AWS Certified Data Engineer Associate Practice Test (DEA-C01)
Use the form below to configure your AWS Certified Data Engineer Associate Practice Test (DEA-C01). The practice test can be configured to only include certain exam objectives and domains. You can choose between 5-100 questions and set a time limit.

AWS Certified Data Engineer Associate DEA-C01 Information
The AWS Certified Data Engineer – Associate certification validates your ability to design, build, and manage data pipelines on the AWS Cloud. It’s designed for professionals who transform raw data into actionable insights using AWS analytics and storage services. This certification proves you can work with modern data architectures that handle both batch and streaming data, using tools like Amazon S3, Glue, Redshift, EMR, Kinesis, and Athena to deliver scalable and efficient data solutions.
The exam covers the full data lifecycle — from ingestion and transformation to storage, analysis, and optimization. Candidates are tested on their understanding of how to choose the right AWS services for specific use cases, design secure and cost-effective pipelines, and ensure data reliability and governance. You’ll need hands-on knowledge of how to build ETL workflows, process large datasets efficiently, and use automation to manage data infrastructure in production environments.
Earning this certification demonstrates to employers that you have the technical expertise to turn data into value on AWS. It’s ideal for data engineers, analysts, and developers who work with cloud-based data systems and want to validate their skills in one of the most in-demand areas of cloud computing today. Whether you’re building data lakes, streaming pipelines, or analytics solutions, this certification confirms you can do it the AWS way — efficiently, securely, and at scale.

Free AWS Certified Data Engineer Associate DEA-C01 Practice Test
- 20 Questions
- Unlimited time
- Data Ingestion and TransformationData Store ManagementData Operations and SupportData Security and Governance
An Amazon Athena table stores clickstream events as Parquet files in an S3 location partitioned by year, month, and day. A nightly ETL job currently runs the following query and is incurring high scan costs:
SELECT user_id, page, event_time FROM clickstream WHERE event_time BETWEEN date '2023-07-01' AND date '2023-07-31';
How should you rewrite the SQL to scan the least amount of data without changing the table definition?
Append a LIMIT clause so the statement becomes:
SELECT user_id, page, event_time FROM clickstream WHERE event_time BETWEEN date '2023-07-01' AND date '2023-07-31' LIMIT 100000;
Create a common table expression (CTE) that selects all columns and then filter the CTE on event_time within the main query.
Add a filter on the partition columns, for example:
SELECT user_id, page, event_time FROM clickstream WHERE year = 2023 AND month = 7 AND day BETWEEN 1 AND 31;
Include an ORDER BY year, month, day clause to ensure the data is read in partition order.
Answer Description
Athena partitions are stored as separate folders in Amazon S3. When a query's WHERE clause references the partition columns, Athena prunes the unrelated partitions and reads only the relevant files, which reduces the amount of data scanned and lowers cost. Filtering solely on event_time does not use partition pruning because that column is stored inside the files, not in the partition path. A LIMIT clause, ORDER BY, or a common table expression does not affect how much data is read from S3, so they provide no scan-cost benefit.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is partition pruning in Amazon Athena?
Why doesn’t filtering on non-partitioned columns reduce scan costs?
How does partitioning in Athena improve query performance?
A retail company runs nightly AWS Glue ETL jobs that load data into an Amazon Redshift cluster. The job script currently hard-codes the database user name and password. Security now requires removing plaintext credentials, rotating the password automatically every 30 days, and making no changes to the ETL code. Which solution meets these requirements most securely?
Create an AWS Secrets Manager secret for the Redshift cluster, enable automatic rotation, update the existing AWS Glue connection to reference the secret's ARN, and add secretsmanager:GetSecretValue permission to the Glue job role.
Encrypt the user name and password with AWS KMS and place the ciphertext in environment variables of the Glue job; configure KMS key rotation every 30 days.
Store the database credentials as SecureString parameters in AWS Systems Manager Parameter Store and schedule an Amazon EventBridge rule that invokes a Lambda function every 30 days to update the parameters; grant the Glue job role ssm:GetParameters permission.
Save the credentials in the AWS Glue Data Catalog connection properties and enable automatic rotation in the connection settings.
Answer Description
AWS Secrets Manager can create a managed secret for an Amazon Redshift cluster whose password is rotated automatically every 30 days. An AWS Glue connection can reference the secret's ARN, so the job continues to run without code changes; the only additional step is to grant the Glue job role permission to call secretsmanager:GetSecretValue. Systems Manager Parameter Store has no built-in rotation, encrypting environment variables with KMS rotates keys rather than credentials, and AWS Glue connections do not provide automatic credential rotation. Therefore the Secrets Manager approach is the only option that satisfies all stated requirements.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is AWS Secrets Manager?
How does automatic secret rotation work in AWS Secrets Manager?
What is an ARN and how is it used in AWS Glue?
AWS Glue cataloged two Amazon Athena tables: web_clicks (fact) with product_id and dim_products (dimension). New product_ids may appear in web_clicks before dim_products is updated. You need an Athena view that returns every click event and adds product_name when available, otherwise null. Which JOIN clause meets this goal?
SELECT ... FROM web_clicks w RIGHT JOIN dim_products d ON w.product_id = d.product_id
SELECT ... FROM web_clicks w FULL OUTER JOIN dim_products d ON w.product_id = d.product_id
SELECT ... FROM web_clicks w LEFT JOIN dim_products d ON w.product_id = d.product_id
SELECT ... FROM web_clicks w INNER JOIN dim_products d ON w.product_id = d.product_id
Answer Description
To keep every row from web_clicks regardless of whether matching product metadata exists, the query must preserve all rows from the left (click) table and add columns from the right (product) table when matches are found. A left outer join does exactly that, returning NULLs for columns from dim_products when no corresponding product_id exists. An inner join would drop unmatched click events, a right join would keep only rows where a product exists, and a full outer join would add extra rows from dim_products that have never been clicked-none of which satisfy the stated requirement.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is a LEFT JOIN in SQL?
Why would an INNER JOIN not work in this case?
What is the difference between a LEFT JOIN and a FULL OUTER JOIN?
A company ingests high-frequency IoT sensor readings and must land them in Amazon S3 in under 30 seconds. Operations teams also need the ability to replay any portion of the incoming stream if a downstream transformation job fails. Which solution meets these requirements while keeping operational overhead to a minimum?
Use an AWS IoT Core rule to write the sensor messages directly to an S3 bucket with the "Customer managed" retry option enabled.
Send the data to Amazon Kinesis Data Streams with a 24-hour retention period, add an Amazon Kinesis Data Firehose delivery stream as a consumer, and configure Firehose buffering to 1 MiB or 10 seconds before writing to Amazon S3.
Deploy an Apache Kafka cluster on Amazon EC2, configure a topic for the sensors, and use a Kafka Connect S3 sink to write data to Amazon S3 every 10 seconds.
Create a direct Amazon Kinesis Data Firehose delivery stream and reduce the S3 buffering size to 1 MiB and interval to 10 seconds.
Answer Description
Amazon Kinesis Data Streams provides a durable stream that can retain data for hours or days, allowing applications to reread any record for replay or recovery. By registering Amazon Kinesis Data Firehose as a stream consumer and setting the buffering limits to 1 MiB and 10 seconds, records are delivered to Amazon S3 in well under the 30-second requirement. A Firehose delivery stream that ingests data directly, an AWS IoT Core rule, or a self-managed Kafka cluster either lacks built-in replay capability or adds unnecessary operational burden, so they do not satisfy both requirements as effectively.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is Amazon Kinesis Data Streams?
How does Amazon Kinesis Data Firehose work as a consumer in this solution?
Why is replay capability important for IoT sensor data processing?
An e-commerce company collects mobile-game clickstream events at 10 MB/s. The data must land in an Amazon S3 data lake and simultaneously feed three independent services: sub-second fraud detection, a personalization microservice, and a daily Amazon EMR batch job. The solution must be fully managed, replayable for 24 hours, and auto-scaling without capacity planning. Which approach is MOST cost-effective?
Provision an Amazon MSK cluster and publish the events to a Kafka topic. Have each consumer application subscribe to the topic and use Apache Flink on MSK to write the data to Amazon S3.
Insert events into a DynamoDB table and enable DynamoDB Streams. Use a Lambda function to forward stream records to Amazon S3 and invoke the consumer services.
Create a Kinesis Data Firehose delivery stream with a Lambda transformation. Configure S3 event notifications from the destination bucket to invoke the fraud and personalization services.
Create a Kinesis Data Stream in on-demand mode. Register three enhanced fan-out consumers: a fraud-detection Lambda function, a personalization microservice, and a Kinesis Data Firehose delivery stream that writes to Amazon S3.
Answer Description
A Kinesis Data Stream in on-demand mode eliminates shard provisioning and automatically scales. Registering each application as an enhanced fan-out consumer gives every consumer its own 2 MB/s pipe per shard with average propagation latency of about 70 ms, supporting sub-second fraud detection. Attaching a Kinesis Data Firehose consumer writes the same records to Amazon S3 for the EMR job. The stream's default 24-hour retention enables replays. Firehose alone lacks low-latency fan-out, MSK entails cluster administration and higher cost, and DynamoDB Streams targets table-change capture, not high-volume clickstreams.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is enhanced fan-out in Kinesis Data Streams?
How does Kinesis Data Firehose integrate with Amazon S3?
What are the advantages of using Kinesis Data Streams in on-demand mode?
An online gaming company delivers about 5 MB/s of gameplay telemetry to AWS. The engineering team must store each record for 7 days, support millisecond-latency writes and multiple parallel reads, and invoke AWS Lambda functions that calculate near-real-time leaderboards. They want the lowest operational overhead and predictable pricing. Which service should they use as the primary data store?
Insert each record into an on-demand Amazon DynamoDB table and export the table to Amazon S3 after 7 days.
Send the data to Amazon S3 through Kinesis Data Firehose and have Lambda query the objects with Amazon Athena.
Create an Amazon Kinesis Data Streams stream with a 7-day retention period and configure AWS Lambda as a consumer.
Deploy an Amazon MSK cluster and write the telemetry to a Kafka topic configured for 7-day retention.
Answer Description
Amazon Kinesis Data Streams is purpose-built for high-throughput, low-latency ingestion. A stream shard provides single-digit-millisecond put and get latency, and the service can retain data for up to 365 days when the retention period is extended, so a 7-day requirement is easily met. Lambda can be configured as an event source, allowing records to be processed almost immediately after they are written, and the fully managed nature of the service eliminates cluster management while costs scale linearly with provisioned shard capacity.
Writing directly to Amazon S3 via Kinesis Data Firehose meets the retention goal but cannot provide millisecond-level reads or support multiple independent consumer applications in real time. Persisting each record in DynamoDB gives low-latency access, but costs rise rapidly for write-heavy workloads and the service is not optimized for sequential stream processing by multiple consumers. Amazon MSK offers Kafka-compatible streams with the required retention but introduces significant operational burden to manage brokers, scaling, and patching, which contradicts the low-overhead requirement.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is the primary purpose of Amazon Kinesis Data Streams?
How does AWS Lambda integrate with Amazon Kinesis to support near-real-time processing?
Why is operational overhead lower with Amazon Kinesis compared to Amazon MSK?
A retail company plans to ingest click-stream events with Apache Kafka. Security mandates that producer and consumer applications authenticate only with short-lived IAM role credentials, and that the data engineering team must not build or rotate cluster user passwords. Which deployment choice meets the requirement while minimizing operational effort?
Create an Amazon MSK cluster but disable IAM access control, instead using SASL/SCRAM authentication with credentials stored in Secrets Manager.
Deploy an Apache Kafka cluster on Amazon EC2 behind a Network Load Balancer and enforce mutual TLS with private certificates from AWS Certificate Manager Private CA.
Deploy an Apache Kafka cluster on Amazon EC2 instances and configure SASL/SCRAM authentication, storing usernames and passwords in AWS Secrets Manager.
Provision an Amazon MSK cluster with IAM access control enabled so clients authenticate with SigV4-signed requests using their IAM roles.
Answer Description
Amazon MSK is a fully managed service that can be configured to use IAM access control. When this option is enabled, clients sign their requests with SigV4 by assuming an IAM role, so there are no static usernames or passwords to create, store, or rotate. Deploying Kafka on Amazon EC2-or using Amazon MSK without IAM access control-requires you to configure SASL/SCRAM or mutual TLS, store credentials or certificates (often in AWS Secrets Manager), and implement a rotation process, which contradicts the requirement to avoid managing passwords. Therefore, enabling IAM access control on an Amazon MSK cluster is the only solution that satisfies both the authentication mandate and the low-operations goal.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is Amazon MSK and why is it useful?
What is SigV4 authentication and how does it work?
How does IAM access control improve security for Amazon MSK?
Your data engineering team uses AWS Glue to transform data that lands in Amazon S3. To comply with EU data-sovereignty rules, every analytic object must remain in either eu-west-1 or eu-central-1. Across dozens of AWS accounts, you must prevent any resource creation or data replication in other Regions. Which solution BEST enforces this requirement?
Require SSE-KMS with customer-managed keys created in the EU Regions and mandate bucket policies that enforce encryption on all uploads.
Turn on Amazon Macie automatic sensitive-data discovery and configure Security Hub to raise findings when objects are stored in non-EU Regions.
Enable S3 Object Lock on all buckets and configure default retention settings so that objects cannot be deleted or overwritten outside the EU.
Attach a service control policy (SCP) to the organization that denies all actions in Regions other than eu-west-1 and eu-central-1 by using the aws:RequestedRegion global condition key.
Answer Description
A service control policy (SCP) applied at the AWS Organizations level can evaluate every API request before it is allowed. By using the aws:RequestedRegion global condition key, the SCP can explicitly Deny any action requested in Regions other than eu-west-1 or eu-central-1. This prevents engineers-and even automated services-from creating S3 buckets, enabling cross-Region replication, or launching resources outside the approved EU Regions, fully satisfying data-sovereignty requirements.
Enabling S3 Object Lock only stops object deletion or modification; it does not stop data being stored in other Regions. Requiring SSE-KMS with EU-based keys encrypts data but does not restrict its geographic location. Amazon Macie with Security Hub can detect non-compliant storage locations, but it is reactive and cannot block operations. Therefore, the SCP with a Region deny condition is the most effective preventive control.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is a Service Control Policy (SCP) in AWS?
What is the aws:RequestedRegion condition key?
How does Amazon Macie differ from SCPs in enforcing data sovereignty rules?
Your company ingests website click-stream events that are serialized as JSON. The structure of the events will evolve as new product features are released, and the data engineering team wants analysts to run ad-hoc SQL queries in Amazon Redshift without performing manual DDL each time a new attribute appears. The solution must keep storage costs low and avoid interrupting existing queries. Which design meets these requirements?
Persist the events in an Amazon RDS PostgreSQL database and query the table from Redshift by using federated queries.
Write the JSON events to Amazon S3, use an AWS Glue crawler to catalog the files, and create an Amazon Redshift Spectrum external table that references the Glue Data Catalog.
Stream the JSON events directly into an Amazon Redshift table that uses the SUPER data type and rely on Redshift to surface new keys automatically.
Use AWS Database Migration Service (AWS DMS) to load the events from S3 into a Redshift columnar table and run a nightly job that issues ALTER TABLE ADD COLUMN statements for any new attributes.
Answer Description
Landing the raw JSON objects in Amazon S3 keeps storage costs lower than storing them inside the data warehouse. When an AWS Glue crawler catalogs the objects, it applies schema-on-read semantics and automatically adds any new or missing attributes it discovers. Creating an external schema in Amazon Redshift that points to the Glue Data Catalog lets analysts query the data through Redshift Spectrum. Because Spectrum consults the Data Catalog at query time, new attributes become available to analysts immediately-no ALTER TABLE commands and no downtime.
Why the other designs fall short:
- Streaming the JSON directly into a Redshift table that uses the SUPER data type does allow schemaless ingestion without DDL, but all raw data is stored in Redshift Managed Storage, which is more expensive than S3, so it does not satisfy the cost constraint.
- Using AWS DMS to load the events and running nightly ALTER TABLE commands adds operational overhead and locks the table during DDL, interrupting queries.
- Persisting the events in an Amazon RDS PostgreSQL database and querying through Redshift federated queries duplicates data in a full relational database, incurs higher storage and compute costs, and still doesn't provide automatic schema evolution for semi-structured JSON columns.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is Amazon Redshift Spectrum and how does it enable querying data in S3?
How does an AWS Glue crawler work, and why is it useful here?
Why is storing JSON in Amazon S3 more cost-effective than in Redshift Managed Storage?
A data engineering team must expose a JSON ingestion REST endpoint to several financial partners. Company policy requires each partner to authenticate by presenting an X.509 client certificate issued by the partner's intermediate CA. The endpoint must be reachable only from the company VPC, and the team wants to avoid writing custom certificate-validation logic. Which solution meets these requirements with the least operational overhead?
Issue an IAM access key and secret key to each partner and require Signature Version 4-signed HTTPS requests to an Internet-facing API Gateway endpoint secured with IAM authorization.
Create a private Amazon API Gateway REST API, enable mutual TLS with a trust store that contains the partners' CA certificates, and access the API through an interface VPC endpoint.
Deploy an internal Application Load Balancer with an HTTPS listener configured for mutual TLS verify mode. Create an ELB trust store containing the partners' CA certificates in Amazon S3 and attach it to the listener.
Provide partners with presigned Amazon S3 PUT URLs secured with TLS 1.2 so they can upload their data files.
Answer Description
An internal Application Load Balancer (ALB) can terminate TLS and perform mutual TLS verification. The team uploads a CA bundle that trusts the partners' intermediate CAs to an ELB trust store stored in Amazon S3, attaches the trust store to an HTTPS listener in mutual TLS verify mode, and targets the ALB at the ingestion service. The ALB authenticates client certificates during the TLS handshake and blocks untrusted connections, so no backend changes are needed. Because the ALB is internal and secured by VPC security groups, the endpoint is accessible only from the company VPC. The other options either rely on presigned URLs, IAM Signature Version 4, or OAuth tokens-which are key- or token-based, not certificate-based-or attempt to use mutual TLS with a private API Gateway REST API, which is not supported.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is mutual TLS and how does it work?
What is an ELB trust store, and how is it used in mutual TLS?
Why is an internal Application Load Balancer preferred over API Gateway in this scenario?
An e-commerce company stores its daily sales metrics as partitioned Parquet files in Amazon S3. Business analysts must build interactive dashboards that refresh hourly, support ad-hoc filtering, and must not require the data engineering team to provision or manage servers. Users are authenticated through Amazon Cognito. Which approach meets the requirements with the least operational overhead?
Configure Amazon QuickSight to query the S3 data through an Athena data source, enable SPICE for hourly refreshes, and share dashboards with Cognito-authenticated users.
Create an Amazon Redshift cluster, load the Parquet data with the COPY command, and connect Amazon QuickSight in direct-query mode.
Launch an Amazon EMR cluster running Presto to serve the data and deploy an open-source visualization tool on Amazon ECS for analysts.
Schedule AWS Glue DataBrew jobs to generate visual charts and publish them as static HTML pages in Amazon S3.
Answer Description
Amazon QuickSight is a fully managed, serverless BI service. It can connect to Amazon S3 data through an Athena data source, and the in-memory SPICE engine can be scheduled for hourly refreshes while still allowing analysts to perform fast, interactive filtering. Dashboards can be shared with users federated by Amazon Cognito without the team managing any infrastructure. AWS Glue DataBrew is designed for data preparation and profiling, not for hosting interactive dashboards. Creating an Amazon Redshift cluster or an Amazon EMR-based Presto query layer would satisfy the visualization need only after the team provisions, secures, and scales additional infrastructure, which contradicts the low-overhead requirement.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
How does Amazon Athena query data in S3?
What is SPICE in Amazon QuickSight?
How does Amazon Cognito integrate with QuickSight for user authentication?
A data engineering team created a materialized view in Amazon Redshift that joins the internal fact_sales table with an external product_dim table stored in Amazon S3 through a Spectrum external schema. After the product_dim data files are overwritten each night, analysts notice that the view returns stale data. The team must keep results current in the most cost-effective way without copying the external table into Redshift. What should they do?
Convert the product_dim external table into a regular Redshift table so the view can refresh automatically.
Replace the materialized view with a late-binding view so it always reads the latest external data.
Run ALTER MATERIALIZED VIEW … AUTO REFRESH YES to enable incremental refresh on the existing view.
Schedule the REFRESH MATERIALIZED VIEW command to run after the nightly S3 load completes.
Answer Description
Materialized views that reference external schemas cannot be configured with AUTO REFRESH. They remain unchanged until explicitly refreshed. The least-cost approach is to leave the data in Amazon S3 and schedule the REFRESH MATERIALIZED VIEW command-using Amazon EventBridge, the Redshift scheduler API, or another orchestrator-to run immediately after the nightly file replacement. Loading the external table into Redshift or switching to a late-binding view would either add storage cost or reduce performance. ALTER MATERIALIZED VIEW … AUTO REFRESH YES is unsupported for external schemas, so it would fail.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is a materialized view in Amazon Redshift?
How does Amazon Redshift Spectrum integrate with S3?
Why can't AUTO REFRESH be configured for materialized views with external schemas?
Your organization uses AWS Lake Formation to govern a raw data lake in Amazon S3. You registered the s3://finance-raw bucket and cataloged the transactions table in the finance database. Analysts already have Lake Formation SELECT on the table, yet Athena returns "Access Denied - insufficient Lake Formation permissions." Which additional Lake Formation permission will resolve the error without granting broader S3 or IAM access?
Grant Lake Formation DATA_LOCATION_ACCESS on the s3://finance-raw location.
Give the IAM role Lake Formation ALTER permission on the transactions table.
Grant Lake Formation DESCRIBE permission on the default database.
Attach an IAM policy that allows s3:GetObject on the finance-raw bucket.
Answer Description
For Athena to run a query, Lake Formation must be able to read metadata in the Glue Data Catalog. The service looks in the default database, so the querying principal needs at least DESCRIBE on that database. Without it, Lake Formation blocks the request and Athena reports "Access Denied." Granting DESCRIBE on the default database satisfies the metadata check; DATA_LOCATION_ACCESS is only required for creating resources, and adding direct S3 permissions or ALTER on the table would bypass governance or still fail the metadata lookup.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What does Lake Formation DESCRIBE permission allow?
Why does Athena require DESCRIBE on the default database to query Lake Formation tables?
What is the difference between DATA_LOCATION_ACCESS and DESCRIBE permissions in Lake Formation?
An ecommerce company keeps 3 years of web-server logs as uncompressed .txt files in the s3://company-data/logs/ prefix. Data analysts must run interactive ad-hoc SQL queries against only the most recent 90 days of logs. The solution must minimize query cost, leave the raw files unchanged, and avoid managing long-running infrastructure. Which approach best meets these requirements?
Copy the most recent 90 days of logs into an Amazon Redshift cluster and pause the cluster when queries are finished.
Use an AWS Glue ETL job to convert the latest 90 days of .txt logs to compressed Parquet files in a separate S3 prefix and query that prefix with Amazon Athena.
Import all .txt logs into an Amazon RDS for PostgreSQL instance with auto-scaling storage and index the timestamp column.
Create external tables in Amazon Athena that reference the existing .txt files and add day-based partitions for the last 90 days.
Answer Description
Converting the most recent 90-day slice of the .txt logs to a columnar, compressed format such as Parquet sharply reduces the amount of data that Amazon Athena needs to scan, lowering both latency and per-query cost. A serverless AWS Glue job can write the converted data to a new S3 prefix, so the original uncompressed text files remain untouched. Athena can then query only the Parquet objects without requiring an always-on cluster. Pointing Athena directly at the .txt files still incurs high scan costs even with partitioning, while Amazon Redshift or Amazon RDS would require provisioning and managing database instances, increasing operational overhead and cost.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
Why is Parquet preferred over .txt files for querying data in Athena?
What is AWS Glue and how does it relate to ETL jobs?
How does Amazon Athena minimize infrastructure management?
An organization runs nightly Apache Spark ETL jobs with Amazon EMR on EKS. Each executor pod requests 4 vCPU and 32 GiB memory, but its CPU limit is also set to 4 vCPU. CloudWatch shows frequent CpuCfsThrottledSeconds and long task runtimes, while cluster nodes have unused CPU. The team wants faster jobs without adding nodes or instances. Which action meets the requirement?
Remove the CPU limit or raise it well above the request so executor containers can use idle vCPU on the node.
Migrate the workload to AWS Glue interactive sessions, which automatically scale compute resources.
Replace gp3 root volumes with io2 volumes on worker nodes to increase disk throughput.
Enable Spark dynamic allocation so the job can launch additional executor pods during the run.
Answer Description
CPU throttling occurs when a container exhausts the CPU limit defined in its pod specification. For Spark workloads this causes tasks to pause even though idle CPU is still available on the node. Removing the CPU limit (or increasing it well above the request) lets the executor containers borrow spare vCPUs and eliminates cgroup throttling, reducing job runtime without needing additional EC2 capacity. Increasing disk throughput, enabling dynamic allocation, or migrating to a different service may improve performance in some cases but do not directly address the throttling caused by the Kubernetes CPU limit in this scenario.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is CpuCfsThrottledSeconds?
How does removing the CPU limit improve Spark job performance?
Why doesn't enabling Spark dynamic allocation solve the issue?
Your team has registered an Amazon S3 data lake with AWS Lake Formation, and analysts query the data through Amazon Athena. The security team must ensure that any S3 object Amazon Macie flags as containing PII is automatically blocked from the analyst LF-principal but remains accessible to the governance LF-principal. The solution must rely on AWS-managed integrations and involve as little custom code as possible. Which approach meets these requirements?
Run an AWS Glue crawler with custom classifiers that detect PII and update the Data Catalog, then attach IAM policies that deny analysts access to any tables the crawler marks as sensitive.
Configure an Amazon Macie discovery job and an EventBridge rule that starts a Step Functions workflow. The workflow calls Lake Formation AddLFTagsToResource to tag resources Classification=Sensitive and applies LF-tag policies that block analysts and allow governance users.
Generate daily S3 Inventory reports, use S3 Batch Operations to tag files that contain sensitive keywords, and add bucket policies that block the analyst group from those objects while permitting governance access.
Use S3 Object Lambda with a Lambda function that removes or redacts PII from objects before analysts access them, while governance users read the original objects directly.
Answer Description
Create an Amazon Macie sensitive-data discovery job for the lake buckets. Configure an Amazon EventBridge rule that triggers an AWS Step Functions state machine whenever Macie publishes a sensitive-data finding. In the workflow, use an AWS SDK task to call the Lake Formation AddLFTagsToResource API and attach an LF-tag such as Classification=Sensitive to the object (or its corresponding catalog columns). Lake Formation tag-based access-control policies then deny the analyst principal and allow the governance principal for resources tagged Classification=Sensitive. This uses only managed integrations (Macie, EventBridge, Step Functions, Lake Formation) and requires minimal code-no bespoke parsing beyond the workflow definition.
The other options either rely on custom parsing inside Lambda, do not use Macie for detection, or cannot apply Lake Formation permissions automatically.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is Amazon Lake Formation and how does it help with data security?
What is the role of Amazon Macie in detecting sensitive data?
How does AWS Step Functions integrate with this solution?
An Amazon Athena table named clickstream contains columns session_id string, page string, event_time timestamp, and load_time_ms int. A data engineer must return the five pages with the highest average load_time_ms recorded in the last 7 days, but only for pages that have at least 100 distinct sessions. Which SQL query satisfies the requirement?
SELECT page, AVG(load_time_ms) AS avg_load FROM clickstream GROUP BY page HAVING COUNT(DISTINCT session_id) >= 100 AND event_time >= current_timestamp - INTERVAL '7' day ORDER BY avg_load DESC LIMIT 5;SELECT page, AVG(load_time_ms) AS avg_load FROM clickstream WHERE event_time >= current_timestamp - INTERVAL '7' day GROUP BY page HAVING COUNT(DISTINCT session_id) >= 100 ORDER BY COUNT(DISTINCT session_id) DESC LIMIT 5;SELECT page, AVG(load_time_ms) AS avg_load FROM clickstream WHERE event_time >= current_timestamp - INTERVAL '7' day AND COUNT(DISTINCT session_id) >= 100 GROUP BY page ORDER BY avg_load DESC LIMIT 5;SELECT page, AVG(load_time_ms) AS avg_load FROM clickstream WHERE event_time >= current_timestamp - INTERVAL '7' day GROUP BY page HAVING COUNT(DISTINCT session_id) >= 100 ORDER BY avg_load DESC LIMIT 5;
Answer Description
The correct query uses multiple qualifiers in the appropriate order: WHERE filters rows that fall within the last 7 days, GROUP BY aggregates by page, HAVING keeps only pages whose aggregated count of distinct session_id values meets the threshold, ORDER BY sorts on the derived average, and LIMIT returns the top five results. Aggregated conditions cannot appear in a WHERE clause, and non-grouped columns such as event_time cannot appear in HAVING without aggregation. Ordering by the session count would not satisfy the requirement to sort by the highest average load_time_ms.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
Why is HAVING used instead of WHERE for aggregated conditions?
What happens if ORDER BY uses COUNT instead of AVG in this query?
Why is INTERVAL '7' day used with current_timestamp?
A DynamoDB table that stores IoT sensor readings peaks at 40,000 writes per second. The analytics team must land every new item in an Amazon S3 data lake within 60 seconds. The solution must auto-scale, provide at-least-once delivery, and minimize operational overhead. Which architecture meets these requirements MOST effectively?
Enable DynamoDB Streams with the NEW_IMAGE view and configure an AWS Lambda function as the event source; inside the function batch the records and submit them to an Amazon Kinesis Data Firehose delivery stream that writes to S3.
Use AWS Database Migration Service in change data capture mode to replicate the DynamoDB table continuously to an S3 target.
Schedule an AWS Glue batch job every minute to export the entire table to S3 by using DynamoDB export to S3.
Create an AWS Glue streaming ETL job that consumes the table's stream ARN directly and writes the data to Amazon S3.
Answer Description
Using DynamoDB Streams with an AWS Lambda function satisfies all stated requirements. Streams delivers every write in near-real time. Lambda can be configured as a native event source and automatically scales with the number of stream shards, giving at-least-once processing without server management. The function can batch the records and invoke an Amazon Kinesis Data Firehose delivery stream, which efficiently loads the data into Amazon S3. Glue streaming jobs cannot consume DynamoDB Streams, DMS introduces additional infrastructure and higher latencies, and frequent full-table exports are inefficient and would miss the one-minute latency target.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is DynamoDB Streams with NEW_IMAGE view?
How does AWS Lambda integrate with DynamoDB Streams?
What is Amazon Kinesis Data Firehose, and how does it work with S3?
A company ingests clickstream events into an Amazon DynamoDB table. Traffic remains near zero most of the day but bursts to 40,000 writes per second during marketing campaigns. Analysts query events by userId and timestamp range. Provisioned capacity with auto scaling causes throttling and wasted spend. Which configuration best meets the performance and cost requirements with minimal administration?
Convert the table and all global secondary indexes to on-demand capacity mode.
Add a DynamoDB Accelerator (DAX) cluster in front of the table to cache hot items.
Triple the provisioned write capacity and reduce the auto-scaling cooldown period to 30 seconds.
Enable DynamoDB Streams and invoke an AWS Lambda function to batch writes into Amazon S3.
Answer Description
Switching the table and any global secondary indexes to on-demand capacity mode removes the need to provision or scale capacity manually. DynamoDB automatically accommodates sudden traffic spikes (up to tens of thousands of requests per second) and charges only for the read and write requests actually made, eliminating throttling during peaks and cost during idle periods. Simply raising provisioned capacity or tweaking auto-scaling still incurs idle cost and requires tuning. DAX improves read latency but does not increase write throughput. DynamoDB Streams with Lambda off-loads data elsewhere but leaves the original write traffic and throttling issues unresolved.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is DynamoDB on-demand capacity mode?
How does DynamoDB auto-scaling differ from on-demand capacity mode?
What is a DynamoDB Accelerator (DAX), and why doesn't it solve write throttling?
Your team receives unpredictable batches of CSV transaction files in a dedicated Amazon S3 prefix. Every file must be ingested into an Amazon Redshift staging table within five minutes of arrival. The solution must follow an event-driven batch pattern, avoid idle infrastructure, and scale automatically with the daily file count. Which approach meets these requirements while keeping operational overhead low?
Send the files to an Amazon Kinesis Data Firehose delivery stream configured to deliver records to Amazon Redshift.
Configure an Amazon S3 event notification that routes through EventBridge to trigger an AWS Glue job, and have the job run a Redshift COPY command for the new object.
Set up an AWS Database Migration Service task with S3 as the source endpoint and Redshift as the target to perform full load and change data capture.
Create an AWS Glue job with a 5-minute cron schedule that recursively scans the prefix and loads any discovered files into Redshift.
Answer Description
Amazon S3 can emit an event for every new object. Publishing that event to Amazon EventBridge allows a rule to start an AWS Glue job only when a file is written. The Glue job can issue a COPY command that loads the single object into Amazon Redshift, giving near-real-time latency without running servers between arrivals. A cron-based Glue schedule polls rather than reacts to events and could miss the five-minute window or waste resources. AWS DMS cannot use S3 as a change-data-capture source for Redshift in this scenario, and Kinesis Data Firehose expects streaming records, not entire objects already in S3, so it does not satisfy the event-driven batch requirement.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is an S3 event notification and how does it work?
What is the role of EventBridge in this solution?
How does AWS Glue and the Redshift COPY command integrate for this use case?
Gnarly!
Looks like that's it! You can go back and review your answers or click the button below to grade your test.