00:20:00

AWS Certified Data Engineer Associate Practice Test (DEA-C01)

Use the form below to configure your AWS Certified Data Engineer Associate Practice Test (DEA-C01). The practice test can be configured to only include certain exam objectives and domains. You can choose between 5-100 questions and set a time limit.

Logo for AWS Certified Data Engineer Associate DEA-C01
Questions
Number of questions in the practice test
Free users are limited to 20 questions, upgrade to unlimited
Seconds Per Question
Determines how long you have to finish the practice test
Exam Objectives
Which exam objectives should be included in the practice test

AWS Certified Data Engineer Associate DEA-C01 Information

The AWS Certified Data Engineer – Associate certification validates your ability to design, build, and manage data pipelines on the AWS Cloud. It’s designed for professionals who transform raw data into actionable insights using AWS analytics and storage services. This certification proves you can work with modern data architectures that handle both batch and streaming data, using tools like Amazon S3, Glue, Redshift, EMR, Kinesis, and Athena to deliver scalable and efficient data solutions.

The exam covers the full data lifecycle — from ingestion and transformation to storage, analysis, and optimization. Candidates are tested on their understanding of how to choose the right AWS services for specific use cases, design secure and cost-effective pipelines, and ensure data reliability and governance. You’ll need hands-on knowledge of how to build ETL workflows, process large datasets efficiently, and use automation to manage data infrastructure in production environments.

Earning this certification demonstrates to employers that you have the technical expertise to turn data into value on AWS. It’s ideal for data engineers, analysts, and developers who work with cloud-based data systems and want to validate their skills in one of the most in-demand areas of cloud computing today. Whether you’re building data lakes, streaming pipelines, or analytics solutions, this certification confirms you can do it the AWS way — efficiently, securely, and at scale.

AWS Certified Data Engineer Associate DEA-C01 Logo
  • Free AWS Certified Data Engineer Associate DEA-C01 Practice Test

  • 20 Questions
  • Unlimited
  • Data Ingestion and Transformation
    Data Store Management
    Data Operations and Support
    Data Security and Governance
Question 1 of 20

Your team needs a managed, serverless workflow that starts when an object arrives under s3://sales/landing/. The workflow must invoke a Lambda function to validate each file, run an AWS Glue Spark job to transform the data, then call another Lambda to load the result into Amazon Redshift. It must provide automatic per-step retries, execution history, and one-click resume from failures. Which solution is most cost-effective?

  • Set up an Amazon EventBridge pipe to invoke the first Lambda function; have that function synchronously call the Glue job and second Lambda while implementing all retries in code.

  • Deploy an Amazon MWAA environment and author an Apache Airflow DAG that coordinates the two Lambda tasks and the Glue job.

  • Build an AWS Glue Workflow that runs the Glue job and add the two Lambda steps as Python shell jobs inside the workflow.

  • Create an AWS Step Functions state machine that invokes the two Lambda functions and the AWS Glue job, and trigger the state machine with an Amazon EventBridge rule for the S3 prefix.

Question 2 of 20

An AWS Glue ETL job writes driver logs to the log group /aws-glue/jobs/output in JSON format. Each log event contains the fields level, message, and jobRunId. You must use CloudWatch Logs Insights to quickly show a count of unique jobRunId values that logged the string "ERROR TimeoutException" during the last 24 hours, while minimizing query cost. Which query meets these requirements?

  • fields @timestamp, jobRunId, message | filter message like /TimeoutException/ | stats count_distinct(message)

  • fields @timestamp, jobRunId, message | sort @timestamp desc | filter message like /ERROR TimeoutException/ | limit 1000 | stats count_distinct(jobRunId)

  • filter message like /ERROR TimeoutException/ | stats count(jobRunId)

  • fields message, jobRunId | filter message like /ERROR TimeoutException/ | stats count_distinct(jobRunId) as affectedRuns

Question 3 of 20

A company has 5 TB of structured sales data that analysts query using complex joins, window functions, and aggregations. The queries must return results within seconds during business hours, and the team wants automatic columnar storage compression without managing infrastructure. Which AWS storage platform should be used to host the dataset to meet these performance characteristics?

  • Amazon DynamoDB

  • Amazon Redshift

  • Amazon RDS for MySQL

  • An AWS Lake Formation data lake on Amazon S3 queried with Amazon Athena

Question 4 of 20

A data engineer is developing a production ML workflow that uses Amazon SageMaker Pipelines to read raw files from Amazon S3, perform data preprocessing, train a model, and deploy the model to a SageMaker endpoint. The company must keep an auditable, end-to-end record of every dataset, processing job, model version, and endpoint created by the pipeline while writing as little custom tracking code as possible. Which solution meets these requirements?

  • Refactor the workflow into AWS Step Functions and enable AWS X-Ray tracing so that each state transition captures lineage information for audit queries.

  • Run an AWS Glue crawler after every pipeline step and store the results in the AWS Glue Data Catalog to represent lineage between datasets, jobs, and models.

  • Enable SageMaker ML Lineage Tracking in the SageMaker Pipeline so that each step automatically registers its artifacts and relationships, then query the lineage graph through the SageMaker Lineage API.

  • Turn on AWS CloudTrail for all SageMaker API calls and analyze the resulting logs with Amazon Athena to reconstruct the lineage of artifacts.

Question 5 of 20

A data engineering team manages a MySQL database hosted on Amazon RDS. Compliance requires that the application password be rotated automatically every 30 days without manual scripting. The analytics pipeline runs on AWS Lambda functions in the same account. Which approach meets the requirement while minimizing operational overhead?

  • Encrypt the password with AWS KMS, save it in a Lambda environment variable, and update the variable manually through a CI/CD pipeline each month.

  • Store the password in AWS Systems Manager Parameter Store as a SecureString and use an EventBridge rule to trigger a custom Lambda function to rotate it every 30 days.

  • Set the master password in Amazon RDS to the keyword AWS_ROTATE to enable automatic rotation and allow Lambda to read the password from the DB instance endpoint.

  • Store the password in AWS Secrets Manager, enable the built-in RDS MySQL rotation schedule, and grant the Lambda execution role permission to retrieve the secret.

Question 6 of 20

A data engineer is building an AWS Step Functions Standard workflow that will invoke an AWS Glue job for each of 200 daily S3 partitions. No more than 10 Glue jobs should run at the same time, each invocation must automatically retry twice with exponential backoff for transient errors, and the workflow must fail immediately on a custom "DATA_VALIDATION_FAILED" error returned by the job. Which Step Functions design will meet these requirements with the least custom code?

  • Run an Express Step Functions workflow triggered by Amazon EventBridge rules that submit Glue jobs in batches of 10 until all partitions are processed.

  • Use a Parallel state with 10 static branches; each branch invokes the Glue job for a subset of partitions.

  • Invoke the Glue job from a Lambda function in a Task state and write custom code in the function to iterate through partitions, manage retries, and enforce a 10-job concurrency limit.

  • Create a Map state that passes the array of partition prefixes, set MaxConcurrency to 10, and configure Retry with backoffRate and a Catch clause for the DATA_VALIDATION_FAILED error.

Question 7 of 20

A data engineer must enable analysts to run ad hoc SQL queries from Amazon Athena, Amazon Redshift Spectrum, and Amazon EMR Presto against semi-structured JSON files stored in an S3 data lake. The solution must avoid duplicating table definitions and should automatically detect new daily partitions that land in the same S3 prefix. Which approach meets these requirements with minimal operational overhead?

  • Embed the JSON schema in every Spark job and instruct analysts to load the data into temporary views before running SQL queries.

  • Create separate external tables with identical names in Athena, Redshift Spectrum, and the EMR Hive metastore, updating each table manually when partitions arrive.

  • Configure an AWS Glue crawler on the S3 prefix to populate an AWS Glue Data Catalog table and have all query engines reference that catalog.

  • Store Avro schema definition files alongside the data in S3 and rely on each engine's SerDe to discover new partitions at query time.

Question 8 of 20

An insurance company keeps policy documents in an Amazon S3 bucket that has versioning enabled. Regulations require that every object, including all previous versions, must be permanently deleted exactly 7 years (2,555 days) after its creation. The solution must prove compliance while minimizing operational overhead and maintenance work. Which action will meet these requirements?

  • Configure an AWS Backup plan for the bucket with a 7-year retention rule so that the original objects are deleted after the backups expire.

  • Create an S3 Lifecycle rule with two expiration actions that permanently delete current object versions and noncurrent object versions after 2,555 days, and enable removal of expired object delete markers.

  • Set up an EventBridge rule that invokes an AWS Lambda function daily to list objects older than 2,555 days and delete each version individually.

  • Enable S3 Object Lock in compliance mode with a 7-year retention period so that objects are automatically removed when the retention period ends.

Question 9 of 20

An analytics team ingests clickstream logs into Amazon S3 and uses nightly AWS Glue Spark jobs to aggregate the data and load it into Amazon Redshift. Auditors must be able to trace each Redshift column back to the exact S3 objects that produced it to verify data accuracy. Which approach delivers automatic column-level data lineage with minimal operational overhead?

  • Run the transformations with AWS Glue ETL jobs and use the AWS Glue Data Catalog's built-in lineage features to track sources and targets.

  • Enable AWS CloudTrail data events for S3 and Redshift and analyze the logs in Amazon Athena to reconstruct lineage.

  • Schedule Amazon Inspector assessments of the Redshift cluster to generate data provenance reports.

  • Attach custom S3 object tags that identify lineage and propagate the tags through each Glue job using job parameters.

Question 10 of 20

An e-commerce startup ingests clickstream events into an Amazon DynamoDB table. Traffic is highly unpredictable: most of the day only a few hundred writes per minute occur, but flash-sale campaigns generate short spikes of up to 50,000 writes per second. The team wants the simplest configuration that keeps costs low during idle periods while automatically absorbing the spikes without throttling. Which solution satisfies these requirements?

  • Enable on-demand capacity mode and turn on TTL so write capacity automatically drops to zero when items expire.

  • Create the table in on-demand capacity mode; rely on its automatic scaling for write traffic.

  • Configure the table with 5,000 provisioned WCUs and attach a multi-node DynamoDB Accelerator (DAX) cluster to absorb burst writes.

  • Use provisioned capacity mode with 50,000 write capacity units and enable auto scaling between 1,000 and 50,000 WCUs.

Question 11 of 20

A data engineering team launches a transient Amazon EMR cluster each night through an AWS Step Functions workflow. Before any Spark job runs, the cluster must have a proprietary JDBC driver installed on every node. After installation, a PySpark ETL script stored in Amazon S3 must be executed. What is the most operationally efficient way to meet these requirements using native EMR scripting capabilities?

  • Configure a bootstrap action that downloads and installs the driver on all nodes, then add an EMR step that runs spark-submit on the PySpark script in Amazon S3.

  • Schedule an EMR Notebook that first installs the driver with pip commands and then executes the PySpark code, triggered nightly by a cron expression.

  • Pass a shell script to a Hadoop Streaming step that both installs the driver and calls the PySpark script in a single command.

  • Build a custom AMI with the driver pre-installed and specify the PySpark ETL through classification properties when creating the cluster.

Question 12 of 20

You manage an Amazon EKS cluster that runs containerized Apache Spark batch jobs that transform data in Amazon S3. The cluster uses a fixed managed node group of twenty m5.xlarge On-Demand instances. During nightly runs CPU utilization exceeds 80 percent and jobs slow, but daytime utilization is under 10 percent. You must boost performance and cut idle costs with minimal operations effort. Which approach meets these goals?

  • Install the Kubernetes Cluster Autoscaler on the EKS cluster, create a managed node group that mixes On-Demand and Spot Instances, and set CPU and memory requests for all Spark pods.

  • Create an EKS Fargate profile for the Spark namespace so every Spark pod runs on Fargate while keeping the existing node group for system pods.

  • Migrate the Spark containers to Amazon ECS and enable Service Auto Scaling based on average CPU utilization across tasks.

  • Increase the existing node group to forty m5.xlarge instances and enable vertical pod autoscaling for Spark executors to remove resource contention.

Question 13 of 20

Every day at 02:00 UTC, a healthcare company must ingest the previous day's CSV file from an Amazon S3 bucket into a staging table in Amazon Redshift. The team wants a fully managed, serverless solution that minimizes cost and ongoing administration while reliably running at the scheduled time. Which approach best meets these requirements?

  • Create a cron-based Amazon EventBridge rule that starts an AWS Glue ETL job, which reads the CSV file from S3 and writes it to Amazon Redshift.

  • Deploy an Amazon Managed Workflows for Apache Airflow (MWAA) environment and schedule a DAG that issues a Redshift COPY command for the file in S3.

  • Launch a transient Amazon EMR cluster each night that runs a Spark job to copy the file from S3 to Redshift, then terminates the cluster.

  • Configure an Amazon Kinesis Data Firehose delivery stream with a Lambda transformation to send data to Redshift and enable the stream at 02:00 UTC using the AWS CLI.

Question 14 of 20

A data engineering team must allow an AWS Glue job running in account A to write objects to an Amazon S3 bucket that belongs to account B. The solution must prevent storage of long-lived credentials inside the job code and must operate without human interaction. Which authentication method should the team use?

  • Create an IAM user in account B, store its access keys in AWS Secrets Manager, and retrieve them from the job at runtime.

  • Configure an IAM role in account B and allow the AWS Glue job to assume that role by using AWS STS.

  • Upload an X.509 client certificate so the Glue job can use mutual TLS authentication with Amazon S3.

  • Generate a pre-signed S3 URL and embed it in the Glue job parameters before each run.

Question 15 of 20

A data engineer is configuring a Spark job on an existing Amazon EMR cluster that periodically connects to an Amazon Redshift database. The job must retrieve the database user name and password at runtime. Security mandates that the credentials are encrypted at rest, automatically rotated every 30 days, and accessed through IAM roles without code changes. Which solution meets these requirements?

  • Store credentials as SecureString parameters in AWS Systems Manager Parameter Store encrypted with a customer managed KMS key. Grant the EMR instance profile role permission to read the parameters.

  • Embed the credentials in the cluster bootstrap action script and restrict script access with an EMR security configuration; create an IAM role that allows reading the script.

  • Store credentials in AWS Secrets Manager, enable built-in rotation with an AWS Lambda function scheduled every 30 days, and allow the EMR instance profile role to read the secret.

  • Place a JSON file containing the credentials in an Amazon S3 bucket encrypted with SSE-KMS and rotate the object every 30 days using a CloudWatch Events rule and Lambda.

Question 16 of 20

An Amazon Athena table stores clickstream events as Parquet files in an S3 location partitioned by year, month, and day. A nightly ETL job currently runs the following query and is incurring high scan costs:

SELECT user_id, page, event_time FROM clickstream WHERE event_time BETWEEN date '2023-07-01' AND date '2023-07-31';

How should you rewrite the SQL to scan the least amount of data without changing the table definition?

  • Append a LIMIT clause so the statement becomes:

    SELECT user_id, page, event_time FROM clickstream WHERE event_time BETWEEN date '2023-07-01' AND date '2023-07-31' LIMIT 100000;

  • Add a filter on the partition columns, for example:

    SELECT user_id, page, event_time FROM clickstream WHERE year = 2023 AND month = 7 AND day BETWEEN 1 AND 31;

  • Include an ORDER BY year, month, day clause to ensure the data is read in partition order.

  • Create a common table expression (CTE) that selects all columns and then filter the CTE on event_time within the main query.

Question 17 of 20

A data engineer loads transformed sales totals into Amazon Redshift Serverless each night. An external partner needs to query the current day's total over the internet through a low-latency HTTPS endpoint. The partner cannot obtain AWS credentials but can pass an API key for authentication. The solution must remain fully serverless and require the least operational overhead. Which approach satisfies these requirements?

  • Write the daily total to a JSON file in an Amazon S3 bucket and share a presigned URL with the partner.

  • Expose the Amazon Redshift Data API endpoint to the partner and store database credentials in AWS Secrets Manager.

  • Deploy a microservice on Amazon ECS Fargate behind an Application Load Balancer that connects to Amazon Redshift with JDBC and returns results.

  • Create a REST API in Amazon API Gateway that requires an API key and invokes an AWS Lambda function, which queries Amazon Redshift through the Redshift Data API and returns JSON.

Question 18 of 20

The analytics team stores PII in an Amazon S3 data lake in us-east-2 and protects it with AWS Backup. Company policy mandates that no backups or object replicas may ever leave us-east-2. You need an organization-wide control that prevents any engineer from configuring cross-Region replication or AWS Backup copy jobs to other Regions while still allowing normal operations in us-east-2. Which approach meets the requirement with minimal ongoing maintenance?

  • Encrypt all recovery points with a customer-managed AWS KMS key that exists solely in us-east-2 and rotate the key quarterly.

  • Enable Amazon S3 Same-Region Replication on every bucket and remove all cross-Region copy rules from existing AWS Backup plans.

  • Attach an AWS Organizations SCP that denies s3:PutBucketReplication, s3:CreateBucket, and backup:StartCopyJob whenever aws:RequestedRegion or s3:LocationConstraint is not "us-east-2", and apply the policy to the OU that contains all data accounts.

  • Create VPC interface endpoints for Amazon S3 and AWS Backup only in us-east-2 and delete the endpoints in all other AWS Regions.

Question 19 of 20

After launching a new mobile game, a company ingests 20,000 player-event records per second through Amazon Kinesis Data Streams. An in-game personalization microservice must retrieve the most recent statistics for an individual player in less than 10 ms. Events older than 24 hours will be queried ad-hoc in Amazon Athena. Which data-storage approach best meets these requirements while minimizing cost?

  • Store each event in an Amazon DynamoDB table keyed by playerId with a 24-hour TTL; process the DynamoDB stream with AWS Lambda to batch write expired and changed items to Amazon S3 for Athena.

  • Use Amazon Kinesis Data Analytics to aggregate events and load them into an Amazon Redshift cluster; have the microservice query Redshift for personalization and analysts run reports on the same cluster.

  • Publish events to an Amazon MSK topic; have the microservice read the topic for player statistics and use MSK Connect to continuously sink the stream to Amazon S3 for Athena.

  • Configure Amazon Kinesis Data Firehose to deliver events directly to Amazon S3 in Parquet format and have both the microservice and analysts query the data with Amazon Athena.

Question 20 of 20

A company ingests 50,000 IoT sensor readings per second. Each record is less than 1 KB of JSON. Data must be available for dashboards that query individual device readings with single-digit millisecond latency. Records are retained for 30 days, after which they should be automatically removed without administrator intervention. Which AWS storage service best meets these requirements while minimizing operational overhead?

  • Amazon Redshift cluster using automatic table vacuum and retention policies

  • Amazon Aurora MySQL with read replica autoscaling

  • Amazon S3 bucket storing gzip-compressed JSON objects

  • Amazon DynamoDB with TTL enabled on the ingestion timestamp