00:20:00

AWS Certified Data Engineer Associate Practice Test (DEA-C01)

Use the form below to configure your AWS Certified Data Engineer Associate Practice Test (DEA-C01). The practice test can be configured to only include certain exam objectives and domains. You can choose between 5-100 questions and set a time limit.

Logo for AWS Certified Data Engineer Associate DEA-C01
Questions
Number of questions in the practice test
Free users are limited to 20 questions, upgrade to unlimited
Seconds Per Question
Determines how long you have to finish the practice test
Exam Objectives
Which exam objectives should be included in the practice test

AWS Certified Data Engineer Associate DEA-C01 Information

The AWS Certified Data Engineer – Associate certification validates your ability to design, build, and manage data pipelines on the AWS Cloud. It’s designed for professionals who transform raw data into actionable insights using AWS analytics and storage services. This certification proves you can work with modern data architectures that handle both batch and streaming data, using tools like Amazon S3, Glue, Redshift, EMR, Kinesis, and Athena to deliver scalable and efficient data solutions.

The exam covers the full data lifecycle — from ingestion and transformation to storage, analysis, and optimization. Candidates are tested on their understanding of how to choose the right AWS services for specific use cases, design secure and cost-effective pipelines, and ensure data reliability and governance. You’ll need hands-on knowledge of how to build ETL workflows, process large datasets efficiently, and use automation to manage data infrastructure in production environments.

Earning this certification demonstrates to employers that you have the technical expertise to turn data into value on AWS. It’s ideal for data engineers, analysts, and developers who work with cloud-based data systems and want to validate their skills in one of the most in-demand areas of cloud computing today. Whether you’re building data lakes, streaming pipelines, or analytics solutions, this certification confirms you can do it the AWS way — efficiently, securely, and at scale.

AWS Certified Data Engineer Associate DEA-C01 Logo
  • Free AWS Certified Data Engineer Associate DEA-C01 Practice Test

  • 20 Questions
  • Unlimited
  • Data Ingestion and Transformation
    Data Store Management
    Data Operations and Support
    Data Security and Governance

Free Preview

This test is a free preview, no account required.
Subscribe to unlock all content, keep track of your scores, and access AI features!

Question 1 of 20

Your ecommerce company stores daily order data as Parquet files in Amazon S3 under the prefix s3://sales-data/orders/year=YYYY/month=MM/day=DD/. A Lambda function, triggered every 15 minutes by Amazon EventBridge, submits Amazon Athena queries that must include the most recent files as soon as they arrive. The team wants to minimize query latency and eliminate the operational cost of running AWS Glue crawlers or the MSCK REPAIR TABLE command after each file delivery. Which approach best meets these requirements?

  • Modify the Lambda function to run the statement MSCK REPAIR TABLE orders before every query submission to refresh partition metadata.

  • Enable partition projection on the Athena table and specify the year, month, and day ranges; keep the partition columns in the WHERE clause of each query.

  • Create a new unpartitioned table with a CREATE TABLE AS SELECT (CTAS) statement and query the consolidated data instead of the partitioned source.

  • Schedule an AWS Glue crawler to run every 15 minutes so that new partitions are added to the Data Catalog before each query executes.

Question 2 of 20

An e-commerce company runs a MySQL 8.0 database on a Single-AZ db.m5.large Amazon RDS instance. The workload peaks at about 300 writes/sec and 3,000 read queries/sec during sales. Management wants to improve read performance and availability while controlling cost and making as few application changes as possible. Which solution meets these requirements?

  • Create two Amazon RDS MySQL read replicas in different Availability Zones and route read queries to the replicas.

  • Migrate the database to Amazon Aurora MySQL Serverless v2 and use two Aurora Replicas.

  • Move frequently read tables to Amazon ElastiCache for Redis and switch the database storage to gp3 volumes.

  • Enable a Multi-AZ deployment and upgrade the primary instance to db.m6i.2xlarge.

Question 3 of 20

A gaming company captures real-time session events from Amazon Kinesis Data Streams. The backend must persist each player's most recent 24-hour session data, handle unpredictable spikes to millions of writes per second, and return player records in single-digit milliseconds by primary key. Operations wants a fully managed, auto-scaling or serverless solution with built-in TTL so stale data is deleted automatically. Which AWS data store best meets these requirements?

  • Amazon S3 bucket storing JSON objects queried through Amazon Athena and S3 Lifecycle rules

  • Amazon Redshift streaming ingestion into an RA3 cluster with automatic table sort keys

  • Amazon DynamoDB table with on-demand capacity and TTL enabled

  • Amazon Aurora MySQL Serverless v2 cluster with auto-scaling read/write endpoints

Question 4 of 20

An analytics team receives hourly CSV files from external vendors. When a file lands in an S3 bucket, it must be validated, transformed with AWS Glue, and loaded into Amazon Redshift. The solution must be serverless, event-driven, include retry logic, and minimize operational overhead. Which architecture best meets these requirements?

  • Create a CloudWatch Events scheduled rule that runs every 5 minutes and invokes a Lambda function. The function lists recently added objects, kicks off an AWS Batch job to transform the data, and then loads the results into Redshift.

  • Deploy Apache Airflow on an EC2 Auto Scaling group and build a DAG that polls the S3 bucket every minute, then starts a Glue job and a Redshift COPY task.

  • Set up Kinesis Data Firehose with the S3 bucket as the data source, enable transformation with a Lambda function, and configure the delivery stream to load directly into Amazon Redshift.

  • Configure an S3 Event Notification to deliver ObjectCreated events to EventBridge, which triggers a Step Functions state machine. The state machine runs a Glue job for transformation, then uses the Redshift Data API to issue a COPY command. Step Functions built-in retries handle transient failures.

Question 5 of 20

A data engineering team runs a persistent Amazon EMR cluster that stores intermediate data in HDFS. Each night, about 50 TB of gzip log files arrive in an Amazon S3 bucket and must be copied into HDFS before downstream MapReduce jobs start. The transfer must maximize throughput, minimize S3 request costs, and run by using only the existing EMR cluster resources. Which solution meets these requirements?

  • Mount the S3 bucket on every core node with s3fs and move the objects to HDFS with the Linux cp command.

  • Use AWS DataSync to transfer the objects to volumes on each core node, then import the data into HDFS.

  • Add an EMR step that uses S3DistCp to copy the objects from Amazon S3 to HDFS in parallel.

  • From the master node, run the AWS CLI command "aws s3 cp --recursive" to copy the objects into HDFS.

Question 6 of 20

A retailer stores clickstream data as Parquet files in Amazon S3. Analysts query the data with Amazon Athena several times a day, and weekly batch jobs update or delete late-arriving records. The company uses AWS Lake Formation and must enforce row-level security while supporting ACID transactions with the least administration. Which approach meets these requirements?

  • Load the data into an Amazon Redshift cluster and share secure views through Lake Formation for row-level access.

  • Convert the dataset to a Lake Formation governed table and use LF tag-based policies to grant analysts SELECT access with row filters.

  • Enable object-level ACLs on the S3 bucket and restrict rows by forcing analysts to use Athena views containing WHERE clauses.

  • Create an external table in the AWS Glue Data Catalog and control access only with S3 bucket policies and Athena workgroup-level data filters.

Question 7 of 20

An ecommerce company uses an Amazon Redshift RA3 cluster. A BI query joins two 200-GB Redshift tables with an Aurora PostgreSQL orders table through a federated query. Grafana runs the query every minute, causing 10-second latency and high Aurora CPU. Data may be 5 minutes old, and the team wants the lowest ongoing cost. What should the data engineer do?

  • Create a materialized view that joins the Redshift and federated tables, and schedule REFRESH MATERIALIZED VIEW every 5 minutes with Amazon EventBridge. Point the dashboard to the materialized view.

  • Unload the two Redshift tables to Amazon S3, create external tables, and use Redshift Spectrum to join them with the federated orders table.

  • Use AWS DMS and COPY to load the orders table into Redshift every 5 minutes, then keep the dashboard query unchanged.

  • Replace the query with a standard Redshift view and rely on the query result cache for most dashboard requests.

Question 8 of 20

An AWS Glue crawler registers daily Parquet files stored under the Amazon S3 prefix s3://datalake/iot/year=YYYY/month=MM/day=DD/. Business analysts query the table from Amazon Athena, but the current day's data is not visible until the crawler's nightly run. As a data engineer, how can you expose new partitions to Athena within minutes of arrival while keeping operational effort low?

  • Replace the crawler with Athena partition projection and define formulas that generate the year, month, and day partitions.

  • Trigger an AWS Step Functions workflow from CloudWatch Events that calls ALTER TABLE ADD PARTITION for each new file detected.

  • Change the crawler to run every five minutes on a fixed schedule.

  • Enable Amazon S3 event notifications to invoke the crawler in incremental mode whenever new objects are created.

Question 9 of 20

A retail company captures clickstream events in an Amazon Kinesis Data Stream. Business analysts need the events to be query-able in Amazon Redshift within one minute of being produced. The data engineering team wants the simplest solution that avoids intermediate storage and minimizes ongoing maintenance. Which approach best meets these requirements?

  • Trigger an AWS Lambda function from the Kinesis Data Stream to batch records and insert them into Redshift via the Data API.

  • Create a materialized view in Amazon Redshift that performs streaming ingestion from the Kinesis Data Stream and enables AUTO REFRESH.

  • Configure an Amazon Kinesis Data Firehose delivery stream to load the data into Amazon Redshift on a 1-minute buffer interval.

  • Build an AWS Glue streaming ETL job that reads from the Kinesis Data Stream and writes the records to Redshift through a JDBC connection.

Question 10 of 20

Your organization uses AWS Lake Formation to govern a raw data lake in Amazon S3. You registered the s3://finance-raw bucket and cataloged the transactions table in the finance database. Analysts already have Lake Formation SELECT on the table, yet Athena returns "Access Denied - insufficient Lake Formation permissions." Which additional Lake Formation permission will resolve the error without granting broader S3 or IAM access?

  • Grant Lake Formation DESCRIBE permission on the default database.

  • Give the IAM role Lake Formation ALTER permission on the transactions table.

  • Attach an IAM policy that allows s3:GetObject on the finance-raw bucket.

  • Grant Lake Formation DATA_LOCATION_ACCESS on the s3://finance-raw location.

Question 11 of 20

A company's Amazon Redshift RA3 cluster hosts a 5-TB fact table that receives new rows each night. Business analysts issue the same complex aggregation query every morning to populate dashboards, but the query still takes about 40 minutes even after regular VACUUM and ANALYZE operations. As the data engineer, you must cut the runtime dramatically, keep administration effort low, and avoid a large cost increase. Which approach will best meet these requirements?

  • Increase the WLM queue's slot count and enable short query acceleration to allocate more memory to the query.

  • Enable Amazon Redshift Concurrency Scaling so the query can execute on additional transient clusters.

  • Create a materialized view that pre-aggregates the required data, schedule an automatic REFRESH after the nightly load, and direct the dashboard to query the materialized view.

  • Change the fact table's distribution style to ALL so every node stores a full copy, eliminating data shuffling during joins.

Question 12 of 20

A CloudFormation template will deploy an AWS Glue job that runs in a private subnet. The job only needs to read objects from the S3 bucket named analytics-data. Security insists the template: 1) follows the principle of least privilege and 2) keeps the IAM role definition concise by avoiding a long inline policy block within the role. Which CloudFormation approach best meets these requirements?

  • Define an AWSIAMRole and attach the AWS-managed policy AmazonS3ReadOnlyAccess in the ManagedPolicyArns property.

  • Attach an AWSIAMInstanceProfile to the Glue job so it inherits the default EC2 instance role.

  • Create an AWSIAMManagedPolicy resource granting s3:GetObject on arn:aws:s3:::analytics-data/* and reference it in the role's ManagedPolicyArns property.

  • Add an AWSIAMPolicy inline resource that grants s3:GetObject on the bucket and attach it to the role.

Question 13 of 20

A company ingests 50 000 JSON events per second from IoT sensors into an Amazon Kinesis Data Stream. The analytics team needs each record converted to Apache Parquet with sub-second latency and written to Amazon S3. The solution must scale automatically with the unpredictable event rate and require minimal infrastructure management. Which approach meets these requirements most effectively?

  • Create an AWS Glue streaming ETL job that reads from the Kinesis Data Stream and writes Parquet files to Amazon S3.

  • Use AWS Lambda with Kinesis Data Streams as the event source; each invocation converts the JSON record to Parquet and writes it to Amazon S3.

  • Configure an Amazon EMR cluster with Spark Structured Streaming to poll the stream and convert data to Parquet in Amazon S3.

  • Deliver the stream to Amazon S3 through Kinesis Data Firehose with a Lambda transformation that converts incoming records to Parquet format.

Question 14 of 20

A retail company receives a 10-GB CSV file in an Amazon S3 bucket every night. The file must be loaded into Amazon Redshift as soon as it arrives. The solution must be fully managed, cost-effective, and must avoid re-loading the same file if the job is restarted after a failure. Which approach meets these requirements?

  • Configure AWS DataSync to move the file into an Amazon Redshift Spectrum external table and run an INSERT statement into the target table.

  • Create an Amazon EventBridge rule for the s3:ObjectCreated event to start an AWS Glue job that copies the file into Amazon Redshift, and enable AWS Glue job bookmarks.

  • Schedule an Amazon EMR cluster to start nightly, run a Spark script that uses the COPY command to load the file into Amazon Redshift, and terminate the cluster afterward.

  • Use Amazon Kinesis Data Analytics with an S3 source and a Redshift destination to stream the file contents into Amazon Redshift.

Question 15 of 20

A company stores daily .csv transaction files in an Amazon S3 bucket. A data engineer must ensure that every new object triggers a processing Lambda function exactly once, in the same order that the files arrive, and that failed invocations are automatically retried without manual intervention. Which approach meets these requirements with the least operational overhead?

  • Send S3 event notifications directly to the Lambda function and restrict its reserved concurrency to 1 to enforce sequential execution.

  • Create an Amazon EventBridge rule for s3:ObjectCreated:Put events and set the Lambda function as the rule's only target.

  • Configure an S3 event notification with a suffix filter of .csv that publishes to an Amazon SQS FIFO queue, then set the Lambda function to poll the queue with a batch size of 1.

  • Enable S3 replication to a second bucket and create a Step Functions state machine that the replication process invokes for each replicated object.

Question 16 of 20

An ETL pipeline is orchestrated by Amazon EventBridge: a rule starts an AWS Glue job whenever new objects land in an S3 bucket. The data engineering team must alert on-call staff immediately when the Glue job finishes with either SUCCEEDED or FAILED status. Notifications must support email and SMS without introducing custom code. Which solution meets these requirements with minimal operational effort?

  • Wrap the Glue job in an AWS Step Functions state machine and use a Catch block that calls a webhook to a chat application when the task fails or succeeds.

  • Configure an Amazon CloudWatch alarm on the job's DPU consumed metric and set the alarm action to push messages to an SQS queue, then invoke a Lambda function to forward notifications.

  • Create a second EventBridge rule that matches Glue Job State Change events with states SUCCEEDED or FAILED and sends them to an Amazon SNS topic that has email and SMS subscriptions.

  • Add code at the end of the Glue script to use Amazon Simple Email Service (Amazon SES) to send an email when the job completes.

Question 17 of 20

Your data engineering team stores daily AWS Glue Apache Spark job logs as compressed JSON files in an Amazon S3 bucket. Analysts must run ad-hoc SQL to find long-running stages and join the result with an existing reference dataset that also resides in S3. The solution must become queryable within minutes of log delivery, require no servers to manage, and minimize operational effort. Which solution best meets these requirements?

  • Stream the log files from S3 into Amazon CloudWatch Logs and analyze them with CloudWatch Logs Insights queries.

  • Launch an on-demand Amazon EMR cluster with Trino, mount the S3 buckets, and submit SQL queries through the Trino coordinator.

  • Run an AWS Glue crawler on the log prefix to update the Data Catalog and query both log and reference tables in Amazon Athena.

  • Deliver the logs to Amazon OpenSearch Service with Amazon Kinesis Data Firehose and query them alongside the reference data using OpenSearch Dashboards.

Question 18 of 20

Your team receives unpredictable batches of CSV transaction files in a dedicated Amazon S3 prefix. Every file must be ingested into an Amazon Redshift staging table within five minutes of arrival. The solution must follow an event-driven batch pattern, avoid idle infrastructure, and scale automatically with the daily file count. Which approach meets these requirements while keeping operational overhead low?

  • Send the files to an Amazon Kinesis Data Firehose delivery stream configured to deliver records to Amazon Redshift.

  • Configure an Amazon S3 event notification that routes through EventBridge to trigger an AWS Glue job, and have the job run a Redshift COPY command for the new object.

  • Set up an AWS Database Migration Service task with S3 as the source endpoint and Redshift as the target to perform full load and change data capture.

  • Create an AWS Glue job with a 5-minute cron schedule that recursively scans the prefix and loads any discovered files into Redshift.

Question 19 of 20

A data engineer loads transformed sales totals into Amazon Redshift Serverless each night. An external partner needs to query the current day's total over the internet through a low-latency HTTPS endpoint. The partner cannot obtain AWS credentials but can pass an API key for authentication. The solution must remain fully serverless and require the least operational overhead. Which approach satisfies these requirements?

  • Write the daily total to a JSON file in an Amazon S3 bucket and share a presigned URL with the partner.

  • Expose the Amazon Redshift Data API endpoint to the partner and store database credentials in AWS Secrets Manager.

  • Deploy a microservice on Amazon ECS Fargate behind an Application Load Balancer that connects to Amazon Redshift with JDBC and returns results.

  • Create a REST API in Amazon API Gateway that requires an API key and invokes an AWS Lambda function, which queries Amazon Redshift through the Redshift Data API and returns JSON.

Question 20 of 20

An Amazon EMR cluster is running an Apache Spark SQL job that joins a 500 GB click-stream DataFrame with a 100 MB reference DataFrame. Shuffle stages dominate the runtime and the team cannot resize the cluster or rewrite the input data. Which Spark-level change will most effectively reduce shuffle traffic and speed up the join?

  • Apply a broadcast join hint to the 100 MB reference DataFrame so each executor receives a local copy.

  • Increase the value of spark.sql.shuffle.partitions to create more shuffle tasks.

  • Persist both DataFrames in memory before executing the join.

  • Enable speculative execution by setting spark.speculation to true.