🔥 40% Off Crucial Exams Memberships — This Week Only

3 days, 15 hours remaining!
00:20:00

AWS Certified Data Engineer Associate Practice Test (DEA-C01)

Use the form below to configure your AWS Certified Data Engineer Associate Practice Test (DEA-C01). The practice test can be configured to only include certain exam objectives and domains. You can choose between 5-100 questions and set a time limit.

Logo for AWS Certified Data Engineer Associate DEA-C01
Questions
Number of questions in the practice test
Free users are limited to 20 questions, upgrade to unlimited
Seconds Per Question
Determines how long you have to finish the practice test
Exam Objectives
Which exam objectives should be included in the practice test

AWS Certified Data Engineer Associate DEA-C01 Information

The AWS Certified Data Engineer – Associate certification validates your ability to design, build, and manage data pipelines on the AWS Cloud. It’s designed for professionals who transform raw data into actionable insights using AWS analytics and storage services. This certification proves you can work with modern data architectures that handle both batch and streaming data, using tools like Amazon S3, Glue, Redshift, EMR, Kinesis, and Athena to deliver scalable and efficient data solutions.

The exam covers the full data lifecycle — from ingestion and transformation to storage, analysis, and optimization. Candidates are tested on their understanding of how to choose the right AWS services for specific use cases, design secure and cost-effective pipelines, and ensure data reliability and governance. You’ll need hands-on knowledge of how to build ETL workflows, process large datasets efficiently, and use automation to manage data infrastructure in production environments.

Earning this certification demonstrates to employers that you have the technical expertise to turn data into value on AWS. It’s ideal for data engineers, analysts, and developers who work with cloud-based data systems and want to validate their skills in one of the most in-demand areas of cloud computing today. Whether you’re building data lakes, streaming pipelines, or analytics solutions, this certification confirms you can do it the AWS way — efficiently, securely, and at scale.

AWS Certified Data Engineer Associate DEA-C01 Logo
  • Free AWS Certified Data Engineer Associate DEA-C01 Practice Test

  • 20 Questions
  • Unlimited time
  • Data Ingestion and Transformation
    Data Store Management
    Data Operations and Support
    Data Security and Governance
Question 1 of 20

An Amazon Athena table stores clickstream events as Parquet files in an S3 location partitioned by year, month, and day. A nightly ETL job currently runs the following query and is incurring high scan costs:

SELECT user_id, page, event_time FROM clickstream WHERE event_time BETWEEN date '2023-07-01' AND date '2023-07-31';

How should you rewrite the SQL to scan the least amount of data without changing the table definition?

  • Append a LIMIT clause so the statement becomes:

    SELECT user_id, page, event_time FROM clickstream WHERE event_time BETWEEN date '2023-07-01' AND date '2023-07-31' LIMIT 100000;

  • Create a common table expression (CTE) that selects all columns and then filter the CTE on event_time within the main query.

  • Add a filter on the partition columns, for example:

    SELECT user_id, page, event_time FROM clickstream WHERE year = 2023 AND month = 7 AND day BETWEEN 1 AND 31;

  • Include an ORDER BY year, month, day clause to ensure the data is read in partition order.

Question 2 of 20

A retail company runs nightly AWS Glue ETL jobs that load data into an Amazon Redshift cluster. The job script currently hard-codes the database user name and password. Security now requires removing plaintext credentials, rotating the password automatically every 30 days, and making no changes to the ETL code. Which solution meets these requirements most securely?

  • Create an AWS Secrets Manager secret for the Redshift cluster, enable automatic rotation, update the existing AWS Glue connection to reference the secret's ARN, and add secretsmanager:GetSecretValue permission to the Glue job role.

  • Encrypt the user name and password with AWS KMS and place the ciphertext in environment variables of the Glue job; configure KMS key rotation every 30 days.

  • Store the database credentials as SecureString parameters in AWS Systems Manager Parameter Store and schedule an Amazon EventBridge rule that invokes a Lambda function every 30 days to update the parameters; grant the Glue job role ssm:GetParameters permission.

  • Save the credentials in the AWS Glue Data Catalog connection properties and enable automatic rotation in the connection settings.

Question 3 of 20

AWS Glue cataloged two Amazon Athena tables: web_clicks (fact) with product_id and dim_products (dimension). New product_ids may appear in web_clicks before dim_products is updated. You need an Athena view that returns every click event and adds product_name when available, otherwise null. Which JOIN clause meets this goal?

  • SELECT ... FROM web_clicks w RIGHT JOIN dim_products d ON w.product_id = d.product_id

  • SELECT ... FROM web_clicks w FULL OUTER JOIN dim_products d ON w.product_id = d.product_id

  • SELECT ... FROM web_clicks w LEFT JOIN dim_products d ON w.product_id = d.product_id

  • SELECT ... FROM web_clicks w INNER JOIN dim_products d ON w.product_id = d.product_id

Question 4 of 20

A company ingests high-frequency IoT sensor readings and must land them in Amazon S3 in under 30 seconds. Operations teams also need the ability to replay any portion of the incoming stream if a downstream transformation job fails. Which solution meets these requirements while keeping operational overhead to a minimum?

  • Use an AWS IoT Core rule to write the sensor messages directly to an S3 bucket with the "Customer managed" retry option enabled.

  • Send the data to Amazon Kinesis Data Streams with a 24-hour retention period, add an Amazon Kinesis Data Firehose delivery stream as a consumer, and configure Firehose buffering to 1 MiB or 10 seconds before writing to Amazon S3.

  • Deploy an Apache Kafka cluster on Amazon EC2, configure a topic for the sensors, and use a Kafka Connect S3 sink to write data to Amazon S3 every 10 seconds.

  • Create a direct Amazon Kinesis Data Firehose delivery stream and reduce the S3 buffering size to 1 MiB and interval to 10 seconds.

Question 5 of 20

An e-commerce company collects mobile-game clickstream events at 10 MB/s. The data must land in an Amazon S3 data lake and simultaneously feed three independent services: sub-second fraud detection, a personalization microservice, and a daily Amazon EMR batch job. The solution must be fully managed, replayable for 24 hours, and auto-scaling without capacity planning. Which approach is MOST cost-effective?

  • Provision an Amazon MSK cluster and publish the events to a Kafka topic. Have each consumer application subscribe to the topic and use Apache Flink on MSK to write the data to Amazon S3.

  • Insert events into a DynamoDB table and enable DynamoDB Streams. Use a Lambda function to forward stream records to Amazon S3 and invoke the consumer services.

  • Create a Kinesis Data Firehose delivery stream with a Lambda transformation. Configure S3 event notifications from the destination bucket to invoke the fraud and personalization services.

  • Create a Kinesis Data Stream in on-demand mode. Register three enhanced fan-out consumers: a fraud-detection Lambda function, a personalization microservice, and a Kinesis Data Firehose delivery stream that writes to Amazon S3.

Question 6 of 20

An online gaming company delivers about 5 MB/s of gameplay telemetry to AWS. The engineering team must store each record for 7 days, support millisecond-latency writes and multiple parallel reads, and invoke AWS Lambda functions that calculate near-real-time leaderboards. They want the lowest operational overhead and predictable pricing. Which service should they use as the primary data store?

  • Insert each record into an on-demand Amazon DynamoDB table and export the table to Amazon S3 after 7 days.

  • Send the data to Amazon S3 through Kinesis Data Firehose and have Lambda query the objects with Amazon Athena.

  • Create an Amazon Kinesis Data Streams stream with a 7-day retention period and configure AWS Lambda as a consumer.

  • Deploy an Amazon MSK cluster and write the telemetry to a Kafka topic configured for 7-day retention.

Question 7 of 20

A retail company plans to ingest click-stream events with Apache Kafka. Security mandates that producer and consumer applications authenticate only with short-lived IAM role credentials, and that the data engineering team must not build or rotate cluster user passwords. Which deployment choice meets the requirement while minimizing operational effort?

  • Create an Amazon MSK cluster but disable IAM access control, instead using SASL/SCRAM authentication with credentials stored in Secrets Manager.

  • Deploy an Apache Kafka cluster on Amazon EC2 behind a Network Load Balancer and enforce mutual TLS with private certificates from AWS Certificate Manager Private CA.

  • Deploy an Apache Kafka cluster on Amazon EC2 instances and configure SASL/SCRAM authentication, storing usernames and passwords in AWS Secrets Manager.

  • Provision an Amazon MSK cluster with IAM access control enabled so clients authenticate with SigV4-signed requests using their IAM roles.

Question 8 of 20

Your data engineering team uses AWS Glue to transform data that lands in Amazon S3. To comply with EU data-sovereignty rules, every analytic object must remain in either eu-west-1 or eu-central-1. Across dozens of AWS accounts, you must prevent any resource creation or data replication in other Regions. Which solution BEST enforces this requirement?

  • Require SSE-KMS with customer-managed keys created in the EU Regions and mandate bucket policies that enforce encryption on all uploads.

  • Turn on Amazon Macie automatic sensitive-data discovery and configure Security Hub to raise findings when objects are stored in non-EU Regions.

  • Enable S3 Object Lock on all buckets and configure default retention settings so that objects cannot be deleted or overwritten outside the EU.

  • Attach a service control policy (SCP) to the organization that denies all actions in Regions other than eu-west-1 and eu-central-1 by using the aws:RequestedRegion global condition key.

Question 9 of 20

Your company ingests website click-stream events that are serialized as JSON. The structure of the events will evolve as new product features are released, and the data engineering team wants analysts to run ad-hoc SQL queries in Amazon Redshift without performing manual DDL each time a new attribute appears. The solution must keep storage costs low and avoid interrupting existing queries. Which design meets these requirements?

  • Persist the events in an Amazon RDS PostgreSQL database and query the table from Redshift by using federated queries.

  • Write the JSON events to Amazon S3, use an AWS Glue crawler to catalog the files, and create an Amazon Redshift Spectrum external table that references the Glue Data Catalog.

  • Stream the JSON events directly into an Amazon Redshift table that uses the SUPER data type and rely on Redshift to surface new keys automatically.

  • Use AWS Database Migration Service (AWS DMS) to load the events from S3 into a Redshift columnar table and run a nightly job that issues ALTER TABLE ADD COLUMN statements for any new attributes.

Question 10 of 20

A data engineering team must expose a JSON ingestion REST endpoint to several financial partners. Company policy requires each partner to authenticate by presenting an X.509 client certificate issued by the partner's intermediate CA. The endpoint must be reachable only from the company VPC, and the team wants to avoid writing custom certificate-validation logic. Which solution meets these requirements with the least operational overhead?

  • Issue an IAM access key and secret key to each partner and require Signature Version 4-signed HTTPS requests to an Internet-facing API Gateway endpoint secured with IAM authorization.

  • Create a private Amazon API Gateway REST API, enable mutual TLS with a trust store that contains the partners' CA certificates, and access the API through an interface VPC endpoint.

  • Deploy an internal Application Load Balancer with an HTTPS listener configured for mutual TLS verify mode. Create an ELB trust store containing the partners' CA certificates in Amazon S3 and attach it to the listener.

  • Provide partners with presigned Amazon S3 PUT URLs secured with TLS 1.2 so they can upload their data files.

Question 11 of 20

An e-commerce company stores its daily sales metrics as partitioned Parquet files in Amazon S3. Business analysts must build interactive dashboards that refresh hourly, support ad-hoc filtering, and must not require the data engineering team to provision or manage servers. Users are authenticated through Amazon Cognito. Which approach meets the requirements with the least operational overhead?

  • Configure Amazon QuickSight to query the S3 data through an Athena data source, enable SPICE for hourly refreshes, and share dashboards with Cognito-authenticated users.

  • Create an Amazon Redshift cluster, load the Parquet data with the COPY command, and connect Amazon QuickSight in direct-query mode.

  • Launch an Amazon EMR cluster running Presto to serve the data and deploy an open-source visualization tool on Amazon ECS for analysts.

  • Schedule AWS Glue DataBrew jobs to generate visual charts and publish them as static HTML pages in Amazon S3.

Question 12 of 20

A data engineering team created a materialized view in Amazon Redshift that joins the internal fact_sales table with an external product_dim table stored in Amazon S3 through a Spectrum external schema. After the product_dim data files are overwritten each night, analysts notice that the view returns stale data. The team must keep results current in the most cost-effective way without copying the external table into Redshift. What should they do?

  • Convert the product_dim external table into a regular Redshift table so the view can refresh automatically.

  • Replace the materialized view with a late-binding view so it always reads the latest external data.

  • Run ALTER MATERIALIZED VIEW … AUTO REFRESH YES to enable incremental refresh on the existing view.

  • Schedule the REFRESH MATERIALIZED VIEW command to run after the nightly S3 load completes.

Question 13 of 20

Your organization uses AWS Lake Formation to govern a raw data lake in Amazon S3. You registered the s3://finance-raw bucket and cataloged the transactions table in the finance database. Analysts already have Lake Formation SELECT on the table, yet Athena returns "Access Denied - insufficient Lake Formation permissions." Which additional Lake Formation permission will resolve the error without granting broader S3 or IAM access?

  • Grant Lake Formation DATA_LOCATION_ACCESS on the s3://finance-raw location.

  • Give the IAM role Lake Formation ALTER permission on the transactions table.

  • Grant Lake Formation DESCRIBE permission on the default database.

  • Attach an IAM policy that allows s3:GetObject on the finance-raw bucket.

Question 14 of 20

An ecommerce company keeps 3 years of web-server logs as uncompressed .txt files in the s3://company-data/logs/ prefix. Data analysts must run interactive ad-hoc SQL queries against only the most recent 90 days of logs. The solution must minimize query cost, leave the raw files unchanged, and avoid managing long-running infrastructure. Which approach best meets these requirements?

  • Copy the most recent 90 days of logs into an Amazon Redshift cluster and pause the cluster when queries are finished.

  • Use an AWS Glue ETL job to convert the latest 90 days of .txt logs to compressed Parquet files in a separate S3 prefix and query that prefix with Amazon Athena.

  • Import all .txt logs into an Amazon RDS for PostgreSQL instance with auto-scaling storage and index the timestamp column.

  • Create external tables in Amazon Athena that reference the existing .txt files and add day-based partitions for the last 90 days.

Question 15 of 20

An organization runs nightly Apache Spark ETL jobs with Amazon EMR on EKS. Each executor pod requests 4 vCPU and 32 GiB memory, but its CPU limit is also set to 4 vCPU. CloudWatch shows frequent CpuCfsThrottledSeconds and long task runtimes, while cluster nodes have unused CPU. The team wants faster jobs without adding nodes or instances. Which action meets the requirement?

  • Remove the CPU limit or raise it well above the request so executor containers can use idle vCPU on the node.

  • Migrate the workload to AWS Glue interactive sessions, which automatically scale compute resources.

  • Replace gp3 root volumes with io2 volumes on worker nodes to increase disk throughput.

  • Enable Spark dynamic allocation so the job can launch additional executor pods during the run.

Question 16 of 20

Your team has registered an Amazon S3 data lake with AWS Lake Formation, and analysts query the data through Amazon Athena. The security team must ensure that any S3 object Amazon Macie flags as containing PII is automatically blocked from the analyst LF-principal but remains accessible to the governance LF-principal. The solution must rely on AWS-managed integrations and involve as little custom code as possible. Which approach meets these requirements?

  • Run an AWS Glue crawler with custom classifiers that detect PII and update the Data Catalog, then attach IAM policies that deny analysts access to any tables the crawler marks as sensitive.

  • Configure an Amazon Macie discovery job and an EventBridge rule that starts a Step Functions workflow. The workflow calls Lake Formation AddLFTagsToResource to tag resources Classification=Sensitive and applies LF-tag policies that block analysts and allow governance users.

  • Generate daily S3 Inventory reports, use S3 Batch Operations to tag files that contain sensitive keywords, and add bucket policies that block the analyst group from those objects while permitting governance access.

  • Use S3 Object Lambda with a Lambda function that removes or redacts PII from objects before analysts access them, while governance users read the original objects directly.

Question 17 of 20

An Amazon Athena table named clickstream contains columns session_id string, page string, event_time timestamp, and load_time_ms int. A data engineer must return the five pages with the highest average load_time_ms recorded in the last 7 days, but only for pages that have at least 100 distinct sessions. Which SQL query satisfies the requirement?

  • SELECT page,
           AVG(load_time_ms) AS avg_load
    FROM clickstream
    GROUP BY page
    HAVING COUNT(DISTINCT session_id) >= 100
       AND event_time >= current_timestamp - INTERVAL '7' day
    ORDER BY avg_load DESC
    LIMIT 5;
    
  • SELECT page,
           AVG(load_time_ms) AS avg_load
    FROM clickstream
    WHERE event_time >= current_timestamp - INTERVAL '7' day
    GROUP BY page
    HAVING COUNT(DISTINCT session_id) >= 100
    ORDER BY COUNT(DISTINCT session_id) DESC
    LIMIT 5;
    
  • SELECT page,
           AVG(load_time_ms) AS avg_load
    FROM clickstream
    WHERE event_time >= current_timestamp - INTERVAL '7' day
      AND COUNT(DISTINCT session_id) >= 100
    GROUP BY page
    ORDER BY avg_load DESC
    LIMIT 5;
    
  • SELECT page,
           AVG(load_time_ms) AS avg_load
    FROM clickstream
    WHERE event_time >= current_timestamp - INTERVAL '7' day
    GROUP BY page
    HAVING COUNT(DISTINCT session_id) >= 100
    ORDER BY avg_load DESC
    LIMIT 5;
    
Question 18 of 20

A DynamoDB table that stores IoT sensor readings peaks at 40,000 writes per second. The analytics team must land every new item in an Amazon S3 data lake within 60 seconds. The solution must auto-scale, provide at-least-once delivery, and minimize operational overhead. Which architecture meets these requirements MOST effectively?

  • Enable DynamoDB Streams with the NEW_IMAGE view and configure an AWS Lambda function as the event source; inside the function batch the records and submit them to an Amazon Kinesis Data Firehose delivery stream that writes to S3.

  • Use AWS Database Migration Service in change data capture mode to replicate the DynamoDB table continuously to an S3 target.

  • Schedule an AWS Glue batch job every minute to export the entire table to S3 by using DynamoDB export to S3.

  • Create an AWS Glue streaming ETL job that consumes the table's stream ARN directly and writes the data to Amazon S3.

Question 19 of 20

A company ingests clickstream events into an Amazon DynamoDB table. Traffic remains near zero most of the day but bursts to 40,000 writes per second during marketing campaigns. Analysts query events by userId and timestamp range. Provisioned capacity with auto scaling causes throttling and wasted spend. Which configuration best meets the performance and cost requirements with minimal administration?

  • Convert the table and all global secondary indexes to on-demand capacity mode.

  • Add a DynamoDB Accelerator (DAX) cluster in front of the table to cache hot items.

  • Triple the provisioned write capacity and reduce the auto-scaling cooldown period to 30 seconds.

  • Enable DynamoDB Streams and invoke an AWS Lambda function to batch writes into Amazon S3.

Question 20 of 20

Your team receives unpredictable batches of CSV transaction files in a dedicated Amazon S3 prefix. Every file must be ingested into an Amazon Redshift staging table within five minutes of arrival. The solution must follow an event-driven batch pattern, avoid idle infrastructure, and scale automatically with the daily file count. Which approach meets these requirements while keeping operational overhead low?

  • Send the files to an Amazon Kinesis Data Firehose delivery stream configured to deliver records to Amazon Redshift.

  • Configure an Amazon S3 event notification that routes through EventBridge to trigger an AWS Glue job, and have the job run a Redshift COPY command for the new object.

  • Set up an AWS Database Migration Service task with S3 as the source endpoint and Redshift as the target to perform full load and change data capture.

  • Create an AWS Glue job with a 5-minute cron schedule that recursively scans the prefix and loads any discovered files into Redshift.