00:20:00

AWS Certified Data Engineer Associate Practice Test (DEA-C01)

Use the form below to configure your AWS Certified Data Engineer Associate Practice Test (DEA-C01). The practice test can be configured to only include certain exam objectives and domains. You can choose between 5-100 questions and set a time limit.

Logo for AWS Certified Data Engineer Associate DEA-C01
Questions
Number of questions in the practice test
Free users are limited to 20 questions, upgrade to unlimited
Seconds Per Question
Determines how long you have to finish the practice test
Exam Objectives
Which exam objectives should be included in the practice test

AWS Certified Data Engineer Associate DEA-C01 Information

The AWS Certified Data Engineer – Associate certification validates your ability to design, build, and manage data pipelines on the AWS Cloud. It’s designed for professionals who transform raw data into actionable insights using AWS analytics and storage services. This certification proves you can work with modern data architectures that handle both batch and streaming data, using tools like Amazon S3, Glue, Redshift, EMR, Kinesis, and Athena to deliver scalable and efficient data solutions.

The exam covers the full data lifecycle — from ingestion and transformation to storage, analysis, and optimization. Candidates are tested on their understanding of how to choose the right AWS services for specific use cases, design secure and cost-effective pipelines, and ensure data reliability and governance. You’ll need hands-on knowledge of how to build ETL workflows, process large datasets efficiently, and use automation to manage data infrastructure in production environments.

Earning this certification demonstrates to employers that you have the technical expertise to turn data into value on AWS. It’s ideal for data engineers, analysts, and developers who work with cloud-based data systems and want to validate their skills in one of the most in-demand areas of cloud computing today. Whether you’re building data lakes, streaming pipelines, or analytics solutions, this certification confirms you can do it the AWS way — efficiently, securely, and at scale.

AWS Certified Data Engineer Associate DEA-C01 Logo
  • Free AWS Certified Data Engineer Associate DEA-C01 Practice Test

  • 20 Questions
  • Unlimited
  • Data Ingestion and Transformation
    Data Store Management
    Data Operations and Support
    Data Security and Governance
Question 1 of 20

A fintech startup captures tick-level trade events in an Amazon Kinesis Data Stream. Business analysts need to run near-real-time SQL queries in Amazon Redshift with end-to-end latency under 15 seconds. The team wants the simplest, most cost-effective solution and does not want to manage intermediate Amazon S3 staging or custom infrastructure. Which approach should the data engineer implement to meet these requirements?

  • Create a materialized view in Amazon Redshift that references the Kinesis stream with the KINESIS clause and enable auto-refresh for continuous ingestion.

  • Configure Amazon Kinesis Data Firehose to deliver the stream to an S3 bucket and schedule a Redshift COPY command to load the files every minute.

  • Build an AWS Glue streaming job that reads from the Kinesis stream and writes batches to Amazon Redshift using JDBC.

  • Attach an AWS Lambda function as a stream consumer that buffers events and inserts them into Amazon Redshift through the Data API.

Question 2 of 20

Your team has registered an Amazon S3 data lake with AWS Lake Formation, and analysts query the data through Amazon Athena. The security team must ensure that any S3 object Amazon Macie flags as containing PII is automatically blocked from the analyst LF-principal but remains accessible to the governance LF-principal. The solution must rely on AWS-managed integrations and involve as little custom code as possible. Which approach meets these requirements?

  • Configure an Amazon Macie discovery job and an EventBridge rule that starts a Step Functions workflow. The workflow calls Lake Formation AddLFTagsToResource to tag resources Classification=Sensitive and applies LF-tag policies that block analysts and allow governance users.

  • Run an AWS Glue crawler with custom classifiers that detect PII and update the Data Catalog, then attach IAM policies that deny analysts access to any tables the crawler marks as sensitive.

  • Generate daily S3 Inventory reports, use S3 Batch Operations to tag files that contain sensitive keywords, and add bucket policies that block the analyst group from those objects while permitting governance access.

  • Use S3 Object Lambda with a Lambda function that removes or redacts PII from objects before analysts access them, while governance users read the original objects directly.

Question 3 of 20

An e-commerce company transforms 2 TB of clickstream data stored in Amazon S3 every night by running a PySpark script that is version-controlled in an S3 path. Engineers want to invoke the job from a Jenkins pipeline through API calls, avoid managing any clusters, yet retain access to the Spark UI for detailed job troubleshooting. Which solution best satisfies these requirements?

  • Package the script in a Docker image and run it with AWS Batch on AWS Fargate; submit the job via the SubmitJob API; inspect the CloudWatch Logs stream for troubleshooting.

  • Create an AWS Glue Spark job that references the script in Amazon S3; trigger the job by calling the StartJobRun API from Jenkins; use the AWS Glue Spark UI to debug failed runs.

  • Provision an Amazon EMR cluster on EC2 each night and submit the script as a step by calling the AddJobFlowSteps API; access the Spark UI on the cluster's master node for troubleshooting; terminate the cluster after completion.

  • Load the script into an Amazon Athena Spark notebook and invoke it by calling the StartQueryExecution API; view execution output in Athena's query editor for debugging.

Question 4 of 20

An organization runs nightly Apache Spark ETL jobs with Amazon EMR on EKS. Each executor pod requests 4 vCPU and 32 GiB memory, but its CPU limit is also set to 4 vCPU. CloudWatch shows frequent CpuCfsThrottledSeconds and long task runtimes, while cluster nodes have unused CPU. The team wants faster jobs without adding nodes or instances. Which action meets the requirement?

  • Replace gp3 root volumes with io2 volumes on worker nodes to increase disk throughput.

  • Remove the CPU limit or raise it well above the request so executor containers can use idle vCPU on the node.

  • Migrate the workload to AWS Glue interactive sessions, which automatically scale compute resources.

  • Enable Spark dynamic allocation so the job can launch additional executor pods during the run.

Question 5 of 20

An analytics team must build an AWS Glue Spark job that enriches 500 GB of Parquet click-stream data stored in Amazon S3 with a 5 GB customer dimension table that resides in an Amazon RDS for PostgreSQL instance. The solution must minimize infrastructure management, let multiple future jobs reuse the same metadata, and ensure that all traffic stays within the VPC. Which approach meets these requirements?

  • Set up AWS Database Migration Service to export the RDS table to Amazon S3 each night, crawl the exported files, and join them with the click-stream data in the Glue job.

  • Configure Amazon Athena with the PostgreSQL federated query connector and have the Glue job retrieve the customer table by querying Athena during each run.

  • Use AWS DMS to replicate the RDS table into Amazon DynamoDB and query DynamoDB from the Glue Spark job for the customer dimension data.

  • Create an AWS Glue JDBC connection to the RDS endpoint in the VPC, run a crawler with that connection to catalog the customer table, and have the Glue Spark job read the cataloged JDBC table alongside the Parquet files.

Question 6 of 20

An analytics team runs a provisioned Amazon Redshift cluster that loads 3 TB of data nightly and is queried by business analysts. Queries arrive unpredictably, with some days heavy ad-hoc activity and most days almost no usage. The company wants to cut costs and remove cluster management tasks while keeping the existing Redshift schema and SQL. Which solution best meets these requirements?

  • Resize the cluster to RA3 nodes and enable Redshift Concurrency Scaling.

  • Query the nightly data files directly from Amazon S3 by using Amazon Athena.

  • Migrate the workload to an on-demand Amazon EMR cluster running Apache Hive.

  • Snapshot the cluster into a workgroup and run it with Amazon Redshift Serverless.

Question 7 of 20

Your company runs several Amazon EMR clusters that execute nightly Spark jobs. The engineering team wants a managed solution to aggregate application and step logs from every cluster, retain the data for 30 days, and provide near-real-time search and interactive dashboards to troubleshoot performance issues. Which approach meets these requirements with the least operational overhead?

  • Stream logs from the EMR master node to Amazon Kinesis Data Streams, invoke AWS Lambda to load the records into Amazon DynamoDB, and build Amazon QuickSight analyses on the table.

  • Enable log archiving to Amazon S3, run Amazon Athena queries against the logs, and visualize the results in Amazon QuickSight with a 30-day lifecycle policy on the S3 bucket.

  • Configure each EMR cluster to publish its logs to CloudWatch Logs, create a CloudWatch Logs subscription that streams the logs to an Amazon OpenSearch Service domain, and set a 30-day retention policy on the log groups.

  • Install Filebeat on every EMR node to forward logs to an ELK stack running on a separate always-on EMR cluster and delete indices older than 30 days.

Question 8 of 20

An ecommerce company stores hundreds of Parquet datasets in Amazon S3. The analytics team catalogs the data in AWS Glue. They must indicate for each table and column whether the data is public, internal only, or contains customer PII, and they must enforce different Athena permissions based on these classifications. Which solution requires the least ongoing administration?

  • Maintain separate AWS Glue databases for Public, Internal, and PII data and restrict Athena users to the corresponding database.

  • Create Lake Formation LF-tags for each sensitivity level, attach them to the relevant tables and columns, and grant tag-based permissions to the appropriate IAM principals.

  • Configure custom classifiers in AWS Glue crawlers to label tables and use Glue column-level IAM policies to restrict Athena access.

  • Enable Amazon Macie on the S3 buckets and use Macie findings to automatically block unauthorized Athena queries against sensitive data.

Question 9 of 20

A data engineering team receives hourly CSV files in an Amazon S3 bucket. Each time a file arrives they must 1) launch an AWS Glue ETL job, 2) run an Amazon Athena CTAS query to aggregate the transformed data, and 3) send an Amazon SNS notification. The solution must provide built-in retries, visual workflow monitoring, JSON-based infrastructure-as-code definitions, and minimal operational overhead. Which service should orchestrate this pipeline?

  • Create an Amazon EventBridge Scheduler cron expression that invokes three Lambda functions in sequence to run Glue, Athena, and SNS.

  • Deploy an Amazon Managed Workflows for Apache Airflow environment and implement a DAG that calls Glue and Athena operators, then publishes an SNS message.

  • Define an AWS Step Functions state machine triggered by an EventBridge rule that invokes the Glue job, runs the Athena query with the SDK integration, and publishes to SNS.

  • Use an AWS Glue Workflow to run the Glue job, followed by a crawler and a trigger that starts an Athena query via Lambda, then send an SNS notification.

Question 10 of 20

An ecommerce platform streams purchase events to an Amazon Kinesis Data Stream that contains three shards. A Lambda function is configured as the only consumer through an event source mapping. CloudWatch shows the IteratorAge metric growing to several minutes even though the function successfully processes each batch in less than 200 ms. The team must reduce the lag without changing code or adding shards. Which action should the data engineer take?

  • Reduce the BatchSize value to invoke the function with fewer records more frequently.

  • Increase the Lambda function's memory allocation to provide more CPU and shorten runtime.

  • Increase the ParallelizationFactor setting on the event source mapping so multiple batches from each shard are processed concurrently.

  • Enable enhanced fan-out on the stream and register the Lambda function as an enhanced consumer.

Question 11 of 20

A company stores multiple datasets in a single Amazon S3 bucket. Objects are tagged Team=. AWS Glue jobs run under IAM roles that carry the same Team tag. The security team wants each job to read only objects matching its Team tag, without creating new policies when new teams join. Which authorization approach will best satisfy this requirement?

  • Apply S3 object ACLs that grant read permission to each team's IAM role whenever new data is uploaded.

  • Implement ABAC by attaching one IAM policy that allows s3:GetObject when the principal's Team tag matches the object's Team tag.

  • Create a dedicated IAM role and managed policy for each team that grants access to that team's S3 prefix.

  • Provision an S3 Access Point per team and use access point resource policies to restrict read access to the corresponding role.

Question 12 of 20

An AWS Glue ETL job processes files that contain PII. The source and destination Amazon S3 buckets must enforce encryption at rest with customer-managed keys. Security forbids use of the default aws/s3 KMS key and wants other AWS accounts to read the output. Which approach meets these requirements with the least operational effort?

  • Enable SSE-KMS with a customer-managed key, configure bucket default encryption to use that key, and add the external accounts to the key policy and bucket policy.

  • Enable SSE-KMS with the AWS managed key (aws/s3) and create S3 Access Points for the external accounts.

  • Enable SSE-S3 on both buckets and add a bucket policy that denies uploads without encryption.

  • Implement client-side encryption in the Glue job using a key stored in AWS Secrets Manager, then upload the encrypted objects.

Question 13 of 20

A company's Amazon Redshift RA3 cluster hosts a 5-TB fact table that receives new rows each night. Business analysts issue the same complex aggregation query every morning to populate dashboards, but the query still takes about 40 minutes even after regular VACUUM and ANALYZE operations. As the data engineer, you must cut the runtime dramatically, keep administration effort low, and avoid a large cost increase. Which approach will best meet these requirements?

  • Enable Amazon Redshift Concurrency Scaling so the query can execute on additional transient clusters.

  • Increase the WLM queue's slot count and enable short query acceleration to allocate more memory to the query.

  • Change the fact table's distribution style to ALL so every node stores a full copy, eliminating data shuffling during joins.

  • Create a materialized view that pre-aggregates the required data, schedule an automatic REFRESH after the nightly load, and direct the dashboard to query the materialized view.

Question 14 of 20

An e-commerce company ingests about 800 GB of product images and related JSON metadata each day. The data must be stored with 11 nines durability, read by Spark jobs on Amazon EMR, and later queried using Amazon Athena. The solution should scale automatically, require minimal administration, and cut storage costs because the images are seldom accessed after the first few days. Which AWS storage option best meets these requirements?

  • Save the images as binary attributes in an Amazon DynamoDB table and scan the table from Amazon EMR.

  • Store the images and metadata in an Amazon S3 bucket and apply an S3 Lifecycle rule that transitions objects to S3 Glacier Instant Retrieval after 30 days.

  • Load the images and metadata into an Amazon Redshift RA3 cluster and query the data with Redshift Spectrum.

  • Mount an Amazon EFS One Zone-IA file system on the EMR cluster and place the images and metadata there.

Question 15 of 20

A workload must ingest 20 MB/s of 20 KB JSON messages produced by thousands of IoT devices and make each record available to a downstream analytics application within a few hundred milliseconds. Which solution meets the throughput and latency requirements in the most cost-effective way?

  • Send the data to an Amazon Kinesis Data Firehose delivery stream with default buffering and deliver it to the analytics application.

  • Publish the events to an Amazon EventBridge bus and have a rule invoke the analytics application for each event.

  • Send the messages to an Amazon Kinesis Data Streams stream sized with at least 20 shards, then have the analytics application consume from the stream.

  • Buffer records on each device and write multipart objects directly to an Amazon S3 bucket, then trigger processing with S3 event notifications.

Question 16 of 20

Your company stores raw transactional data with credit-card and SSN columns in an Amazon S3 data lake. Business analysts query the data using Amazon Athena. Compliance mandates that analysts see all columns except those with PII. The solution must avoid duplicating data, follow least privilege, and require minimal maintenance. Which approach satisfies these needs?

  • Encrypt PII columns client-side before uploading to S3 and withhold the encryption key from analysts so that ciphertext values appear unreadable when they query the data.

  • Schedule Amazon Macie to classify objects daily and move any files containing PII to an encrypted quarantine bucket that analysts cannot access; analysts query the remaining bucket with Athena.

  • Register the S3 location with AWS Lake Formation, tag PII columns in the Data Catalog, and grant the analyst group column-level permissions that exclude columns tagged as PII.

  • Use an AWS Glue job to copy the dataset into a new Parquet table that omits PII columns, and direct analysts to query the new table instead of the raw data.

Question 17 of 20

A data engineering team runs a managed Apache Airflow environment on Amazon MWAA to orchestrate nightly ETL pipelines. Company policy states that no task may use the MWAA execution role; each task must assume a job-specific IAM role automatically. The team wants to satisfy the policy without refactoring the existing DAG code. Which solution will meet these requirements with the LEAST operational overhead?

  • Create a new Docker image that includes custom Airflow configuration with job-specific credentials and attach it to the MWAA environment.

  • Edit the aws_default Airflow connection in the MWAA environment and set the role_arn extra field to the IAM role that the pipeline should assume.

  • Store long-lived access keys for each job-specific IAM user in separate Airflow connections and reference them from every task.

  • Transform each task into an AWS Lambda function that first calls STS:AssumeRole and then performs the workload.

Question 18 of 20

A data engineering team schedules an AWS Glue Spark job through Amazon EventBridge to transform and load daily CSV files from an S3 landing prefix into a partitioned analytics bucket. The job writes with append mode, and Athena reports sometimes reveal duplicate rows for the same day even though the source files are never modified. Which change will most effectively prevent these duplicates while keeping the pipeline fully automated and cost-effective?

  • Enable AWS Glue job bookmarks so the job automatically ignores files it has already processed.

  • Add an AWS Step Functions state machine that calls Athena to delete duplicate records after each load completes.

  • Configure an S3 lifecycle rule to delete files in the landing prefix immediately after the job finishes.

  • Change the Spark write operation to overwrite the existing date partition each day.

Question 19 of 20

A company receives CSV files in an Amazon S3 bucket that is owned by another AWS account. A data engineer must copy any new files to the company's central data-lake bucket every hour between 08:00 and 18:00, Monday through Friday. The solution must be serverless, easy to adjust for future schedule changes, and incur the lowest possible operational cost. Which approach meets these requirements MOST effectively?

  • Deploy Apache Airflow in Amazon Managed Workflows for Apache Airflow (MWAA) and create an hourly DAG that runs an AWS Data Pipeline task to replicate the files.

  • Configure an hourly AWS Glue crawler on the source bucket and trigger an AWS Glue job to copy the files into the destination bucket.

  • Launch an Amazon EC2 instance and configure a Linux cron job that runs the "aws s3 sync" command every hour to copy the objects between buckets.

  • Create an Amazon EventBridge rule with a cron expression that invokes an AWS Lambda function every hour during business hours; the function assumes a cross-account role and copies any new objects to the data-lake bucket.

Question 20 of 20

A data lake on Amazon S3 contains a raw table with customer email addresses. Compliance requires downstream analytics to receive a deterministic pseudonym for each address so that joins are possible, while the original email can never be inferred without an internal secret key. As the data engineer, which solution most simply applies a keyed salt during anonymization by relying only on managed services?

  • Enable S3 Bucket Keys with SSE-KMS and configure an S3 Object Lambda access point to rewrite objects on the fly.

  • Deploy an AWS Lambda function triggered by S3 PUT to read each object, prepend a random value to every email before hashing, store the mapping in Amazon DynamoDB, and write the redacted file back to S3.

  • Use server-side encryption with customer-provided keys (SSE-C) on the raw bucket and rotate the keys daily.

  • Create an AWS Glue DataBrew recipe that applies the HMAC-SHA256 transformation to the email column using a secret key retrieved from AWS Secrets Manager, then write the output to a curated S3 prefix.