00:20:00

AWS Certified Data Engineer Associate Practice Test (DEA-C01)

Use the form below to configure your AWS Certified Data Engineer Associate Practice Test (DEA-C01). The practice test can be configured to only include certain exam objectives and domains. You can choose between 5-100 questions and set a time limit.

Logo for AWS Certified Data Engineer Associate DEA-C01
Questions
Number of questions in the practice test
Free users are limited to 20 questions, upgrade to unlimited
Seconds Per Question
Determines how long you have to finish the practice test
Exam Objectives
Which exam objectives should be included in the practice test

AWS Certified Data Engineer Associate DEA-C01 Information

The AWS Certified Data Engineer – Associate certification validates your ability to design, build, and manage data pipelines on the AWS Cloud. It’s designed for professionals who transform raw data into actionable insights using AWS analytics and storage services. This certification proves you can work with modern data architectures that handle both batch and streaming data, using tools like Amazon S3, Glue, Redshift, EMR, Kinesis, and Athena to deliver scalable and efficient data solutions.

The exam covers the full data lifecycle — from ingestion and transformation to storage, analysis, and optimization. Candidates are tested on their understanding of how to choose the right AWS services for specific use cases, design secure and cost-effective pipelines, and ensure data reliability and governance. You’ll need hands-on knowledge of how to build ETL workflows, process large datasets efficiently, and use automation to manage data infrastructure in production environments.

Earning this certification demonstrates to employers that you have the technical expertise to turn data into value on AWS. It’s ideal for data engineers, analysts, and developers who work with cloud-based data systems and want to validate their skills in one of the most in-demand areas of cloud computing today. Whether you’re building data lakes, streaming pipelines, or analytics solutions, this certification confirms you can do it the AWS way — efficiently, securely, and at scale.

AWS Certified Data Engineer Associate DEA-C01 Logo
  • Free AWS Certified Data Engineer Associate DEA-C01 Practice Test

  • 20 Questions
  • Unlimited
  • Data Ingestion and Transformation
    Data Store Management
    Data Operations and Support
    Data Security and Governance
Question 1 of 20

A data engineering team runs a persistent Amazon EMR cluster that stores intermediate data in HDFS. Each night, about 50 TB of gzip log files arrive in an Amazon S3 bucket and must be copied into HDFS before downstream MapReduce jobs start. The transfer must maximize throughput, minimize S3 request costs, and run by using only the existing EMR cluster resources. Which solution meets these requirements?

  • Mount the S3 bucket on every core node with s3fs and move the objects to HDFS with the Linux cp command.

  • From the master node, run the AWS CLI command "aws s3 cp --recursive" to copy the objects into HDFS.

  • Use AWS DataSync to transfer the objects to volumes on each core node, then import the data into HDFS.

  • Add an EMR step that uses S3DistCp to copy the objects from Amazon S3 to HDFS in parallel.

Question 2 of 20

A company stores operational data in an Amazon Aurora PostgreSQL cluster. Analysts need to join this data with large fact tables that already reside in Amazon Redshift for near-real-time ad-hoc reporting. The solution must minimize data movement and ongoing maintenance while allowing analysts to run standard SQL joins from their Redshift data warehouse. Which approach meets these requirements with the least operational overhead?

  • Set up an AWS Database Migration Service task with change data capture (CDC) to replicate the Aurora tables into Redshift and run joins on the replicated tables.

  • Create an external schema in Amazon Redshift that references the Aurora PostgreSQL database and use Amazon Redshift federated queries to join the remote tables with local fact tables.

  • Schedule an AWS Glue ETL job to load the Aurora data into Redshift staging tables every 15 minutes and join the staging tables with the fact tables.

  • Export the Aurora tables to Amazon S3 and use Redshift Spectrum external tables to join the exported data with Redshift fact tables.

Question 3 of 20

Your company receives hourly comma-separated value (CSV) log files in an Amazon S3 prefix. Data analysts use Amazon Athena for ad-hoc queries, but scan costs and runtimes are increasing as the dataset grows. As a data engineer, you must convert both existing and future files to an optimized columnar format, partition the data by event_date, and avoid managing any servers or long-running clusters.

Which solution MOST cost-effectively meets these requirements?

  • Create an AWS Glue crawler to catalog the CSV files, then schedule an AWS Glue Spark job that reads the crawler's table, writes Snappy-compressed Parquet files partitioned by event_date to a new S3 prefix, and updates the Data Catalog.

  • Provision an Amazon EMR cluster with Apache Hive, run a CREATE EXTERNAL TABLE … STORED AS ORC statement to convert the CSV data to ORC, and keep the cluster running to process new hourly files.

  • Enable S3 Storage Lens and apply Lifecycle rules to transition the CSV objects to the S3 Glacier Flexible Retrieval storage class after 30 days to reduce storage and Athena scan costs.

  • Modify the source application to write Parquet files directly to the target S3 prefix and drop the existing CSV files once verified.

Question 4 of 20

An application writes 2 TB of structured transactional data as comma-separated files to an S3 bucket each day. Analysts query the data with Amazon Athena and experience long runtimes and high scan charges. A data engineer will add a nightly AWS Glue Spark job to transform the data. Which transformation will best address the volume characteristics while retaining the relational schema?

  • Merge all daily CSV files into a single uncompressed file to reduce S3 object overhead.

  • Compress the existing CSV files with Gzip and remove all header rows.

  • Split each CSV file into chunks no larger than 128 MB to increase Athena parallelism.

  • Convert the files to Apache Parquet, apply Snappy compression, and partition the dataset by transaction_date.

Question 5 of 20

You run an AWS Glue 3.0 Spark job written in Python that reads 50,000 gzip-compressed JSON files (about 100 KB each) from one Amazon S3 prefix, transforms the data, and writes Parquet files back to S3. The job uses the default 10 G.1X DPUs and currently completes in eight hours while average CPU utilization stays under 30 percent. Which modification will most improve performance without increasing cost?

  • Use create_dynamic_frame_from_options with connection_options {"groupFiles": "inPartition", "groupSize": "134217728"} so Glue combines many small objects before processing.

  • Write the Parquet output with the Zstandard compression codec to shrink the file sizes.

  • Enable AWS Glue job bookmarking so previously processed files are skipped.

  • Add --conf spark.executor.memory=16g to the job parameters to increase executor heap size.

Question 6 of 20

A company stores application logs as compressed JSON files in an Amazon S3 location that is partitioned by the prefix logs/region/date=YYYY-MM-DD. A data engineer created an AWS Glue crawler that builds an Athena table so analysts can run ad-hoc queries. The crawler runs on a daily schedule, but after several months it spends most of its run time re-processing unchanged folders, delaying data availability for the most recent partition.

Which crawler configuration change will minimize the crawl time without requiring code changes to the ingest process?

  • Enable partition projection in the Athena table and delete the crawler.

  • Change the crawler's recrawl behavior to CRAWL_NEW_FOLDERS_ONLY so it processes only folders that were added since the last run.

  • Switch the crawler trigger to Amazon S3 event notifications so it runs once for every new object.

  • Configure the crawler to create a separate table for each region/date folder.

Question 7 of 20

A retail company runs nightly AWS Glue ETL jobs that load data into an Amazon Redshift cluster. The job script currently hard-codes the database user name and password. Security now requires removing plaintext credentials, rotating the password automatically every 30 days, and making no changes to the ETL code. Which solution meets these requirements most securely?

  • Store the database credentials as SecureString parameters in AWS Systems Manager Parameter Store and schedule an Amazon EventBridge rule that invokes a Lambda function every 30 days to update the parameters; grant the Glue job role ssm:GetParameters permission.

  • Save the credentials in the AWS Glue Data Catalog connection properties and enable automatic rotation in the connection settings.

  • Encrypt the user name and password with AWS KMS and place the ciphertext in environment variables of the Glue job; configure KMS key rotation every 30 days.

  • Create an AWS Secrets Manager secret for the Redshift cluster, enable automatic rotation, update the existing AWS Glue connection to reference the secret's ARN, and add secretsmanager:GetSecretValue permission to the Glue job role.

Question 8 of 20

A data engineering team keeps the Python script for an AWS Glue ETL job in an AWS CodeCommit repository. The team wants every commit to automatically: 1. package the script, 2. update a development Glue job, 3. pause for manager approval, and 4. promote the change to the production Glue job. Which approach delivers this CI/CD workflow with the least custom code and operational overhead?

  • Configure an Amazon EventBridge rule to start an AWS Glue workflow that copies the latest script to both development and production jobs, then ask engineers to manually trigger the production job after testing.

  • Create an AWS CodePipeline with a CodeCommit source stage, a CodeBuild stage that packages the script to Amazon S3, a CloudFormation deploy action for the development Glue job, a manual approval action, and a second CloudFormation deploy action for the production Glue job.

  • Add an S3 trigger to both Glue job script locations that invokes a Lambda function; the function pulls the latest commit from CodeCommit and updates the jobs without any intermediate steps.

  • Use AWS CodeDeploy to create deployment groups for the Glue job and set up a deployment pipeline that pushes the script to development and production, inserting a wait step before the production deployment.

Question 9 of 20

Your analytics team plans to land about 2 TB of new, structured sales data in AWS each day. They must run complex SQL joins across 100 TB of historical data, support roughly 200 concurrent dashboard users, and load new data continuously without locking running queries. Queries should complete within seconds. Which managed AWS data store is the most appropriate?

  • Create an Amazon Redshift cluster with RA3 nodes and enable Concurrency Scaling.

  • Run an Amazon EMR cluster and execute Apache Hive queries on Parquet files stored in Amazon S3.

  • Deploy Amazon RDS for PostgreSQL on db.r6g.16xlarge with provisioned IOPS and multiple read replicas.

  • Store the data in Amazon DynamoDB using on-demand capacity and query it with PartiQL.

Question 10 of 20

A data engineering team uses AWS Step Functions to launch a transient Amazon EMR 6.x cluster nightly to run a PySpark ETL step, after which the cluster terminates automatically. When a step fails, the cluster shuts down before engineers can view Spark driver and executor logs. The team must retain detailed logs and the Spark history UI for post-mortem analysis while adding minimal EC2 cost. Which action meets these requirements?

  • Enable termination protection and disable auto-termination so the cluster remains available for manual log retrieval via SSH.

  • Configure EMRFS Consistent View so logs are automatically synchronized to Amazon S3 after each task.

  • Specify an Amazon S3 log URI and enable persistent application user interfaces for Spark when creating the EMR cluster.

  • Enable CloudTrail data events on the input data bucket to capture Spark driver logs for later review.

Question 11 of 20

A data engineering team processes log files stored in Amazon S3. Nightly AWS Glue ETL jobs write curated data back to S3, while analysts run ad-hoc queries with Amazon Athena and Apache Spark on Amazon EMR. Maintaining separate metastores for each service has resulted in schema drift and extra administration. The team needs a single, serverless data catalog that all three services can reference directly, with the least operational overhead. Which approach satisfies these requirements?

  • Run an Apache Hive metastore on the EMR primary node and connect Athena to it with AWS Glue connectors.

  • Create external schemas in Amazon Redshift and have Athena and EMR issue federated queries against them.

  • Store table metadata in an Amazon DynamoDB table and update Athena and EMR Spark jobs to read from it using custom code.

  • Use the AWS Glue Data Catalog as the unified metastore and configure both Athena and EMR to reference it.

Question 12 of 20

A data engineer must explore a 200 GB CSV data lake on Amazon S3, remove duplicate rows, and check for malformed records. Company policy prohibits long-running clusters, and the engineer wants to perform the work from an existing Jupyter notebook in Amazon SageMaker Studio with minimal infrastructure to manage. Which approach meets these requirements while keeping costs low?

  • Run ad-hoc Amazon Athena SQL queries from the notebook with the Boto3 SDK to identify and delete bad or duplicate rows.

  • Use the Athena for Apache Spark notebook interface to open a new serverless Spark session and connect the SageMaker Studio notebook to it with a JDBC driver.

  • Create an Amazon EMR cluster with JupyterHub enabled, attach the notebook to the cluster, and terminate the cluster after processing.

  • Launch an AWS Glue interactive session from the SageMaker Studio notebook by switching to the Glue PySpark kernel and process the data with Apache Spark.

Question 13 of 20

A retail company stores clickstream records in Amazon S3 using the prefix structure s3://bucket/events/year=YYYY/month=MM/day=DD/hour=HH/. An AWS Glue Data Catalog table exposes the data to Amazon Athena. Hundreds of new hour-level partitions arrive each day, and analysts must query the most recent data within minutes while keeping maintenance cost low. Which solution best meets these requirements?

  • Schedule an AWS Glue crawler to run every 5 minutes to discover and add new partitions.

  • Enable partition projection on the Glue Data Catalog table and define templates for year, month, day, and hour.

  • Instruct analysts to execute MSCK REPAIR TABLE before each Athena query to refresh partition metadata.

  • Configure Amazon S3 event notifications to trigger an AWS Lambda function that calls BatchCreatePartition for every new object.

Question 14 of 20

A security team needs to audit API activity across 50 AWS accounts that belong to a single AWS Organization. They must aggregate all CloudTrail management events in near-real time, keep the logs immutable for 365 days, and let analysts run ad-hoc SQL queries without exporting the data to another service. Which solution requires the LEAST ongoing operational effort?

  • In each member account, stream CloudTrail events to CloudWatch Logs and subscribe the log groups to an Amazon OpenSearch Service domain for search and analysis.

  • Enable Amazon Security Lake across the organization to collect CloudTrail management events and query the Parquet files in the Security Lake S3 buckets with Athena.

  • Configure an organization CloudTrail trail that delivers logs to an S3 bucket protected with S3 Object Lock, catalog the logs with AWS Glue, and query them using Amazon Athena.

  • Create an organization event data store in AWS CloudTrail Lake from the delegated administrator account, set one-year extendable retention, and grant analysts permission to run Lake SQL queries.

Question 15 of 20

A fintech startup captures tick-level trade events in an Amazon Kinesis Data Stream. Business analysts need to run near-real-time SQL queries in Amazon Redshift with end-to-end latency under 15 seconds. The team wants the simplest, most cost-effective solution and does not want to manage intermediate Amazon S3 staging or custom infrastructure. Which approach should the data engineer implement to meet these requirements?

  • Build an AWS Glue streaming job that reads from the Kinesis stream and writes batches to Amazon Redshift using JDBC.

  • Create a materialized view in Amazon Redshift that references the Kinesis stream with the KINESIS clause and enable auto-refresh for continuous ingestion.

  • Configure Amazon Kinesis Data Firehose to deliver the stream to an S3 bucket and schedule a Redshift COPY command to load the files every minute.

  • Attach an AWS Lambda function as a stream consumer that buffers events and inserts them into Amazon Redshift through the Data API.

Question 16 of 20

Your company stores JSON transaction logs in Amazon S3 using the prefix s3://company-logs/year=/month=/day=

/. Analysts query the data with Amazon Athena. You must configure an AWS Glue crawler that automatically adds each new day folder as a Data Catalog partition, deletes the partition when the folder is removed, and finishes quickly by scanning only changed objects. Which Glue crawler settings meet these requirements?
  • Set RecrawlPolicy RecrawlBehavior = CRAWL_EVENT_MODE and SchemaChangePolicy DeleteBehavior = DELETE_FROM_DATABASE (UpdateBehavior = LOG).

  • Set RecrawlPolicy RecrawlBehavior = CRAWL_NEW_FOLDERS_ONLY and SchemaChangePolicy DeleteBehavior = LOG.

  • Set RecrawlPolicy RecrawlBehavior = CRAWL_EVERYTHING and SchemaChangePolicy DeleteBehavior = DELETE_FROM_DATABASE.

  • Schedule a nightly full crawl with SchemaChangePolicy UpdateBehavior = UPDATE_IN_DATABASE and DeleteBehavior = LOG.

Question 17 of 20

An analytics team stores click-stream data as Parquet files in Amazon S3, partitioned by year/month/day (for example, s3://datalake/events/year=2025/month=10/day=07/). A daily AWS Glue crawler adds partitions to the AWS Glue Data Catalog so analysts can query the table in Amazon Athena. After two years the crawler's runtime and cost have increased significantly. The team wants to keep automatic partition discovery while minimizing ongoing cost and administration. What should they do?

  • Switch to Amazon S3 event notifications that invoke an AWS Glue job calling the batchCreatePartition API to add each new partition to the Data Catalog.

  • Change the existing crawler's recrawl policy to crawl new folders only and enable partition indexes on the Data Catalog table.

  • Enable partition projection for the Athena table, configure the year, month, and day keys, and stop scheduling the AWS Glue crawler.

  • Create an AWS Lambda function that runs MSCK REPAIR TABLE after each crawler run to update the Data Catalog incrementally.

Question 18 of 20

An Amazon Redshift cluster runs in private subnets without a NAT gateway. The cluster must query only the objects in the s3://dept-finance/raw/ prefix by using Redshift Spectrum. A VPC interface endpoint (AWS PrivateLink) for Amazon S3 already exists in the subnets. Which action enforces this restriction while leaving other VPC workloads unaffected?

  • Replace the interface endpoint with an S3 gateway endpoint, associate it with the private subnets, and create a bucket policy that limits access to the raw/ prefix.

  • Add a bucket policy on the dept-finance bucket that allows GetObject only from the specified VPC endpoint and raw/ prefix while denying all other access paths.

  • Modify the Redshift cluster's IAM role to allow s3:GetObject on dept-finance/raw/* and s3:ListBucket on the dept-finance bucket, leaving the endpoint configuration unchanged.

  • Attach a custom IAM endpoint policy to the S3 interface VPC endpoint that permits s3:GetObject on arn:aws:s3:::dept-finance/raw/*, s3:ListBucket on arn:aws:s3:::dept-finance, and denies all other S3 actions.

Question 19 of 20

A data engineer is generating an AWS Step Functions workflow from a dependency table containing up to 10,000 tasks, each with at most 30 downstream dependencies. The engineer must store the directed acyclic graph in memory inside a 512 MB Lambda function and run a topological sort in O(V+E) time. Which in-memory representation best meets these requirements?

  • A 10,000 × 10,000 boolean adjacency matrix stored in memory.

  • An adjacency list implemented as a dictionary that maps each task ID to a list of its dependent task IDs.

  • A nested dictionary that maps each source task ID to a dictionary of destination IDs set to true.

  • A single list containing one JSON object for every edge, scanned each time the graph is traversed.

Question 20 of 20

A data engineer must catalog tables from an Amazon RDS for MySQL database that sits in a private subnet with no NAT or internet gateway. The engineer is creating a new AWS Glue crawler to read the schema. Which configuration will allow the crawler to reach the database without exposing it publicly or adding extra network infrastructure?

  • Do not create any connection; selecting Amazon RDS as the data store is sufficient because Glue can connect to all regional RDS endpoints by default.

  • Create a JDBC connection with the default Glue security group; the crawler will automatically route through the account's NAT gateway.

  • Create a network connection that uses a public subnet with an internet gateway so the crawler can reach the database over its public endpoint.

  • Create a JDBC AWS Glue connection that specifies the RDS endpoint, references credentials in AWS Secrets Manager, and selects the same VPC, private subnet, and a security group allowing port 3306.