AWS Certified Data Engineer Associate DEA-C01 Practice Question

A company stores raw clickstream logs in Amazon S3. A PySpark job converts each day's files to partitioned Parquet before analysts arrive. Daily input ranges from 20 GB to 2 TB. The team wants to minimize operational effort, pay only for compute actually used, and still finish processing within a 2-hour SLA. Which solution best meets these requirements?

  • Create an Amazon EMR Serverless Spark application and invoke the PySpark script with an AWS Step Functions workflow each morning.

  • Create an AWS Glue Spark job with G.2X worker type and increase the number of DPUs until the job completes within the SLA.

  • Run the job on Amazon EMR on EKS, using Spot-backed worker node groups that are scaled by Cluster Autoscaler.

  • Deploy a persistent EMR cluster with On-Demand core nodes and enable cluster auto scaling; schedule the PySpark job with Apache Airflow running on the master node.

AWS Certified Data Engineer Associate DEA-C01
Data Ingestion and Transformation
Your Score:
Settings & Objectives
Random Mixed
Questions are selected randomly from all chosen topics, with a preference for those you haven’t seen before. You may see several questions from the same objective or domain in a row.
Rotate by Objective
Questions cycle through each objective or domain in turn, helping you avoid long streaks of questions from the same area. You may see some repeat questions, but the distribution will be more balanced across topics.

Check or uncheck an objective to set which questions you will receive.

Bash, the Crucial Exams Chat Bot
AI Bot