AWS Certified Data Engineer Associate DEA-C01 Practice Question

An analytics team runs ad hoc PySpark jobs on Amazon EMR. Each job iteratively processes a 40 GB Parquet data set stored in Amazon S3, rereading the files each time and incurring high latency and S3 request costs. Jobs launch only twice per week, making a long-running cluster too expensive. Which configuration will yield the fastest run time while minimizing total cost?

  • Create a persistent EC2 fleet with attached EBS volumes, populate them with the dataset, and mount the volumes as HDFS for all future EMR jobs.

  • Use EMRFS Consistent View and run the Spark job directly against the objects stored in Amazon S3.

  • Launch a transient EMR cluster and add an S3DistCp step to copy the input data to HDFS at startup; write final results back to S3 before the cluster terminates.

  • Enable Amazon S3 Select pushdown in Spark so each iteration reads only the required Parquet columns from S3.

AWS Certified Data Engineer Associate DEA-C01
Data Store Management
Your Score:
Settings & Objectives
Random Mixed
Questions are selected randomly from all chosen topics, with a preference for those you haven’t seen before. You may see several questions from the same objective or domain in a row.
Rotate by Objective
Questions cycle through each objective or domain in turn, helping you avoid long streaks of questions from the same area. You may see some repeat questions, but the distribution will be more balanced across topics.

Check or uncheck an objective to set which questions you will receive.

Bash, the Crucial Exams Chat Bot
AI Bot