AWS Certified Data Engineer Associate DEA-C01 Practice Question

An analytics team runs ad hoc PySpark jobs on Amazon EMR. Each job iteratively processes a 40 GB Parquet data set stored in Amazon S3, rereading the files each time and incurring high latency and S3 request costs. Jobs launch only twice per week, making a long-running cluster too expensive. Which configuration will yield the fastest run time while minimizing total cost?

Launch a transient EMR cluster and add an S3DistCp step to copy the input data to HDFS at startup; write final results back to S3 before the cluster terminates.
Create a persistent EC2 fleet with attached EBS volumes, populate them with the dataset, and mount the volumes as HDFS for all future EMR jobs.
Enable Amazon S3 Select pushdown in Spark so each iteration reads only the required Parquet columns from S3.
Use EMRFS Consistent View and run the Spark job directly against the objects stored in Amazon S3.

AWS Certified Data Engineer Associate DEA-C01

Data Store Management

Your Score:

Bash, the Crucial Exams Chat Bot

AI Bot

AWS Certified Data Engineer Associate DEA-C01 Practice Question

Answer Description

Ask Bash

Why is HDFS faster than directly accessing Amazon S3 in this scenario?

What makes transient EMR clusters cost-effective for periodic workloads?

Why isn’t Amazon S3 Select a good fit for shuffle-heavy iterative algorithms?

Monthly

$19.99 $11.99

Billed monthly,
Cancel any time.

3 Month Pass

$44.99 $26.99

One time purchase of $26.99,
Does not auto-renew.

Annual Pass

$119.99 $71.99

One time purchase of $71.99,
Does not auto-renew.

Lifetime Pass

$189.99 $113.99

One time purchase,
Good for life.

All Exams

Unlimited Tests

Unlimited Questions

AI Tutor

Track scores

Report Cards

Voucher Discounts

Advanced PBQs

Included Exams

AWS Certified Data Engineer Associate DEA-C01 Practice Question

Report Issue

Answer Description

Ask Bash

Why is HDFS faster than directly accessing Amazon S3 in this scenario?

What makes transient EMR clusters cost-effective for periodic workloads?

Why isn’t Amazon S3 Select a good fit for shuffle-heavy iterative algorithms?

Report Issue