AWS Certified Data Engineer Associate DEA-C01 Practice Question
An analytics team runs ad hoc PySpark jobs on Amazon EMR. Each job iteratively processes a 40 GB Parquet data set stored in Amazon S3, rereading the files each time and incurring high latency and S3 request costs. Jobs launch only twice per week, making a long-running cluster too expensive. Which configuration will yield the fastest run time while minimizing total cost?
Create a persistent EC2 fleet with attached EBS volumes, populate them with the dataset, and mount the volumes as HDFS for all future EMR jobs.
Use EMRFS Consistent View and run the Spark job directly against the objects stored in Amazon S3.
Launch a transient EMR cluster and add an S3DistCp step to copy the input data to HDFS at startup; write final results back to S3 before the cluster terminates.
Enable Amazon S3 Select pushdown in Spark so each iteration reads only the required Parquet columns from S3.
Copying the 40 GB data set from S3 into the cluster's local HDFS at the beginning of each run lets Spark read and write intermediate data on disks that are directly attached to the nodes, eliminating repeated network calls to S3 and reducing request charges. Because the cluster is transient, storage costs stop when the job finishes, so overall cost is lower than keeping a long-lived cluster running or repeatedly accessing S3. EMRFS Consistent View still uses S3, so performance remains limited by network I/O. S3 Select is optimized for column projection and predicate pushdown, not for shuffle-heavy iterative algorithms, and it still incurs per-request charges. Persisting EBS volumes on standalone EC2 instances defeats the purpose of an on-demand EMR environment and incurs continuous storage charges.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
Why is HDFS faster than directly accessing Amazon S3 in this scenario?
Open an interactive chat with Bash
What makes transient EMR clusters cost-effective for periodic workloads?
Open an interactive chat with Bash
Why isn’t Amazon S3 Select a good fit for shuffle-heavy iterative algorithms?
Open an interactive chat with Bash
AWS Certified Data Engineer Associate DEA-C01
Data Store Management
Your Score:
Report Issue
Bash, the Crucial Exams Chat Bot
AI Bot
Loading...
Loading...
Loading...
Pass with Confidence.
IT & Cybersecurity Package
You have hit the limits of our free tier, become a Premium Member today for unlimited access.
Military, Healthcare worker, Gov. employee or Teacher? See if you qualify for a Community Discount.
Monthly
$19.99
$19.99/mo
Billed monthly, Cancel any time.
3 Month Pass
$44.99
$14.99/mo
One time purchase of $44.99, Does not auto-renew.
MOST POPULAR
Annual Pass
$119.99
$9.99/mo
One time purchase of $119.99, Does not auto-renew.
BEST DEAL
Lifetime Pass
$189.99
One time purchase, Good for life.
What You Get
All IT & Cybersecurity Package plans include the following perks and exams .