AWS Certified Data Engineer Associate DEA-C01 Practice Question
An Amazon EMR cluster is running an Apache Spark SQL job that joins a 500 GB click-stream DataFrame with a 100 MB reference DataFrame. Shuffle stages dominate the runtime and the team cannot resize the cluster or rewrite the input data. Which Spark-level change will most effectively reduce shuffle traffic and speed up the join?
Enable speculative execution by setting spark.speculation to true.
Increase the value of spark.sql.shuffle.partitions to create more shuffle tasks.
Persist both DataFrames in memory before executing the join.
Apply a broadcast join hint to the 100 MB reference DataFrame so each executor receives a local copy.
Using a broadcast join hint copies the small 100 MB reference DataFrame to every executor, so the larger 500 GB DataFrame can be joined locally without shuffling either dataset across the network. Increasing the number of shuffle partitions will not reduce the amount of data shuffled, and persisting the DataFrames adds memory pressure without eliminating the shuffle. Enabling speculative execution only mitigates slow tasks but does not address the fundamental shuffle cost of a standard hash join. Therefore, broadcasting the small table is the most effective solution.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is a broadcast join in Apache Spark?
Open an interactive chat with Bash
Why does increasing spark.sql.shuffle.partitions not reduce shuffle traffic?
Open an interactive chat with Bash
What is shuffle in Apache Spark and why is it costly?
Open an interactive chat with Bash
What is a broadcast join in Apache Spark?
Open an interactive chat with Bash
Why is shuffle traffic costly in Apache Spark?
Open an interactive chat with Bash
What is the difference between persisting DataFrames and broadcast joins?
Open an interactive chat with Bash
AWS Certified Data Engineer Associate DEA-C01
Data Ingestion and Transformation
Your Score:
Report Issue
Bash, the Crucial Exams Chat Bot
AI Bot
Loading...
Loading...
Loading...
Pass with Confidence.
IT & Cybersecurity Package
You have hit the limits of our free tier, become a Premium Member today for unlimited access.
Military, Healthcare worker, Gov. employee or Teacher? See if you qualify for a Community Discount.
Monthly
$19.99
$19.99/mo
Billed monthly, Cancel any time.
3 Month Pass
$44.99
$14.99/mo
One time purchase of $44.99, Does not auto-renew.
MOST POPULAR
Annual Pass
$119.99
$9.99/mo
One time purchase of $119.99, Does not auto-renew.
BEST DEAL
Lifetime Pass
$189.99
One time purchase, Good for life.
What You Get
All IT & Cybersecurity Package plans include the following perks and exams .