AWS Certified Data Engineer Associate DEA-C01 Practice Question
An AWS Glue Spark job joins a 500 GB clickstream fact table stored as partitioned Parquet files in Amazon S3 with a 40 MB reference table of country codes. The join causes lengthy shuffle stages and high network traffic. Without adding nodes or executors, which code change will most effectively reduce shuffle and shorten the job's runtime?
Repartition both DataFrames by a randomly generated salt column, then perform the join.
Raise the spark.sql.shuffle.partitions configuration value to double the current setting.
Convert both datasets from Parquet to CSV so the executor reads smaller individual files.
Apply the broadcast() or /*+ BROADCAST */ hint to the country-codes DataFrame before performing the join.
In distributed Spark execution, most of the overhead in a join comes from shuffling data across the network. When one table is very small (tens of megabytes) compared to the other, marking that table for a broadcast join sends a full copy of the small table to every executor. The larger fact table is then joined locally on each node, essentially eliminating the expensive data shuffle. Repartitioning both tables by a random key does not reduce total shuffle volume. Simply increasing the number of shuffle partitions spreads work but does not solve the network transfer bottleneck, and converting Parquet to CSV increases I/O and data size, further slowing the job. Therefore, broadcasting the 40 MB reference table is the correct optimization.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is a broadcast join in Spark?
Open an interactive chat with Bash
Why does shuffle affect Spark performance?
Open an interactive chat with Bash
What is the difference between Parquet and CSV in Spark?
Open an interactive chat with Bash
What is a broadcast join in Spark?
Open an interactive chat with Bash
Why does shuffling data in Spark jobs cause performance issues?
Open an interactive chat with Bash
How does the `spark.sql.shuffle.partitions` configuration affect Spark performance?
Open an interactive chat with Bash
AWS Certified Data Engineer Associate DEA-C01
Data Ingestion and Transformation
Your Score:
Report Issue
Bash, the Crucial Exams Chat Bot
AI Bot
Loading...
Loading...
Loading...
Pass with Confidence.
IT & Cybersecurity Package
You have hit the limits of our free tier, become a Premium Member today for unlimited access.
Military, Healthcare worker, Gov. employee or Teacher? See if you qualify for a Community Discount.
Monthly
$19.99
$19.99/mo
Billed monthly, Cancel any time.
3 Month Pass
$44.99
$14.99/mo
One time purchase of $44.99, Does not auto-renew.
MOST POPULAR
Annual Pass
$119.99
$9.99/mo
One time purchase of $119.99, Does not auto-renew.
BEST DEAL
Lifetime Pass
$189.99
One time purchase, Good for life.
What You Get
All IT & Cybersecurity Package plans include the following perks and exams .