AWS Certified Data Engineer Associate DEA-C01 Practice Question

An Amazon EMR cluster is running an Apache Spark SQL job that joins a 500 GB click-stream DataFrame with a 100 MB reference DataFrame. Shuffle stages dominate the runtime and the team cannot resize the cluster or rewrite the input data. Which Spark-level change will most effectively reduce shuffle traffic and speed up the join?

  • Enable speculative execution by setting spark.speculation to true.

  • Increase the value of spark.sql.shuffle.partitions to create more shuffle tasks.

  • Persist both DataFrames in memory before executing the join.

  • Apply a broadcast join hint to the 100 MB reference DataFrame so each executor receives a local copy.

AWS Certified Data Engineer Associate DEA-C01
Data Ingestion and Transformation
Your Score:
Settings & Objectives
Random Mixed
Questions are selected randomly from all chosen topics, with a preference for those you haven’t seen before. You may see several questions from the same objective or domain in a row.
Rotate by Objective
Questions cycle through each objective or domain in turn, helping you avoid long streaks of questions from the same area. You may see some repeat questions, but the distribution will be more balanced across topics.

Check or uncheck an objective to set which questions you will receive.

Bash, the Crucial Exams Chat Bot
AI Bot