AWS Certified Data Engineer Associate DEA-C01 Practice Question

An AWS Glue Spark job joins a 500 GB clickstream fact table stored as partitioned Parquet files in Amazon S3 with a 40 MB reference table of country codes. The join causes lengthy shuffle stages and high network traffic. Without adding nodes or executors, which code change will most effectively reduce shuffle and shorten the job's runtime?

  • Repartition both DataFrames by a randomly generated salt column, then perform the join.

  • Raise the spark.sql.shuffle.partitions configuration value to double the current setting.

  • Convert both datasets from Parquet to CSV so the executor reads smaller individual files.

  • Apply the broadcast() or /*+ BROADCAST */ hint to the country-codes DataFrame before performing the join.

AWS Certified Data Engineer Associate DEA-C01
Data Ingestion and Transformation
Your Score:
Settings & Objectives
Random Mixed
Questions are selected randomly from all chosen topics, with a preference for those you haven’t seen before. You may see several questions from the same objective or domain in a row.
Rotate by Objective
Questions cycle through each objective or domain in turn, helping you avoid long streaks of questions from the same area. You may see some repeat questions, but the distribution will be more balanced across topics.

Check or uncheck an objective to set which questions you will receive.

Bash, the Crucial Exams Chat Bot
AI Bot