AWS Certified Data Engineer Associate DEA-C01 Practice Question

An AWS Glue Spark job joins a 500 GB clickstream fact table stored as partitioned Parquet files in Amazon S3 with a 40 MB reference table of country codes. The join causes lengthy shuffle stages and high network traffic. Without adding nodes or executors, which code change will most effectively reduce shuffle and shorten the job's runtime?

Convert both datasets from Parquet to CSV so the executor reads smaller individual files.
Apply the broadcast() or /*+ BROADCAST */ hint to the country-codes DataFrame before performing the join.
Raise the spark.sql.shuffle.partitions configuration value to double the current setting.
Repartition both DataFrames by a randomly generated salt column, then perform the join.

AWS Certified Data Engineer Associate DEA-C01

Data Ingestion and Transformation

Your Score:

Bash, the Crucial Exams Chat Bot

AI Bot

AWS Certified Data Engineer Associate DEA-C01 Practice Question

Answer Description

Ask Bash

What is a broadcast join in Spark?

Why does shuffle affect Spark performance?

What is the difference between Parquet and CSV in Spark?

Monthly

$19.99 $11.99

Billed monthly,
Cancel any time.

3 Month Pass

$44.99 $26.99

One time purchase of $26.99,
Does not auto-renew.

Annual Pass

$119.99 $71.99

One time purchase of $71.99,
Does not auto-renew.

Lifetime Pass

$189.99 $113.99

One time purchase,
Good for life.

All Exams

Unlimited Tests

Unlimited Questions

AI Tutor

Track scores

Report Cards

Voucher Discounts

Advanced PBQs

Included Exams

AWS Certified Data Engineer Associate DEA-C01 Practice Question

Report Issue

Answer Description

Ask Bash

What is a broadcast join in Spark?

Why does shuffle affect Spark performance?

What is the difference between Parquet and CSV in Spark?

Report Issue