AWS Certified Data Engineer Associate DEA-C01 Practice Question

An AWS Glue Spark job ingests click-stream data stored in Amazon S3 as Parquet files partitioned by the column event_date (YYYY-MM-DD). The job is run daily with a job parameter DATE=2025-10-01, but the code currently executes:

path = 's3://analytics/clicks/*'
df = spark.read.parquet(path)
df_filtered = df.filter(df.event_date == DATE)

The team reports that the job scans several terabytes and exceeds its 15-minute SLA. Which change will MOST effectively reduce the job's runtime with minimal additional cost?

Insert df = df.repartition(1) immediately after the filter to minimize the number of output files.
Double the executor and driver memory in the Glue job's Spark configuration.
Change the read path to s3://analytics/clicks/event_date=${DATE}/ (or pass a push_down_predicate for event_date) so Spark loads only the matching partition.
Call df.cache() before all downstream transformations to keep the dataset in memory.

AWS Certified Data Engineer Associate DEA-C01

Data Ingestion and Transformation

Your Score:

Bash, the Crucial Exams Chat Bot

AI Bot

AWS Certified Data Engineer Associate DEA-C01 Practice Question

Answer Description

Ask Bash

What are partitions in AWS Glue and why are they important?

What is a push-down predicate and how does it work in AWS Glue?

Why is filtering data at the source more efficient than after loading?

Monthly

$19.99

Billed monthly,
Cancel any time.

3 Month Pass

$44.99

One time purchase of $44.99,
Does not auto-renew.

Annual Pass

$119.99

One time purchase of $119.99,
Does not auto-renew.

Lifetime Pass

$189.99

One time purchase,
Good for life.

All Exams

Unlimited Tests

Unlimited Questions

AI Tutor

Track scores

Report Cards

Voucher Discounts

Advanced PBQs

Included Exams

AWS Certified Data Engineer Associate DEA-C01 Practice Question

Report Issue

Answer Description

Ask Bash

What are partitions in AWS Glue and why are they important?

What is a push-down predicate and how does it work in AWS Glue?

Why is filtering data at the source more efficient than after loading?

Report Issue