AWS Certified Data Engineer Associate DEA-C01 Practice Question

An AWS Glue Spark job ingests click-stream data stored in Amazon S3 as Parquet files partitioned by the column event_date (YYYY-MM-DD). The job is run daily with a job parameter DATE=2025-10-01, but the code currently executes:

path = 's3://analytics/clicks/*'
df = spark.read.parquet(path)
df_filtered = df.filter(df.event_date == DATE)

The team reports that the job scans several terabytes and exceeds its 15-minute SLA. Which change will MOST effectively reduce the job's runtime with minimal additional cost?

  • Insert df = df.repartition(1) immediately after the filter to minimize the number of output files.

  • Double the executor and driver memory in the Glue job's Spark configuration.

  • Change the read path to s3://analytics/clicks/event_date=${DATE}/ (or pass a push_down_predicate for event_date) so Spark loads only the matching partition.

  • Call df.cache() before all downstream transformations to keep the dataset in memory.

AWS Certified Data Engineer Associate DEA-C01
Data Ingestion and Transformation
Your Score:
Settings & Objectives
Random Mixed
Questions are selected randomly from all chosen topics, with a preference for those you haven’t seen before. You may see several questions from the same objective or domain in a row.
Rotate by Objective
Questions cycle through each objective or domain in turn, helping you avoid long streaks of questions from the same area. You may see some repeat questions, but the distribution will be more balanced across topics.

Check or uncheck an objective to set which questions you will receive.

Bash, the Crucial Exams Chat Bot
AI Bot