AWS Certified Data Engineer Associate DEA-C01 Practice Question

A data engineer is building an AWS Glue PySpark job that runs hourly data-quality checks on a 10 TB orders dataset stored in Amazon S3. The data is heavily skewed across 12 distinct values in the order_status column; several rare statuses represent business-critical exceptions. The team must minimize cost by reading only a small fraction of the dataset while guaranteeing that every status is examined during each run. Which sampling technique BEST satisfies these requirements?

  • Implement stratified sampling on the order_status column so each status contributes a proportionate subset of records to every hourly sample.

  • Apply reservoir sampling in a single pass to collect a fixed-size subset of records.

  • Perform simple random sampling without replacement on the entire dataset at a fixed 1 percent rate.

  • Use systematic sampling by sorting the data and selecting every Nth record.

AWS Certified Data Engineer Associate DEA-C01
Data Operations and Support
Your Score:
Settings & Objectives
Random Mixed
Questions are selected randomly from all chosen topics, with a preference for those you haven’t seen before. You may see several questions from the same objective or domain in a row.
Rotate by Objective
Questions cycle through each objective or domain in turn, helping you avoid long streaks of questions from the same area. You may see some repeat questions, but the distribution will be more balanced across topics.

Check or uncheck an objective to set which questions you will receive.

Bash, the Crucial Exams Chat Bot
AI Bot