AWS Certified Data Engineer Associate DEA-C01 Practice Question

An Amazon Athena table stores clickstream events as Parquet files in an S3 location partitioned by year, month, and day. A nightly ETL job currently runs the following query and is incurring high scan costs:

SELECT user_id, page, event_time FROM clickstream WHERE event_time BETWEEN date '2023-07-01' AND date '2023-07-31';

How should you rewrite the SQL to scan the least amount of data without changing the table definition?

  • Include an ORDER BY year, month, day clause to ensure the data is read in partition order.

  • Append a LIMIT clause so the statement becomes:

    SELECT user_id, page, event_time FROM clickstream WHERE event_time BETWEEN date '2023-07-01' AND date '2023-07-31' LIMIT 100000;

  • Create a common table expression (CTE) that selects all columns and then filter the CTE on event_time within the main query.

  • Add a filter on the partition columns, for example:

    SELECT user_id, page, event_time FROM clickstream WHERE year = 2023 AND month = 7 AND day BETWEEN 1 AND 31;

AWS Certified Data Engineer Associate DEA-C01
Data Ingestion and Transformation
Your Score:
Settings & Objectives
Random Mixed
Questions are selected randomly from all chosen topics, with a preference for those you haven’t seen before. You may see several questions from the same objective or domain in a row.
Rotate by Objective
Questions cycle through each objective or domain in turn, helping you avoid long streaks of questions from the same area. You may see some repeat questions, but the distribution will be more balanced across topics.

Check or uncheck an objective to set which questions you will receive.

Bash, the Crucial Exams Chat Bot
AI Bot