AWS Certified Data Engineer Associate DEA-C01 Practice Question

Your company stores raw click-stream events as gzip-compressed JSON files in an S3 bucket partitioned by dt=YYYY-MM-DD. Analysts report that some records occasionally lack the required session_id field. You must generate a curated dataset in another S3 prefix that contains only valid records, can be refreshed daily, and uses standard SQL while remaining fully serverless and cost-efficient. Which solution meets these requirements?

  • Provision an Amazon EMR cluster with Hive, schedule a daily HiveQL job that selects only records with a non-null session_id and writes the output to another S3 prefix, then terminate the cluster.

  • Run a CREATE TABLE AS SELECT query in Amazon Athena that filters out rows where session_id IS NULL and writes the results to a new S3 prefix; use Athena Scheduled Queries to execute the statement daily.

  • Create an AWS Glue DataBrew project pointing at the S3 dataset, add a recipe step to delete rows with null session_id, and run the DataBrew job on a daily schedule.

  • Load the raw files into Amazon Redshift Serverless each day, issue a SQL query to remove null session_id values, and UNLOAD the cleaned data back to a different S3 location.

AWS Certified Data Engineer Associate DEA-C01
Data Operations and Support
Your Score:
Settings & Objectives
Random Mixed
Questions are selected randomly from all chosen topics, with a preference for those you haven’t seen before. You may see several questions from the same objective or domain in a row.
Rotate by Objective
Questions cycle through each objective or domain in turn, helping you avoid long streaks of questions from the same area. You may see some repeat questions, but the distribution will be more balanced across topics.

Check or uncheck an objective to set which questions you will receive.

Bash, the Crucial Exams Chat Bot
AI Bot