AWS Certified Data Engineer Associate DEA-C01 Practice Question
Your company receives hourly comma-separated value (CSV) log files in an Amazon S3 prefix. Data analysts use Amazon Athena for ad-hoc queries, but scan costs and runtimes are increasing as the dataset grows. As a data engineer, you must convert both existing and future files to an optimized columnar format, partition the data by event_date, and avoid managing any servers or long-running clusters.
Which solution MOST cost-effectively meets these requirements?
Modify the source application to write Parquet files directly to the target S3 prefix and drop the existing CSV files once verified.
Enable S3 Storage Lens and apply Lifecycle rules to transition the CSV objects to the S3 Glacier Flexible Retrieval storage class after 30 days to reduce storage and Athena scan costs.
Provision an Amazon EMR cluster with Apache Hive, run a CREATE EXTERNAL TABLE … STORED AS ORC statement to convert the CSV data to ORC, and keep the cluster running to process new hourly files.
Create an AWS Glue crawler to catalog the CSV files, then schedule an AWS Glue Spark job that reads the crawler's table, writes Snappy-compressed Parquet files partitioned by event_date to a new S3 prefix, and updates the Data Catalog.
AWS Glue is a serverless ETL service; you pay only while the job runs and do not manage clusters. A Glue crawler can infer the schema of the incoming CSV files and store it in the Data Catalog. A Glue Spark job can then read the CSV data from the source prefix, write compressed Parquet files partitioned by event_date to a separate S3 prefix, and can be placed on an hourly schedule so new files are converted as they arrive. Athena automatically benefits because it can query the partitioned Parquet files with far less data scanned, lowering query costs and improving performance.
Launching an EMR cluster or Redshift cluster introduces persistent infrastructure you must configure and pay for even when idle, making them less cost-effective for this workload. S3 Storage Lens and S3 Lifecycle policies reduce storage cost but do not transform file formats. Writing Parquet directly from the source application eliminates the need for conversion but requires changing the upstream producer, which is outside the stated scope.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is AWS Glue and how does it work?
Open an interactive chat with Bash
What are the benefits of using Parquet files over CSV files in AWS Athena?
Open an interactive chat with Bash
How does partitioning by event_date improve Athena performance?
Open an interactive chat with Bash
What is AWS Glue, and why is it suitable for this use case?
Open an interactive chat with Bash
What are Parquet files, and why are they better than CSV for Athena?
Open an interactive chat with Bash
Why is partitioning by event_date important in this solution?
Open an interactive chat with Bash
AWS Certified Data Engineer Associate DEA-C01
Data Ingestion and Transformation
Your Score:
Report Issue
Bash, the Crucial Exams Chat Bot
AI Bot
Loading...
Loading...
Loading...
Pass with Confidence.
IT & Cybersecurity Package
You have hit the limits of our free tier, become a Premium Member today for unlimited access.
Military, Healthcare worker, Gov. employee or Teacher? See if you qualify for a Community Discount.
Monthly
$19.99
$19.99/mo
Billed monthly, Cancel any time.
3 Month Pass
$44.99
$14.99/mo
One time purchase of $44.99, Does not auto-renew.
MOST POPULAR
Annual Pass
$119.99
$9.99/mo
One time purchase of $119.99, Does not auto-renew.
BEST DEAL
Lifetime Pass
$189.99
One time purchase, Good for life.
What You Get
All IT & Cybersecurity Package plans include the following perks and exams .