AWS Certified Data Engineer Associate DEA-C01 Practice Question
A company lands 2 TB of comma-separated log files in an Amazon S3 landing prefix every night at 01:00. Analysts query the data with Amazon Athena and want the new records available within 30 minutes in a curated S3 prefix, stored as Apache Parquet and partitioned by ingestion date. The data engineering team wants the lowest operational overhead and to minimize compute costs when the nightly workload is idle. Which approach meets these requirements?
Spin up a long-running Amazon EMR cluster with Apache Spark. Schedule a daily step at 01:05 that converts the files to Parquet and writes them to the curated prefix, leaving the cluster running for the next day's job.
Configure an AWS Glue Spark job that is triggered when new files arrive. The job converts the CSV input to Parquet, partitions by date, writes to the curated S3 prefix, and uses Glue job auto-scaling so no compute is billed when idle.
Load the CSV data into an Amazon Redshift table each night, then run an UNLOAD command to write Parquet files partitioned by date back to S3 for Athena queries.
Invoke an AWS Lambda function from each S3 PUT event. The function uses pandas to read the CSV objects, convert them to Parquet, and store the results in the curated prefix.
AWS Glue provides a fully managed, serverless Spark environment that can be triggered by Amazon S3 event notifications or a scheduled AWS Glue workflow. When the landing files arrive, Glue automatically spins up workers, reads the CSV data from the landing prefix, and writes the output as Parquet files partitioned by date to the curated prefix. Workers shut down when the job finishes, so no resources accrue charges while the pipeline is idle.
An always-on EMR cluster could perform the same transformation, but keeping the cluster running 24×7 would incur unnecessary EC2 costs and add operational management overhead. A Lambda function cannot reliably process 2 TB of data because Lambda invocations are limited to 15 minutes and 10 GB of memory, forcing complex fan-out coordination. Loading the data into Redshift and then UNLOADing to S3 would double the storage footprint, require a Redshift cluster to be online, and add an extra data movement step, increasing both cost and latency. Therefore, the AWS Glue job is the most cost-effective and operationally simple solution.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
Why is AWS Glue preferred for this workflow over Amazon EMR?
Open an interactive chat with Bash
What advantages does Apache Parquet offer over CSV in this scenario?
Open an interactive chat with Bash
How does AWS Glue handle partitioning by ingestion date?
Open an interactive chat with Bash
AWS Certified Data Engineer Associate DEA-C01
Data Ingestion and Transformation
Your Score:
Report Issue
Bash, the Crucial Exams Chat Bot
AI Bot
Loading...
Loading...
Loading...
Pass with Confidence.
IT & Cybersecurity Package
You have hit the limits of our free tier, become a Premium Member today for unlimited access.
Military, Healthcare worker, Gov. employee or Teacher? See if you qualify for a Community Discount.
Monthly
$19.99 $11.99
$11.99/mo
Billed monthly, Cancel any time.
$19.99 after promotion ends
3 Month Pass
$44.99 $26.99
$8.99/mo
One time purchase of $26.99, Does not auto-renew.
$44.99 after promotion ends
Save $18!
MOST POPULAR
Annual Pass
$119.99 $71.99
$5.99/mo
One time purchase of $71.99, Does not auto-renew.
$119.99 after promotion ends
Save $48!
BEST DEAL
Lifetime Pass
$189.99 $113.99
One time purchase, Good for life.
Save $76!
What You Get
All IT & Cybersecurity Package plans include the following perks and exams .