AWS Certified Data Engineer Associate DEA-C01 Practice Question
A media company uses an S3 data lake. CSV files are delivered every hour to the prefix s3://company-raw/year=/month=/day=
/. A data engineer must convert each new batch to Apache Parquet, partitioned by the same date keys, and catalog the resulting tables so they are queryable in Amazon Athena. The solution must:
avoid re-processing files that were already converted
scale without provisioning or managing servers
require the least custom code
Which approach meets these requirements MOST cost-effectively?
Launch an AWS Glue Python shell job on an hourly schedule that reads the CSV files with pandas, converts them to Parquet, and writes the results to the curated prefix.
Set up an Amazon Kinesis Data Firehose delivery stream with an S3 source and Parquet output conversion enabled, then point it at the raw bucket prefix.
Create an AWS Glue Spark ETL job that reads from the raw S3 prefix, enables job bookmarks, writes the output in Parquet to an s3://company-curated/ prefix partitioned by year, month, and day, and updates the AWS Glue Data Catalog on each run.
Configure an AWS Lambda function triggered by S3 ObjectCreated events that converts each CSV file to Parquet, writes it to the curated bucket, and uses the Athena API to add partitions.
An AWS Glue ETL job that runs on the serverless Spark runtime provides built-in readers for CSV, writers for Parquet, and automatic integration with the AWS Glue Data Catalog. When job bookmarks are enabled, the job maintains state about the input files it has already processed, so subsequent runs skip those objects and process only new data. The job can be triggered on an hourly schedule with no infrastructure to manage.
A Python shell job does not support job bookmarks, so it would re-ingest the same objects unless additional logic is written. Lambda functions chained with AWS Step Functions would satisfy the serverless requirement but would need custom code for Parquet conversion, partitioning, and catalog updates. Kinesis Data Firehose is designed for streaming ingestion and cannot retroactively read existing S3 objects, so it cannot meet the replayability requirement.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What are AWS Glue job bookmarks?
Open an interactive chat with Bash
Why is Apache Parquet used instead of CSV in data lakes?
Open an interactive chat with Bash
How does AWS Glue integrate with Amazon Athena for querying data?
Open an interactive chat with Bash
AWS Certified Data Engineer Associate DEA-C01
Data Ingestion and Transformation
Your Score:
Report Issue
Bash, the Crucial Exams Chat Bot
AI Bot
Loading...
Loading...
Loading...
Pass with Confidence.
IT & Cybersecurity Package
You have hit the limits of our free tier, become a Premium Member today for unlimited access.
Military, Healthcare worker, Gov. employee or Teacher? See if you qualify for a Community Discount.
Monthly
$19.99
$19.99/mo
Billed monthly, Cancel any time.
3 Month Pass
$44.99
$14.99/mo
One time purchase of $44.99, Does not auto-renew.
MOST POPULAR
Annual Pass
$119.99
$9.99/mo
One time purchase of $119.99, Does not auto-renew.
BEST DEAL
Lifetime Pass
$189.99
One time purchase, Good for life.
What You Get
All IT & Cybersecurity Package plans include the following perks and exams .