AWS Certified Data Engineer Associate DEA-C01 Practice Question
An analytics team runs an Amazon EMR cluster that finishes a nightly Spark batch job at 02:00 UTC. The job writes partitioned Parquet files to HDFS under /data/events/date=YYYY-MM-DD. The new files must be ingested into an Amazon S3 data lake by 03:00 UTC. The solution must minimize operational effort, avoid opening inbound ports on the cluster, and control costs. Which approach meets these requirements?
Reconfigure the Spark job to write its output directly to an Amazon S3 prefix by using EMRFS, then schedule an AWS Glue crawler on that prefix to catalog the daily partition.
Install AWS DataSync agents on the EMR core nodes and configure a nightly task to copy the HDFS folder to Amazon S3.
Add a nightly Amazon EMR step that runs DistCp from HDFS to an S3 bucket, orchestrated by AWS Step Functions.
Create an AWS Glue JDBC connection to the Hive metastore on the EMR master node and have an AWS Glue job read the HDFS location each night.
Writing the Spark output directly to an Amazon S3 prefix with the EMRFS connector removes the need for a separate copy step and lets the cluster shut down as soon as the job ends, lowering cost. Because the data is already in S3, an AWS Glue crawler can automatically detect the new date partition without any network access to the EMR cluster, so no inbound ports have to be opened. Reading HDFS through a JDBC connection to Hive would require opening port 9083 on the EMR master node and still leaves the data on-cluster. Using AWS DataSync would require installing and managing agents on the EMR core nodes, which adds operational overhead and keeps the cluster running longer. Running DistCp each night also keeps the cluster running longer and adds operational overhead compared with writing directly to S3 in the first place.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is EMRFS in AWS?
Open an interactive chat with Bash
What does an AWS Glue crawler do?
Open an interactive chat with Bash
How does writing directly to Amazon S3 minimize costs in this scenario?
Open an interactive chat with Bash
What is EMRFS in Amazon EMR?
Open an interactive chat with Bash
How does AWS Glue crawler work for partition detection?
Open an interactive chat with Bash
Why should Spark output be written directly to Amazon S3 using EMRFS instead of HDFS?
Open an interactive chat with Bash
AWS Certified Data Engineer Associate DEA-C01
Data Ingestion and Transformation
Your Score:
Report Issue
Bash, the Crucial Exams Chat Bot
AI Bot
Loading...
Loading...
Loading...
Pass with Confidence.
IT & Cybersecurity Package
You have hit the limits of our free tier, become a Premium Member today for unlimited access.
Military, Healthcare worker, Gov. employee or Teacher? See if you qualify for a Community Discount.
Monthly
$19.99
$19.99/mo
Billed monthly, Cancel any time.
3 Month Pass
$44.99
$14.99/mo
One time purchase of $44.99, Does not auto-renew.
MOST POPULAR
Annual Pass
$119.99
$9.99/mo
One time purchase of $119.99, Does not auto-renew.
BEST DEAL
Lifetime Pass
$189.99
One time purchase, Good for life.
What You Get
All IT & Cybersecurity Package plans include the following perks and exams .