AWS Certified Data Engineer Associate DEA-C01 Practice Question

An analytics team runs an Amazon EMR cluster that finishes a nightly Spark batch job at 02:00 UTC. The job writes partitioned Parquet files to HDFS under /data/events/date=YYYY-MM-DD. The new files must be ingested into an Amazon S3 data lake by 03:00 UTC. The solution must minimize operational effort, avoid opening inbound ports on the cluster, and control costs. Which approach meets these requirements?

  • Reconfigure the Spark job to write its output directly to an Amazon S3 prefix by using EMRFS, then schedule an AWS Glue crawler on that prefix to catalog the daily partition.

  • Install AWS DataSync agents on the EMR core nodes and configure a nightly task to copy the HDFS folder to Amazon S3.

  • Add a nightly Amazon EMR step that runs DistCp from HDFS to an S3 bucket, orchestrated by AWS Step Functions.

  • Create an AWS Glue JDBC connection to the Hive metastore on the EMR master node and have an AWS Glue job read the HDFS location each night.

AWS Certified Data Engineer Associate DEA-C01
Data Ingestion and Transformation
Your Score:
Settings & Objectives
Random Mixed
Questions are selected randomly from all chosen topics, with a preference for those you haven’t seen before. You may see several questions from the same objective or domain in a row.
Rotate by Objective
Questions cycle through each objective or domain in turn, helping you avoid long streaks of questions from the same area. You may see some repeat questions, but the distribution will be more balanced across topics.

Check or uncheck an objective to set which questions you will receive.

Bash, the Crucial Exams Chat Bot
AI Bot