AWS Certified Data Engineer Associate DEA-C01 Practice Question
A data engineering team runs a persistent Amazon EMR cluster that stores intermediate data in HDFS. Each night, about 50 TB of gzip log files arrive in an Amazon S3 bucket and must be copied into HDFS before downstream MapReduce jobs start. The transfer must maximize throughput, minimize S3 request costs, and run by using only the existing EMR cluster resources. Which solution meets these requirements?
Use AWS DataSync to transfer the objects to volumes on each core node, then import the data into HDFS.
Add an EMR step that uses S3DistCp to copy the objects from Amazon S3 to HDFS in parallel.
Mount the S3 bucket on every core node with s3fs and move the objects to HDFS with the Linux cp command.
From the master node, run the AWS CLI command "aws s3 cp --recursive" to copy the objects into HDFS.
S3DistCp is an Amazon EMR utility built on Apache DistCp that runs as a step on the cluster. It launches multiple mapper tasks that copy objects in parallel, optionally combines small files, and uses the cluster's network bandwidth instead of a single node. This approach delivers the highest throughput while reducing the number of GET requests. Running the command from the master node (aws s3 cp) would be single-threaded, DataSync adds an external service and cannot write directly into HDFS, and mounting the bucket with s3fs provides no parallelism and incurs high per-object overhead.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is S3DistCp, and how does it optimize data transfer to HDFS?
Open an interactive chat with Bash
Why is running 'aws s3 cp --recursive' from the master node not ideal for this scenario?
Open an interactive chat with Bash
How does S3DistCp reduce S3 request costs compared to other methods like mounting S3 with s3fs?
Open an interactive chat with Bash
What is S3DistCp and why is it used in Amazon EMR?
Open an interactive chat with Bash
How does parallelism work in S3DistCp compared to single-threaded alternatives like AWS CLI?
Open an interactive chat with Bash
Why is using alternatives like DataSync or s3fs not suitable for this scenario?
Open an interactive chat with Bash
AWS Certified Data Engineer Associate DEA-C01
Data Ingestion and Transformation
Your Score:
Report Issue
Bash, the Crucial Exams Chat Bot
AI Bot
Loading...
Loading...
Loading...
Pass with Confidence.
IT & Cybersecurity Package
You have hit the limits of our free tier, become a Premium Member today for unlimited access.
Military, Healthcare worker, Gov. employee or Teacher? See if you qualify for a Community Discount.
Monthly
$19.99
$19.99/mo
Billed monthly, Cancel any time.
3 Month Pass
$44.99
$14.99/mo
One time purchase of $44.99, Does not auto-renew.
MOST POPULAR
Annual Pass
$119.99
$9.99/mo
One time purchase of $119.99, Does not auto-renew.
BEST DEAL
Lifetime Pass
$189.99
One time purchase, Good for life.
What You Get
All IT & Cybersecurity Package plans include the following perks and exams .