AWS Certified Data Engineer Associate DEA-C01 Practice Question

A data engineer must explore a 200 GB CSV data lake on Amazon S3, remove duplicate rows, and check for malformed records. Company policy prohibits long-running clusters, and the engineer wants to perform the work from an existing Jupyter notebook in Amazon SageMaker Studio with minimal infrastructure to manage. Which approach meets these requirements while keeping costs low?

Use the Athena for Apache Spark notebook interface to open a new serverless Spark session and connect the SageMaker Studio notebook to it with a JDBC driver.
Launch an AWS Glue interactive session from the SageMaker Studio notebook by switching to the Glue PySpark kernel and process the data with Apache Spark.
Create an Amazon EMR cluster with JupyterHub enabled, attach the notebook to the cluster, and terminate the cluster after processing.
Run ad-hoc Amazon Athena SQL queries from the notebook with the Boto3 SDK to identify and delete bad or duplicate rows.

AWS Certified Data Engineer Associate DEA-C01

Data Operations and Support

Your Score:

Bash, the Crucial Exams Chat Bot

AI Bot

AWS Certified Data Engineer Associate DEA-C01 Practice Question

Answer Description

Ask Bash

What is AWS Glue and how does it support Spark processing?

Why is the Glue PySpark kernel a better option than EMR for this task?

Why can’t standard Amazon Athena SQL queries handle row-level data cleansing efficiently?

Monthly

$19.99 $11.99

Billed monthly,
Cancel any time.

3 Month Pass

$44.99 $26.99

One time purchase of $26.99,
Does not auto-renew.

Annual Pass

$119.99 $71.99

One time purchase of $71.99,
Does not auto-renew.

Lifetime Pass

$189.99 $113.99

One time purchase,
Good for life.

All Exams

Unlimited Tests

Unlimited Questions

AI Tutor

Track scores

Report Cards

Voucher Discounts

Advanced PBQs

Included Exams

AWS Certified Data Engineer Associate DEA-C01 Practice Question

Report Issue

Answer Description

Ask Bash

What is AWS Glue and how does it support Spark processing?

Why is the Glue PySpark kernel a better option than EMR for this task?

Why can’t standard Amazon Athena SQL queries handle row-level data cleansing efficiently?

Report Issue