AWS Certified Data Engineer Associate DEA-C01 Practice Question

A data engineer must explore a 200 GB CSV data lake on Amazon S3, remove duplicate rows, and check for malformed records. Company policy prohibits long-running clusters, and the engineer wants to perform the work from an existing Jupyter notebook in Amazon SageMaker Studio with minimal infrastructure to manage. Which approach meets these requirements while keeping costs low?

  • Launch an AWS Glue interactive session from the SageMaker Studio notebook by switching to the Glue PySpark kernel and process the data with Apache Spark.

  • Create an Amazon EMR cluster with JupyterHub enabled, attach the notebook to the cluster, and terminate the cluster after processing.

  • Use the Athena for Apache Spark notebook interface to open a new serverless Spark session and connect the SageMaker Studio notebook to it with a JDBC driver.

  • Run ad-hoc Amazon Athena SQL queries from the notebook with the Boto3 SDK to identify and delete bad or duplicate rows.

AWS Certified Data Engineer Associate DEA-C01
Data Operations and Support
Your Score:
Settings & Objectives
Random Mixed
Questions are selected randomly from all chosen topics, with a preference for those you haven’t seen before. You may see several questions from the same objective or domain in a row.
Rotate by Objective
Questions cycle through each objective or domain in turn, helping you avoid long streaks of questions from the same area. You may see some repeat questions, but the distribution will be more balanced across topics.

Check or uncheck an objective to set which questions you will receive.

Bash, the Crucial Exams Chat Bot
AI Bot