AWS Certified Data Engineer Associate DEA-C01 Practice Question

An analytics team has cataloged several Parquet tables in AWS Glue Data Catalog. You are launching an Amazon EMR 6.6 cluster running Spark SQL to process the data in Amazon S3. The cluster should query the tables without copying metadata locally, and any changes to table definitions made in the Data Catalog must become immediately available to the jobs. Which solution meets these requirements with the least operational effort?

Configure the EMR cluster to use AWS Glue Data Catalog as its Hive metastore by enabling the glue-datacatalog integration and granting the cluster role Glue permissions.
Create an Amazon Athena workgroup and connect the cluster's Spark engine to it through the JDBC driver so Spark queries can read the tables.
Add a bootstrap action that exports the Data Catalog tables as Hive DDL statements and executes them with beeline to populate the cluster's local metastore at startup.
Run an hourly AWS Glue crawler that writes updated schemas into the cluster's default Hive metastore hosted on Amazon RDS.

Report Issue

Answer Description

Configuring the EMR cluster to use AWS Glue Data Catalog as its Hive metastore lets Spark and Hive clients reference the existing table definitions directly. The cluster is pointed to the Data Catalog by setting the glue-specific Hive metastore client factory class (or by adding the enable-glue-datacatalog classification). Because the catalog remains external to the cluster, any future schema updates or new partitions registered in AWS Glue are automatically visible to running jobs without additional scripts or crawlers. Exporting metadata to a local Hive metastore or executing DDL during bootstrap creates a separate copy that must be maintained, increasing operational overhead. Connecting Spark through an Athena JDBC driver does not provide direct access to Glue metadata for Spark SQL processing on the cluster and adds unnecessary complexity.

Ask Bash

Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.