GCP Professional Data Engineer Practice Question

Your on-premise 50-node Hadoop cluster runs nightly Spark and Hive batch jobs scheduled by Oozie. The code depends on standard Hadoop libraries and HDFS paths. You must move the workload to Google Cloud in two weeks with minimal refactoring, keep triggering jobs from your existing CI pipeline, and ensure clusters auto-scale and delete themselves when work finishes to control cost. Which Google Cloud service is the most appropriate execution target?

Rewrite all Spark and Hive transformations as BigQuery SQL and schedule them with BigQuery Data Transfer Service.
Run the pipelines on Cloud Dataflow by rewriting the Spark and Hive code in Apache Beam.
Create ephemeral Cloud Dataproc clusters backed by Cloud Storage and submit the existing Spark and Hive jobs without modification.
Package each Spark job into a container image and execute it on Cloud Run with automatic scaling.

Report Issue

Answer Description

Cloud Dataproc provides managed Hadoop and Spark clusters that can read and write directly to Cloud Storage through the built-in Hadoop Connector, so existing Spark and Hive code that expects HDFS paths usually runs unchanged. Clusters start in about 90 seconds, support autoscaling, and can be configured to delete themselves when jobs complete, lowering operational overhead and cost. Moving the workload to Dataflow would require rewriting the jobs in Apache Beam. BigQuery SQL or Dataform would need a full translation of the Spark and Hive logic. Cloud Run cannot natively execute distributed Hadoop or Spark frameworks. Therefore, using ephemeral Cloud Dataproc clusters is the best choice for a rapid lift-and-shift migration with minimal code changes and automated cluster lifecycle management.

Ask Bash

Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.