GCP Professional Data Engineer Practice Question

An enterprise runs nightly Apache Spark ETL jobs written in Scala on an on-prem Hadoop YARN cluster. They want to lift-and-shift these jobs to Google Cloud with almost no code changes. The solution must provide on-demand autoscaling, let them attach preemptible workers to save costs, keep using a Hive Metastore for shared table definitions, and allow engineers to open an SSH session on worker nodes for live debugging. Which Google Cloud service best meets all of these requirements?

Schedule equivalent SQL transformations as BigQuery scheduled queries and stored procedures.
Migrate the Spark code to Apache Beam and execute the pipelines on Cloud Dataflow.
Create on-demand Dataproc clusters that run the existing Spark jobs, enable autoscaling with preemptible secondary workers, and connect to Dataproc Metastore.
Rebuild the ETL logic in Cloud Data Fusion and run the pipelines in batch mode on Cloud Dataflow.

Report Issue

Answer Description

Dataproc offers managed Hadoop/Spark clusters that can be started on demand, autoscaled, and configured with preemptible secondary workers to lower cost. It supports direct execution of existing Spark jobs without rewriting, integrates with Dataproc Metastore (or an external Hive Metastore) so table metadata remains available, and exposes standard Compute Engine VMs so engineers can SSH to the nodes. Dataflow would require rewriting Spark code into Apache Beam pipelines, BigQuery would force a shift to SQL-based ELT, and Cloud Data Fusion would mean rebuilding pipelines in a visual tool; neither alternative gives direct SSH access to Spark executors or Hive compatibility out of the box.

Ask Bash

Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.