An enterprise runs nightly Apache Spark ETL jobs written in Scala on an on-prem Hadoop YARN cluster. They want to lift-and-shift these jobs to Google Cloud with almost no code changes. The solution must provide on-demand autoscaling, let them attach preemptible workers to save costs, keep using a Hive Metastore for shared table definitions, and allow engineers to open an SSH session on worker nodes for live debugging. Which Google Cloud service best meets all of these requirements?
Rebuild the ETL logic in Cloud Data Fusion and run the pipelines in batch mode on Cloud Dataflow.
Create on-demand Dataproc clusters that run the existing Spark jobs, enable autoscaling with preemptible secondary workers, and connect to Dataproc Metastore.
Schedule equivalent SQL transformations as BigQuery scheduled queries and stored procedures.
Migrate the Spark code to Apache Beam and execute the pipelines on Cloud Dataflow.
Dataproc offers managed Hadoop/Spark clusters that can be started on demand, autoscaled, and configured with preemptible secondary workers to lower cost. It supports direct execution of existing Spark jobs without rewriting, integrates with Dataproc Metastore (or an external Hive Metastore) so table metadata remains available, and exposes standard Compute Engine VMs so engineers can SSH to the nodes. Dataflow would require rewriting Spark code into Apache Beam pipelines, BigQuery would force a shift to SQL-based ELT, and Cloud Data Fusion would mean rebuilding pipelines in a visual tool; neither alternative gives direct SSH access to Spark executors or Hive compatibility out of the box.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is Dataproc in Google Cloud?
Open an interactive chat with Bash
What is a Hive Metastore and why is it important for Spark jobs?
Open an interactive chat with Bash
What are preemptible workers in Google Dataproc, and how do they help save costs?
Open an interactive chat with Bash
What are preemptible workers in Google Cloud Dataproc?
Open an interactive chat with Bash
What is Dataproc Metastore, and why is it important?
Open an interactive chat with Bash
How does autoscaling work in Dataproc clusters?
Open an interactive chat with Bash
What is Dataproc in Google Cloud?
Open an interactive chat with Bash
What is a preemptible VM, and how does it save costs?
Open an interactive chat with Bash
What is the role of a Metastore in big data pipelines?
Open an interactive chat with Bash
GCP Professional Data Engineer
Ingesting and processing the data
Your Score:
Report Issue
Bash, the Crucial Exams Chat Bot
AI Bot
Loading...
Loading...
Loading...
Pass with Confidence.
IT & Cybersecurity Package
You have hit the limits of our free tier, become a Premium Member today for unlimited access.
Military, Healthcare worker, Gov. employee or Teacher? See if you qualify for a Community Discount.
Monthly
$19.99
$19.99/mo
Billed monthly, Cancel any time.
3 Month Pass
$44.99
$14.99/mo
One time purchase of $44.99, Does not auto-renew.
MOST POPULAR
Annual Pass
$119.99
$9.99/mo
One time purchase of $119.99, Does not auto-renew.
BEST DEAL
Lifetime Pass
$189.99
One time purchase, Good for life.
What You Get
All IT & Cybersecurity Package plans include the following perks and exams .