Your organization needs to lift-and-shift 7,000 existing batch and streaming PySpark jobs that currently run on an on-premises Apache Hadoop cluster. Leadership insists that the cloud solution remain based on open-source processing engines to minimize vendor lock-in and allow future migration to other clouds. The data engineering team also wants the ability to create short-lived clusters that automatically delete themselves once each job finishes to keep costs low. Which Google Cloud service best fulfills these requirements while requiring little or no refactoring of the existing PySpark code?
Migrate the transformations into BigQuery user-defined functions (UDFs) and schedule them as BigQuery jobs.
Replicate the logic in Cloud Data Fusion visual pipelines, which are executed on managed Dataproc clusters.
Spin up on-demand Cloud Dataproc clusters for each job and enable cluster auto-deletion once the Spark or Hadoop job finishes.
Rewrite the PySpark workloads as Apache Beam pipelines and execute them with Cloud Dataflow Flex Templates.
Cloud Dataproc is the most suitable service. It provisions managed clusters that run unmodified Apache Hadoop, Spark, and Hive components, so existing PySpark code can run without refactoring. Dataproc also supports per-job or time-to-live (TTL) cluster auto-deletion, allowing teams to spin up temporary clusters for each workload and shut them down automatically, optimizing cost. In contrast, using Cloud Dataflow would necessitate rewriting the workloads in the Apache Beam model, Cloud Data Fusion compiles pipelines to run on Dataproc but is designed for graphical ETL/ELT rather than executing arbitrary PySpark scripts, and BigQuery relies on its proprietary SQL engine, which is not compatible with PySpark code.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is Cloud Dataproc and how does it handle Apache Hadoop and Spark workloads?
Open an interactive chat with Bash
How does cluster auto-deletion in Cloud Dataproc optimize costs?
Open an interactive chat with Bash
Why is Cloud Dataproc preferred over Cloud Dataflow when using PySpark workloads?
Open an interactive chat with Bash
GCP Professional Data Engineer
Designing data processing systems
Your Score:
Report Issue
Bash, the Crucial Exams Chat Bot
AI Bot
Loading...
Loading...
Loading...
Pass with Confidence.
IT & Cybersecurity Package
You have hit the limits of our free tier, become a Premium Member today for unlimited access.
Military, Healthcare worker, Gov. employee or Teacher? See if you qualify for a Community Discount.
Monthly
$19.99
$19.99/mo
Billed monthly, Cancel any time.
3 Month Pass
$44.99
$14.99/mo
One time purchase of $44.99, Does not auto-renew.
MOST POPULAR
Annual Pass
$119.99
$9.99/mo
One time purchase of $119.99, Does not auto-renew.
BEST DEAL
Lifetime Pass
$189.99
One time purchase, Good for life.
What You Get
All IT & Cybersecurity Package plans include the following perks and exams .