Your company runs hundreds of nightly ETL workloads implemented as Apache Spark jobs on an on-premises Hadoop cluster. Management wants to migrate these pipelines to Google Cloud, but the CTO insists the Spark code remain unchanged so it can later run on Amazon EMR. The data engineering team also wants to avoid managing clusters or manually patching software in Google Cloud. Which approach best meets both the portability and operational requirements?
Deploy the existing Spark jobs on on-demand Cloud Dataproc clusters, which manage the underlying Hadoop and Spark runtime automatically.
Package the Spark jobs into containers and orchestrate them on Google Kubernetes Engine using the open-source Spark Operator.
Load the source data into BigQuery and replace the Spark transformations with SQL and dbt models orchestrated by Cloud Composer.
Rewrite the pipelines in Apache Beam and execute them on Cloud Dataflow so they can later run on any Beam-compatible runner.
Using Cloud Dataproc allows the team to submit their existing Spark jobs without modification because Dataproc runs the same open-source Spark distribution that Amazon EMR supports. Dataproc provisions and manages clusters, automates upgrades, and scales resources automatically, eliminating most administrative overhead while the workload runs in Google Cloud. Rewriting the jobs in Apache Beam for Cloud Dataflow would provide cross-runner portability, but it violates the "no code changes" mandate. Running Spark on GKE with the Spark Operator or replacing Spark with BigQuery SQL would both require the team to take on additional operational burden or extensive refactoring, and therefore do not meet the stated constraints. Consequently, choosing Cloud Dataproc with on-demand managed Spark clusters is the only option that preserves code portability to EMR and minimizes operational management on GCP.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is Cloud Dataproc and why is it suitable for Spark jobs?
Open an interactive chat with Bash
What is the benefit of using Apache Spark on Cloud Dataproc compared to Kubernetes with Spark Operator?
Open an interactive chat with Bash
Why is rewriting Apache Spark jobs in Apache Beam for Cloud Dataflow not suitable here?
Open an interactive chat with Bash
GCP Professional Data Engineer
Designing data processing systems
Your Score:
Report Issue
Bash, the Crucial Exams Chat Bot
AI Bot
Loading...
Loading...
Loading...
Pass with Confidence.
IT & Cybersecurity Package
You have hit the limits of our free tier, become a Premium Member today for unlimited access.
Military, Healthcare worker, Gov. employee or Teacher? See if you qualify for a Community Discount.
Monthly
$19.99 $11.99
$11.99/mo
Billed monthly, Cancel any time.
$19.99 after promotion ends
3 Month Pass
$44.99 $26.99
$8.99/mo
One time purchase of $26.99, Does not auto-renew.
$44.99 after promotion ends
Save $18!
MOST POPULAR
Annual Pass
$119.99 $71.99
$5.99/mo
One time purchase of $71.99, Does not auto-renew.
$119.99 after promotion ends
Save $48!
BEST DEAL
Lifetime Pass
$189.99 $113.99
One time purchase, Good for life.
Save $76!
What You Get
All IT & Cybersecurity Package plans include the following perks and exams .