Your company processes 12 TB of IoT sensor data every night in an on-prem Hadoop cluster using PySpark jobs. They must move to Google Cloud but keep the existing Spark code, avoid vendor lock-in so they can repatriate workloads later, and prefer open-source orchestration. Which Google Cloud-based design best meets these portability requirements while adding the ability to scale on demand?
Rewrite the pipelines in Apache Beam and execute them on Dataflow, scheduling executions with Workflows.
Use autoscaling Dataproc Spark clusters that read and write Parquet files in Cloud Storage, orchestrated end-to-end with Cloud Composer DAGs.
Containerize each Spark job and deploy on Cloud Run, triggering executions via Pub/Sub and coordinating with Cloud Scheduler.
Load historical data into BigQuery, stream new data with the BigQuery Streaming API, and schedule nightly SQL transformations with Dataform.
Running Spark code unchanged on Dataproc preserves investment and ensures portability because Dataproc is built on open-source Hadoop/Spark; the same jobs can run on other clouds or back on-prem clusters. Storing data in open formats such as Parquet on Cloud Storage keeps it cloud-agnostic. Autoscaling Dataproc clusters provide on-demand elasticity without manual capacity management. Cloud Composer relies on open-source Airflow, so orchestration definitions remain portable.
Dataflow would require rewriting Spark jobs into Apache Beam pipelines. BigQuery-centric solutions tie the workload to a proprietary warehouse and SQL dialect. Running Spark in arbitrary containers on Cloud Run lacks native cluster semantics and would still need bespoke orchestration, limiting portability and scalability for large distributed jobs.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is Dataproc and why is it suitable for Spark workloads?
Open an interactive chat with Bash
What is Cloud Composer and how does it support portability?
Open an interactive chat with Bash
Why is Parquet a good choice for storing data in this scenario?
Open an interactive chat with Bash
GCP Professional Data Engineer
Designing data processing systems
Your Score:
Report Issue
Bash, the Crucial Exams Chat Bot
AI Bot
Loading...
Loading...
Loading...
Pass with Confidence.
IT & Cybersecurity Package
You have hit the limits of our free tier, become a Premium Member today for unlimited access.
Military, Healthcare worker, Gov. employee or Teacher? See if you qualify for a Community Discount.
Monthly
$19.99 $11.99
$11.99/mo
Billed monthly, Cancel any time.
$19.99 after promotion ends
3 Month Pass
$44.99 $26.99
$8.99/mo
One time purchase of $26.99, Does not auto-renew.
$44.99 after promotion ends
Save $18!
MOST POPULAR
Annual Pass
$119.99 $71.99
$5.99/mo
One time purchase of $71.99, Does not auto-renew.
$119.99 after promotion ends
Save $48!
BEST DEAL
Lifetime Pass
$189.99 $113.99
One time purchase, Good for life.
Save $76!
What You Get
All IT & Cybersecurity Package plans include the following perks and exams .