A retail analytics division runs nightly Spark ETL pipelines for four business units. All jobs currently share a single persistent Dataproc cluster backed by Cloud Storage. During month-end closes, long-running joins from one unit starve executors needed by others, and the central platform team spends hours tuning YARN queues. Management asks you to eliminate cross-team resource contention, keep costs low when jobs are idle, and avoid complex capacity management. What should you do?
Migrate the Spark pipelines to BigQuery using BigQuery-Connector and execute them as scheduled queries with on-demand pricing.
Launch an ephemeral Dataproc cluster for each team's nightly job, run the Spark pipeline, and configure the cluster to delete itself when the job succeeds or fails.
Resize the existing persistent cluster to a larger machine class and define stricter YARN capacity scheduler queues for each business unit.
Enable autoscaling on the persistent cluster and add preemptible secondary workers to handle month-end peaks.
Creating a short-lived Dataproc cluster for each team's batch job isolates resources because every job receives its own master and worker VMs. The cluster can be provisioned with hardware and initialization actions that match that job's needs and is deleted automatically when the job finishes, so no idle capacity is billed. A large persistent multi-tenant cluster (even with autoscaling or YARN queues) continues to risk contention and incurs charges while it is up. Migrating to BigQuery could meet some objectives but requires rewriting Spark code and was not requested.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
ELI5: What is Dataproc?
Open an interactive chat with Bash
Why does creating ephemeral clusters avoid idle costs?
Open an interactive chat with Bash
How does Dataproc isolate resources for individual teams?
Open an interactive chat with Bash
What is Dataproc and how does it help Spark ETL pipelines?
Open an interactive chat with Bash
What are YARN capacity scheduler queues, and why are they not ideal for this scenario?
Open an interactive chat with Bash
Why is an ephemeral Dataproc cluster preferred over autoscaling for the persistent cluster?
Open an interactive chat with Bash
GCP Professional Data Engineer
Maintaining and automating data workloads
Your Score:
Report Issue
Bash, the Crucial Exams Chat Bot
AI Bot
Loading...
Loading...
Loading...
Pass with Confidence.
IT & Cybersecurity Package
You have hit the limits of our free tier, become a Premium Member today for unlimited access.
Military, Healthcare worker, Gov. employee or Teacher? See if you qualify for a Community Discount.
Monthly
$19.99 $11.99
$11.99/mo
Billed monthly, Cancel any time.
$19.99 after promotion ends
3 Month Pass
$44.99 $26.99
$8.99/mo
One time purchase of $26.99, Does not auto-renew.
$44.99 after promotion ends
Save $18!
MOST POPULAR
Annual Pass
$119.99 $71.99
$5.99/mo
One time purchase of $71.99, Does not auto-renew.
$119.99 after promotion ends
Save $48!
BEST DEAL
Lifetime Pass
$189.99 $113.99
One time purchase, Good for life.
Save $76!
What You Get
All IT & Cybersecurity Package plans include the following perks and exams .