Your data engineering team operates a 50-node Dataproc cluster that runs a nightly Spark ETL job on clickstream data for about two hours. The rest of the day the cluster is largely idle, except for occasional ad-hoc Hive queries from analysts who can wait a few minutes for results to start returning. Management asks you to lower compute costs while keeping existing SLAs. What approach should you take?
Retire Dataproc and rewrite the Spark pipelines as BigQuery SQL, using only on-demand query pricing.
Keep the existing persistent cluster but attach an autoscaling policy so all workers can scale down to zero when idle.
Run each ETL and ad-hoc workload on an ephemeral Dataproc cluster that is created when the job is submitted and deleted when it completes.
Keep the persistent cluster and convert all worker nodes to preemptible VMs while leaving master nodes unchanged.
Creating job-scoped Dataproc clusters for both the nightly batch job and the on-demand ad-hoc queries eliminates almost all idle-time spending. A cluster is provisioned with hardware and software tailored to the individual job, runs only for the job's duration, and is deleted automatically afterward. Autoscaling an always-on cluster cannot scale masters or primary workers to zero, so fixed costs remain. Replacing only worker nodes with preemptibles keeps the cluster masters and primary workers running 22 idle hours per day. Migrating to BigQuery could remove cluster costs but would entail re-engineering Spark workloads, risking SLAs and adding query charges; cost reduction is less certain short-term.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is an ephemeral Dataproc cluster?
Open an interactive chat with Bash
How does autoscaling work on Dataproc clusters?
Open an interactive chat with Bash
What are preemptible VMs, and why might they not be ideal for this use case?
Open an interactive chat with Bash
What is an ephemeral Dataproc cluster?
Open an interactive chat with Bash
How does Dataproc autoscaling work?
Open an interactive chat with Bash
What are preemptible VMs on Dataproc?
Open an interactive chat with Bash
GCP Professional Data Engineer
Maintaining and automating data workloads
Your Score:
Report Issue
Bash, the Crucial Exams Chat Bot
AI Bot
Loading...
Loading...
Loading...
Pass with Confidence.
IT & Cybersecurity Package
You have hit the limits of our free tier, become a Premium Member today for unlimited access.
Military, Healthcare worker, Gov. employee or Teacher? See if you qualify for a Community Discount.
Monthly
$19.99 $11.99
$11.99/mo
Billed monthly, Cancel any time.
$19.99 after promotion ends
3 Month Pass
$44.99 $26.99
$8.99/mo
One time purchase of $26.99, Does not auto-renew.
$44.99 after promotion ends
Save $18!
MOST POPULAR
Annual Pass
$119.99 $71.99
$5.99/mo
One time purchase of $71.99, Does not auto-renew.
$119.99 after promotion ends
Save $48!
BEST DEAL
Lifetime Pass
$189.99 $113.99
One time purchase, Good for life.
Save $76!
What You Get
All IT & Cybersecurity Package plans include the following perks and exams .