Your nightly Spark ETL job on Dataproc processes 12 TB from Cloud Storage and must finish within two hours. A persistent three-master, 50-worker cluster currently meets the SLA but sits idle for 20 hours, generating unexpected cost. The job uses several Python wheels and third-party JARs. Which redesign most effectively minimizes compute cost while preserving the SLA?
Trigger a workflow template from Cloud Composer that spins up an ephemeral Dataproc cluster with the required initialization actions, uses mostly preemptible worker VMs, runs the Spark job, and deletes the cluster when it completes.
Attach an autoscaling policy to the existing cluster so that all worker nodes scale to zero after the job finishes, leaving the masters running until the next job.
Port the Spark code to Dataproc Serverless for Spark to let the service allocate and bill only for resources used during the run.
Schedule the persistent cluster for automatic deletion two hours after it starts and manually recreate the cluster each night before the job.
Creating a job-scoped (ephemeral) Dataproc cluster for each run eliminates almost all idle compute charges because the cluster exists only during the two-hour processing window. A workflow template (or a Cloud Composer DAG) can programmatically create the cluster, run initialization actions to install the required Python wheels and JAR files, and then delete the cluster. Using preemptible secondary workers further cuts per-run cost without impacting the deadline.
Keeping the existing cluster and attaching an autoscaling policy cannot remove the three master VMs and cannot shrink the primary worker group below its minimum size, so fixed compute charges remain. Migrating to Dataproc Serverless would avoid cluster management, but Serverless currently does not support custom initialization actions and may require code refactoring. Manually deleting and recreating a full-sized cluster each night would save idle costs, but it introduces unnecessary operational effort and risk of configuration drift compared with an automated workflow template. Therefore, an automated ephemeral cluster with preemptible workers is the most cost-effective and reliable solution.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is an ephemeral Dataproc cluster?
Open an interactive chat with Bash
What are preemptible worker VMs and why are they cost-effective?
Open an interactive chat with Bash
How does a workflow template or Cloud Composer improve automation for Dataproc jobs?
Open an interactive chat with Bash
What is Cloud Composer and how does it work?
Open an interactive chat with Bash
What are preemptible worker VMs in Dataproc and why are they cost-effective?
Open an interactive chat with Bash
What are initialization actions in Dataproc, and why are they important?
Open an interactive chat with Bash
GCP Professional Data Engineer
Maintaining and automating data workloads
Your Score:
Report Issue
Bash, the Crucial Exams Chat Bot
AI Bot
Loading...
Loading...
Loading...
Pass with Confidence.
IT & Cybersecurity Package
You have hit the limits of our free tier, become a Premium Member today for unlimited access.
Military, Healthcare worker, Gov. employee or Teacher? See if you qualify for a Community Discount.
Monthly
$19.99 $11.99
$11.99/mo
Billed monthly, Cancel any time.
$19.99 after promotion ends
3 Month Pass
$44.99 $26.99
$8.99/mo
One time purchase of $26.99, Does not auto-renew.
$44.99 after promotion ends
Save $18!
MOST POPULAR
Annual Pass
$119.99 $71.99
$5.99/mo
One time purchase of $71.99, Does not auto-renew.
$119.99 after promotion ends
Save $48!
BEST DEAL
Lifetime Pass
$189.99 $113.99
One time purchase, Good for life.
Save $76!
What You Get
All IT & Cybersecurity Package plans include the following perks and exams .