Your analytics team schedules a 4-hour Spark ETL job every night on a Dataproc cluster with one master node and 20 n2-standard-4 workers. To keep the cluster available for occasional ad-hoc jobs, the team leaves it running the remaining 20 hours each day, resulting in high compute charges for mostly idle resources. Management asks you to reduce costs without extending the nightly ETL completion time or sacrificing ad-hoc flexibility. What should you do?
Keep the persistent cluster but convert all worker nodes to preemptible VMs to lower the hourly rate.
Configure Cloud Composer to spin up a job-scoped Dataproc cluster each night (and for any ad-hoc submission), run the Spark job, then delete the cluster after completion.
Enable Dataproc autoscaling on the existing cluster so worker nodes scale down during idle periods while keeping the master and two workers running.
Rewrite the workload as a continuous Dataflow streaming job that runs on a single, permanently provisioned regional Dataflow job.
Dataproc charges for every VM (including master nodes) as long as the cluster is running, even when it is idle. The most effective way to avoid paying for idle capacity is to replace the always-on (persistent) cluster with job-scoped, ephemeral clusters. A workflow scheduler such as Cloud Composer can create an appropriately sized cluster just before each nightly Spark job starts, run the workload, and delete the cluster immediately after it finishes. Ad-hoc users can submit their own on-demand Dataproc jobs that each spin up a short-lived cluster, ensuring resources are paid for only while they are used. Autoscaling or switching to preemptible workers on a persistent cluster still incurs charges for the master nodes and any baseline workers that remain running. Converting the batch process to a streaming pipeline would keep resources allocated all day and raise, not lower, costs.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is Cloud Composer, and why is it important for scheduling Dataproc jobs?
Open an interactive chat with Bash
What are job-scoped Dataproc clusters, and how do they differ from persistent clusters?
Open an interactive chat with Bash
Why does autoscaling or using preemptible VMs on a persistent cluster not sufficiently reduce costs?
Open an interactive chat with Bash
What is Cloud Composer?
Open an interactive chat with Bash
What are ephemeral clusters in Dataproc?
Open an interactive chat with Bash
How does preemptible VMs lower costs, and why isn't it ideal in this solution?
Open an interactive chat with Bash
GCP Professional Data Engineer
Maintaining and automating data workloads
Your Score:
Report Issue
Bash, the Crucial Exams Chat Bot
AI Bot
Loading...
Loading...
Loading...
Pass with Confidence.
IT & Cybersecurity Package
You have hit the limits of our free tier, become a Premium Member today for unlimited access.
Military, Healthcare worker, Gov. employee or Teacher? See if you qualify for a Community Discount.
Monthly
$19.99
$19.99/mo
Billed monthly, Cancel any time.
3 Month Pass
$44.99
$14.99/mo
One time purchase of $44.99, Does not auto-renew.
MOST POPULAR
Annual Pass
$119.99
$9.99/mo
One time purchase of $119.99, Does not auto-renew.
BEST DEAL
Lifetime Pass
$189.99
One time purchase, Good for life.
What You Get
All IT & Cybersecurity Package plans include the following perks and exams .