Your company runs a Spark-based ETL that processes about 4 TB of logs every night. The job takes 40 minutes and has a strict SLA but no interactivity requirements. During business hours, data scientists occasionally submit short interactive Spark SQL queries (5-10 minutes each) for troubleshooting. The current always-on 20-node Dataproc cluster sits idle most of the time. The CIO demands a 70 % cost reduction without hurting either workload. Which redesign best satisfies the goal?
Enable autoscaling and preemptible workers on the existing 20-node persistent cluster so it scales down when idle but remains available for both workloads.
Retire Dataproc and load the logs into BigQuery, running both the nightly ETL and interactive analysis as SQL queries with on-demand pricing.
Migrate the nightly ETL to an ephemeral Dataproc job cluster that terminates on completion, and keep a minimal two-node persistent cluster dedicated to the interactive Spark SQL queries.
Replace the single cluster with a new per-job Dataproc cluster for every workload, including the interactive queries, deleting each cluster immediately after the query finishes.
Creating a dedicated, job-scoped (ephemeral) cluster for the predictable nightly batch ETL eliminates almost 23 hours of idle resources every day. Because the batch workload has no need for interactivity, the few-minute cluster start-up time is acceptable. Keeping a very small, always-on cluster for daytime troubleshooting preserves the low-latency experience the data scientists require while incurring only minimal fixed cost. Alternatives that rely solely on a single persistent cluster-whether autoscaled or using pre-emptible workers-still pay for at least the master nodes around the clock. Launching per-job clusters for interactive queries could satisfy the cost target but would violate the requirement for fast, on-demand query execution. Moving to BigQuery may eventually help, but it changes the toolset and exceeds the stated scope of comparing Dataproc cluster models.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is an ephemeral Dataproc job cluster?
Open an interactive chat with Bash
What is the purpose of keeping a minimal persistent Dataproc cluster?
Open an interactive chat with Bash
Why are autoscaling and preemptible workers not ideal for this scenario?
Open an interactive chat with Bash
What is an ephemeral Dataproc cluster?
Open an interactive chat with Bash
How does autoscaling work in Dataproc clusters?
Open an interactive chat with Bash
Why is BigQuery not suitable for this specific workload?
Open an interactive chat with Bash
GCP Professional Data Engineer
Maintaining and automating data workloads
Your Score:
Report Issue
Bash, the Crucial Exams Chat Bot
AI Bot
Loading...
Loading...
Loading...
Pass with Confidence.
IT & Cybersecurity Package
You have hit the limits of our free tier, become a Premium Member today for unlimited access.
Military, Healthcare worker, Gov. employee or Teacher? See if you qualify for a Community Discount.
Monthly
$19.99 $11.99
$11.99/mo
Billed monthly, Cancel any time.
$19.99 after promotion ends
3 Month Pass
$44.99 $26.99
$8.99/mo
One time purchase of $26.99, Does not auto-renew.
$44.99 after promotion ends
Save $18!
MOST POPULAR
Annual Pass
$119.99 $71.99
$5.99/mo
One time purchase of $71.99, Does not auto-renew.
$119.99 after promotion ends
Save $48!
BEST DEAL
Lifetime Pass
$189.99 $113.99
One time purchase, Good for life.
Save $76!
What You Get
All IT & Cybersecurity Package plans include the following perks and exams .