A media analytics team currently runs both of the following workloads on the same 50-node Dataproc cluster in us-central1:
A Spark Streaming job that ingests click-stream events 24×7 with a strict sub-second latency SLA.
An hourly Hive aggregation that executes for about 8 minutes and writes the results to BigQuery. The cluster sits mostly idle outside the streaming driver and a few executors. Management asks you to redesign the deployment to cut compute costs while keeping the streaming SLA unchanged and avoiding operational toil. What should you do?
Keep a small persistent Dataproc cluster sized for the streaming workload, and schedule the hourly Hive aggregation on separate job-based Dataproc clusters that delete themselves after completion.
Convert both workloads to job-based (ephemeral) clusters that are created and destroyed for every execution, including the Spark Streaming application.
Consolidate the workloads on a larger persistent cluster that disables autoscaling but uses pre-emptible workers to lower the hourly price.
Move both workloads to a single autoscaling persistent cluster configured to scale to zero when idle so that costs are eliminated outside batch windows.
The latency-sensitive Spark Streaming application must remain online, so it belongs on a small persistent Dataproc cluster that is always available. The hourly Hive aggregation is an ideal fit for job-scoped (ephemeral) clusters, which spin up with the exact resources required, read data from Cloud Storage, write the results to BigQuery, and automatically delete themselves when the job finishes. This hybrid approach removes the batch workload's idle capacity costs while leaving the streaming workload unaffected. Running both jobs on one persistent cluster (with or without autoscaling or pre-emptible VMs) still pays for unused resources between batch runs, and making the streaming job ephemeral would repeatedly restart a long-running pipeline, violating its SLA.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is Dataproc and how is it used with Spark Streaming?
Open an interactive chat with Bash
What are ephemeral clusters in Dataproc and why are they suitable for batch workloads?
Open an interactive chat with Bash
Why can't Spark Streaming run on ephemeral clusters effectively?
Open an interactive chat with Bash
What is a Dataproc cluster?
Open an interactive chat with Bash
What is the difference between ephemeral clusters and persistent clusters?
Open an interactive chat with Bash
What is Spark Streaming, and why does it require a persistent cluster?
Open an interactive chat with Bash
GCP Professional Data Engineer
Maintaining and automating data workloads
Your Score:
Report Issue
Bash, the Crucial Exams Chat Bot
AI Bot
Loading...
Loading...
Loading...
Pass with Confidence.
IT & Cybersecurity Package
You have hit the limits of our free tier, become a Premium Member today for unlimited access.
Military, Healthcare worker, Gov. employee or Teacher? See if you qualify for a Community Discount.
Monthly
$19.99
$19.99/mo
Billed monthly, Cancel any time.
3 Month Pass
$44.99
$14.99/mo
One time purchase of $44.99, Does not auto-renew.
MOST POPULAR
Annual Pass
$119.99
$9.99/mo
One time purchase of $119.99, Does not auto-renew.
BEST DEAL
Lifetime Pass
$189.99
One time purchase, Good for life.
What You Get
All IT & Cybersecurity Package plans include the following perks and exams .