Your team runs a Spark Structured Streaming pipeline that reads events from Pub/Sub, enriches them with look-ups stored on Cloud Storage, and writes low-latency aggregates to BigTable for a near real-time dashboard. The job must run 24×7, allow on-the-fly code updates for tuning, and survive individual worker failures without interrupting the stream. You also need the flexibility to scale the number of workers up or down automatically as traffic fluctuates. Which Dataproc deployment model best meets these requirements while controlling unnecessary idle costs?
Submit the job to a transient (ephemeral) Dataproc cluster created by a workflow template and deleted after each micro-batch completes.
Use Dataproc Serverless for Spark to submit the streaming job as a batch task that spins up resources on demand and terminates when the driver exits.
Run the job on a persistent Dataproc cluster configured with an autoscaling policy.
Create a Cloud Composer DAG that launches a new Dataproc cluster every hour to run the streaming job and tears it down when the hour ends.
Because the streaming job is long-running and must stay continuously available, it should execute on a Dataproc persistent cluster. A persistent cluster keeps the Spark driver and executors alive, so stateful streaming can continue without the overhead of tearing the cluster down between micro-batches. Enabling Dataproc autoscaling lets the cluster grow and shrink with traffic, reducing idle expense while still maintaining the always-on characteristic that streaming workloads require.
Ephemeral (job-scoped) clusters or serverless batch jobs are designed for finite batch processing; they terminate when the job finishes, so they introduce restart latency and potential checkpoint reprocessing for continuous streams. Workflow templates that create and delete clusters are likewise optimized for batch ETL, not for 24×7 pipelines.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is Dataproc autoscaling?
Open an interactive chat with Bash
How does Pub/Sub integrate with Spark Structured Streaming?
Open an interactive chat with Bash
Why is BigTable suitable for low-latency aggregations in this pipeline?
Open an interactive chat with Bash
What is a persistent Dataproc cluster in GCP?
Open an interactive chat with Bash
How does autoscaling work in Dataproc?
Open an interactive chat with Bash
Why is a transient Dataproc cluster not suitable for 24×7 streaming jobs?
Open an interactive chat with Bash
GCP Professional Data Engineer
Maintaining and automating data workloads
Your Score:
Report Issue
Bash, the Crucial Exams Chat Bot
AI Bot
Loading...
Loading...
Loading...
Pass with Confidence.
IT & Cybersecurity Package
You have hit the limits of our free tier, become a Premium Member today for unlimited access.
Military, Healthcare worker, Gov. employee or Teacher? See if you qualify for a Community Discount.
Monthly
$19.99
$19.99/mo
Billed monthly, Cancel any time.
3 Month Pass
$44.99
$14.99/mo
One time purchase of $44.99, Does not auto-renew.
MOST POPULAR
Annual Pass
$119.99
$9.99/mo
One time purchase of $119.99, Does not auto-renew.
BEST DEAL
Lifetime Pass
$189.99
One time purchase, Good for life.
What You Get
All IT & Cybersecurity Package plans include the following perks and exams .