A media company ingests clickstream data 24×7 from its mobile applications and processes it in near-real-time using Spark Structured Streaming and a custom Flink job. The pipeline must continuously enrich events with user profiles stored in Bigtable and write the results to BigQuery with end-to-end latency under one minute. Operators also need to run on-demand SQL queries against the same Spark metastore during business hours. Which Dataproc deployment model best meets these requirements while balancing cost and operational complexity?
Use Dataproc Serverless for all streaming and interactive workloads and disable any persistent cluster resources.
Migrate the streaming code to Cloud Dataflow and spin up an on-demand Dataproc cluster only for interactive SQL queries.
Create a persistent Dataproc cluster with autoscaling enabled and run both Spark Streaming and Flink jobs continuously.
Submit each Spark and Flink job to a separate ephemeral Dataproc job-cluster that terminates when the job finishes.
Because the Spark and Flink jobs run continuously, stopping the cluster after each job would introduce unacceptable startup delay and repeated state initialization. A long-lived (persistent) Dataproc cluster keeps the streaming applications, shuffle state, and metastore in memory, meeting sub-minute latency goals and enabling interactive queries against the same Hive/Spark catalogs. Job-scoped ephemeral clusters are optimized for finite batch workloads; their initialization overhead and lack of long-running executors make them unsuitable for always-on streaming pipelines. Per-job Dataproc Serverless or Dataflow would require re-architecting the Spark and Flink codebases and would not share a common metastore as easily, adding complexity.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is Dataproc and how does it support persistent clusters?
Open an interactive chat with Bash
What is the role of autoscaling in Dataproc persistent clusters?
Open an interactive chat with Bash
Why are ephemeral Dataproc job-clusters unsuitable for Spark Streaming and Flink jobs?
Open an interactive chat with Bash
What is a Dataproc persistent cluster?
Open an interactive chat with Bash
What is Spark Structured Streaming and why is it used for near-real-time processing?
Open an interactive chat with Bash
Why is Bigtable used for user profile enrichment in the pipeline?
Open an interactive chat with Bash
GCP Professional Data Engineer
Maintaining and automating data workloads
Your Score:
Report Issue
Bash, the Crucial Exams Chat Bot
AI Bot
Loading...
Loading...
Loading...
Pass with Confidence.
IT & Cybersecurity Package
You have hit the limits of our free tier, become a Premium Member today for unlimited access.
Military, Healthcare worker, Gov. employee or Teacher? See if you qualify for a Community Discount.
Monthly
$19.99
$19.99/mo
Billed monthly, Cancel any time.
3 Month Pass
$44.99
$14.99/mo
One time purchase of $44.99, Does not auto-renew.
MOST POPULAR
Annual Pass
$119.99
$9.99/mo
One time purchase of $119.99, Does not auto-renew.
BEST DEAL
Lifetime Pass
$189.99
One time purchase, Good for life.
What You Get
All IT & Cybersecurity Package plans include the following perks and exams .