GCP Professional Data Engineer Practice Question

A media company ingests clickstream data 24×7 from its mobile applications and processes it in near-real-time using Spark Structured Streaming and a custom Flink job. The pipeline must continuously enrich events with user profiles stored in Bigtable and write the results to BigQuery with end-to-end latency under one minute. Operators also need to run on-demand SQL queries against the same Spark metastore during business hours. Which Dataproc deployment model best meets these requirements while balancing cost and operational complexity?

Migrate the streaming code to Cloud Dataflow and spin up an on-demand Dataproc cluster only for interactive SQL queries.
Submit each Spark and Flink job to a separate ephemeral Dataproc job-cluster that terminates when the job finishes.
Use Dataproc Serverless for all streaming and interactive workloads and disable any persistent cluster resources.
Create a persistent Dataproc cluster with autoscaling enabled and run both Spark Streaming and Flink jobs continuously.

Report Issue

Answer Description

Because the Spark and Flink jobs run continuously, stopping the cluster after each job would introduce unacceptable startup delay and repeated state initialization. A long-lived (persistent) Dataproc cluster keeps the streaming applications, shuffle state, and metastore in memory, meeting sub-minute latency goals and enabling interactive queries against the same Hive/Spark catalogs. Job-scoped ephemeral clusters are optimized for finite batch workloads; their initialization overhead and lack of long-running executors make them unsuitable for always-on streaming pipelines. Per-job Dataproc Serverless or Dataflow would require re-architecting the Spark and Flink codebases and would not share a common metastore as easily, adding complexity.

Ask Bash

Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.