GCP Professional Data Engineer Practice Question

A media analytics team currently runs both of the following workloads on the same 50-node Dataproc cluster in us-central1:

  1. A Spark Streaming job that ingests click-stream events 24×7 with a strict sub-second latency SLA.
  2. An hourly Hive aggregation that executes for about 8 minutes and writes the results to BigQuery. The cluster sits mostly idle outside the streaming driver and a few executors. Management asks you to redesign the deployment to cut compute costs while keeping the streaming SLA unchanged and avoiding operational toil. What should you do?
  • Keep a small persistent Dataproc cluster sized for the streaming workload, and schedule the hourly Hive aggregation on separate job-based Dataproc clusters that delete themselves after completion.

  • Convert both workloads to job-based (ephemeral) clusters that are created and destroyed for every execution, including the Spark Streaming application.

  • Consolidate the workloads on a larger persistent cluster that disables autoscaling but uses pre-emptible workers to lower the hourly price.

  • Move both workloads to a single autoscaling persistent cluster configured to scale to zero when idle so that costs are eliminated outside batch windows.

GCP Professional Data Engineer
Maintaining and automating data workloads
Your Score:
Settings & Objectives
Random Mixed
Questions are selected randomly from all chosen topics, with a preference for those you haven’t seen before. You may see several questions from the same objective or domain in a row.
Rotate by Objective
Questions cycle through each objective or domain in turn, helping you avoid long streaks of questions from the same area. You may see some repeat questions, but the distribution will be more balanced across topics.

Check or uncheck an objective to set which questions you will receive.

Bash, the Crucial Exams Chat Bot
AI Bot