GCP Professional Data Engineer Practice Question

Your media company runs three independent Spark batch pipelines every hour. Each pipeline finishes on a 20-node Dataproc cluster in about 10 minutes, after which the cluster remains idle until the next hour. Engineers must continue using a proprietary Spark I/O connector that is not supported outside Dataproc. You need to cut compute costs without increasing job runtime or compromising the custom connector. What should you do?

Use Dataproc workflow templates to spin up an ephemeral cluster for each pipeline, configure preemptible secondary workers, store all data on Cloud Storage, and delete the cluster when the job completes.
Purchase a fixed BigQuery Standard Edition reservation sized for the three hourly jobs and rewrite the Spark pipelines as SQL queries.
Keep a single long-running Dataproc cluster but attach an autoscaling policy so workers scale down to zero between hourly runs.
Migrate the pipelines to Cloud Dataflow with streaming autoscaling templates that read from Pub/Sub and write to BigQuery.

Report Issue

Answer Description

Creating an ephemeral Dataproc cluster for each Spark job and deleting it automatically at job completion removes all charges for idle masters and workers between runs. Using preemptible secondary workers further lowers per-job cost while still keeping enough capacity during execution, and persisting intermediate data in Cloud Storage avoids the need for costly HDFS disks. A persistent cluster with autoscaling cannot scale masters to zero, so you would still pay for idle instances. Migrating to BigQuery or Dataflow would require re-implementing the custom Spark connector and could risk performance or compatibility. Therefore the job-scoped, auto-deleting Dataproc cluster with preemptible workers is the most cost-efficient solution that meets all constraints.

Ask Bash

Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.