GCP Professional Data Engineer Practice Question

Your nightly Spark ETL job on Dataproc processes 12 TB from Cloud Storage and must finish within two hours. A persistent three-master, 50-worker cluster currently meets the SLA but sits idle for 20 hours, generating unexpected cost. The job uses several Python wheels and third-party JARs. Which redesign most effectively minimizes compute cost while preserving the SLA?

Port the Spark code to Dataproc Serverless for Spark to let the service allocate and bill only for resources used during the run.
Attach an autoscaling policy to the existing cluster so that all worker nodes scale to zero after the job finishes, leaving the masters running until the next job.
Schedule the persistent cluster for automatic deletion two hours after it starts and manually recreate the cluster each night before the job.
Trigger a workflow template from Cloud Composer that spins up an ephemeral Dataproc cluster with the required initialization actions, uses mostly preemptible worker VMs, runs the Spark job, and deletes the cluster when it completes.

Report Issue

Answer Description

Creating a job-scoped (ephemeral) Dataproc cluster for each run eliminates almost all idle compute charges because the cluster exists only during the two-hour processing window. A workflow template (or a Cloud Composer DAG) can programmatically create the cluster, run initialization actions to install the required Python wheels and JAR files, and then delete the cluster. Using preemptible secondary workers further cuts per-run cost without impacting the deadline.

Keeping the existing cluster and attaching an autoscaling policy cannot remove the three master VMs and cannot shrink the primary worker group below its minimum size, so fixed compute charges remain. Migrating to Dataproc Serverless would avoid cluster management, but Serverless currently does not support custom initialization actions and may require code refactoring. Manually deleting and recreating a full-sized cluster each night would save idle costs, but it introduces unnecessary operational effort and risk of configuration drift compared with an automated workflow template. Therefore, an automated ephemeral cluster with preemptible workers is the most cost-effective and reliable solution.

Ask Bash

Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.