GCP Professional Data Engineer Practice Question

Your media analytics team runs 50 independent Hive batch transformations every night on a 20-node single-master Dataproc cluster. Each job completes in about 40 minutes, after which the cluster sits idle until the next evening. Finance has asked you to reduce the cluster's compute cost by at least 60 percent, but you must continue using the existing Hive scripts and need the flexibility to set custom initialization actions for individual jobs. What should you do?

Downsize the persistent cluster to five workers and add preemptible local SSDs to accelerate the nightly ETL without changing the job structure.
Keep the existing cluster but attach an autoscaling policy that reduces primary workers to zero during idle periods to avoid VM charges.
Rewrite the Hive transformations for BigQuery and schedule them as low-priority batch queries to take advantage of on-demand pricing.
Package each nightly Hive job in a Dataproc workflow template that launches a job-scoped cluster with the required initialization actions, executes the job against data in Cloud Storage, and automatically deletes the cluster when the job finishes.

Report Issue

Answer Description

Because the jobs are short-lived batch transformations that do not need to keep a cluster running between executions, the most cost-effective design is to run each job on its own ephemeral Dataproc cluster that is created on demand and deleted automatically when the job finishes. All job input and output can reside in Cloud Storage (HCFS), so no data is lost when the cluster is torn down. This model eliminates charges for idle VM instances-including the master-between nightly runs, easily achieving the requested cost reduction while still allowing per-job hardware sizing and custom initialization actions.

Autoscaling cannot scale the primary master (and usually at least one worker) to zero, so a persistent cluster still accrues charges all day. Merely shrinking the cluster or adding preemptible disks would reduce, but not eliminate, idle costs and might not reach a 60 percent saving. Migrating to BigQuery or Dataflow would require significant code rewrites and is outside the stated constraint to keep existing Hive jobs.

Ask Bash

Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.