Your media analytics team runs 50 independent Hive batch transformations every night on a 20-node single-master Dataproc cluster. Each job completes in about 40 minutes, after which the cluster sits idle until the next evening. Finance has asked you to reduce the cluster's compute cost by at least 60 percent, but you must continue using the existing Hive scripts and need the flexibility to set custom initialization actions for individual jobs. What should you do?
Downsize the persistent cluster to five workers and add preemptible local SSDs to accelerate the nightly ETL without changing the job structure.
Keep the existing cluster but attach an autoscaling policy that reduces primary workers to zero during idle periods to avoid VM charges.
Rewrite the Hive transformations for BigQuery and schedule them as low-priority batch queries to take advantage of on-demand pricing.
Package each nightly Hive job in a Dataproc workflow template that launches a job-scoped cluster with the required initialization actions, executes the job against data in Cloud Storage, and automatically deletes the cluster when the job finishes.
Because the jobs are short-lived batch transformations that do not need to keep a cluster running between executions, the most cost-effective design is to run each job on its own ephemeral Dataproc cluster that is created on demand and deleted automatically when the job finishes. All job input and output can reside in Cloud Storage (HCFS), so no data is lost when the cluster is torn down. This model eliminates charges for idle VM instances-including the master-between nightly runs, easily achieving the requested cost reduction while still allowing per-job hardware sizing and custom initialization actions.
Autoscaling cannot scale the primary master (and usually at least one worker) to zero, so a persistent cluster still accrues charges all day. Merely shrinking the cluster or adding preemptible disks would reduce, but not eliminate, idle costs and might not reach a 60 percent saving. Migrating to BigQuery or Dataflow would require significant code rewrites and is outside the stated constraint to keep existing Hive jobs.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is a Dataproc workflow template?
Open an interactive chat with Bash
Why is data stored in Cloud Storage for ephemeral clusters in this scenario?
Open an interactive chat with Bash
What are initialization actions in Dataproc, and why are they required for custom jobs?
Open an interactive chat with Bash
What is a Dataproc workflow template?
Open an interactive chat with Bash
What is HCFS and why is it used in this scenario?
Open an interactive chat with Bash
Why can't autoscaling reduce primary workers to zero on Dataproc?
Open an interactive chat with Bash
GCP Professional Data Engineer
Maintaining and automating data workloads
Your Score:
Report Issue
Bash, the Crucial Exams Chat Bot
AI Bot
Loading...
Loading...
Loading...
Pass with Confidence.
IT & Cybersecurity Package
You have hit the limits of our free tier, become a Premium Member today for unlimited access.
Military, Healthcare worker, Gov. employee or Teacher? See if you qualify for a Community Discount.
Monthly
$19.99
$19.99/mo
Billed monthly, Cancel any time.
3 Month Pass
$44.99
$14.99/mo
One time purchase of $44.99, Does not auto-renew.
MOST POPULAR
Annual Pass
$119.99
$9.99/mo
One time purchase of $119.99, Does not auto-renew.
BEST DEAL
Lifetime Pass
$189.99
One time purchase, Good for life.
What You Get
All IT & Cybersecurity Package plans include the following perks and exams .