Your analytics team runs a Spark ETL job every night. The job processes 3 TB of data, writes the cleansed result to a BigQuery table, and must finish within a two-hour window. Today it executes on a three-node Dataproc cluster that is kept running 24 × 7 and stores temporary files in HDFS on the cluster's persistent disks. The cluster is idle the rest of the day, and monthly Compute Engine and persistent-disk charges have become the largest cost item in the project. You have been asked to redesign the solution to cut operating costs while still meeting the existing SLA and without rewriting the Spark code. Which approach best meets these requirements?
Migrate the Spark job to a scheduled BigQuery stored procedure that rewrites the ETL logic in SQL and leverages BigQuery's on-demand pricing.
Replace the persistent cluster with an ephemeral Dataproc workflow that spins up a job-scoped cluster each night, uses Cloud Storage instead of HDFS, and optionally adds pre-emptible secondary workers for extra capacity at lower cost.
Add local SSDs to the existing persistent cluster for faster I/O and purchase Committed Use Discounts on the VM instances to lower hourly costs.
Keep the current cluster but enable Dataproc autoscaling and resize the cluster to zero workers after the job finishes; restart the same cluster before the next run.
Provisioning an ephemeral Dataproc cluster from a workflow template just before the nightly run and deleting it as soon as the Spark job completes avoids paying for 22 idle hours of VM and disk usage each day. Configuring Spark to read and write temporary data in Cloud Storage through Dataproc's Hadoop Compatible File System (HCFS) connector removes the need for large HDFS data disks, so only minimal boot-disk charges accrue while the cluster is running and inexpensive Cloud Storage fees apply to the temporary data. Adding cost-effective pre-emptible secondary workers can further reduce compute spend, and because the workload is short-lived and can tolerate some worker loss, the two-hour SLA can still be met. Keeping a persistent cluster (even with autoscaling or a suspended state) continues to incur baseline VM and disk charges, and migrating the job to BigQuery or Dataflow would require rewriting the Spark code, which is out of scope.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is the benefit of using ephemeral Dataproc clusters instead of persistent ones?
Open an interactive chat with Bash
What role does Cloud Storage play in ephemeral Dataproc workflows?
Open an interactive chat with Bash
Why are pre-emptible secondary workers added to ephemeral Dataproc clusters?
Open an interactive chat with Bash
What is an ephemeral Dataproc cluster?
Open an interactive chat with Bash
How does Cloud Storage differ from HDFS when used with Dataproc?
Open an interactive chat with Bash
What are pre-emptible workers in a Dataproc cluster?
Open an interactive chat with Bash
GCP Professional Data Engineer
Maintaining and automating data workloads
Your Score:
Report Issue
Bash, the Crucial Exams Chat Bot
AI Bot
Loading...
Loading...
Loading...
Pass with Confidence.
IT & Cybersecurity Package
You have hit the limits of our free tier, become a Premium Member today for unlimited access.
Military, Healthcare worker, Gov. employee or Teacher? See if you qualify for a Community Discount.
Monthly
$19.99 $11.99
$11.99/mo
Billed monthly, Cancel any time.
$19.99 after promotion ends
3 Month Pass
$44.99 $26.99
$8.99/mo
One time purchase of $26.99, Does not auto-renew.
$44.99 after promotion ends
Save $18!
MOST POPULAR
Annual Pass
$119.99 $71.99
$5.99/mo
One time purchase of $71.99, Does not auto-renew.
$119.99 after promotion ends
Save $48!
BEST DEAL
Lifetime Pass
$189.99 $113.99
One time purchase, Good for life.
Save $76!
What You Get
All IT & Cybersecurity Package plans include the following perks and exams .