Your media company runs three independent Spark batch pipelines every hour. Each pipeline finishes on a 20-node Dataproc cluster in about 10 minutes, after which the cluster remains idle until the next hour. Engineers must continue using a proprietary Spark I/O connector that is not supported outside Dataproc. You need to cut compute costs without increasing job runtime or compromising the custom connector. What should you do?
Use Dataproc workflow templates to spin up an ephemeral cluster for each pipeline, configure preemptible secondary workers, store all data on Cloud Storage, and delete the cluster when the job completes.
Purchase a fixed BigQuery Standard Edition reservation sized for the three hourly jobs and rewrite the Spark pipelines as SQL queries.
Keep a single long-running Dataproc cluster but attach an autoscaling policy so workers scale down to zero between hourly runs.
Migrate the pipelines to Cloud Dataflow with streaming autoscaling templates that read from Pub/Sub and write to BigQuery.
Creating an ephemeral Dataproc cluster for each Spark job and deleting it automatically at job completion removes all charges for idle masters and workers between runs. Using preemptible secondary workers further lowers per-job cost while still keeping enough capacity during execution, and persisting intermediate data in Cloud Storage avoids the need for costly HDFS disks. A persistent cluster with autoscaling cannot scale masters to zero, so you would still pay for idle instances. Migrating to BigQuery or Dataflow would require re-implementing the custom Spark connector and could risk performance or compatibility. Therefore the job-scoped, auto-deleting Dataproc cluster with preemptible workers is the most cost-efficient solution that meets all constraints.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What are ephemeral Dataproc clusters?
Open an interactive chat with Bash
What are preemptible secondary workers in Dataproc?
Open an interactive chat with Bash
Why is Cloud Storage used instead of HDFS in ephemeral clusters?
Open an interactive chat with Bash
What are Dataproc ephemeral clusters and how do they help reduce costs?
Open an interactive chat with Bash
What are preemptible workers in Dataproc, and why are they cost-efficient?
Open an interactive chat with Bash
Why is storing data on Cloud Storage more cost-efficient than HDFS in this context?
Open an interactive chat with Bash
GCP Professional Data Engineer
Maintaining and automating data workloads
Your Score:
Report Issue
Bash, the Crucial Exams Chat Bot
AI Bot
Loading...
Loading...
Loading...
Pass with Confidence.
IT & Cybersecurity Package
You have hit the limits of our free tier, become a Premium Member today for unlimited access.
Military, Healthcare worker, Gov. employee or Teacher? See if you qualify for a Community Discount.
Monthly
$19.99
$19.99/mo
Billed monthly, Cancel any time.
3 Month Pass
$44.99
$14.99/mo
One time purchase of $44.99, Does not auto-renew.
MOST POPULAR
Annual Pass
$119.99
$9.99/mo
One time purchase of $119.99, Does not auto-renew.
BEST DEAL
Lifetime Pass
$189.99
One time purchase, Good for life.
What You Get
All IT & Cybersecurity Package plans include the following perks and exams .