A retail analytics team runs a Spark‐based ETL workflow each night that processes 8 TB of sales logs stored in Cloud Storage. The job finishes in about four hours and there is no other Spark workload during the day. Security policy requires that each run uses a fresh environment so that no leftover libraries or temp files persist after completion. The team's main goal is to eliminate the 20 hours of daily idle costs from their current always-on 20-node Dataproc cluster while still giving every nightly run the exact Spark configuration it needs. Which approach best meets these requirements?
Use a Dataproc Workflow Template that creates a job-scoped (ephemeral) cluster, runs the Spark ETL job with Cloud Storage as the default file system, and deletes the cluster automatically after the workflow succeeds or fails.
Rewrite the nightly pipeline as scheduled queries in BigQuery and drop Dataproc altogether.
Convert the Spark job to a streaming Dataflow pipeline launched from a Flex Template, allowing Dataflow to scale workers down after processing completes.
Keep the existing cluster but enable autoscaling with preemptible secondary workers to shrink the cluster to zero workers when idle.
Creating an ephemeral Dataproc cluster for every ETL run removes the cost of an idle persistent cluster because the cluster exists only for the duration of the job. Workflow Templates (or job APIs) let you specify cluster-scoped hardware, initialization actions, and Cloud Storage as the primary file system, then automatically delete the cluster when the Spark job completes. Autoscaling or preemptible workers on a long-lived cluster would still leave the master nodes running (and incur charges) throughout the day. Re-engineering the pipeline into BigQuery SQL or Dataflow would change the technology stack and is unnecessary for the stated goal of reducing idle Dataproc costs while keeping Spark.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is a Dataproc Workflow Template?
Open an interactive chat with Bash
Why is an ephemeral cluster better than autoscaling for this ETL job?
Open an interactive chat with Bash
What is the benefit of using Cloud Storage as the default file system for Spark jobs?
Open an interactive chat with Bash
What is a Dataproc Workflow Template?
Open an interactive chat with Bash
Why are ephemeral clusters advantageous for spark jobs?
Open an interactive chat with Bash
How does Cloud Storage interact with Dataproc Spark jobs?
Open an interactive chat with Bash
GCP Professional Data Engineer
Maintaining and automating data workloads
Your Score:
Report Issue
Bash, the Crucial Exams Chat Bot
AI Bot
Loading...
Loading...
Loading...
Pass with Confidence.
IT & Cybersecurity Package
You have hit the limits of our free tier, become a Premium Member today for unlimited access.
Military, Healthcare worker, Gov. employee or Teacher? See if you qualify for a Community Discount.
Monthly
$19.99 $11.99
$11.99/mo
Billed monthly, Cancel any time.
$19.99 after promotion ends
3 Month Pass
$44.99 $26.99
$8.99/mo
One time purchase of $26.99, Does not auto-renew.
$44.99 after promotion ends
Save $18!
MOST POPULAR
Annual Pass
$119.99 $71.99
$5.99/mo
One time purchase of $71.99, Does not auto-renew.
$119.99 after promotion ends
Save $48!
BEST DEAL
Lifetime Pass
$189.99 $113.99
One time purchase, Good for life.
Save $76!
What You Get
All IT & Cybersecurity Package plans include the following perks and exams .