Your team operates a 20-node persistent Dataproc cluster, each worker provisioned with 2 TB of persistent disk. A nightly Spark batch job ingests about 1 TB of new data, writes aggregated results, and then the cluster sits idle for roughly 20 hours. Analysts still need access to both the raw and aggregated data until the next run. Which redesign will most effectively reduce storage costs while preserving data availability between jobs?
Retain the persistent cluster and add preemptible secondary workers so nodes can be released during idle periods while HDFS replicas stay on primary workers.
Migrate the workload to BigQuery for storage and querying, but retain the Dataproc cluster for transformations that BigQuery cannot perform.
Switch to a job-scoped (ephemeral) Dataproc cluster configured to use a regional Cloud Storage bucket as the default file system, uploading input and output data to the bucket and deleting the cluster after each run.
Shrink worker persistent disks to 100 GB and add local SSDs for shuffle spill, keeping the cluster running so HDFS holds data for analysts.
Using an ephemeral Dataproc cluster that reads from and writes to a regional Cloud Storage bucket makes Cloud Storage-not HDFS on persistent disks-the system of record. When the job finishes the cluster is deleted, so you stop paying for idle VM instances and their attached persistent disks. Because data lives in Cloud Storage it remains available to analysts and to the next day's job without requiring the cluster (or its disks) to remain running. The other options either continue to pay for large attached disks, keep the cluster running, or introduce additional services while still incurring Dataproc storage costs, so they do not minimize storage spend as effectively.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
Why is Cloud Storage preferred over HDFS for ephemeral clusters?
Open an interactive chat with Bash
What is the benefit of using a regional bucket in Cloud Storage for Dataproc jobs?
Open an interactive chat with Bash
How does an ephemeral Dataproc cluster reduce costs compared to persistent clusters?
Open an interactive chat with Bash
What is an ephemeral Dataproc cluster?
Open an interactive chat with Bash
What is the role of regional Cloud Storage in Dataproc workflows?
Open an interactive chat with Bash
Why is HDFS not ideal for this use case?
Open an interactive chat with Bash
GCP Professional Data Engineer
Maintaining and automating data workloads
Your Score:
Report Issue
Bash, the Crucial Exams Chat Bot
AI Bot
Loading...
Loading...
Loading...
Pass with Confidence.
IT & Cybersecurity Package
You have hit the limits of our free tier, become a Premium Member today for unlimited access.
Military, Healthcare worker, Gov. employee or Teacher? See if you qualify for a Community Discount.
Monthly
$19.99 $11.99
$11.99/mo
Billed monthly, Cancel any time.
$19.99 after promotion ends
3 Month Pass
$44.99 $26.99
$8.99/mo
One time purchase of $26.99, Does not auto-renew.
$44.99 after promotion ends
Save $18!
MOST POPULAR
Annual Pass
$119.99 $71.99
$5.99/mo
One time purchase of $71.99, Does not auto-renew.
$119.99 after promotion ends
Save $48!
BEST DEAL
Lifetime Pass
$189.99 $113.99
One time purchase, Good for life.
Save $76!
What You Get
All IT & Cybersecurity Package plans include the following perks and exams .