A retailer is migrating its on-prem Hadoop environment to Google Cloud.
A Spark ETL job processes 4 TB of data each night and must finish before 02:00.
Data scientists launch unpredictable, short-lived Spark SQL sessions during business hours that need up to 32 vCPUs and low query latency.
All datasets must persist independently of any compute lifecycle.
Finance wants to minimize spend on idle resources. Which approach best satisfies the requirements?
Create a single persistent Dataproc cluster sized for the peak interactive workload, keep it running 24×7, and store all data on the cluster's HDFS disks.
Run the nightly ETL in an ephemeral Dataproc job cluster that reads and writes to Cloud Storage, and keep a small persistent Dataproc cluster with autoscaling enabled for data-science exploration.
Use Dataproc Serverless for the nightly ETL and migrate interactive analytics to BigQuery, storing all data in Cloud Storage.
Run both the nightly ETL and each interactive Spark SQL session in separate ephemeral Dataproc clusters that read from and write to Cloud Storage, deleting every cluster when its job completes.
Ephemeral (job-scoped) Dataproc clusters eliminate idle costs because each cluster is created only for the duration of a job or interactive session and is deleted immediately afterward. Persisting the data in Cloud Storage meets the requirement that data lives beyond the cluster's lifetime and avoids the higher cost of HDFS on persistent disks. Using a separate ephemeral cluster for every nightly batch job ensures the ETL finishes on schedule while allowing you to right-size hardware for that workload. Launching an on-demand ephemeral cluster for each ad-hoc exploration session provides the low-latency, 32-vCPU environment the data scientists need without keeping a large persistent cluster running all day. A single always-on cluster would violate the idle-cost constraint, and storing data on HDFS would tie persistence to the cluster's disks. Re-platforming interactive work into BigQuery changes the required Spark execution engine and was not requested.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is Dataproc and how does it work in Google Cloud?
Open an interactive chat with Bash
Why is Cloud Storage preferred over HDFS for this solution?
Open an interactive chat with Bash
What are the advantages of using ephemeral clusters for Spark workloads?
Open an interactive chat with Bash
What is an ephemeral Dataproc cluster?
Open an interactive chat with Bash
Why is Cloud Storage used instead of HDFS for data persistence in this scenario?
Open an interactive chat with Bash
How does autoscaling benefit Dataproc clusters for interactive workloads?
Open an interactive chat with Bash
GCP Professional Data Engineer
Maintaining and automating data workloads
Your Score:
Report Issue
Bash, the Crucial Exams Chat Bot
AI Bot
Loading...
Loading...
Loading...
Pass with Confidence.
IT & Cybersecurity Package
You have hit the limits of our free tier, become a Premium Member today for unlimited access.
Military, Healthcare worker, Gov. employee or Teacher? See if you qualify for a Community Discount.
Monthly
$19.99
$19.99/mo
Billed monthly, Cancel any time.
3 Month Pass
$44.99
$14.99/mo
One time purchase of $44.99, Does not auto-renew.
MOST POPULAR
Annual Pass
$119.99
$9.99/mo
One time purchase of $119.99, Does not auto-renew.
BEST DEAL
Lifetime Pass
$189.99
One time purchase, Good for life.
What You Get
All IT & Cybersecurity Package plans include the following perks and exams .