GCP Professional Data Engineer Practice Question

A retailer is migrating its on-prem Hadoop environment to Google Cloud.

A Spark ETL job processes 4 TB of data each night and must finish before 02:00.
Data scientists launch unpredictable, short-lived Spark SQL sessions during business hours that need up to 32 vCPUs and low query latency.
All datasets must persist independently of any compute lifecycle.
Finance wants to minimize spend on idle resources.
Which approach best satisfies the requirements?

Run the nightly ETL in an ephemeral Dataproc job cluster that reads and writes to Cloud Storage, and keep a small persistent Dataproc cluster with autoscaling enabled for data-science exploration.
Use Dataproc Serverless for the nightly ETL and migrate interactive analytics to BigQuery, storing all data in Cloud Storage.
Create a single persistent Dataproc cluster sized for the peak interactive workload, keep it running 24×7, and store all data on the cluster's HDFS disks.
Run both the nightly ETL and each interactive Spark SQL session in separate ephemeral Dataproc clusters that read from and write to Cloud Storage, deleting every cluster when its job completes.

Report Issue

Answer Description

Ephemeral (job-scoped) Dataproc clusters eliminate idle costs because each cluster is created only for the duration of a job or interactive session and is deleted immediately afterward. Persisting the data in Cloud Storage meets the requirement that data lives beyond the cluster's lifetime and avoids the higher cost of HDFS on persistent disks. Using a separate ephemeral cluster for every nightly batch job ensures the ETL finishes on schedule while allowing you to right-size hardware for that workload. Launching an on-demand ephemeral cluster for each ad-hoc exploration session provides the low-latency, 32-vCPU environment the data scientists need without keeping a large persistent cluster running all day. A single always-on cluster would violate the idle-cost constraint, and storing data on HDFS would tie persistence to the cluster's disks. Re-platforming interactive work into BigQuery changes the required Spark execution engine and was not requested.

Ask Bash

Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.