Your company stores all raw and curated data for its enterprise-wide data lake in regional Cloud Storage buckets. A 40-minute Spark job must convert the previous day's 3 TB of application logs from JSON to partitioned Parquet each night. Leadership wants to pay for compute only while the transformation runs and to delete all cluster resources immediately afterward, without risking data loss or an extra data-copy step. Which design satisfies these requirements?
Attach local SSDs to each worker, copy the logs to the SSDs, perform the Spark conversion, and rely on VM snapshots to preserve the Parquet files when the cluster shuts down.
Launch a Dataproc cluster on demand, run the Spark job with input and output paths set to gs:// buckets, and configure the cluster to auto-delete immediately after the job completes.
Load the JSON logs into the cluster's HDFS, run the Spark conversion there, then copy the Parquet files back to Cloud Storage before manually deleting the cluster.
Create a long-running Dataproc cluster that persists the logs and Parquet output in Bigtable tables mounted on the cluster; shut down only the worker VMs overnight.
Reading the input logs from gs:// paths and writing the transformed Parquet files back to Cloud Storage lets the job rely on Cloud Storage's durable, decoupled object storage instead of HDFS on the cluster. Submitting the job to an ephemeral Dataproc cluster that is configured to auto-delete when the job finishes means VMs (and their attached disks) exist only for the job's duration, so you are charged for compute and persistent-disk I/O only while processing runs. All data persists in Cloud Storage after the cluster is gone, eliminating data-copy operations and the risk of loss.
Staging data on HDFS or local SSDs (other options) would require copying it back to Cloud Storage before deletion or would lose data when VMs are removed. Persisting data in Bigtable keeps disks-and their costs-allocated continuously, defeating the cost goal.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is Dataproc in GCP?
Open an interactive chat with Bash
What is gs:// in Cloud Storage?
Open an interactive chat with Bash
What is Parquet and why is it preferred for data processing?
Open an interactive chat with Bash
What is a Dataproc cluster?
Open an interactive chat with Bash
Why use Cloud Storage instead of HDFS or local SSDs?
Open an interactive chat with Bash
What is partitioned Parquet, and why is it used?
Open an interactive chat with Bash
GCP Professional Data Engineer
Storing the data
Your Score:
Report Issue
Bash, the Crucial Exams Chat Bot
AI Bot
Loading...
Loading...
Loading...
Pass with Confidence.
IT & Cybersecurity Package
You have hit the limits of our free tier, become a Premium Member today for unlimited access.
Military, Healthcare worker, Gov. employee or Teacher? See if you qualify for a Community Discount.
Monthly
$19.99
$19.99/mo
Billed monthly, Cancel any time.
3 Month Pass
$44.99
$14.99/mo
One time purchase of $44.99, Does not auto-renew.
MOST POPULAR
Annual Pass
$119.99
$9.99/mo
One time purchase of $119.99, Does not auto-renew.
BEST DEAL
Lifetime Pass
$189.99
One time purchase, Good for life.
What You Get
All IT & Cybersecurity Package plans include the following perks and exams .