🔥 40% Off Crucial Exams Memberships — Deal ends today!

46 minutes, 34 seconds remaining!

GCP Professional Data Engineer Practice Question

Your team operates a 20-node persistent Dataproc cluster, each worker provisioned with 2 TB of persistent disk. A nightly Spark batch job ingests about 1 TB of new data, writes aggregated results, and then the cluster sits idle for roughly 20 hours. Analysts still need access to both the raw and aggregated data until the next run. Which redesign will most effectively reduce storage costs while preserving data availability between jobs?

  • Retain the persistent cluster and add preemptible secondary workers so nodes can be released during idle periods while HDFS replicas stay on primary workers.

  • Migrate the workload to BigQuery for storage and querying, but retain the Dataproc cluster for transformations that BigQuery cannot perform.

  • Switch to a job-scoped (ephemeral) Dataproc cluster configured to use a regional Cloud Storage bucket as the default file system, uploading input and output data to the bucket and deleting the cluster after each run.

  • Shrink worker persistent disks to 100 GB and add local SSDs for shuffle spill, keeping the cluster running so HDFS holds data for analysts.

GCP Professional Data Engineer
Maintaining and automating data workloads
Your Score:
Settings & Objectives
Random Mixed
Questions are selected randomly from all chosen topics, with a preference for those you haven’t seen before. You may see several questions from the same objective or domain in a row.
Rotate by Objective
Questions cycle through each objective or domain in turn, helping you avoid long streaks of questions from the same area. You may see some repeat questions, but the distribution will be more balanced across topics.

Check or uncheck an objective to set which questions you will receive.

Bash, the Crucial Exams Chat Bot
AI Bot