GCP Professional Data Engineer Practice Question

Your nightly Spark ETL job on Dataproc processes 12 TB from Cloud Storage and must finish within two hours. A persistent three-master, 50-worker cluster currently meets the SLA but sits idle for 20 hours, generating unexpected cost. The job uses several Python wheels and third-party JARs. Which redesign most effectively minimizes compute cost while preserving the SLA?

  • Port the Spark code to Dataproc Serverless for Spark to let the service allocate and bill only for resources used during the run.

  • Attach an autoscaling policy to the existing cluster so that all worker nodes scale to zero after the job finishes, leaving the masters running until the next job.

  • Schedule the persistent cluster for automatic deletion two hours after it starts and manually recreate the cluster each night before the job.

  • Trigger a workflow template from Cloud Composer that spins up an ephemeral Dataproc cluster with the required initialization actions, uses mostly preemptible worker VMs, runs the Spark job, and deletes the cluster when it completes.

GCP Professional Data Engineer
Maintaining and automating data workloads
Your Score:
Settings & Objectives
Random Mixed
Questions are selected randomly from all chosen topics, with a preference for those you haven’t seen before. You may see several questions from the same objective or domain in a row.
Rotate by Objective
Questions cycle through each objective or domain in turn, helping you avoid long streaks of questions from the same area. You may see some repeat questions, but the distribution will be more balanced across topics.

Check or uncheck an objective to set which questions you will receive.

Bash, the Crucial Exams Chat Bot
AI Bot