GCP Professional Data Engineer Practice Question

Your media company runs three independent Spark batch pipelines every hour. Each pipeline finishes on a 20-node Dataproc cluster in about 10 minutes, after which the cluster remains idle until the next hour. Engineers must continue using a proprietary Spark I/O connector that is not supported outside Dataproc. You need to cut compute costs without increasing job runtime or compromising the custom connector. What should you do?

  • Use Dataproc workflow templates to spin up an ephemeral cluster for each pipeline, configure preemptible secondary workers, store all data on Cloud Storage, and delete the cluster when the job completes.

  • Purchase a fixed BigQuery Standard Edition reservation sized for the three hourly jobs and rewrite the Spark pipelines as SQL queries.

  • Keep a single long-running Dataproc cluster but attach an autoscaling policy so workers scale down to zero between hourly runs.

  • Migrate the pipelines to Cloud Dataflow with streaming autoscaling templates that read from Pub/Sub and write to BigQuery.

GCP Professional Data Engineer
Maintaining and automating data workloads
Your Score:
Settings & Objectives
Random Mixed
Questions are selected randomly from all chosen topics, with a preference for those you haven’t seen before. You may see several questions from the same objective or domain in a row.
Rotate by Objective
Questions cycle through each objective or domain in turn, helping you avoid long streaks of questions from the same area. You may see some repeat questions, but the distribution will be more balanced across topics.

Check or uncheck an objective to set which questions you will receive.

Bash, the Crucial Exams Chat Bot
AI Bot