GCP Professional Data Engineer Practice Question

A retailer is migrating its on-prem Hadoop environment to Google Cloud.

  • A Spark ETL job processes 4 TB of data each night and must finish before 02:00.
  • Data scientists launch unpredictable, short-lived Spark SQL sessions during business hours that need up to 32 vCPUs and low query latency.
  • All datasets must persist independently of any compute lifecycle.
  • Finance wants to minimize spend on idle resources.
    Which approach best satisfies the requirements?
  • Run the nightly ETL in an ephemeral Dataproc job cluster that reads and writes to Cloud Storage, and keep a small persistent Dataproc cluster with autoscaling enabled for data-science exploration.

  • Use Dataproc Serverless for the nightly ETL and migrate interactive analytics to BigQuery, storing all data in Cloud Storage.

  • Create a single persistent Dataproc cluster sized for the peak interactive workload, keep it running 24×7, and store all data on the cluster's HDFS disks.

  • Run both the nightly ETL and each interactive Spark SQL session in separate ephemeral Dataproc clusters that read from and write to Cloud Storage, deleting every cluster when its job completes.

GCP Professional Data Engineer
Maintaining and automating data workloads
Your Score:
Settings & Objectives
Random Mixed
Questions are selected randomly from all chosen topics, with a preference for those you haven’t seen before. You may see several questions from the same objective or domain in a row.
Rotate by Objective
Questions cycle through each objective or domain in turn, helping you avoid long streaks of questions from the same area. You may see some repeat questions, but the distribution will be more balanced across topics.

Check or uncheck an objective to set which questions you will receive.

Bash, the Crucial Exams Chat Bot
AI Bot