GCP Professional Data Engineer Practice Question

Your company stores all raw and curated data for its enterprise-wide data lake in regional Cloud Storage buckets. A 40-minute Spark job must convert the previous day's 3 TB of application logs from JSON to partitioned Parquet each night. Leadership wants to pay for compute only while the transformation runs and to delete all cluster resources immediately afterward, without risking data loss or an extra data-copy step. Which design satisfies these requirements?

  • Attach local SSDs to each worker, copy the logs to the SSDs, perform the Spark conversion, and rely on VM snapshots to preserve the Parquet files when the cluster shuts down.

  • Launch a Dataproc cluster on demand, run the Spark job with input and output paths set to gs:// buckets, and configure the cluster to auto-delete immediately after the job completes.

  • Load the JSON logs into the cluster's HDFS, run the Spark conversion there, then copy the Parquet files back to Cloud Storage before manually deleting the cluster.

  • Create a long-running Dataproc cluster that persists the logs and Parquet output in Bigtable tables mounted on the cluster; shut down only the worker VMs overnight.

GCP Professional Data Engineer
Storing the data
Your Score:
Settings & Objectives
Random Mixed
Questions are selected randomly from all chosen topics, with a preference for those you haven’t seen before. You may see several questions from the same objective or domain in a row.
Rotate by Objective
Questions cycle through each objective or domain in turn, helping you avoid long streaks of questions from the same area. You may see some repeat questions, but the distribution will be more balanced across topics.

Check or uncheck an objective to set which questions you will receive.

Bash, the Crucial Exams Chat Bot
AI Bot