GCP Professional Data Engineer Practice Question

Your team must design a data-cleansing layer for a clickstream pipeline that ingests millions of events per minute from Pub/Sub. Requirements: apply schema validation, deduplication, and null-handling, mask e-mail addresses with Cloud DLP, support the same code path for historic backfills from Cloud Storage, and minimize infrastructure management. Which solution should you implement?

  • Build cleansing recipes in Cloud Dataprep, schedule them as Dataflow jobs for Pub/Sub data, and create a separate Dataprep flow to process historical files from Cloud Storage.

  • Configure a Cloud Data Fusion replication pipeline that reads Pub/Sub messages, applies Wrangler transformations with a DLP plugin, writes to BigQuery, and enable a parallel batch pipeline for Cloud Storage files.

  • Provision a long-running Dataproc cluster with Spark Structured Streaming to read from Pub/Sub, use custom UDFs that invoke Cloud DLP for masking, and run additional Spark batch jobs on the cluster for backfills.

  • Develop an Apache Beam pipeline and run it on Cloud Dataflow; use Beam schemas for validation and deduplication, call the Dataflow Cloud DLP transform to mask e-mails, and execute the same pipeline in batch mode for Cloud Storage backfills.

GCP Professional Data Engineer
Ingesting and processing the data
Your Score:
Settings & Objectives
Random Mixed
Questions are selected randomly from all chosen topics, with a preference for those you haven’t seen before. You may see several questions from the same objective or domain in a row.
Rotate by Objective
Questions cycle through each objective or domain in turn, helping you avoid long streaks of questions from the same area. You may see some repeat questions, but the distribution will be more balanced across topics.

Check or uncheck an objective to set which questions you will receive.

Bash, the Crucial Exams Chat Bot
AI Bot