GCP Professional Data Engineer Practice Question

Your team must design a data-cleansing layer for a clickstream pipeline that ingests millions of events per minute from Pub/Sub. Requirements: apply schema validation, deduplication, and null-handling, mask e-mail addresses with Cloud DLP, support the same code path for historic backfills from Cloud Storage, and minimize infrastructure management. Which solution should you implement?

Build cleansing recipes in Cloud Dataprep, schedule them as Dataflow jobs for Pub/Sub data, and create a separate Dataprep flow to process historical files from Cloud Storage.
Provision a long-running Dataproc cluster with Spark Structured Streaming to read from Pub/Sub, use custom UDFs that invoke Cloud DLP for masking, and run additional Spark batch jobs on the cluster for backfills.
Develop an Apache Beam pipeline and run it on Cloud Dataflow; use Beam schemas for validation and deduplication, call the Dataflow Cloud DLP transform to mask e-mails, and execute the same pipeline in batch mode for Cloud Storage backfills.
Configure a Cloud Data Fusion replication pipeline that reads Pub/Sub messages, applies Wrangler transformations with a DLP plugin, writes to BigQuery, and enable a parallel batch pipeline for Cloud Storage files.

Report Issue

Answer Description

An Apache Beam pipeline running on Cloud Dataflow satisfies every requirement. Beam's unified model lets you write a single pipeline that executes in streaming mode for Pub/Sub ingestion and in batch mode for historical files in Cloud Storage, eliminating the need to maintain separate codebases. Beam schemas and built-in transforms provide schema validation, null handling, windowing, and deduplication. The Cloud DLP transform for Dataflow can invoke Google Cloud DLP to identify and mask e-mail addresses while data is in flight. Cloud Dataflow is a serverless, fully managed service that automatically provisions and scales resources, so there are no clusters to administer.

Cloud Dataprep lacks a native Pub/Sub connector and is limited to scheduled batch jobs, so it cannot meet the real-time requirement. Although Dataproc is fully managed for Spark workloads, it still requires you to create and operate Spark jobs separately for streaming and batch and to build custom DLP integrations, adding complexity. Cloud Data Fusion pipelines depend on underlying Dataproc clusters, and Wrangler uses distinct steps for streaming versus batch, preventing a single unified code path. Therefore, using Beam on Dataflow with the DLP transform is the most suitable choice.

Ask Bash

Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.