Your team must design a data-cleansing layer for a clickstream pipeline that ingests millions of events per minute from Pub/Sub. Requirements: apply schema validation, deduplication, and null-handling, mask e-mail addresses with Cloud DLP, support the same code path for historic backfills from Cloud Storage, and minimize infrastructure management. Which solution should you implement?
Build cleansing recipes in Cloud Dataprep, schedule them as Dataflow jobs for Pub/Sub data, and create a separate Dataprep flow to process historical files from Cloud Storage.
Provision a long-running Dataproc cluster with Spark Structured Streaming to read from Pub/Sub, use custom UDFs that invoke Cloud DLP for masking, and run additional Spark batch jobs on the cluster for backfills.
Develop an Apache Beam pipeline and run it on Cloud Dataflow; use Beam schemas for validation and deduplication, call the Dataflow Cloud DLP transform to mask e-mails, and execute the same pipeline in batch mode for Cloud Storage backfills.
Configure a Cloud Data Fusion replication pipeline that reads Pub/Sub messages, applies Wrangler transformations with a DLP plugin, writes to BigQuery, and enable a parallel batch pipeline for Cloud Storage files.
An Apache Beam pipeline running on Cloud Dataflow satisfies every requirement. Beam's unified model lets you write a single pipeline that executes in streaming mode for Pub/Sub ingestion and in batch mode for historical files in Cloud Storage, eliminating the need to maintain separate codebases. Beam schemas and built-in transforms provide schema validation, null handling, windowing, and deduplication. The Cloud DLP transform for Dataflow can invoke Google Cloud DLP to identify and mask e-mail addresses while data is in flight. Cloud Dataflow is a serverless, fully managed service that automatically provisions and scales resources, so there are no clusters to administer.
Cloud Dataprep lacks a native Pub/Sub connector and is limited to scheduled batch jobs, so it cannot meet the real-time requirement. Although Dataproc is fully managed for Spark workloads, it still requires you to create and operate Spark jobs separately for streaming and batch and to build custom DLP integrations, adding complexity. Cloud Data Fusion pipelines depend on underlying Dataproc clusters, and Wrangler uses distinct steps for streaming versus batch, preventing a single unified code path. Therefore, using Beam on Dataflow with the DLP transform is the most suitable choice.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
Why is Apache Beam suitable for both streaming and batch processing?
Open an interactive chat with Bash
What is the role of the Cloud DLP transform in the Dataflow pipeline?
Open an interactive chat with Bash
How does Cloud Dataflow minimize infrastructure management?
Open an interactive chat with Bash
What is Apache Beam and how does it enable unified pipelines?
Open an interactive chat with Bash
What is Cloud Dataflow and why is it used for Apache Beam pipelines?
Open an interactive chat with Bash
How does Cloud Dataflow integrate with Cloud DLP for data masking?
Open an interactive chat with Bash
GCP Professional Data Engineer
Ingesting and processing the data
Your Score:
Report Issue
Bash, the Crucial Exams Chat Bot
AI Bot
Loading...
Loading...
Loading...
Pass with Confidence.
IT & Cybersecurity Package
You have hit the limits of our free tier, become a Premium Member today for unlimited access.
Military, Healthcare worker, Gov. employee or Teacher? See if you qualify for a Community Discount.
Monthly
$19.99
$19.99/mo
Billed monthly, Cancel any time.
3 Month Pass
$44.99
$14.99/mo
One time purchase of $44.99, Does not auto-renew.
MOST POPULAR
Annual Pass
$119.99
$9.99/mo
One time purchase of $119.99, Does not auto-renew.
BEST DEAL
Lifetime Pass
$189.99
One time purchase, Good for life.
What You Get
All IT & Cybersecurity Package plans include the following perks and exams .