Your company streams purchase events from Pub/Sub into BigQuery for near real-time dashboards. Compliance requires that any primary account number (PAN) in the field card_number is tokenized before it is written. Business analysts also need each record to contain a non-empty order_id and want duplicate order_id values to be discarded if they arrive again within 24 hours. You must keep end-to-end latency below five seconds and avoid managing cluster infrastructure. Which design should you implement to satisfy the cleansing requirements?
Deploy a long-running Dataproc Spark Streaming job that calls Cloud DLP for tokenization, removes duplicate order_id values, stores Parquet files in Cloud Storage, and triggers a BigQuery load job every hour.
Use the Pub/Sub to BigQuery streaming template without modification and rely on BigQuery policy tags to mask the card_number column, accepting all rows and deduplicating later with a nightly BigQuery MERGE job.
Build a streaming Dataflow pipeline that invokes Cloud DLP to tokenize card_number, filters out events with a null order_id, applies a 24-hour windowed Distinct on order_id, and writes the cleansed stream to BigQuery via the Storage Write API.
Create an hourly Cloud Data Fusion batch pipeline that pulls messages from Pub/Sub, uses the built-in Cloud DLP plugin to tokenize card_number, deduplicates on order_id, and then loads the result into BigQuery.
A streaming Dataflow pipeline can call the Cloud DLP API through the provided DLP transform (or the DLP De-identify template) to tokenize the card_number field before any data leaves the pipeline. Inside the same Beam pipeline you can apply a filter that removes events with a null or empty order_id and then add a 24-hour fixed (or session) window followed by a Distinct transform keyed on order_id to drop repeats. Dataflow is serverless, auto-scales, and typically delivers sub-second to a few-second latency, so it meets the operational and latency constraints. The other options either do not tokenize data before it reaches BigQuery, cannot meet the five-second latency target (batch Cloud Data Fusion or hourly Dataproc jobs), or rely on masking in BigQuery, which does not remove the clear-text PAN before storage and therefore violates the compliance requirement.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is the Cloud DLP API and how does it handle tokenization?
Open an interactive chat with Bash
How does Dataflow ensure low latency for stream processing?
Open an interactive chat with Bash
What is the Storage Write API in BigQuery, and why is it used here?
Open an interactive chat with Bash
What is Pub/Sub in GCP?
Open an interactive chat with Bash
How does Cloud DLP tokenize PAN data?
Open an interactive chat with Bash
What is the advantage of using Dataflow for stream processing?
Open an interactive chat with Bash
GCP Professional Data Engineer
Ingesting and processing the data
Your Score:
Report Issue
Bash, the Crucial Exams Chat Bot
AI Bot
Loading...
Loading...
Loading...
Pass with Confidence.
IT & Cybersecurity Package
You have hit the limits of our free tier, become a Premium Member today for unlimited access.
Military, Healthcare worker, Gov. employee or Teacher? See if you qualify for a Community Discount.
Monthly
$19.99
$19.99/mo
Billed monthly, Cancel any time.
3 Month Pass
$44.99
$14.99/mo
One time purchase of $44.99, Does not auto-renew.
MOST POPULAR
Annual Pass
$119.99
$9.99/mo
One time purchase of $119.99, Does not auto-renew.
BEST DEAL
Lifetime Pass
$189.99
One time purchase, Good for life.
What You Get
All IT & Cybersecurity Package plans include the following perks and exams .