Your team is building a streaming Apache Beam pipeline that reads click-stream events from Pub/Sub and writes them to BigQuery. Each event has a globally unique field called event_id and several required dimension columns such as user_id and page_id. Occasionally, the upstream system retries publishes, which produces exact duplicate events with the same event_id. Some events also arrive with a null user_id that violates downstream business rules. The pipeline must
guarantee that no duplicate rows are inserted into BigQuery,
divert any event whose user_id is null to a separate Pub/Sub dead-letter topic, and
ensure that state held for deduplication cannot grow without bound as the unbounded stream continues.
Which implementation in Dataflow best satisfies these requirements while minimizing operational overhead?
Add a fixed-duration window (for example, one hour) to the event stream, apply the Beam Distinct (or RemoveDuplicates) transform keyed by event_id, then use a ParDo with side outputs to send events with null user_id to a Pub/Sub dead-letter topic before writing the main output to BigQuery.
Enable exactly-once delivery on the Pub/Sub subscription to prevent duplicates, then call the Data Loss Prevention API from a ParDo to remove records that have null user_id values.
Send the data directly to BigQuery using the streaming API and set insertId to event_id; configure the table schema so user_id is NULLABLE to avoid rejects.
Keep the stream in the global window, apply Distinct on event_id with an allowed lateness of seven days, and rely on BigQuery load errors to reject rows where user_id is null.
Applying a fixed (or session) window before deduplication constrains state growth because each window maintains only the keys seen within its time span; once the window is closed and its watermark passes, the state can be garbage-collected. Using the Beam Distinct (or RemoveDuplicates) transform with a representative value function keyed on event_id removes replayed events inside each window. A subsequent ParDo can inspect each record, emitting valid elements to the main output (ultimately written to BigQuery) and routing records whose user_id is null to a side output that a Pub/Sub sink publishes to a dead-letter topic. Relying on BigQuery's de-duplication (insertId) or Pub/Sub's exactly-once feature would not address state growth or the need to quarantine bad records within the pipeline, and using only a global window would cause unbounded state accumulation.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is the significance of using fixed-duration windows in Apache Beam pipelines?
Open an interactive chat with Bash
How does the `Distinct` or `RemoveDuplicates` transform work in Apache Beam?
Open an interactive chat with Bash
What is the purpose of side outputs in Apache Beam pipelines and how are they used?
Open an interactive chat with Bash
GCP Professional Data Engineer
Ingesting and processing the data
Your Score:
Report Issue
Bash, the Crucial Exams Chat Bot
AI Bot
Loading...
Loading...
Loading...
Pass with Confidence.
IT & Cybersecurity Package
You have hit the limits of our free tier, become a Premium Member today for unlimited access.
Military, Healthcare worker, Gov. employee or Teacher? See if you qualify for a Community Discount.
Monthly
$19.99
$19.99/mo
Billed monthly, Cancel any time.
3 Month Pass
$44.99
$14.99/mo
One time purchase of $44.99, Does not auto-renew.
MOST POPULAR
Annual Pass
$119.99
$9.99/mo
One time purchase of $119.99, Does not auto-renew.
BEST DEAL
Lifetime Pass
$189.99
One time purchase, Good for life.
What You Get
All IT & Cybersecurity Package plans include the following perks and exams .