GCP Professional Data Engineer Practice Question

Your team is building a streaming Apache Beam pipeline that reads click-stream events from Pub/Sub and writes them to BigQuery. Each event has a globally unique field called event_id and several required dimension columns such as user_id and page_id. Occasionally, the upstream system retries publishes, which produces exact duplicate events with the same event_id. Some events also arrive with a null user_id that violates downstream business rules. The pipeline must

  1. guarantee that no duplicate rows are inserted into BigQuery,
  2. divert any event whose user_id is null to a separate Pub/Sub dead-letter topic, and
  3. ensure that state held for deduplication cannot grow without bound as the unbounded stream continues.

Which implementation in Dataflow best satisfies these requirements while minimizing operational overhead?

  • Send the data directly to BigQuery using the streaming API and set insertId to event_id; configure the table schema so user_id is NULLABLE to avoid rejects.

  • Enable exactly-once delivery on the Pub/Sub subscription to prevent duplicates, then call the Data Loss Prevention API from a ParDo to remove records that have null user_id values.

  • Keep the stream in the global window, apply Distinct on event_id with an allowed lateness of seven days, and rely on BigQuery load errors to reject rows where user_id is null.

  • Add a fixed-duration window (for example, one hour) to the event stream, apply the Beam Distinct (or RemoveDuplicates) transform keyed by event_id, then use a ParDo with side outputs to send events with null user_id to a Pub/Sub dead-letter topic before writing the main output to BigQuery.

GCP Professional Data Engineer
Ingesting and processing the data
Your Score:
Settings & Objectives
Random Mixed
Questions are selected randomly from all chosen topics, with a preference for those you haven’t seen before. You may see several questions from the same objective or domain in a row.
Rotate by Objective
Questions cycle through each objective or domain in turn, helping you avoid long streaks of questions from the same area. You may see some repeat questions, but the distribution will be more balanced across topics.

Check or uncheck an objective to set which questions you will receive.

Bash, the Crucial Exams Chat Bot
AI Bot