GCP Professional Data Engineer Practice Question

Your retail analytics team must deliver a near-real-time dashboard that charts the rolling five-minute average number of product page clicks per product category. Clickstream events are published to Pub/Sub with an event-time attribute. Static product metadata is stored in a BigQuery dimension table that is refreshed once per hour. Aggregated results have to be queryable in BigQuery within one minute after every window closes, and the solution should minimise operational overhead. Which design should you implement to meet these requirements?

Deploy a streaming Dataflow pipeline that uses the BigQuery dimension table as an hourly refreshed side input to enrich the aggregates before writing them to BigQuery.
Stream events directly into BigQuery and rely on a materialized view to compute the rolling five-minute averages and join with the dimension table, eliminating the need for Dataflow.
Build a streaming Dataflow pipeline that reads from Pub/Sub, assigns event-time timestamps, applies a five-minute hopping window advancing every minute to compute category-level averages, writes the aggregated results to BigQuery via streaming inserts, and expose a BigQuery view that joins this fact table with the product dimension table.
Schedule a Cloud Data Fusion batch ETL pipeline every five minutes that pulls recent events from Pub/Sub, performs the aggregation and join, and loads the result into BigQuery.

Report Issue

Answer Description

The most operationally efficient approach is to perform the time-critical computation of rolling five-minute averages in a continuously running Dataflow streaming pipeline, which naturally handles event-time windowing and late data. The pipeline writes each window's results to BigQuery through BigQueryIO streaming inserts, making the data queryable within seconds to a couple of minutes. Because Apache Beam side inputs in streaming are static snapshots and cannot automatically refresh from BigQuery, it is more reliable to join the aggregates with the product dimension in BigQuery itself. A simple view (or scheduled query, if denormalised data is required) can combine the fact table written by Dataflow with the hourly updated dimension table without impacting the pipeline. Other options either introduce frequent job setup overhead (Cloud Data Fusion batch), rely on an unsupported streaming side-input refresh from BigQuery, or depend solely on BigQuery for continuous windowed computation, which does not guarantee the required one-minute freshness for complex rolling aggregations.

Ask Bash

Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.