Your retail analytics team must deliver a near-real-time dashboard that charts the rolling five-minute average number of product page clicks per product category. Clickstream events are published to Pub/Sub with an event-time attribute. Static product metadata is stored in a BigQuery dimension table that is refreshed once per hour. Aggregated results have to be queryable in BigQuery within one minute after every window closes, and the solution should minimise operational overhead. Which design should you implement to meet these requirements?
Deploy a streaming Dataflow pipeline that uses the BigQuery dimension table as an hourly refreshed side input to enrich the aggregates before writing them to BigQuery.
Stream events directly into BigQuery and rely on a materialized view to compute the rolling five-minute averages and join with the dimension table, eliminating the need for Dataflow.
Build a streaming Dataflow pipeline that reads from Pub/Sub, assigns event-time timestamps, applies a five-minute hopping window advancing every minute to compute category-level averages, writes the aggregated results to BigQuery via streaming inserts, and expose a BigQuery view that joins this fact table with the product dimension table.
Schedule a Cloud Data Fusion batch ETL pipeline every five minutes that pulls recent events from Pub/Sub, performs the aggregation and join, and loads the result into BigQuery.
The most operationally efficient approach is to perform the time-critical computation of rolling five-minute averages in a continuously running Dataflow streaming pipeline, which naturally handles event-time windowing and late data. The pipeline writes each window's results to BigQuery through BigQueryIO streaming inserts, making the data queryable within seconds to a couple of minutes. Because Apache Beam side inputs in streaming are static snapshots and cannot automatically refresh from BigQuery, it is more reliable to join the aggregates with the product dimension in BigQuery itself. A simple view (or scheduled query, if denormalised data is required) can combine the fact table written by Dataflow with the hourly updated dimension table without impacting the pipeline. Other options either introduce frequent job setup overhead (Cloud Data Fusion batch), rely on an unsupported streaming side-input refresh from BigQuery, or depend solely on BigQuery for continuous windowed computation, which does not guarantee the required one-minute freshness for complex rolling aggregations.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is the purpose of event-time windowing in Dataflow?
Open an interactive chat with Bash
How does BigQueryIO streaming insert work in Dataflow?
Open an interactive chat with Bash
Why is a BigQuery view used for joining fact and dimension tables?
Open an interactive chat with Bash
What is Pub/Sub and why is it used in real-time applications?
Open an interactive chat with Bash
What is the purpose of event-time in streaming pipelines?
Open an interactive chat with Bash
How does Dataflow handle late data and ensure accurate windowing?
Open an interactive chat with Bash
GCP Professional Data Engineer
Ingesting and processing the data
Your Score:
Report Issue
Bash, the Crucial Exams Chat Bot
AI Bot
Loading...
Loading...
Loading...
Pass with Confidence.
IT & Cybersecurity Package
You have hit the limits of our free tier, become a Premium Member today for unlimited access.
Military, Healthcare worker, Gov. employee or Teacher? See if you qualify for a Community Discount.
Monthly
$19.99
$19.99/mo
Billed monthly, Cancel any time.
3 Month Pass
$44.99
$14.99/mo
One time purchase of $44.99, Does not auto-renew.
MOST POPULAR
Annual Pass
$119.99
$9.99/mo
One time purchase of $119.99, Does not auto-renew.
BEST DEAL
Lifetime Pass
$189.99
One time purchase, Good for life.
What You Get
All IT & Cybersecurity Package plans include the following perks and exams .