GCP Professional Data Engineer Practice Question

An online media company is rebuilding its click-stream ingestion pipeline on Google Cloud. About 80 000 JSON events per second are published from mobile devices to a Cloud Pub/Sub topic. A personalization microservice must be able to look up the latest events for any given user ID with single-digit millisecond latency for up to seven days after ingestion. Data scientists will also run monthly aggregations on a full year of clickstream history in BigQuery. Which design for the initial sink that subscribes to Pub/Sub best meets these requirements while keeping the architecture simple and cost-efficient?

Write each message to BigQuery using streaming inserts into partitioned tables, and let the microservice query BigQuery directly for recent events.
Use a Dataflow pipeline to write events as Avro files to Cloud Storage and create external tables in BigQuery over the bucket for analytics.
Persist events in Cloud Bigtable using the user ID as the row key, then export the table daily to Cloud Storage and batch-load the files into BigQuery.
Trigger Cloud Functions for each Pub/Sub message to insert the event into Cloud SQL, and configure federated queries from BigQuery to Cloud SQL for analytics.

Report Issue

Answer Description

Cloud Bigtable is optimized for high-throughput writes from streaming sources such as Pub/Sub and provides consistent single-digit millisecond latency for key-based lookups, which satisfies the personalization service. Retaining one week of data in Bigtable keeps hot data available for low-latency access. You can then run a scheduled Dataflow job (or Bigtable export) that writes older data to Cloud Storage and loads it into BigQuery, enabling cost-efficient long-term analytical queries without burdening Bigtable.

Directly streaming into BigQuery would allow analytics but would not deliver the required point-read latency and could become expensive at 80 000 rows per second. Writing to Cloud Storage first would not support low-latency reads. Cloud SQL cannot ingest at this scale and would not provide horizontal scalability or the required latency. Therefore, persisting events in Cloud Bigtable and exporting them periodically to BigQuery is the most appropriate choice.

Ask Bash

Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.