GCP Professional Data Engineer Practice Question

A digital retailer ingests 2 TB of raw click-stream logs into Cloud Storage every night. Analysts frequently refine attribution logic and must be able to replay all historical data within hours without managing long-running clusters. To minimize operational overhead and avoid repeated data movement, which integration pattern and execution environment should you recommend for the transformation stage of the pipeline?

Stream the logs through Pub/Sub into a Dataflow job that transforms and writes the output to BigQuery in near real time (streaming ETL).
Spin up a transient Dataproc cluster each night to transform the logs before loading the curated results into BigQuery (ETL).
Keep existing BigQuery tables as the source and copy them into Cloud SQL so downstream applications can consume them (reverse ETL).
Load the raw logs into BigQuery and perform all cleansing and attribution logic there using SQL (ELT).

Report Issue

Answer Description

Loading the raw files directly into BigQuery and then applying successive SQL transformations follows an ELT pattern: extract from the source, load into the analytical store, and transform in place. Because BigQuery is a fully managed, serverless warehouse that scales storage and compute independently, analysts can rerun or add new transformation queries on all historical data quickly without provisioning clusters. Transforming data in Dataproc or Dataflow before loading is an ETL approach that still requires operating or scheduling processing infrastructure and re-running it whenever business rules change. Publishing data out of BigQuery to operational systems corresponds to reverse ETL, which does not address the need to flexibly reprocess historical raw data inside the warehouse.

Ask Bash

Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.