GCP Professional Data Engineer Practice Question

Your team receives hourly Avro files compressed with gzip in Cloud Storage paths like gs://raw-bucket/dt=YYYYMMDD-HH/*.avro.gz. About 100 GB arrives per hour, and analysts must query the data in BigQuery within 15 minutes. Additional requirements: keep ingestion cost as low as possible, load each hour atomically (all or nothing), automatically accept newly added nullable fields, and partition the target table by ingestion time. Which approach best meets the requirements?

Build a batch Dataflow pipeline that reads the Avro files from Cloud Storage and writes the records to BigQuery through the Storage Write API's default stream.
Configure a Cloud Storage finalize event to trigger a Cloud Function that starts a BigQuery load job using a wildcard URI for the hour's Avro files, WRITE_APPEND disposition, schema_update_options = ALLOW_FIELD_ADDITION, and ingestion-time partitioning.
Create a permanent external table over the Avro objects and run scheduled queries each hour to copy data into a destination table.
Schedule hourly Dataproc Spark jobs in Cloud Composer to parse the Avro files and insert the rows into BigQuery using the legacy streaming insert API.

Report Issue

Answer Description

Submitting a BigQuery load job from Cloud Storage satisfies every constraint:

Cost: Load jobs are free operations, whereas streaming inserts via the Storage Write API or legacy streaming incur ingestion charges based on the number of bytes ingested. External tables avoid load cost but force every query to read the source files, shifting cost to query time.
Latency: A load job over 100 GB of Avro data typically finishes well within the 15-minute window, especially when triggered as soon as the hour's objects finalize.
Atomicity: With the default setting of maxBadRecords = 0, a load job either commits all rows or fails, giving all-or-nothing visibility for each hourly batch.
Schema evolution: When importing Avro, setting schema_update_options = ALLOW_FIELD_ADDITION automatically accepts new nullable fields.
Partitioning: Specifying timePartitioning = INGESTION_TIME writes the batch into the correct partition automatically.

The other options fail at least one requirement: external tables do not guarantee atomic visibility and increase query cost; a Dataflow pipeline using the Storage Write API still incurs ingestion charges and is not atomic; Dataproc Spark jobs add unnecessary infrastructure and rely on streaming inserts.

Ask Bash

Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.