Your team receives hourly Avro files compressed with gzip in Cloud Storage paths like gs://raw-bucket/dt=YYYYMMDD-HH/*.avro.gz. About 100 GB arrives per hour, and analysts must query the data in BigQuery within 15 minutes. Additional requirements: keep ingestion cost as low as possible, load each hour atomically (all or nothing), automatically accept newly added nullable fields, and partition the target table by ingestion time. Which approach best meets the requirements?
Build a batch Dataflow pipeline that reads the Avro files from Cloud Storage and writes the records to BigQuery through the Storage Write API's default stream.
Configure a Cloud Storage finalize event to trigger a Cloud Function that starts a BigQuery load job using a wildcard URI for the hour's Avro files, WRITE_APPEND disposition, schema_update_options = ALLOW_FIELD_ADDITION, and ingestion-time partitioning.
Create a permanent external table over the Avro objects and run scheduled queries each hour to copy data into a destination table.
Schedule hourly Dataproc Spark jobs in Cloud Composer to parse the Avro files and insert the rows into BigQuery using the legacy streaming insert API.
Submitting a BigQuery load job from Cloud Storage satisfies every constraint:
Cost: Load jobs are free operations, whereas streaming inserts via the Storage Write API or legacy streaming incur ingestion charges based on the number of bytes ingested. External tables avoid load cost but force every query to read the source files, shifting cost to query time.
Latency: A load job over 100 GB of Avro data typically finishes well within the 15-minute window, especially when triggered as soon as the hour's objects finalize.
Atomicity: With the default setting of maxBadRecords = 0, a load job either commits all rows or fails, giving all-or-nothing visibility for each hourly batch.
Schema evolution: When importing Avro, setting schema_update_options = ALLOW_FIELD_ADDITION automatically accepts new nullable fields.
Partitioning: Specifying timePartitioning = INGESTION_TIME writes the batch into the correct partition automatically.
The other options fail at least one requirement: external tables do not guarantee atomic visibility and increase query cost; a Dataflow pipeline using the Storage Write API still incurs ingestion charges and is not atomic; Dataproc Spark jobs add unnecessary infrastructure and rely on streaming inserts.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is the role of a Cloud Storage finalize event in the solution?
Open an interactive chat with Bash
Why is schema_update_options set to ALLOW_FIELD_ADDITION important?
Open an interactive chat with Bash
What is ingestion-time partitioning in BigQuery, and why is it used here?
Open an interactive chat with Bash
What is a Cloud Storage finalize event?
Open an interactive chat with Bash
What does schema_update_options = ALLOW_FIELD_ADDITION mean?
Open an interactive chat with Bash
How does ingestion-time partitioning work in BigQuery?
Open an interactive chat with Bash
GCP Professional Data Engineer
Ingesting and processing the data
Your Score:
Report Issue
Bash, the Crucial Exams Chat Bot
AI Bot
Loading...
Loading...
Loading...
Pass with Confidence.
IT & Cybersecurity Package
You have hit the limits of our free tier, become a Premium Member today for unlimited access.
Military, Healthcare worker, Gov. employee or Teacher? See if you qualify for a Community Discount.
Monthly
$19.99
$19.99/mo
Billed monthly, Cancel any time.
3 Month Pass
$44.99
$14.99/mo
One time purchase of $44.99, Does not auto-renew.
MOST POPULAR
Annual Pass
$119.99
$9.99/mo
One time purchase of $119.99, Does not auto-renew.
BEST DEAL
Lifetime Pass
$189.99
One time purchase, Good for life.
What You Get
All IT & Cybersecurity Package plans include the following perks and exams .