Your company ingests 2 TB/day of raw IoT sensor data as daily Avro files in a Cloud Storage landing bucket cataloged by Dataplex. Each night a job must 1) mask customer PII, 2) aggregate readings by device-hour, and 3) write compressed Parquet files to a curated bucket. Volume can triple without notice. Ops requires a fully managed, pay-as-you-go service that auto-scales and needs no cluster maintenance. The same code must support back-fill runs via Cloud Scheduler with a date parameter. Which solution meets these needs?
Create a Spark SQL job in Cloud Dataproc, run it nightly on a static cluster sized for peak load, and orchestrate executions with Cloud Composer.
Load the Avro files into BigQuery staging tables with a scheduled load job, run SQL to mask and aggregate, then export the results as CSV files to Cloud Storage via a scheduled extract job.
Attach Cloud Functions to the landing bucket's object-finalize events; each function parses its file, removes PII, performs in-memory aggregation across all files, and writes a Parquet file back to the curated bucket.
Implement an Apache Beam pipeline, package it as a Dataflow Flex Template that reads Avro files from Cloud Storage, masks PII, aggregates by device-hour, writes compressed Parquet to the curated bucket, and trigger the template nightly from Cloud Scheduler with a date parameter.
A Dataflow pipeline built with Apache Beam can ingest Avro files from Cloud Storage using AvroIO, transform the data to hash or remove PII, perform keyed windowed aggregations by device and hour, and write compressed Parquet outputs to the curated bucket with ParquetIO. When packaged as a Flex Template, the identical pipeline can be launched nightly or on-demand for historical back-fills, with the processing date passed as a runtime parameter from Cloud Scheduler through Cloud Functions or Workflows. Dataflow's serverless design automatically provisions and horizontally scales workers, so no cluster sizing or maintenance is required.
A fixed-size Dataproc cluster would not provide automatic scaling and still requires operational oversight. Cloud Data Fusion jobs spin up Dataproc clusters behind the scenes, introducing startup latency and additional cost. A BigQuery-centric workflow would involve separate load, transform, and extract steps and-when exporting CSV as proposed-fails the Parquet requirement. Cloud Functions are constrained in execution time and memory, making them unsuitable for multi-terabyte nightly processing. Therefore, the Dataflow Flex Template is the most suitable solution.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is Apache Beam and why is it used in this solution?
Open an interactive chat with Bash
What is a Dataflow Flex Template and how does it differ from standard templates?
Open an interactive chat with Bash
Why is Parquet preferred over CSV for this solution?
Open an interactive chat with Bash
What is Apache Beam, and how does it work with Dataflow?
Open an interactive chat with Bash
How does Dataflow ensure auto-scaling and serverless execution?
Open an interactive chat with Bash
What is a Dataflow Flex Template, and how does it support back-fill runs?
Open an interactive chat with Bash
What is Apache Beam and how is it used in the solution?
Open an interactive chat with Bash
What are Flex Templates in Dataflow and how do they work?
Open an interactive chat with Bash
Why is Dataflow better suited for this task compared to Dataproc or BigQuery?
Open an interactive chat with Bash
GCP Professional Data Engineer
Storing the data
Your Score:
Report Issue
Bash, the Crucial Exams Chat Bot
AI Bot
Loading...
Loading...
Loading...
Pass with Confidence.
IT & Cybersecurity Package
You have hit the limits of our free tier, become a Premium Member today for unlimited access.
Military, Healthcare worker, Gov. employee or Teacher? See if you qualify for a Community Discount.
Monthly
$19.99 $11.99
$11.99/mo
Billed monthly, Cancel any time.
$19.99 after promotion ends
3 Month Pass
$44.99 $26.99
$8.99/mo
One time purchase of $26.99, Does not auto-renew.
$44.99 after promotion ends
Save $18!
MOST POPULAR
Annual Pass
$119.99 $71.99
$5.99/mo
One time purchase of $71.99, Does not auto-renew.
$119.99 after promotion ends
Save $48!
BEST DEAL
Lifetime Pass
$189.99 $113.99
One time purchase, Good for life.
Save $76!
What You Get
All IT & Cybersecurity Package plans include the following perks and exams .