GCP Professional Data Engineer Practice Question

Your team is designing the first landing layer for a new pipeline that ingests about 10 TB per day of compressed JSON and image files through Pub/Sub and Dataflow. Compliance requires that the raw, immutable data be retained for at least seven years at the lowest possible cost. The data will later be transformed in Dataproc and occasionally queried in BigQuery, so it must remain accessible to multiple analytics engines and be able to transition automatically to colder, cheaper tiers as it ages. Which Google Cloud sink best satisfies these requirements?

A Cloud Bigtable instance with a wide-row schema storing each file as a cell value
A BigQuery dataset that relies on BigQuery long-term storage pricing after 90 days
A Cloud Storage bucket configured with lifecycle rules to serve as the organization's data lake
A regional Cloud Spanner database using a BLOB column for each ingested object

Report Issue

Answer Description

Cloud Storage is Google Cloud's durable, low-cost object store and is commonly used as the data-lake landing zone. It can accept any file format produced by Pub/Sub and Dataflow, is query-engine-agnostic (schema-on-read), and supports lifecycle management rules that automatically move objects to Nearline, Coldline, or Archive classes for long-term retention at lower cost. BigQuery long-term storage is cheaper than its standard tier but still keeps data inside the data warehouse and does not handle binary objects such as images. Cloud Bigtable and Spanner are operational databases optimized for low-latency reads and writes, not for inexpensive archival of petabytes of unstructured files.

Ask Bash

Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.