GCP Professional Data Engineer Practice Question

Your company has three regional Google Cloud projects where raw log and ad-impression CSVs land hourly into Cloud Storage buckets that must remain the primary data-lake. The central analytics team needs one searchable catalog across all files, including automatic schema discovery and profiling, and wants to avoid ongoing engineering when new buckets or folder paths appear. Which architecture meets these goals while following Google Cloud's data governance best practices?

Use hourly Dataflow jobs to load all incoming files into a single multi-regional BigQuery dataset and let analysts search the dataset through Data Catalog.
Enable Cloud Asset Inventory exports for each project, write the bucket metadata to BigQuery, and expose a custom Looker dashboard for analysts to locate files.
Mount each bucket on a GKE cluster via Cloud Storage FUSE and run an open-source metadata crawler nightly to populate a self-hosted catalog service.
Create a Dataplex lake spanning the three projects, register every bucket as a managed asset in a raw zone with auto-discovery enabled, and grant analysts access to the resulting catalog entries.

Report Issue

Answer Description

Dataplex is designed to build governed data lakes that can span multiple projects and regions. By creating a lake and adding each Cloud Storage bucket as a managed asset in a raw zone, Dataplex continuously crawls the buckets, detects new objects, infers file schemas, computes data profiles, and automatically publishes the resulting technical metadata into the unified Data Catalog. Analysts from any project can then search and access the data through the same catalog entries once appropriate IAM permissions are granted. The other options all require significant custom engineering or do not provide built-in schema inference and profiling. Running an open-source crawler on GKE increases operational overhead and does not integrate natively with Cloud IAM or Data Catalog. Moving all files into BigQuery with Dataflow breaks the requirement to keep Cloud Storage as the canonical landing zone and still needs ongoing pipeline maintenance for new buckets. Exporting Cloud Asset Inventory and creating Looker dashboards only captures storage object metadata; it lacks automated schema inference, data profiles, and tight catalog integration. Therefore, using Dataplex with auto-discovery and its integrated catalog is the most appropriate choice.

Ask Bash

Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.