GCP Professional Data Engineer Practice Test
Use the form below to configure your GCP Professional Data Engineer Practice Test. The practice test can be configured to only include certain exam objectives and domains. You can choose between 5-100 questions and set a time limit.

GCP Professional Data Engineer Information
Overview
The Google Cloud Professional Data Engineer (PDE) certification is designed to validate a practitionerâs ability to build, operationalize, secure, and monitor data processing systems on Google Cloud Platform (GCP). Candidates are expected to demonstrate proficiency in designing dataâdriven solutions that are reliable, scalable, and cost-effectiveâspanning everything from ingestion pipelines and transformation jobs to advanced analytics and machine-learning models. Earning the PDE credential signals to employers that you can translate business and technical requirements into robust data architectures while adhering to best practices for security, compliance, and governance.
Exam Structure and Knowledge Domains
The exam is a two-hour, multiple-choice test available in a proctored, in-person or online format. Questions target real-world scenarios across four broad domains: (1) designing data processing systems; (2) building and operationalizing data processing systems; (3) operationalizing machine-learning models; and (4) ensuring solution quality. You might be asked to choose optimal storage solutions (BigQuery, Cloud Spanner, Bigtable), architect streaming pipelines with Pub/Sub and Dataflow, or troubleshoot performance bottlenecks. Because the PDE focuses heavily on applied problem-solving rather than rote memorization, hands-on experienceâwhether via professional projects or Googleâs Qwiklabs/Cloud Skills Boost labsâis critical for success.
About GCP PDE Practice Exams
Taking reputable practice exams is one of the most efficient ways to gauge readiness and close knowledge gaps. High-quality mocks mirror the actual testâs wording, timing, and scenario-based style, helping you get comfortable with the pace and depth of questioning. After each attempt, review explanationsânot just the items you missed, but also the ones you answered correctlyâto reinforce concepts and uncover lucky guesses. Tracking performance over multiple sittings shows whether your improvement is consistent or if certain domains lag behind. When used alongside hands-on labs, whitepapers, and documentation, practice tests become a feedback loop that sharpens both your intuition and time-management skills.
Preparation Tips
Begin your preparation with the official exam guide to map each task statement to concrete learning resources (Coursera courses, Google documentation, blog posts). Build small proof-of-concept projectsâsuch as streaming IoT data to BigQuery or automating model retraining with AI Platformâto anchor theory in practice. In the final weeks, shift from broad study to focused review: revisit weak areas highlighted by practice exams, skim product release notes for recent feature updates, and fine-tune your exam-day strategy (flag uncertain questions, manage breaks, monitor the clock). By combining targeted study, practical experimentation, and iterative assessment, you can approach the GCP Professional Data Engineer exam with confidence and a clear roadmap to certification.

Free GCP Professional Data Engineer Practice Test
- 20 Questions
- Unlimited time
- Designing data processing systemsIngesting and processing the dataStoring the dataPreparing and using data for analysisMaintaining and automating data workloads
Your organization runs a Dataflow streaming job that continuously writes events into an existing BigQuery dataset containing sensitive customer information. Security policy mandates least-privilege access for the Dataflow worker service account: it must be able to create new tables in that dataset and append or overwrite rows, but it must not change table schemas or manage dataset-level access controls. You need to grant a single predefined IAM role on the dataset to satisfy this requirement. Which role should you assign?
Grant roles/bigquery.dataEditor on the dataset
Grant roles/bigquery.jobUser on the project
Grant roles/bigquery.dataOwner on the dataset
Grant roles/bigquery.admin on the project
Answer Description
The BigQuery Data Editor role (roles/bigquery.dataEditor) is scoped to datasets and grants permissions such as bigquery.tables.create and bigquery.tables.updateData, which allow a principal to create tables and write rows. Although it also includes bigquery.tables.getData (read access), it does not include permissions like bigquery.tables.update (alter table schemas) or bigquery.datasets.update (change access controls). Thus it provides the minimum required capabilities without granting schema-modification rights. BigQuery Data Owner and BigQuery Admin include schema and access-control permissions, violating least privilege. BigQuery Data Viewer is read-only, and BigQuery Job User controls job execution but provides no direct dataset write permissions. Therefore granting roles/bigquery.dataEditor on the dataset is the correct choice.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is the difference between roles/bigquery.dataEditor and roles/bigquery.dataOwner?
Why can't roles/bigquery.admin be assigned instead of roles/bigquery.dataEditor?
What does least-privilege access mean in IAM roles?
What is the scope of the BigQuery Data Editor role?
What’s the difference between BigQuery Data Editor and BigQuery Admin roles?
What permissions are excluded from the BigQuery Data Editor role?
A media-streaming company runs an Apache Beam pipeline on Cloud Dataflow in the us-central1 region. The job keeps several terabytes of user session data in Redis to perform low-latency joins. Management wants the pipeline to survive a complete zonal outage without manual intervention while keeping operational overhead and complexity to a minimum. Which approach best meets these requirements?
Create a Memorystore for Redis Cluster instance in us-central1. Configure the Dataflow pipeline to connect through the cluster's discovery endpoint and rely on its built-in multi-zone shard replication.
Run open-source Redis Cluster on a stateful GKE deployment distributed across three zones and manage failover with custom scripts and Kubernetes operators.
Provision two Basic Tier Memorystore for Redis instances, one in us-central1-a and one in us-central1-b, and modify the Dataflow job to write to both instances for redundancy.
Deploy a Standard Tier Memorystore for Redis instance in us-central1-a and create a Cloud SQL read replica in a different zone to take over if the primary zone fails.
Answer Description
Memorystore for Redis Cluster is a regional service that automatically shards data and places each shard's primary and replica in separate zones inside the region. If an entire zone fails, clients reconnect to the surviving replicas without manual reconfiguration, so the Dataflow job continues with minimal disruption. Running your own Redis on GKE or managing dual basic instances would require you to build and operate the sharding, replication, and failover logic yourself. A single-zone Standard Tier instance, even with a Cloud SQL read replica, would not protect against a zonal outage and mixes unrelated services. Therefore, a regional Memorystore for Redis Cluster instance with its built-in multi-zone placement and automatic failover delivers the desired high availability with the least operational effort.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is Memorystore for Redis Cluster?
How does multi-zone shard replication work in Memorystore?
Why isn't using multiple Basic Tier instances a good solution for zonal outages?
What is Memorystore for Redis Cluster and how does it ensure high availability?
Why is running Redis on GKE more complex compared to Memorystore for Redis?
Why is a single-zone Standard Tier Memorystore for Redis instance insufficient for zonal outage protection?
Your ecommerce platform runs a self-managed PostgreSQL 12 cluster handling 25 000 write TPS against 10 TB of data. Business analysts need sub-second ad-hoc reporting on the same tables without affecting OLTP latency. You must lift-and-shift to Google Cloud within one quarter, reuse existing SQL, avoid managing storage-compute scaling or patching, and minimize downtime during future maintenance. Which managed Google Cloud service best satisfies both transactional and analytical requirements while preserving PostgreSQL compatibility?
BigQuery with federated external tables over exported data
Spanner configured with the PostgreSQL interface
Cloud SQL for PostgreSQL with high-availability configuration and read replicas
AlloyDB for PostgreSQL
Answer Description
AlloyDB for PostgreSQL is a fully managed, PostgreSQL-compatible service that separates compute and log-structured storage to deliver high write throughput and near-linear read scaling through read pools. It offers an integrated columnar engine that accelerates analytical queries by orders of magnitude on the same operational data, and Google handles replication, patching, and zero-downtime maintenance. Cloud SQL provides PostgreSQL compatibility but lacks the columnar accelerator and can suffer from replication lag under heavy write loads. Spanner's PostgreSQL interface is not identical to upstream PostgreSQL and would require more re-engineering, while BigQuery excels at analytics but is not suitable for high-write OLTP workloads and would force a dual-system architecture.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is AlloyDB for PostgreSQL?
How does AlloyDB compare to Cloud SQL?
What makes AlloyDB suitable for OLTP and analytical workloads?
What is AlloyDB for PostgreSQL?
How does the columnar engine in AlloyDB improve analytical queries?
Why is AlloyDB preferred over Cloud SQL for PostgreSQL in this scenario?
Your manufacturing company collects 150 000 JSON telemetry events per second from thousands of factory devices worldwide. Dashboards in BigQuery must reflect events within 30 seconds of publication. Devices occasionally emit malformed JSON that should be quarantined for later inspection without interrupting ingest. The team wants a fully managed, autoscaling solution that minimizes ongoing operations. Which architecture best satisfies these requirements?
Deploy a long-lived Spark Streaming job on a Dataproc cluster that consumes the Pub/Sub topic, cleans the data, writes to BigQuery, and stores malformed records in an HDFS directory.
Have devices write newline-delimited JSON files to Cloud Storage and configure a BigQuery load job every 15 minutes with an error log destination for rows that fail to parse.
Trigger a Cloud Function for each message delivered by a Pub/Sub push subscription and insert the event into BigQuery; wrap the insert in a try/catch block that logs malformed JSON to Cloud Logging.
Publish events to a Pub/Sub topic that has a dead-letter topic enabled; run an autoscaling Dataflow streaming pipeline that parses the JSON, writes valid rows to BigQuery via the Storage Write API, and routes parsing failures to the dead-letter topic.
Answer Description
Publishing events to a Pub/Sub topic provides a serverless ingestion layer that automatically scales to high throughput. A streaming Dataflow job can subscribe to the topic, parse the JSON, and write valid rows to BigQuery with the BigQuery Storage Write API, making data queryable within seconds. The pipeline can send parsing failures to a Pub/Sub dead-letter topic (or a side output) so bad records are isolated without stopping the job. This combination is fully managed and autoscaling, requiring no cluster maintenance. Cloud Functions would face concurrency limits and per-invocation overhead at 150 000âmsg/s, Dataproc introduces cluster administration work and is not serverless, and batch file loads from Cloud Storage cannot meet the sub-minute latency target.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is Pub/Sub and why is it suitable for high-throughput data ingestion?
What is the role of Dataflow in the solution and how does it support autoscaling?
What is the BigQuery Storage Write API and how does it improve data latency?
What is Pub/Sub, and why is it ideal for high-throughput messaging?
What is the BigQuery Storage Write API, and how does it enable fast data querying?
How does a Dataflow streaming pipeline handle malformed JSON data efficiently?
Your organization is adopting Dataplex to unify governance across its Google Cloud data estate. Three business domains (Sales, Marketing, and Finance) will ingest raw data into domain-owned buckets, transform it in Dataproc, and publish cleaned datasets in BigQuery. In addition, several enterprise reference tables (for example, country codes and fiscal calendars) must be discoverable and consistently governed by a central data-governance team while remaining accessible to all domains without duplicating the data. Which Dataplex design best satisfies these requirements and aligns with the lake â zone â asset hierarchy?
Store the reference datasets in a Cloud Storage bucket that is not registered with Dataplex and let each domain lake create external tables pointing to that bucket for queries.
Within each domain lake, define three zones (raw, reference, curated) and bulk-replicate the reference datasets into the reference zone of every lake so each team can manage its own copy.
Create a separate Enterprise lake managed by the governance team that contains a single curated zone with the reference datasets as assets, and keep raw and curated zones inside each domain lake for domain-specific data.
Add the reference datasets as additional assets inside the curated zone of every domain lake and rely on Dataplex asset sharing to grant cross-domain access.
Answer Description
Dataplex lakes are intended to map to broad business areas or administrative ownership boundaries. Creating a dedicated Enterprise lake allows the central governance team to own and manage a curated zone that contains the shared reference datasets as individual assets (BigQuery datasets or Cloud Storage buckets). IAM policies can then be granted at lake, zone, or asset level so that the domain teams in the Sales, Marketing, and Finance lakes can read the reference data without copying it. Replicating the reference data in every domain lake or maintaining it outside Dataplex would defeat the purpose of centralized metadata, governance, and cost control, and Dataplex today does not support asset-level "sharing" that automatically spans multiple lakes-each asset belongs to exactly one zone within one lake.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is Dataplex in Google Cloud?
How does Dataplex enforce governance without data duplication?
What is the lake → zone → asset hierarchy in Dataplex?
What is Dataplex's zone and asset hierarchy?
How does Dataplex handle centralized governance?
Why is it important to avoid replicating reference data across domains?
Your organization manages several BigQuery projects. Interactive queries that refresh the executive dashboard in the prod-analytics project start at 09:00 each weekday and must finish within seconds. During the rest of the day, development and ad-hoc queries from other projects may use any spare capacity, but the dashboard must never be slowed when it runs. What is the most cost-effective way to guarantee performance for the dashboard while still letting the other projects use leftover capacity?
Purchase a 1-year commitment for 1,000 slots, place them in a dedicated reservation assigned to prod-analytics, create a second 0-slot reservation for the other projects so they can opportunistically borrow idle slots.
Run the dashboard queries with batch priority and enable query result caching so they do not compete with other workloads.
Upgrade every BigQuery project to Enterprise Plus edition and rely on automatic slot scaling to handle the 09:00 dashboard workload.
Buy Flex slots at 08:55 each morning for prod-analytics and delete them after the dashboard finishes; keep all projects on on-demand pricing for the rest of the day.
Answer Description
A one-year slot commitment provides the lowest unit cost when the same capacity is required every workday. Create a reservation that holds the 1,000 committed slots and assign only the prod-analytics project to it. Then create a second reservation that owns 0 slots and assign the other projects to that reservation. BigQuery automatically lets the 0-slot reservation borrow any idle slots from the 1,000-slot reservation, but when the dashboard starts, the full 1,000 slots are immediately reclaimed for prod-analytics, ensuring its performance target. Flex slots are more expensive per hour and would require daily provisioning, while upgrading editions or lowering query priority would not reserve capacity and therefore cannot guarantee the required latency.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What are BigQuery slot reservations, and how do they work?
What is the difference between committed slots and flex slots in BigQuery?
How do 0-slot reservations work in BigQuery?
What are BigQuery slots?
What is a reservation in BigQuery?
What is the difference between committed and flex slots in BigQuery?
You are designing a streaming pipeline that ingests temperature readings from tens of thousands of IoT devices through Pub/Sub and must persist the data in a storage tier that can sustain millions of writes per second while offering sub-10-millisecond read latency for lookups by device-id and event timestamp to power a real-time dashboard. The data is append-only and will not be queried with complex joins or multi-row transactions. Which Google Cloud sink best meets these requirements with minimal operational overhead?
Cloud Pub/Sub Lite topic configured with 7-day message retention
BigQuery partitioned table using ingestion-time partitioning
Cloud Spanner table keyed on device-id with a secondary index on timestamp
Cloud Bigtable with a composite row key of device-id and reversed timestamp
Answer Description
Cloud Bigtable is engineered for very high write throughput and single-row reads in single-digit milliseconds, making it ideal for time-series or sensor data that is appended continuously and retrieved by a well-designed row key such as device-id#timestamp. BigQuery is optimized for analytical scans, not low-latency point reads. Cloud Spanner provides strong relational semantics and distributed transactions, which add cost and complexity unnecessary for this append-only workload. Pub/Sub Lite is a messaging service rather than a serving store and is not intended for low-latency random access queries.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is Cloud Bigtable, and why is it a good choice for this use case?
Why isn't BigQuery suitable for low-latency point queries in this scenario?
What is the limitation of using Cloud Spanner in this scenario?
How does Cloud Bigtable achieve sub-10-millisecond read latency while handling millions of writes per second?
Why is a composite row key with device-id and reversed-timestamp recommended in this use case?
Why is BigQuery not suitable for real-time, low-latency point lookups in this scenario?
During a quarterly audit, you discover that all 20 data scientists in your analytics project were granted the primitive Editor role so they could create and modify BigQuery tables. The CISO asks you to immediately reduce the blast radius while ensuring the scientists can continue their normal workloads. Which action best satisfies the principle of least privilege?
Replace the Editor role with a custom role that includes all resourcemanager.* permissions but excludes storage.* permissions to protect Cloud Storage data.
Downgrade each scientist to the Viewer primitive role and allow them to impersonate a service account that still has the Editor role when they need write access.
Retain the Editor role but enable Cloud Audit Logs and set up log-based alerts to detect any misuse of non-BigQuery services.
Remove the Editor binding and grant each scientist the predefined role roles/bigquery.dataEditor only on the datasets they work with.
Answer Description
The Editor primitive role grants thousands of permissions across nearly every Google Cloud service, including the ability to create, modify, and delete resources such as Compute Engine instances and Cloud Storage buckets. To comply with least-privilege guidelines, you should remove this broad role and replace it with a predefined BigQuery-specific role that contains only the permissions required for the scientists' tasks. Granting roles/bigquery.dataEditor at the dataset level lets them create and update tables without exposing the project to unnecessary risk. The other options either continue to over-provision access, add unnecessary impersonation complexity, or rely solely on monitoring rather than removing excessive permissions.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is the principle of least privilege in cloud security?
What does the roles/bigquery.dataEditor role allow users to do?
Why is assigning primitive roles like Editor considered a security risk?
What is the principle of least privilege?
What does the roles/bigquery.dataEditor role include?
How can granting permissions at the dataset level reduce risk?
A global equities trading platform must ingest more than one million trade order updates per second from users on three continents. Each order write must commit in under 10 ms with full ACID semantics and globally consistent reads. Analysts require ad-hoc SQL queries on the full history of trades with less than five-second freshness, but the engineering team wants to avoid managing infrastructure. Which GCP service combination best satisfies both workloads?
Ingest both transactional and analytical workloads directly into a partitioned BigQuery dataset using BigQuery Omni.
Persist orders in Cloud Spanner and stream Spanner change streams to BigQuery with Dataflow for analytics.
Use Cloud SQL with cross-region read replicas for orders and replicate to BigQuery with Datastream.
Store orders in Cloud Bigtable and expose the table as an external table for direct querying from BigQuery.
Answer Description
Cloud Spanner is a fully managed, horizontally scalable relational store that offers strong consistency, global replication, and ACID transactions with single-digit-millisecond latency-ideal for the high-volume transactional (OLTP) order stream. Spanner change streams can continuously capture mutations, and a Dataflow template can stream them into BigQuery, Google Cloud's serverless data warehouse optimized for interactive analytical (OLAP) SQL. This architecture cleanly separates transactional and analytical paths while remaining fully managed. Bigtable lacks multi-row ACID guarantees and global consistency, Cloud SQL cannot scale to millions of updates per second, and BigQuery alone is not designed for sub-millisecond transactional writes.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What are ACID semantics in databases?
What are Spanner change streams, and how do they work with Dataflow?
How does Cloud Spanner achieve global consistency?
What is Cloud Spanner used for in this architecture?
How do Spanner change streams work with Dataflow and BigQuery?
Why was Cloud Bigtable not suitable for this use case?
Your analytics team has deployed a Cloud Data Fusion Enterprise edition instance in the us-central1 region. The instance was provisioned with a private IP so that its management UI and the Dataproc ephemeral clusters it creates have no public IPv4 addresses.
You now need to allow the pipelines that run inside the Data Fusion tenant project to read and write data in Cloud Bigtable tables that reside in a VPC network (prod-analytics-vpc) in your customer project. The security team requires that all traffic stay on Google's private backbone; the Bigtable instances must remain reachable only over internal IP addresses, and no inbound firewall openings in prod-analytics-vpc are allowed.
Which networking approach meets the requirements while following Google-recommended architecture for Cloud Data Fusion private deployments?
Expose the Cloud Bigtable instances through Private Service Connect and have the Data Fusion instance consume the published PSC endpoints over the internet.
Convert the tenant project into a service project of the customer's Shared VPC host so that Dataproc clusters obtain IP addresses directly inside prod-analytics-vpc.
Create a Cloud NAT gateway in the tenant project and route traffic from the Dataproc subnet to the internet; whitelist the gateway's public IP range on Bigtable.
Peer the tenant project's default network with prod-analytics-vpc by using VPC Network Peering and rely on existing firewall egress rules for the Dataproc workers.
Answer Description
Cloud Data Fusion instances that use the Private IP option create their control plane in a Google-managed tenant project. Runtime resources such as Dataproc clusters run in that tenant project's default network. To let these private workers reach services that live in a customer VPC (for example, Cloud Bigtable with only internal IPs), Google recommends creating a VPC-network-peering connection between the tenant project's network and the customer project's VPC. Peering keeps the traffic on Google's private backbone, requires no public IPs, and honours existing firewall rules from both sides; no ingress holes have to be opened because Dataproc workers initiate the outbound connections.
Other options fail to satisfy one or more constraints:
- Using Cloud NAT would still expose Dataproc workers to the public internet.
- Private Service Connect endpoints are not supported for Bigtable yet and would still require deploying PSC back-ends in the customer VPC.
- Shared VPC is not possible because the tenant project is Google-managed and cannot be attached as a service project under your host project.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is a tenant project in Cloud Data Fusion?
What is VPC Network Peering and how does it work?
Why can’t Shared VPC be used with a tenant project?
What is a tenant project in GCP?
How does VPC Network Peering work in GCP?
Why can't Cloud Data Fusion tenant projects use Shared VPCs?
What is a tenant project in Google Cloud Data Fusion?
What is VPC Network Peering and why is it used in this scenario?
Why is Cloud NAT not recommended for this setup?
You are building a BigQuery ML logistic-regression model on table prod.customers, which contains nullable numeric columns (usage_minutes, tenure_days) and a high-cardinality STRING column plan_type. Analysts will later call ML.PREDICT directly on the raw table from BI dashboards. You need to guarantee that missing numeric values are mean-imputed and that plan_type is one-hot encoded during both model training and every subsequent prediction, without requiring any additional preprocessing SQL in the dashboards. What should you do?
Create a materialized view that performs the imputing and one-hot encoding, train the model on that view, and require dashboards to invoke
ML.PREDICTagainst the view instead of the raw table.Apply only numeric normalization in the
TRANSFORMclause and instruct dashboard developers to one-hot encodeplan_typewithin theirML.PREDICTqueries.Specify a
TRANSFORMclause when youCREATE MODEL, usingML.IMPUTERfor the numeric columns andML.ONE_HOT_ENCODERforplan_type; BigQuery ML will reuse these transformations automatically duringML.PREDICT.Run a scheduled Dataflow pipeline that writes a fully preprocessed feature table; instruct dashboards to join to this table before calling
ML.PREDICTso that the model receives clean features.
Answer Description
Define the preprocessing inside the TRANSFORM clause of the CREATE MODEL statement. By calling ML.IMPUTER on usage_minutes and tenure_days you ensure mean imputation for missing numeric values, and applying ML.ONE_HOT_ENCODER to plan_type converts the high-cardinality string column into sparse indicator features, optionally limited by top_k. BigQuery ML stores these transformations with the model and automatically reapplies them during ML.PREDICT, so dashboard queries can run predictions on raw records without repeating the logic. External pipelines, materialized views, or delegating encoding to dashboards would require manual coordination and risk feature skew, whereas BigQuery ML does not offer an auto_preprocess flag in CREATE MODEL.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is the `TRANSFORM` clause in BigQuery ML?
What does `ML.IMPUTER` do in BigQuery ML?
How does `ML.ONE_HOT_ENCODER` work for categorical variables?
What does ML.IMPUTER do in BigQuery ML?
How does ML.ONE_HOT_ENCODER work for categorical columns?
Why is it risky to preprocess data outside the TRANSFORM clause in BigQuery ML?
A fintech startup uses BigQuery on-demand for an ad-hoc fraud-detection dashboard and a nightly ETL that ingests multiple terabytes. Security analysts expect dashboard queries to return in seconds, but when the ETL overlaps, some dashboards queue for minutes and miss their SLO. You must remove this contention without buying additional BigQuery capacity or editing SQL. What should you do?
Keep both jobs interactive but move the ETL schedule to 03:00-05:00 when fewer analysts are online.
Buy additional on-demand slots so both workloads can run as interactive queries concurrently.
Submit the ETL job with batch query priority while leaving the dashboard queries as interactive.
Create a 500-slot reservation for the dashboard project and keep both workloads as interactive.
Answer Description
Submitting the ETL as a batch-priority query queues it until BigQuery detects idle slots, while interactive dashboard queries keep immediate priority. This prevents the ETL from delaying analyst queries, costs no extra money, and requires no SQL changes. Purchasing slots or reservations adds cost, and rescheduling the ETL might still overlap with peak usage.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
ELI5: What does query priority mean in BigQuery?
What are idle slots in BigQuery and how do they affect batch queries?
Can query priority changes affect costs in BigQuery?
What is the difference between batch query priority and interactive query priority in BigQuery?
Why do batch-priority queries prevent contention with interactive queries in BigQuery?
How does setting query priority in BigQuery affect costs and performance?
Your organization has subscribed to a private listing in Analytics Hub that a partner publishes. The subscription automatically created a read-only linked dataset called ecommerce_partner.default in your analytics project. Several business analysts need to build Looker Studio dashboards that query this data, and you want to avoid additional storage cost or data-movement operations. Which action enables the analysts to visualize the shared data while following Google-recommended architecture and keeping operational overhead minimal?
Have analysts create a BigQuery data source in Looker Studio that points directly to the linked dataset, and grant them BigQuery Data Viewer on the dataset plus BigQuery Job User on the project.
Schedule a daily BigQuery Data Transfer Service job that copies all tables from the linked dataset into a native dataset, then connect Looker Studio to the copied tables.
Trigger a Cloud Function each night to export the linked tables as CSV files to Cloud Storage and use Looker Studio's Cloud Storage connector to build reports.
Set up scheduled queries that write the linked data into Cloud SQL, and configure Looker Studio to read from the Cloud SQL instance instead of BigQuery.
Answer Description
The linked dataset is already a first-class BigQuery dataset in the consumer project, backed by tables that physically reside in the publisher's project. Looker Studio natively connects to BigQuery through the BigQuery API, so it can query the linked dataset directly without copying or exporting data. Analysts simply need permission to read the dataset (roles/bigquery.dataViewer) and to run query jobs that bill the consumer project (roles/bigquery.jobUser). No data transfer, export, or replication jobs are required, eliminating extra storage cost and maintenance. Copying tables, exporting to Cloud Storage, or materializing data in Cloud SQL introduce unnecessary cost and operational complexity and therefore are not recommended for consuming Analytics Hub exchanges.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is a linked dataset in BigQuery?
What does roles/bigquery.dataViewer and roles/bigquery.jobUser provide access to?
How does Looker Studio integrate with BigQuery?
What is a linked dataset in Analytics Hub?
What permissions are required for analysts to work with linked datasets in Looker Studio?
Why is using direct BigQuery connections preferred for Looker Studio dashboards?
A smart-city analytics team ingests billions of JSON sensor readings each day through Pub/Sub and immediately writes them to a raw staging location. Compliance rules require that the unmodified records be retained for five years, with older data automatically moved to colder, less expensive storage classes. Engineers will later run Dataflow jobs that cleanse the data and load curated subsets into BigQuery on demand. Which sink best satisfies the retention, cost, and future-processing requirements for the raw data layer?
Insert the records into a partitioned BigQuery table using streaming inserts
Load the records into a Cloud SQL PostgreSQL database and enable point-in-time recovery
Store the records as objects in a Cloud Storage bucket with lifecycle rules
Persist the records in a wide-column Cloud Bigtable instance
Answer Description
Cloud Storage is designed for data-lake use cases where raw, unstructured, or semi-structured files must be kept for long periods at low cost. Bucket lifecycle policies can automatically transition objects from Standard to Nearline, Coldline, or Archive classes to meet five-year retention at minimal expense. Dataflow can natively read from and write to Cloud Storage, allowing downstream cleansing and loading into BigQuery without relocating the data. BigQuery is an analytics warehouse optimized for SQL query performance; loading every raw record would incur storage and streaming-insert costs and prevents tiered lifecycle storage. Cloud Bigtable provides low-latency key/value serving but lacks object lifecycle tiering and is cost-inefficient for write-once archival data. Cloud SQL is a managed relational OLTP service with size limits, higher per-GB cost, and no automated tiering, making it unsuitable for petabyte-scale, append-only raw data.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is Cloud Storage's bucket lifecycle policy?
How does Dataflow integrate with Cloud Storage for processing?
Why is Cloud Storage preferred over BigQuery, Bigtable, or Cloud SQL in this scenario?
What are lifecycle rules in Cloud Storage?
Why is Cloud Storage better suited for raw data storage compared to BigQuery?
How does Dataflow integrate with Cloud Storage for data processing?
Your team builds an ELT workflow in Dataform that lands raw click-stream data in BigQuery and publishes cleaned tables for analysts. Compliance requires the nightly job to stop immediately whenever the current load introduces duplicate primary keys or orphaned foreign keys. Which Dataform construct should you use to add these data-quality gates so that the pipeline run automatically fails when the rule-checking query returns rows?
Configure the tables as incremental in Dataform and filter out problematic records with a WHERE clause referencing the latest updated_at timestamp.
Rely on BigQuery's built-in NOT NULL and UNIQUE table constraints to reject bad data during the load step.
Create separate .sqlx files defined with type: "assertion", each containing a query that returns rows when the quality rule is violated.
Attach postOperations blocks to the target tables to delete duplicates and unresolved foreign keys after the load finishes.
Answer Description
Create assertion actions in Dataform. An assertion is authored in a .sqlx file whose config block contains type: "assertion". At runtime Dataform executes the query and expects it to return an empty result set; any rows indicate a data-quality violation, causing the assertion to fail and the entire Dataform run to stop with an error. Post-operation scripts, BigQuery schema constraints, or incremental table filters do not provide the same automatic run-blocking behavior for complex business-rule checks such as duplicate detection or referential-integrity validation.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is Dataform and how does it work with BigQuery?
What are '.sqlx' files and how are they used for assertions?
Why are postOperations blocks or schema constraints unsuitable for Dataform's data-quality gating?
What is an assertion in Dataform?
How does Dataform stop the pipeline when an assertion fails?
Why don't postOperations or BigQuery constraints work the same as assertions for data-quality checks?
Your e-commerce analytics team issues ad-hoc interactive queries against a 180-TB BigQuery table that stores 90 days of click-stream events. The project is billed with BigQuery's on-demand model, and daily query volume fluctuates, making long-term slot commitments unattractive. Analysts usually inspect only the most recent three days of data, but each query currently scans the full table, driving up costs. To lower query charges while continuing to use on-demand pricing, which approach should you implement?
Export the data to Cloud Storage and query it as a BigLake external table, eliminating per-query charges.
Partition the table by date and require queries to include a filter on the partitioning column so only recent partitions are scanned.
Upgrade to BigQuery Enterprise Edition and buy a 500-slot reservation to run queries on flat-rate capacity.
Apply gzip compression to the existing table so the bytes scanned by each query are smaller.
Answer Description
Under BigQuery's on-demand model you pay for the number of bytes each query reads. Converting the log table to a date-partitioned table (for example, partitioned by ingestion or event date) and having analysts filter on the partitioning column limits scanning to just the partitions that hold the last three days of data. Because the amount of data read drops from 180 TB to roughly 6 TB, the on-demand cost of every query falls proportionally. Purchasing slot reservations, exporting to external tables, or applying gzip compression would not cut on-demand query bytes in this scenario: flat-rate slots change the pricing model, external tables are still billed per bytes processed, and storage compression does not change how many logical bytes a query scans.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
Why is table partitioning effective for lowering query costs in BigQuery?
What is the difference between on-demand pricing and flat-rate pricing in BigQuery?
How does gzip compression affect BigQuery query costs?
What is table partitioning in BigQuery?
How is cost calculated under BigQuery's on-demand pricing model?
What are the key differences between flat-rate and on-demand pricing in BigQuery?
Your company runs several streaming Dataflow jobs in separate Google Cloud projects, each triggered by Cloud Composer. SREs want one place to observe end-to-end pipeline lag, receive a PagerDuty alert if lag exceeds five minutes, and inspect individual worker logs without receiving broad project-level permissions. Which architecture best satisfies these requirements while minimizing operational overhead?
Enable Cloud Trace in every project, export latency traces to a shared Trace project, create alerts on trace duration, and grant SREs Trace Viewer to inspect worker traces.
Deploy Prometheus on GKE to scrape OpenCensus metrics from Dataflow workers, configure Alertmanager for paging, and set up an Elasticsearch-Kibana stack for logs with Kibana viewer access for SREs.
Create a central operations project, add all pipeline projects to its Cloud Monitoring metrics scope, define an alert on the Dataflow job system_lag metric (>300 s) with a PagerDuty notification channel, and configure aggregated Log Router sinks that export Dataflow worker logs to a log bucket in the operations project where SREs have Logs Viewer access.
Publish Dataflow metrics to Pub/Sub, stream them into BigQuery, use Cloud Scheduler queries to compute lag, trigger Cloud Functions to send PagerDuty alerts, and store worker logs in a BigQuery dataset shared with SREs.
Answer Description
Using a Cloud Monitoring metrics scope in a dedicated operations project aggregates metrics from multiple service projects automatically, letting engineers build a single dashboard. Dataflow already publishes the job metric system_lag, so an alerting policy that fires when the value is more than 300 seconds can notify PagerDuty. Aggregated Log Router sinks from each project can route Dataflow worker logs into a centralized log bucket where responders are granted the Logs Viewer role, giving them drill-down visibility without exposing wider project access. The other approaches either rely on services that do not natively collect Dataflow lag (Cloud Trace), add unnecessary components and maintenance (custom Pub/Sub â BigQuery or self-hosted Prometheus/Elasticsearch), or still require broader IAM grants.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is a Cloud Monitoring metrics scope?
What is the system_lag metric in Dataflow?
What are aggregated Log Router sinks?
What is a Cloud Monitoring metrics scope in Google Cloud?
How does the Log Router sink work for exporting logs in Google Cloud?
What is the system_lag metric in Google Cloud Dataflow?
Your company ingests click-stream events into Pub/Sub and processes them in Cloud Dataflow to compute, per user, the duration of each browsing session. A session is any sequence of events separated by less than 30 minutes of inactivity. Product managers require an initial (possibly partial) session duration to be available within one minute after the first event in the session, while still accepting events that arrive up to 10 minutes late. Which Apache Beam windowing and trigger configuration best satisfies these requirements?
Sliding windows of 30 minutes with a 1-minute slide, no allowed lateness, trigger AfterCount(1) in DISCARDING mode
Session windows with a 30-minute gap duration, allowed lateness of 10 minutes, default AfterWatermark trigger plus an early firing AfterProcessingTime(1 minute) in ACCUMULATING mode
Global window with a processing-time trigger that fires every minute, ACCUMULATING mode, no allowed lateness
Fixed (tumbling) windows of 1 minute, allowed lateness of 10 minutes, AfterWatermark trigger only, DISCARDING mode
Answer Description
Because the business definition of a session is "activity separated by a 30-minute idle gap," the pipeline should use session windows with a 30-minute gap duration. To deliver an early, partial aggregation within one minute, add an early firing trigger that emits after one minute of processing time. The default AfterWatermark trigger ensures an on-time pane when the watermark passes the window end, and an allowed lateness of 10 minutes lets Dataflow reopen the window to merge any late events that still arrive within that tolerance. Using ACCUMULATING mode guarantees that every pane adds to the previous result instead of replacing it.
Other options fail to meet one or more requirements:
- Fixed (tumbling) windows cannot adapt to user-driven session boundaries.
- Sliding windows would double-count data and omit the required gap logic.
- A global window with only processing-time triggers would have no concept of a 30-minute session boundary and could not close windows correctly for late data.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What are session windows in Apache Beam?
What does the 'AfterWatermark' trigger do in Apache Beam?
What is the difference between ACCUMULATING and DISCARDING mode in Apache Beam?
What are session windows in Apache Beam, and why are they used?
What does 'allowed lateness' mean in windowing, and why is it important here?
What is the role of triggers in Apache Beam, particularly the AfterWatermark trigger?
What is a session window in Apache Beam?
What does AfterWatermark trigger do in Apache Beam?
What is ACCUMULATING mode in Apache Beam triggers?
Your company has a 20-TB BigQuery dataset updated hourly. Three partners in different Google Cloud organizations need SQL access for dashboards, but the data must remain in your project. Each partner must pay its own query costs. Security policy prohibits granting dataset-level IAM roles to external principals; instead, access must be provided through a service built for cross-organization sharing. You also need to revoke access instantly without data copies or export jobs. Which design satisfies these constraints?
Use BigQuery Data Transfer Service to replicate the dataset into each partner's project on an hourly schedule.
Create a private data exchange in Analytics Hub, publish the BigQuery dataset as a listing, and have each partner subscribe, which creates a linked dataset they can query in their own projects.
Grant the partners BigQuery Data Viewer roles on the dataset and instruct them to run cross-project queries using the fully qualified table name.
Schedule a daily export of the dataset to Cloud Storage and give partners ACL access so they can create external tables that query the exported files.
Answer Description
Publishing the dataset through Analytics Hub meets every requirement while respecting the security policy. A private data exchange with a listing that references the existing dataset lets the data stay in the publisher project. When each partner subscribes, BigQuery creates a read-only linked dataset in the partner's project that appears in their BigQuery Explorer, and all query charges are billed to the partner's project. Access can be revoked instantly by disabling the listing or removing a subscriber. The other approaches either violate the policy against external IAM roles, require ongoing data movement, or risk billing the publisher.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is Analytics Hub in Google Cloud?
What is the difference between linked datasets and exported datasets?
How does billing work when using linked datasets from Analytics Hub?
What is Analytics Hub in Google Cloud?
How does BigQuery handle linked datasets for subscribers?
What differentiates a private data exchange from a public one in Analytics Hub?
Your analytics team must orchestrate a daily data pipeline that: triggers a Cloud Storage Transfer job, runs custom Python data-quality scripts on Cloud Run, loads cleansed data into BigQuery, and finally calls a Vertex AI prediction endpoint. The workflow needs conditional branching, cross-task retries, SSH connections to an on-premises host, and a graphical DAG that operators can monitor. To satisfy these requirements while avoiding heavy infrastructure management and allowing reuse of existing Airflow DAGs, which Google Cloud service should you use?
Workflows
Cloud Composer (managed Apache Airflow)
Dataflow Flex Templates with pipeline options
Cloud Scheduler triggers invoking Pub/Sub topics and Cloud Functions
Answer Description
Cloud Composer is Google Cloud's managed Apache Airflow service. Airflow expresses complex, multi-step workflows as DAGs that support conditional branches, retries, and dependencies. Composer lets you install custom Python packages and Airflow plugins, so you can invoke Cloud Run jobs, BigQuery operators, Storage Transfer Service hooks, SSH operators for on-prem hosts, and custom calls to Vertex AI. It also provides the standard Airflow web UI for visual monitoring, while Google manages the underlying infrastructure.
Workflows can orchestrate Google Cloud APIs but offers limited first-class support for SSH connections, custom Python libraries, or direct reuse of existing Airflow DAGs; authoring complex branching pipelines is possible but less ergonomic. Cloud Scheduler with Pub/Sub and Cloud Functions would require substantial custom code to replicate DAG-level dependency management and monitoring. Dataflow Flex Templates address data processing, not cross-service orchestration. Therefore, Cloud Composer best meets all stated requirements.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is a DAG in Apache Airflow?
How does Cloud Composer manage infrastructure for workflows?
Why is Cloud Composer better for this use case than Workflows?
What is a DAG in Apache Airflow?
How does Cloud Composer handle infrastructure management?
Why can't Workflows replace Cloud Composer in the given use case?
Cool beans!
Looks like that's it! You can go back and review your answers or click the button below to grade your test.