đŸ”„ 40% Off Crucial Exams Memberships — Deal ends today!

3 hours, 33 minutes remaining!
00:20:00

GCP Professional Data Engineer Practice Test

Use the form below to configure your GCP Professional Data Engineer Practice Test. The practice test can be configured to only include certain exam objectives and domains. You can choose between 5-100 questions and set a time limit.

Logo for GCP Professional Data Engineer
Questions
Number of questions in the practice test
Free users are limited to 20 questions, upgrade to unlimited
Seconds Per Question
Determines how long you have to finish the practice test
Exam Objectives
Which exam objectives should be included in the practice test

GCP Professional Data Engineer Information

Overview

The Google Cloud Professional Data Engineer (PDE) certification is designed to validate a practitioner’s ability to build, operationalize, secure, and monitor data processing systems on Google Cloud Platform (GCP). Candidates are expected to demonstrate proficiency in designing data‐driven solutions that are reliable, scalable, and cost-effective—spanning everything from ingestion pipelines and transformation jobs to advanced analytics and machine-learning models. Earning the PDE credential signals to employers that you can translate business and technical requirements into robust data architectures while adhering to best practices for security, compliance, and governance.

Exam Structure and Knowledge Domains

The exam is a two-hour, multiple-choice test available in a proctored, in-person or online format. Questions target real-world scenarios across four broad domains: (1) designing data processing systems; (2) building and operationalizing data processing systems; (3) operationalizing machine-learning models; and (4) ensuring solution quality. You might be asked to choose optimal storage solutions (BigQuery, Cloud Spanner, Bigtable), architect streaming pipelines with Pub/Sub and Dataflow, or troubleshoot performance bottlenecks. Because the PDE focuses heavily on applied problem-solving rather than rote memorization, hands-on experience—whether via professional projects or Google’s Qwiklabs/Cloud Skills Boost labs—is critical for success.

About GCP PDE Practice Exams

Taking reputable practice exams is one of the most efficient ways to gauge readiness and close knowledge gaps. High-quality mocks mirror the actual test’s wording, timing, and scenario-based style, helping you get comfortable with the pace and depth of questioning. After each attempt, review explanations—not just the items you missed, but also the ones you answered correctly—to reinforce concepts and uncover lucky guesses. Tracking performance over multiple sittings shows whether your improvement is consistent or if certain domains lag behind. When used alongside hands-on labs, whitepapers, and documentation, practice tests become a feedback loop that sharpens both your intuition and time-management skills.

Preparation Tips

Begin your preparation with the official exam guide to map each task statement to concrete learning resources (Coursera courses, Google documentation, blog posts). Build small proof-of-concept projects—such as streaming IoT data to BigQuery or automating model retraining with AI Platform—to anchor theory in practice. In the final weeks, shift from broad study to focused review: revisit weak areas highlighted by practice exams, skim product release notes for recent feature updates, and fine-tune your exam-day strategy (flag uncertain questions, manage breaks, monitor the clock). By combining targeted study, practical experimentation, and iterative assessment, you can approach the GCP Professional Data Engineer exam with confidence and a clear roadmap to certification.

GCP Professional Data Engineer Logo
  • Free GCP Professional Data Engineer Practice Test

  • 20 Questions
  • Unlimited time
  • Designing data processing systems
    Ingesting and processing the data
    Storing the data
    Preparing and using data for analysis
    Maintaining and automating data workloads
Question 1 of 20

Your organization runs a Dataflow streaming job that continuously writes events into an existing BigQuery dataset containing sensitive customer information. Security policy mandates least-privilege access for the Dataflow worker service account: it must be able to create new tables in that dataset and append or overwrite rows, but it must not change table schemas or manage dataset-level access controls. You need to grant a single predefined IAM role on the dataset to satisfy this requirement. Which role should you assign?

  • Grant roles/bigquery.dataEditor on the dataset

  • Grant roles/bigquery.jobUser on the project

  • Grant roles/bigquery.dataOwner on the dataset

  • Grant roles/bigquery.admin on the project

Question 2 of 20

A media-streaming company runs an Apache Beam pipeline on Cloud Dataflow in the us-central1 region. The job keeps several terabytes of user session data in Redis to perform low-latency joins. Management wants the pipeline to survive a complete zonal outage without manual intervention while keeping operational overhead and complexity to a minimum. Which approach best meets these requirements?

  • Create a Memorystore for Redis Cluster instance in us-central1. Configure the Dataflow pipeline to connect through the cluster's discovery endpoint and rely on its built-in multi-zone shard replication.

  • Run open-source Redis Cluster on a stateful GKE deployment distributed across three zones and manage failover with custom scripts and Kubernetes operators.

  • Provision two Basic Tier Memorystore for Redis instances, one in us-central1-a and one in us-central1-b, and modify the Dataflow job to write to both instances for redundancy.

  • Deploy a Standard Tier Memorystore for Redis instance in us-central1-a and create a Cloud SQL read replica in a different zone to take over if the primary zone fails.

Question 3 of 20

Your ecommerce platform runs a self-managed PostgreSQL 12 cluster handling 25 000 write TPS against 10 TB of data. Business analysts need sub-second ad-hoc reporting on the same tables without affecting OLTP latency. You must lift-and-shift to Google Cloud within one quarter, reuse existing SQL, avoid managing storage-compute scaling or patching, and minimize downtime during future maintenance. Which managed Google Cloud service best satisfies both transactional and analytical requirements while preserving PostgreSQL compatibility?

  • BigQuery with federated external tables over exported data

  • Spanner configured with the PostgreSQL interface

  • Cloud SQL for PostgreSQL with high-availability configuration and read replicas

  • AlloyDB for PostgreSQL

Question 4 of 20

Your manufacturing company collects 150 000 JSON telemetry events per second from thousands of factory devices worldwide. Dashboards in BigQuery must reflect events within 30 seconds of publication. Devices occasionally emit malformed JSON that should be quarantined for later inspection without interrupting ingest. The team wants a fully managed, autoscaling solution that minimizes ongoing operations. Which architecture best satisfies these requirements?

  • Deploy a long-lived Spark Streaming job on a Dataproc cluster that consumes the Pub/Sub topic, cleans the data, writes to BigQuery, and stores malformed records in an HDFS directory.

  • Have devices write newline-delimited JSON files to Cloud Storage and configure a BigQuery load job every 15 minutes with an error log destination for rows that fail to parse.

  • Trigger a Cloud Function for each message delivered by a Pub/Sub push subscription and insert the event into BigQuery; wrap the insert in a try/catch block that logs malformed JSON to Cloud Logging.

  • Publish events to a Pub/Sub topic that has a dead-letter topic enabled; run an autoscaling Dataflow streaming pipeline that parses the JSON, writes valid rows to BigQuery via the Storage Write API, and routes parsing failures to the dead-letter topic.

Question 5 of 20

Your organization is adopting Dataplex to unify governance across its Google Cloud data estate. Three business domains (Sales, Marketing, and Finance) will ingest raw data into domain-owned buckets, transform it in Dataproc, and publish cleaned datasets in BigQuery. In addition, several enterprise reference tables (for example, country codes and fiscal calendars) must be discoverable and consistently governed by a central data-governance team while remaining accessible to all domains without duplicating the data. Which Dataplex design best satisfies these requirements and aligns with the lake → zone → asset hierarchy?

  • Store the reference datasets in a Cloud Storage bucket that is not registered with Dataplex and let each domain lake create external tables pointing to that bucket for queries.

  • Within each domain lake, define three zones (raw, reference, curated) and bulk-replicate the reference datasets into the reference zone of every lake so each team can manage its own copy.

  • Create a separate Enterprise lake managed by the governance team that contains a single curated zone with the reference datasets as assets, and keep raw and curated zones inside each domain lake for domain-specific data.

  • Add the reference datasets as additional assets inside the curated zone of every domain lake and rely on Dataplex asset sharing to grant cross-domain access.

Question 6 of 20

Your organization manages several BigQuery projects. Interactive queries that refresh the executive dashboard in the prod-analytics project start at 09:00 each weekday and must finish within seconds. During the rest of the day, development and ad-hoc queries from other projects may use any spare capacity, but the dashboard must never be slowed when it runs. What is the most cost-effective way to guarantee performance for the dashboard while still letting the other projects use leftover capacity?

  • Purchase a 1-year commitment for 1,000 slots, place them in a dedicated reservation assigned to prod-analytics, create a second 0-slot reservation for the other projects so they can opportunistically borrow idle slots.

  • Run the dashboard queries with batch priority and enable query result caching so they do not compete with other workloads.

  • Upgrade every BigQuery project to Enterprise Plus edition and rely on automatic slot scaling to handle the 09:00 dashboard workload.

  • Buy Flex slots at 08:55 each morning for prod-analytics and delete them after the dashboard finishes; keep all projects on on-demand pricing for the rest of the day.

Question 7 of 20

You are designing a streaming pipeline that ingests temperature readings from tens of thousands of IoT devices through Pub/Sub and must persist the data in a storage tier that can sustain millions of writes per second while offering sub-10-millisecond read latency for lookups by device-id and event timestamp to power a real-time dashboard. The data is append-only and will not be queried with complex joins or multi-row transactions. Which Google Cloud sink best meets these requirements with minimal operational overhead?

  • Cloud Pub/Sub Lite topic configured with 7-day message retention

  • BigQuery partitioned table using ingestion-time partitioning

  • Cloud Spanner table keyed on device-id with a secondary index on timestamp

  • Cloud Bigtable with a composite row key of device-id and reversed timestamp

Question 8 of 20

During a quarterly audit, you discover that all 20 data scientists in your analytics project were granted the primitive Editor role so they could create and modify BigQuery tables. The CISO asks you to immediately reduce the blast radius while ensuring the scientists can continue their normal workloads. Which action best satisfies the principle of least privilege?

  • Replace the Editor role with a custom role that includes all resourcemanager.* permissions but excludes storage.* permissions to protect Cloud Storage data.

  • Downgrade each scientist to the Viewer primitive role and allow them to impersonate a service account that still has the Editor role when they need write access.

  • Retain the Editor role but enable Cloud Audit Logs and set up log-based alerts to detect any misuse of non-BigQuery services.

  • Remove the Editor binding and grant each scientist the predefined role roles/bigquery.dataEditor only on the datasets they work with.

Question 9 of 20

A global equities trading platform must ingest more than one million trade order updates per second from users on three continents. Each order write must commit in under 10 ms with full ACID semantics and globally consistent reads. Analysts require ad-hoc SQL queries on the full history of trades with less than five-second freshness, but the engineering team wants to avoid managing infrastructure. Which GCP service combination best satisfies both workloads?

  • Ingest both transactional and analytical workloads directly into a partitioned BigQuery dataset using BigQuery Omni.

  • Persist orders in Cloud Spanner and stream Spanner change streams to BigQuery with Dataflow for analytics.

  • Use Cloud SQL with cross-region read replicas for orders and replicate to BigQuery with Datastream.

  • Store orders in Cloud Bigtable and expose the table as an external table for direct querying from BigQuery.

Question 10 of 20

Your analytics team has deployed a Cloud Data Fusion Enterprise edition instance in the us-central1 region. The instance was provisioned with a private IP so that its management UI and the Dataproc ephemeral clusters it creates have no public IPv4 addresses.
You now need to allow the pipelines that run inside the Data Fusion tenant project to read and write data in Cloud Bigtable tables that reside in a VPC network (prod-analytics-vpc) in your customer project. The security team requires that all traffic stay on Google's private backbone; the Bigtable instances must remain reachable only over internal IP addresses, and no inbound firewall openings in prod-analytics-vpc are allowed.
Which networking approach meets the requirements while following Google-recommended architecture for Cloud Data Fusion private deployments?

  • Expose the Cloud Bigtable instances through Private Service Connect and have the Data Fusion instance consume the published PSC endpoints over the internet.

  • Convert the tenant project into a service project of the customer's Shared VPC host so that Dataproc clusters obtain IP addresses directly inside prod-analytics-vpc.

  • Create a Cloud NAT gateway in the tenant project and route traffic from the Dataproc subnet to the internet; whitelist the gateway's public IP range on Bigtable.

  • Peer the tenant project's default network with prod-analytics-vpc by using VPC Network Peering and rely on existing firewall egress rules for the Dataproc workers.

Question 11 of 20

You are building a BigQuery ML logistic-regression model on table prod.customers, which contains nullable numeric columns (usage_minutes, tenure_days) and a high-cardinality STRING column plan_type. Analysts will later call ML.PREDICT directly on the raw table from BI dashboards. You need to guarantee that missing numeric values are mean-imputed and that plan_type is one-hot encoded during both model training and every subsequent prediction, without requiring any additional preprocessing SQL in the dashboards. What should you do?

  • Create a materialized view that performs the imputing and one-hot encoding, train the model on that view, and require dashboards to invoke ML.PREDICT against the view instead of the raw table.

  • Apply only numeric normalization in the TRANSFORM clause and instruct dashboard developers to one-hot encode plan_type within their ML.PREDICT queries.

  • Specify a TRANSFORM clause when you CREATE MODEL, using ML.IMPUTER for the numeric columns and ML.ONE_HOT_ENCODER for plan_type; BigQuery ML will reuse these transformations automatically during ML.PREDICT.

  • Run a scheduled Dataflow pipeline that writes a fully preprocessed feature table; instruct dashboards to join to this table before calling ML.PREDICT so that the model receives clean features.

Question 12 of 20

A fintech startup uses BigQuery on-demand for an ad-hoc fraud-detection dashboard and a nightly ETL that ingests multiple terabytes. Security analysts expect dashboard queries to return in seconds, but when the ETL overlaps, some dashboards queue for minutes and miss their SLO. You must remove this contention without buying additional BigQuery capacity or editing SQL. What should you do?

  • Keep both jobs interactive but move the ETL schedule to 03:00-05:00 when fewer analysts are online.

  • Buy additional on-demand slots so both workloads can run as interactive queries concurrently.

  • Submit the ETL job with batch query priority while leaving the dashboard queries as interactive.

  • Create a 500-slot reservation for the dashboard project and keep both workloads as interactive.

Question 13 of 20

Your organization has subscribed to a private listing in Analytics Hub that a partner publishes. The subscription automatically created a read-only linked dataset called ecommerce_partner.default in your analytics project. Several business analysts need to build Looker Studio dashboards that query this data, and you want to avoid additional storage cost or data-movement operations. Which action enables the analysts to visualize the shared data while following Google-recommended architecture and keeping operational overhead minimal?

  • Have analysts create a BigQuery data source in Looker Studio that points directly to the linked dataset, and grant them BigQuery Data Viewer on the dataset plus BigQuery Job User on the project.

  • Schedule a daily BigQuery Data Transfer Service job that copies all tables from the linked dataset into a native dataset, then connect Looker Studio to the copied tables.

  • Trigger a Cloud Function each night to export the linked tables as CSV files to Cloud Storage and use Looker Studio's Cloud Storage connector to build reports.

  • Set up scheduled queries that write the linked data into Cloud SQL, and configure Looker Studio to read from the Cloud SQL instance instead of BigQuery.

Question 14 of 20

A smart-city analytics team ingests billions of JSON sensor readings each day through Pub/Sub and immediately writes them to a raw staging location. Compliance rules require that the unmodified records be retained for five years, with older data automatically moved to colder, less expensive storage classes. Engineers will later run Dataflow jobs that cleanse the data and load curated subsets into BigQuery on demand. Which sink best satisfies the retention, cost, and future-processing requirements for the raw data layer?

  • Insert the records into a partitioned BigQuery table using streaming inserts

  • Load the records into a Cloud SQL PostgreSQL database and enable point-in-time recovery

  • Store the records as objects in a Cloud Storage bucket with lifecycle rules

  • Persist the records in a wide-column Cloud Bigtable instance

Question 15 of 20

Your team builds an ELT workflow in Dataform that lands raw click-stream data in BigQuery and publishes cleaned tables for analysts. Compliance requires the nightly job to stop immediately whenever the current load introduces duplicate primary keys or orphaned foreign keys. Which Dataform construct should you use to add these data-quality gates so that the pipeline run automatically fails when the rule-checking query returns rows?

  • Configure the tables as incremental in Dataform and filter out problematic records with a WHERE clause referencing the latest updated_at timestamp.

  • Rely on BigQuery's built-in NOT NULL and UNIQUE table constraints to reject bad data during the load step.

  • Create separate .sqlx files defined with type: "assertion", each containing a query that returns rows when the quality rule is violated.

  • Attach postOperations blocks to the target tables to delete duplicates and unresolved foreign keys after the load finishes.

Question 16 of 20

Your e-commerce analytics team issues ad-hoc interactive queries against a 180-TB BigQuery table that stores 90 days of click-stream events. The project is billed with BigQuery's on-demand model, and daily query volume fluctuates, making long-term slot commitments unattractive. Analysts usually inspect only the most recent three days of data, but each query currently scans the full table, driving up costs. To lower query charges while continuing to use on-demand pricing, which approach should you implement?

  • Export the data to Cloud Storage and query it as a BigLake external table, eliminating per-query charges.

  • Partition the table by date and require queries to include a filter on the partitioning column so only recent partitions are scanned.

  • Upgrade to BigQuery Enterprise Edition and buy a 500-slot reservation to run queries on flat-rate capacity.

  • Apply gzip compression to the existing table so the bytes scanned by each query are smaller.

Question 17 of 20

Your company runs several streaming Dataflow jobs in separate Google Cloud projects, each triggered by Cloud Composer. SREs want one place to observe end-to-end pipeline lag, receive a PagerDuty alert if lag exceeds five minutes, and inspect individual worker logs without receiving broad project-level permissions. Which architecture best satisfies these requirements while minimizing operational overhead?

  • Enable Cloud Trace in every project, export latency traces to a shared Trace project, create alerts on trace duration, and grant SREs Trace Viewer to inspect worker traces.

  • Deploy Prometheus on GKE to scrape OpenCensus metrics from Dataflow workers, configure Alertmanager for paging, and set up an Elasticsearch-Kibana stack for logs with Kibana viewer access for SREs.

  • Create a central operations project, add all pipeline projects to its Cloud Monitoring metrics scope, define an alert on the Dataflow job system_lag metric (>300 s) with a PagerDuty notification channel, and configure aggregated Log Router sinks that export Dataflow worker logs to a log bucket in the operations project where SREs have Logs Viewer access.

  • Publish Dataflow metrics to Pub/Sub, stream them into BigQuery, use Cloud Scheduler queries to compute lag, trigger Cloud Functions to send PagerDuty alerts, and store worker logs in a BigQuery dataset shared with SREs.

Question 18 of 20

Your company ingests click-stream events into Pub/Sub and processes them in Cloud Dataflow to compute, per user, the duration of each browsing session. A session is any sequence of events separated by less than 30 minutes of inactivity. Product managers require an initial (possibly partial) session duration to be available within one minute after the first event in the session, while still accepting events that arrive up to 10 minutes late. Which Apache Beam windowing and trigger configuration best satisfies these requirements?

  • Sliding windows of 30 minutes with a 1-minute slide, no allowed lateness, trigger AfterCount(1) in DISCARDING mode

  • Session windows with a 30-minute gap duration, allowed lateness of 10 minutes, default AfterWatermark trigger plus an early firing AfterProcessingTime(1 minute) in ACCUMULATING mode

  • Global window with a processing-time trigger that fires every minute, ACCUMULATING mode, no allowed lateness

  • Fixed (tumbling) windows of 1 minute, allowed lateness of 10 minutes, AfterWatermark trigger only, DISCARDING mode

Question 19 of 20

Your company has a 20-TB BigQuery dataset updated hourly. Three partners in different Google Cloud organizations need SQL access for dashboards, but the data must remain in your project. Each partner must pay its own query costs. Security policy prohibits granting dataset-level IAM roles to external principals; instead, access must be provided through a service built for cross-organization sharing. You also need to revoke access instantly without data copies or export jobs. Which design satisfies these constraints?

  • Use BigQuery Data Transfer Service to replicate the dataset into each partner's project on an hourly schedule.

  • Create a private data exchange in Analytics Hub, publish the BigQuery dataset as a listing, and have each partner subscribe, which creates a linked dataset they can query in their own projects.

  • Grant the partners BigQuery Data Viewer roles on the dataset and instruct them to run cross-project queries using the fully qualified table name.

  • Schedule a daily export of the dataset to Cloud Storage and give partners ACL access so they can create external tables that query the exported files.

Question 20 of 20

Your analytics team must orchestrate a daily data pipeline that: triggers a Cloud Storage Transfer job, runs custom Python data-quality scripts on Cloud Run, loads cleansed data into BigQuery, and finally calls a Vertex AI prediction endpoint. The workflow needs conditional branching, cross-task retries, SSH connections to an on-premises host, and a graphical DAG that operators can monitor. To satisfy these requirements while avoiding heavy infrastructure management and allowing reuse of existing Airflow DAGs, which Google Cloud service should you use?

  • Workflows

  • Cloud Composer (managed Apache Airflow)

  • Dataflow Flex Templates with pipeline options

  • Cloud Scheduler triggers invoking Pub/Sub topics and Cloud Functions