đŸ”„ 40% Off Crucial Exams Memberships — Deal ends today!

3 hours, 32 minutes remaining!
00:20:00

GCP Professional Data Engineer Practice Test

Use the form below to configure your GCP Professional Data Engineer Practice Test. The practice test can be configured to only include certain exam objectives and domains. You can choose between 5-100 questions and set a time limit.

Logo for GCP Professional Data Engineer
Questions
Number of questions in the practice test
Free users are limited to 20 questions, upgrade to unlimited
Seconds Per Question
Determines how long you have to finish the practice test
Exam Objectives
Which exam objectives should be included in the practice test

GCP Professional Data Engineer Information

Overview

The Google Cloud Professional Data Engineer (PDE) certification is designed to validate a practitioner’s ability to build, operationalize, secure, and monitor data processing systems on Google Cloud Platform (GCP). Candidates are expected to demonstrate proficiency in designing data‐driven solutions that are reliable, scalable, and cost-effective—spanning everything from ingestion pipelines and transformation jobs to advanced analytics and machine-learning models. Earning the PDE credential signals to employers that you can translate business and technical requirements into robust data architectures while adhering to best practices for security, compliance, and governance.

Exam Structure and Knowledge Domains

The exam is a two-hour, multiple-choice test available in a proctored, in-person or online format. Questions target real-world scenarios across four broad domains: (1) designing data processing systems; (2) building and operationalizing data processing systems; (3) operationalizing machine-learning models; and (4) ensuring solution quality. You might be asked to choose optimal storage solutions (BigQuery, Cloud Spanner, Bigtable), architect streaming pipelines with Pub/Sub and Dataflow, or troubleshoot performance bottlenecks. Because the PDE focuses heavily on applied problem-solving rather than rote memorization, hands-on experience—whether via professional projects or Google’s Qwiklabs/Cloud Skills Boost labs—is critical for success.

About GCP PDE Practice Exams

Taking reputable practice exams is one of the most efficient ways to gauge readiness and close knowledge gaps. High-quality mocks mirror the actual test’s wording, timing, and scenario-based style, helping you get comfortable with the pace and depth of questioning. After each attempt, review explanations—not just the items you missed, but also the ones you answered correctly—to reinforce concepts and uncover lucky guesses. Tracking performance over multiple sittings shows whether your improvement is consistent or if certain domains lag behind. When used alongside hands-on labs, whitepapers, and documentation, practice tests become a feedback loop that sharpens both your intuition and time-management skills.

Preparation Tips

Begin your preparation with the official exam guide to map each task statement to concrete learning resources (Coursera courses, Google documentation, blog posts). Build small proof-of-concept projects—such as streaming IoT data to BigQuery or automating model retraining with AI Platform—to anchor theory in practice. In the final weeks, shift from broad study to focused review: revisit weak areas highlighted by practice exams, skim product release notes for recent feature updates, and fine-tune your exam-day strategy (flag uncertain questions, manage breaks, monitor the clock). By combining targeted study, practical experimentation, and iterative assessment, you can approach the GCP Professional Data Engineer exam with confidence and a clear roadmap to certification.

GCP Professional Data Engineer Logo
  • Free GCP Professional Data Engineer Practice Test

  • 20 Questions
  • Unlimited time
  • Designing data processing systems
    Ingesting and processing the data
    Storing the data
    Preparing and using data for analysis
    Maintaining and automating data workloads

Free Preview

This test is a free preview, no account required.
Subscribe to unlock all content, keep track of your scores, and access AI features!

Question 1 of 20

Your e-commerce site streams user clicks from Pub/Sub into a Dataflow pipeline built with Apache Beam. A session is any sequence of events for one user with no more than 30 minutes of inactivity. Product managers expect provisional per-user session counts every 5 minutes of processing time, yet final results must still include events that arrive up to 15 minutes after the session gap closes. Which windowing and triggering setup best satisfies these needs while avoiding unnecessary windows?

  • Use session windows with a 30-minute gap; add an early AfterProcessingTime trigger that fires 5 minutes after the first element, keep the default on-time watermark firing, set allowed lateness to 15 minutes, and accumulate fired panes.

  • Use 30-minute fixed windows; add an AfterProcessingTime trigger that fires 5 minutes after the first element and set allowed lateness to 15 minutes.

  • Use sliding windows of 30 minutes that advance every 5 minutes with the default watermark trigger and no allowed lateness.

  • Use the global window; set an AfterProcessingTime trigger to fire 30 minutes after each element and discard fired panes with no allowed lateness.

Question 2 of 20

Your organization runs a Dataflow streaming job that continuously writes events into an existing BigQuery dataset containing sensitive customer information. Security policy mandates least-privilege access for the Dataflow worker service account: it must be able to create new tables in that dataset and append or overwrite rows, but it must not change table schemas or manage dataset-level access controls. You need to grant a single predefined IAM role on the dataset to satisfy this requirement. Which role should you assign?

  • Grant roles/bigquery.dataEditor on the dataset

  • Grant roles/bigquery.dataOwner on the dataset

  • Grant roles/bigquery.jobUser on the project

  • Grant roles/bigquery.admin on the project

Question 3 of 20

Your organization receives 4 TB of JSON telemetry each day from hundreds of thousands of IoT devices. The events must be filtered for malformed records, deduplicated by device-timestamp, enriched with a 100 MB reference table that updates weekly, and streamed into partitioned BigQuery tables with sub-minute latency. Data engineers also need to rerun the identical logic over months of archived data for occasional backfills. Operations teams require automatic horizontal scaling and no cluster management. Which Google Cloud solution best satisfies all requirements?

  • Run Spark Streaming on a long-running Cloud Dataproc cluster, coupled with scheduled Dataprep jobs for cleaning and a separate batch Spark job for backfills.

  • Load raw events directly into BigQuery with streaming inserts and use Dataform SQL models to cleanse, deduplicate, and enrich data in place.

  • Implement an Apache Beam pipeline on Cloud Dataflow that streams events, uses a weekly-refreshed side input for enrichment, and can be re-run in batch for backfills.

  • Create a Cloud Data Fusion pipeline with Wrangler transforms, triggered by Cloud Composer, and manually scale the underlying Dataproc cluster for spikes.

Question 4 of 20

Your company runs a group of Compute Engine instances that execute nightly analytics jobs containing protected health information (PHI). The jobs must read reference files from an encrypted Cloud Storage bucket and write results to a BigQuery dataset, both located in the production project. Compliance forbids embedding any long-lived user credentials in the VM images, and the security team requires least-privilege access with minimal operational effort for credential rotation. Which design best satisfies these constraints?

  • Generate individual service-account keys for each engineer, embed the JSON key files in the VM startup script, and grant BigQuery Admin and Storage Admin roles at the project level. Rotate the keys quarterly.

  • Grant the default Compute Engine service account the Project Editor role and let the application use the default credentials automatically provided by the metadata server.

  • Store a Cloud Storage HMAC key in Secret Manager; have the application fetch the key at startup to sign requests to the bucket and to authenticate to BigQuery with signed URLs.

  • Create a dedicated service account (for example, sa-analytics-vm). Grant it Storage Object Viewer on the specific bucket and BigQuery Data Editor on the target dataset, attach it as the runtime service account for the instances, and do not generate any user-managed keys.

Question 5 of 20

You are the lead data engineer at a media-streaming company. A real-time pipeline ingests Pub/Sub events, processes them through multiple Cloud Dataflow streaming jobs, and writes to BigQuery. Support engineers need a single dashboard that shows each stage's built-in system metrics together with Apache Beam user-defined counters, and they must receive alerts whenever end-to-end latency exceeds two minutes or worker-pool CPU utilization stays above 80 percent for five minutes. The business wants to minimize operational overhead and avoid operating any additional monitoring stack. Which monitoring strategy best meets these requirements?

  • Leverage the default export of Dataflow job metrics to Cloud Monitoring, emit Beam counters as custom metrics, and build Cloud Monitoring dashboards and alerting policies for latency and CPU usage.

  • Deploy a managed Prometheus server on GKE to scrape Dataflow worker logs, store the metrics, and visualize and alert on them using a self-hosted Grafana dashboard.

  • Enable Cloud Trace for each Dataflow job and configure trace-based alerts; correlate spans from Pub/Sub, Dataflow, and BigQuery in the Cloud Trace console.

  • Create log sinks for Pub/Sub, Dataflow, and BigQuery into Cloud Logging, derive log-based metrics for latency and CPU, and visualize them with Logging charts and alerts.

Question 6 of 20

A healthcare provider is building a Dataflow pipeline that loads sensitive genomic records into a BigQuery dataset located in the europe-west2 region. Regulations require that: 1) all data be encrypted with keys that remain exclusively in the hospital's on-premises HSM, 2) every decryption operation be auditable in Cloud Logging, and 3) no application code changes be needed beyond configuration. Which key-management approach should you recommend?

  • Create Customer-Managed Encryption Keys (CMEK) in Cloud KMS and rotate them weekly.

  • Enable default Google-managed encryption for BigQuery and Dataflow artifacts and rely on Cloud Audit Logs.

  • Configure Customer-Supplied Encryption Keys (CSEK) and pass the key with every Dataflow and BigQuery request.

  • Use Cloud External Key Manager (EKM) with an externally managed key and enable CMEK on the BigQuery dataset and Dataflow temporary buckets.

Question 7 of 20

A global retailer ingests daily sales transactions into BigQuery. Customer email addresses and phone numbers must not be visible to analysts at partner companies, yet those partners need to join multiple days of data on the same customers to calculate repeat-purchase metrics. To satisfy GDPR requirements, the retailer wants a managed solution that irreversibly replaces the sensitive fields while preserving deterministic joinability across data sets. Which approach best meets these needs?

  • Rely on BigQuery column-level access controls to hide the email and phone columns from partner accounts.

  • Enable Customer-Managed Encryption Keys (CMEK) on the BigQuery dataset so the PII remains encrypted when shared with partners.

  • Run Cloud DLP to perform deterministic cryptographic tokenization of the email and phone fields using a customer-managed key before loading each day's files.

  • Store the PII columns in a separate Cloud SQL instance and share only integer foreign keys with partners.

Question 8 of 20

Your company's data platform team must provide a self-service environment where data engineers across multiple projects can discover, profile, and govern files stored in Cloud Storage and tables in BigQuery. The solution should automatically scan new and existing assets to harvest technical metadata, generate data profiles that include statistics such as null counts and cardinality, and surface the assets through a unified catalog that supports fine-grained access controls. The team wants to minimize custom code and avoid deploying third-party software. Which design best satisfies these requirements?

  • Register every bucket and dataset in standalone Data Catalog entry groups and trigger Cloud Functions that launch Dataflow jobs to calculate statistics and update metadata tables.

  • Create Dataplex lakes and governed zones that reference the Cloud Storage buckets and BigQuery datasets, enable automated discovery, data profiling, and quality scans in each zone, and use the Dataplex catalog for cross-project search and access control.

  • Centralize all data by copying it into a single BigQuery dataset with BigQuery Omni, then rely on INFORMATION_SCHEMA views and custom Cloud Composer DAGs to generate profiling reports.

  • Use Cloud Asset Inventory to index storage objects and datasets, and schedule BigQuery Data Transfer Service jobs to load audit logs that analysts can query for metadata and quality metrics.

Question 9 of 20

Your company ingests IoT telemetry at 30 000 messages per second via Cloud Pub/Sub. A streaming Dataflow job in us-central1 transforms the data and writes to a BigQuery dataset also in us-central1. The business requires that if the entire us-central1 region becomes unavailable, no more than 60 seconds of data may be lost (RPO ≀ 1 minute) and processing must resume in another region within 15 minutes (RTO ≀ 15 minutes) without manual code changes. Which design meets these objectives with the least operational overhead?

  • Enable Pub/Sub topic replication to us-east1 and use a Cloud Composer DAG that launches the Dataflow template in us-east1 when a regional health check fails; keep the dataset in a us-east1 regional BigQuery location.

  • Configure the job to autoscale across all zones in us-central1 and snapshot state to a dual-region Cloud Storage bucket every minute; redeploy the template manually in another region during an outage.

  • Create a second pull subscription to the Pub/Sub topic and deploy an identical streaming Dataflow Flex Template in us-east1 writing to a multi-region BigQuery dataset; run both pipelines continuously with idempotent writes.

  • Modify the existing Dataflow job to enable drain-and-restore, set a 60-second checkpoint interval, and rely on BigQuery regional redundancy for protection.

Question 10 of 20

Your company, a global retailer subject to GDPR, stores transactional data in a BigQuery table called customer_orders that has the columns order_id, item_id, customer_email, credit_card_hash, and amount. Marketing analysts must be able to run ad-hoc SQL on every column except customer_email and credit_card_hash, while the Risk team needs unrestricted access. The solution must scale so that any new columns later classified as PII are automatically protected without rewriting queries or creating additional tables. How should you implement this in BigQuery?

  • Create a Data Catalog taxonomy with a PII policy tag, attach the tag to customer_email and credit_card_hash, grant the Risk group permissions to read that policy tag and the dataset, and give Marketing only dataset-level BigQuery read access without tag permission.

  • Move customer_email and credit_card_hash into a separate BigQuery table, restrict access to that table to the Risk team, and let Marketing query the remaining columns in the original table.

  • Encrypt only the customer_email and credit_card_hash columns with customer-managed encryption keys (CMEK) and provide the decryption key to the Risk team but not to Marketing analysts.

  • Build an authorized view that omits the customer_email and credit_card_hash columns, share the view with Marketing analysts, and share the underlying table directly with the Risk team.

Question 11 of 20

Your company ACME Payments is building a streaming analytics pipeline on Google Cloud to process credit-card transactions from EU customers. Regulations require that (1) all personal data is stored and processed exclusively in EU regions, (2) primary account numbers (PANs) are pseudonymized but remain reversible for future investigations, (3) data analysts must not have access to decryption keys, and (4) the Dataflow pipeline must follow least-privilege principles. Which approach best meets these requirements?

  • Use Cloud External Key Manager with keys in a US HSM for format-preserving encryption, store the pseudonymized data in a BigQuery dataset in europe-west2, and allow analysts to decrypt by granting them roles/cloudkms.cryptoKeyEncrypterDecrypter.

  • Enforce the constraints/gcp.resourceLocations policy to permit only EU regions; run Dataflow in europe-west1 using Cloud DLP deterministic encryption protected by an EU-resident CMEK key in Cloud KMS; write results to a BigQuery dataset in europe-west1; grant analysts roles/bigquery.dataViewer only; grant the Dataflow service account roles/bigquery.dataEditor on the dataset and roles/cloudkms.cryptoKeyEncrypterDecrypter on the key.

  • Enable Assured Workloads for EU but allow resources in any region; in Dataflow apply irreversible DLP redaction before loading to a multi-regional BigQuery dataset; grant analysts roles/bigquery.dataOwner and roles/cloudkms.cryptoKeyDecrypter for investigation needs.

  • Deploy Dataflow in us-central1, hash PANs with SHA-256 during processing, store the output in a US multi-region BigQuery dataset, and grant analysts only the roles/bigquery.metadataViewer role.

Question 12 of 20

A healthcare provider stores sensitive patient telemetry in BigQuery. A new regulation requires that the encryption keys protecting this data must remain in an on-premises, FIPS 140-2 Level 3 certified HSM that is managed exclusively by the provider's security team. Analysts must continue to run existing SQL workloads without code changes, and key rotation must occur automatically through the key-management system rather than by updating application logic. Which Google Cloud encryption approach best meets these requirements?

  • Configure BigQuery to use a Customer-Managed Encryption Key that is hosted in an on-premises HSM through Cloud External Key Manager.

  • Enable the default Google-managed encryption that automatically secures data at rest.

  • Protect the dataset with Customer-Supplied Encryption Keys (CSEK) provided in every BigQuery API call.

  • Configure BigQuery with Customer-Managed Encryption Keys stored in Cloud KMS and backed by Cloud HSM.

Question 13 of 20

A payment-processing company ingests transaction records from multiple branches into a BigQuery table. Each record contains the cardholder's full name and the 16-digit primary account number (PAN).
Compliance requires the following before data can be queried by data scientists in the analytics project:

  • Names must be pseudonymized in a way that lets datasets from different branches still be joined on the same customer.
  • PANs must be rendered non-reversible, but analysts need the last four digits for charge-back investigations.
    You are designing a Dataflow pipeline that calls Cloud Data Loss Prevention (DLP) for in-stream de-identification. Which approach best meets both requirements while minimizing the risk of re-identification?
  • Apply a CryptoReplaceFfxFpeConfig transform to the name field and to the PAN field using the same Cloud KMS key so that both values remain reversible for auditors.

  • Store the raw table in a restricted project and grant analysts a BigQuery view that excludes the name and PAN columns; do not perform any in-pipeline transformation.

  • Encrypt the entire table with Cloud KMS at rest and allow analysts to decrypt on read; rely on Data Catalog column tags to warn users about personal data.

  • Apply a CryptoDeterministicConfig transform to the name field using a shared Cloud KMS key, and apply a CharacterMaskConfig that masks the first 12 digits of the PAN, leaving the last 4 digits visible.

Question 14 of 20

Your retail company runs a 15 TB Oracle 12c database in its on-premises data center that records incoming online orders. You need to populate a BigQuery dataset in Google Cloud and keep it synchronized with the source in near real time so analysts always see up-to-date data. A 1 Gbps Cloud VPN already connects the data center to Google Cloud, and the team prefers a managed, serverless solution that automatically performs change-data capture with minimal ongoing operations work. Which Google Cloud migration service should you use?

  • Migrate the database with Database Migration Service into Cloud SQL and query it from BigQuery.

  • Use BigQuery Data Transfer Service to schedule daily incremental imports from Oracle.

  • Ship a Transfer Appliance with exported database files, then load them into BigQuery.

  • Use Datastream to stream CDC events from Oracle to BigQuery.

Question 15 of 20

After migrating a 40-TB Oracle data mart to BigQuery using Datastream (CDC -> Cloud Storage) and Dataflow loads, you must prove before cut-over that every source row matches its BigQuery copy. The solution has to 1) scale across hundreds of tables without per-table coding, 2) surface any row-level mismatches, and 3) expose results to Cloud Monitoring for alerts. Which approach best meets these requirements?

  • Develop individual Dataflow pipelines for each table that calculate row hashes in Oracle and BigQuery, then compare the results and publish a metric.

  • Enable a built-in Datastream data-validation feature to generate checksum comparisons automatically and send the results to Cloud Logging.

  • Run Google's open-source Data Validation Tool as a Dataflow flex template to compute per-table checksums between Oracle and BigQuery, log results to Cloud Logging, and create log-based metrics for Cloud Monitoring alerts.

  • Create final BigQuery snapshots and run manual EXCEPT queries against exported Oracle CSV files; record any differences in a spreadsheet.

Question 16 of 20

Your health-insurance company ingests millions of call-center transcripts from Cloud Storage into BigQuery each day for trend analysis. Regulations forbid storing clear-text PII such as customer names and phone numbers, yet analysts must be able to deterministically group conversations that belong to the same customer for audits. You want a scalable, fully managed solution that requires minimal custom code and lets you add new PII detectors later. Which design should you implement?

  • Load the raw transcripts into BigQuery first and use SQL REGEXP_REPLACE functions in scheduled queries to overwrite PII columns with randomly generated strings.

  • Encrypt each transcript locally with a customer-supplied encryption key (CSEK) and load the encrypted files directly into BigQuery so analysts can decrypt data when needed.

  • Enable BigQuery column-level security on the PII columns and grant access only to authorized roles while keeping the original transcripts unchanged in BigQuery.

  • Invoke the Cloud DLP Files on Cloud Storage to BigQuery Dataflow template with a de-identification configuration that applies CryptoDeterministicConfig using a customer-managed Cloud KMS key, producing tokenized names and phone numbers before loading the data into BigQuery.

Question 17 of 20

A financial services company runs its analytics platform on Google Cloud. Security architects set these requirements: all BigQuery tables containing customer PII must reside only in EU regions; business analysts can run aggregate queries but must never see raw email or phone columns; a Dataflow pipeline service account should have only the permissions required to insert new partitions into the same tables. Which design best satisfies all requirements while following the principle of least privilege?

  • Create a raw dataset in europe-west1 and apply the gcp.resourceLocations organization policy to EU regions. Publish an authorized view that provides only aggregated results and share that view with the analyst group. Grant the analysts bigquery.dataViewer on the dataset that houses the view and bigquery.jobUser on the project. Grant the Dataflow service account bigquery.dataEditor on the raw dataset.

  • Load PII into a US multi-regional dataset after redacting email and phone fields with Cloud DLP; give analysts bigquery.jobUser on the project and bigquery.dataViewer on the dataset; grant the Dataflow service account bigquery.dataEditor.

  • Replicate the dataset to europe-west1 and give analysts access through BigQuery column-level security by assigning them the bigquery.policyTagAccessor role; omit any organization policy, and grant the Dataflow service account bigquery.dataOwner on the dataset.

  • Place the tables in the EU multi-regional location and label sensitive columns with Data Catalog policy tags; give analysts bigquery.dataViewer on the raw dataset and bigquery.tagUser on the tags, and give the Dataflow service account bigquery.dataOwner on the dataset.

Question 18 of 20

During a quarterly audit, you discover that all 20 data scientists in your analytics project were granted the primitive Editor role so they could create and modify BigQuery tables. The CISO asks you to immediately reduce the blast radius while ensuring the scientists can continue their normal workloads. Which action best satisfies the principle of least privilege?

  • Remove the Editor binding and grant each scientist the predefined role roles/bigquery.dataEditor only on the datasets they work with.

  • Downgrade each scientist to the Viewer primitive role and allow them to impersonate a service account that still has the Editor role when they need write access.

  • Retain the Editor role but enable Cloud Audit Logs and set up log-based alerts to detect any misuse of non-BigQuery services.

  • Replace the Editor role with a custom role that includes all resourcemanager.* permissions but excludes storage.* permissions to protect Cloud Storage data.

Question 19 of 20

A global media company stores raw logs in several Cloud Storage buckets across multiple regions and ingests curated data into multiple BigQuery projects that are owned by different business units. Data scientists complain that they cannot easily discover which tables contain video-stream metrics or which buckets store ad-impression logs without asking individual teams. The chief data officer wants a single place where all Cloud Storage objects and BigQuery tables are automatically indexed, enriched with business metadata, and made searchable through a common API while still enforcing existing IAM policies. As the lead data engineer, which design should you implement to satisfy these requirements with minimal custom development and maximum portability across present and future Google Cloud projects?

  • Use Cloud Asset Inventory to list resources across projects, write Cloud Functions that parse the export into Pub/Sub, and build a Looker dashboard for interactive search.

  • Create a Dataplex lake, attach each Cloud Storage bucket and BigQuery dataset as governed assets, define zones and business tags, and rely on the Dataplex Universal Catalog (searchable through Data Catalog APIs) for discovery.

  • Enable BigQuery Data Catalog in every project, export catalog entries nightly to a central Cloud SQL instance, and build a custom front-end that merges the exports for search.

  • Install an open-source metadata repository such as DataHub on Google Kubernetes Engine, build custom crawlers for Cloud Storage and BigQuery, and expose search through a REST endpoint.

Question 20 of 20

A media-analytics company stores all BigQuery datasets in the us-central1 regional location. A new internal SLA says that reporting queries must keep working with no data loss if one of the zones in that region goes offline. Management will accept a short outage if the entire region fails and does not want to pay for extra pipelines or a second copy of the data. What should you change to meet the SLA?

  • Set up a BigQuery Data Transfer Service job to copy each table from us-central1 to us-east1 every 15 minutes.

  • Nightly export the datasets to a dual-region Cloud Storage bucket and re-import them into BigQuery during an outage.

  • Keep the datasets in the current regional location and make no additional changes; BigQuery already provides automatic, zero-RPO replication across zones inside the region.

  • Move all datasets to the multi-regional US location so that data is replicated across multiple regions instead of zones.