GCP Professional Data Engineer Practice Test
Use the form below to configure your GCP Professional Data Engineer Practice Test. The practice test can be configured to only include certain exam objectives and domains. You can choose between 5-100 questions and set a time limit.

GCP Professional Data Engineer Information
Overview
The Google Cloud Professional Data Engineer (PDE) certification is designed to validate a practitionerâs ability to build, operationalize, secure, and monitor data processing systems on Google Cloud Platform (GCP). Candidates are expected to demonstrate proficiency in designing dataâdriven solutions that are reliable, scalable, and cost-effectiveâspanning everything from ingestion pipelines and transformation jobs to advanced analytics and machine-learning models. Earning the PDE credential signals to employers that you can translate business and technical requirements into robust data architectures while adhering to best practices for security, compliance, and governance.
Exam Structure and Knowledge Domains
The exam is a two-hour, multiple-choice test available in a proctored, in-person or online format. Questions target real-world scenarios across four broad domains: (1) designing data processing systems; (2) building and operationalizing data processing systems; (3) operationalizing machine-learning models; and (4) ensuring solution quality. You might be asked to choose optimal storage solutions (BigQuery, Cloud Spanner, Bigtable), architect streaming pipelines with Pub/Sub and Dataflow, or troubleshoot performance bottlenecks. Because the PDE focuses heavily on applied problem-solving rather than rote memorization, hands-on experienceâwhether via professional projects or Googleâs Qwiklabs/Cloud Skills Boost labsâis critical for success.
About GCP PDE Practice Exams
Taking reputable practice exams is one of the most efficient ways to gauge readiness and close knowledge gaps. High-quality mocks mirror the actual testâs wording, timing, and scenario-based style, helping you get comfortable with the pace and depth of questioning. After each attempt, review explanationsânot just the items you missed, but also the ones you answered correctlyâto reinforce concepts and uncover lucky guesses. Tracking performance over multiple sittings shows whether your improvement is consistent or if certain domains lag behind. When used alongside hands-on labs, whitepapers, and documentation, practice tests become a feedback loop that sharpens both your intuition and time-management skills.
Preparation Tips
Begin your preparation with the official exam guide to map each task statement to concrete learning resources (Coursera courses, Google documentation, blog posts). Build small proof-of-concept projectsâsuch as streaming IoT data to BigQuery or automating model retraining with AI Platformâto anchor theory in practice. In the final weeks, shift from broad study to focused review: revisit weak areas highlighted by practice exams, skim product release notes for recent feature updates, and fine-tune your exam-day strategy (flag uncertain questions, manage breaks, monitor the clock). By combining targeted study, practical experimentation, and iterative assessment, you can approach the GCP Professional Data Engineer exam with confidence and a clear roadmap to certification.

Free GCP Professional Data Engineer Practice Test
- 20 Questions
- Unlimited time
- Designing data processing systemsIngesting and processing the dataStoring the dataPreparing and using data for analysisMaintaining and automating data workloads
Free Preview
This test is a free preview, no account required.
Subscribe to unlock all content, keep track of your scores, and access AI features!
Your e-commerce site streams user clicks from Pub/Sub into a Dataflow pipeline built with Apache Beam. A session is any sequence of events for one user with no more than 30 minutes of inactivity. Product managers expect provisional per-user session counts every 5 minutes of processing time, yet final results must still include events that arrive up to 15 minutes after the session gap closes. Which windowing and triggering setup best satisfies these needs while avoiding unnecessary windows?
Use session windows with a 30-minute gap; add an early AfterProcessingTime trigger that fires 5 minutes after the first element, keep the default on-time watermark firing, set allowed lateness to 15 minutes, and accumulate fired panes.
Use 30-minute fixed windows; add an AfterProcessingTime trigger that fires 5 minutes after the first element and set allowed lateness to 15 minutes.
Use sliding windows of 30 minutes that advance every 5 minutes with the default watermark trigger and no allowed lateness.
Use the global window; set an AfterProcessingTime trigger to fire 30 minutes after each element and discard fired panes with no allowed lateness.
Answer Description
Session windows naturally group events that are separated by less than a specified gap, matching the 30-minute inactivity rule for user sessions. Adding an early processing-time trigger such as AfterProcessingTime.pastFirstElementInPane().plusDelayOf(5 minutes) produces interim aggregates every five minutes. Leaving the default on-time firing (watermark past end of window) ensures a pane is emitted when the 30-minute gap elapses, and setting allowedLateness to 15 minutes lets late events merge into the existing session. Using accumulating mode updates prior results without opening new windows. Fixed, sliding, or global windows fail to capture true user sessions or to meet late-data requirements without extra complexity.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is a session window in Apache Beam?
What is the purpose of an AfterProcessingTime trigger in Apache Beam?
What does allowed lateness mean in windowing in Apache Beam?
What is the difference between session windows and fixed windows in Apache Beam?
What is an early processing-time trigger in Apache Beam and why is it useful?
What does allowed lateness mean in the context of Apache Beam windows?
Your organization runs a Dataflow streaming job that continuously writes events into an existing BigQuery dataset containing sensitive customer information. Security policy mandates least-privilege access for the Dataflow worker service account: it must be able to create new tables in that dataset and append or overwrite rows, but it must not change table schemas or manage dataset-level access controls. You need to grant a single predefined IAM role on the dataset to satisfy this requirement. Which role should you assign?
Grant roles/bigquery.dataEditor on the dataset
Grant roles/bigquery.dataOwner on the dataset
Grant roles/bigquery.jobUser on the project
Grant roles/bigquery.admin on the project
Answer Description
The BigQuery Data Editor role (roles/bigquery.dataEditor) is scoped to datasets and grants permissions such as bigquery.tables.create and bigquery.tables.updateData, which allow a principal to create tables and write rows. Although it also includes bigquery.tables.getData (read access), it does not include permissions like bigquery.tables.update (alter table schemas) or bigquery.datasets.update (change access controls). Thus it provides the minimum required capabilities without granting schema-modification rights. BigQuery Data Owner and BigQuery Admin include schema and access-control permissions, violating least privilege. BigQuery Data Viewer is read-only, and BigQuery Job User controls job execution but provides no direct dataset write permissions. Therefore granting roles/bigquery.dataEditor on the dataset is the correct choice.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is the difference between roles/bigquery.dataEditor and roles/bigquery.dataOwner?
Why can't roles/bigquery.admin be assigned instead of roles/bigquery.dataEditor?
What does least-privilege access mean in IAM roles?
What is the scope of the BigQuery Data Editor role?
What’s the difference between BigQuery Data Editor and BigQuery Admin roles?
What permissions are excluded from the BigQuery Data Editor role?
Your organization receives 4 TB of JSON telemetry each day from hundreds of thousands of IoT devices. The events must be filtered for malformed records, deduplicated by device-timestamp, enriched with a 100 MB reference table that updates weekly, and streamed into partitioned BigQuery tables with sub-minute latency. Data engineers also need to rerun the identical logic over months of archived data for occasional backfills. Operations teams require automatic horizontal scaling and no cluster management. Which Google Cloud solution best satisfies all requirements?
Run Spark Streaming on a long-running Cloud Dataproc cluster, coupled with scheduled Dataprep jobs for cleaning and a separate batch Spark job for backfills.
Load raw events directly into BigQuery with streaming inserts and use Dataform SQL models to cleanse, deduplicate, and enrich data in place.
Implement an Apache Beam pipeline on Cloud Dataflow that streams events, uses a weekly-refreshed side input for enrichment, and can be re-run in batch for backfills.
Create a Cloud Data Fusion pipeline with Wrangler transforms, triggered by Cloud Composer, and manually scale the underlying Dataproc cluster for spikes.
Answer Description
Cloud Dataflow executes Apache Beam pipelines that run in either true streaming or batch mode from the same code base. It supports user-defined transforms for complex cleansing and deduplication, side inputs for periodically refreshed lookup tables, exactly-once BigQuery sinks, and automatic horizontal scaling without any cluster to manage. Dataproc would require cluster provisioning and separate batch/stream jobs; Cloud Data Fusion and Dataform focus on GUI or SQL-based ETL and are not ideal for high-volume, low-latency streaming with code reuse across batch and streaming workloads.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is Apache Beam and why is it suitable for this use case?
What are side inputs and how do they work in Cloud Dataflow?
How does Cloud Dataflow ensure automatic horizontal scaling?
Why is Apache Beam suitable for both streaming and batch processing in Cloud Dataflow?
What is a side input in Apache Beam, and how does it enable enrichment?
How does Cloud Dataflow ensure automatic horizontal scaling?
Your company runs a group of Compute Engine instances that execute nightly analytics jobs containing protected health information (PHI). The jobs must read reference files from an encrypted Cloud Storage bucket and write results to a BigQuery dataset, both located in the production project. Compliance forbids embedding any long-lived user credentials in the VM images, and the security team requires least-privilege access with minimal operational effort for credential rotation. Which design best satisfies these constraints?
Generate individual service-account keys for each engineer, embed the JSON key files in the VM startup script, and grant BigQuery Admin and Storage Admin roles at the project level. Rotate the keys quarterly.
Grant the default Compute Engine service account the Project Editor role and let the application use the default credentials automatically provided by the metadata server.
Store a Cloud Storage HMAC key in Secret Manager; have the application fetch the key at startup to sign requests to the bucket and to authenticate to BigQuery with signed URLs.
Create a dedicated service account (for example, sa-analytics-vm). Grant it Storage Object Viewer on the specific bucket and BigQuery Data Editor on the target dataset, attach it as the runtime service account for the instances, and do not generate any user-managed keys.
Answer Description
Attaching a dedicated service account to the Compute Engine instances and granting it only the permissions required for the two target resources meets the principle of least privilege. Because the service account is bound to the VM, the workload can obtain short-lived OAuth 2.0 tokens from the instance metadata server at run time; no user credentials or downloadable keys need to be stored on disk, so key rotation happens automatically. Granting broad project-level roles or distributing user-managed keys would violate least-privilege goals and introduce operational overhead. Using HMAC or signed URLs still requires managing long-lived secrets and does not cover BigQuery access.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is a service account in Google Cloud?
How does the metadata server provide OAuth 2.0 tokens to VM instances?
What is the principle of least privilege, and why is it important?
What is a service account in GCP?
How does the Metadata server work in Compute Engine?
What is the principle of least privilege in access control?
You are the lead data engineer at a media-streaming company. A real-time pipeline ingests Pub/Sub events, processes them through multiple Cloud Dataflow streaming jobs, and writes to BigQuery. Support engineers need a single dashboard that shows each stage's built-in system metrics together with Apache Beam user-defined counters, and they must receive alerts whenever end-to-end latency exceeds two minutes or worker-pool CPU utilization stays above 80 percent for five minutes. The business wants to minimize operational overhead and avoid operating any additional monitoring stack. Which monitoring strategy best meets these requirements?
Leverage the default export of Dataflow job metrics to Cloud Monitoring, emit Beam counters as custom metrics, and build Cloud Monitoring dashboards and alerting policies for latency and CPU usage.
Deploy a managed Prometheus server on GKE to scrape Dataflow worker logs, store the metrics, and visualize and alert on them using a self-hosted Grafana dashboard.
Enable Cloud Trace for each Dataflow job and configure trace-based alerts; correlate spans from Pub/Sub, Dataflow, and BigQuery in the Cloud Trace console.
Create log sinks for Pub/Sub, Dataflow, and BigQuery into Cloud Logging, derive log-based metrics for latency and CPU, and visualize them with Logging charts and alerts.
Answer Description
Cloud Dataflow automatically exports a rich set of built-in metrics (for example, elements-per-second, data freshness, CPU and memory usage) to Cloud Monitoring. Apache Beam user-defined counters can also be published as custom metrics that appear in the same Monitoring workspace. Using these metrics, engineers can build dashboards that combine Dataflow, Pub/Sub, and BigQuery data and can create alerting policies for latency and sustained CPU utilization. This provides a single pane of glass and proactive notifications without deploying or managing extra tooling.
The other options fall short:
- Enabling Cloud Trace surfaces trace spans but not CPU, memory, or custom Beam counters.
- Log-based metrics would require per-service extraction rules and still miss real-time Dataflow internals.
- Running Prometheus and Grafana introduces additional infrastructure and contradicts the low-overhead requirement.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What are Apache Beam user-defined counters?
How does Cloud Monitoring integrate with Cloud Dataflow?
What are the advantages of using Cloud Monitoring over Prometheus and Grafana in this scenario?
What are Apache Beam user-defined counters?
What built-in metrics does Cloud Dataflow export to Cloud Monitoring?
Why is Cloud Monitoring preferred for dashboards and alerts in this scenario?
A healthcare provider is building a Dataflow pipeline that loads sensitive genomic records into a BigQuery dataset located in the europe-west2 region. Regulations require that: 1) all data be encrypted with keys that remain exclusively in the hospital's on-premises HSM, 2) every decryption operation be auditable in Cloud Logging, and 3) no application code changes be needed beyond configuration. Which key-management approach should you recommend?
Create Customer-Managed Encryption Keys (CMEK) in Cloud KMS and rotate them weekly.
Enable default Google-managed encryption for BigQuery and Dataflow artifacts and rely on Cloud Audit Logs.
Configure Customer-Supplied Encryption Keys (CSEK) and pass the key with every Dataflow and BigQuery request.
Use Cloud External Key Manager (EKM) with an externally managed key and enable CMEK on the BigQuery dataset and Dataflow temporary buckets.
Answer Description
Cloud External Key Manager (EKM) lets Google Cloud services such as BigQuery and Cloud Storage use an encryption key that is stored and managed in an external HSM. When you enable CMEK protection with an EKM key, the key material never resides in Google Cloud, yet BigQuery and Dataflow can transparently encrypt and decrypt data without code changes. Every call to the external key is captured in Cloud Audit Logs, satisfying the auditing requirement. Customer-supplied encryption keys cannot be used with BigQuery and require applications to pass the key on each request, while CMEK keys kept entirely inside Cloud KMS do not meet the mandate that the key stay on-premises. Default Google-managed keys provide no customer control.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is Cloud External Key Manager (EKM)?
How do Customer-Managed Encryption Keys (CMEK) work in Google Cloud?
What is the difference between CMEK and CSEK?
A global retailer ingests daily sales transactions into BigQuery. Customer email addresses and phone numbers must not be visible to analysts at partner companies, yet those partners need to join multiple days of data on the same customers to calculate repeat-purchase metrics. To satisfy GDPR requirements, the retailer wants a managed solution that irreversibly replaces the sensitive fields while preserving deterministic joinability across data sets. Which approach best meets these needs?
Rely on BigQuery column-level access controls to hide the email and phone columns from partner accounts.
Enable Customer-Managed Encryption Keys (CMEK) on the BigQuery dataset so the PII remains encrypted when shared with partners.
Run Cloud DLP to perform deterministic cryptographic tokenization of the email and phone fields using a customer-managed key before loading each day's files.
Store the PII columns in a separate Cloud SQL instance and share only integer foreign keys with partners.
Answer Description
Cloud Data Loss Prevention (DLP) can de-identify sensitive data by applying deterministic cryptographic transformations. When you configure a CryptoDeterministicConfig with a customer-managed (or external) key, DLP replaces each instance of the identified PII with a stable surrogate token: the same input value always maps to the same output, enabling joins across data sets. Because only the keyed cryptographic function can reverse the process, partners cannot recover the original PII, meeting GDPR data-minimization requirements.
Merely encrypting the BigQuery dataset with CMEK protects data at rest but still exposes clear-text PII to anyone who can query the table. Column-level access controls would block partners from seeing the identifiers entirely, preventing the required joins. Off-loading PII to Cloud SQL and sharing foreign keys would still leave linkage information exposed and complicate operations without providing managed tokenization. Therefore, deterministic tokenization with Cloud DLP and a customer-managed key is the correct solution.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is Cloud DLP?
How does deterministic cryptographic tokenization work?
What is the role of Customer-Managed Keys (CMKs) in Cloud DLP?
What is deterministic cryptographic tokenization?
How does Cloud DLP ensure GDPR compliance?
What is the role of customer-managed keys in Cloud DLP?
Your company's data platform team must provide a self-service environment where data engineers across multiple projects can discover, profile, and govern files stored in Cloud Storage and tables in BigQuery. The solution should automatically scan new and existing assets to harvest technical metadata, generate data profiles that include statistics such as null counts and cardinality, and surface the assets through a unified catalog that supports fine-grained access controls. The team wants to minimize custom code and avoid deploying third-party software. Which design best satisfies these requirements?
Register every bucket and dataset in standalone Data Catalog entry groups and trigger Cloud Functions that launch Dataflow jobs to calculate statistics and update metadata tables.
Create Dataplex lakes and governed zones that reference the Cloud Storage buckets and BigQuery datasets, enable automated discovery, data profiling, and quality scans in each zone, and use the Dataplex catalog for cross-project search and access control.
Centralize all data by copying it into a single BigQuery dataset with BigQuery Omni, then rely on INFORMATION_SCHEMA views and custom Cloud Composer DAGs to generate profiling reports.
Use Cloud Asset Inventory to index storage objects and datasets, and schedule BigQuery Data Transfer Service jobs to load audit logs that analysts can query for metadata and quality metrics.
Answer Description
Dataplex natively unifies governance for Cloud Storage and BigQuery by grouping assets into lakes and zones. When you attach a bucket or dataset to a zone, Dataplex automatically runs discovery jobs that register the assets in the Dataplex (Data Catalog) catalog, creates data profiles with statistics such as null counts, and can run built-in data quality scans. Access to the cataloged assets is controlled through IAM roles on Dataplex resources, giving fine-grained governance without custom pipelines. The other options either rely on separate services that do not provide automated profiling (BigQuery INFORMATION_SCHEMA), require custom functions and pipelines to keep metadata up to date, or use services (Cloud Asset Inventory, BigQuery Data Transfer Service) that are not intended for end-user data discovery and quality management.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is Dataplex and how does it support data governance in this solution?
What are data profiles and how does Dataplex generate them?
How does Dataplex provide fine-grained access control, and why is it beneficial?
What is Dataplex in GCP?
How does Dataplex perform automated discovery and profiling?
Why is Dataplex better than standalone Data Catalog or other services?
Your company ingests IoT telemetry at 30 000 messages per second via Cloud Pub/Sub. A streaming Dataflow job in us-central1 transforms the data and writes to a BigQuery dataset also in us-central1. The business requires that if the entire us-central1 region becomes unavailable, no more than 60 seconds of data may be lost (RPO †1 minute) and processing must resume in another region within 15 minutes (RTO †15 minutes) without manual code changes. Which design meets these objectives with the least operational overhead?
Enable Pub/Sub topic replication to us-east1 and use a Cloud Composer DAG that launches the Dataflow template in us-east1 when a regional health check fails; keep the dataset in a us-east1 regional BigQuery location.
Configure the job to autoscale across all zones in us-central1 and snapshot state to a dual-region Cloud Storage bucket every minute; redeploy the template manually in another region during an outage.
Create a second pull subscription to the Pub/Sub topic and deploy an identical streaming Dataflow Flex Template in us-east1 writing to a multi-region BigQuery dataset; run both pipelines continuously with idempotent writes.
Modify the existing Dataflow job to enable drain-and-restore, set a 60-second checkpoint interval, and rely on BigQuery regional redundancy for protection.
Answer Description
Running two identical streaming jobs in separate regions provides active-active redundancy: if one region fails, the other continues to process new messages with no intervention, keeping RTO effectively zero and RPO limited only by Pub/Sub delivery guarantees. Using separate subscriptions prevents message acknowledgment coupling, and a multi-region BigQuery dataset remains reachable from either job. Idempotent or exactly-once semantics in the pipeline mitigate duplicate writes that can occur when both jobs are running. Relying on a single-region job, manual restarts from snapshots, or workflow-driven failover all introduce higher operational burden or longer recovery times that threaten the 15-minute RTO and 1-minute RPO targets.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is RPO and RTO in cloud architecture?
What are idempotent writes in data pipelines?
How does Pub/Sub support high availability with multiple subscriptions?
What is active-active redundancy in streaming data pipelines?
What are idempotent writes in data pipelines?
How does Pub/Sub ensure message delivery guarantees for RPO requirements?
Your company, a global retailer subject to GDPR, stores transactional data in a BigQuery table called customer_orders that has the columns order_id, item_id, customer_email, credit_card_hash, and amount. Marketing analysts must be able to run ad-hoc SQL on every column except customer_email and credit_card_hash, while the Risk team needs unrestricted access. The solution must scale so that any new columns later classified as PII are automatically protected without rewriting queries or creating additional tables. How should you implement this in BigQuery?
Create a Data Catalog taxonomy with a PII policy tag, attach the tag to customer_email and credit_card_hash, grant the Risk group permissions to read that policy tag and the dataset, and give Marketing only dataset-level BigQuery read access without tag permission.
Move customer_email and credit_card_hash into a separate BigQuery table, restrict access to that table to the Risk team, and let Marketing query the remaining columns in the original table.
Encrypt only the customer_email and credit_card_hash columns with customer-managed encryption keys (CMEK) and provide the decryption key to the Risk team but not to Marketing analysts.
Build an authorized view that omits the customer_email and credit_card_hash columns, share the view with Marketing analysts, and share the underlying table directly with the Risk team.
Answer Description
BigQuery enforces column-level security through policy tags that live in Data Catalog taxonomies. By tagging each sensitive column with a PII policy tag and granting access to that tag only to the Risk group (via the Data Catalog Fine-Grained Reader or BigQuery Data Policy User role), you ensure they can query the protected columns while Marketing-who has dataset-level read rights but no access to the tag-cannot. When new PII columns are added, attaching the same policy tag immediately protects them, so no views or table restructuring is required. Row-level security filters rows, not columns; splitting tables or using authorized views would work but requires ongoing schema maintenance; per-column encryption with customer-managed keys is not natively enforced by BigQuery for column masking and would block all users without the key, not only Marketing.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is column-level security in BigQuery?
What are Data Catalog taxonomies and policy tags?
Why does GDPR compliance matter for managing sensitive data in BigQuery?
What is a Data Catalog taxonomy in BigQuery?
How do policy tags enable column-level security in BigQuery?
What makes column-level security with policy tags scalable for GDPR compliance?
Your company ACME Payments is building a streaming analytics pipeline on Google Cloud to process credit-card transactions from EU customers. Regulations require that (1) all personal data is stored and processed exclusively in EU regions, (2) primary account numbers (PANs) are pseudonymized but remain reversible for future investigations, (3) data analysts must not have access to decryption keys, and (4) the Dataflow pipeline must follow least-privilege principles. Which approach best meets these requirements?
Use Cloud External Key Manager with keys in a US HSM for format-preserving encryption, store the pseudonymized data in a BigQuery dataset in europe-west2, and allow analysts to decrypt by granting them roles/cloudkms.cryptoKeyEncrypterDecrypter.
Enforce the constraints/gcp.resourceLocations policy to permit only EU regions; run Dataflow in europe-west1 using Cloud DLP deterministic encryption protected by an EU-resident CMEK key in Cloud KMS; write results to a BigQuery dataset in europe-west1; grant analysts roles/bigquery.dataViewer only; grant the Dataflow service account roles/bigquery.dataEditor on the dataset and roles/cloudkms.cryptoKeyEncrypterDecrypter on the key.
Enable Assured Workloads for EU but allow resources in any region; in Dataflow apply irreversible DLP redaction before loading to a multi-regional BigQuery dataset; grant analysts roles/bigquery.dataOwner and roles/cloudkms.cryptoKeyDecrypter for investigation needs.
Deploy Dataflow in us-central1, hash PANs with SHA-256 during processing, store the output in a US multi-region BigQuery dataset, and grant analysts only the roles/bigquery.metadataViewer role.
Answer Description
The correct approach enforces the constraints as follows:
- Apply the organization-policy constraint constraints/gcp.resourceLocations to allow only EU regions, ensuring Cloud Storage buckets, Dataflow jobs, BigQuery datasets, and Cloud KMS key rings are created inside the EU.
- Within the Dataflow job, call Cloud DLP to perform deterministic encryption on the PAN field, using a customer-managed key (CMEK) stored in a Cloud KMS key ring located in an EU region. Deterministic encryption provides reversible pseudonymization while preserving referential integrity.
- Persist the transformed data to a BigQuery dataset in an EU region such as europe-west1, satisfying data-residency rules.
- Grant data analysts only the roles/bigquery.dataViewer role on the dataset so they can query pseudonymized data but lack any Cloud KMS permissions required to decrypt it.
- Grant the Dataflow worker service account the minimum required privileges: roles/bigquery.dataEditor on the target dataset and roles/cloudkms.cryptoKeyEncrypterDecrypter on the specific CMEK key, preventing broader access.
Alternative solutions fail to satisfy one or more requirements: allowing non-EU regions violates residency rules; using irreversible redaction or hashing breaks the reversibility requirement; storing keys outside the EU or granting analysts decrypter privileges violates compliance and least-privilege principles.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is Cloud DLP deterministic encryption?
What is the constraints/gcp.resourceLocations policy?
How does least-privilege access work in Google Cloud?
What is the constraints/gcp.resourceLocations policy?
What is deterministic encryption, and why is it used?
What is CMEK in Google Cloud, and why is it significant in this solution?
A healthcare provider stores sensitive patient telemetry in BigQuery. A new regulation requires that the encryption keys protecting this data must remain in an on-premises, FIPS 140-2 Level 3 certified HSM that is managed exclusively by the provider's security team. Analysts must continue to run existing SQL workloads without code changes, and key rotation must occur automatically through the key-management system rather than by updating application logic. Which Google Cloud encryption approach best meets these requirements?
Configure BigQuery to use a Customer-Managed Encryption Key that is hosted in an on-premises HSM through Cloud External Key Manager.
Enable the default Google-managed encryption that automatically secures data at rest.
Protect the dataset with Customer-Supplied Encryption Keys (CSEK) provided in every BigQuery API call.
Configure BigQuery with Customer-Managed Encryption Keys stored in Cloud KMS and backed by Cloud HSM.
Answer Description
Because the regulation stipulates that encryption keys must stay in an on-premises FIPS 140-2 Level 3 HSM under the customer's sole control, the only Google Cloud option that satisfies this is Cloud External Key Manager (EKM). With EKM, BigQuery can be configured to use a customer-managed encryption key that never leaves the customer-owned HSM; Google Cloud retrieves key material on-demand over a secure channel, so the data remains encrypted at rest with an externally hosted key. Key rotation is handled in the external HSM and is transparent to BigQuery clients, so no SQL jobs or application code need to change.
Using CMEK backed by Cloud HSM would store the key inside Google Cloud, violating the requirement that keys remain on-premises. Default Google-managed keys do not satisfy customer-control or residency requirements. Customer-supplied encryption keys (CSEK) are not supported for BigQuery and require applications to supply a key with every request, which would break existing workloads and fail to automate rotation. Therefore, configuring BigQuery with an external key through Cloud EKM is the correct solution.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is Cloud External Key Manager (EKM)?
What is FIPS 140-2 certification and why is it important for HSMs?
How does key rotation work with Cloud External Key Manager (EKM)?
What is Cloud External Key Manager (EKM)?
What is FIPS 140-2 Level 3 certification?
How does BigQuery interact with the on-premises HSM when using Cloud EKM?
A payment-processing company ingests transaction records from multiple branches into a BigQuery table. Each record contains the cardholder's full name and the 16-digit primary account number (PAN).
Compliance requires the following before data can be queried by data scientists in the analytics project:
- Names must be pseudonymized in a way that lets datasets from different branches still be joined on the same customer.
- PANs must be rendered non-reversible, but analysts need the last four digits for charge-back investigations.
You are designing a Dataflow pipeline that calls Cloud Data Loss Prevention (DLP) for in-stream de-identification. Which approach best meets both requirements while minimizing the risk of re-identification?
Apply a CryptoReplaceFfxFpeConfig transform to the name field and to the PAN field using the same Cloud KMS key so that both values remain reversible for auditors.
Store the raw table in a restricted project and grant analysts a BigQuery view that excludes the name and PAN columns; do not perform any in-pipeline transformation.
Encrypt the entire table with Cloud KMS at rest and allow analysts to decrypt on read; rely on Data Catalog column tags to warn users about personal data.
Apply a CryptoDeterministicConfig transform to the name field using a shared Cloud KMS key, and apply a CharacterMaskConfig that masks the first 12 digits of the PAN, leaving the last 4 digits visible.
Answer Description
Deterministic cryptographic transformation with a centrally-managed Cloud KMS key converts identical names to the same surrogate value across all branches, so joins are still possible and the process can be reversed only by a team that controls the key.
Character masking that replaces the first 12 digits of the PAN with a fixed symbol irreversibly removes sensitive data yet keeps the last four digits available to analysts.
The other proposals either (1) use format-preserving encryption that is still reversible by anyone with a key, (2) encrypt data in bulk without selectively exposing the last four digits, or (3) rely only on access controls without actually de-identifying the data, and therefore do not satisfy the stated compliance goals.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is Cloud Data Loss Prevention (DLP) and how does it help in de-identification?
How does CryptoDeterministicConfig enable pseudonymization across datasets?
Why use CharacterMaskConfig for PANs instead of encrypting the entire field?
What is CryptoDeterministicConfig in GCP Dataflow?
How does CharacterMaskConfig work in Cloud DLP?
Why use Cloud KMS for managing encryption keys in data pipelines?
Your retail company runs a 15 TB Oracle 12c database in its on-premises data center that records incoming online orders. You need to populate a BigQuery dataset in Google Cloud and keep it synchronized with the source in near real time so analysts always see up-to-date data. A 1 Gbps Cloud VPN already connects the data center to Google Cloud, and the team prefers a managed, serverless solution that automatically performs change-data capture with minimal ongoing operations work. Which Google Cloud migration service should you use?
Migrate the database with Database Migration Service into Cloud SQL and query it from BigQuery.
Use BigQuery Data Transfer Service to schedule daily incremental imports from Oracle.
Ship a Transfer Appliance with exported database files, then load them into BigQuery.
Use Datastream to stream CDC events from Oracle to BigQuery.
Answer Description
Datastream is Google Cloud's fully managed, serverless change-data-capture (CDC) and replication service. It supports Oracle sources and can continuously stream inserts, updates, and deletes to BigQuery with sub-second to minute-level latency, requiring little operational overhead.
The BigQuery Data Transfer Service only performs scheduled batch loads and cannot read directly from Oracle. Transfer Appliance is designed for one-time, offline bulk transfers of files to Cloud Storage, not continuous CDC, so it would not keep the dataset current. Database Migration Service focuses on moving databases into Cloud SQL (and currently offers limited support for Oracle) and would still require additional tooling to feed changes into BigQuery. Therefore, Datastream best meets the requirement for managed, serverless, near-real-time replication from an on-prem Oracle database to BigQuery.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is Change Data Capture (CDC)?
How does Datastream work for Oracle databases?
Why is Datastream better for this task compared to BigQuery Data Transfer Service?
What is CDC (Change Data Capture) in Datastream?
How does Datastream ensure near-real-time data synchronization?
Why is Datastream a better fit than BigQuery Data Transfer Service for this use case?
After migrating a 40-TB Oracle data mart to BigQuery using Datastream (CDC -> Cloud Storage) and Dataflow loads, you must prove before cut-over that every source row matches its BigQuery copy. The solution has to 1) scale across hundreds of tables without per-table coding, 2) surface any row-level mismatches, and 3) expose results to Cloud Monitoring for alerts. Which approach best meets these requirements?
Develop individual Dataflow pipelines for each table that calculate row hashes in Oracle and BigQuery, then compare the results and publish a metric.
Enable a built-in Datastream data-validation feature to generate checksum comparisons automatically and send the results to Cloud Logging.
Run Google's open-source Data Validation Tool as a Dataflow flex template to compute per-table checksums between Oracle and BigQuery, log results to Cloud Logging, and create log-based metrics for Cloud Monitoring alerts.
Create final BigQuery snapshots and run manual EXCEPT queries against exported Oracle CSV files; record any differences in a spreadsheet.
Answer Description
Running Google's open-source Data Validation Tool (DVT) as a Dataflow flex template fulfills all constraints. The template compares row counts and column-level checksums between Oracle and BigQuery for every table based on a single YAML configuration, so no table-specific code is required. It writes detailed match and mismatch results to BigQuery tables and Cloud Logging, from which log-based metrics can feed Cloud Monitoring alert policies. Datastream lacks built-in validation, custom Dataflow jobs would require per-table logic, BigQuery snapshots with manual SQL diffing do not scale, and BigQuery Data Transfer Service offers no validation capability. Therefore, deploying the DVT Dataflow template is the most efficient, scalable, and monitorable option.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is Google's open-source Data Validation Tool (DVT) and how does it work?
What are Dataflow flex templates and how do they enhance this solution?
How does Cloud Monitoring integrate with log-based metrics for alerts in this solution?
What is the Data Validation Tool (DVT)?
What is a Dataflow flex template?
How can log-based metrics enable Cloud Monitoring alerts?
Your health-insurance company ingests millions of call-center transcripts from Cloud Storage into BigQuery each day for trend analysis. Regulations forbid storing clear-text PII such as customer names and phone numbers, yet analysts must be able to deterministically group conversations that belong to the same customer for audits. You want a scalable, fully managed solution that requires minimal custom code and lets you add new PII detectors later. Which design should you implement?
Load the raw transcripts into BigQuery first and use SQL REGEXP_REPLACE functions in scheduled queries to overwrite PII columns with randomly generated strings.
Encrypt each transcript locally with a customer-supplied encryption key (CSEK) and load the encrypted files directly into BigQuery so analysts can decrypt data when needed.
Enable BigQuery column-level security on the PII columns and grant access only to authorized roles while keeping the original transcripts unchanged in BigQuery.
Invoke the Cloud DLP Files on Cloud Storage to BigQuery Dataflow template with a de-identification configuration that applies CryptoDeterministicConfig using a customer-managed Cloud KMS key, producing tokenized names and phone numbers before loading the data into BigQuery.
Answer Description
Cloud Data Loss Prevention (DLP) offers fully managed inspection and de-identification capabilities that scale automatically. By calling DLP from the "Cloud Storage Text to BigQuery with Cloud DLP" Dataflow template and configuring a CryptoDeterministicConfig that uses a customer-managed Cloud KMS key, all detected PII (for example, PERSON_NAME and PHONE_NUMBER infoTypes) is replaced with stable, non-reversible tokens before the data is written to BigQuery. Because the same source value always maps to the same token when the same key is used, analysts can reliably join or aggregate records that refer to the same individual without exposing the original sensitive values. BigQuery's column-level security or views alone would leave the raw PII in storage, and client-side encryption or ad-hoc regex masking would break deterministic linking and add significant custom development and maintenance. Therefore, orchestrating a Dataflow DLP de-identification template with deterministic cryptographic tokenization best satisfies the privacy, scalability, and maintainability requirements.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is Cloud DLP in GCP?
What is CryptoDeterministicConfig in Cloud DLP?
Why use Cloud KMS with Cloud DLP?
What is Cloud DLP and how does it help with data de-identification?
How does CryptoDeterministicConfig enable deterministic tokenization?
Why is a customer-managed Cloud KMS key important in this design?
A financial services company runs its analytics platform on Google Cloud. Security architects set these requirements: all BigQuery tables containing customer PII must reside only in EU regions; business analysts can run aggregate queries but must never see raw email or phone columns; a Dataflow pipeline service account should have only the permissions required to insert new partitions into the same tables. Which design best satisfies all requirements while following the principle of least privilege?
Create a raw dataset in europe-west1 and apply the gcp.resourceLocations organization policy to EU regions. Publish an authorized view that provides only aggregated results and share that view with the analyst group. Grant the analysts bigquery.dataViewer on the dataset that houses the view and bigquery.jobUser on the project. Grant the Dataflow service account bigquery.dataEditor on the raw dataset.
Load PII into a US multi-regional dataset after redacting email and phone fields with Cloud DLP; give analysts bigquery.jobUser on the project and bigquery.dataViewer on the dataset; grant the Dataflow service account bigquery.dataEditor.
Replicate the dataset to europe-west1 and give analysts access through BigQuery column-level security by assigning them the bigquery.policyTagAccessor role; omit any organization policy, and grant the Dataflow service account bigquery.dataOwner on the dataset.
Place the tables in the EU multi-regional location and label sensitive columns with Data Catalog policy tags; give analysts bigquery.dataViewer on the raw dataset and bigquery.tagUser on the tags, and give the Dataflow service account bigquery.dataOwner on the dataset.
Answer Description
Storing the raw tables in a single-region EU dataset (for example europe-west1) keeps data physically in the EU. Enforcing the organization policy constraint gcp.resourceLocations ensures no one can accidentally create resources outside approved EU regions. An authorized view can expose only aggregated results, so analysts can run their queries without gaining direct access to the sensitive columns. Granting analysts bigquery.dataViewer on the dataset that contains the views plus bigquery.jobUser on the project lets them execute queries but not modify data. The Dataflow pipeline needs to load and overwrite table partitions, so bigquery.dataEditor on the raw dataset is sufficient-there is no need for broader Owner privileges. The other options either do not restrict residency to the EU, expose raw columns through insufficient controls, or assign overly broad IAM roles.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is the purpose of the gcp.resourceLocations organization policy?
What is an authorized view in BigQuery, and how does it protect sensitive data?
Why is specific IAM role assignment important for the Dataflow service account?
What is gcp.resourceLocations organization policy?
How do authorized views in BigQuery work?
Why is bigquery.dataEditor sufficient for the Dataflow service account?
During a quarterly audit, you discover that all 20 data scientists in your analytics project were granted the primitive Editor role so they could create and modify BigQuery tables. The CISO asks you to immediately reduce the blast radius while ensuring the scientists can continue their normal workloads. Which action best satisfies the principle of least privilege?
Remove the Editor binding and grant each scientist the predefined role roles/bigquery.dataEditor only on the datasets they work with.
Downgrade each scientist to the Viewer primitive role and allow them to impersonate a service account that still has the Editor role when they need write access.
Retain the Editor role but enable Cloud Audit Logs and set up log-based alerts to detect any misuse of non-BigQuery services.
Replace the Editor role with a custom role that includes all resourcemanager.* permissions but excludes storage.* permissions to protect Cloud Storage data.
Answer Description
The Editor primitive role grants thousands of permissions across nearly every Google Cloud service, including the ability to create, modify, and delete resources such as Compute Engine instances and Cloud Storage buckets. To comply with least-privilege guidelines, you should remove this broad role and replace it with a predefined BigQuery-specific role that contains only the permissions required for the scientists' tasks. Granting roles/bigquery.dataEditor at the dataset level lets them create and update tables without exposing the project to unnecessary risk. The other options either continue to over-provision access, add unnecessary impersonation complexity, or rely solely on monitoring rather than removing excessive permissions.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is the principle of least privilege in cloud security?
What does the roles/bigquery.dataEditor role allow users to do?
Why is assigning primitive roles like Editor considered a security risk?
What is the principle of least privilege?
What does the roles/bigquery.dataEditor role include?
How can granting permissions at the dataset level reduce risk?
A global media company stores raw logs in several Cloud Storage buckets across multiple regions and ingests curated data into multiple BigQuery projects that are owned by different business units. Data scientists complain that they cannot easily discover which tables contain video-stream metrics or which buckets store ad-impression logs without asking individual teams. The chief data officer wants a single place where all Cloud Storage objects and BigQuery tables are automatically indexed, enriched with business metadata, and made searchable through a common API while still enforcing existing IAM policies. As the lead data engineer, which design should you implement to satisfy these requirements with minimal custom development and maximum portability across present and future Google Cloud projects?
Use Cloud Asset Inventory to list resources across projects, write Cloud Functions that parse the export into Pub/Sub, and build a Looker dashboard for interactive search.
Create a Dataplex lake, attach each Cloud Storage bucket and BigQuery dataset as governed assets, define zones and business tags, and rely on the Dataplex Universal Catalog (searchable through Data Catalog APIs) for discovery.
Enable BigQuery Data Catalog in every project, export catalog entries nightly to a central Cloud SQL instance, and build a custom front-end that merges the exports for search.
Install an open-source metadata repository such as DataHub on Google Kubernetes Engine, build custom crawlers for Cloud Storage and BigQuery, and expose search through a REST endpoint.
Answer Description
Dataplex can attach Cloud Storage buckets and BigQuery datasets from any project into logical data lakes and zones. When an asset is attached, Dataplex's built-in metadata service automatically crawls the underlying storage, extracts technical metadata (schemas, locations, partitions), and surfaces it-together with user-defined business tags-in the Dataplex Universal Catalog. The catalog entries reside in Data Catalog, so analysts can use Data Catalog search APIs to look for business terms such as "video-stream metrics" or "ad-impressions" without knowing the physical location of the data. IAM policies applied at the project, dataset, or bucket level are respected because Dataplex manages only references and does not copy data. Running a third-party catalog or exporting metadata to another system would add operational overhead and would not benefit from Dataplex's tight integration with Google Cloud services and IAM, reducing portability and increasing maintenance effort. Therefore, enabling Dataplex, organizing assets into lakes and zones, and relying on the Universal Catalog best meets the stated goals.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is Dataplex and why is it useful in this scenario?
How does Dataplex Universal Catalog integrate with Data Catalog?
What are the advantages of using Dataplex over third-party metadata repositories?
What is Dataplex in Google Cloud?
What is the Dataplex Universal Catalog?
How does Dataplex enforce IAM policies during data discovery?
A media-analytics company stores all BigQuery datasets in the us-central1 regional location. A new internal SLA says that reporting queries must keep working with no data loss if one of the zones in that region goes offline. Management will accept a short outage if the entire region fails and does not want to pay for extra pipelines or a second copy of the data. What should you change to meet the SLA?
Set up a BigQuery Data Transfer Service job to copy each table from us-central1 to us-east1 every 15 minutes.
Nightly export the datasets to a dual-region Cloud Storage bucket and re-import them into BigQuery during an outage.
Keep the datasets in the current regional location and make no additional changes; BigQuery already provides automatic, zero-RPO replication across zones inside the region.
Move all datasets to the multi-regional US location so that data is replicated across multiple regions instead of zones.
Answer Description
BigQuery regional locations already replicate table data synchronously to at least two separate zones inside the same region and automatically reroute queries to healthy replicas if a zone is unavailable. This provides a Recovery Point Objective of zero and near-instant failover (very low RTO) for single-zone outages without any additional storage, pipelines, or operational overhead. Moving to a multi-region location or creating cross-region copies would exceed the stated requirements and add cost or complexity, while periodic exports or scheduled transfers would introduce an RPO greater than zero and require extra maintenance. Therefore, no architectural change is required.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What does zero-RPO replication mean in BigQuery?
How does BigQuery handle zone outages within a regional location?
What is the difference between a regional and multi-regional location in BigQuery?
What does zero-RPO replication mean in BigQuery?
How does BigQuery handle failover during a zone outage?
What is the difference between regional and multi-regional locations in BigQuery?
Wow!
Looks like that's it! You can go back and review your answers or click the button below to grade your test.