Data Processing and Transformation (GCP PDE) Flashcards
GCP Professional Data Engineer Flashcards

| Front | Back |
| How do you define a streaming pipeline in Dataflow | A pipeline with unbounded data input that processes data in real time. |
| How does Apache Beam support fault tolerance | It supports fault tolerance through distributed execution and checkpointing mechanisms. |
| How does Dataflow handle late data | Through window expiration policies and watermarks. |
| What are common use cases for Dataproc | ETL, Data analysis, Machine learning, and migrations from on-prem Hadoop/Spark clusters. |
| What are ParDo transforms used for in Apache Beam | ParDo is used for element-wise transformations such as filtering, mapping, or flat-mapping in a pipeline. |
| What are sinks in Dataflow pipelines | Destinations where data is written, like BigQuery or Cloud Storage. |
| What are the benefits of regional endpoints in Dataflow | They allow for better control over data locality and compliance with regional data residency requirements. |
| What are the main components of a Dataflow pipeline | Input (source), transformations (processing), and output (sink). |
| What does the term "autoscaling" mean in Dataflow | It refers to the service dynamically adjusting worker resources based on workload. |
| What format does Dataflow use for pipeline definitions | Apache Beam Programming Model. |
| What is a BigQuery connector in Dataflow | A feature allowing pipelines to read/write data from/to BigQuery tables. |
| What is a combiner in Apache Beam | A specialized transform for merging data efficiently. |
| What is a DoFn in Apache Beam | The core processing unit within a ParDo transform accountable for the main logic. |
| What is a PCollection in Apache Beam | An immutable, distributed dataset used in Beam pipelines. |
| What is a runner in Apache Beam | A component that executes the pipeline, such as Dataflow Runner or Direct Runner. |
| What is a shuffle operation in Dataproc | Data exchange process within the Spark/Hadoop framework often used for aggregation and joins. |
| What is a side input in Apache Beam | A static dataset provided as additional input to a ParDo transform. |
| What is a watermark in Dataflow | It is a mechanism for tracking the progress of event-time data in a streaming pipeline. |
| What is Apache Beam | A unified programming model for defining and executing data processing workflows across different execution engines. |
| What is ensemble learning in Dataproc workflows | Ensemble learning is a machine learning method that combines multiple models to achieve better performance, often processed on Dataproc clusters. |
| What is Google Dataflow | A fully managed, serverless, data processing service ideal for batch and stream processing. |
| What is Google Dataproc | An easy-to-use and fully managed cloud service for deploying and running Apache Spark and Hadoop clusters. |
| What is the default storage layer for Dataproc | Google Cloud Storage. |
| What is the difference between batch and streaming in Dataflow | Batch processes finite datasets, while streaming processes infinite or unbounded datasets in near real-time. |
| What is the function of a Dataflow template | It allows you to predefine pipeline configurations and reuse them with different parameters for execution. |
| What is the purpose of a pipeline runner in Apache Beam | The pipeline runner specifies where and how the pipeline is executed (e.g., on Dataflow, DirectRunner, etc.). |
| What is the purpose of a transform in Apache Beam | A transform processes input data to produce output data. |
| What is the purpose of the GroupByKey transform in Apache Beam | GroupByKey is used to group data by key before applying further processing or aggregations. |
| What is the windowing concept in Apache Beam | A way to divide a data stream into logical windows of fixed time or events for processing. |
| What type of processing is supported by Dataflow | Dataflow supports both batch and stream processing. |
About the Flashcards
This study set offers a focused review of key data engineering services. Flashcards for the GCP Professional Data Engineer exam cover essential terminology for Google Dataflow, Google Dataproc, and the Apache Beam programming model. You will review the fundamentals of building batch and stream processing pipelines, understanding how these managed services operate. The deck is designed to help you master core components from data sources and sinks to the execution of workflows.
Dive deeper into the technical details of the Apache Beam model, including PCollections, transforms, and windowing for handling unbounded data. The cards clarify concepts like autoscaling in Dataflow and common use cases for Dataproc. This review helps you master the key ideas necessary for success.
Topics covered in this flashcard deck:
- Google Dataflow Fundamentals
- Google Dataproc Concepts
- Apache Beam Programming Model
- Data Processing Pipelines
- Stream and Batch Processing
- Transforms and PCollections