Bash, the Crucial Exams Chat Bot
AI Bot
Data Processing and Transformation (GCP PDE) Flashcards
GCP Professional Data Engineer Flashcards
| Front | Back |
| How do you define a streaming pipeline in Dataflow | A pipeline with unbounded data input that processes data in real time. |
| How does Apache Beam support fault tolerance | It supports fault tolerance through distributed execution and checkpointing mechanisms. |
| How does Dataflow handle late data | Through window expiration policies and watermarks. |
| What are common use cases for Dataproc | ETL, Data analysis, Machine learning, and migrations from on-prem Hadoop/Spark clusters. |
| What are ParDo transforms used for in Apache Beam | ParDo is used for element-wise transformations such as filtering, mapping, or flat-mapping in a pipeline. |
| What are sinks in Dataflow pipelines | Destinations where data is written, like BigQuery or Cloud Storage. |
| What are the benefits of regional endpoints in Dataflow | They allow for better control over data locality and compliance with regional data residency requirements. |
| What are the main components of a Dataflow pipeline | Input (source), transformations (processing), and output (sink). |
| What does the term "autoscaling" mean in Dataflow | It refers to the service dynamically adjusting worker resources based on workload. |
| What format does Dataflow use for pipeline definitions | Apache Beam Programming Model. |
| What is a BigQuery connector in Dataflow | A feature allowing pipelines to read/write data from/to BigQuery tables. |
| What is a combiner in Apache Beam | A specialized transform for merging data efficiently. |
| What is a DoFn in Apache Beam | The core processing unit within a ParDo transform accountable for the main logic. |
| What is a PCollection in Apache Beam | An immutable, distributed dataset used in Beam pipelines. |
| What is a runner in Apache Beam | A component that executes the pipeline, such as Dataflow Runner or Direct Runner. |
| What is a shuffle operation in Dataproc | Data exchange process within the Spark/Hadoop framework often used for aggregation and joins. |
| What is a side input in Apache Beam | A static dataset provided as additional input to a ParDo transform. |
| What is a watermark in Dataflow | It is a mechanism for tracking the progress of event-time data in a streaming pipeline. |
| What is Apache Beam | A unified programming model for defining and executing data processing workflows across different execution engines. |
| What is ensemble learning in Dataproc workflows | Ensemble learning is a machine learning method that combines multiple models to achieve better performance, often processed on Dataproc clusters. |
| What is Google Dataflow | A fully managed, serverless, data processing service ideal for batch and stream processing. |
| What is Google Dataproc | An easy-to-use and fully managed cloud service for deploying and running Apache Spark and Hadoop clusters. |
| What is the default storage layer for Dataproc | Google Cloud Storage. |
| What is the difference between batch and streaming in Dataflow | Batch processes finite datasets, while streaming processes infinite or unbounded datasets in near real-time. |
| What is the function of a Dataflow template | It allows you to predefine pipeline configurations and reuse them with different parameters for execution. |
| What is the purpose of a pipeline runner in Apache Beam | The pipeline runner specifies where and how the pipeline is executed (e.g., on Dataflow, DirectRunner, etc.). |
| What is the purpose of a transform in Apache Beam | A transform processes input data to produce output data. |
| What is the purpose of the GroupByKey transform in Apache Beam | GroupByKey is used to group data by key before applying further processing or aggregations. |
| What is the windowing concept in Apache Beam | A way to divide a data stream into logical windows of fixed time or events for processing. |
| What type of processing is supported by Dataflow | Dataflow supports both batch and stream processing. |
Front
What is a runner in Apache Beam
Click the card to flip
Back
A component that executes the pipeline, such as Dataflow Runner or Direct Runner.
Front
What are the main components of a Dataflow pipeline
Back
Input (source), transformations (processing), and output (sink).
Front
What is a watermark in Dataflow
Back
It is a mechanism for tracking the progress of event-time data in a streaming pipeline.
Front
What format does Dataflow use for pipeline definitions
Back
Apache Beam Programming Model.
Front
How do you define a streaming pipeline in Dataflow
Back
A pipeline with unbounded data input that processes data in real time.
Front
What is ensemble learning in Dataproc workflows
Back
Ensemble learning is a machine learning method that combines multiple models to achieve better performance, often processed on Dataproc clusters.
Front
What does the term "autoscaling" mean in Dataflow
Back
It refers to the service dynamically adjusting worker resources based on workload.
Front
What are common use cases for Dataproc
Back
ETL, Data analysis, Machine learning, and migrations from on-prem Hadoop/Spark clusters.
Front
What is a PCollection in Apache Beam
Back
An immutable, distributed dataset used in Beam pipelines.
Front
What is the function of a Dataflow template
Back
It allows you to predefine pipeline configurations and reuse them with different parameters for execution.
Front
What are the benefits of regional endpoints in Dataflow
Back
They allow for better control over data locality and compliance with regional data residency requirements.
Front
What is a DoFn in Apache Beam
Back
The core processing unit within a ParDo transform accountable for the main logic.
Front
What are ParDo transforms used for in Apache Beam
Back
ParDo is used for element-wise transformations such as filtering, mapping, or flat-mapping in a pipeline.
Front
What is Google Dataflow
Back
A fully managed, serverless, data processing service ideal for batch and stream processing.
Front
What is Google Dataproc
Back
An easy-to-use and fully managed cloud service for deploying and running Apache Spark and Hadoop clusters.
Front
What is the difference between batch and streaming in Dataflow
Back
Batch processes finite datasets, while streaming processes infinite or unbounded datasets in near real-time.
Front
What is the purpose of a transform in Apache Beam
Back
A transform processes input data to produce output data.
Front
What are sinks in Dataflow pipelines
Back
Destinations where data is written, like BigQuery or Cloud Storage.
Front
What is Apache Beam
Back
A unified programming model for defining and executing data processing workflows across different execution engines.
Front
How does Dataflow handle late data
Back
Through window expiration policies and watermarks.
Front
What is the default storage layer for Dataproc
Back
Google Cloud Storage.
Front
What is the purpose of a pipeline runner in Apache Beam
Back
The pipeline runner specifies where and how the pipeline is executed (e.g., on Dataflow, DirectRunner, etc.).
Front
What is a side input in Apache Beam
Back
A static dataset provided as additional input to a ParDo transform.
Front
What is the purpose of the GroupByKey transform in Apache Beam
Back
GroupByKey is used to group data by key before applying further processing or aggregations.
Front
How does Apache Beam support fault tolerance
Back
It supports fault tolerance through distributed execution and checkpointing mechanisms.
Front
What type of processing is supported by Dataflow
Back
Dataflow supports both batch and stream processing.
Front
What is the windowing concept in Apache Beam
Back
A way to divide a data stream into logical windows of fixed time or events for processing.
Front
What is a BigQuery connector in Dataflow
Back
A feature allowing pipelines to read/write data from/to BigQuery tables.
Front
What is a shuffle operation in Dataproc
Back
Data exchange process within the Spark/Hadoop framework often used for aggregation and joins.
Front
What is a combiner in Apache Beam
Back
A specialized transform for merging data efficiently.
1/30
This deck covers tools like Dataflow, Dataproc, and Apache Beam, focusing on designing and building data pipelines and processing workflows.