Bash, the Crucial Exams Chat Bot
AI Bot
Data Processing and Transformation (GCP PDE) Flashcards
GCP Professional Data Engineer Flashcards
| Front | Back |
| How do you define a streaming pipeline in Dataflow | A pipeline with unbounded data input that processes data in real time. |
| How does Apache Beam support fault tolerance | It supports fault tolerance through distributed execution and checkpointing mechanisms. |
| How does Dataflow handle late data | Through window expiration policies and watermarks. |
| What are common use cases for Dataproc | ETL, Data analysis, Machine learning, and migrations from on-prem Hadoop/Spark clusters. |
| What are ParDo transforms used for in Apache Beam | ParDo is used for element-wise transformations such as filtering, mapping, or flat-mapping in a pipeline. |
| What are sinks in Dataflow pipelines | Destinations where data is written, like BigQuery or Cloud Storage. |
| What are the benefits of regional endpoints in Dataflow | They allow for better control over data locality and compliance with regional data residency requirements. |
| What are the main components of a Dataflow pipeline | Input (source), transformations (processing), and output (sink). |
| What does the term "autoscaling" mean in Dataflow | It refers to the service dynamically adjusting worker resources based on workload. |
| What format does Dataflow use for pipeline definitions | Apache Beam Programming Model. |
| What is a BigQuery connector in Dataflow | A feature allowing pipelines to read/write data from/to BigQuery tables. |
| What is a combiner in Apache Beam | A specialized transform for merging data efficiently. |
| What is a DoFn in Apache Beam | The core processing unit within a ParDo transform accountable for the main logic. |
| What is a PCollection in Apache Beam | An immutable, distributed dataset used in Beam pipelines. |
| What is a runner in Apache Beam | A component that executes the pipeline, such as Dataflow Runner or Direct Runner. |
| What is a shuffle operation in Dataproc | Data exchange process within the Spark/Hadoop framework often used for aggregation and joins. |
| What is a side input in Apache Beam | A static dataset provided as additional input to a ParDo transform. |
| What is a watermark in Dataflow | It is a mechanism for tracking the progress of event-time data in a streaming pipeline. |
| What is Apache Beam | A unified programming model for defining and executing data processing workflows across different execution engines. |
| What is ensemble learning in Dataproc workflows | Ensemble learning is a machine learning method that combines multiple models to achieve better performance, often processed on Dataproc clusters. |
| What is Google Dataflow | A fully managed, serverless, data processing service ideal for batch and stream processing. |
| What is Google Dataproc | An easy-to-use and fully managed cloud service for deploying and running Apache Spark and Hadoop clusters. |
| What is the default storage layer for Dataproc | Google Cloud Storage. |
| What is the difference between batch and streaming in Dataflow | Batch processes finite datasets, while streaming processes infinite or unbounded datasets in near real-time. |
| What is the function of a Dataflow template | It allows you to predefine pipeline configurations and reuse them with different parameters for execution. |
| What is the purpose of a pipeline runner in Apache Beam | The pipeline runner specifies where and how the pipeline is executed (e.g., on Dataflow, DirectRunner, etc.). |
| What is the purpose of a transform in Apache Beam | A transform processes input data to produce output data. |
| What is the purpose of the GroupByKey transform in Apache Beam | GroupByKey is used to group data by key before applying further processing or aggregations. |
| What is the windowing concept in Apache Beam | A way to divide a data stream into logical windows of fixed time or events for processing. |
| What type of processing is supported by Dataflow | Dataflow supports both batch and stream processing. |
This deck covers tools like Dataflow, Dataproc, and Apache Beam, focusing on designing and building data pipelines and processing workflows.