Data Processing and Transformation (GCP PDE) Flashcards

GCP Professional Data Engineer Flashcards

Front	Back
How do you define a streaming pipeline in Dataflow	A pipeline with unbounded data input that processes data in real time.
How does Apache Beam support fault tolerance	It supports fault tolerance through distributed execution and checkpointing mechanisms.
How does Dataflow handle late data	Through window expiration policies and watermarks.
What are common use cases for Dataproc	ETL, Data analysis, Machine learning, and migrations from on-prem Hadoop/Spark clusters.
What are ParDo transforms used for in Apache Beam	ParDo is used for element-wise transformations such as filtering, mapping, or flat-mapping in a pipeline.
What are sinks in Dataflow pipelines	Destinations where data is written, like BigQuery or Cloud Storage.
What are the benefits of regional endpoints in Dataflow	They allow for better control over data locality and compliance with regional data residency requirements.
What are the main components of a Dataflow pipeline	Input (source), transformations (processing), and output (sink).
What does the term "autoscaling" mean in Dataflow	It refers to the service dynamically adjusting worker resources based on workload.
What format does Dataflow use for pipeline definitions	Apache Beam Programming Model.
What is a BigQuery connector in Dataflow	A feature allowing pipelines to read/write data from/to BigQuery tables.
What is a combiner in Apache Beam	A specialized transform for merging data efficiently.
What is a DoFn in Apache Beam	The core processing unit within a ParDo transform accountable for the main logic.
What is a PCollection in Apache Beam	An immutable, distributed dataset used in Beam pipelines.
What is a runner in Apache Beam	A component that executes the pipeline, such as Dataflow Runner or Direct Runner.
What is a shuffle operation in Dataproc	Data exchange process within the Spark/Hadoop framework often used for aggregation and joins.
What is a side input in Apache Beam	A static dataset provided as additional input to a ParDo transform.
What is a watermark in Dataflow	It is a mechanism for tracking the progress of event-time data in a streaming pipeline.
What is Apache Beam	A unified programming model for defining and executing data processing workflows across different execution engines.
What is ensemble learning in Dataproc workflows	Ensemble learning is a machine learning method that combines multiple models to achieve better performance, often processed on Dataproc clusters.
What is Google Dataflow	A fully managed, serverless, data processing service ideal for batch and stream processing.
What is Google Dataproc	An easy-to-use and fully managed cloud service for deploying and running Apache Spark and Hadoop clusters.
What is the default storage layer for Dataproc	Google Cloud Storage.
What is the difference between batch and streaming in Dataflow	Batch processes finite datasets, while streaming processes infinite or unbounded datasets in near real-time.
What is the function of a Dataflow template	It allows you to predefine pipeline configurations and reuse them with different parameters for execution.
What is the purpose of a pipeline runner in Apache Beam	The pipeline runner specifies where and how the pipeline is executed (e.g., on Dataflow, DirectRunner, etc.).
What is the purpose of a transform in Apache Beam	A transform processes input data to produce output data.
What is the purpose of the GroupByKey transform in Apache Beam	GroupByKey is used to group data by key before applying further processing or aggregations.
What is the windowing concept in Apache Beam	A way to divide a data stream into logical windows of fixed time or events for processing.
What type of processing is supported by Dataflow	Dataflow supports both batch and stream processing.

GCP Professional Data Engineer

This deck covers tools like Dataflow, Dataproc, and Apache Beam, focusing on designing and building data pipelines and processing workflows.

Share on...