Data Processing and Transformation (GCP PDE) Flashcards

GCP Professional Data Engineer Flashcards

Study our Data Processing and Transformation (GCP PDE) flashcards for the GCP Professional Data Engineer exam with 30+ flashcards. View as flashcards, a searchable table, or as a fun matching game.

GCP Professional Data Engineer Course Header Image

Front	Back
How do you define a streaming pipeline in Dataflow	A pipeline with unbounded data input that processes data in real time.
How does Apache Beam support fault tolerance	It supports fault tolerance through distributed execution and checkpointing mechanisms.
How does Dataflow handle late data	Through window expiration policies and watermarks.
What are common use cases for Dataproc	ETL, Data analysis, Machine learning, and migrations from on-prem Hadoop/Spark clusters.
What are ParDo transforms used for in Apache Beam	ParDo is used for element-wise transformations such as filtering, mapping, or flat-mapping in a pipeline.
What are sinks in Dataflow pipelines	Destinations where data is written, like BigQuery or Cloud Storage.
What are the benefits of regional endpoints in Dataflow	They allow for better control over data locality and compliance with regional data residency requirements.
What are the main components of a Dataflow pipeline	Input (source), transformations (processing), and output (sink).
What does the term "autoscaling" mean in Dataflow	It refers to the service dynamically adjusting worker resources based on workload.
What format does Dataflow use for pipeline definitions	Apache Beam Programming Model.
What is a BigQuery connector in Dataflow	A feature allowing pipelines to read/write data from/to BigQuery tables.
What is a combiner in Apache Beam	A specialized transform for merging data efficiently.
What is a DoFn in Apache Beam	The core processing unit within a ParDo transform accountable for the main logic.
What is a PCollection in Apache Beam	An immutable, distributed dataset used in Beam pipelines.
What is a runner in Apache Beam	A component that executes the pipeline, such as Dataflow Runner or Direct Runner.
What is a shuffle operation in Dataproc	Data exchange process within the Spark/Hadoop framework often used for aggregation and joins.
What is a side input in Apache Beam	A static dataset provided as additional input to a ParDo transform.
What is a watermark in Dataflow	It is a mechanism for tracking the progress of event-time data in a streaming pipeline.
What is Apache Beam	A unified programming model for defining and executing data processing workflows across different execution engines.
What is ensemble learning in Dataproc workflows	Ensemble learning is a machine learning method that combines multiple models to achieve better performance, often processed on Dataproc clusters.
What is Google Dataflow	A fully managed, serverless, data processing service ideal for batch and stream processing.
What is Google Dataproc	An easy-to-use and fully managed cloud service for deploying and running Apache Spark and Hadoop clusters.
What is the default storage layer for Dataproc	Google Cloud Storage.
What is the difference between batch and streaming in Dataflow	Batch processes finite datasets, while streaming processes infinite or unbounded datasets in near real-time.
What is the function of a Dataflow template	It allows you to predefine pipeline configurations and reuse them with different parameters for execution.
What is the purpose of a pipeline runner in Apache Beam	The pipeline runner specifies where and how the pipeline is executed (e.g., on Dataflow, DirectRunner, etc.).
What is the purpose of a transform in Apache Beam	A transform processes input data to produce output data.
What is the purpose of the GroupByKey transform in Apache Beam	GroupByKey is used to group data by key before applying further processing or aggregations.
What is the windowing concept in Apache Beam	A way to divide a data stream into logical windows of fixed time or events for processing.
What type of processing is supported by Dataflow	Dataflow supports both batch and stream processing.

Related Study Materials

GCP Professional Data Engineer Study Materials GCP Professional Data Engineer Practice Tests

Related Flashcards

Monitoring, Optimization, and Security (GCP PDE) Machine Learning and AI Services (GCP PDE) Data Storage and Databases (GCP PDE) GCP Core Concepts (GCP PDE)

About the Flashcards

This study set offers a focused review of key data engineering services. Flashcards for the GCP Professional Data Engineer exam cover essential terminology for Google Dataflow, Google Dataproc, and the Apache Beam programming model. You will review the fundamentals of building batch and stream processing pipelines, understanding how these managed services operate. The deck is designed to help you master core components from data sources and sinks to the execution of workflows.

Dive deeper into the technical details of the Apache Beam model, including PCollections, transforms, and windowing for handling unbounded data. The cards clarify concepts like autoscaling in Dataflow and common use cases for Dataproc. This review helps you master the key ideas necessary for success.

Topics covered in this flashcard deck:

Google Dataflow Fundamentals
Google Dataproc Concepts
Apache Beam Programming Model
Data Processing Pipelines
Stream and Batch Processing
Transforms and PCollections

Share on...