Bash, the Crucial Exams Chat Bot
AI Bot

Data Processing and Transformation (GCP PDE)  Flashcards

GCP Professional Data Engineer Flashcards

FrontBack
How do you define a streaming pipeline in DataflowA pipeline with unbounded data input that processes data in real time.
How does Apache Beam support fault toleranceIt supports fault tolerance through distributed execution and checkpointing mechanisms.
How does Dataflow handle late dataThrough window expiration policies and watermarks.
What are common use cases for DataprocETL, Data analysis, Machine learning, and migrations from on-prem Hadoop/Spark clusters.
What are ParDo transforms used for in Apache BeamParDo is used for element-wise transformations such as filtering, mapping, or flat-mapping in a pipeline.
What are sinks in Dataflow pipelinesDestinations where data is written, like BigQuery or Cloud Storage.
What are the benefits of regional endpoints in DataflowThey allow for better control over data locality and compliance with regional data residency requirements.
What are the main components of a Dataflow pipelineInput (source), transformations (processing), and output (sink).
What does the term "autoscaling" mean in DataflowIt refers to the service dynamically adjusting worker resources based on workload.
What format does Dataflow use for pipeline definitionsApache Beam Programming Model.
What is a BigQuery connector in DataflowA feature allowing pipelines to read/write data from/to BigQuery tables.
What is a combiner in Apache BeamA specialized transform for merging data efficiently.
What is a DoFn in Apache BeamThe core processing unit within a ParDo transform accountable for the main logic.
What is a PCollection in Apache BeamAn immutable, distributed dataset used in Beam pipelines.
What is a runner in Apache BeamA component that executes the pipeline, such as Dataflow Runner or Direct Runner.
What is a shuffle operation in DataprocData exchange process within the Spark/Hadoop framework often used for aggregation and joins.
What is a side input in Apache BeamA static dataset provided as additional input to a ParDo transform.
What is a watermark in DataflowIt is a mechanism for tracking the progress of event-time data in a streaming pipeline.
What is Apache BeamA unified programming model for defining and executing data processing workflows across different execution engines.
What is ensemble learning in Dataproc workflowsEnsemble learning is a machine learning method that combines multiple models to achieve better performance, often processed on Dataproc clusters.
What is Google DataflowA fully managed, serverless, data processing service ideal for batch and stream processing.
What is Google DataprocAn easy-to-use and fully managed cloud service for deploying and running Apache Spark and Hadoop clusters.
What is the default storage layer for DataprocGoogle Cloud Storage.
What is the difference between batch and streaming in DataflowBatch processes finite datasets, while streaming processes infinite or unbounded datasets in near real-time.
What is the function of a Dataflow templateIt allows you to predefine pipeline configurations and reuse them with different parameters for execution.
What is the purpose of a pipeline runner in Apache BeamThe pipeline runner specifies where and how the pipeline is executed (e.g., on Dataflow, DirectRunner, etc.).
What is the purpose of a transform in Apache BeamA transform processes input data to produce output data.
What is the purpose of the GroupByKey transform in Apache BeamGroupByKey is used to group data by key before applying further processing or aggregations.
What is the windowing concept in Apache BeamA way to divide a data stream into logical windows of fixed time or events for processing.
What type of processing is supported by DataflowDataflow supports both batch and stream processing.
This deck covers tools like Dataflow, Dataproc, and Apache Beam, focusing on designing and building data pipelines and processing workflows.
Share on...
Follow us on...