Data Engineering and Preparation Flashcards

AWS Machine Learning Engineer Associate MLA-C01 Flashcards

Front	Back
Difference between data lake and data warehouse	A data lake stores all types of data, while a data warehouse stores processed and structured data for analytics
How does Glue transform data	It uses PySpark or Scala ETL scripts to process and transform data
How does partitioning improve query performance	It allows queries to scan only relevant subsets of data instead of the entire dataset
What are the three main steps of ETL	Extract, Transform, Load
What does the term data lineage mean	Tracking the origin and transformation history of data through a pipeline
What is a data lake	A centralized repository to store structured, unstructured, and semi-structured data at any scale
What is a Glue Data Catalog	An AWS service that stores metadata for data sources, tables, and schemas
What is Amazon EMR used for	Running big data frameworks like Apache Hadoop and Apache Spark on AWS
What is an Amazon Redshift Spectrum query	Running SQL queries directly on data stored in Amazon S3 without needing to load it into Redshift
What is Apache Spark used for in data engineering	Fast and scalable processing of large datasets using distributed computing
What is cluster resizing in Amazon EMR	Changing the number and type of nodes in a cluster to optimize costs or performance
What is data deduplication	The process of removing duplicate records from a dataset
What is data engineering	The process of designing, building, and maintaining systems for collecting, storing, and analyzing data at scale
What is data transformation	The process of converting raw data into a structured and usable format
What is partitioning in AWS Glue	Organizing data into segments for more efficient queries and processing
What is S3 used for in data pipelines	Storing data in a scalable and durable object storage solution
What is schema-on-read	Schematizing data only when it is read from storage, giving flexibility to unstructured data
What is schema-on-write	Enforcing a schema when data is written to storage, ensuring consistency in structure
What is Snowflake commonly used for in data pipelines	Cloud-based data warehousing and analytics
What is the benefit of using Parquet or ORC file formats	Optimized for analytics and reduce data storage space with columnar storage
What is the difference between on-demand and spot instances in EMR	On-demand instances are billed per hour used, while spot instances provide unused capacity at discounted rates
What is the primary use of AWS Glue	ETL (Extract, Transform, Load) processes to prepare and transform data
What is the purpose of a data warehouse	Centralizing and organizing large amounts of structured data for analysis
What is the purpose of data normalization	Organizing data to reduce redundancy and improve integrity
What is the role of an IAM role in AWS Glue	Managing permissions and access control for Glue jobs and related services
What is the role of AWS Redshift in data pipelines	Providing a scalable data warehouse solution for analytics and querying large datasets
What is the role of HDFS in an EMR cluster	Distributed storage to store input data and output data for processing frameworks like Hadoop or Spark
What is the role of metadata in data preparation	Providing contextual information about data like structure, schema, or attributes
What is the use of a Glue crawler	Automating the process of data source discovery and metadata creation in the data catalog

AWS Machine Learning Engineer Associate MLA-C01

This deck focuses on data preparation, transformation, and the use of AWS tools like Glue, EMR, and Redshift for effective data pipelines.

Share on...