Data Engineering and Preparation Flashcards
AWS Machine Learning Engineer Associate MLA-C01 Flashcards

| Front | Back |
| Difference between data lake and data warehouse | A data lake stores all types of data, while a data warehouse stores processed and structured data for analytics |
| How does Glue transform data | It uses PySpark or Scala ETL scripts to process and transform data |
| How does partitioning improve query performance | It allows queries to scan only relevant subsets of data instead of the entire dataset |
| What are the three main steps of ETL | Extract, Transform, Load |
| What does the term data lineage mean | Tracking the origin and transformation history of data through a pipeline |
| What is a data lake | A centralized repository to store structured, unstructured, and semi-structured data at any scale |
| What is a Glue Data Catalog | An AWS service that stores metadata for data sources, tables, and schemas |
| What is Amazon EMR used for | Running big data frameworks like Apache Hadoop and Apache Spark on AWS |
| What is an Amazon Redshift Spectrum query | Running SQL queries directly on data stored in Amazon S3 without needing to load it into Redshift |
| What is Apache Spark used for in data engineering | Fast and scalable processing of large datasets using distributed computing |
| What is cluster resizing in Amazon EMR | Changing the number and type of nodes in a cluster to optimize costs or performance |
| What is data deduplication | The process of removing duplicate records from a dataset |
| What is data engineering | The process of designing, building, and maintaining systems for collecting, storing, and analyzing data at scale |
| What is data transformation | The process of converting raw data into a structured and usable format |
| What is partitioning in AWS Glue | Organizing data into segments for more efficient queries and processing |
| What is S3 used for in data pipelines | Storing data in a scalable and durable object storage solution |
| What is schema-on-read | Schematizing data only when it is read from storage, giving flexibility to unstructured data |
| What is schema-on-write | Enforcing a schema when data is written to storage, ensuring consistency in structure |
| What is Snowflake commonly used for in data pipelines | Cloud-based data warehousing and analytics |
| What is the benefit of using Parquet or ORC file formats | Optimized for analytics and reduce data storage space with columnar storage |
| What is the difference between on-demand and spot instances in EMR | On-demand instances are billed per hour used, while spot instances provide unused capacity at discounted rates |
| What is the primary use of AWS Glue | ETL (Extract, Transform, Load) processes to prepare and transform data |
| What is the purpose of a data warehouse | Centralizing and organizing large amounts of structured data for analysis |
| What is the purpose of data normalization | Organizing data to reduce redundancy and improve integrity |
| What is the role of an IAM role in AWS Glue | Managing permissions and access control for Glue jobs and related services |
| What is the role of AWS Redshift in data pipelines | Providing a scalable data warehouse solution for analytics and querying large datasets |
| What is the role of HDFS in an EMR cluster | Distributed storage to store input data and output data for processing frameworks like Hadoop or Spark |
| What is the role of metadata in data preparation | Providing contextual information about data like structure, schema, or attributes |
| What is the use of a Glue crawler | Automating the process of data source discovery and metadata creation in the data catalog |
About the Flashcards
Flashcards for the AWS Machine Learning Engineer Associate exam provide targeted review of core data engineering terms, cloud services, and pipeline practices. Use these cards to reinforce definitions, including data engineering, ETL steps, data lakes and data warehouses, and to memorize how tools like AWS Glue, Amazon EMR, and Apache Spark fit into end-to-end pipelines.
The deck focuses on practical exam topics such as S3 storage, Amazon Redshift and Snowflake warehousing, Glue Data Catalog and crawlers, schema-on-read versus schema-on-write, transformation techniques, Parquet/ORC and partitioning, data lineage, deduplication, normalization, IAM role permissions, and cluster sizing and cost/performance trade-offs.
Topics covered in this flashcard deck:
- ETL concepts
- Data lake vs warehouse
- AWS Glue & Data Catalog
- Apache Spark and EMR
- S3 and Amazon Redshift
- Parquet/ORC and partitioning