Bash, the Crucial Exams Chat Bot
AI Bot

Data Engineering and Preparation Flashcards

AWS Machine Learning Engineer Associate MLA-C01 Flashcards

Study our Data Engineering and Preparation flashcards for the AWS Machine Learning Engineer Associate MLA-C01 exam with 29+ flashcards. View as flashcards, a searchable table, or as a fun matching game.
AWS Machine Learning Engineer Associate MLA-C01 Course Header Image
FrontBack
Difference between data lake and data warehouseA data lake stores all types of data, while a data warehouse stores processed and structured data for analytics
How does Glue transform dataIt uses PySpark or Scala ETL scripts to process and transform data
How does partitioning improve query performanceIt allows queries to scan only relevant subsets of data instead of the entire dataset
What are the three main steps of ETLExtract, Transform, Load
What does the term data lineage meanTracking the origin and transformation history of data through a pipeline
What is a data lakeA centralized repository to store structured, unstructured, and semi-structured data at any scale
What is a Glue Data CatalogAn AWS service that stores metadata for data sources, tables, and schemas
What is Amazon EMR used forRunning big data frameworks like Apache Hadoop and Apache Spark on AWS
What is an Amazon Redshift Spectrum queryRunning SQL queries directly on data stored in Amazon S3 without needing to load it into Redshift
What is Apache Spark used for in data engineeringFast and scalable processing of large datasets using distributed computing
What is cluster resizing in Amazon EMRChanging the number and type of nodes in a cluster to optimize costs or performance
What is data deduplicationThe process of removing duplicate records from a dataset
What is data engineeringThe process of designing, building, and maintaining systems for collecting, storing, and analyzing data at scale
What is data transformationThe process of converting raw data into a structured and usable format
What is partitioning in AWS GlueOrganizing data into segments for more efficient queries and processing
What is S3 used for in data pipelinesStoring data in a scalable and durable object storage solution
What is schema-on-readSchematizing data only when it is read from storage, giving flexibility to unstructured data
What is schema-on-writeEnforcing a schema when data is written to storage, ensuring consistency in structure
What is Snowflake commonly used for in data pipelinesCloud-based data warehousing and analytics
What is the benefit of using Parquet or ORC file formatsOptimized for analytics and reduce data storage space with columnar storage
What is the difference between on-demand and spot instances in EMROn-demand instances are billed per hour used, while spot instances provide unused capacity at discounted rates
What is the primary use of AWS GlueETL (Extract, Transform, Load) processes to prepare and transform data
What is the purpose of a data warehouseCentralizing and organizing large amounts of structured data for analysis
What is the purpose of data normalizationOrganizing data to reduce redundancy and improve integrity
What is the role of an IAM role in AWS GlueManaging permissions and access control for Glue jobs and related services
What is the role of AWS Redshift in data pipelinesProviding a scalable data warehouse solution for analytics and querying large datasets
What is the role of HDFS in an EMR clusterDistributed storage to store input data and output data for processing frameworks like Hadoop or Spark
What is the role of metadata in data preparationProviding contextual information about data like structure, schema, or attributes
What is the use of a Glue crawlerAutomating the process of data source discovery and metadata creation in the data catalog

About the Flashcards

Flashcards for the AWS Machine Learning Engineer Associate exam provide targeted review of core data engineering terms, cloud services, and pipeline practices. Use these cards to reinforce definitions, including data engineering, ETL steps, data lakes and data warehouses, and to memorize how tools like AWS Glue, Amazon EMR, and Apache Spark fit into end-to-end pipelines.

The deck focuses on practical exam topics such as S3 storage, Amazon Redshift and Snowflake warehousing, Glue Data Catalog and crawlers, schema-on-read versus schema-on-write, transformation techniques, Parquet/ORC and partitioning, data lineage, deduplication, normalization, IAM role permissions, and cluster sizing and cost/performance trade-offs.

Topics covered in this flashcard deck:

  • ETL concepts
  • Data lake vs warehouse
  • AWS Glue & Data Catalog
  • Apache Spark and EMR
  • S3 and Amazon Redshift
  • Parquet/ORC and partitioning
Share on...
Follow us on...