Bash, the Crucial Exams Chat Bot
AI Bot
Data Engineering and Preparation Flashcards
AWS Machine Learning Engineer Associate MLA-C01 Flashcards
| Front | Back |
| Difference between data lake and data warehouse | A data lake stores all types of data, while a data warehouse stores processed and structured data for analytics |
| How does Glue transform data | It uses PySpark or Scala ETL scripts to process and transform data |
| How does partitioning improve query performance | It allows queries to scan only relevant subsets of data instead of the entire dataset |
| What are the three main steps of ETL | Extract, Transform, Load |
| What does the term data lineage mean | Tracking the origin and transformation history of data through a pipeline |
| What is a data lake | A centralized repository to store structured, unstructured, and semi-structured data at any scale |
| What is a Glue Data Catalog | An AWS service that stores metadata for data sources, tables, and schemas |
| What is Amazon EMR used for | Running big data frameworks like Apache Hadoop and Apache Spark on AWS |
| What is an Amazon Redshift Spectrum query | Running SQL queries directly on data stored in Amazon S3 without needing to load it into Redshift |
| What is Apache Spark used for in data engineering | Fast and scalable processing of large datasets using distributed computing |
| What is cluster resizing in Amazon EMR | Changing the number and type of nodes in a cluster to optimize costs or performance |
| What is data deduplication | The process of removing duplicate records from a dataset |
| What is data engineering | The process of designing, building, and maintaining systems for collecting, storing, and analyzing data at scale |
| What is data transformation | The process of converting raw data into a structured and usable format |
| What is partitioning in AWS Glue | Organizing data into segments for more efficient queries and processing |
| What is S3 used for in data pipelines | Storing data in a scalable and durable object storage solution |
| What is schema-on-read | Schematizing data only when it is read from storage, giving flexibility to unstructured data |
| What is schema-on-write | Enforcing a schema when data is written to storage, ensuring consistency in structure |
| What is Snowflake commonly used for in data pipelines | Cloud-based data warehousing and analytics |
| What is the benefit of using Parquet or ORC file formats | Optimized for analytics and reduce data storage space with columnar storage |
| What is the difference between on-demand and spot instances in EMR | On-demand instances are billed per hour used, while spot instances provide unused capacity at discounted rates |
| What is the primary use of AWS Glue | ETL (Extract, Transform, Load) processes to prepare and transform data |
| What is the purpose of a data warehouse | Centralizing and organizing large amounts of structured data for analysis |
| What is the purpose of data normalization | Organizing data to reduce redundancy and improve integrity |
| What is the role of an IAM role in AWS Glue | Managing permissions and access control for Glue jobs and related services |
| What is the role of AWS Redshift in data pipelines | Providing a scalable data warehouse solution for analytics and querying large datasets |
| What is the role of HDFS in an EMR cluster | Distributed storage to store input data and output data for processing frameworks like Hadoop or Spark |
| What is the role of metadata in data preparation | Providing contextual information about data like structure, schema, or attributes |
| What is the use of a Glue crawler | Automating the process of data source discovery and metadata creation in the data catalog |
This deck focuses on data preparation, transformation, and the use of AWS tools like Glue, EMR, and Redshift for effective data pipelines.