Bash, the Crucial Exams Chat Bot
AI Bot

Data Engineering and Preparation  Flashcards

AWS Machine Learning Engineer Associate MLA-C01 Flashcards

FrontBack
Difference between data lake and data warehouseA data lake stores all types of data, while a data warehouse stores processed and structured data for analytics
How does Glue transform dataIt uses PySpark or Scala ETL scripts to process and transform data
How does partitioning improve query performanceIt allows queries to scan only relevant subsets of data instead of the entire dataset
What are the three main steps of ETLExtract, Transform, Load
What does the term data lineage meanTracking the origin and transformation history of data through a pipeline
What is a data lakeA centralized repository to store structured, unstructured, and semi-structured data at any scale
What is a Glue Data CatalogAn AWS service that stores metadata for data sources, tables, and schemas
What is Amazon EMR used forRunning big data frameworks like Apache Hadoop and Apache Spark on AWS
What is an Amazon Redshift Spectrum queryRunning SQL queries directly on data stored in Amazon S3 without needing to load it into Redshift
What is Apache Spark used for in data engineeringFast and scalable processing of large datasets using distributed computing
What is cluster resizing in Amazon EMRChanging the number and type of nodes in a cluster to optimize costs or performance
What is data deduplicationThe process of removing duplicate records from a dataset
What is data engineeringThe process of designing, building, and maintaining systems for collecting, storing, and analyzing data at scale
What is data transformationThe process of converting raw data into a structured and usable format
What is partitioning in AWS GlueOrganizing data into segments for more efficient queries and processing
What is S3 used for in data pipelinesStoring data in a scalable and durable object storage solution
What is schema-on-readSchematizing data only when it is read from storage, giving flexibility to unstructured data
What is schema-on-writeEnforcing a schema when data is written to storage, ensuring consistency in structure
What is Snowflake commonly used for in data pipelinesCloud-based data warehousing and analytics
What is the benefit of using Parquet or ORC file formatsOptimized for analytics and reduce data storage space with columnar storage
What is the difference between on-demand and spot instances in EMROn-demand instances are billed per hour used, while spot instances provide unused capacity at discounted rates
What is the primary use of AWS GlueETL (Extract, Transform, Load) processes to prepare and transform data
What is the purpose of a data warehouseCentralizing and organizing large amounts of structured data for analysis
What is the purpose of data normalizationOrganizing data to reduce redundancy and improve integrity
What is the role of an IAM role in AWS GlueManaging permissions and access control for Glue jobs and related services
What is the role of AWS Redshift in data pipelinesProviding a scalable data warehouse solution for analytics and querying large datasets
What is the role of HDFS in an EMR clusterDistributed storage to store input data and output data for processing frameworks like Hadoop or Spark
What is the role of metadata in data preparationProviding contextual information about data like structure, schema, or attributes
What is the use of a Glue crawlerAutomating the process of data source discovery and metadata creation in the data catalog
This deck focuses on data preparation, transformation, and the use of AWS tools like Glue, EMR, and Redshift for effective data pipelines.
Share on...
Follow us on...