Data Preparation and Processing Flashcards
AWS Certified AI Practitioner AIF-C01 Flashcards

| Front | Back |
| What is data augmentation | Creating additional training samples by modifying existing data, often used in image or text datasets. |
| What is data cleaning | The process of detecting and correcting errors or inconsistencies in a dataset to improve its quality. |
| What is data deduplication | The process of removing duplicate records to maintain data integrity and avoid redundancy. |
| What is data labeling | Assigning meaningful tags or categories to data samples to make them usable for machine learning models. |
| What is data transformation | The process of changing data into a format suitable for analysis, such as normalization or encoding. |
| What is feature selection | The process of choosing relevant features to improve model performance and reduce computational complexity. |
| What is imputation | The process of replacing missing values in a dataset with substituted values like the mean, median, or mode. |
| What is normalization | Rescaling numeric data to a range, typically between 0 and 1, to ensure fair contributions to a model. |
| What is one-hot encoding | Converting categorical data into binary vectors where each category is represented by a one-hot encoded value. |
| What is outlier detection | Identifying data points that are significantly different from the rest of the data, often due to errors or unusual conditions. |
| What is PCA (Principal Component Analysis) | A technique used to reduce dimensionality by projecting data onto principal components that explain most of the variance. |
| What is SMOTE | Synthetic Minority Oversampling Technique, used to balance datasets by generating synthetic samples for the minority class. |
| What is standardization | Transforming data to have a mean of 0 and a standard deviation of 1 for consistent scaling. |
| What is the difference between structured and unstructured data | Structured data is organized into rows and columns, while unstructured data lacks predefined organization. |
| What is the difference between train-test split and cross-validation | Train-test split divides data once, whereas cross-validation iteratively divides data for better reliability. |
| What is the role of data integration | Combining data from multiple sources to ensure consistency and enable meaningful analysis. |
| Why is data preprocessing important | Because raw data may contain noise, errors, or irrelevant information that can hinder model learning. |
| Why is data splitting important | To divide data into training, validation, and test sets for unbiased evaluation of machine learning models. |
| Why is feature scaling necessary | To ensure features contribute equally to a machine learning model, avoiding dominance by larger values. |
| Why is handling missing data important | Because missing values can negatively affect model performance and lead to biased results. |
About the Flashcards
Flashcards for the AWS Certified AI Practitioner exam provide a concise refresher on every step that turns raw data into trustworthy input for analytics. Review definitions of data cleaning, labeling, integration, and deduplication, and see why managing missing values or outliers is critical to unbiased results and stronger model performance.
Cards also walk through practical preprocessing techniques you must recognize on test day: imputation strategies, normalization versus standardization, one-hot encoding, SMOTE balancing, PCA dimensionality reduction, feature selection, and proper train-test or cross-validation splits. This focused deck helps solidify terminology and processes you will spot in scenario questions.
Topics covered in this flashcard deck:
- Data cleaning & labeling
- Missing data handling
- Feature scaling & encoding
- Dimensionality reduction
- Class imbalance techniques
- Train-test validation splits