Bash, the Crucial Exams Chat Bot
AI Bot
Data Preparation and Processing Flashcards
AWS Certified AI Practitioner AIF-C01 Flashcards
| Front | Back |
| What is data augmentation | Creating additional training samples by modifying existing data, often used in image or text datasets. |
| What is data cleaning | The process of detecting and correcting errors or inconsistencies in a dataset to improve its quality. |
| What is data deduplication | The process of removing duplicate records to maintain data integrity and avoid redundancy. |
| What is data labeling | Assigning meaningful tags or categories to data samples to make them usable for machine learning models. |
| What is data transformation | The process of changing data into a format suitable for analysis, such as normalization or encoding. |
| What is feature selection | The process of choosing relevant features to improve model performance and reduce computational complexity. |
| What is imputation | The process of replacing missing values in a dataset with substituted values like the mean, median, or mode. |
| What is normalization | Rescaling numeric data to a range, typically between 0 and 1, to ensure fair contributions to a model. |
| What is one-hot encoding | Converting categorical data into binary vectors where each category is represented by a one-hot encoded value. |
| What is outlier detection | Identifying data points that are significantly different from the rest of the data, often due to errors or unusual conditions. |
| What is PCA (Principal Component Analysis) | A technique used to reduce dimensionality by projecting data onto principal components that explain most of the variance. |
| What is SMOTE | Synthetic Minority Oversampling Technique, used to balance datasets by generating synthetic samples for the minority class. |
| What is standardization | Transforming data to have a mean of 0 and a standard deviation of 1 for consistent scaling. |
| What is the difference between structured and unstructured data | Structured data is organized into rows and columns, while unstructured data lacks predefined organization. |
| What is the difference between train-test split and cross-validation | Train-test split divides data once, whereas cross-validation iteratively divides data for better reliability. |
| What is the role of data integration | Combining data from multiple sources to ensure consistency and enable meaningful analysis. |
| Why is data preprocessing important | Because raw data may contain noise, errors, or irrelevant information that can hinder model learning. |
| Why is data splitting important | To divide data into training, validation, and test sets for unbiased evaluation of machine learning models. |
| Why is feature scaling necessary | To ensure features contribute equally to a machine learning model, avoiding dominance by larger values. |
| Why is handling missing data important | Because missing values can negatively affect model performance and lead to biased results. |
This deck focuses on the steps involved in preparing and processing data for machine learning models, including data cleaning, labeling, and transformation techniques.