Bash, the Crucial Exams Chat Bot
AI Bot

Data Preparation and Processing  Flashcards

What is imputation
Combining data from multiple sources to ensure consistency and enable meaningful analysis.
The process of replacing missing values in a dataset with substituted values like the mean, median, or mode.
Assigning meaningful tags or categories to data samples to make them usable for machine learning models.
What is outlier detection
What is data augmentation
What is SMOTE
Synthetic Minority Oversampling Technique, used to balance datasets by generating synthetic samples for the minority class.
What is data labeling
What is the role of data integration
Creating additional training samples by modifying existing data, often used in image or text datasets.
Identifying data points that are significantly different from the rest of the data, often due to errors or unusual conditions.
FrontBack
What is data augmentationCreating additional training samples by modifying existing data, often used in image or text datasets.
What is data cleaningThe process of detecting and correcting errors or inconsistencies in a dataset to improve its quality.
What is data deduplicationThe process of removing duplicate records to maintain data integrity and avoid redundancy.
What is data labelingAssigning meaningful tags or categories to data samples to make them usable for machine learning models.
What is data transformationThe process of changing data into a format suitable for analysis, such as normalization or encoding.
What is feature selectionThe process of choosing relevant features to improve model performance and reduce computational complexity.
What is imputationThe process of replacing missing values in a dataset with substituted values like the mean, median, or mode.
What is normalizationRescaling numeric data to a range, typically between 0 and 1, to ensure fair contributions to a model.
What is one-hot encodingConverting categorical data into binary vectors where each category is represented by a one-hot encoded value.
What is outlier detectionIdentifying data points that are significantly different from the rest of the data, often due to errors or unusual conditions.
What is PCA (Principal Component Analysis)A technique used to reduce dimensionality by projecting data onto principal components that explain most of the variance.
What is SMOTESynthetic Minority Oversampling Technique, used to balance datasets by generating synthetic samples for the minority class.
What is standardizationTransforming data to have a mean of 0 and a standard deviation of 1 for consistent scaling.
What is the difference between structured and unstructured dataStructured data is organized into rows and columns, while unstructured data lacks predefined organization.
What is the difference between train-test split and cross-validationTrain-test split divides data once, whereas cross-validation iteratively divides data for better reliability.
What is the role of data integrationCombining data from multiple sources to ensure consistency and enable meaningful analysis.
Why is data preprocessing importantBecause raw data may contain noise, errors, or irrelevant information that can hinder model learning.
Why is data splitting importantTo divide data into training, validation, and test sets for unbiased evaluation of machine learning models.
Why is feature scaling necessaryTo ensure features contribute equally to a machine learning model, avoiding dominance by larger values.
Why is handling missing data importantBecause missing values can negatively affect model performance and lead to biased results.
Front
Why is data splitting important
Click the card to flip
Back
To divide data into training, validation, and test sets for unbiased evaluation of machine learning models.
Front
What is data labeling
Back
Assigning meaningful tags or categories to data samples to make them usable for machine learning models.
Front
What is imputation
Back
The process of replacing missing values in a dataset with substituted values like the mean, median, or mode.
Front
What is PCA (Principal Component Analysis)
Back
A technique used to reduce dimensionality by projecting data onto principal components that explain most of the variance.
Front
Why is feature scaling necessary
Back
To ensure features contribute equally to a machine learning model, avoiding dominance by larger values.
Front
What is feature selection
Back
The process of choosing relevant features to improve model performance and reduce computational complexity.
Front
Why is data preprocessing important
Back
Because raw data may contain noise, errors, or irrelevant information that can hinder model learning.
Front
What is normalization
Back
Rescaling numeric data to a range, typically between 0 and 1, to ensure fair contributions to a model.
Front
What is SMOTE
Back
Synthetic Minority Oversampling Technique, used to balance datasets by generating synthetic samples for the minority class.
Front
What is data cleaning
Back
The process of detecting and correcting errors or inconsistencies in a dataset to improve its quality.
Front
What is the role of data integration
Back
Combining data from multiple sources to ensure consistency and enable meaningful analysis.
Front
What is outlier detection
Back
Identifying data points that are significantly different from the rest of the data, often due to errors or unusual conditions.
Front
What is data transformation
Back
The process of changing data into a format suitable for analysis, such as normalization or encoding.
Front
What is data augmentation
Back
Creating additional training samples by modifying existing data, often used in image or text datasets.
Front
What is one-hot encoding
Back
Converting categorical data into binary vectors where each category is represented by a one-hot encoded value.
Front
What is the difference between train-test split and cross-validation
Back
Train-test split divides data once, whereas cross-validation iteratively divides data for better reliability.
Front
What is data deduplication
Back
The process of removing duplicate records to maintain data integrity and avoid redundancy.
Front
What is standardization
Back
Transforming data to have a mean of 0 and a standard deviation of 1 for consistent scaling.
Front
What is the difference between structured and unstructured data
Back
Structured data is organized into rows and columns, while unstructured data lacks predefined organization.
Front
Why is handling missing data important
Back
Because missing values can negatively affect model performance and lead to biased results.
1/20
This deck focuses on the steps involved in preparing and processing data for machine learning models, including data cleaning, labeling, and transformation techniques.
Share on...
Follow us on...