A data scientist is building a fraud detection model for a financial institution. The historical transaction dataset is highly imbalanced, with fraudulent transactions (the minority class) accounting for only 0.5% of the data. A baseline model trained on this data shows high accuracy but has an extremely low recall for the fraud class. The scientist needs to apply a mitigation technique to rebalance the training data. Which of the following approaches best addresses the class imbalance by creating new, varied examples for the minority class, thereby reducing the specific risk of overfitting that arises from simple duplication?
Synthetic Minority Oversampling Technique (SMOTE)
Applying L2 regularization to the baseline model
Randomly undersampling the non-fraudulent (majority) class
Randomly oversampling the fraudulent (minority) class by duplication
The correct answer is the Synthetic Minority Oversampling Technique (SMOTE). SMOTE works by creating new, synthetic data points for the minority class. It selects a minority class instance, finds its k-nearest minority class neighbors, and then generates synthetic instances by interpolating between the selected instance and its neighbors. This method is superior to simple oversampling because it generates new, plausible examples rather than just duplicating existing ones, which helps the model generalize better and reduces the risk of overfitting.
Randomly undersampling the majority class is incorrect because it involves removing a large number of potentially informative samples from the majority class, which can lead to a loss of information and a biased model.
Randomly oversampling the minority class is a less effective approach because it simply duplicates existing minority class samples. This lack of new information can lead to overfitting, where the model learns the specific duplicated examples instead of the underlying patterns of fraud.
Applying L2 regularization is incorrect in this context. Regularization is a technique used to prevent overfitting by penalizing large model coefficients, but it does not directly address the problem of class imbalance caused by an underrepresented minority class.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is the Synthetic Minority Oversampling Technique (SMOTE)?
Open an interactive chat with Bash
How is SMOTE different from random oversampling?
Open an interactive chat with Bash
Why is undersampling the majority class not ideal?