A data scientist is dealing with a binary fraud-detection dataset that contains 1 000 000 observations, of which only 0.2 % are labeled as fraud. The model of choice is a gradient-boosted decision tree. The scientist plans to mitigate the extreme class imbalance with the Synthetic Minority Over-sampling Technique (SMOTE) and to assess performance with 5-fold stratified cross-validation before evaluating on a separate, untouched test set whose class distribution mirrors production.
Which procedure is the most appropriate for oversampling in this scenario so that the minority class is strengthened without introducing optimistic validation bias or excessive overfitting?
Run SMOTE on the entire dataset first so that synthetic minority records are present in every cross-validation fold.
Inside each cross-validation fold, apply SMOTE solely to the training partition, then train the model on that augmented data and validate on the untouched fold hold-out.
Build an ensemble that draws bootstrap samples from the majority class only, keeping each minority instance exactly once in every bootstrap replica.
Before cross-validation, duplicate every minority-class record 499 times to obtain a perfectly balanced 1:1 class ratio, then train and validate on this expanded dataset.
Applying SMOTE only to the training partition inside each cross-validation fold keeps the validation data distribution unchanged and prevents any synthetic records-built from information in the training partition-from leaking into the fold's validation split. This preserves an unbiased estimate of generalisation performance while still enriching the minority class during learning.
If SMOTE is run before cross-validation (or on the whole dataset), synthetic observations derived from one original minority instance can appear simultaneously in both the training and validation partitions, inflating metrics. Simply duplicating minority rows on a massive scale carries the same leakage risk and also encourages the model to memorize exact copies, increasing overfitting. Bootstrapping only the majority class re-uses existing majority records but never increases minority support, so it fails to correct the imbalance problem.
Therefore, the fold-wise application of SMOTE to the training split is the only strategy that both corrects the skew and yields reliable validation scores.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is SMOTE and how does it work?
Open an interactive chat with Bash
Why is applying SMOTE only to the training partition important in cross-validation?
Open an interactive chat with Bash
What is stratified cross-validation and why is it suitable here?