You are tuning a logistic-regression fraud detector trained on 455 000 real and 5 000 fraudulent transactions (≈ 1 % positives). A baseline model built on the imbalanced data yields an average F1 of 0.12 under stratified 5-fold cross-validation (CV). You then apply random oversampling so that the training split is 50 / 50 positive-to-negative, keeping the validation folds untouched. After retraining, you observe:
Training-set F1: 0.93
Cross-validated F1: 0.10 Which explanation best accounts for the drop in CV performance despite the much higher training score?
Oversampling should always lower variance, so the CV drop indicates target leakage between your folds rather than any overfitting problem.
Duplicating the same minority transactions through random oversampling caused the model to overfit to those repeats, inflating training F1 but hurting generalization.
Oversampling only shifts the decision threshold without affecting learned parameters; the lower CV F1 is expected until you retune the threshold.
The oversampler injected label noise that increases model bias; therefore training F1 should have fallen, so the discrepancy must come from a metric-calculation error.
Random oversampling copies the minority-class points with replacement until the desired balance is reached. Because the minority class originally contains only 5 000 unique examples, many of those are duplicated. Logistic regression can then memorize these repeats, fitting larger coefficients that classify the duplicates almost perfectly-hence the very high training F1. However, the validation folds still contain the original 1 % minority rate and no duplicated points. The model therefore generalizes poorly, and its F1 actually falls below the baseline. The gap is a textbook symptom of overfitting caused by duplicated samples, not by data leakage, label noise, or threshold mis-calibration.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is overfitting in logistic regression?
Open an interactive chat with Bash
What does stratified 5-fold cross-validation mean?
Open an interactive chat with Bash
How does random oversampling affect the training process?