A machine learning engineer is tasked with building a model and estimating its generalization error. They use a single loop of k-fold cross-validation. In each fold, they perform hyperparameter tuning using grid search on the training data, identify the best parameters, and then evaluate the model with these parameters on the validation set. The final performance is reported as the average of the scores across all folds. This process results in a model that performs exceptionally well during this cross-validation procedure but fails to generalize to new production data. Which of the following is the most likely cause for this discrepancy?
Standard k-fold cross-validation is only appropriate for regression models, and a stratified approach should have been used for this classification task.
The process causes information leakage, leading to an optimistic performance estimate because the validation data influences both hyperparameter selection and performance evaluation.
The model is underfitting due to the reduced size of the training partition created in each fold of the cross-validation process.
The grid search for hyperparameter tuning is computationally inefficient and likely resulted in a globally suboptimal model.
The correct answer identifies the fundamental methodological flaw in the described procedure. When the same validation data is used to both select the best-performing hyperparameters (by seeing which ones perform best on it) and to score the model's performance, information about the validation set 'leaks' into the model selection process. This leads to an optimistically biased performance estimate. The model's hyperparameters are effectively overfitted to the specific k-folds of the dataset. The proper way to get an unbiased estimate of generalization performance while also tuning hyperparameters is to use nested cross-validation, where an outer loop assesses performance and an inner loop, with its own data splits, is used for hyperparameter tuning.
The model is not underfitting; the scenario states it performed exceptionally well in cross-validation, which is the opposite of underfitting. While grid search can be inefficient, its inefficiency does not explain the discrepancy between high cross-validation scores and poor production performance. Finally, while stratified k-fold is often preferred for classification, standard k-fold can still be used and its use is not the primary cause of such a significant optimistic bias.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is information leakage in machine learning?
Open an interactive chat with Bash
How does nested cross-validation avoid information leakage?
Open an interactive chat with Bash
Why is stratified k-fold cross-validation often preferred for classification tasks?