During hyper-parameter tuning of a Ridge regression model you standardize all 120 numeric predictors and evaluate five penalty values (λ = 0, 0.1, 1, 10, 100) with 10-fold cross-validation. The average validation MSE drops from λ = 0 to λ ≈ 5, then climbs steeply once λ exceeds 100. Pre-processing and data splits have already been verified. Which explanation best accounts for the rise in validation error at very large λ values?
A high λ forces some coefficients exactly to zero, removing important predictors and increasing variance in the folds.
Large λ amplifies multicollinearity, making the coefficient estimates more sensitive to small changes in the data.
The matrix (XᵀX + λI) becomes non-invertible at large λ values, causing numerical instability that inflates the error.
A very large λ over-penalizes the weights, shrinking almost all coefficients toward zero and introducing high bias, so the model underfits the data.
The Ridge penalty adds λ ∑β² to the loss. Moderate λ reduces variance by shrinking coefficients and often improves generalization, but an excessively large λ drives almost every weight toward zero. When the model can no longer capture the underlying signal it becomes highly biased and under-fits, so the cross-validated MSE rises again. Adding λI never makes (XᵀX + λI) non-invertible-if anything it improves its condition number. Unlike LASSO, Ridge does not set coefficients exactly to zero, and its L2 regularization mitigates rather than amplifies multicollinearity. Therefore the increase in error is best explained by underfitting caused by an overly strong penalty.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
Why does a very large λ cause underfitting in Ridge regression?
Open an interactive chat with Bash
How is Ridge regression different from LASSO in handling coefficients?
Open an interactive chat with Bash
What does adding λI improve in the (XᵀX + λI) matrix in Ridge regression?