During model selection you compare several penalized linear regressions on a data set with 500 predictors and 300 observations. The model chosen by 10-fold cross-validation minimizes the residual sum of squares plus a penalty term λ‖β‖₁, and only 42 predictors keep non-zero coefficients. Which characteristic of the L1 penalty best explains why this approach produces a much sparser solution than a model that penalizes λ‖β‖₂?
The L1 constraint region is a diamond with sharp, axis-aligned corners, so the optimum often falls on a corner where one or more coefficients are zero.
Because the L1 penalty replaces mean squared error with mean absolute error, gradients vanish for small coefficients and push them to zero.
The L1 penalty minimizes each predictor's variance inflation factor and discards any term whose VIF exceeds a preset threshold.
The L1 penalty clusters highly correlated predictors into principal components, forcing the remaining component loadings to zero.
The LASSO adds an L1 (absolute value) penalty that constrains the feasible solutions to lie inside a diamond-shaped region whose corners touch the coordinate axes. Because the contours of the least-squares loss are ellipses, the first point at which a contour meets the diamond is often at one of these corners-meaning at least one coefficient is exactly zero. In contrast, an L2 penalty (ridge regression) forms a circular constraint with no axis-aligned corners, so the optimum rarely occurs with any coefficient exactly zero. The other statements describe effects (changing the error metric, PCA‐like grouping, or VIF minimization) that are not part of the LASSO optimization.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
Why does the L1 penalty lead to sparse solutions?
Open an interactive chat with Bash
How is L1 penalty different from L2 penalty in regression?