A data scientist has developed a multiple linear regression model to predict housing prices. After the initial training, the scientist examines the model's performance by creating a residual vs. fitted values plot. The plot reveals that the residuals are not randomly scattered around the zero line; instead, they form a distinct, parabolic (U-shaped) pattern. What is the most likely issue with the model, and what is the most appropriate next step in the model design iteration process?
The model is likely overfitting the training data. The next step should be to increase the L2 regularization penalty (e.g., in a Ridge regression) to reduce the model's complexity.
The model exhibits non-linearity, indicating it fails to capture the underlying structure of the data. The next step should be to use feature engineering to create polynomial terms for the relevant predictors.
The plot shows evidence of heteroscedasticity, meaning the variance of the errors is not constant. The next step should be to apply a Box-Cox transformation to the response variable to stabilize the variance.
The plot reveals multicollinearity among the predictor variables. The next step should be to calculate the Variance Inflation Factor (VIF) for each feature and consider removing highly correlated predictors.
The correct option identifies non-linearity as the issue and suggests creating polynomial features as the solution. A parabolic or U-shaped pattern in a residual vs. fitted values plot is a classic indicator that the linear model is failing to capture a non-linear relationship in the data. This is a form of underfitting, where the model is too simple. The appropriate corrective action is to engineer new features that can account for this curvature, such as adding squared or cubic terms of the existing predictors (polynomial features).
The option suggesting heteroscedasticity is incorrect because heteroscedasticity typically appears as a cone or fan shape in the residual plot, where the spread of residuals changes as the fitted values increase or decrease. While a Box-Cox transformation is a valid technique to address non-constant variance, it is not the primary solution for the U-shaped pattern described.
The option suggesting multicollinearity is incorrect because multicollinearity, the correlation between predictor variables, is not diagnosed using a residual vs. fitted plot. It is typically identified using a correlation matrix or by calculating the Variance Inflation Factor (VIF).
The option suggesting overfitting is incorrect. A U-shaped residual plot indicates underfitting (the model is too simple to capture the underlying pattern), not overfitting. Increasing regularization would further simplify the model, likely worsening the issue.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What does a U-shaped pattern in a residual vs. fitted values plot signify?
Open an interactive chat with Bash
What are polynomial features in machine learning?
Open an interactive chat with Bash
Why is multicollinearity not diagnosed using a residual vs. fitted values plot?