A data scientist is building a multiple linear regression model to predict housing prices. The initial model, using only the living area in square feet as a predictor, yields an R-squared value of 0.65. To improve the model, the data scientist adds ten additional predictor variables, including number of bedrooms, number of bathrooms, and age of the house. The new model results in an R-squared value of 0.78. Which of the following is the most critical consideration for the data scientist when interpreting this increase in R-squared?
An R-squared of 0.78 indicates that 78% of the model's predictions for house prices will be correct.
The increase from 0.65 to 0.78 definitively proves that the additional variables have strong predictive power and the new model is superior.
The new R-squared value is high, which invalidates the p-values of the individual coefficients in the model.
The R-squared value will almost always increase when more predictors are added to the model, regardless of their actual significance, potentially leading to overfitting.
The correct answer highlights a key limitation of the R-squared metric. R-squared measures the proportion of the variance in the dependent variable that is explained by the independent variables. A critical characteristic of R-squared is that it will either increase or stay the same whenever a new predictor variable is added to the model, even if that variable has no real relationship with the outcome. This mathematical property means that simply observing an increase in R-squared after adding more variables is not sufficient evidence of a better model. It may indicate that the model is becoming overly complex and fitting to the noise in the training data (overfitting), rather than capturing the true underlying relationships. Therefore, the most critical consideration is recognizing that this increase is expected and could be misleading.
The other options are incorrect. Stating that the increase definitively proves the new model is superior is a flawed interpretation because it ignores the risk of overfitting and the inherent tendency of R-squared to increase. R-squared does not measure predictive accuracy in terms of the percentage of correct predictions; it measures the proportion of explained variance. A high R-squared value does not invalidate the p-values of the coefficients; these are separate (though related) diagnostic measures of a regression model.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
Why does R-squared increase when more predictors are added?
Open an interactive chat with Bash
What is overfitting in the context of regression models?
Open an interactive chat with Bash
How can a data scientist avoid overfitting when adding predictors?