You are tasked with predicting compressor efficiency using ordinary least-squares linear regression with temperature, pressure, and rotational speed as predictors. Five-fold cross-validation yields a mean training R² of 0.93 but a validation R² of only 0.48. In the residual-versus-fitted plot, the residuals form a clear U-shaped curve: they are negative at mid-range fitted values and positive at both low and high fitted values. Variance-inflation factors for all predictors are below 2, and the residuals show approximately constant variance. Given this evidence, which data issue is most likely responsible for the model's poor generalization?
Multicollinearity among the predictors, which inflates coefficient variance and destabilizes the model
Non-linearity between the predictors and compressor efficiency that the linear model cannot capture
Granularity misalignment between sensor readings and efficiency measurements that introduces aggregation bias
Lagged (autocorrelated) observations that violate the independence assumption of ordinary least squares
A systematic U-shaped (or another curved) pattern in a residual-versus-fitted plot indicates that the true relationship between predictors and the response is not adequately captured by a straight line. When an OLS model is forced on such nonlinear data, it fits reasonably well near the center of the range (driving the high training R²) but systematically under- or over-predicts at the extremes, leading to large, structured residuals and degraded validation performance. Low VIF scores rule out severe multicollinearity, constant variance rules out heteroscedasticity, and there is no evidence of temporal ordering or aggregation problems. Introducing nonlinear feature transformations (e.g., polynomial or interaction terms) or switching to a nonlinear model would address the issue.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is a residual-versus-fitted plot, and how is it used to evaluate regression models?
Open an interactive chat with Bash
What are variance-inflation factors (VIF), and why are they important in regression analysis?
Open an interactive chat with Bash
How can nonlinear feature transformations address non-linearity in regression models?