While building an ordinary least-squares model to forecast quarterly sales, an analyst includes eight predictors, among them Total marketing spend and Digital ad spend. After fitting the model, she observes several issues: the model's R-squared is high at 0.93, but neither marketing variable is individually significant (p ≈ 0.35). Furthermore, if Digital ad spend is removed from the model, Total marketing spend becomes highly significant (p < 0.01) and its coefficient's sign flips from negative to positive. A final check shows the variance inflation factor (VIF) for each marketing variable exceeds 12.
Which statement best explains these symptoms and identifies the most appropriate first step to address them?
The residuals are heteroscedastic; switch to weighted least squares to stabilize the standard errors.
The model exhibits multicollinearity; first remove or consolidate the two highly correlated marketing predictors and then refit the model.
The model is overfitting; apply k-fold cross-validation and add L2 regularization to reduce variance.
Sales observations are imbalanced; oversample the low-sales class with SMOTE to improve coefficient significance.
A VIF above 10 is a common rule-of-thumb signal of severe multicollinearity, indicating that two or more predictors carry nearly the same information. Multicollinearity inflates the standard errors of the affected coefficients, which can make them appear insignificant even when the overall model is strong, and it can cause the sign of a coefficient to flip when another correlated predictor is added or removed. The primary corrective action is to eliminate the redundancy-either by dropping one of the correlated variables, combining them (for example, by creating a single composite feature), or otherwise reducing their overlap-before considering more complex remedies such as regularization. Heteroscedasticity, overfitting, or class imbalance would not explain the high VIF values or the specific sign reversal symptom.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is multicollinearity and why is it an issue in regression models?
Open an interactive chat with Bash
What is a variance inflation factor (VIF) and how does it indicate multicollinearity?
Open an interactive chat with Bash
What are potential methods to address multicollinearity in regression analysis?