A data scientist is developing a multiple linear regression model to predict daily sales revenue for an online retail business. The model uses several predictor variables, including daily_marketing_spend, website_sessions, number_of_promotions, and average_competitor_price. Initial model training yields a high R-squared value, suggesting a good overall fit. However, a detailed analysis of the model's coefficients reveals the following issues:
The coefficient for website_sessions is negative, which is counterintuitive.
The standard errors for the coefficients of both daily_marketing_spend and website_sessions are unusually large.
The p-values for daily_marketing_spend and website_sessions are not statistically significant, despite the model's high overall F-statistic and R-squared value.
Given these observations, which of the following data issues is the most likely cause of these paradoxical results?
The correct answer is multicollinearity. The scenario describes classic symptoms of multicollinearity in a regression model. Multicollinearity occurs when two or more independent variables are highly correlated with each other, such as daily_marketing_spend and website_sessions. This high correlation makes it difficult for the model to distinguish their individual effects on the dependent variable. Key indicators described in the scenario include a high overall R-squared value coupled with individually insignificant predictor variables (high p-values and large standard errors), and coefficients that may have unexpected or counterintuitive signs.
Non-stationarity is a common issue in time series data, and it can lead to spurious regressions where a relationship appears significant when none exists. However, it does not typically manifest with the specific pattern of a high overall model significance combined with insignificant, sign-flipped coefficients for correlated predictors.
Presence of multivariate outliers can significantly influence regression coefficients and model fit. However, outliers would not systematically produce the described effects across a specific pair of correlated variables, such as inflating their standard errors and flipping a coefficient sign, while leaving the overall R-squared high.
Granularity misalignment refers to aggregating or joining data at different levels (e.g., daily sales with weekly marketing spend). This is a data structuring problem that can lead to loss of information, but it does not directly cause the statistical artifacts (inflated standard errors, insignificant p-values) observed in the model's output.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is multicollinearity, and why does it affect regression models?
Open an interactive chat with Bash
How can a data scientist detect multicollinearity in a dataset?
Open an interactive chat with Bash
How can multicollinearity be resolved to improve a regression model?