A data scientist is developing a linear regression model to predict quarterly sales for a large retail chain. The features include advertising spend, number of promotional events, and several macroeconomic indicators like GDP growth rate, unemployment rate, and the consumer price index (CPI). During model diagnostics, the data scientist observes that the p-values for the macroeconomic indicators are high, and their coefficients are highly sensitive to the inclusion or exclusion of other variables. Furthermore, some coefficients have signs that contradict established economic principles. Which of the following data issues is the most probable cause of these specific observations?
The correct answer is multicollinearity. Multicollinearity occurs when two or more independent variables in a regression model are highly correlated, making it difficult to distinguish their individual effects on the dependent variable. The scenario describes classic symptoms of multicollinearity: inflated standard errors (leading to high p-values), unstable coefficient estimates that change dramatically when other variables are added or removed, and coefficients with counter-intuitive signs. The macroeconomic indicators used (GDP, unemployment, CPI) are often highly correlated with each other, making this the most likely cause.
Non-stationarity refers to time series data whose statistical properties (like mean and variance) change over time. While it can cause issues like spurious correlations, it doesn't directly explain the coefficient instability and sign-flipping described.
Seasonality is a regular, periodic pattern in data. If unaccounted for, it would likely appear as a pattern in the model's residuals, but it is not the primary cause for unstable coefficients among a group of predictors.
Sparse data refers to a dataset with a high proportion of zero or null values. This is not suggested by the scenario, which involves macroeconomic indicators that are typically dense, and the symptoms described are not characteristic of sparsity.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is multicollinearity in simple terms?
Open an interactive chat with Bash
How can multicollinearity be detected in a regression model?
Open an interactive chat with Bash
What are some ways to handle multicollinearity in regression models?