A data science team at a retail company has developed a linear regression model to predict future monthly sales. The model was trained on data from January 2020 to December 2024. The primary feature used is a time index, along with several other features related to marketing spend and seasonal events. The model demonstrates high accuracy on the hold-out test set (data from 2024). The team is now tasked with using this model to forecast sales for all of 2026. Which of the following describes the most significant risk associated with this specific forecasting task?
The model will be performing extrapolation, which assumes that the relationships and trends observed in the training data will continue unchanged into the future, an often unreliable assumption.
The model is performing interpolation, which can be inaccurate if the new data points fall in a sparse region of the original training data.
The model is likely to suffer from underfitting because the linear relationship cannot capture the complexity of future market dynamics.
Data leakage from the test set may have inflated the model's performance metrics, giving a false sense of confidence for future predictions.
The correct answer identifies extrapolation as the most significant risk. Extrapolation is the process of estimating a value beyond the original range of the training data. In this scenario, the model was trained on data up to December 2024, and it is being asked to predict for 2026, which is outside that range. The primary risk of extrapolation is the assumption that the trends and relationships learned from the past data will continue to hold true for the future, which is often not the case due to changing market conditions, consumer behavior, or other unforeseen factors.
The option regarding underfitting is incorrect because the scenario states the model has 'high accuracy' on the test set, which suggests underfitting is not the primary issue. The option about data leakage is a plausible but less certain risk; there is no specific information in the scenario to suggest leakage occurred, whereas extrapolation is a certainty. The option mentioning interpolation is incorrect because interpolation involves making predictions within the range of the training data, and this scenario is a clear case of predicting outside that range.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is extrapolation in predictive modeling?
Open an interactive chat with Bash
Why is extrapolation more risky than interpolation?
Open an interactive chat with Bash
How can a data science team mitigate the risks of extrapolation?