Your analytics team is building a daily cost-forecasting model for an international e-commerce company. You receive two data sets:
Orders.csv - purchase orders time-stamped at 00:00 for each calendar date.
FX_rates.csv - daily closing foreign-exchange rates recorded at 17:00 New York time and stored under the same calendar date.
After an inner join on calendar date, you train a regression model that predicts a day's average order cost from the FX rate and other features. During testing, the model's accuracy is unrealistically high, and an audit shows that the FX rate being used actually reflects market conditions after the orders were placed.
Which data issue is present, and what should be the first corrective action during exploratory data analysis (EDA)?
Multicollinearity exists among predictors; drop highly correlated currency features based on VIF analysis.
Lagged observations are present; shift the FX_rate series back by one day to align timestamps before joining.
Seasonality is present; perform STL decomposition to remove periodic components from the FX_rate series.
The FX_rate variable is non-linear; apply a logarithmic transformation to linearize its relationship with cost.
Because orders are placed at 00:00 but joined with FX rates recorded later that same day at 17:00, each joined record pairs an order with a future FX value. This is a classic example of data leakage due to a systematic time shift, which falls under the category of issues with lagged observations. The misalignment produces temporal leakage and inflates performance metrics. The proper first step is to realign the offending series by shifting the FX_rates data backward one day before the join, so that every predictor value is contemporaneous with, or precedes, the event it is meant to explain. Seasonality removal, multicollinearity mitigation, or non-linear transformations would not fix the causal mis-timing that drives the spurious accuracy.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is data leakage in machine learning?
Open an interactive chat with Bash
How does shifting data address lagged observations and prevent data leakage?
Open an interactive chat with Bash
What is exploratory data analysis (EDA), and why is it important in identifying issues like lagged observations?