A data-science team is developing a binary classifier that predicts equipment failure seven days ahead from two years of hourly sensor readings. The engineer follows this workflow:
(1) remove rows that contain any null sensor value; (2) compute a 24-hour rolling mean for every sensor and append it as a new feature; (3) randomly split the resulting data into 80 % training and 20 % test sets; (4) fit a StandardScaler on the training split and apply the scaler to both splits; (5) train a gradient-boosting classifier; (6) evaluate accuracy on the test split.
The offline test accuracy is 0.93, but the model's accuracy on live streaming data drops to 0.64.
Which single step in this workflow is the most likely cause of the data-leakage that explains the performance drop, and why?
Step (4) - Scaling the data with StandardScaler fitted on the training split; this is the correct way to scale and does not cause leakage.
Step (3) - Randomly splitting time-stamped data; this puts future observations in the training set and lets the model learn about events that occur after some test instances, creating temporal data leakage.
Step (2) - Computing the 24-hour rolling mean before the split; the feature engineering leaks test values into training features and inflates accuracy.
Step (1) - Eliminating rows with missing readings; this reduces sample size but does not provide the model with information about future failures.
Randomly splitting records that have a natural time order lets examples from the future enter the training set while examples from the past end up in the test set. Because the model is trained on data that chronologically follow some of the test examples, it gains information that would never be available in production. This temporal look-ahead inflates the offline score; once the model is deployed on genuinely unseen future data, performance degrades sharply.
Rolling means (step 2) are not harmful by themselves as long as each mean is computed only from past observations and the split preserves chronology. Dropping missing rows (step 1) can bias the sample but does not leak target information. Fitting a StandardScaler on the training split only (step 4) is the correct, leakage-free procedure.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
Why does randomly splitting time-stamped data cause temporal data leakage?
Open an interactive chat with Bash
What is the correct way to split time-stamped data to avoid data leakage?
Open an interactive chat with Bash
How does a rolling mean in feature engineering avoid causing data leakage?