In a churn-prediction initiative, your team builds a gradient-boosting model using 24 monthly snapshots (January 2023 - December 2024). Before the model can enter any online experiments, policy requires an offline validation step that (a) prevents temporal leakage and (b) ensures that every record is used for training at least once during hyper-parameter search. Which validation strategy best meets both requirements?
A single 80/20 hold-out split where the last five months are used only for testing and never included in training.
Random k-fold cross-validation with shuffling enabled so each fold contains a mixture of months.
Walk-forward (expanding-window) time-series cross-validation that trains on the earliest months and validates on the next contiguous month, repeating until all folds are evaluated.
Leave-one-customer-out cross-validation that removes one customer's entire history per fold regardless of transaction dates.
Walk-forward (expanding-window) time-series cross-validation always trains on past observations and validates on the immediately following time slice, so the model never sees data from the future and temporal leakage is avoided. Because the window rolls forward, each record eventually appears in a training fold even though it is withheld for validation in another fold, allowing the entire data set to inform model fitting and hyper-parameter tuning. Random k-fold or stratified splits that ignore time order mix future records into the training set, leaking information. A single 80/20 hold-out using the last few months avoids leakage but permanently withholds those months from training, violating the requirement that all data contribute to model learning. Leave-one-customer-out splits disregard date ordering as well, so a fold could still train on later months than those used for validation, again risking leakage.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What does temporal leakage mean in machine learning?
Open an interactive chat with Bash
How does walk-forward time-series cross-validation work?
Open an interactive chat with Bash
Why is random k-fold cross-validation unsuitable for time-series data?