A data scientist is preparing to hand off a machine-learning pipeline that supports drug-trial decisions to the company's compliance team. The source code lives in Git, and the model artifacts are pushed to an internal registry, but the training script
Six months later the auditors re-run the same Git commit and obtain different model coefficients because both the S3 object and several Python packages have silently changed. According to data-science life-cycle best practices, which single additional action would have most directly prevented this reproducibility failure?
Record the pseudo-random seed used during training and store it in the model registry metadata.
Increase the hold-out test set from 20 % to 30 % so that validation scores have lower variance.
Schedule weekly retraining jobs that always pull the newest dataset and latest package versions, overwriting the previous model artifact.
Version the exact training dataset and commit a dependency lock file that pins every package and hash alongside the model code.
Reproducibility requires that every artifact able to change over time-code, data, and execution environment-be captured immutably. Committing an immutable snapshot (or content-addressed pointer) of the exact training dataset and a fully pinned dependency lock file (for example, requirements.txt or conda-lock with exact versions and hashes) to the same version-control commit guarantees that anyone can recreate the identical environment months later.
Simply logging a random seed does not protect against data or package drift. Regularly retraining with the latest files overwrites history and makes past results irrecoverable. Enlarging the test set improves evaluation robustness but has no impact on audit reproducibility.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What does it mean to version a training dataset?
Open an interactive chat with Bash
What is a dependency lock file, and why is it important?
Open an interactive chat with Bash
How does logging a random seed improve reproducibility, and where are its limitations?