Your organization's GitHub repository contains code for an ML pipeline while the training data (≈200 GB) lives in an Amazon S3 bucket that is overwritten every week. Compliance rules require that anyone who checks out any past Git commit can automatically restore exactly the dataset that was used for that commit, without bloating the repository or exceeding GitHub file-size limits. Which approach best satisfies these requirements?
Enable S3 object versioning and save the object version IDs in a YAML configuration file that the pipeline reads at runtime.
Store the full dataset in Git Large File Storage so each commit contains a pointer to the data blobs managed by Git LFS.
Package every weekly snapshot as a compressed archive and upload it as a GitHub release asset referenced by a repository tag.
Track the dataset with DVC: commit the lightweight .dvc pointer files to Git and configure an S3 DVC remote so that "git checkout" followed by "dvc pull" retrieves the exact snapshot.
Committing small .dvc metadata files to Git while pushing the actual data to an S3 remote managed by DVC aligns each data snapshot with the corresponding Git commit. After a developer runs "git checkout" for a historical commit, the accompanying "dvc pull" command uses the commit's .dvc files to fetch the correct data version from the remote, guaranteeing reproducibility without storing large binaries in Git. Git LFS still incurs strict per-file and bandwidth limits and can become costly for frequent 200 GB snapshots. Manually recording S3 object version IDs or uploading .tar.gz archives to GitHub releases provides no automatic linkage between code and data and relies on error-prone, manual processes, defeating the goal of seamless reproducibility.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is DVC and how does it integrate with Git?
Open an interactive chat with Bash
Why is storing large datasets in Git directly not recommended?
Open an interactive chat with Bash
What are the limitations of Git Large File Storage (Git LFS) compared to DVC?