A machine learning team at a financial services company is developing a credit risk model. They use Git for versioning their Python code. However, they are encountering issues with storing trained model artifacts, which often exceed 100MB, directly in their Git repository. This practice is causing repository bloat and slow clone times. Furthermore, they have no systematic way to link a specific model file to the version of the dataset and the hyperparameters used to train it, leading to reproducibility problems. To address these challenges in alignment with MLOps best practices for model versioning, which of the following solutions should the team implement?
Implement Git Large File Storage (LFS) to track the large model files, which replaces the files with text pointers in Git while storing the file contents on a remote server.
Adopt a file naming convention where each serialized model file includes the training date and version number, and store these files in a cloud storage bucket versioned by date.
Integrate Data Version Control (DVC) with the existing Git workflow to track model artifacts as pointers in Git, while storing the actual files in a designated remote storage.
Utilize an MLflow Tracking server exclusively to log hyperparameters and metrics, and use the MLflow Model Registry to manage model versions and stages.
The correct answer is to integrate Data Version Control (DVC) with Git. This approach directly addresses both problems stated in the scenario. DVC is specifically designed for machine learning projects to handle large files (like datasets and models) by storing them in a separate remote storage while using Git to version control lightweight metadata files that point to the correct data versions. Crucially, DVC can define and version entire ML pipelines, explicitly linking code versions, data dependencies, parameters, and output models, which solves the reproducibility challenge.
Git Large File Storage (LFS) is an incorrect choice because, while it solves the problem of repository bloat from large files, it is a general-purpose tool and lacks the built-in MLOps capabilities of DVC for tracking data dependencies and creating reproducible pipelines. MLflow is also incorrect as a standalone solution because, while it excels at tracking experiments, parameters, and managing model lifecycles in a registry, it does not directly address the issue of versioning large files within a Git workflow. It is often used in conjunction with DVC, but DVC is the tool that specifically solves the stated data and model file versioning problem. Relying on a manual file naming convention is an anti-pattern that is error-prone, difficult to scale, and does not provide the robust, automated lineage tracking required for MLOps best practices.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is DVC and how does it differ from Git?
Open an interactive chat with Bash
Why is reproducibility important in machine learning projects?
Open an interactive chat with Bash
How does DVC handle large files without bloating the Git repository?