A machine learning team is developing a computer vision model using a 2TB dataset of high-resolution images. The team uses Git for source code versioning but is facing significant challenges with versioning the dataset itself. Standard Git is not viable due to the dataset's size, and while Git LFS was considered, the associated storage costs and performance overhead for their cloud provider are prohibitive. The team's primary requirements are to maintain lightweight, reproducible links between specific code versions and the corresponding data versions without duplicating the entire dataset for each change.
Which of the following solutions would be the MOST effective and cost-efficient for the team to implement for data versioning in this scenario?
Implement Data Version Control (DVC) to track metadata pointers in Git while keeping the actual image files in a separate, cost-effective cloud storage.
Utilize Docker to package each version of the 2TB dataset into a new container image, versioning the data and the environment together for reproducibility.
Store periodic snapshots of the dataset as compressed archives in a shared cloud storage location, using a naming convention that corresponds to Git commit hashes.
Build a custom database and API to manage data versioning by storing file paths and metadata, which the team's code will query to retrieve specific dataset versions.
The correct answer is to implement Data Version Control (DVC). DVC is an open-source tool designed specifically for versioning data and models in machine learning projects. It works alongside Git by storing lightweight metadata files (pointers) in the Git repository, while the actual large data files are stored in a separate, cost-effective storage system like AWS S3, Google Cloud Storage, or an on-premises server. This approach directly addresses the scenario's requirements by keeping the Git repository small and fast, providing clear, reproducible links between code and data versions, and avoiding the high costs associated with storing large files directly in Git or Git LFS.
The approach of using compressed archives with a naming convention is a manual and error-prone process that lacks robust dependency tracking and makes it difficult to switch between versions efficiently. Building a custom database and API is not cost-effective as it involves significant development and maintenance effort, essentially reinventing a solution that already exists with tools like DVC. Using Docker to store a 2TB dataset within an image is a significant anti-pattern; it would lead to extremely large, unmanageable images, slow down container operations, and goes against the best practice of keeping images small and using volumes for large or persistent data.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is Data Version Control (DVC) and how does it work?
Open an interactive chat with Bash
Why is storing large datasets directly in Git or using Git LFS not ideal for machine learning?
Open an interactive chat with Bash
What advantages does DVC offer compared to manual approaches like using compressed archives?