CompTIA DataX DY0-001 (V1) Practice Question

A machine learning team is developing a computer vision model using a 2TB dataset of high-resolution images. The team uses Git for source code versioning but is facing significant challenges with versioning the dataset itself. Standard Git is not viable due to the dataset's size, and while Git LFS was considered, the associated storage costs and performance overhead for their cloud provider are prohibitive. The team's primary requirements are to maintain lightweight, reproducible links between specific code versions and the corresponding data versions without duplicating the entire dataset for each change.

Which of the following solutions would be the MOST effective and cost-efficient for the team to implement for data versioning in this scenario?

Utilize Docker to package each version of the 2TB dataset into a new container image, versioning the data and the environment together for reproducibility.
Store periodic snapshots of the dataset as compressed archives in a shared cloud storage location, using a naming convention that corresponds to Git commit hashes.
Build a custom database and API to manage data versioning by storing file paths and metadata, which the team's code will query to retrieve specific dataset versions.
Implement Data Version Control (DVC) to track metadata pointers in Git while keeping the actual image files in a separate, cost-effective cloud storage.

Report Issue

Answer Description

The correct answer is to implement Data Version Control (DVC). DVC is an open-source tool designed specifically for versioning data and models in machine learning projects. It works alongside Git by storing lightweight metadata files (pointers) in the Git repository, while the actual large data files are stored in a separate, cost-effective storage system like AWS S3, Google Cloud Storage, or an on-premises server. This approach directly addresses the scenario's requirements by keeping the Git repository small and fast, providing clear, reproducible links between code and data versions, and avoiding the high costs associated with storing large files directly in Git or Git LFS.

The approach of using compressed archives with a naming convention is a manual and error-prone process that lacks robust dependency tracking and makes it difficult to switch between versions efficiently. Building a custom database and API is not cost-effective as it involves significant development and maintenance effort, essentially reinventing a solution that already exists with tools like DVC. Using Docker to store a 2TB dataset within an image is a significant anti-pattern; it would lead to extremely large, unmanageable images, slow down container operations, and goes against the best practice of keeping images small and using volumes for large or persistent data.

Ask Bash

Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.

What is Data Version Control (DVC) and how does it work?

Open an interactive chat with Bash

Why is storing large datasets directly in Git or using Git LFS not ideal for machine learning?

Open an interactive chat with Bash

What advantages does DVC offer compared to manual approaches like using compressed archives?

Open an interactive chat with Bash

CompTIA DataX DY0-001 (V1)

Operations and Processes

Your Score:

SAVE $64

CompTIA DataX Voucher

v1 / DY0-001

$529.00 $465.00

Bash, the Crucial Exams Chat Bot

AI Bot

CompTIA DataX DY0-001 (V1) Practice Question

Answer Description

Ask Bash

What is Data Version Control (DVC) and how does it work?

Why is storing large datasets directly in Git or using Git LFS not ideal for machine learning?

What advantages does DVC offer compared to manual approaches like using compressed archives?

Monthly

$19.99

Billed monthly,
Cancel any time.

3 Month Pass

$44.99

One time purchase of $44.99,
Does not auto-renew.

Annual Pass

$119.99

One time purchase of $119.99,
Does not auto-renew.

Lifetime Pass

$189.99

One time purchase,
Good for life.

All Exams

Unlimited Tests

Unlimited Questions

AI Tutor

Track scores

Report Cards

Voucher Discounts

Advanced PBQs

Included Exams