CompTIA DataX DY0-001 (V1) Practice Question

A machine learning team is tasked with building a credit card fraud detection model. The historical dataset is highly imbalanced, with fraudulent transactions accounting for less than 1% of the data. To evaluate their chosen algorithm, a data scientist implements a standard 10-fold cross-validation procedure. Which of the following describes the most critical issue the data scientist is likely to encounter with this evaluation approach?

The validation process will systematically underestimate the model's bias because each training set is smaller than the total dataset.
The model's training time will increase tenfold compared to a single train-test split, making the evaluation process computationally infeasible.
Some folds may not contain any instances of the minority (fraudulent) class, leading to a skewed and unreliable performance estimate.
The use of k-fold cross-validation will cause the model to underfit, as it is only trained on 90% of the data at any given time.

Report Issue

Answer Description

The correct answer identifies the most critical issue with using standard k-fold cross-validation on a highly imbalanced dataset. With random splitting, it is highly probable that some folds will contain no examples of the minority class (fraudulent transactions). This would make it impossible to calculate certain performance metrics like precision or recall on those folds, and the overall averaged performance metric would not be a reliable estimate of the model's ability to generalize. The appropriate technique in this scenario is Stratified K-Fold cross-validation, which ensures that each fold maintains the same class distribution as the original dataset.

The distractor regarding computational cost is a general characteristic of cross-validation but is a planned trade-off for a more robust evaluation, not the most critical methodological flaw in this specific scenario.
The distractor about underestimating bias is incorrect; k-fold cross-validation, especially with a high k, provides a low-bias estimate of the test error.
The distractor concerning underfitting is also incorrect. Cross-validation is a technique to assess and help prevent overfitting or underfitting by providing a better estimate of generalization error; it does not inherently cause underfitting.

Ask Bash

Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.

What is Stratified K-Fold cross-validation and how does it address the issue of imbalanced datasets?

Open an interactive chat with Bash

Why is it problematic if some folds do not contain instances of the minority class during cross-validation?

Open an interactive chat with Bash

What are precision, recall, and F1-score, and why are they important for imbalanced datasets?

Open an interactive chat with Bash

CompTIA DataX DY0-001 (V1)

Machine Learning

Your Score:

SAVE $64

CompTIA DataX Voucher

v1 / DY0-001

$529.00 $465.00

Bash, the Crucial Exams Chat Bot

AI Bot

CompTIA DataX DY0-001 (V1) Practice Question

Answer Description

Ask Bash

What is Stratified K-Fold cross-validation and how does it address the issue of imbalanced datasets?

Why is it problematic if some folds do not contain instances of the minority class during cross-validation?

What are precision, recall, and F1-score, and why are they important for imbalanced datasets?

Monthly

$19.99

Billed monthly,
Cancel any time.

3 Month Pass

$44.99

One time purchase of $44.99,
Does not auto-renew.

Annual Pass

$119.99

One time purchase of $119.99,
Does not auto-renew.

Lifetime Pass

$189.99

One time purchase,
Good for life.

All Exams

Unlimited Tests

Unlimited Questions

AI Tutor

Track scores

Report Cards

Voucher Discounts

Advanced PBQs

Included Exams