CompTIA DataX DY0-001 (V1) Practice Question

A data scientist is building a logistic regression model to detect fraudulent financial transactions. The model uses four features: age, account_balance, number_of_monthly_transactions, and average_transaction_amount. An initial exploratory data analysis using box plots for each individual feature reveals no significant outliers. However, the model's performance is unexpectedly poor, and a residuals vs. leverage plot indicates that a few data points have an unusually high influence on the model's coefficients.

Given this scenario, which of the following methods is the MOST appropriate for identifying these influential, problematic data points?

Generate a scatter plot matrix of all feature pairs.
Apply a Box-Cox transformation to each feature.
Implement an Isolation Forest algorithm on the dataset.
Calculate the Mahalanobis distance for each data point.

Report Issue

Answer Description

The correct answer is to calculate the Mahalanobis distance for each data point. Mahalanobis distance is a multivariate outlier detection method that measures the distance of a point from the center of a distribution (the centroid), while accounting for the correlation between the variables. In this scenario, since univariate analysis showed no outliers, the problem is likely due to an unusual combination of feature values (e.g., a young person with an extremely high account balance and transaction frequency), which is exactly what Mahalanobis distance is designed to detect. These multivariate outliers can exert high leverage on regression models, which is consistent with the diagnostic plot findings.

Generating a scatter plot matrix is a useful visualization technique but is limited to showing relationships between pairs of variables. It would not reliably identify outliers that only become apparent when considering three or more variables simultaneously.
An Isolation Forest is a powerful, modern algorithm for anomaly detection. While it is effective for multivariate outliers, Mahalanobis distance is a more fundamental statistical measure directly related to the geometric influence of a point in a multivariate linear model context. For identifying influential points in a regression setting, Mahalanobis distance is the most direct and classic approach.
Applying a Box-Cox transformation is a technique used to stabilize variance and make data more closely resemble a normal distribution. Its purpose is to transform the data to better meet model assumptions, not to identify which specific data points are outliers.

Ask Bash

Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.

What is Mahalanobis distance, and why is it useful in detecting multivariate outliers?

Open an interactive chat with Bash

How does Mahalanobis distance differ from an Isolation Forest algorithm?

Open an interactive chat with Bash

Why wouldn't a scatter plot matrix or Box-Cox transformation address the issue in this situation?

Open an interactive chat with Bash

CompTIA DataX DY0-001 (V1)

Modeling, Analysis, and Outcomes

Your Score:

SAVE $64

CompTIA DataX Voucher

v1 / DY0-001

$529.00 $465.00

Bash, the Crucial Exams Chat Bot

AI Bot

CompTIA DataX DY0-001 (V1) Practice Question

Answer Description

Ask Bash

What is Mahalanobis distance, and why is it useful in detecting multivariate outliers?

How does Mahalanobis distance differ from an Isolation Forest algorithm?

Why wouldn't a scatter plot matrix or Box-Cox transformation address the issue in this situation?

Monthly

$19.99

Billed monthly,
Cancel any time.

3 Month Pass

$44.99

One time purchase of $44.99,
Does not auto-renew.

Annual Pass

$119.99

One time purchase of $119.99,
Does not auto-renew.

Lifetime Pass

$189.99

One time purchase,
Good for life.

All Exams

Unlimited Tests

Unlimited Questions

AI Tutor

Track scores

Report Cards

Voucher Discounts

Advanced PBQs

Included Exams