00:15:00

CompTIA DataX Practice Test (DY0-001)

Use the form below to configure your CompTIA DataX Practice Test (DY0-001). The practice test can be configured to only include certain exam objectives and domains. You can choose between 5-100 questions and set a time limit.

Logo for CompTIA DataX DY0-001 (V1)
Questions
Number of questions in the practice test
Free users are limited to 20 questions, upgrade to unlimited
Seconds Per Question
Determines how long you have to finish the practice test
Exam Objectives
Which exam objectives should be included in the practice test

CompTIA DataX DY0-001 (V1) Information

CompTIA DataX is an expert‑level, vendor‑neutral certification aimed at deeply experienced data science professionals. Launched on July 25, 2024, the exam verifies advanced competencies across the full data science lifecycle - from mathematical modeling and machine learning to deployment and specialized applications like NLP, computer vision, and anomaly detection.

The exam comprehensively covers five key domains:

  • Mathematics and Statistics (~17%)
  • Modeling, Analysis and Outcomes (~24%)
  • Machine Learning (~24%)
  • Operations and Processes (~22%)
  • Specialized Applications of Data Science (~13%)

It includes a mix of multiple‑choice and performance‑based questions (PBQs), simulating real-world tasks like interpreting data pipelines or optimizing machine learning workflows. The duration is 165 minutes, with a maximum of 90 questions. Scoring is pass/fail only, with no scaled score reported.

Free CompTIA DataX DY0-001 (V1) Practice Test

Press start when you are ready, or press Change to modify any settings for the practice test.

  • Questions: 15
  • Time: Unlimited
  • Included Topics:
    Mathematics and Statistics
    Modeling, Analysis, and Outcomes
    Machine Learning
    Operations and Processes
    Specialized Applications of Data Science

Free Preview

This test is a free preview, no account required.
Subscribe to unlock all content, keep track of your scores, and access AI features!

Question 1 of 15

A data scientist wants to report a two-sided 95% confidence interval for the true population Pearson correlation between two numerical features. In a random sample of n = 60 observations, the sample correlation is r = 0.58. To use standard normal critical values, which pre-processing step should be applied to the correlation estimate before constructing the confidence interval?

  • Apply the Wilson score method directly to r to obtain the interval.

  • Transform r with Fisher's inverse hyperbolic tangent (z-transformation), build the interval in the transformed space, then back-transform the interval's endpoints.

  • Use a Box-Cox transformation on each variable so that the resulting correlation can be treated as normally distributed.

  • Multiply r by √(n−2)/√(1−r²) and treat the result as standard normal when forming the interval.

Question 2 of 15

A data scientist develops a multiple linear regression model to predict housing prices. Upon evaluation, a plot of the model's residuals versus its fitted values reveals a distinct fan shape, where the vertical spread of the residuals increases as the predicted housing price increases. Which of the following statements describes the most critical implication of this observation for the model's statistical inference?

  • The model suffers from severe multicollinearity, making it difficult to isolate the individual impact of each predictor variable.

  • The coefficient estimates are biased, leading to a systematic overestimation or underestimation of the true population parameters.

  • The standard errors of the coefficients are biased, rendering hypothesis tests and confidence intervals unreliable.

  • The residuals are not normally distributed, which violates the primary assumption required for the coefficient estimates to be valid.

Question 3 of 15

An analytics team is evaluating three nested multiple linear regression models to predict annual energy consumption (kWh) for office buildings. The validation-set summary is:

Model | Predictors | R² | Adjusted R² | RMSE (kWh) | F-statistic p-value
M1 | 6 | 0.88 | 0.879 | 12 400 | <0.001
M2 | 15 | 0.90 | 0.893 | 12 100 | <0.001
M3 | 25 | 0.91 | 0.885 | 12 050 | <0.001

Hardware constraints limit the production model to the smallest set of predictors that still yields clear performance gains. Which single performance metric from the table gives the most defensible basis for deciding which model best achieves this balance?

  • Root-mean-square error (RMSE)

  • F-statistic p-value

  • Adjusted R²

Question 4 of 15

A customer analytics team is cleaning a dataset that contains customer age (fully observed), loyalty tier (fully observed), and total annual spending, of which about 18 % of the values are missing. Exploratory analysis shows that customers who are younger and those in the highest loyalty tier are less likely to report spending. However, within any given age-tier combination, the probability that spending is missing is unrelated to the true (unobserved) spending amount. Which description best characterizes the missingness mechanism for the spending variable in this situation?

  • Missing Completely at Random due to a random data-entry glitch that uniformly deleted 18 % of spending values across the dataset.

  • Missing Completely at Random (MCAR); missingness is unrelated to any observed or unobserved variables.

  • Missing at Random (MAR); the probability of a missing spending value depends only on the observed age and loyalty tier.

  • Missing Not at Random (MNAR); higher or lower spending directly influences the chance that the value is missing, even after accounting for age and tier.

Question 5 of 15

A data scientist at an aerospace firm has developed a binary classification model to predict catastrophic engine failures. The positive class represents a "failure" event, which is extremely rare in the operational data. The primary business objective is to avoid missing any potential failures, as a single missed event (a False Negative) is unacceptable due to safety implications. The cost of a False Positive (flagging a healthy engine for inspection) is considered minimal. Which classifier performance metric should be prioritized to best evaluate and optimize the model for this specific requirement?

  • Recall

  • Accuracy

  • Precision

  • F1 Score

Question 6 of 15

A data scientist fits a multiple linear regression model with an intercept and six predictor variables (p = 6) to a sample of n = 80 observations. The model's coefficient of determination is R² = 0.37.

Using the classical F-test for overall model significance (H₀: all slope coefficients = 0), what is the value of the F statistic that should be reported?

  • Approximately 37.0

  • Approximately 0.59

  • Approximately 12.2

  • Approximately 7.1

Question 7 of 15

A data scientist is conducting a survival analysis to model customer churn for a subscription-based service. The dataset includes the tenure of each customer and a status indicator for whether they have churned or are still active (censored data). The initial analysis with a non-parametric Kaplan-Meier estimator was used to visualize the survival probability.

The next objective is to understand how covariates, such as the customer's subscription plan and monthly spending, influence the risk of churn over time. The data scientist wants to quantify the effect of these covariates but is hesitant to make a strong assumption about the specific shape of the underlying baseline hazard function.

Given these requirements, which of the following models is the most appropriate choice?

  • Weibull AFT model

  • Kaplan-Meier estimator

  • ARIMA model

  • Cox Proportional Hazards model

Question 8 of 15

An automated trading-surveillance system must ensure that at least 70 % of the orders it flags as suspicious are truly manipulative, so the compliance team has set a minimum precision of 0.70. Two candidate classifiers were evaluated on a validation set of 50 000 historical orders with the following confusion-matrix counts:

Classifier X - TP = 260, FP = 110, FN = 190, TN = 47 440
Classifier Y - TP = 350, FP = 180, FN = 100, TN = 47 370

Which option correctly identifies the classifier(s) that meet the compliance requirement and states the corresponding precision value?

  • Only Classifier X satisfies the requirement with a precision of approximately 0.70

  • Neither classifier satisfies the requirement because both precisions are below 0.70

  • Both classifiers satisfy the requirement because each has a precision above 0.70

  • Only Classifier Y satisfies the requirement with a precision of approximately 0.66

Question 9 of 15

A data scientist is building a decision tree classifier to predict customer churn. At a specific node containing 20 samples, 10 customers have churned and 10 have not. The scientist is evaluating two features, 'Contract Type' and 'Has Tech Support', to determine the optimal split. The results of splitting by each feature are as follows:

  • Split by 'Contract Type':

    • Node A ('Month-to-Month'): 12 samples (9 Churn, 3 No Churn)
    • Node B ('One/Two Year'): 8 samples (1 Churn, 7 No Churn)
  • Split by 'Has Tech Support':

    • Node C ('Yes'): 10 samples (3 Churn, 7 No Churn)
    • Node D ('No'): 10 samples (7 Churn, 3 No Churn)

Given that the algorithm uses entropy to maximize information gain, which of the following conclusions is correct?

  • The 'Has Tech Support' feature should be selected because its resulting split has a lower weighted average entropy than the 'Contract Type' split.

  • The 'Has Tech Support' feature should be selected because its child nodes are perfectly balanced in size (10 samples each), which maximizes the reduction in impurity.

  • The 'Contract Type' feature should be selected because its resulting split has a lower weighted average entropy (approximately 0.705) than the 'Has Tech Support' split (approximately 0.881).

  • The information gain for both splits is equal, so the Gini index must be calculated to determine the optimal feature.

Question 10 of 15

A data scientist develops a classification model to identify fraudulent financial transactions. The test dataset contains 1,000,000 transactions, of which 1,000 (0.1%) are fraudulent. After testing, the model produces the following confusion matrix:

Predicted: FraudPredicted: Not Fraud
Actual: Fraud800 (TP)200 (FN)
Actual: Not Fraud500 (FP)998,500 (TN)

The primary business objective is to minimize the number of missed fraudulent transactions (False Negatives), even at the cost of flagging some legitimate transactions for review (False Positives). Given this objective and the severe class imbalance, which performance metric provides the most relevant assessment of the model's effectiveness for its intended purpose?

  • Recall

  • Accuracy

  • Precision

  • Matthews Correlation Coefficient (MCC)

Question 11 of 15

You are implementing a Monte Carlo simulator for network-packet jitter. The only random-number source available returns independent samples from the continuous uniform distribution on (0, 1). To feed the noise model, the simulator must generate pairs of independent standard normal (mean 0, variance 1) random variables on every call. Which one of the following transformations of two independent Uniform(0, 1) samples U1 and U2 will correctly produce the required standard normal variables Z1 and Z2?

  • Z1 = √(-2 ln U1) cos(2π U2); Z2 = √(-2 ln U1) sin(2π U2)

  • Z1 = ln (U1) / √2; Z2 = ln (U2) / √2

  • Z1 = √(-2 ln (U1 / U2)); Z2 = √(-2 ln (U2 / U1))

  • Z1 = √(-2 ln U1) cos(π U2); Z2 = √(-2 ln U2) sin(π U1)

Question 12 of 15

An SRE team is analyzing the daily count of service outages for a cloud platform. Over the last 365 days the observed frequencies are: 0 outages on 310 days, 1 outage on 45 days, and 2 outages on 10 days (no day had more than two outages). The sample mean is 0.18 outages per day and the sample variance is 0.20. To develop a generative model for the number of outages per day, which distribution and supporting rationale provides the most statistically appropriate starting point?

  • Binomial distribution - because the count of outages can be viewed as successes in 365 daily trials with variance np(1 − p).

  • Power law distribution - heavy-tailed behavior explains low-probability, high-impact outage counts.

  • Poisson distribution - the near-equality of the sample mean and variance supports a Poisson rate parameter λ ≈ 0.18 for rare, independent daily outages.

  • Student's t-distribution - its heavier tails better model the occasional two-outage days in a small sample.

Question 13 of 15

A data scientist is building a decision tree classifier to predict customer churn. They are evaluating a potential split on a categorical feature. The parent node contains 100 samples, with 50 belonging to the 'Churn' class and 50 to the 'No Churn' class. The proposed split creates two child nodes:

  • Child Node 1: 60 samples, with 40 'Churn' and 20 'No Churn'.
  • Child Node 2: 40 samples, with 10 'Churn' and 30 'No Churn'. To evaluate the quality of this split, what is the weighted Gini impurity?
  • 0.417

  • 0.083

  • 0.410

  • 0.500

Question 14 of 15

A data scientist is analyzing a clinical trial dataset that includes the variables patient_age and systolic_blood_pressure (SBP). They observe that a significant number of SBP values are missing. Upon further investigation, the data scientist discovers that the probability of an SBP value being missing is correlated with patient_age, with younger patients being more likely to have a missing SBP value. However, within any specific age group, the reason for the missing SBP value is not related to the actual (unobserved) blood pressure level or any other unmeasured factor. Which type of missingness does this scenario describe?

  • Missing Completely at Random (MCAR)

  • Missing at Random (MAR)

  • Structural Missingness

  • Not Missing at Random (NMAR)

Question 15 of 15

An analyst is investigating the linear association between two continuous variables X and Y using n = 7 paired observations. The following summary statistics are available:

  • Sample standard deviation of X: s_X = 4
  • Sample standard deviation of Y: s_Y = 3
  • Sum of cross-products of deviations: Σ(x_i − x̄)(y_i − ȳ) = 36

Using Pearson's correlation coefficient and testing at the α = 0.05 significance level (two-tailed) for H₀: ρ = 0, which statement correctly states both the value of the sample correlation r and the appropriate decision on the null hypothesis?

  • r ≈ 0.83 and fail to reject the null hypothesis (no statistically significant linear correlation)

  • r = 0.50 and fail to reject the null hypothesis (no statistically significant linear correlation)

  • r ≈ 0.83 and reject the null hypothesis (statistically significant linear correlation)

  • r = 0.50 and reject the null hypothesis (statistically significant linear correlation)