00:20:00

Home
CompTIA
CompTIA DataX DY0-001 (V1)
CompTIA DataX Practice Test

CompTIA DataX Practice Test (DY0-001)

Use the form below to configure your CompTIA DataX Practice Test (DY0-001). The practice test can be configured to only include certain exam objectives and domains. You can choose between 5-100 questions and set a time limit.

Questions

Number of questions in the practice test

Free users are limited to 20 questions, upgrade to unlimited

Seconds Per Question

Determines how long you have to finish the practice test

Enable Test Timer

Exam Objectives

Which exam objectives should be included in the practice test

Mathematics and Statistics

Modeling, Analysis, and Outcomes

Machine Learning

Operations and Processes

Specialized Applications of Data Science

CompTIA DataX DY0-001 (V1) Information

CompTIA DataX is an expert‑level, vendor‑neutral certification aimed at deeply experienced data science professionals. Launched on July 25, 2024, the exam verifies advanced competencies across the full data science lifecycle - from mathematical modeling and machine learning to deployment and specialized applications like NLP, computer vision, and anomaly detection.

The exam comprehensively covers five key domains:

Mathematics and Statistics (~17%)
Modeling, Analysis and Outcomes (~24%)
Machine Learning (~24%)
Operations and Processes (~22%)
Specialized Applications of Data Science (~13%)

It includes a mix of multiple‑choice and performance‑based questions (PBQs), simulating real-world tasks like interpreting data pipelines or optimizing machine learning workflows. The duration is 165 minutes, with a maximum of 90 questions. Scoring is pass/fail only, with no scaled score reported.

All Study Materials

CompTIA DataX DY0-001 (V1)

Study Mode

CompTIA DataX DY0-001 (V1)

Flashcards

CompTIA DataX DY0-001 (V1)

SAVE $64

CompTIA DataX Voucher

v1 / DY0-001

$529.00 $465.00

CompTIA DataX DY0-001 (V1)

Your Score:

Mathematics and Statistics:

Operations and Processes:

Modeling, Analysis, and Outcomes:

Filters

Scroll down to see your responses and detailed results

Free CompTIA DataX DY0-001 (V1) Practice Test
20 Questions
Unlimited
Mathematics and Statistics
Modeling, Analysis, and Outcomes
Machine Learning
Operations and Processes
Specialized Applications of Data Science

Free Preview

This test is a free preview, no account required.
Subscribe to unlock all content, keep track of your scores, and access AI features!

Question 1 of 20

An analyst is investigating the linear association between two continuous variables X and Y using n = 7 paired observations. The following summary statistics are available:

Sample standard deviation of X: s_X = 4
Sample standard deviation of Y: s_Y = 3
Sum of cross-products of deviations: Σ(x_i − x̄)(y_i − ȳ) = 36

Using Pearson's correlation coefficient and testing at the α = 0.05 significance level (two-tailed) for H₀: ρ = 0, which statement correctly states both the value of the sample correlation r and the appropriate decision on the null hypothesis?

r ≈ 0.83 and reject the null hypothesis (statistically significant linear correlation)
r = 0.50 and fail to reject the null hypothesis (no statistically significant linear correlation)
r ≈ 0.83 and fail to reject the null hypothesis (no statistically significant linear correlation)
r = 0.50 and reject the null hypothesis (statistically significant linear correlation)

Correct Incorrect Unanswered Mathematics and Statistics

Report Issue

Answer Description

The Pearson correlation for a sample is r = Σ(x_i − x̄)(y_i − ȳ) / [(n − 1)s_X s_Y]. Substituting the given values gives r = 36 / (6 × 4 × 3) = 0.50. To test significance, use the t statistic t = r √(n − 2) / √(1 − r²). With n = 7, t = 0.50 × √5 / √(1 − 0.25) ≈ 1.29. The critical value for a two-tailed test with df = 5 at α = 0.05 is about ±2.57, so |1.29| does not exceed the critical value. Therefore the analyst fails to reject H₀; the correlation is not statistically significant at the 0.05 level.

Ask Bash

Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.

What is Pearson's correlation coefficient, and how is it calculated?

How is the t-statistic used to test the significance of Pearson's correlation coefficient?

What does a two-tailed test at α = 0.05 mean for hypothesis testing?

Question 2 of 20

A customer analytics team is cleaning a dataset that contains customer age (fully observed), loyalty tier (fully observed), and total annual spending, of which about 18 % of the values are missing. Exploratory analysis shows that customers who are younger and those in the highest loyalty tier are less likely to report spending. However, within any given age-tier combination, the probability that spending is missing is unrelated to the true (unobserved) spending amount. Which description best characterizes the missingness mechanism for the spending variable in this situation?

Missing Completely at Random due to a random data-entry glitch that uniformly deleted 18 % of spending values across the dataset.
Missing Not at Random (MNAR); higher or lower spending directly influences the chance that the value is missing, even after accounting for age and tier.
Missing at Random (MAR); the probability of a missing spending value depends only on the observed age and loyalty tier.
Missing Completely at Random (MCAR); missingness is unrelated to any observed or unobserved variables.

Correct Incorrect Unanswered Mathematics and Statistics

Report Issue

Answer Description

The missingness depends on two fully observed variables-age and loyalty tier-but, conditional on them, it is not related to the spending values that are actually missing. This matches the definition of Missing at Random (MAR). Under MAR, the missing-data mechanism is considered ignorable for likelihood-based models or multiple imputation, provided the observed predictors that drive missingness are included in the analysis. The mechanism is not Missing Completely at Random (MCAR) because younger, high-tier customers have a higher propensity for missingness, and it is not Missing Not at Random (MNAR) because spending itself does not influence whether it is missing once age and tier are taken into account.

Ask Bash

Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.

What is the difference between MAR, MCAR, and MNAR in data analysis?

Why is MAR considered ignorable in likelihood-based models or imputation?

What techniques can be used to handle MAR missing data?

Question 3 of 20

A data scientist is building a decision tree classifier to predict customer churn. At a specific node containing 20 samples, 10 customers have churned and 10 have not. The scientist is evaluating two features, 'Contract Type' and 'Has Tech Support', to determine the optimal split. The results of splitting by each feature are as follows:

Split by 'Contract Type':
- Node A ('Month-to-Month'): 12 samples (9 Churn, 3 No Churn)
- Node B ('One/Two Year'): 8 samples (1 Churn, 7 No Churn)
Split by 'Has Tech Support':
- Node C ('Yes'): 10 samples (3 Churn, 7 No Churn)
- Node D ('No'): 10 samples (7 Churn, 3 No Churn)

Given that the algorithm uses entropy to maximize information gain, which of the following conclusions is correct?

The 'Has Tech Support' feature should be selected because its resulting split has a lower weighted average entropy than the 'Contract Type' split.
The 'Has Tech Support' feature should be selected because its child nodes are perfectly balanced in size (10 samples each), which maximizes the reduction in impurity.
The 'Contract Type' feature should be selected because its resulting split has a lower weighted average entropy (approximately 0.705) than the 'Has Tech Support' split (approximately 0.881).
The information gain for both splits is equal, so the Gini index must be calculated to determine the optimal feature.

Correct Incorrect Unanswered Mathematics and Statistics

Report Issue

Answer Description

The correct answer is that 'Contract Type' should be selected because its split results in a lower weighted average entropy. The goal of a decision tree split is to maximize Information Gain, which is equivalent to minimizing the weighted average entropy of the child nodes.

The calculation is as follows:

Calculate the entropy for each child node. The formula for entropy is: E = -p * log2(p) - (1-p) * log2(1-p).
- E(Node A) (9/12 Churn): -( (9/12) * log2(9/12) + (3/12) * log2(3/12) ) ≈ 0.811
- E(Node B) (1/8 Churn): -( (1/8) * log2(1/8) + (7/8) * log2(7/8) ) ≈ 0.544
- E(Node C) (3/10 Churn): -( (3/10) * log2(3/10) + (7/10) * log2(7/10) ) ≈ 0.881
- E(Node D) (7/10 Churn): -( (7/10) * log2(7/10) + (3/10) * log2(3/10) ) ≈ 0.881
Calculate the weighted average entropy for each split. The formula is the sum of (samples_in_child / total_samples) * entropy_of_child.
- W_avg_entropy('Contract Type') = (12/20) * 0.811 + (8/20) * 0.544 = 0.6 * 0.811 + 0.4 * 0.544 ≈ 0.487 + 0.218 = 0.705
- W_avg_entropy('Has Tech Support') = (10/20) * 0.881 + (10/20) * 0.881 = 0.5 * 0.881 + 0.5 * 0.881 = 0.881
Compare the results. The split on 'Contract Type' (0.705) has a lower weighted average entropy than the split on 'Has Tech Support' (0.881). Therefore, 'Contract Type' yields a higher information gain and is the better split.

The other options are incorrect. The 'Has Tech Support' split has a higher weighted entropy, making it the less desirable choice. The balance of sample sizes in the child nodes for 'Has Tech Support' does not guarantee higher information gain; the purity of the classes within those nodes is what matters. Finally, calculating the Gini index is an alternative to entropy, not a necessary tie-breaker.

Ask Bash

Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.

What is entropy in the context of decision trees?

What is information gain and why is it important in decision trees?

How is the weighted average entropy calculated for a split?

Question 4 of 20

A data scientist wants to report a two-sided 95% confidence interval for the true population Pearson correlation between two numerical features. In a random sample of n = 60 observations, the sample correlation is r = 0.58. To use standard normal critical values, which pre-processing step should be applied to the correlation estimate before constructing the confidence interval?

Use a Box-Cox transformation on each variable so that the resulting correlation can be treated as normally distributed.
Transform r with Fisher's inverse hyperbolic tangent (z-transformation), build the interval in the transformed space, then back-transform the interval's endpoints.
Apply the Wilson score method directly to r to obtain the interval.
Multiply r by √(n−2)/√(1−r²) and treat the result as standard normal when forming the interval.

Correct Incorrect Unanswered Mathematics and Statistics

Report Issue

Answer Description

Because the sampling distribution of Pearson's r is skewed and its variance depends on the unknown population correlation (ρ), a direct calculation using normal theory is inappropriate. Fisher's z-transformation-z = atanh(r) = ½ ln[(1+r)/(1−r)]-is a variance-stabilizing transform that makes the resulting statistic, z, approximately normally distributed as N(atanh(ρ), 1/(n−3)). A 95% interval for this transformed value is therefore z ± 1.96 / √(n−3). Applying the inverse transform (tanh) to the interval's endpoints yields the confidence interval for ρ. The Wilson score interval is designed for binomial proportions. A Box-Cox transformation applies to the raw data, not the correlation coefficient r. The statistic r√(n−2)/√(1−r²) follows a t-distribution and is used for hypothesis testing, not interval estimation.

Ask Bash

Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.

What is Fisher's z-transformation and why is it used for correlations?

Why can't the Wilson score method or Box-Cox transformation be used in this case?

What is the role of sample size (n) in constructing the confidence interval for correlation?

Question 5 of 20

An automated trading-surveillance system must ensure that at least 70 % of the orders it flags as suspicious are truly manipulative, so the compliance team has set a minimum precision of 0.70. Two candidate classifiers were evaluated on a validation set of 50 000 historical orders with the following confusion-matrix counts:

Classifier X - TP = 260, FP = 110, FN = 190, TN = 47 440
Classifier Y - TP = 350, FP = 180, FN = 100, TN = 47 370

Which option correctly identifies the classifier(s) that meet the compliance requirement and states the corresponding precision value?

Both classifiers satisfy the requirement because each has a precision above 0.70
Only Classifier X satisfies the requirement with a precision of approximately 0.70
Only Classifier Y satisfies the requirement with a precision of approximately 0.66
Neither classifier satisfies the requirement because both precisions are below 0.70

Correct Incorrect Unanswered Mathematics and Statistics

Report Issue

Answer Description

Precision is defined as TP / (TP + FP), the proportion of positive predictions that are correct.

Classifier X: 260 / (260 + 110) ≈ 0.703 (> 0.70).
Classifier Y: 350 / (350 + 180) ≈ 0.660 (< 0.70).

Because only Classifier X achieves a precision of at least 0.70, it alone satisfies the compliance requirement. Classifier Y falls short despite having more true positives, because it also generates more false positives, lowering its precision.

Ask Bash

Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.

How do you calculate precision in a confusion matrix?

What is the role of the confusion matrix in evaluating classifiers?

Why does Classifier Y have more true positives but a lower precision than Classifier X?

Question 6 of 20

An analytics team is evaluating three nested multiple linear regression models to predict annual energy consumption (kWh) for office buildings. The validation-set summary is:

Model | Predictors | R² | Adjusted R² | RMSE (kWh) | F-statistic p-value
M1 | 6 | 0.88 | 0.879 | 12 400 | <0.001
M2 | 15 | 0.90 | 0.893 | 12 100 | <0.001
M3 | 25 | 0.91 | 0.885 | 12 050 | <0.001

Hardware constraints limit the production model to the smallest set of predictors that still yields clear performance gains. Which single performance metric from the table gives the most defensible basis for deciding which model best achieves this balance?

Adjusted R²
F-statistic p-value
Root-mean-square error (RMSE)
R²

Correct Incorrect Unanswered Mathematics and Statistics

Report Issue

Answer Description

Adjusted R² modifies the ordinary R² by incorporating both sample size and the number of predictors, so it rises only when additional variables reduce the residual variance more than would be expected by chance. It therefore rewards genuine improvement while penalizing unnecessary complexity. In the table, Model 2 attains the highest adjusted R², indicating the best trade-off between parsimony and predictive power. Plain R² and RMSE both improve (or stay nearly the same) as more predictors are added, so they cannot flag overfitting. The F-statistic p-value only tests whether each model outperforms an intercept-only model; because all three p-values are identical, it offers no guidance for choosing among the competing models.

Ask Bash

Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.

Why is adjusted R² preferred over plain R² in model comparison?

What is the role of RMSE in evaluating regression models?

Why can't the F-statistic p-value guide model selection in this case?

Question 7 of 20

A network engineer models the inter-arrival time T (in microseconds) between packets as an exponential random variable with rate λ = 500 000 μs⁻¹. The resulting probability density function is

f_T(t) = 500 000 e^(−500 000 t), t ≥ 0.

Seeing that f_T(0) = 500 000 ≫ 1, the engineer worries that the model violates the fundamental rule that probabilities cannot exceed 1. Which explanation best resolves the engineer's concern?

The value at t = 0 is ignored since the exponential distribution is undefined at zero; therefore 500 000 has no bearing on the model's validity.
The apparent problem arises from using microseconds; dividing the density by 1 000 000 to convert it to seconds would ensure it never exceeds 1.
The value 500 000 is acceptable because a PDF expresses probability per microsecond; actual probabilities are obtained by integrating over an interval, and the total area still equals 1.
Any PDF whose maximum exceeds 1 indicates incorrect parameter estimation; λ must be reduced until the entire density is ≤ 1.

Correct Incorrect Unanswered Mathematics and Statistics

Report Issue

Answer Description

A probability density function reports probability per unit of measurement, not probability itself. Although the exponential PDF attains the value 500 000 at t = 0 μs, the probability of observing T in any finite interval is obtained by integrating the density over that interval. Because the area under the entire curve ∫₀^{∞ 500 000 e}(−500 000 t) dt equals 1, no probability rule is violated. Simply changing units or the rate parameter can alter the numerical value of the density at a point, but it will still be valid as long as the total integrated area remains 1. The other options are incorrect because the PDF is well-defined at t = 0, scaling units alone does not 'fix' an error, and there is no requirement that a PDF stay below 1 everywhere.

Ask Bash

Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.

Why can the PDF exceed 1 at a specific point without violating probability rules?

What does it mean to integrate a PDF, and why is this important?

Why doesn't scaling the unit of measurement (e.g., switching from microseconds to seconds) fix the concern?

Question 8 of 20

A data scientist at an aerospace firm has developed a binary classification model to predict catastrophic engine failures. The positive class represents a "failure" event, which is extremely rare in the operational data. The primary business objective is to avoid missing any potential failures, as a single missed event (a False Negative) is unacceptable due to safety implications. The cost of a False Positive (flagging a healthy engine for inspection) is considered minimal. Which classifier performance metric should be prioritized to best evaluate and optimize the model for this specific requirement?

F1 Score
Precision
Accuracy
Recall

Correct Incorrect Unanswered Mathematics and Statistics

Report Issue

Answer Description

The correct answer is Recall. Recall, also known as sensitivity or the true positive rate, is calculated as TP / (TP + FN), where TP is True Positives and FN is False Negatives. In scenarios where the cost of a False Negative is very high, such as failing to predict a critical equipment failure, maximizing recall is the primary objective. This metric directly measures the model's ability to identify all actual positive instances.

Precision, calculated as TP / (TP + FP), measures the accuracy of positive predictions. Prioritizing precision would aim to reduce False Positives, which is not the main concern in this scenario.
Accuracy is not suitable for highly imbalanced datasets because a model can achieve a high accuracy score by simply predicting the majority class, while completely failing to identify the rare, critical events.
The F1 score is the harmonic mean of precision and recall and seeks a balance between them. While useful, it does not specifically prioritize minimizing False Negatives, which is the explicit goal here.

Ask Bash

Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.

Why is Recall more important than Precision in this scenario?

What is the impact of an imbalanced dataset on Accuracy?

How does the F1 Score compare to Recall in this case?

Question 9 of 20

A data scientist is analyzing the performance of a new machine learning model designed to optimize ad spend. Due to budget constraints, they could only run a pilot test on 15 randomly selected advertising campaigns. The scientist calculates the mean improvement in return on ad spend (ROAS) for this sample and notes that the population standard deviation of ROAS improvement is unknown. To construct a 95% confidence interval for the true mean ROAS improvement, which of the following statements most accurately describes the appropriate distribution to use and its justification?

A Binomial distribution is appropriate by classifying each campaign as a 'success' or 'failure' to model the probability of a positive ROAS improvement.
A standard normal (Z) distribution is appropriate because the Central Limit Theorem ensures the sampling distribution of the mean is approximately normal.
A Chi-squared distribution is appropriate because the goal is to test the goodness-of-fit of the observed ROAS improvements against an expected distribution.
A t-distribution with 14 degrees of freedom is appropriate because the population standard deviation is unknown and must be estimated from a small sample, which introduces additional uncertainty.

Correct Incorrect Unanswered Mathematics and Statistics

Report Issue

Answer Description

The correct answer is to use a t-distribution with 14 degrees of freedom. The Student's t-distribution is the appropriate statistical distribution for estimating the population mean when the sample size is small (typically n < 30) and the population standard deviation is unknown. The degrees of freedom for a single sample confidence interval are calculated as n - 1, which is 15 - 1 = 14 in this scenario. The t-distribution has heavier tails compared to the standard normal distribution, which accounts for the additional uncertainty introduced by having to estimate the population standard deviation from the sample data.

The standard normal (Z) distribution is incorrect because its use requires either a known population standard deviation or a large sample size (n ≥ 30), neither of which applies here. The Chi-squared distribution is incorrect as it is primarily used for tests involving categorical data (like goodness-of-fit or independence) or for making inferences about population variance, not for creating a confidence interval for the mean. The Binomial distribution is also incorrect because it models discrete data representing the number of successes in a series of independent trials, whereas ROAS improvement is a continuous variable.

Ask Bash

Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.

Why is the t-distribution used for small sample sizes instead of the Z-distribution?

What does 'degrees of freedom' mean in the context of the t-distribution?

When would the Chi-squared or Binomial distributions be appropriate instead of the t-distribution?

Question 10 of 20

A data scientist is analyzing a clinical trial dataset that includes the variables patient_age and systolic_blood_pressure (SBP). They observe that a significant number of SBP values are missing. Upon further investigation, the data scientist discovers that the probability of an SBP value being missing is correlated with patient_age, with younger patients being more likely to have a missing SBP value. However, within any specific age group, the reason for the missing SBP value is not related to the actual (unobserved) blood pressure level or any other unmeasured factor. Which type of missingness does this scenario describe?

Missing at Random (MAR)
Missing Completely at Random (MCAR)
Structural Missingness
Not Missing at Random (NMAR)

Correct Incorrect Unanswered Mathematics and Statistics

Report Issue

Answer Description

The correct answer is Missing at Random (MAR). In this scenario, the missingness of the systolic_blood_pressure (SBP) is dependent on another observed variable, which is patient_age. This is the key characteristic of MAR: the probability of a value being missing is related to other observed information in the dataset but not to the unobserved value itself.

Missing Completely at Random (MCAR) is incorrect because the missingness is not completely random; it has a systematic relationship with the patient_age variable. If the data were MCAR, the probability of a missing SBP value would be the same for all patients, regardless of their age or any other characteristic.
Not Missing at Random (NMAR) is incorrect because the scenario explicitly states that the missingness is not related to the actual unobserved blood pressure level. NMAR would apply if, for example, patients with very high blood pressure were less likely to have their SBP recorded, meaning the missingness depends on the value of the missing variable itself.
Structural Missingness is incorrect. This term typically refers to data that is missing for a logical reason inherent in the study's design. For instance, a question about the number of pregnancies would be structurally missing for male participants. The scenario described does not fit this definition.

Ask Bash

Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.

What distinguishes Missing at Random (MAR) from Missing Completely at Random (MCAR)?

How can a data scientist address Missing at Random (MAR) in a dataset?

Why doesn't the scenario fall under Not Missing at Random (NMAR)?

Question 11 of 20

A data scientist develops a multiple linear regression model to predict housing prices. Upon evaluation, a plot of the model's residuals versus its fitted values reveals a distinct fan shape, where the vertical spread of the residuals increases as the predicted housing price increases. Which of the following statements describes the most critical implication of this observation for the model's statistical inference?

The residuals are not normally distributed, which violates the primary assumption required for the coefficient estimates to be valid.
The standard errors of the coefficients are biased, rendering hypothesis tests and confidence intervals unreliable.
The model suffers from severe multicollinearity, making it difficult to isolate the individual impact of each predictor variable.
The coefficient estimates are biased, leading to a systematic overestimation or underestimation of the true population parameters.

Correct Incorrect Unanswered Mathematics and Statistics

Report Issue

Answer Description

The correct answer is that the standard errors of the coefficients are biased, which renders hypothesis tests and confidence intervals unreliable. The fan-shaped pattern in the residual plot is a classic indicator of heteroskedasticity, which means the variance of the error term is not constant across all levels of the independent variables. In the presence of heteroskedasticity, Ordinary Least Squares (OLS) coefficient estimates remain unbiased, but they are no longer efficient (i.e., not BLUE - Best Linear Unbiased Estimators). The primary issue for statistical inference is that the formulas used to calculate the variance and standard errors of the coefficients, which assume homoskedasticity, become biased. This bias in the standard errors leads to unreliable t-statistics, p-values, and confidence intervals, potentially causing the analyst to draw incorrect conclusions about the statistical significance of the predictor variables.

The coefficient estimates themselves do not become biased due to heteroskedasticity in an OLS model. Multicollinearity is a separate issue related to high correlation between predictor variables, not the variance of the residuals. While the normality of residuals is another OLS assumption, the fan shape specifically points to non-constant variance (heteroskedasticity), not necessarily a deviation from a normal distribution.

Ask Bash

Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.

What is heteroskedasticity?

How do you detect heteroskedasticity in a regression model?

How can you address heteroskedasticity in a model?

Question 12 of 20

A data scientist is analyzing latency data from hundreds of distributed microservices to ensure they meet service level objectives (SLOs). The dataset contains response times in milliseconds (a continuous variable) and the corresponding service ID (a categorical variable). The primary goal of the initial exploratory analysis is to efficiently compare the distributions of response times across all services, specifically to identify services with high variability and a significant number of extreme outlier response times. Which of the following visualizations is the most effective and scalable for this specific task?

A Q-Q plot comparing each service's response time distribution to a normal distribution.
A series of histograms, one for each service.
A scatter plot with service IDs on the x-axis and response times on the y-axis.
A box and whisker plot.

Correct Incorrect Unanswered Modeling, Analysis, and Outcomes

Report Issue

Answer Description

The correct answer is a box and whisker plot. A box plot is the most effective tool for this scenario because it is specifically designed to summarize and compare the distributions of a continuous variable across multiple groups or categories. It concisely displays key statistical measures for each service: the median (central tendency), the interquartile range (IQR) representing the middle 50% of the data (variability), and the whiskers and individual points beyond them (outliers). This makes it highly efficient for comparing hundreds of service distributions at a glance to identify those with high spread (a long box or whiskers) and numerous outliers.

A histogram is not ideal because it would require generating hundreds of individual plots, one for each microservice. Comparing these many plots side-by-side would be impractical and inefficient for identifying services with high variability and outliers.

A scatter plot is used to visualize the relationship between two continuous variables. Using it to plot a continuous variable (response time) against a categorical one (service ID) would result in a series of vertical dot strips that would be heavily overplotted and difficult to interpret, especially with hundreds of services.

A Q-Q plot is used to determine if a dataset follows a specific theoretical distribution, like a normal distribution. It is not designed for comparing the summary statistics of distributions across many different groups. The data scientist would need to create a separate plot for each of the hundreds of services to assess their individual distributional shapes, which does not meet the goal of an efficient, comparative analysis.

Ask Bash

Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.

Why is a box and whisker plot considered the best choice for this task?

What does the IQR and whiskers in a box plot represent?

Why are the other visualization methods not suited for this scenario?

Question 13 of 20

A data scientist is running an online A/B test that records whether each visit results in a click (1) or no click (0). Every 5 minutes she streams the cumulative average click-through-rate (CTR) difference between variant B and the control. During the first hour (≈10 000 observations) the cumulative difference swings between −2 % and +3 %, but after ten hours (≈100 000 observations) it fluctuates within ±0.3 % of a stable value. Which statistical result best explains why the running average becomes increasingly stable as more independent observations arrive, assuming each user's CTR has finite variance?

Law of large numbers
Central limit theorem
Simpson's paradox
Chebyshev's inequality

Correct Incorrect Unanswered Mathematics and Statistics

Report Issue

Answer Description

The law of large numbers states that for independent, identically distributed observations with a finite expected value, the sample mean converges in probability (and, under the strong form, almost surely) to the true population mean as the sample size grows. Consequently, as the A/B test accumulates additional visitors, the running average CTR difference drifts less and eventually stabilizes around the underlying expected difference.

The central limit theorem tells us that the distribution of the scaled deviation of the sample mean approaches normality, but it does not itself guarantee that the mean will settle at the population value. Chebyshev's inequality provides only a broad bound on the probability of large deviations at any fixed sample size, without ensuring eventual convergence. Simpson's paradox refers to misleading aggregated comparisons and is unrelated to the convergence of sample averages.

Ask Bash

Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.

What is the law of large numbers?

How is the law of large numbers different from the central limit theorem?

Why is Chebyshev's inequality not sufficient in this scenario?

Question 14 of 20

You are preparing the feature request_latency_ms for a regression model. Exploratory analysis shows that only the upper tail is problematic: about 2 % of records exceed 3 000 ms, while the lower tail appears clean. You must preserve all rows but limit the influence of those extreme high values using SciPy's winsorize so that only the top 2 % of observations are capped and the lower tail is left completely untouched.

import numpy as np
from scipy.stats.mstats import winsorize

latency = np.load('latency.npy')
latency_clean = winsorize(latency, limits=______)  # fill in

Which tuple correctly replaces ______ to meet the requirement?

(0.02, None)
(0.0, 0.02)
(0.02, 0.02)
(None, 0.02)

Correct Incorrect Unanswered Operations and Processes

Report Issue

Answer Description

The limits argument in scipy.stats.mstats.winsorize expects a two-element tuple (lower, upper) containing the proportions of data to Winsorize on each side. A value of None designates an open interval-no Winsorization-on that side. Therefore, to leave the lower tail unchanged and Winsorize only the upper 2 %, you pass (None, 0.02).

(None, 0.02) sets no lower bound Winsorization and caps the highest 2 % of observations at the 98th percentile, preserving all rows.
(0.02, 0.02) Winsorizes both 2 % tails, altering the lower tail, which violates the scenario.
(0.02, None) Winsorizes only the lower 2 %, leaving the high outliers untouched.
(0.0, 0.02) specifies 0 % on the lower tail, which technically works but still computes an unnecessary quantile and does not satisfy the explicit instruction to leave the lower limit open.

Hence, (None, 0.02) is the only tuple that meets all stated requirements.

Ask Bash

Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.

What is Winsorization, and why is it used?

How does the `limits` argument in `scipy.stats.mstats.winsorize` work?

Why is `(None, 0.02)` better than `(0.0, 0.02)` in this case?

Question 15 of 20

A data scientist fits a multiple linear regression model with an intercept and six predictor variables (p = 6) to a sample of n = 80 observations. The model's coefficient of determination is R² = 0.37.

Using the classical F-test for overall model significance (H₀: all slope coefficients = 0), what is the value of the F statistic that should be reported?

Approximately 12.2
Approximately 0.59
Approximately 37.0
Approximately 7.1

Correct Incorrect Unanswered Mathematics and Statistics

Report Issue

Answer Description

For the overall significance test in multiple regression, the F statistic can be written in terms of R²:

F = (R² / p) ÷ [(1 - R²) / (n - p - 1)]

Substituting the values:

Numerator term: R² / p = 0.37 / 6 ≈ 0.0617
Denominator term: (1 - R²) / (n - p - 1) = 0.63 / 73 ≈ 0.00863

F = 0.0617 / 0.00863 ≈ 7.15

Because this value greatly exceeds typical critical values for the F(6, 73) distribution at α = 0.05 (≈ 2.20), the null hypothesis would be rejected.

The distractor values reflect common errors: computing R² / (1 - R²) alone (≈0.59), using (n - p - 1) / p alone (≈12.2), or omitting the intercept when calculating degrees of freedom (≈37).

Ask Bash

Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.

What is the purpose of the F-test in multiple linear regression?

How do degrees of freedom affect the computation of the F-statistic?

What does the coefficient of determination (R²) indicate in this context?

Question 16 of 20

A data scientist is conducting a survival analysis to model customer churn for a subscription-based service. The dataset includes the tenure of each customer and a status indicator for whether they have churned or are still active (censored data). The initial analysis with a non-parametric Kaplan-Meier estimator was used to visualize the survival probability.

The next objective is to understand how covariates, such as the customer's subscription plan and monthly spending, influence the risk of churn over time. The data scientist wants to quantify the effect of these covariates but is hesitant to make a strong assumption about the specific shape of the underlying baseline hazard function.

Given these requirements, which of the following models is the most appropriate choice?

Cox Proportional Hazards model
Kaplan-Meier estimator
Weibull AFT model
ARIMA model

Correct Incorrect Unanswered Mathematics and Statistics

Report Issue

Answer Description

The correct answer is the Cox Proportional Hazards model. This model is a semi-parametric regression model and is ideal for this scenario because it allows for the estimation of the effects of covariates (like subscription plan and spending) on the hazard rate without making any assumptions about the shape of the baseline hazard function. This directly addresses the requirement to quantify covariate effects while avoiding strong distributional assumptions.

The Kaplan-Meier estimator is a non-parametric method used to estimate and visualize the survival function. While useful for initial analysis, it cannot incorporate multiple or continuous covariates into a regression framework to quantify their individual effects on the hazard rate.
The Weibull AFT (Accelerated Failure Time) model is a fully parametric model. It requires the assumption that survival times follow a specific distribution (the Weibull distribution). This contradicts the data scientist's goal of avoiding strong assumptions about the underlying distribution.
An ARIMA model is used for time series forecasting, which analyzes data points collected over time to predict future values (e.g., monthly sales). It is not designed for time-to-event analysis, which involves understanding the duration until an event occurs and must account for censored data and individual-level covariates.

Ask Bash

Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.

What makes the Cox Proportional Hazards model semi-parametric?

How is the Kaplan-Meier estimator different from the Cox model?

Why isn’t the Weibull AFT model suitable for this scenario?

Question 17 of 20

You are implementing a Monte Carlo simulator for network-packet jitter. The only random-number source available returns independent samples from the continuous uniform distribution on (0, 1). To feed the noise model, the simulator must generate pairs of independent standard normal (mean 0, variance 1) random variables on every call. Which one of the following transformations of two independent Uniform(0, 1) samples U1 and U2 will correctly produce the required standard normal variables Z1 and Z2?

Z1 = √(-2 ln (U1 / U2)); Z2 = √(-2 ln (U2 / U1))
Z1 = ln (U1) / √2; Z2 = ln (U2) / √2
Z1 = √(-2 ln U1) cos(π U2); Z2 = √(-2 ln U2) sin(π U1)
Z1 = √(-2 ln U1) cos(2π U2); Z2 = √(-2 ln U1) sin(2π U2)

Correct Incorrect Unanswered Mathematics and Statistics

Report Issue

Answer Description

The Box-Muller transform maps two independent Uniform(0, 1) variables to two independent N(0, 1) variables by converting the uniform samples into polar coordinates. The correct transformation is:

Z1 = √(-2 ln U1) cos(2π U2)
Z2 = √(-2 ln U1) sin(2π U2)

This works by setting the squared radius R² = -2 ln U1 and the angle Θ = 2π U2. Because the Jacobian of this transformation exactly cancels the standard bivariate normal density, the resulting pair has the desired distribution. The other choices break one or more of the requirements:

Transformations that use π instead of 2π for the angle or take logarithms of both U1 and U2 distort the output, so it is no longer standard normal.
Mixing the two uniform variables inside the logarithm, such as in the expression √(-2 ln (U1 / U2)), changes the radial distribution and makes Z1 and Z2 dependent, so they are neither independent nor standard normal.

Therefore, the transformation using √(-2 ln U1) for the radius and the trigonometric functions of 2π U2 for the angle is the only one that satisfies the specification.

Ask Bash

Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.

What is the Box-Muller transform in simple terms?

Why does the Box-Muller transform use 2π in the angle calculation?

What does it mean for random variables to be independent and standard normal?

Question 18 of 20

A data scientist is building a decision tree classifier to predict customer churn. They are evaluating a potential split on a categorical feature. The parent node contains 100 samples, with 50 belonging to the 'Churn' class and 50 to the 'No Churn' class. The proposed split creates two child nodes:

Child Node 1: 60 samples, with 40 'Churn' and 20 'No Churn'.
Child Node 2: 40 samples, with 10 'Churn' and 30 'No Churn'. To evaluate the quality of this split, what is the weighted Gini impurity?

0.417
0.083
0.500
0.410

Correct Incorrect Unanswered Mathematics and Statistics

Report Issue

Answer Description

The correct answer is 0.417. The weighted Gini impurity is calculated by finding the Gini impurity for each child node and then computing their weighted average based on the number of samples in each node.

Step 1: Calculate Gini Impurity for Child Node 1

The proportion of 'Churn' is p1 = 40/60 = 2/3.
The proportion of 'No Churn' is p2 = 20/60 = 1/3.
Gini(Node 1) = 1 - (p1^{2 + p2}2) = 1 - ((2/3)^2 + (1/3)^2) = 1 - (4/9 + 1/9) = 1 - 5/9 = 4/9 ≈ 0.444.

Step 2: Calculate Gini Impurity for Child Node 2

The proportion of 'Churn' is p1 = 10/40 = 1/4.
The proportion of 'No Churn' is p2 = 30/40 = 3/4.
Gini(Node 2) = 1 - (p1^{2 + p2}2) = 1 - ((1/4)^2 + (3/4)^2) = 1 - (1/16 + 9/16) = 1 - 10/16 = 6/16 = 0.375.

Step 3: Calculate the Weighted Gini Impurity

The weight for Node 1 is the number of samples in it divided by the total samples in the parent: w1 = 60/100 = 0.6.
The weight for Node 2 is w2 = 40/100 = 0.4.
Weighted Gini = (w1 * Gini(Node 1)) + (w2 * Gini(Node 2)) = (0.6 * 4/9) + (0.4 * 0.375) ≈ 0.267 + 0.150 = 0.417.

Incorrect Answer Analysis:

0.500 is the Gini impurity of the parent node (1 - (0.5^{2 + 0.5}2) = 0.5), not the weighted impurity of the split.
0.083 is the Information Gain (Gini Gain), which is calculated by subtracting the weighted Gini impurity from the parent node's Gini impurity (0.500 - 0.417 = 0.083).
0.410 is an incorrect calculation, likely resulting from taking a simple average of the two child node impurities instead of a weighted average.

Ask Bash

Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.

What is Gini Impurity in decision trees?

Why use a weighted Gini impurity for child nodes?

How is Information Gain related to Gini Impurity?

Question 19 of 20

A data scientist develops a classification model to identify fraudulent financial transactions. The test dataset contains 1,000,000 transactions, of which 1,000 (0.1%) are fraudulent. After testing, the model produces the following confusion matrix:

	Predicted: Fraud	Predicted: Not Fraud
Actual: Fraud	800 (TP)	200 (FN)
Actual: Not Fraud	500 (FP)	998,500 (TN)

The primary business objective is to minimize the number of missed fraudulent transactions (False Negatives), even at the cost of flagging some legitimate transactions for review (False Positives). Given this objective and the severe class imbalance, which performance metric provides the most relevant assessment of the model's effectiveness for its intended purpose?

Recall
Matthews Correlation Coefficient (MCC)
Accuracy
Precision

Correct Incorrect Unanswered Mathematics and Statistics

Report Issue

Answer Description

The correct answer is Recall.

Recall (Sensitivity or True Positive Rate) is calculated as TP / (TP + FN). It measures the proportion of actual positive cases that the model correctly identified. In this scenario, Recall = 800 / (800 + 200) = 80%. This metric directly addresses the business objective of minimizing missed fraudulent transactions (False Negatives). A high recall indicates that the model is effective at identifying the vast majority of actual fraud cases.

Accuracy is incorrect because it is a misleading metric for datasets with severe class imbalance. It is calculated as (TP + TN) / Total, which in this case is (800 + 998,500) / 1,000,000 = 99.93%. While this number seems very high, a naive model that predicts "Not Fraud" for every transaction would achieve 99.9% accuracy, making it a poor indicator of the model's ability to detect the rare positive class.

Precision is incorrect in this context. Precision is calculated as TP / (TP + FP) and measures the proportion of positive predictions that were actually correct. Here, Precision = 800 / (800 + 500) = 61.5%. This metric is important when the cost of a False Positive is high. However, the business objective explicitly prioritizes minimizing False Negatives over False Positives, making Recall the more relevant metric.

Matthews Correlation Coefficient (MCC) is a sophisticated and generally robust metric for imbalanced datasets because it considers all four cells of the confusion matrix. However, the question asks for the metric that is most relevant to the specific business objective of minimizing False Negatives. While MCC provides a balanced, overall score, Recall is the most direct and explicit measure of the model's performance against that particular goal.

Ask Bash

Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.

Why is recall a better metric than accuracy for imbalanced datasets?

What is the difference between recall and precision?

When should we consider using Matthews Correlation Coefficient (MCC)?

Question 20 of 20

An SRE team is analyzing the daily count of service outages for a cloud platform. Over the last 365 days the observed frequencies are: 0 outages on 310 days, 1 outage on 45 days, and 2 outages on 10 days (no day had more than two outages). The sample mean is 0.18 outages per day and the sample variance is 0.20. To develop a generative model for the number of outages per day, which distribution and supporting rationale provides the most statistically appropriate starting point?

Student's t-distribution - its heavier tails better model the occasional two-outage days in a small sample.
Power law distribution - heavy-tailed behavior explains low-probability, high-impact outage counts.
Poisson distribution - the near-equality of the sample mean and variance supports a Poisson rate parameter λ ≈ 0.18 for rare, independent daily outages.
Binomial distribution - because the count of outages can be viewed as successes in 365 daily trials with variance np(1 − p).

Correct Incorrect Unanswered Mathematics and Statistics

Report Issue

Answer Description

The Poisson distribution is designed for modeling the number of independent events that occur in a fixed interval when those events are rare and occur at a constant average rate. A defining property of the Poisson distribution is that its mean and variance are both equal to the rate parameter λ. Because the observed data are non-negative integer counts, the mean (0.18) is very close to the variance (0.20), and outages are presumed independent from day to day, the Poisson distribution is the most appropriate first model.

The binomial distribution is inappropriate here because it requires a fixed number of identical trials (n) each day; in this context n would need to be the unknown number of "possible outage opportunities" within a day, and its variance is np(1 − p), which is strictly less than the mean np. Since the sample variance is slightly greater than the sample mean, the binomial is a poorer fit than the Poisson.

The Student's t-distribution is a continuous distribution used for inference on sample means when population variance is unknown; it cannot generate non-negative integer counts.

A power-law distribution is heavy-tailed and continuous or defined on positive integers with probabilities that decay polynomially; it is suited to modeling extreme events across many orders of magnitude, not tightly bounded low counts such as 0, 1, 2 outages. Therefore, only the Poisson model aligns with both the empirical moment relationship and the data-generation mechanism.

Ask Bash

Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.

Why does the Poisson distribution fit the scenario described?

Why is the binomial distribution not suitable for modeling daily outages?

What differentiates the Poisson distribution from heavy-tailed distributions like the power law?

Cool beans!

Looks like that's it! You can go back and review your answers or click the button below to grade your test.

CompTIA DataX Practice Test (DY0-001)

CompTIA DataX DY0-001 (V1) Information

Free CompTIA DataX DY0-001 (V1) Practice Test

Free Preview

Report Issue

Answer Description

Ask Bash

What is Pearson's correlation coefficient, and how is it calculated?

How is the t-statistic used to test the significance of Pearson's correlation coefficient?

What does a two-tailed test at α = 0.05 mean for hypothesis testing?

Report Issue

Answer Description

Ask Bash

What is the difference between MAR, MCAR, and MNAR in data analysis?

Why is MAR considered ignorable in likelihood-based models or imputation?

What techniques can be used to handle MAR missing data?

Report Issue

Answer Description

Ask Bash

What is entropy in the context of decision trees?

What is information gain and why is it important in decision trees?

How is the weighted average entropy calculated for a split?

Report Issue

Answer Description

Ask Bash

What is Fisher's z-transformation and why is it used for correlations?

Why can't the Wilson score method or Box-Cox transformation be used in this case?

What is the role of sample size (n) in constructing the confidence interval for correlation?

Report Issue

Answer Description

Ask Bash

How do you calculate precision in a confusion matrix?

What is the role of the confusion matrix in evaluating classifiers?

Why does Classifier Y have more true positives but a lower precision than Classifier X?

Report Issue

Answer Description

Ask Bash

Why is adjusted R² preferred over plain R² in model comparison?

What is the role of RMSE in evaluating regression models?

Why can't the F-statistic p-value guide model selection in this case?

Report Issue

Answer Description

Ask Bash

Why can the PDF exceed 1 at a specific point without violating probability rules?

What does it mean to integrate a PDF, and why is this important?

Why doesn't scaling the unit of measurement (e.g., switching from microseconds to seconds) fix the concern?

Report Issue

Answer Description

Ask Bash

Why is Recall more important than Precision in this scenario?

What is the impact of an imbalanced dataset on Accuracy?

How does the F1 Score compare to Recall in this case?

Report Issue

Answer Description

Ask Bash

Why is the t-distribution used for small sample sizes instead of the Z-distribution?

What does 'degrees of freedom' mean in the context of the t-distribution?

When would the Chi-squared or Binomial distributions be appropriate instead of the t-distribution?

Report Issue

Answer Description

Ask Bash

What distinguishes Missing at Random (MAR) from Missing Completely at Random (MCAR)?

How can a data scientist address Missing at Random (MAR) in a dataset?

Why doesn't the scenario fall under Not Missing at Random (NMAR)?

Report Issue

Answer Description

Ask Bash

What is heteroskedasticity?

How do you detect heteroskedasticity in a regression model?

How can you address heteroskedasticity in a model?

Report Issue

Answer Description

Ask Bash

Why is a box and whisker plot considered the best choice for this task?

What does the IQR and whiskers in a box plot represent?

Why are the other visualization methods not suited for this scenario?

Report Issue

Answer Description

Ask Bash

What is the law of large numbers?