CompTIA DataX Practice Test (DY0-001)
Use the form below to configure your CompTIA DataX Practice Test (DY0-001). The practice test can be configured to only include certain exam objectives and domains. You can choose between 5-100 questions and set a time limit.

CompTIA DataX DY0-001 (V1) Information
CompTIA DataX is an expert‑level, vendor‑neutral certification aimed at deeply experienced data science professionals. Launched on July 25, 2024, the exam verifies advanced competencies across the full data science lifecycle - from mathematical modeling and machine learning to deployment and specialized applications like NLP, computer vision, and anomaly detection.
The exam comprehensively covers five key domains:
- Mathematics and Statistics (~17%)
- Modeling, Analysis and Outcomes (~24%)
- Machine Learning (~24%)
- Operations and Processes (~22%)
- Specialized Applications of Data Science (~13%)
It includes a mix of multiple‑choice and performance‑based questions (PBQs), simulating real-world tasks like interpreting data pipelines or optimizing machine learning workflows. The duration is 165 minutes, with a maximum of 90 questions. Scoring is pass/fail only, with no scaled score reported.
Free CompTIA DataX DY0-001 (V1) Practice Test
Press start when you are ready, or press Change to modify any settings for the practice test.
- Questions: 15
- Time: Unlimited
- Included Topics:Mathematics and StatisticsModeling, Analysis, and OutcomesMachine LearningOperations and ProcessesSpecialized Applications of Data Science
Free Preview
This test is a free preview, no account required.
Subscribe to unlock all content, keep track of your scores, and access AI features!
A data scientist wants to report a two-sided 95% confidence interval for the true population Pearson correlation between two numerical features. In a random sample of n = 60 observations, the sample correlation is r = 0.58. To use standard normal critical values, which pre-processing step should be applied to the correlation estimate before constructing the confidence interval?
Apply the Wilson score method directly to r to obtain the interval.
Transform r with Fisher's inverse hyperbolic tangent (z-transformation), build the interval in the transformed space, then back-transform the interval's endpoints.
Use a Box-Cox transformation on each variable so that the resulting correlation can be treated as normally distributed.
Multiply r by √(n−2)/√(1−r²) and treat the result as standard normal when forming the interval.
Answer Description
Because the sampling distribution of Pearson's r is skewed and its variance depends on the unknown population correlation (ρ), a direct calculation using normal theory is inappropriate. Fisher's z-transformation-z = atanh(r) = ½ ln[(1+r)/(1−r)]-is a variance-stabilizing transform that makes the resulting statistic, z, approximately normally distributed as N(atanh(ρ), 1/(n−3)). A 95% interval for this transformed value is therefore z ± 1.96 / √(n−3). Applying the inverse transform (tanh) to the interval's endpoints yields the confidence interval for ρ. The Wilson score interval is designed for binomial proportions. A Box-Cox transformation applies to the raw data, not the correlation coefficient r. The statistic r√(n−2)/√(1−r²) follows a t-distribution and is used for hypothesis testing, not interval estimation.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is Fisher's z-transformation and why is it used for correlations?
Why can't the Wilson score method or Box-Cox transformation be used in this case?
What is the role of sample size (n) in constructing the confidence interval for correlation?
A data scientist develops a multiple linear regression model to predict housing prices. Upon evaluation, a plot of the model's residuals versus its fitted values reveals a distinct fan shape, where the vertical spread of the residuals increases as the predicted housing price increases. Which of the following statements describes the most critical implication of this observation for the model's statistical inference?
The model suffers from severe multicollinearity, making it difficult to isolate the individual impact of each predictor variable.
The coefficient estimates are biased, leading to a systematic overestimation or underestimation of the true population parameters.
The standard errors of the coefficients are biased, rendering hypothesis tests and confidence intervals unreliable.
The residuals are not normally distributed, which violates the primary assumption required for the coefficient estimates to be valid.
Answer Description
The correct answer is that the standard errors of the coefficients are biased, which renders hypothesis tests and confidence intervals unreliable. The fan-shaped pattern in the residual plot is a classic indicator of heteroskedasticity, which means the variance of the error term is not constant across all levels of the independent variables. In the presence of heteroskedasticity, Ordinary Least Squares (OLS) coefficient estimates remain unbiased, but they are no longer efficient (i.e., not BLUE - Best Linear Unbiased Estimators). The primary issue for statistical inference is that the formulas used to calculate the variance and standard errors of the coefficients, which assume homoskedasticity, become biased. This bias in the standard errors leads to unreliable t-statistics, p-values, and confidence intervals, potentially causing the analyst to draw incorrect conclusions about the statistical significance of the predictor variables.
The coefficient estimates themselves do not become biased due to heteroskedasticity in an OLS model. Multicollinearity is a separate issue related to high correlation between predictor variables, not the variance of the residuals. While the normality of residuals is another OLS assumption, the fan shape specifically points to non-constant variance (heteroskedasticity), not necessarily a deviation from a normal distribution.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is heteroskedasticity?
How do you detect heteroskedasticity in a regression model?
How can you address heteroskedasticity in a model?
An analytics team is evaluating three nested multiple linear regression models to predict annual energy consumption (kWh) for office buildings. The validation-set summary is:
Model | Predictors | R² | Adjusted R² | RMSE (kWh) | F-statistic p-value
M1 | 6 | 0.88 | 0.879 | 12 400 | <0.001
M2 | 15 | 0.90 | 0.893 | 12 100 | <0.001
M3 | 25 | 0.91 | 0.885 | 12 050 | <0.001
Hardware constraints limit the production model to the smallest set of predictors that still yields clear performance gains. Which single performance metric from the table gives the most defensible basis for deciding which model best achieves this balance?
R²
Root-mean-square error (RMSE)
F-statistic p-value
Adjusted R²
Answer Description
Adjusted R² modifies the ordinary R² by incorporating both sample size and the number of predictors, so it rises only when additional variables reduce the residual variance more than would be expected by chance. It therefore rewards genuine improvement while penalizing unnecessary complexity. In the table, Model 2 attains the highest adjusted R², indicating the best trade-off between parsimony and predictive power. Plain R² and RMSE both improve (or stay nearly the same) as more predictors are added, so they cannot flag overfitting. The F-statistic p-value only tests whether each model outperforms an intercept-only model; because all three p-values are identical, it offers no guidance for choosing among the competing models.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
Why is adjusted R² preferred over plain R² in model comparison?
What is the role of RMSE in evaluating regression models?
Why can't the F-statistic p-value guide model selection in this case?
A customer analytics team is cleaning a dataset that contains customer age (fully observed), loyalty tier (fully observed), and total annual spending, of which about 18 % of the values are missing. Exploratory analysis shows that customers who are younger and those in the highest loyalty tier are less likely to report spending. However, within any given age-tier combination, the probability that spending is missing is unrelated to the true (unobserved) spending amount. Which description best characterizes the missingness mechanism for the spending variable in this situation?
Missing Completely at Random due to a random data-entry glitch that uniformly deleted 18 % of spending values across the dataset.
Missing Completely at Random (MCAR); missingness is unrelated to any observed or unobserved variables.
Missing at Random (MAR); the probability of a missing spending value depends only on the observed age and loyalty tier.
Missing Not at Random (MNAR); higher or lower spending directly influences the chance that the value is missing, even after accounting for age and tier.
Answer Description
The missingness depends on two fully observed variables-age and loyalty tier-but, conditional on them, it is not related to the spending values that are actually missing. This matches the definition of Missing at Random (MAR). Under MAR, the missing-data mechanism is considered ignorable for likelihood-based models or multiple imputation, provided the observed predictors that drive missingness are included in the analysis. The mechanism is not Missing Completely at Random (MCAR) because younger, high-tier customers have a higher propensity for missingness, and it is not Missing Not at Random (MNAR) because spending itself does not influence whether it is missing once age and tier are taken into account.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is the difference between MAR, MCAR, and MNAR in data analysis?
Why is MAR considered ignorable in likelihood-based models or imputation?
What techniques can be used to handle MAR missing data?
A data scientist at an aerospace firm has developed a binary classification model to predict catastrophic engine failures. The positive class represents a "failure" event, which is extremely rare in the operational data. The primary business objective is to avoid missing any potential failures, as a single missed event (a False Negative) is unacceptable due to safety implications. The cost of a False Positive (flagging a healthy engine for inspection) is considered minimal. Which classifier performance metric should be prioritized to best evaluate and optimize the model for this specific requirement?
Recall
Accuracy
Precision
F1 Score
Answer Description
The correct answer is Recall. Recall, also known as sensitivity or the true positive rate, is calculated as TP / (TP + FN), where TP is True Positives and FN is False Negatives. In scenarios where the cost of a False Negative is very high, such as failing to predict a critical equipment failure, maximizing recall is the primary objective. This metric directly measures the model's ability to identify all actual positive instances.
- Precision, calculated as TP / (TP + FP), measures the accuracy of positive predictions. Prioritizing precision would aim to reduce False Positives, which is not the main concern in this scenario.
- Accuracy is not suitable for highly imbalanced datasets because a model can achieve a high accuracy score by simply predicting the majority class, while completely failing to identify the rare, critical events.
- The F1 score is the harmonic mean of precision and recall and seeks a balance between them. While useful, it does not specifically prioritize minimizing False Negatives, which is the explicit goal here.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
Why is Recall more important than Precision in this scenario?
What is the impact of an imbalanced dataset on Accuracy?
How does the F1 Score compare to Recall in this case?
A data scientist fits a multiple linear regression model with an intercept and six predictor variables (p = 6) to a sample of n = 80 observations. The model's coefficient of determination is R² = 0.37.
Using the classical F-test for overall model significance (H₀: all slope coefficients = 0), what is the value of the F statistic that should be reported?
Approximately 37.0
Approximately 0.59
Approximately 12.2
Approximately 7.1
Answer Description
For the overall significance test in multiple regression, the F statistic can be written in terms of R²:
F = (R² / p) ÷ [(1 - R²) / (n - p - 1)]
Substituting the values:
- Numerator term: R² / p = 0.37 / 6 ≈ 0.0617
- Denominator term: (1 - R²) / (n - p - 1) = 0.63 / 73 ≈ 0.00863
F = 0.0617 / 0.00863 ≈ 7.15
Because this value greatly exceeds typical critical values for the F(6, 73) distribution at α = 0.05 (≈ 2.20), the null hypothesis would be rejected.
The distractor values reflect common errors: computing R² / (1 - R²) alone (≈0.59), using (n - p - 1) / p alone (≈12.2), or omitting the intercept when calculating degrees of freedom (≈37).
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is the purpose of the F-test in multiple linear regression?
How do degrees of freedom affect the computation of the F-statistic?
What does the coefficient of determination (R²) indicate in this context?
A data scientist is conducting a survival analysis to model customer churn for a subscription-based service. The dataset includes the tenure of each customer and a status indicator for whether they have churned or are still active (censored data). The initial analysis with a non-parametric Kaplan-Meier estimator was used to visualize the survival probability.
The next objective is to understand how covariates, such as the customer's subscription plan and monthly spending, influence the risk of churn over time. The data scientist wants to quantify the effect of these covariates but is hesitant to make a strong assumption about the specific shape of the underlying baseline hazard function.
Given these requirements, which of the following models is the most appropriate choice?
Weibull AFT model
Kaplan-Meier estimator
ARIMA model
Cox Proportional Hazards model
Answer Description
The correct answer is the Cox Proportional Hazards model. This model is a semi-parametric regression model and is ideal for this scenario because it allows for the estimation of the effects of covariates (like subscription plan and spending) on the hazard rate without making any assumptions about the shape of the baseline hazard function. This directly addresses the requirement to quantify covariate effects while avoiding strong distributional assumptions.
- The Kaplan-Meier estimator is a non-parametric method used to estimate and visualize the survival function. While useful for initial analysis, it cannot incorporate multiple or continuous covariates into a regression framework to quantify their individual effects on the hazard rate.
- The Weibull AFT (Accelerated Failure Time) model is a fully parametric model. It requires the assumption that survival times follow a specific distribution (the Weibull distribution). This contradicts the data scientist's goal of avoiding strong assumptions about the underlying distribution.
- An ARIMA model is used for time series forecasting, which analyzes data points collected over time to predict future values (e.g., monthly sales). It is not designed for time-to-event analysis, which involves understanding the duration until an event occurs and must account for censored data and individual-level covariates.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What makes the Cox Proportional Hazards model semi-parametric?
How is the Kaplan-Meier estimator different from the Cox model?
Why isn’t the Weibull AFT model suitable for this scenario?
An automated trading-surveillance system must ensure that at least 70 % of the orders it flags as suspicious are truly manipulative, so the compliance team has set a minimum precision of 0.70. Two candidate classifiers were evaluated on a validation set of 50 000 historical orders with the following confusion-matrix counts:
Classifier X - TP = 260, FP = 110, FN = 190, TN = 47 440
Classifier Y - TP = 350, FP = 180, FN = 100, TN = 47 370
Which option correctly identifies the classifier(s) that meet the compliance requirement and states the corresponding precision value?
Only Classifier X satisfies the requirement with a precision of approximately 0.70
Neither classifier satisfies the requirement because both precisions are below 0.70
Both classifiers satisfy the requirement because each has a precision above 0.70
Only Classifier Y satisfies the requirement with a precision of approximately 0.66
Answer Description
Precision is defined as TP / (TP + FP), the proportion of positive predictions that are correct.
- Classifier X: 260 / (260 + 110) ≈ 0.703 (> 0.70).
- Classifier Y: 350 / (350 + 180) ≈ 0.660 (< 0.70).
Because only Classifier X achieves a precision of at least 0.70, it alone satisfies the compliance requirement. Classifier Y falls short despite having more true positives, because it also generates more false positives, lowering its precision.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
How do you calculate precision in a confusion matrix?
What is the role of the confusion matrix in evaluating classifiers?
Why does Classifier Y have more true positives but a lower precision than Classifier X?
A data scientist is building a decision tree classifier to predict customer churn. At a specific node containing 20 samples, 10 customers have churned and 10 have not. The scientist is evaluating two features, 'Contract Type' and 'Has Tech Support', to determine the optimal split. The results of splitting by each feature are as follows:
Split by 'Contract Type':
- Node A ('Month-to-Month'): 12 samples (9 Churn, 3 No Churn)
- Node B ('One/Two Year'): 8 samples (1 Churn, 7 No Churn)
Split by 'Has Tech Support':
- Node C ('Yes'): 10 samples (3 Churn, 7 No Churn)
- Node D ('No'): 10 samples (7 Churn, 3 No Churn)
Given that the algorithm uses entropy to maximize information gain, which of the following conclusions is correct?
The 'Has Tech Support' feature should be selected because its resulting split has a lower weighted average entropy than the 'Contract Type' split.
The 'Has Tech Support' feature should be selected because its child nodes are perfectly balanced in size (10 samples each), which maximizes the reduction in impurity.
The 'Contract Type' feature should be selected because its resulting split has a lower weighted average entropy (approximately 0.705) than the 'Has Tech Support' split (approximately 0.881).
The information gain for both splits is equal, so the Gini index must be calculated to determine the optimal feature.
Answer Description
The correct answer is that 'Contract Type' should be selected because its split results in a lower weighted average entropy. The goal of a decision tree split is to maximize Information Gain, which is equivalent to minimizing the weighted average entropy of the child nodes.
The calculation is as follows:
Calculate the entropy for each child node. The formula for entropy is: E = -p * log2(p) - (1-p) * log2(1-p).
E(Node A)
(9/12 Churn):-( (9/12) * log2(9/12) + (3/12) * log2(3/12) )
≈ 0.811E(Node B)
(1/8 Churn):-( (1/8) * log2(1/8) + (7/8) * log2(7/8) )
≈ 0.544E(Node C)
(3/10 Churn):-( (3/10) * log2(3/10) + (7/10) * log2(7/10) )
≈ 0.881E(Node D)
(7/10 Churn):-( (7/10) * log2(7/10) + (3/10) * log2(3/10) )
≈ 0.881
Calculate the weighted average entropy for each split. The formula is the sum of
(samples_in_child / total_samples) * entropy_of_child
.W_avg_entropy('Contract Type')
=(12/20) * 0.811 + (8/20) * 0.544
=0.6 * 0.811 + 0.4 * 0.544
≈ 0.487 + 0.218 = 0.705W_avg_entropy('Has Tech Support')
=(10/20) * 0.881 + (10/20) * 0.881
=0.5 * 0.881 + 0.5 * 0.881
= 0.881
Compare the results. The split on 'Contract Type' (0.705) has a lower weighted average entropy than the split on 'Has Tech Support' (0.881). Therefore, 'Contract Type' yields a higher information gain and is the better split.
The other options are incorrect. The 'Has Tech Support' split has a higher weighted entropy, making it the less desirable choice. The balance of sample sizes in the child nodes for 'Has Tech Support' does not guarantee higher information gain; the purity of the classes within those nodes is what matters. Finally, calculating the Gini index is an alternative to entropy, not a necessary tie-breaker.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is entropy in the context of decision trees?
What is information gain and why is it important in decision trees?
How is the weighted average entropy calculated for a split?
A data scientist develops a classification model to identify fraudulent financial transactions. The test dataset contains 1,000,000 transactions, of which 1,000 (0.1%) are fraudulent. After testing, the model produces the following confusion matrix:
Predicted: Fraud | Predicted: Not Fraud | |
---|---|---|
Actual: Fraud | 800 (TP) | 200 (FN) |
Actual: Not Fraud | 500 (FP) | 998,500 (TN) |
The primary business objective is to minimize the number of missed fraudulent transactions (False Negatives), even at the cost of flagging some legitimate transactions for review (False Positives). Given this objective and the severe class imbalance, which performance metric provides the most relevant assessment of the model's effectiveness for its intended purpose?
Recall
Accuracy
Precision
Matthews Correlation Coefficient (MCC)
Answer Description
The correct answer is Recall.
Recall (Sensitivity or True Positive Rate) is calculated as TP / (TP + FN). It measures the proportion of actual positive cases that the model correctly identified. In this scenario, Recall = 800 / (800 + 200) = 80%. This metric directly addresses the business objective of minimizing missed fraudulent transactions (False Negatives). A high recall indicates that the model is effective at identifying the vast majority of actual fraud cases.
Accuracy is incorrect because it is a misleading metric for datasets with severe class imbalance. It is calculated as (TP + TN) / Total, which in this case is (800 + 998,500) / 1,000,000 = 99.93%. While this number seems very high, a naive model that predicts "Not Fraud" for every transaction would achieve 99.9% accuracy, making it a poor indicator of the model's ability to detect the rare positive class.
Precision is incorrect in this context. Precision is calculated as TP / (TP + FP) and measures the proportion of positive predictions that were actually correct. Here, Precision = 800 / (800 + 500) = 61.5%. This metric is important when the cost of a False Positive is high. However, the business objective explicitly prioritizes minimizing False Negatives over False Positives, making Recall the more relevant metric.
Matthews Correlation Coefficient (MCC) is a sophisticated and generally robust metric for imbalanced datasets because it considers all four cells of the confusion matrix. However, the question asks for the metric that is most relevant to the specific business objective of minimizing False Negatives. While MCC provides a balanced, overall score, Recall is the most direct and explicit measure of the model's performance against that particular goal.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
Why is recall a better metric than accuracy for imbalanced datasets?
What is the difference between recall and precision?
When should we consider using Matthews Correlation Coefficient (MCC)?
You are implementing a Monte Carlo simulator for network-packet jitter. The only random-number source available returns independent samples from the continuous uniform distribution on (0, 1). To feed the noise model, the simulator must generate pairs of independent standard normal (mean 0, variance 1) random variables on every call. Which one of the following transformations of two independent Uniform(0, 1) samples U1 and U2 will correctly produce the required standard normal variables Z1 and Z2?
Z1 = √(-2 ln U1) cos(2π U2); Z2 = √(-2 ln U1) sin(2π U2)
Z1 = ln (U1) / √2; Z2 = ln (U2) / √2
Z1 = √(-2 ln (U1 / U2)); Z2 = √(-2 ln (U2 / U1))
Z1 = √(-2 ln U1) cos(π U2); Z2 = √(-2 ln U2) sin(π U1)
Answer Description
The Box-Muller transform maps two independent Uniform(0, 1) variables to two independent N(0, 1) variables by converting the uniform samples into polar coordinates. The correct transformation is:
- Z1 = √(-2 ln U1) cos(2π U2)
- Z2 = √(-2 ln U1) sin(2π U2)
This works by setting the squared radius R² = -2 ln U1 and the angle Θ = 2π U2. Because the Jacobian of this transformation exactly cancels the standard bivariate normal density, the resulting pair has the desired distribution. The other choices break one or more of the requirements:
- Transformations that use π instead of 2π for the angle or take logarithms of both U1 and U2 distort the output, so it is no longer standard normal.
- Mixing the two uniform variables inside the logarithm, such as in the expression √(-2 ln (U1 / U2)), changes the radial distribution and makes Z1 and Z2 dependent, so they are neither independent nor standard normal.
Therefore, the transformation using √(-2 ln U1) for the radius and the trigonometric functions of 2π U2 for the angle is the only one that satisfies the specification.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is the Box-Muller transform in simple terms?
Why does the Box-Muller transform use 2π in the angle calculation?
What does it mean for random variables to be independent and standard normal?
An SRE team is analyzing the daily count of service outages for a cloud platform. Over the last 365 days the observed frequencies are: 0 outages on 310 days, 1 outage on 45 days, and 2 outages on 10 days (no day had more than two outages). The sample mean is 0.18 outages per day and the sample variance is 0.20. To develop a generative model for the number of outages per day, which distribution and supporting rationale provides the most statistically appropriate starting point?
Binomial distribution - because the count of outages can be viewed as successes in 365 daily trials with variance np(1 − p).
Power law distribution - heavy-tailed behavior explains low-probability, high-impact outage counts.
Poisson distribution - the near-equality of the sample mean and variance supports a Poisson rate parameter λ ≈ 0.18 for rare, independent daily outages.
Student's t-distribution - its heavier tails better model the occasional two-outage days in a small sample.
Answer Description
The Poisson distribution is designed for modeling the number of independent events that occur in a fixed interval when those events are rare and occur at a constant average rate. A defining property of the Poisson distribution is that its mean and variance are both equal to the rate parameter λ. Because the observed data are non-negative integer counts, the mean (0.18) is very close to the variance (0.20), and outages are presumed independent from day to day, the Poisson distribution is the most appropriate first model.
The binomial distribution is inappropriate here because it requires a fixed number of identical trials (n) each day; in this context n would need to be the unknown number of "possible outage opportunities" within a day, and its variance is np(1 − p), which is strictly less than the mean np. Since the sample variance is slightly greater than the sample mean, the binomial is a poorer fit than the Poisson.
The Student's t-distribution is a continuous distribution used for inference on sample means when population variance is unknown; it cannot generate non-negative integer counts.
A power-law distribution is heavy-tailed and continuous or defined on positive integers with probabilities that decay polynomially; it is suited to modeling extreme events across many orders of magnitude, not tightly bounded low counts such as 0, 1, 2 outages. Therefore, only the Poisson model aligns with both the empirical moment relationship and the data-generation mechanism.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
Why does the Poisson distribution fit the scenario described?
Why is the binomial distribution not suitable for modeling daily outages?
What differentiates the Poisson distribution from heavy-tailed distributions like the power law?
A data scientist is building a decision tree classifier to predict customer churn. They are evaluating a potential split on a categorical feature. The parent node contains 100 samples, with 50 belonging to the 'Churn' class and 50 to the 'No Churn' class. The proposed split creates two child nodes:
- Child Node 1: 60 samples, with 40 'Churn' and 20 'No Churn'.
- Child Node 2: 40 samples, with 10 'Churn' and 30 'No Churn'. To evaluate the quality of this split, what is the weighted Gini impurity?
0.417
0.083
0.410
0.500
Answer Description
The correct answer is 0.417. The weighted Gini impurity is calculated by finding the Gini impurity for each child node and then computing their weighted average based on the number of samples in each node.
Step 1: Calculate Gini Impurity for Child Node 1
- The proportion of 'Churn' is p1 = 40/60 = 2/3.
- The proportion of 'No Churn' is p2 = 20/60 = 1/3.
- Gini(Node 1) = 1 - (p12 + p22) = 1 - ((2/3)^2 + (1/3)^2) = 1 - (4/9 + 1/9) = 1 - 5/9 = 4/9 ≈ 0.444.
Step 2: Calculate Gini Impurity for Child Node 2
- The proportion of 'Churn' is p1 = 10/40 = 1/4.
- The proportion of 'No Churn' is p2 = 30/40 = 3/4.
- Gini(Node 2) = 1 - (p12 + p22) = 1 - ((1/4)^2 + (3/4)^2) = 1 - (1/16 + 9/16) = 1 - 10/16 = 6/16 = 0.375.
Step 3: Calculate the Weighted Gini Impurity
- The weight for Node 1 is the number of samples in it divided by the total samples in the parent: w1 = 60/100 = 0.6.
- The weight for Node 2 is w2 = 40/100 = 0.4.
- Weighted Gini = (w1 * Gini(Node 1)) + (w2 * Gini(Node 2)) = (0.6 * 4/9) + (0.4 * 0.375) ≈ 0.267 + 0.150 = 0.417.
Incorrect Answer Analysis:
- 0.500 is the Gini impurity of the parent node (1 - (0.52 + 0.52) = 0.5), not the weighted impurity of the split.
- 0.083 is the Information Gain (Gini Gain), which is calculated by subtracting the weighted Gini impurity from the parent node's Gini impurity (0.500 - 0.417 = 0.083).
- 0.410 is an incorrect calculation, likely resulting from taking a simple average of the two child node impurities instead of a weighted average.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is Gini Impurity in decision trees?
Why use a weighted Gini impurity for child nodes?
How is Information Gain related to Gini Impurity?
A data scientist is analyzing a clinical trial dataset that includes the variables patient_age
and systolic_blood_pressure
(SBP). They observe that a significant number of SBP values are missing. Upon further investigation, the data scientist discovers that the probability of an SBP value being missing is correlated with patient_age
, with younger patients being more likely to have a missing SBP value. However, within any specific age group, the reason for the missing SBP value is not related to the actual (unobserved) blood pressure level or any other unmeasured factor. Which type of missingness does this scenario describe?
Missing Completely at Random (MCAR)
Missing at Random (MAR)
Structural Missingness
Not Missing at Random (NMAR)
Answer Description
The correct answer is Missing at Random (MAR). In this scenario, the missingness of the systolic_blood_pressure
(SBP) is dependent on another observed variable, which is patient_age
. This is the key characteristic of MAR: the probability of a value being missing is related to other observed information in the dataset but not to the unobserved value itself.
Missing Completely at Random (MCAR) is incorrect because the missingness is not completely random; it has a systematic relationship with the
patient_age
variable. If the data were MCAR, the probability of a missing SBP value would be the same for all patients, regardless of their age or any other characteristic.Not Missing at Random (NMAR) is incorrect because the scenario explicitly states that the missingness is not related to the actual unobserved blood pressure level. NMAR would apply if, for example, patients with very high blood pressure were less likely to have their SBP recorded, meaning the missingness depends on the value of the missing variable itself.
Structural Missingness is incorrect. This term typically refers to data that is missing for a logical reason inherent in the study's design. For instance, a question about the number of pregnancies would be structurally missing for male participants. The scenario described does not fit this definition.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What distinguishes Missing at Random (MAR) from Missing Completely at Random (MCAR)?
How can a data scientist address Missing at Random (MAR) in a dataset?
Why doesn't the scenario fall under Not Missing at Random (NMAR)?
An analyst is investigating the linear association between two continuous variables X and Y using n = 7 paired observations. The following summary statistics are available:
- Sample standard deviation of X: s_X = 4
- Sample standard deviation of Y: s_Y = 3
- Sum of cross-products of deviations: Σ(x_i − x̄)(y_i − ȳ) = 36
Using Pearson's correlation coefficient and testing at the α = 0.05 significance level (two-tailed) for H₀: ρ = 0, which statement correctly states both the value of the sample correlation r and the appropriate decision on the null hypothesis?
r ≈ 0.83 and fail to reject the null hypothesis (no statistically significant linear correlation)
r = 0.50 and fail to reject the null hypothesis (no statistically significant linear correlation)
r ≈ 0.83 and reject the null hypothesis (statistically significant linear correlation)
r = 0.50 and reject the null hypothesis (statistically significant linear correlation)
Answer Description
The Pearson correlation for a sample is r = Σ(x_i − x̄)(y_i − ȳ) / [(n − 1)s_X s_Y]. Substituting the given values gives r = 36 / (6 × 4 × 3) = 0.50. To test significance, use the t statistic t = r √(n − 2) / √(1 − r²). With n = 7, t = 0.50 × √5 / √(1 − 0.25) ≈ 1.29. The critical value for a two-tailed test with df = 5 at α = 0.05 is about ±2.57, so |1.29| does not exceed the critical value. Therefore the analyst fails to reject H₀; the correlation is not statistically significant at the 0.05 level.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is Pearson's correlation coefficient, and how is it calculated?
How is the t-statistic used to test the significance of Pearson's correlation coefficient?
What does a two-tailed test at α = 0.05 mean for hypothesis testing?
Nice!
Looks like that's it! You can go back and review your answers or click the button below to grade your test.