00:20:00

CompTIA DataX Practice Test (DY0-001)

Use the form below to configure your CompTIA DataX Practice Test (DY0-001). The practice test can be configured to only include certain exam objectives and domains. You can choose between 5-100 questions and set a time limit.

Logo for CompTIA DataX DY0-001 (V1)
Questions
Number of questions in the practice test
Free users are limited to 20 questions, upgrade to unlimited
Seconds Per Question
Determines how long you have to finish the practice test
Exam Objectives
Which exam objectives should be included in the practice test

CompTIA DataX DY0-001 (V1) Information

CompTIA DataX is an expert‑level, vendor‑neutral certification aimed at deeply experienced data science professionals. Launched on July 25, 2024, the exam verifies advanced competencies across the full data science lifecycle - from mathematical modeling and machine learning to deployment and specialized applications like NLP, computer vision, and anomaly detection.

The exam comprehensively covers five key domains:

  • Mathematics and Statistics (~17%)
  • Modeling, Analysis and Outcomes (~24%)
  • Machine Learning (~24%)
  • Operations and Processes (~22%)
  • Specialized Applications of Data Science (~13%)

It includes a mix of multiple‑choice and performance‑based questions (PBQs), simulating real-world tasks like interpreting data pipelines or optimizing machine learning workflows. The duration is 165 minutes, with a maximum of 90 questions. Scoring is pass/fail only, with no scaled score reported.

CompTIA DataX DY0-001 (V1) Logo
  • Free CompTIA DataX DY0-001 (V1) Practice Test

  • 20 Questions
  • Unlimited
  • Mathematics and Statistics
    Modeling, Analysis, and Outcomes
    Machine Learning
    Operations and Processes
    Specialized Applications of Data Science
Question 1 of 20

A data scientist is performing exploratory data analysis on a dataset of e-commerce transaction amounts. They generate a histogram to understand the distribution of the transaction values, which are continuous and highly right-skewed. The initial plot, created using the default settings of a popular data visualization library, shows nearly all the data points clustered into a single bar on the far left, with a few other bars sparsely populated to the right. Which of the following is the most effective next step to improve the visualization and gain a clearer understanding of the data's distribution?

  • Replace the histogram with a box and whisker plot to better visualize the median and interquartile range.

  • Switch to a density plot, as histograms are not suitable for visualizing skewed continuous data.

  • Adjust the binning strategy by experimenting with different bin widths or applying a rule like the Freedman-Diaconis rule.

  • Increase the number of bins to the maximum allowable value to ensure maximum granularity.

Question 2 of 20

While building a logistic-regression model to predict loan default, your training data show that 8 % of values for the numeric attribute debt_to_income_ratio are missing. Exploratory analysis reveals that the probability of a value being missing increases for borrowers who are younger than 25 and who have less than one year of employment, but within those strata the missingness appears random. The feature is continuous, right-skewed, and has a strong influence on the target. Regulation requires that the chosen imputation technique preserve the variable's variance and explicitly propagate the extra uncertainty introduced by the missing data to any downstream parameter estimates. Which imputation type is the most appropriate to meet these constraints?

  • k-nearest-neighbors imputation using Euclidean distance on standardized predictors

  • Multiple imputation with pooled estimates across several completed data sets

  • Listwise deletion of all records that lack debt_to_income_ratio

  • Single mean imputation calculated within each cross-validation fold

Question 3 of 20

A data scientist is preparing a manufacturing data set for a k-nearest neighbors (k-NN) model that uses Euclidean distance. The data contain two continuous variables: AnnualEnergy_kWh, with a range of 0 to 12,000, and MaintenanceDowntime_min, with a range of 0 to 7,200.

During pilot runs, the distance metric is dominated by AnnualEnergy_kWh, causing records with high downtime to be mis-classified. According to best practice for normalization, which preprocessing step should the data scientist apply before training so that both variables contribute proportionally to the distance calculation?

  • Standardize each feature to zero mean and unit variance (z-score).

  • Apply a natural logarithm transform (log1p) to every value in both features.

  • Rescale each feature to the 0-1 interval using min-max normalization.

  • Generate polynomial cross-terms between the two features and include them in the model.

Question 4 of 20

A data scientist is analyzing the relationship between two continuous variables. A scatter plot reveals a clear pattern: as one variable increases, the other consistently increases, but the relationship is distinctly non-linear (curved). The calculated coefficients are a Pearson correlation of 0.2 and a Spearman correlation of 0.9. Which statement provides the MOST accurate explanation for this significant difference in the coefficient values?

  • Spearman correlation is only appropriate for ordinal data, and its application to continuous data has artificially inflated its value.

  • The presence of significant outliers is suppressing the Pearson correlation, while the rank-based Spearman correlation is unaffected.

  • Pearson correlation's low value reflects the lack of a linear relationship, while Spearman correlation's high value accurately captures the strong monotonic trend.

  • The data likely violates the normality assumption required for Pearson correlation, leading to an inaccurate result.

Question 5 of 20

A data scientist is building a multiple linear regression model to predict housing prices. The initial model, using only the living area in square feet as a predictor, yields an R-squared value of 0.65. To improve the model, the data scientist adds ten additional predictor variables, including number of bedrooms, number of bathrooms, and age of the house. The new model results in an R-squared value of 0.78. Which of the following is the most critical consideration for the data scientist when interpreting this increase in R-squared?

  • The new R-squared value is high, which invalidates the p-values of the individual coefficients in the model.

  • The increase from 0.65 to 0.78 definitively proves that the additional variables have strong predictive power and the new model is superior.

  • The R-squared value will almost always increase when more predictors are added to the model, regardless of their actual significance, potentially leading to overfitting.

  • An R-squared of 0.78 indicates that 78% of the model's predictions for house prices will be correct.

Question 6 of 20

A data scientist is investigating the relationship between two categorical variables: 'User Segment' (with 4 levels: 'Free Trial', 'Basic', 'Pro', 'Enterprise') and 'Feature Adoption Rate' (with 3 levels: 'Low', 'Medium', 'High'). They construct a 4x3 contingency table to perform a Chi-squared test of independence. After calculating the expected frequencies, they discover that two cells have an expected frequency below 5. Given this situation, what is the most appropriate immediate action to ensure the validity of the analysis?

  • Immediately apply Fisher's Exact Test, as it is more accurate for small sample sizes and low expected frequencies.

  • Combine adjacent or logically similar categories in one or both variables to increase the expected frequencies in the cells.

  • Remove the rows or columns containing the cells with low expected frequencies from the analysis.

  • Perform an independent samples t-test for each pair of user segments to compare their feature adoption.

Question 7 of 20

A data science team is evaluating four association rules that have already met the project's minimum support and confidence thresholds:

  • Rule A: → support = 2%, confidence = 80%
  • Rule B: → support = 4%, confidence = 50%
  • Rule C: → support = 1%, confidence = 90%
  • Rule D: → support = 3%, confidence = 60%

To rank the rules, the team will use the reinforcement metric, also known as Rule Power Factor. Based on this metric, which rule is the most powerful?

  • Rule A

  • Rule D

  • Rule B

  • Rule C

Question 8 of 20

A data‐science team is tuning a pricing engine whose objective is twice-differentiable and non-convex, subject to hundreds of inequality constraints and simple bounds. They have analytic gradients and Hessians and want every iterate to remain strictly inside the feasible region throughout the search. To do this, they choose a solver that

  1. augments the objective with a logarithmic barrier term −μ ∑log sᵢ(x) to prevent boundary violations,
  2. follows a central path by gradually decreasing the barrier parameter μ→0, and
  3. at each outer iteration solves a primal-dual Newton system instead of a quadratic programming subproblem.

Which class of constrained nonlinear optimization algorithms matches this strategy?

  • Nelder-Mead simplex search

  • Augmented Lagrangian (method of multipliers)

  • Primal-dual interior-point (path-following) method

  • Sequential quadratic programming

Question 9 of 20

An e-commerce company plans to run online validation of a new ranking model. The current production model (champion) will continue to serve users, while 50 % of requests are randomly routed to a challenger. Business stakeholders want to replace the champion only if the challenger shows at least a 2 % lift in click-through rate (CTR). Which step is most critical before traffic is split to ensure the experiment yields statistically valid evidence?

  • Increase the inference endpoint's autoscaling threshold so both variants can absorb peak traffic without throttling.

  • Deploy the challenger only in shadow mode, receiving mirrored traffic that does not affect live users.

  • Enable detailed feature logging for the challenger so offline explainability tools can be applied after the test.

  • Compute the minimum number of user impressions needed to detect a 2 % absolute lift in CTR at the chosen significance and power levels.

Question 10 of 20

A machine learning engineer is manually implementing the gradient descent algorithm to optimize a multivariate linear regression model. The objective is to minimize the Mean Squared Error (MSE) cost function by iteratively adjusting the model's parameters (weights). For each iteration of the algorithm, which of the following mathematical operations is most fundamental for determining the direction and magnitude of the update for a specific weight?

  • Computing the second partial derivative (Hessian matrix) of the cost function.

  • Calculating the Euclidean distance between the predicted and actual values.

  • Calculating the partial derivative of the MSE cost function with respect to that specific weight.

  • Applying the chain rule to the model's activation function.

Question 11 of 20

A data scientist has developed a multiple linear regression model to predict housing prices. After the initial training, the scientist examines the model's performance by creating a residual vs. fitted values plot. The plot reveals that the residuals are not randomly scattered around the zero line; instead, they form a distinct, parabolic (U-shaped) pattern. What is the most likely issue with the model, and what is the most appropriate next step in the model design iteration process?

  • The model exhibits non-linearity, indicating it fails to capture the underlying structure of the data. The next step should be to use feature engineering to create polynomial terms for the relevant predictors.

  • The model is likely overfitting the training data. The next step should be to increase the L2 regularization penalty (e.g., in a Ridge regression) to reduce the model's complexity.

  • The plot reveals multicollinearity among the predictor variables. The next step should be to calculate the Variance Inflation Factor (VIF) for each feature and consider removing highly correlated predictors.

  • The plot shows evidence of heteroscedasticity, meaning the variance of the errors is not constant. The next step should be to apply a Box-Cox transformation to the response variable to stabilize the variance.

Question 12 of 20

You are building a sentiment classifier that must label customer-service tickets as Positive, Negative, or Neutral. In a corpus of 600 000 tickets, about 80 % are Neutral, 15 % Negative, and 5 % Positive. An LSTM model currently reports 81 % overall accuracy, but stakeholders want a single evaluation metric that is not dominated by the Neutral majority and instead gives each sentiment category equal influence on the final score. Which metric should you monitor during model development to satisfy this requirement?

  • Micro-averaged precision

  • Overall accuracy

  • Macro-averaged F1 score

  • Weighted F1 score

Question 13 of 20

A data scientist is constructing a feature matrix where the existing feature vectors are linearly independent. A new feature vector is engineered, which is a linear combination of two of the original vectors. This new vector is then appended as a new column to the matrix. Which statement correctly describes the primary consequence of this action on the properties of the feature matrix?

  • The span of the column space expands to a higher dimension because an additional vector has been introduced.

  • The new vector replaces one of the original vectors in the basis, resolving a deficient rank problem in the original matrix.

  • The span of the column space is unaffected, which improves the numerical stability of subsequent model coefficient estimations.

  • The span of the column space remains unchanged, but perfect multicollinearity is introduced.

Question 14 of 20

A machine learning engineer is training a deep neural network for a non-stationary problem and notices that the learning process has effectively halted. They determine that their current optimizer, Adagrad, has caused the learning rate to diminish to a near-zero value. To mitigate this, they decide to switch to the Root Mean Square Propagation (RMSprop) optimizer. What is the key mechanism in RMSprop that directly addresses this issue of a rapidly vanishing learning rate seen in Adagrad?

  • It introduces a penalty term to the loss function based on the magnitude of the model's weights to prevent overfitting.

  • It adds a fraction of the previous weight update vector to the current one, helping to accelerate convergence and dampen oscillations.

  • It computes adaptive learning rates by storing an exponentially decaying average of past gradients (first moment) and past squared gradients (second moment).

  • It calculates a moving average of the squared gradients using a decay parameter, which prevents the denominator of the update rule from monotonically increasing.

Question 15 of 20

A monitoring script records the number of checkout failures per minute on a high-traffic e-commerce platform. Historical data indicate that failures occur independently at a constant average rate of 2.3 per minute. Assuming this process follows a Poisson distribution, which of the following values is closest to the probability that at least five checkout failures will be observed in a randomly selected minute?

  • 0.024

  • 0.084

  • 0.916

  • 0.209

Question 16 of 20

A data scientist is performing Principal Component Analysis (PCA) on a high-dimensional dataset where the features have been standardized. After computing the covariance matrix of the data, the analysis proceeds with an eigen-decomposition. What does the first principal component represent in this context?

  • The largest eigenvalue of the covariance matrix, which quantifies the total variance captured by the model.

  • The eigenvector of the covariance matrix associated with the largest eigenvalue.

  • The direction defined by the eigenvector with the smallest eigenvalue, as it captures the least amount of systemic noise.

  • A linear combination of features designed to maximize the separation between predefined classes.

Question 17 of 20

A data science team at an e-commerce company has developed a highly accurate customer churn prediction model using a complex gradient boosting algorithm. During the Evaluation phase, stakeholders confirm the model's predictive power but state that their primary goal has evolved. They now need to understand the specific reasons why customers are churning to inform retention strategies, a task for which the current "black box" model is ill-suited. According to the CRISP-DM methodology, what is the most appropriate immediate next step?

  • Return to the Business Understanding phase to redefine the project objectives and success criteria to include model interpretability.

  • Return to the Modeling phase to retrain the data with an inherently interpretable model, such as a decision tree or logistic regression.

  • Proceed to the Deployment phase since the model is technically accurate, and initiate a separate project for root-cause analysis.

  • Return to the Data Preparation phase to create new features that might provide more explanatory power when used in a new model.

Question 18 of 20

A data science team is creating a container image for a predictive-analytics service that will be offered under a proprietary license. Corporate policy forbids distribution of any image that contains a direct or transitive dependency released under the GNU GPL or other strong-copyleft licenses. The team wants to block non-compliant images automatically before they are pushed to the internal registry, while adding as little manual work as possible to the continuous-integration (CI) pipeline.

Which approach best meets these dependency-licensing requirements?

  • Replace any GPL-licensed dependencies with internal forks released under a permissive license and document the change in the project's README.

  • Generate an SBOM during each build with Syft or Trivy and have an Open Policy Agent rule fail the pipeline whenever a prohibited license is detected.

  • Pin every third-party package version in a requirements.txt file and commit it to version control to keep a reproducible inventory of licenses.

  • Run pip freeze after the image is built, store the output as a build artifact, and ask the compliance team to review the file once a quarter.

Question 19 of 20

During a schema-on-read validation step in your ETL pipeline, you must reject any record whose order_date field is not a valid calendar date in the form YYYY-MM-DD. The rule should allow only years between 1900 and 2099, months 01-12, and days 01-31; it does not need to account for month-specific day limits (for example, 31 February may pass). Which regular expression best enforces this requirement?

  • ^\d{4}-\d{2}-\d{2}$

  • ^([0-9]{2}){2}-(0[1-9]|1[0-2])-(0[1-9]|3)$

  • ^(19|20)\d{2}/(0[1-9]|1[0-2])/(0[1-9]|\d|3)$

  • ^(19|20)\d{2}-(0[1-9]|1[0-2])-(0[1-9]|\d|3)$

Question 20 of 20

During a market-basket analysis of 10,000 e-commerce transactions, you evaluate the association rule → . The items appear with the following absolute frequencies:

  • Wireless Mouse: 2,000 transactions
  • Mouse Pad: 1,500 transactions
  • Both items together: 600 transactions

Based on these counts, which statement about the lift of the rule is correct?

  • The lift is 3.33, showing a very strong positive association between the two items.

  • The lift is 0.75, indicating that customers who buy a wireless mouse are less likely than average to buy a mouse pad.

  • The lift is 2.0, showing that customers who buy a wireless mouse are twice as likely to buy a mouse pad compared with the baseline.

  • The lift is 1.5, indicating a slight positive association between the two items.