00:20:00

CompTIA DataX Practice Test (DY0-001)

Use the form below to configure your CompTIA DataX Practice Test (DY0-001). The practice test can be configured to only include certain exam objectives and domains. You can choose between 5-100 questions and set a time limit.

Logo for CompTIA DataX DY0-001 (V1)
Questions
Number of questions in the practice test
Free users are limited to 20 questions, upgrade to unlimited
Seconds Per Question
Determines how long you have to finish the practice test
Exam Objectives
Which exam objectives should be included in the practice test

CompTIA DataX DY0-001 (V1) Information

CompTIA DataX is an expert‑level, vendor‑neutral certification aimed at deeply experienced data science professionals. Launched on July 25, 2024, the exam verifies advanced competencies across the full data science lifecycle - from mathematical modeling and machine learning to deployment and specialized applications like NLP, computer vision, and anomaly detection.

The exam comprehensively covers five key domains:

  • Mathematics and Statistics (~17%)
  • Modeling, Analysis and Outcomes (~24%)
  • Machine Learning (~24%)
  • Operations and Processes (~22%)
  • Specialized Applications of Data Science (~13%)

It includes a mix of multiple‑choice and performance‑based questions (PBQs), simulating real-world tasks like interpreting data pipelines or optimizing machine learning workflows. The duration is 165 minutes, with a maximum of 90 questions. Scoring is pass/fail only, with no scaled score reported.

CompTIA DataX DY0-001 (V1) Logo
  • Free CompTIA DataX DY0-001 (V1) Practice Test

  • 20 Questions
  • Unlimited
  • Mathematics and Statistics
    Modeling, Analysis, and Outcomes
    Machine Learning
    Operations and Processes
    Specialized Applications of Data Science
Question 1 of 20

You are inspecting a retail dataset where all columns have been imported as numeric values:

  • Loyalty_Tier: 1 = Bronze, 2 = Silver, 3 = Gold, 4 = Platinum
  • Discount_Rate: numeric percentage between 0 and 100
  • Units_Sold: whole-number count of items per transaction
  • Transaction_Timestamp: Unix epoch seconds

Before computing summary statistics and visualizations, which single column should you cast to a categorical type so that exploratory data analysis treats its values as membership levels rather than quantities?

  • Loyalty_Tier

  • Units_Sold

  • Transaction_Timestamp

  • Discount_Rate

Question 2 of 20

A gradient-boosting regressor that predicts delivery times for an online food-delivery platform was trained on six months of historical orders. Two months after deployment, a new municipal traffic law lowers the maximum speed limit from 35 mph to 25 mph on all urban streets. The distributions of the model's input features (order size, time of day, restaurant-to-customer distance, day of week) remain statistically indistinguishable from the training set, yet the model's residuals become consistently positive and the mean absolute error doubles. Which primary cause of model drift best explains this behaviour?

  • Random measurement noise in the performance metric (irreducible error)

  • A shift in the relationship between features and target caused by the external policy change (concept drift)

  • A covariate shift in the input feature distributions (data drift)

  • Information about the target variable leaking into the feature set (data leakage)

Question 3 of 20

During a schema-on-read validation step in your ETL pipeline, you must reject any record whose order_date field is not a valid calendar date in the form YYYY-MM-DD. The rule should allow only years between 1900 and 2099, months 01-12, and days 01-31; it does not need to account for month-specific day limits (for example, 31 February may pass). Which regular expression best enforces this requirement?

  • ^(19|20)\d{2}-(0[1-9]|1[0-2])-(0[1-9]|\d|3)$

  • ^([0-9]{2}){2}-(0[1-9]|1[0-2])-(0[1-9]|3)$

  • ^\d{4}-\d{2}-\d{2}$

  • ^(19|20)\d{2}/(0[1-9]|1[0-2])/(0[1-9]|\d|3)$

Question 4 of 20

A data science team is developing a real-time fraud detection model for financial transactions. The deployment specifications are strict: inference latency must not exceed 100ms to ensure a seamless user experience, and the model must achieve a recall of at least 0.92 to minimize the number of missed fraudulent transactions. After experimenting with several architectures, the team has narrowed the choice down to three models and has compiled the following specification testing results:

ModelRecallF1-ScoreAverage Inference Latency (ms)
Model A (DNN)0.950.91145 ms
Model B (GBM)0.930.9285 ms
Model C (LogReg)0.880.8920 ms

Based on an analysis of these specification testing results, which model should be recommended for deployment?

  • Model C (LogReg), because its extremely low latency provides the best user experience while maintaining a high F1-Score.

  • Model B (GBM), because it is the only model that satisfies both the minimum recall and maximum latency requirements.

  • None of the models are suitable, as no single model optimizes both recall and latency simultaneously.

  • Model A (DNN), because it has the highest recall, which is the most critical metric for minimizing missed fraud.

Question 5 of 20

A quantitative trading firm is building a model to predict the end-of-day price volatility of a specific Exchange-Traded Fund (ETF). The team is using two primary data sources:

  1. The ETF's historical daily Open, High, Low, and Close (OHLC) prices, recorded once at the end of each trading day.
  2. A real-time social media sentiment score related to the ETF's underlying assets, captured and timestamped every minute during trading hours.

The team's initial approach involves a direct join of the two datasets on the calendar date, which results in the daily OHLC data being duplicated for every one-minute sentiment reading. Which data issue is most fundamentally compromising the model's integrity, and what is the correct first step to remediate it?

  • The model has insufficient features. The team should engineer new features by creating lagged observations of the sentiment data to capture its delayed impact on daily prices.

  • The model will suffer from multicollinearity. A Variance Inflation Factor (VIF) analysis should be run to identify and remove the high correlation between sentiment and price movements.

  • The data exhibits non-stationarity. Both time series should be made stationary using differencing before any modeling is attempted to avoid spurious correlations.

  • The datasets have a granularity misalignment. The one-minute sentiment data must be aggregated into a daily summary statistic (e.g., mean, total, or final value) before being joined with the daily OHLC data.

Question 6 of 20

A data scientist is working with a dataset containing 10,000 samples and 784 features, represented as a data matrix D with dimensions 10,000 x 784. The goal is to apply a linear transformation to this dataset to reduce the number of features to 64. This transformation is achieved by right-multiplying the data matrix D by a transformation matrix T, resulting in a new matrix D_prime. What must be the dimensions of the transformation matrix T for this operation to be valid, and what will be the dimensions of the resulting matrix D_prime?

  • Matrix T must be 784 x 784, and D_prime will be 10,000 x 784.

  • Matrix T must be 784 x 64, and D_prime will be 10,000 x 64.

  • Matrix T must be 10,000 x 64, and D_prime will be 10,000 x 64.

  • Matrix T must be 64 x 784, and D_prime will be 10,000 x 784.

Question 7 of 20

A data science team was tasked with developing a predictive maintenance model for a manufacturing plant's machinery. The team immediately sourced sensor data, cleaned it, and built a technically robust model with 98% accuracy in identifying potential failures on a held-out test set. However, during the initial deployment meetings, it became clear that the model's output did not integrate with the maintenance department's existing workflow, and the predictions were not aligned with the specific component failures the business had prioritized for cost-saving. This has led to significant resistance from stakeholders. According to the Cross-Industry Standard Protocol for Data Mining (CRISP-DM) model, which phase was most likely neglected, leading to these adoption challenges?

  • Modeling

  • Data Preparation

  • Evaluation

  • Business Understanding

Question 8 of 20

A data scientist is finalizing a presentation for a government regulatory body and internal executive stakeholders. The presentation's central element is a heat map visualizing model performance degradation across demographic segments. The current visualization uses a traditional red-to-green diverging color scale to represent poor to strong performance, respectively. An accessibility audit flagged this choice as non-compliant for users with deuteranopia. To ensure the chart is fully accessible and clearly communicates insights to all viewers, which action is the most appropriate for the data scientist to take?

  • Convert the visualization to use a single-hue sequential color palette, such as 'viridis', varying lightness from light to dark to represent the performance metric.

  • Supplement the heat map with a detailed data table in an appendix and add an accessibility note directing users to this table for the raw values.

  • Adjust the saturation and brightness of the existing red and green colors until the contrast ratio between them meets the WCAG 2.1 AA requirement of 3:1 for graphical objects.

  • Replace the red-green scale with a perceptually uniform, colorblind-safe diverging palette, such as blue-to-orange, and verify that adjacent colors meet a 3:1 contrast ratio.

Question 9 of 20

A fashion e-commerce company wants to roll out multimodal search so that shoppers can type a natural-language query such as "red leather ankle boots" and instantly retrieve the most relevant product images from a catalog of 50 million pictures. Design constraints include:

  • End-to-end latency must stay below 100 ms.
  • Queries are open-ended, not limited to a fixed set of classes.
  • The image catalog will be stored as dense vectors in an approximate nearest-neighbor (ANN) index. Which modeling strategy should the data-science team choose to satisfy all of these requirements while preserving strong semantic alignment between text and images?
  • Fine-tune a large language model on product captions only and use its [CLS] token embedding as the representation for both queries and images.

  • Deploy separate image and text classifiers and average their softmax probability outputs at query time (late fusion) to rank results.

  • Train a contrastive dual-encoder (two-tower) model on paired caption-image data so that the text and image encoders produce vectors in the same embedding space, then pre-compute and ANN-index the image embeddings.

  • Generate synthetic captions for every product image with an image-captioning model and index those captions with a TF-IDF bag-of-words search engine.

Question 10 of 20

A data science team has developed a large gradient boosting model for a real-time credit card fraud detection system. During offline testing on a historical dataset, the model achieved an F1-score of 0.95, significantly outperforming the existing rule-based system. The primary business requirement is to reduce fraud losses, and a key technical constraint is that any transaction must be scored in under 50 milliseconds to avoid impacting the customer experience. What is the most critical step the team must take to validate the model against the project requirements before recommending deployment?

  • Establish a continuous monitoring system to detect data drift and concept drift in the production data stream.

  • Conduct further hyperparameter tuning using a wider search space and cross-validation to attempt to increase the F1-score above 0.95.

  • Deploy the model to a staging environment that mirrors production hardware and conduct load testing to measure its inference latency under simulated real-world traffic.

  • Implement SHAP (SHapley Additive exPlanations) to generate detailed explanations for the model's predictions to meet potential audit requirements.

Question 11 of 20

A machine learning engineer is manually implementing the gradient descent algorithm to optimize a multivariate linear regression model. The objective is to minimize the Mean Squared Error (MSE) cost function by iteratively adjusting the model's parameters (weights). For each iteration of the algorithm, which of the following mathematical operations is most fundamental for determining the direction and magnitude of the update for a specific weight?

  • Calculating the Euclidean distance between the predicted and actual values.

  • Calculating the partial derivative of the MSE cost function with respect to that specific weight.

  • Applying the chain rule to the model's activation function.

  • Computing the second partial derivative (Hessian matrix) of the cost function.

Question 12 of 20

A payment-processing platform is evaluating a gradient-boosted decision-tree (GBDT) fraud-detection model against the company's long-standing rule-based filter. During a 30-day A/B test the following aggregated results were collected:

Metric                                   Rule-based |  GBDT
--------------------------------------------------------------
Precision                                        0.52 | 0.71
Recall                                           0.78 | 0.80
F1 score                                         0.62 | 0.75
Average inference latency (ms)                     22 | 48
False positives per million transactions         9600 | 4800
False negatives per million transactions         2600 | 2400
Infrastructure cost per 1 M inferences (USD)       25 | 60

The service-level agreement (SLA) requires latency to stay below 75 ms and fewer than 6000 false positives per million transactions. Monthly volume is 40 million transactions. Finance estimates that each false positive costs USD 2.50 in manual-review labor, while each false negative leads to an average chargeback loss of USD 15.

Which statement most strongly justifies recommending the GBDT model over the conventional rule-based process?

  • The infrastructure cost of the GBDT increases by 140 %, making it economically infeasible despite its modest gain in recall.

  • The GBDT satisfies all SLA limits and, after accounting for error-related costs, is projected to save roughly USD 600 000 per month even after its extra USD 1 400 infrastructure bill.

  • Because confidence intervals for precision and recall were not reported, the results cannot justify replacing the well-understood rule-based system.

  • Although the GBDT improves F1, its inference latency more than doubles, so user-experience risk outweighs any potential savings.

Question 13 of 20

You are tuning a logistic-regression fraud detector trained on 455 000 real and 5 000 fraudulent transactions (≈ 1 % positives). A baseline model built on the imbalanced data yields an average F1 of 0.12 under stratified 5-fold cross-validation (CV). You then apply random oversampling so that the training split is 50 / 50 positive-to-negative, keeping the validation folds untouched. After retraining, you observe:

  • Training-set F1: 0.93
  • Cross-validated F1: 0.10
    Which explanation best accounts for the drop in CV performance despite the much higher training score?
  • Duplicating the same minority transactions through random oversampling caused the model to overfit to those repeats, inflating training F1 but hurting generalization.

  • Oversampling should always lower variance, so the CV drop indicates target leakage between your folds rather than any overfitting problem.

  • The oversampler injected label noise that increases model bias; therefore training F1 should have fallen, so the discrepancy must come from a metric-calculation error.

  • Oversampling only shifts the decision threshold without affecting learned parameters; the lower CV F1 is expected until you retune the threshold.

Question 14 of 20

A data science team deployed a gradient-boosted model to detect fraudulent credit card transactions. The model, trained on historical data from the previous year, achieved a 95% F1-score during validation. After six months in production, monitoring systems indicate a drop in the F1-score to 78%, accompanied by a significant increase in false negatives. Analysis of the live inference data reveals that the statistical distribution of features like 'transaction amount' and 'time of day' has shifted compared to the original training dataset. However, the fundamental patterns defining a fraudulent transaction are believed to be unchanged.

Which of the following best identifies the primary cause of the model's performance degradation and the most appropriate initial action?

  • The model is experiencing concept drift. The team should perform extensive hyperparameter tuning on the existing model architecture to adapt to the new fraud patterns.

  • The original model was overfitted to the training data. The best course of action is to simplify the model by reducing its complexity and then redeploying.

  • The performance drop is likely due to multicollinearity in the new data. The team should focus on advanced feature engineering to create new, uncorrelated variables.

  • The model is experiencing data drift. The most appropriate initial action is to retrain the model using a more recent dataset that includes the last six months of production data.

Question 15 of 20

You are developing a regression model to forecast the next-quarter energy usage of a large manufacturing plant. The training set has 20 000 rows and roughly 400 engineered features from industrial sensors, many of which are highly correlated. An ordinary least-squares model overfits and shows high validation error. The stakeholders insist on a linear model that (1) applies coefficient shrinkage to reduce variance, (2) can drive some coefficients exactly to zero to eliminate redundant sensors, and (3) remains stable in the presence of strongly correlated predictors. Which regressor best satisfies all of these requirements?

  • Elastic Net regression

  • LASSO regression

  • Decision tree regressor

  • Ridge regression

Question 16 of 20

A movie-streaming provider keeps a 1-5 star rating matrix and wants to build a user-based, similarity-based recommender. Some customers are "tough graders" who rarely rate above three stars, while others routinely give four or five stars even to average titles. To make sure that neighbor selection reflects relative preferences rather than each customer's personal rating scale, which similarity measure should the data scientist choose when constructing the user-user similarity matrix?

  • Euclidean distance between raw rating vectors

  • Pearson correlation coefficient computed on co-rated items

  • Cosine similarity applied to the raw rating vectors

  • Jaccard similarity on the sets of movies each user has rated

Question 17 of 20

A data scientist is developing a linear regression model to predict the annual income of individuals based on several predictor variables, including years of experience. A preliminary analysis of the target variable, Annual_Income, reveals that its distribution is strongly right-skewed. Furthermore, after fitting an initial model, an examination of the residual vs. fitted values plot shows a distinct cone shape, where the variance of the residuals increases as the predicted income increases. Which of the following data transformation techniques is the most direct and appropriate method to address both the right-skewness and the observed heteroscedasticity in this scenario?

  • Apply an exponential transformation to the Annual_Income variable.

  • Standardize both the target variable and the predictor variables.

  • Apply a logarithmic transformation to the Annual_Income variable.

  • Apply a Box-Cox transformation to the Annual_Income variable.

Question 18 of 20

Which situation best satisfies the Missing at Random assumption and therefore allows standard multiple-imputation methods that rely on MAR to yield unbiased estimates?

  • Fasting triglyceride measurements are missing more often for study participants who are under 18 years old, and every participant's age is fully recorded in the dataset.

  • A wearable fitness tracker's heart-rate sensor occasionally loses connection because of random Bluetooth interference, producing gaps unrelated to any user characteristics or physiology.

  • At a diabetes clinic, laboratory staff sometimes leave the blood-glucose field blank when the measured value exceeds 400 mg/dL and triggers an outlier warning.

  • In an anonymous salary survey, respondents earning very low or very high incomes are less likely to disclose their pay, and no other collected variable predicts this behavior.

Question 19 of 20

A data scientist is preparing to build a predictive model and needs to validate a critical assumption for several linear regression techniques: the normality of the model's residuals. After fitting an initial model, the residuals have been extracted. Which of the following visualization methods is the most precise for graphically assessing whether the residuals conform to a normal distribution?

  • Density plot

  • Quantile-Quantile (Q-Q) plot

  • Histogram with a normal distribution overlay

  • Box and whisker plot

Question 20 of 20

While exploring a 2-dimensional dataset that contains two spatial clusters-one very dense and one much sparser-a data scientist tries to find a single (eps, minPts) setting in DBSCAN that will correctly identify both clusters. Every time she preserves the dense cluster, the sparse cluster is either merged into it or labeled as noise, and whenever she isolates the sparse cluster, the dense cluster fragments. Which underlying property of DBSCAN most directly causes this limitation?

  • DBSCAN requires the user to specify the exact number of clusters beforehand; supplying the wrong number causes clusters to fragment or merge.

  • DBSCAN assigns points to clusters by minimizing within-cluster sum of squared errors (SSE), which biases it toward clusters of uniform density.

  • DBSCAN relies on a single global density threshold (eps) that applies to every point, so it cannot accommodate clusters with markedly different densities.

  • DBSCAN assumes that all features are statistically independent and identically distributed, so clusters of varying density violate this assumption.