00:20:00

CompTIA DataX Practice Test (DY0-001)

Use the form below to configure your CompTIA DataX Practice Test (DY0-001). The practice test can be configured to only include certain exam objectives and domains. You can choose between 5-100 questions and set a time limit.

Logo for CompTIA DataX DY0-001 (V1)
Questions
Number of questions in the practice test
Free users are limited to 20 questions, upgrade to unlimited
Seconds Per Question
Determines how long you have to finish the practice test
Exam Objectives
Which exam objectives should be included in the practice test

CompTIA DataX DY0-001 (V1) Information

CompTIA DataX is an expert‑level, vendor‑neutral certification aimed at deeply experienced data science professionals. Launched on July 25, 2024, the exam verifies advanced competencies across the full data science lifecycle - from mathematical modeling and machine learning to deployment and specialized applications like NLP, computer vision, and anomaly detection.

The exam comprehensively covers five key domains:

  • Mathematics and Statistics (~17%)
  • Modeling, Analysis and Outcomes (~24%)
  • Machine Learning (~24%)
  • Operations and Processes (~22%)
  • Specialized Applications of Data Science (~13%)

It includes a mix of multiple‑choice and performance‑based questions (PBQs), simulating real-world tasks like interpreting data pipelines or optimizing machine learning workflows. The duration is 165 minutes, with a maximum of 90 questions. Scoring is pass/fail only, with no scaled score reported.

Free CompTIA DataX DY0-001 (V1) Practice Test

Press start when you are ready, or press Change to modify any settings for the practice test.

  • Questions: 20
  • Time: Unlimited
  • Included Topics:
    Mathematics and Statistics
    Modeling, Analysis, and Outcomes
    Machine Learning
    Operations and Processes
    Specialized Applications of Data Science
Question 1 of 20

A data scientist is building a multiple linear regression model to predict housing prices. The initial model, using only the living area in square feet as a predictor, yields an R-squared value of 0.65. To improve the model, the data scientist adds ten additional predictor variables, including number of bedrooms, number of bathrooms, and age of the house. The new model results in an R-squared value of 0.78. Which of the following is the most critical consideration for the data scientist when interpreting this increase in R-squared?

  • The increase from 0.65 to 0.78 definitively proves that the additional variables have strong predictive power and the new model is superior.

  • The new R-squared value is high, which invalidates the p-values of the individual coefficients in the model.

  • The R-squared value will almost always increase when more predictors are added to the model, regardless of their actual significance, potentially leading to overfitting.

  • An R-squared of 0.78 indicates that 78% of the model's predictions for house prices will be correct.

Question 2 of 20

A data science team is developing an automated ingestion pipeline for customer feedback data provided as CSV files. The pipeline frequently fails due to parsing errors, specifically when feedback text contains commas or line breaks. Although the text fields are enclosed in double quotes as per convention, the parser still misinterprets the data structure. Which of the following is the most likely underlying cause of this data ingestion problem?

  • The ingestion pipeline is attempting to infer a data schema, and the presence of mixed data types is causing type-casting failures.

  • The CSV files are being saved with a UTF-8 byte-order mark (BOM) that the ingestion script cannot interpret.

  • The CSV files contain unescaped double quotes within data fields that are also enclosed in double quotes.

  • The data provider is using a regional-specific delimiter, such as a semicolon, instead of a comma.

Question 3 of 20

A data science team at an e-commerce company is tasked with measuring the success of a new customer loyalty program. The primary business objective is to 'significantly boost repeat business profitability'. After an initial discovery phase, the team is tracking several data points. Which of the following represents the most effective Key Performance Indicator (KPI) for the stated business objective?

  • The total number of weekly transactions made by loyalty program members.

  • The ratio of active loyalty program members to total registered users.

  • The month-over-month growth rate of new sign-ups for the loyalty program.

  • A 15% increase in the average Customer Lifetime Value (CLV) for loyalty program members over the next fiscal year.

Question 4 of 20

A machine learning engineer is training a deep neural network. The process involves a forward pass to generate predictions, a loss function to quantify error, and a backward pass to learn from that error. Within this training loop, what is the primary computational contribution of the backpropagation algorithm itself?

  • To apply an optimization rule, such as momentum or Adam, to update the network's parameters.

  • To normalize the activations of hidden layers to ensure a stable distribution of inputs during training.

  • To determine the initial error value by comparing the network's final output with the ground-truth labels.

  • To efficiently calculate the gradient of the loss function with respect to every weight and bias in the network.

Question 5 of 20

You have inherited a helper function named standardize_df() in a production feature-engineering library. The function must (1) subtract the mean and divide by the population standard deviation for every numeric column and (2) leave any column whose standard deviation is exactly zero unchanged to avoid division-by-zero problems. You are charged with adding a single PyTest unit test that delivers the strongest regression-catching power while still following unit-testing best practices (deterministic data, small scope, Arrange-Act-Assert structure, no unnecessary external libraries). Which test design best satisfies these requirements?

  • Apply scikit-learn's StandardScaler to a different DataFrame and assert that its output equals the output of standardize_df.

  • Generate 10 000 random rows, call standardize_df, and assert only that the output DataFrame has the same shape as the input.

  • Set numpy.random.seed(0) inside the test and simply check that standardize_df executes without raising an exception.

  • Construct a small DataFrame with one constant and one varying numeric column, run standardize_df, then assert with pytest.approx that the varying column now has mean 0 and std 1 and that the constant column is identical to the original.

Question 6 of 20

A data scientist wants to report a two-sided 95% confidence interval for the true population Pearson correlation between two numerical features. In a random sample of n = 60 observations, the sample correlation is r = 0.58. To use standard normal critical values, which pre-processing step should be applied to the correlation estimate before constructing the confidence interval?

  • Transform r with Fisher's inverse hyperbolic tangent (z-transformation), build the interval in the transformed space, then back-transform the interval's endpoints.

  • Multiply r by √(n−2)/√(1−r²) and treat the result as standard normal when forming the interval.

  • Use a Box-Cox transformation on each variable so that the resulting correlation can be treated as normally distributed.

  • Apply the Wilson score method directly to r to obtain the interval.

Question 7 of 20

A data science team is tasked with determining if a new, computationally intensive recommendation algorithm causes a statistically significant increase in user engagement compared to the current algorithm. To generate the data needed for this analysis, the team plans to deploy the new algorithm to a segment of users. Which of the following is the most critical component of the experimental design to ensure the resulting data can be used to infer causality?

  • Randomly assigning users to either the new algorithm (treatment group) or the existing algorithm (control group).

  • Formulating a precise null hypothesis and an alternative hypothesis with a defined p-value threshold for significance.

  • Selecting the most active users for the new algorithm's group to maximize the potential observable impact.

  • Implementing detailed logging to capture all user interactions with the recommendations, such as clicks and hover time.

Question 8 of 20

During a model audit, you examine the first convolutional layer of an image-classification network. The layer receives a 128×128×3 input and applies 64 kernels of size 5×5 with stride 1 and "same" padding so that the spatial resolution of the output remains 128×128. Bias terms are present (one per kernel), but you must report only the number of trainable weights excluding biases in this layer. How many weights does the layer contain?

  • 78 643 200

  • 9 600

  • 4 800

  • 1 600

Question 9 of 20

You have just finished training a logistic-regression model that flags potentially fraudulent B2B invoices. For next week's 10-minute board meeting, the CFO wants one slide that instantly shows how many legitimate invoices would be held for manual review (false positives) and how many fraudulent invoices the model might miss (false negatives). Several board members are color-blind and have little time for technical explanations. Which visualization and design choice will best satisfy these communication requirements?

  • An annotated confusion-matrix heatmap that uses a color-blind-safe blue-orange palette and displays the four cell counts in large text.

  • A scatter plot of predicted fraud probability versus invoice amount, colored by the model's predicted class labels.

  • A 3-D stacked pie chart that uses red and green slices to depict true positives, false positives, true negatives, and false negatives.

  • An ROC curve showing the area under the curve (AUC) with an interactive threshold slider.

Question 10 of 20

A data scientist is developing a model to classify product images into 150 distinct categories. The chosen base algorithm is a Support Vector Machine (SVM), which is inherently a binary classifier. The development team is operating under significant computational resource constraints, making training time a primary concern. Which multiclass classification strategy is the most appropriate choice for adapting the SVM model in this scenario?

  • One-vs-Rest (OvR)

  • One-vs-One (OvO)

  • Error-Correcting Output Codes (ECOC)

  • Multinomial Logistic Regression

Question 11 of 20

A data science team is designing a data lake architecture on a distributed file system to store terabytes of structured event data for analytical querying. The primary use case involves running complex, read-heavy queries for feature engineering, which frequently select a small subset of columns from a wide table containing over 200 columns. The system must also support schema evolution as new event properties are added over time. Given these requirements, which data format is the most appropriate for storing the processed data in the data lake to optimize query performance and storage efficiency?

  • CSV

  • Avro

  • Parquet

  • JSON

Question 12 of 20

Your data science team must release a patient readmission data set to external researchers. The file currently contains direct identifiers (patient name, Social Security number) and quasi-identifiers (full date of birth, 5-digit ZIP code). Compliance requires that (a) the released data no longer be considered protected health information (PHI) under the HIPAA Privacy Rule, so no patient authorization is needed, and (b) age and regional patterns remain analytically useful. Which preprocessing approach best meets both requirements?

  • Encrypt the entire data set with AES-256 and provide researchers with the decryption key after they sign a data-use agreement.

  • Replace each name and Social Security number with a random UUID but keep the full date of birth and 5-digit ZIP code intact.

  • Mask all but the last four digits of each Social Security number and hash patient names with SHA-256 while leaving other fields unchanged.

  • Remove the 18 identifiers listed in 45 CFR §164.514(b)(2), convert dates of birth to age in years, and truncate ZIP codes to their first three digits when the corresponding area exceeds 20,000 residents.

Question 13 of 20

A data science team is developing a predictive model for equipment failure using a single, unpruned decision tree. During testing, they observe two phenomena:

  1. The model achieves near-perfect accuracy on the training dataset but performs poorly on the unseen validation dataset.
  2. Minor changes to the training data, such as removing a small number of data points, result in a drastically different tree structure and predictions.

Which underlying characteristic of decision trees is the primary cause of both of these observations?

  • High bias

  • Multicollinearity

  • High variance

  • The curse of dimensionality

Question 14 of 20

A quantitative analyst is modeling a company's monthly sales data from the last decade. A time series plot reveals a consistent upward trend, and an Augmented Dickey-Fuller (ADF) test confirms the series is non-stationary. The analyst plans to use an Autoregressive Integrated Moving Average (ARIMA) model for forecasting. To address the identified non-stationarity, which configuration choice for the ARIMA(p, d, q) model is the most critical first step?

  • Determine the autoregressive order, p, by examining the Partial Autocorrelation Function (PACF) of the original series.

  • Set the differencing order, d, to a value of 1 or more to make the series stationary.

  • Select the moving average order, q, by inspecting the Autocorrelation Function (ACF) of the original series.

  • Apply a Box-Cox transformation to the series to remove the stochastic trend.

Question 15 of 20

You are deploying an orchestration platform to automate hourly data-ingestion pipelines. A six-hour network outage prevents the 03:00-08:00 runs from executing. When connectivity returns, the business wants the orchestrator to automatically create and execute pipeline instances for each missed hour in chronological order, with no manual triggers or code changes. Which built-in orchestration capability is specifically designed to satisfy this requirement?

  • Service-level-agreement (SLA) miss callbacks that fire alerts when a task runs too long

  • Branching operators that choose between alternative execution paths during the workflow

  • Catch-up (backfill) scheduling that automatically creates runs for every missed time interval

  • Dynamic task mapping that expands a single task into parallel subtasks at runtime

Question 16 of 20

A data scientist is developing an ordinary least-squares model to predict daily revenue, a strictly positive continuous variable. The revenue distribution is highly right-skewed and, after an initial linear fit, the residual-versus-fitted plot shows a wedge-shaped pattern that widens as fitted values increase, indicating heteroscedasticity. The scientist needs a single data transformation on the response variable that (1) can stabilize the variance and approximate normality and (2) lets the optimal transformation be chosen from a continuum of power functions using maximum-likelihood estimation. Which transformation should be applied before refitting the model?

  • Standardize the revenue variable with a z-score (mean 0, standard deviation 1).

  • Rescale the revenue variable to the 0-1 range with min-max normalization.

  • Apply a Box-Cox power transformation to the revenue variable.

  • Take the natural logarithm (ln) of the revenue variable.

Question 17 of 20

A data scientist is building a logistic regression model to detect fraudulent financial transactions. The model uses four features: age, account_balance, number_of_monthly_transactions, and average_transaction_amount. An initial exploratory data analysis using box plots for each individual feature reveals no significant outliers. However, the model's performance is unexpectedly poor, and a residuals vs. leverage plot indicates that a few data points have an unusually high influence on the model's coefficients.

Given this scenario, which of the following methods is the MOST appropriate for identifying these influential, problematic data points?

  • Apply a Box-Cox transformation to each feature.

  • Calculate the Mahalanobis distance for each data point.

  • Implement an Isolation Forest algorithm on the dataset.

  • Generate a scatter plot matrix of all feature pairs.

Question 18 of 20

While tuning a CNN that classifies photographs of industrial defects, you observe that validation accuracy drops sharply whenever the defect is partly hidden by a worker's hand or tool, even though training loss remains low. You decide to add a masking-based data-augmentation step that follows the Random Erasing technique. Which configuration is MOST likely to increase robustness to partial occlusion without altering the ground-truth class labels?

  • Add a binary channel that records the location of an arbitrary mask and train the model to reconstruct the hidden pixels as a secondary objective.

  • At each forward pass, randomly zero-out a comparable fraction of convolutional filters in the network's first layer to simulate information loss.

  • With a fixed probability, overwrite one randomly located rectangular region covering roughly 2-20 % of every training image with random pixel values (or the per-channel dataset mean) while keeping the original label.

  • Replace all background pixels that fall outside each annotated bounding box with a uniform gray mask so that only the foreground object remains visible.

Question 19 of 20

A manufacturing company deployed a gradient-boosted model to predict bearing failures from streaming sensor data. Two weeks after a firmware update changed the calibration of the vibration sensors, the model's precision fell from 0.82 to 0.55 even though the proportion of actual failures in the field remained at 3.4 %. Subsequent analysis shows that the mean and variance of multiple vibration-related features have shifted by more than two standard deviations, but the conditional relationship between those features and the failure label appears unchanged. Which phenomenon is the most likely root cause of the model's performance degradation?

  • Model over-fitting resulting from excessively high variance during initial training

  • Data drift (covariate shift) caused by the firmware-induced change in input feature distributions

  • Concept drift because the physical mechanism of bearing failure has evolved

  • Data leakage introduced by inadvertently training on target-related features

Question 20 of 20

A machine-learning engineer must deploy a Docker-packaged real-time fraud-detection model. Traffic is usually near zero but can spike unpredictably to thousands of requests per second. The business wants to pay nothing while the service is idle yet keep end-to-end inference latency below 100 ms during spikes. Which cloud deployment approach best meets these requirements?

  • Deploy the container on a request-driven serverless container platform that supports automatic scale-to-zero (for example, Google Cloud Run or Azure Container Apps).

  • Host the container on a single, large bare-metal instance to eliminate virtualization overhead.

  • Provision a fleet of reserved virtual machines sized for the maximum anticipated peak load.

  • Run the model on a managed Kubernetes cluster with a fixed-size node pool sized for average traffic.