🔥 40% Off Crucial Exams Memberships — This Week Only

6 hours, 44 minutes remaining!
00:20:00

CompTIA DataX Practice Test (DY0-001)

Use the form below to configure your CompTIA DataX Practice Test (DY0-001). The practice test can be configured to only include certain exam objectives and domains. You can choose between 5-100 questions and set a time limit.

Logo for CompTIA DataX DY0-001 (V1)
Questions
Number of questions in the practice test
Free users are limited to 20 questions, upgrade to unlimited
Seconds Per Question
Determines how long you have to finish the practice test
Exam Objectives
Which exam objectives should be included in the practice test

CompTIA DataX DY0-001 (V1) Information

CompTIA DataX is an expert‑level, vendor‑neutral certification aimed at deeply experienced data science professionals. Launched on July 25, 2024, the exam verifies advanced competencies across the full data science lifecycle - from mathematical modeling and machine learning to deployment and specialized applications like NLP, computer vision, and anomaly detection.

The exam comprehensively covers five key domains:

  • Mathematics and Statistics (~17%)
  • Modeling, Analysis and Outcomes (~24%)
  • Machine Learning (~24%)
  • Operations and Processes (~22%)
  • Specialized Applications of Data Science (~13%)

It includes a mix of multiple‑choice and performance‑based questions (PBQs), simulating real-world tasks like interpreting data pipelines or optimizing machine learning workflows. The duration is 165 minutes, with a maximum of 90 questions. Scoring is pass/fail only, with no scaled score reported.

CompTIA DataX DY0-001 (V1) Logo
  • Free CompTIA DataX DY0-001 (V1) Practice Test

  • 20 Questions
  • Unlimited time
  • Mathematics and Statistics
    Modeling, Analysis, and Outcomes
    Machine Learning
    Operations and Processes
    Specialized Applications of Data Science
Question 1 of 20

In the TF-IDF text-classification pipeline you are building for English-language restaurant reviews, the initial document-term matrix contains more than 150 000 unique tokens because words such as "run", "running", and "ran" are treated as separate features. You want to reduce this sparsity without accidentally conflating semantically different words like "universe" and "university". Which single text-preparation step best satisfies the requirement?

  • Switch to character-level tokenization so each character becomes a feature.

  • Apply part-of-speech-aware lemmatization to convert each token to its dictionary lemma.

  • Remove all stop words, including verbs and adjectives, before vectorization.

  • Run the Porter stemming algorithm to strip suffixes from every token.

Question 2 of 20

A data scientist is hand-coding the backward pass for a multi-class logistic regression model. For a logit vector z ∈ ℝᴷ the softmax function is defined as

σ(z)k = exp(z_k) / \sum^ exp(z_j).

During backpropagation they must compute the Jacobian element ∂σ(z)k / ∂z_i. Which of the following expressions is mathematically correct for this partial derivative (δ denotes the Kronecker delta)?

  • σ_k(z) (δ_ − σ_i(z))

  • σ_k(z) (1 − σ_k(z))

  • σ_i(z) (δ_ − σ_k(z))

  • δ_ − σ_k(z) σ_i(z)

Question 3 of 20

A data science team is deploying a real-time fraud detection model for a financial institution. The system's architecture requires that the data used for model inference be perfectly consistent with the primary transactional database at all times. Any lag in data propagation could lead to significant financial loss. Given this strict requirement for data integrity and consistency, which data replication strategy is the most appropriate for the underlying feature store?

  • Synchronous replication, as it guarantees a write operation is committed to both the primary and replica before returning success, ensuring zero data loss (RPO=0) at the cost of higher write latency.

  • Asynchronous replication, as it provides the lowest write latency by acknowledging writes before they are committed to the replica, which is ideal for high-throughput systems.

  • Multi-master replication, as it allows writes to any node, maximizing availability and write performance across geographically distributed locations.

  • Snapshot replication, as it creates point-in-time copies of the data, which is ideal for versioning datasets for model reproducibility and retraining.

Question 4 of 20

A data science team is developing a fraud detection model using a Gradient Boosting Machine (GBM) on a large dataset with thousands of features. After training, the model achieves 99.8% accuracy on the training set but only 85% accuracy on a held-out validation set. The training loss is near zero, while the validation loss is substantially higher and was observed to increase after a certain number of boosting rounds. Given this significant performance gap, which of the following BEST describes the phenomenon the model is exhibiting and the most effective initial step to address it?

  • The model is overfitting to the training data. The most effective initial step is to apply regularization techniques, such as increasing the reg_lambda or reg_alpha hyperparameters, or to reduce the complexity of the model by limiting the maximum tree depth.

  • The model is underfitting the data. The best course of action is to increase the model's complexity by adding more estimators (trees) or allowing for deeper trees to better capture the data's patterns.

  • The validation set is exhibiting concept drift. The team should acquire more recent data for validation and consider implementing a drift detection mechanism before retraining.

  • The model is suffering from data leakage. The team should re-evaluate the feature engineering and data splitting process to ensure a strict separation of data before any transformations are applied.

Question 5 of 20

A data science team is developing a model for real-time fraud detection, which will be deployed in a low-latency environment. The training data is known to be highly imbalanced. During the model selection phase, the team conducts a thorough literature review. What should be the primary focus of this literature review to ensure the selection of an appropriate initial model?

  • To find publicly available datasets that can be used to augment the team's proprietary data.

  • To identify model architectures and feature engineering techniques that have proven effective for problems with similar constraints.

  • To select a set of optimal hyperparameters for a predetermined model like XGBoost.

  • To establish a definitive performance benchmark by averaging the reported F1-scores from published papers.

Question 6 of 20

A machine learning engineer is training a deep neural network. The process involves a forward pass to generate predictions, a loss function to quantify error, and a backward pass to learn from that error. Within this training loop, what is the primary computational contribution of the backpropagation algorithm itself?

  • To normalize the activations of hidden layers to ensure a stable distribution of inputs during training.

  • To efficiently calculate the gradient of the loss function with respect to every weight and bias in the network.

  • To apply an optimization rule, such as momentum or Adam, to update the network's parameters.

  • To determine the initial error value by comparing the network's final output with the ground-truth labels.

Question 7 of 20

A data scientist at a high-volume semiconductor manufacturing plant is responsible for monitoring a critical etching process. They use hypothesis testing to decide if the process has deviated from its specifications. The null hypothesis (H0) is that the process is operating within specification, while the alternative hypothesis (H1) is that it has deviated. A deviation could result in producing millions of faulty microchips. To minimize production interruptions from false alarms, the team has chosen a very low significance level (alpha), such as 0.001, for their control tests.

Which of the following statements best describes the primary risk associated with this statistical strategy?

  • By lowering the significance level, the team decreases the probability of a Type I error (falsely concluding the process has deviated), but this simultaneously increases the probability of a Type II error, elevating the risk of not detecting a true deviation and consequently shipping defective products.

  • A low significance level increases the statistical power (1 - β) of the test, thereby reducing the probability of both Type I and Type II errors simultaneously.

  • This strategy correctly minimizes the most critical risk, which is the Type II error (failing to detect a deviation), by making the test more sensitive to any anomalies.

  • The chosen alpha level directly minimizes the risk of shipping defective products by ensuring that the process is only stopped for statistically significant, genuine deviations.

Question 8 of 20

A data scientist is developing a text classification model using a large corpus of over one million documents. They have generated TF-IDF feature vectors, resulting in a document-term matrix with more than 200,000 unique terms (features). When training a k-Nearest Neighbors (k-NN) classifier on these high-dimensional, sparse vectors, they observe two primary issues: extremely long training times and poor predictive accuracy. Which of the following strategies provides the most effective solution to address both the computational inefficiency and the model performance problem?

  • Augment the feature set by including bigrams and trigrams from the text corpus.

  • Apply Truncated SVD to the feature matrix to reduce its dimensionality.

  • Convert the TF-IDF matrix into a Compressed Sparse Row (CSR) format.

  • Standardize the feature vectors using a StandardScaler to have zero mean and unit variance.

Question 9 of 20

A data science team at a financial institution is architecting a system for regulatory compliance reporting. The system must guarantee transactional atomicity, consistency, isolation, and durability (ACID). It also requires strict schema-on-write enforcement for data integrity and auditability, while providing optimized performance for predefined, complex analytical queries. Given these critical requirements, which storage concept is the most appropriate choice?

  • A data lake with Parquet files.

  • A document-oriented NoSQL database.

  • A relational database management system (RDBMS).

  • A key-value store.

Question 10 of 20

A data scientist is comparing two binary classification models, Model A and Model B, for a credit default prediction task. Model A achieves an Area Under the Curve (AUC) of 0.85, while Model B achieves an AUC of 0.82. A detailed analysis of their Receiver Operating Characteristic (ROC) curves reveals that Model B's curve is positioned above Model A's curve for all False Positive Rate (FPR) values below 0.2. Conversely, Model A's curve is superior for all FPR values above 0.2. The primary business requirement is to select a model that performs best while maintaining a very low rate of incorrectly flagging creditworthy customers as high-risk, specifically keeping the FPR under 0.2. Given this constraint, which model should be recommended and why?

  • Model A, because a higher AUC guarantees a lower number of total misclassifications regardless of the chosen threshold.

  • Neither model, as a different metric like Precision-Recall AUC should be used since the AUC values are too close to make a definitive decision.

  • Model B, because it has a higher True Positive Rate (TPR) for the acceptable range of False Positive Rate (FPR) defined by the business constraint.

  • Model A, because its overall Area Under the Curve (AUC) is higher, indicating superior performance across all classification thresholds.

Question 11 of 20

A data scientist is developing a linear regression model to predict quarterly sales for a large retail chain. The features include advertising spend, number of promotional events, and several macroeconomic indicators like GDP growth rate, unemployment rate, and the consumer price index (CPI). During model diagnostics, the data scientist observes that the p-values for the macroeconomic indicators are high, and their coefficients are highly sensitive to the inclusion or exclusion of other variables. Furthermore, some coefficients have signs that contradict established economic principles. Which of the following data issues is the most probable cause of these specific observations?

  • Sparse data

  • Multicollinearity

  • Non-stationarity

  • Seasonality

Question 12 of 20

A data science team is gathering requirements for a new predictive‐maintenance model that will be deployed on factory equipment. One requirement is to capture the model's relevant range of application. Which item best fulfills this specific requirement?

  • Document the acceptable boundaries for each input variable and the operating conditions under which model performance targets are guaranteed.

  • List the database tables and columns that the ingestion pipeline must extract for model training and retraining.

  • Estimate the yearly cloud compute costs required to run the batch-scoring service at peak load.

  • Provide the RACI matrix that defines who reviews, approves, and deploys code changes to the model.

Question 13 of 20

A data scientist is designing a strategy for a sequential decision-making problem, drawing inspiration from the principles of the 'one-armed bandit' problem. The goal is to maximize a cumulative reward over a series of trials. Which of the following represents the central dilemma that any effective bandit algorithm must navigate?

  • Ensuring the solution adheres to predefined budget and resource constraints using a linear solver.

  • Minimizing the risk of overfitting by applying regularization techniques to the reward function.

  • Reducing the dimensionality of the action space to decrease computational complexity.

  • Balancing the choice between continuing with the action that has yielded the highest observed reward so far (exploitation) and trying other actions to gather more information about their potential rewards (exploration).

Question 14 of 20

A data science team is developing a classification model to predict customer churn based on several continuous features. A preliminary analysis, which included a Bartlett's test, reveals strong evidence that the covariance matrices of the 'churn' and 'no churn' classes are statistically different. The team is deciding between Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA) for the final model.

Given this finding, which of the following statements provides the most accurate guidance for model selection?

  • QDA should be preferred because it models each class using its own distinct covariance matrix, making it suitable for data where classes do not share a common covariance structure.

  • QDA should be selected because it is a non-parametric method that can adapt to the differing class variances without making distributional assumptions.

  • Either model can be used interchangeably, provided the features are first transformed using Principal Component Analysis (PCA) to ensure the covariance matrices are equalized.

  • LDA should be preferred because it is less prone to overfitting than QDA, and its robustness will provide a more generalized model even when the covariance assumption is violated.

Question 15 of 20

During exploratory analysis of model residuals, you create a normal Q-Q plot. The plotted points form an S-shape: observations in the left tail fall below the 45-degree reference line, the middle portion stays near the line, and observations in the right tail rise above it. Which conclusion and next step most appropriately address this diagnostic?

  • The residuals exhibit heavy tails relative to a normal distribution; refit the model with a heavy-tailed or robust error distribution (e.g., Student-t).

  • The residuals have lighter tails than a normal distribution; a uniform or other thin-tailed distribution will suffice without further changes.

  • The residuals are left-skewed; square the residuals to symmetrize the distribution before re-estimating the model.

  • The residuals are right-skewed; apply a logarithmic transformation to reduce skewness before refitting the model.

Question 16 of 20

During training you notice that a deep multilayer perceptron that uses tanh(x) in every hidden layer begins to learn extremely slowly after the first few epochs. You suspect the gradients are vanishing as they are back-propagated. From a mathematical standpoint, which property of the tanh activation most directly explains why its use can drive gradients toward zero when neuron inputs have large magnitude?

  • Its first derivative equals x for |x| > 1, causing gradients to grow without bound and leading to exploding rather than vanishing gradients.

  • Its output range is strictly 0 to 1, so activations stay positive and bias the gradient toward zero.

  • Its first derivative is 1 − tanh²(x), which tends to zero as |x| becomes large, so back-propagated gradients are repeatedly attenuated.

  • Its second derivative is a constant 1, so there is no curvature change and gradients get stuck at saddle points instead of vanishing.

Question 17 of 20

A data architect at a major e-commerce company is designing an ingestion and storage solution for a new analytics platform. The platform will process high-velocity user clickstream data, which arrives as semi-structured JSON objects. The primary requirements are to support fast, complex analytical queries on specific columns while minimizing storage costs and providing data that is refreshed every few minutes. Which of the following approaches best meets all of these requirements?

  • Stream the incoming JSON data directly into a structured, relational database, normalizing the data into multiple tables.

  • Implement a real-time streaming pipeline that writes the raw, nested JSON data directly to object storage as individual files.

  • Ingest the data in micro-batches, converting the nested JSON into a flattened, columnar Parquet format for storage.

  • Set up a daily batch process to collect all clickstream events, flatten them, and store them as compressed CSV files.

Question 18 of 20

A data scientist is developing a multiple linear regression model using ordinary least squares (OLS). The feature matrix X is a 1000x15 matrix (1000 samples, 15 features). During model fitting, the process fails because the matrix (X^T * X) is singular and cannot be inverted. This problem indicates perfect multicollinearity among the features. What does this singularity imply about the rank of the feature matrix X?

  • The rank of X is equal to 15.

  • The rank of X is equal to 1000.

  • The rank of X is less than 15.

  • The rank of X is greater than 15.

Question 19 of 20

A data scientist is analyzing a large, multi-site clinical trial dataset. During exploratory data analysis, it's discovered that a number of entries for the 'Resting Heart Rate' variable are missing. Which of the following scenarios provides the strongest evidence that the data for 'Resting Heart Rate' is Missing Completely At Random (MCAR)?

  • Patients who reported experiencing palpitations, a condition often correlated with high resting heart rates, were more likely to have their measurement postponed by clinicians, leading to missing entries.

  • Due to a software bug, the data collection application failed to save the 'Resting Heart Rate' entry for approximately 5% of patients. The failures occurred unpredictably across all clinical sites and demographic groups.

  • The study protocol allowed clinicians to skip the 'Resting Heart Rate' measurement for patients whose blood pressure was within a normal range, as it was deemed less critical. Blood pressure data is fully recorded for all patients.

  • A single data collection device at a high-volume urban clinic was found to be improperly calibrated. All readings from this device were flagged as invalid and subsequently removed during the data cleaning phase.

Question 20 of 20

A data science team has developed a high-accuracy, 32-bit floating-point (FP32) convolutional neural network (CNN) for a complex object detection task. The business requires this model to be deployed on a fleet of battery-powered aerial drones with significant constraints on processing power, memory, and energy consumption for real-time inference. Which of the following strategies is the most effective for adapting the model for this edge computing scenario while attempting to minimize accuracy loss?

  • Retraining the entire model from scratch using a higher learning rate and applying aggressive L2 regularization to reduce weight magnitudes.

  • Applying post-training quantization to convert model weights to 8-bit integers (INT8) and using structured pruning to remove entire redundant filters.

  • Implementing data augmentation through image rotation and scaling, and increasing the inference batch size to improve throughput.

  • Deploying the model to a high-performance cloud server and creating a REST API for the drones to send image data for remote inference.