CompTIA DataX Practice Test (DY0-001)
Use the form below to configure your CompTIA DataX Practice Test (DY0-001). The practice test can be configured to only include certain exam objectives and domains. You can choose between 5-100 questions and set a time limit.

CompTIA DataX DY0-001 (V1) Information
CompTIA DataX is an expert‑level, vendor‑neutral certification aimed at deeply experienced data science professionals. Launched on July 25, 2024, the exam verifies advanced competencies across the full data science lifecycle - from mathematical modeling and machine learning to deployment and specialized applications like NLP, computer vision, and anomaly detection.
The exam comprehensively covers five key domains:
- Mathematics and Statistics (~17%)
- Modeling, Analysis and Outcomes (~24%)
- Machine Learning (~24%)
- Operations and Processes (~22%)
- Specialized Applications of Data Science (~13%)
It includes a mix of multiple‑choice and performance‑based questions (PBQs), simulating real-world tasks like interpreting data pipelines or optimizing machine learning workflows. The duration is 165 minutes, with a maximum of 90 questions. Scoring is pass/fail only, with no scaled score reported.

Free CompTIA DataX DY0-001 (V1) Practice Test
- 20 Questions
- Unlimited
- Mathematics and StatisticsModeling, Analysis, and OutcomesMachine LearningOperations and ProcessesSpecialized Applications of Data Science
A data scientist is performing exploratory data analysis on a dataset of e-commerce transaction amounts. They generate a histogram to understand the distribution of the transaction values, which are continuous and highly right-skewed. The initial plot, created using the default settings of a popular data visualization library, shows nearly all the data points clustered into a single bar on the far left, with a few other bars sparsely populated to the right. Which of the following is the most effective next step to improve the visualization and gain a clearer understanding of the data's distribution?
Replace the histogram with a box and whisker plot to better visualize the median and interquartile range.
Switch to a density plot, as histograms are not suitable for visualizing skewed continuous data.
Adjust the binning strategy by experimenting with different bin widths or applying a rule like the Freedman-Diaconis rule.
Increase the number of bins to the maximum allowable value to ensure maximum granularity.
Answer Description
The correct answer is to experiment with different bin widths or use a binning rule specifically designed for skewed data. In a histogram, the way data is grouped into bins is critical for its interpretation. With highly skewed data, default binning algorithms (which often assume a somewhat normal distribution) can create misleading visualizations. A very large bin width might group all the smaller, more frequent values into one bar, while the long tail of larger, infrequent values is spread thinly across the remaining bins, obscuring the details of the distribution. Adjusting the number of bins, or the width of each bin, allows for a more granular view. For right-skewed data, using more bins or applying a transformation (like a logarithmic scale on the x-axis, which is conceptually similar to changing bin widths on a log scale) can help to spread out the clustered data and make the distribution's shape more apparent.
Using a box plot is a plausible option for skewed data but it summarizes the distribution into quartiles and may hide features like bimodality, which a well-constructed histogram could reveal. Simply increasing the number of bins without considering the data's skewness might lead to a noisy, difficult-to-interpret plot. A density plot is a good alternative, but adjusting the histogram's parameters is the most direct and fundamental step to address the described problem with the initial histogram itself.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is the Freedman-Diaconis rule?
Why are histograms more suitable than box plots for visualizing skewed data?
How does applying a logarithmic scale help with skewed data in histograms?
While building a logistic-regression model to predict loan default, your training data show that 8 % of values for the numeric attribute debt_to_income_ratio
are missing. Exploratory analysis reveals that the probability of a value being missing increases for borrowers who are younger than 25 and who have less than one year of employment, but within those strata the missingness appears random. The feature is continuous, right-skewed, and has a strong influence on the target. Regulation requires that the chosen imputation technique preserve the variable's variance and explicitly propagate the extra uncertainty introduced by the missing data to any downstream parameter estimates. Which imputation type is the most appropriate to meet these constraints?
k-nearest-neighbors imputation using Euclidean distance on standardized predictors
Multiple imputation with pooled estimates across several completed data sets
Listwise deletion of all records that lack
debt_to_income_ratio
Single mean imputation calculated within each cross-validation fold
Answer Description
Multiple imputation (for example, multiple imputation by chained equations) stochastically draws several plausible values for each missing observation conditional on the observed data, creating multiple completed data sets. Model parameters are estimated in each data set and then pooled, so between-imputation variability is carried forward and reflected in standard errors-satisfying the requirement to propagate uncertainty under a Missing-at-Random mechanism. Listwise deletion simply removes affected rows, reducing sample size and yielding biased coefficients when missingness depends on observed covariates. Single mean imputation is deterministic; it underestimates variance and ignores imputation uncertainty. k-nearest-neighbors imputation generates only one completed data set and likewise fails to account for the sampling variability of the imputed values. Therefore, multiple imputation is the most suitable choice.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is multiple imputation and how does it work?
Why does listwise deletion lead to biased coefficients?
What does 'propagating uncertainty' mean in the context of imputation?
A data scientist is preparing a manufacturing data set for a k-nearest neighbors (k-NN) model that uses Euclidean distance. The data contain two continuous variables: AnnualEnergy_kWh, with a range of 0 to 12,000, and MaintenanceDowntime_min, with a range of 0 to 7,200.
During pilot runs, the distance metric is dominated by AnnualEnergy_kWh, causing records with high downtime to be mis-classified. According to best practice for normalization, which preprocessing step should the data scientist apply before training so that both variables contribute proportionally to the distance calculation?
Standardize each feature to zero mean and unit variance (z-score).
Apply a natural logarithm transform (log1p) to every value in both features.
Rescale each feature to the 0-1 interval using min-max normalization.
Generate polynomial cross-terms between the two features and include them in the model.
Answer Description
The k-nearest neighbors (k-NN) algorithm relies on raw Euclidean distances, so any feature with a larger numeric range will disproportionately dominate the distance metric. Min-max normalization addresses this by rescaling each continuous feature to a common bounded range (typically 0 to 1), ensuring that both AnnualEnergy_kWh and MaintenanceDowntime_min have an equal potential influence on the distance metric. Standardization (z-score) is another common scaling technique, but it scales data based on a mean of 0 and a standard deviation of 1, without enforcing a strict, bounded range; this can make it more sensitive to outliers than min-max normalization. A log transform changes a feature's distribution shape to handle skewness and is not designed for scaling features of different magnitudes. Creating polynomial cross-terms is a feature engineering technique that introduces new, unscaled variables, which would likely worsen the imbalance. Therefore, min-max normalization is the most appropriate technique in this scenario.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
Why is min-max normalization preferred over standardization for k-NN with Euclidean distance?
How does Euclidean distance work in k-NN, and why do variable ranges matter?
What is the difference between min-max normalization and a log transform?
A data scientist is analyzing the relationship between two continuous variables. A scatter plot reveals a clear pattern: as one variable increases, the other consistently increases, but the relationship is distinctly non-linear (curved). The calculated coefficients are a Pearson correlation of 0.2 and a Spearman correlation of 0.9. Which statement provides the MOST accurate explanation for this significant difference in the coefficient values?
Spearman correlation is only appropriate for ordinal data, and its application to continuous data has artificially inflated its value.
The presence of significant outliers is suppressing the Pearson correlation, while the rank-based Spearman correlation is unaffected.
Pearson correlation's low value reflects the lack of a linear relationship, while Spearman correlation's high value accurately captures the strong monotonic trend.
The data likely violates the normality assumption required for Pearson correlation, leading to an inaccurate result.
Answer Description
The correct answer is that Pearson correlation's low value reflects the lack of a linear relationship, while Spearman correlation's high value accurately captures the strong monotonic trend. Pearson correlation specifically measures the strength and direction of a linear association between two variables. If the relationship is strong but not linear (e.g., curved), the Pearson coefficient will be low. Spearman correlation, on the other hand, measures the strength and direction of a monotonic relationship. A monotonic relationship is one where the variables tend to move in the same direction, but not necessarily at a constant rate. Since the described relationship is consistently increasing (monotonic) but curved (non-linear), Spearman's rank-based method correctly identifies a strong relationship (0.9), while Pearson's method correctly identifies a weak linear relationship (0.2).
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is the main difference between Pearson and Spearman correlation?
What is a monotonic relationship, and why is it important for Spearman correlation?
Why does Pearson correlation require a linear relationship to yield a strong result?
A data scientist is building a multiple linear regression model to predict housing prices. The initial model, using only the living area in square feet as a predictor, yields an R-squared value of 0.65. To improve the model, the data scientist adds ten additional predictor variables, including number of bedrooms, number of bathrooms, and age of the house. The new model results in an R-squared value of 0.78. Which of the following is the most critical consideration for the data scientist when interpreting this increase in R-squared?
The new R-squared value is high, which invalidates the p-values of the individual coefficients in the model.
The increase from 0.65 to 0.78 definitively proves that the additional variables have strong predictive power and the new model is superior.
The R-squared value will almost always increase when more predictors are added to the model, regardless of their actual significance, potentially leading to overfitting.
An R-squared of 0.78 indicates that 78% of the model's predictions for house prices will be correct.
Answer Description
The correct answer highlights a key limitation of the R-squared metric. R-squared measures the proportion of the variance in the dependent variable that is explained by the independent variables. A critical characteristic of R-squared is that it will either increase or stay the same whenever a new predictor variable is added to the model, even if that variable has no real relationship with the outcome. This mathematical property means that simply observing an increase in R-squared after adding more variables is not sufficient evidence of a better model. It may indicate that the model is becoming overly complex and fitting to the noise in the training data (overfitting), rather than capturing the true underlying relationships. Therefore, the most critical consideration is recognizing that this increase is expected and could be misleading.
The other options are incorrect. Stating that the increase definitively proves the new model is superior is a flawed interpretation because it ignores the risk of overfitting and the inherent tendency of R-squared to increase. R-squared does not measure predictive accuracy in terms of the percentage of correct predictions; it measures the proportion of explained variance. A high R-squared value does not invalidate the p-values of the coefficients; these are separate (though related) diagnostic measures of a regression model.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
Why does R-squared increase when more predictors are added?
What is overfitting in the context of regression models?
How can a data scientist avoid overfitting when adding predictors?
A data scientist is investigating the relationship between two categorical variables: 'User Segment' (with 4 levels: 'Free Trial', 'Basic', 'Pro', 'Enterprise') and 'Feature Adoption Rate' (with 3 levels: 'Low', 'Medium', 'High'). They construct a 4x3 contingency table to perform a Chi-squared test of independence. After calculating the expected frequencies, they discover that two cells have an expected frequency below 5. Given this situation, what is the most appropriate immediate action to ensure the validity of the analysis?
Immediately apply Fisher's Exact Test, as it is more accurate for small sample sizes and low expected frequencies.
Combine adjacent or logically similar categories in one or both variables to increase the expected frequencies in the cells.
Remove the rows or columns containing the cells with low expected frequencies from the analysis.
Perform an independent samples t-test for each pair of user segments to compare their feature adoption.
Answer Description
The correct action is to combine adjacent or logically similar categories. The Chi-squared test of independence operates under the assumption that the expected frequency in each cell of the contingency table should be at least 5. When this assumption is violated, as in this scenario, the Chi-squared distribution may not accurately approximate the test statistic, potentially leading to unreliable p-values and an increased risk of a Type I error. The most common and appropriate first step to address this is to combine logically related categories. For instance, the 'Pro' and 'Enterprise' segments could be combined into a 'Paid' category, or the 'Low' and 'Medium' adoption rates could be merged. This action increases the cell counts, helping to meet the test's assumption while retaining most of the data.
- Applying Fisher's Exact Test is a plausible alternative, as it is designed for small sample sizes and does not rely on the same large-sample approximation. However, for contingency tables larger than 2x2, combining categories is often the more practical and interpretable first step. Fisher's test can also be computationally intensive for larger tables.
- Performing an independent samples t-test is incorrect because a t-test is used to compare the means of a continuous variable between two groups. Both variables in this scenario are categorical.
- Removing rows or columns with low expected frequencies is inappropriate as it results in a loss of valuable data and can introduce bias into the analysis.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
Why is an expected frequency of at least 5 important in a Chi-squared test?
What are some practical methods for combining categories in a contingency table?
When should Fisher's Exact Test be used instead of a Chi-squared test?
A data science team is evaluating four association rules that have already met the project's minimum support and confidence thresholds:
- Rule A: → support = 2%, confidence = 80%
- Rule B: → support = 4%, confidence = 50%
- Rule C: → support = 1%, confidence = 90%
- Rule D: → support = 3%, confidence = 60%
To rank the rules, the team will use the reinforcement metric, also known as Rule Power Factor. Based on this metric, which rule is the most powerful?
Rule A
Rule D
Rule B
Rule C
Answer Description
Reinforcement, also known as the Rule Power Factor, is calculated by multiplying a rule's support by its confidence (both expressed as proportions).
- Rule A: 0.02 × 0.80 = 0.016
- Rule B: 0.04 × 0.50 = 0.020
- Rule C: 0.01 × 0.90 = 0.009
- Rule D: 0.03 × 0.60 = 0.018
Rule B yields the largest reinforcement value (0.020), so it is the most powerful rule according to this metric. Rules D, A, and C follow in descending order of reinforcement.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is the Rule Power Factor (Reinforcement Metric)?
How are support and confidence defined in association rules?
Why was Rule B ranked higher than the others?
A data‐science team is tuning a pricing engine whose objective is twice-differentiable and non-convex, subject to hundreds of inequality constraints and simple bounds. They have analytic gradients and Hessians and want every iterate to remain strictly inside the feasible region throughout the search. To do this, they choose a solver that
- augments the objective with a logarithmic barrier term −μ ∑log sᵢ(x) to prevent boundary violations,
- follows a central path by gradually decreasing the barrier parameter μ→0, and
- at each outer iteration solves a primal-dual Newton system instead of a quadratic programming subproblem.
Which class of constrained nonlinear optimization algorithms matches this strategy?
Nelder-Mead simplex search
Augmented Lagrangian (method of multipliers)
Primal-dual interior-point (path-following) method
Sequential quadratic programming
Answer Description
The described approach is characteristic of primal-dual interior-point (path-following) methods. Interior-point algorithms embed all inequality constraints in a logarithmic barrier, ensuring every iterate remains strictly feasible; they then trace the central path toward optimality while repeatedly updating μ and solving Newton-type systems.
Sequential quadratic programming also handles nonlinear constraints, but it linearizes the constraints and solves a series of quadratic programming subproblems rather than inserting a barrier term.
Augmented Lagrangian methods add a quadratic penalty to the Lagrangian and relax feasibility between outer iterations; iterates can lie outside the feasible region.
The Nelder-Mead simplex search is a derivative-free direct-search technique that is typically used for unconstrained (or only box-constrained) problems and does not rely on gradients, Hessians, or barrier terms.
Therefore, only the interior-point class fits all three bullet-point requirements.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is a primal-dual interior-point method?
Why is the logarithmic barrier term −μ ∑log sᵢ(x) used?
How does a primal-dual Newton system work in this context?
An e-commerce company plans to run online validation of a new ranking model. The current production model (champion) will continue to serve users, while 50 % of requests are randomly routed to a challenger. Business stakeholders want to replace the champion only if the challenger shows at least a 2 % lift in click-through rate (CTR). Which step is most critical before traffic is split to ensure the experiment yields statistically valid evidence?
Increase the inference endpoint's autoscaling threshold so both variants can absorb peak traffic without throttling.
Deploy the challenger only in shadow mode, receiving mirrored traffic that does not affect live users.
Enable detailed feature logging for the challenger so offline explainability tools can be applied after the test.
Compute the minimum number of user impressions needed to detect a 2 % absolute lift in CTR at the chosen significance and power levels.
Answer Description
Determining the minimum sample size with a power analysis, which involves setting a significance level (α) and statistical power (1-β), is crucial. This step ensures the A/B test will collect enough impressions to reliably detect a 2% lift while controlling type I and type II error rates. If the sample size is too small, the test could end prematurely, leading to false conclusions. The other actions improve observability (feature logging), safety (shadow mode), or reliability (autoscaling), but they do not guarantee that an observed CTR difference will be statistically meaningful.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is statistical power in the context of A/B testing?
Why is setting a significance level important in an experiment?
How do you calculate the minimum sample size for detecting a 2% lift in CTR?
A machine learning engineer is manually implementing the gradient descent algorithm to optimize a multivariate linear regression model. The objective is to minimize the Mean Squared Error (MSE) cost function by iteratively adjusting the model's parameters (weights). For each iteration of the algorithm, which of the following mathematical operations is most fundamental for determining the direction and magnitude of the update for a specific weight?
Computing the second partial derivative (Hessian matrix) of the cost function.
Calculating the Euclidean distance between the predicted and actual values.
Calculating the partial derivative of the MSE cost function with respect to that specific weight.
Applying the chain rule to the model's activation function.
Answer Description
The correct answer is to calculate the partial derivative of the MSE cost function with respect to that specific weight. In gradient descent, the goal is to minimize a cost function by adjusting model parameters. The gradient, which is a vector composed of the partial derivatives of the cost function with respect to each parameter, points in the direction of the steepest ascent of the cost function. Therefore, to minimize the cost, the algorithm updates the parameters by taking a step in the opposite direction of the gradient. The partial derivative for a specific weight tells us how a small change in that weight will affect the total error, thus defining the direction and contributing to the magnitude of the necessary update for that weight.
- Computing the second partial derivative (Hessian matrix) is characteristic of second-order optimization methods, like Newton's method, which use curvature information to converge faster but are more computationally expensive. The question specifically asks about gradient descent, which is a first-order method.
- Applying the chain rule is a necessary step in the process of deriving the partial derivative for complex functions (like in neural networks), but the fundamental quantity needed for the update step in gradient descent is the partial derivative itself.
- Calculating the Euclidean distance between predicted and actual values is part of computing the overall MSE cost, not the update step. The partial derivative of this cost is what guides the optimization.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is the purpose of calculating the partial derivative in gradient descent?
What is the difference between gradient descent and second-order methods like Newton's method?
Why is the chain rule important in calculating partial derivatives?
A data scientist has developed a multiple linear regression model to predict housing prices. After the initial training, the scientist examines the model's performance by creating a residual vs. fitted values plot. The plot reveals that the residuals are not randomly scattered around the zero line; instead, they form a distinct, parabolic (U-shaped) pattern. What is the most likely issue with the model, and what is the most appropriate next step in the model design iteration process?
The model exhibits non-linearity, indicating it fails to capture the underlying structure of the data. The next step should be to use feature engineering to create polynomial terms for the relevant predictors.
The model is likely overfitting the training data. The next step should be to increase the L2 regularization penalty (e.g., in a Ridge regression) to reduce the model's complexity.
The plot reveals multicollinearity among the predictor variables. The next step should be to calculate the Variance Inflation Factor (VIF) for each feature and consider removing highly correlated predictors.
The plot shows evidence of heteroscedasticity, meaning the variance of the errors is not constant. The next step should be to apply a Box-Cox transformation to the response variable to stabilize the variance.
Answer Description
The correct option identifies non-linearity as the issue and suggests creating polynomial features as the solution. A parabolic or U-shaped pattern in a residual vs. fitted values plot is a classic indicator that the linear model is failing to capture a non-linear relationship in the data. This is a form of underfitting, where the model is too simple. The appropriate corrective action is to engineer new features that can account for this curvature, such as adding squared or cubic terms of the existing predictors (polynomial features).
The option suggesting heteroscedasticity is incorrect because heteroscedasticity typically appears as a cone or fan shape in the residual plot, where the spread of residuals changes as the fitted values increase or decrease. While a Box-Cox transformation is a valid technique to address non-constant variance, it is not the primary solution for the U-shaped pattern described.
The option suggesting multicollinearity is incorrect because multicollinearity, the correlation between predictor variables, is not diagnosed using a residual vs. fitted plot. It is typically identified using a correlation matrix or by calculating the Variance Inflation Factor (VIF).
The option suggesting overfitting is incorrect. A U-shaped residual plot indicates underfitting (the model is too simple to capture the underlying pattern), not overfitting. Increasing regularization would further simplify the model, likely worsening the issue.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What does a U-shaped pattern in a residual vs. fitted values plot signify?
What are polynomial features in machine learning?
Why is multicollinearity not diagnosed using a residual vs. fitted values plot?
You are building a sentiment classifier that must label customer-service tickets as Positive, Negative, or Neutral. In a corpus of 600 000 tickets, about 80 % are Neutral, 15 % Negative, and 5 % Positive. An LSTM model currently reports 81 % overall accuracy, but stakeholders want a single evaluation metric that is not dominated by the Neutral majority and instead gives each sentiment category equal influence on the final score. Which metric should you monitor during model development to satisfy this requirement?
Micro-averaged precision
Overall accuracy
Macro-averaged F1 score
Weighted F1 score
Answer Description
A macro-averaged F1 score first computes the F1 for each class separately and then takes the unweighted mean. Because each class contributes equally, performance on Positive and Negative tickets is just as influential as Neutral, making this metric well suited to imbalanced multi-class sentiment problems. Overall accuracy and micro-averaged precision are dominated by the large Neutral class and can mask poor minority-class performance. A weighted F1 score partially addresses imbalance but still scales each class's contribution by its support, so the majority class would continue to drive the result, failing to meet the stakeholders' requirement.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is a Macro-averaged F1 score?
Why is overall accuracy not suitable for imbalanced datasets?
How is a Weighted F1 score different from a Macro-averaged F1 score?
A data scientist is constructing a feature matrix where the existing feature vectors are linearly independent. A new feature vector is engineered, which is a linear combination of two of the original vectors. This new vector is then appended as a new column to the matrix. Which statement correctly describes the primary consequence of this action on the properties of the feature matrix?
The span of the column space expands to a higher dimension because an additional vector has been introduced.
The new vector replaces one of the original vectors in the basis, resolving a deficient rank problem in the original matrix.
The span of the column space is unaffected, which improves the numerical stability of subsequent model coefficient estimations.
The span of the column space remains unchanged, but perfect multicollinearity is introduced.
Answer Description
The correct answer is that the span of the column space remains unchanged, but this introduces perfect multicollinearity into the model. The span of a set of vectors is the set of all possible linear combinations of those vectors. Since the new feature vector is explicitly created as a linear combination of existing vectors, it already lies within the original span and does not add any new dimensions to the column space. The direct consequence of adding a linearly dependent feature vector is the introduction of perfect multicollinearity. This condition can destabilize linear models, making coefficient estimates unreliable and non-unique. The other options are incorrect because the span does not expand, the addition of a dependent vector degrades rather than improves model stability, and a new basis is not necessarily formed while ignoring the more critical issue of multicollinearity.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What does 'perfect multicollinearity' mean in linear modeling?
What is the column space of a matrix?
How does linear dependence affect the numerical stability of a linear model?
A machine learning engineer is training a deep neural network for a non-stationary problem and notices that the learning process has effectively halted. They determine that their current optimizer, Adagrad, has caused the learning rate to diminish to a near-zero value. To mitigate this, they decide to switch to the Root Mean Square Propagation (RMSprop) optimizer. What is the key mechanism in RMSprop that directly addresses this issue of a rapidly vanishing learning rate seen in Adagrad?
It introduces a penalty term to the loss function based on the magnitude of the model's weights to prevent overfitting.
It adds a fraction of the previous weight update vector to the current one, helping to accelerate convergence and dampen oscillations.
It computes adaptive learning rates by storing an exponentially decaying average of past gradients (first moment) and past squared gradients (second moment).
It calculates a moving average of the squared gradients using a decay parameter, which prevents the denominator of the update rule from monotonically increasing.
Answer Description
The correct answer explains the core mechanism of RMSprop that solves a key limitation of Adagrad. RMSprop maintains an exponentially decaying average of past squared gradients. Unlike Adagrad, which accumulates all past squared gradients, RMSprop's use of a moving average (controlled by a decay parameter, rho) prevents the denominator in the learning rate update from growing indefinitely. This ensures the learning rate does not become too small, allowing the model to continue learning effectively, especially in non-stationary settings.
The distractor describing the use of both first and second moment estimates refers to the Adam optimizer, which combines the adaptive learning rate mechanism of RMSprop with momentum. The option describing the addition of a previous weight update vector refers to the Momentum optimizer, a different technique used to accelerate gradient descent. The final incorrect option describes L2 regularization (weight decay), which is a technique to prevent overfitting by penalizing large weights, and is unrelated to the adaptive learning rate mechanism of RMSprop.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What makes RMSprop better suited for non-stationary problems compared to Adagrad?
How does the decay parameter (rho) in RMSprop function?
How does RMSprop differ from the Adam optimizer in handling gradients?
A monitoring script records the number of checkout failures per minute on a high-traffic e-commerce platform. Historical data indicate that failures occur independently at a constant average rate of 2.3 per minute. Assuming this process follows a Poisson distribution, which of the following values is closest to the probability that at least five checkout failures will be observed in a randomly selected minute?
0.024
0.084
0.916
0.209
Answer Description
For a Poisson(λ = 2.3) random variable X, the probability of exactly k events is P(X = k) = e^{-λ} λ^ ⁄ k!. The required probability is P(X ≥ 5) = 1 − P(X ≤ 4). Computing the cumulative probability up to k = 4: P(0) ≈ 0.1003, P(1) ≈ 0.2306, P(2) ≈ 0.2652, P(3) ≈ 0.2033, P(4) ≈ 0.1169. The sum is about 0.9163. Subtracting from 1 gives P(X ≥ 5) ≈ 0.084. The listed value closest to this result is 0.084. The other choices correspond to unrelated cumulative or single-point probabilities (e.g., exactly four events, at least four, or at most four) and therefore do not answer the question.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is the Poisson distribution used for?
How do you calculate cumulative probability in a Poisson distribution?
Why is e (Euler's number) used in the Poisson distribution formula?
A data scientist is performing Principal Component Analysis (PCA) on a high-dimensional dataset where the features have been standardized. After computing the covariance matrix of the data, the analysis proceeds with an eigen-decomposition. What does the first principal component represent in this context?
The largest eigenvalue of the covariance matrix, which quantifies the total variance captured by the model.
The eigenvector of the covariance matrix associated with the largest eigenvalue.
The direction defined by the eigenvector with the smallest eigenvalue, as it captures the least amount of systemic noise.
A linear combination of features designed to maximize the separation between predefined classes.
Answer Description
The correct answer is that the first principal component is the eigenvector of the covariance matrix associated with the largest eigenvalue. Principal Component Analysis (PCA) works by finding the directions of maximum variance in the data. These directions are mathematically represented by the eigenvectors of the data's covariance matrix. The amount of variance captured along each eigenvector's direction is given by its corresponding eigenvalue. Therefore, the first principal component, which by definition captures the most variance, corresponds to the eigenvector with the largest eigenvalue.
The largest eigenvalue itself is a scalar value that represents the amount of variance explained by the first principal component, not the component (direction) itself. Maximizing the separation between predefined classes is the objective of a supervised dimensionality reduction technique like Linear Discriminant Analysis (LDA), not the unsupervised PCA. The eigenvector with the smallest eigenvalue represents the last principal component, which captures the least amount of variance in the data.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
Why is the first principal component associated with the largest eigenvalue?
How does PCA differ from Linear Discriminant Analysis (LDA)?
What is the importance of standardizing features before performing PCA?
A data science team at an e-commerce company has developed a highly accurate customer churn prediction model using a complex gradient boosting algorithm. During the Evaluation phase, stakeholders confirm the model's predictive power but state that their primary goal has evolved. They now need to understand the specific reasons why customers are churning to inform retention strategies, a task for which the current "black box" model is ill-suited. According to the CRISP-DM methodology, what is the most appropriate immediate next step?
Return to the Business Understanding phase to redefine the project objectives and success criteria to include model interpretability.
Return to the Modeling phase to retrain the data with an inherently interpretable model, such as a decision tree or logistic regression.
Proceed to the Deployment phase since the model is technically accurate, and initiate a separate project for root-cause analysis.
Return to the Data Preparation phase to create new features that might provide more explanatory power when used in a new model.
Answer Description
The correct answer is to return to the Business Understanding phase. The Cross-Industry Standard Protocol for Data Mining (CRISP-DM) is an iterative process model. The Evaluation phase is designed specifically to assess whether the developed model meets the business success criteria defined in the initial Business Understanding phase. In this scenario, the business goals have fundamentally changed from pure prediction to requiring model interpretability for strategic insights. Because the project's primary objective has been redefined, the team must formally return to the Business Understanding phase to update the project goals, redefine the business success criteria to include interpretability, and adjust the project plan accordingly.
Jumping directly to the Modeling or Data Preparation phases is incorrect because these technical steps should be guided by a clearly defined and agreed-upon business objective. Proceeding to the Deployment phase is also incorrect as it would mean delivering a solution that no longer meets the stakeholders' primary needs, which violates a core principle of the Evaluation phase.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is CRISP-DM, and why is it important in data projects?
Why is model interpretability important in certain projects?
What makes decision trees or logistic regression more interpretable compared to gradient boosting?
A data science team is creating a container image for a predictive-analytics service that will be offered under a proprietary license. Corporate policy forbids distribution of any image that contains a direct or transitive dependency released under the GNU GPL or other strong-copyleft licenses. The team wants to block non-compliant images automatically before they are pushed to the internal registry, while adding as little manual work as possible to the continuous-integration (CI) pipeline.
Which approach best meets these dependency-licensing requirements?
Replace any GPL-licensed dependencies with internal forks released under a permissive license and document the change in the project's README.
Generate an SBOM during each build with Syft or Trivy and have an Open Policy Agent rule fail the pipeline whenever a prohibited license is detected.
Pin every third-party package version in a requirements.txt file and commit it to version control to keep a reproducible inventory of licenses.
Run pip freeze after the image is built, store the output as a build artifact, and ask the compliance team to review the file once a quarter.
Answer Description
The most effective way to enforce license policy with minimal manual effort is to automate it in the CI pipeline. Generating a Software Bill of Materials (SBOM) with a tool such as Syft or Trivy and evaluating that SBOM with policy-as-code (for example, an Open Policy Agent rule) allows the build to fail immediately whenever a prohibited GPL or similar license appears-regardless of whether the package is a direct or transitive dependency.
Simply pinning versions in a requirements.txt file helps with reproducibility but does not check license metadata. Keeping a frozen list for quarterly manual review still allows non-compliant code to ship between reviews. Forking and relicensing troublesome packages requires ongoing manual maintenance and does not provide an automated gate for future dependencies. Therefore, integrating SBOM generation and automated license scanning in the CI pipeline is the only option that continuously enforces the organization's licensing policy.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is an SBOM and why is it important in enforcing licensing policies?
What is Open Policy Agent (OPA) and how does it work in a CI pipeline?
Why is using tools like Syft or Trivy better than manual dependency management?
During a schema-on-read validation step in your ETL pipeline, you must reject any record whose order_date field is not a valid calendar date in the form YYYY-MM-DD. The rule should allow only years between 1900 and 2099, months 01-12, and days 01-31; it does not need to account for month-specific day limits (for example, 31 February may pass). Which regular expression best enforces this requirement?
^\d{4}-\d{2}-\d{2}$
^([0-9]{2}){2}-(0[1-9]|1[0-2])-(0[1-9]|3)$
^(19|20)\d{2}/(0[1-9]|1[0-2])/(0[1-9]|\d|3)$
^(19|20)\d{2}-(0[1-9]|1[0-2])-(0[1-9]|\d|3)$
Answer Description
The goal is to keep the pattern tight enough to eliminate obviously invalid tokens but avoid excessive complexity. Anchoring the pattern with ^ and $ ensures that the entire string is validated, not just a substring.
The expression ^(19|20)\{2}-(0[1-9]|1[0-2])-(0[1-9]|\d|3)$ works as follows:
(19|20)\\d{2}
constrains the year to 1900-2099.(0[1-9]|1[0-2])
forces the month to 01-12.(0[1-9]|\\d|3)
correctly limits the day to 01-31 by handling numbers from 01-09, 10-29, and 30-31.- Each part is separated by the required hyphen.
Distractor explanations:
^(19|20)\\d{2}/...
uses slashes, so it fails the hyphen requirement.^\\d{4}-\\d{2}-\\d{2}$
allows0000-00-00
and other impossible values because it lacks specific range checks.^([0-9]{2}){2}-...
repeats a two-digit group for the year (e.g., a year like 9919 would pass) and provides an incomplete day range, so many invalid years and days would pass.
Therefore, the first option is the most precise fit for the stated constraint.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What does the ^ and $ in a regular expression do?
Why does (19|20)\d{2} constrain the year to 1900-2099?
How does (0[1-9]|1[0-2]) ensure the month is valid?
During a market-basket analysis of 10,000 e-commerce transactions, you evaluate the association rule → . The items appear with the following absolute frequencies:
- Wireless Mouse: 2,000 transactions
- Mouse Pad: 1,500 transactions
- Both items together: 600 transactions
Based on these counts, which statement about the lift of the rule is correct?
The lift is 3.33, showing a very strong positive association between the two items.
The lift is 0.75, indicating that customers who buy a wireless mouse are less likely than average to buy a mouse pad.
The lift is 2.0, showing that customers who buy a wireless mouse are twice as likely to buy a mouse pad compared with the baseline.
The lift is 1.5, indicating a slight positive association between the two items.
Answer Description
Support(Wireless Mouse) = 2,000 / 10,000 = 0.20. Support(Mouse Pad) = 1,500 / 10,000 = 0.15. Support(both) = 600 / 10,000 = 0.06.
Confidence(→) = 0.06 / 0.20 = 0.30. Lift = Confidence / Support(Mouse Pad) = 0.30 / 0.15 = 2.0.
A lift of 2.0 means the consequent (Mouse Pad) is twice as likely to appear when the antecedent (Wireless Mouse) is present than it is on average, indicating a positive association. The other options give incorrect numeric results and therefore incorrect interpretations: 0.75 would imply a negative association, 1.5 underestimates the strength, and 3.33 overstates it.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What does 'lift' mean in market-basket analysis?
How is confidence different from lift in association rules?
Why is the lift value of 2.0 significant in the given analysis?
Woo!
Looks like that's it! You can go back and review your answers or click the button below to grade your test.