CompTIA DataX Practice Test (DY0-001)
Use the form below to configure your CompTIA DataX Practice Test (DY0-001). The practice test can be configured to only include certain exam objectives and domains. You can choose between 5-100 questions and set a time limit.

CompTIA DataX DY0-001 (V1) Information
CompTIA DataX is an expert‑level, vendor‑neutral certification aimed at deeply experienced data science professionals. Launched on July 25, 2024, the exam verifies advanced competencies across the full data science lifecycle - from mathematical modeling and machine learning to deployment and specialized applications like NLP, computer vision, and anomaly detection.
The exam comprehensively covers five key domains:
- Mathematics and Statistics (~17%)
- Modeling, Analysis and Outcomes (~24%)
- Machine Learning (~24%)
- Operations and Processes (~22%)
- Specialized Applications of Data Science (~13%)
It includes a mix of multiple‑choice and performance‑based questions (PBQs), simulating real-world tasks like interpreting data pipelines or optimizing machine learning workflows. The duration is 165 minutes, with a maximum of 90 questions. Scoring is pass/fail only, with no scaled score reported.
Free CompTIA DataX DY0-001 (V1) Practice Test
Press start when you are ready, or press Change to modify any settings for the practice test.
- Questions: 20
- Time: Unlimited
- Included Topics:Mathematics and StatisticsModeling, Analysis, and OutcomesMachine LearningOperations and ProcessesSpecialized Applications of Data Science
A data scientist is building a multiple linear regression model to predict housing prices. The initial model, using only the living area in square feet as a predictor, yields an R-squared value of 0.65. To improve the model, the data scientist adds ten additional predictor variables, including number of bedrooms, number of bathrooms, and age of the house. The new model results in an R-squared value of 0.78. Which of the following is the most critical consideration for the data scientist when interpreting this increase in R-squared?
The increase from 0.65 to 0.78 definitively proves that the additional variables have strong predictive power and the new model is superior.
The new R-squared value is high, which invalidates the p-values of the individual coefficients in the model.
The R-squared value will almost always increase when more predictors are added to the model, regardless of their actual significance, potentially leading to overfitting.
An R-squared of 0.78 indicates that 78% of the model's predictions for house prices will be correct.
Answer Description
The correct answer highlights a key limitation of the R-squared metric. R-squared measures the proportion of the variance in the dependent variable that is explained by the independent variables. A critical characteristic of R-squared is that it will either increase or stay the same whenever a new predictor variable is added to the model, even if that variable has no real relationship with the outcome. This mathematical property means that simply observing an increase in R-squared after adding more variables is not sufficient evidence of a better model. It may indicate that the model is becoming overly complex and fitting to the noise in the training data (overfitting), rather than capturing the true underlying relationships. Therefore, the most critical consideration is recognizing that this increase is expected and could be misleading.
The other options are incorrect. Stating that the increase definitively proves the new model is superior is a flawed interpretation because it ignores the risk of overfitting and the inherent tendency of R-squared to increase. R-squared does not measure predictive accuracy in terms of the percentage of correct predictions; it measures the proportion of explained variance. A high R-squared value does not invalidate the p-values of the coefficients; these are separate (though related) diagnostic measures of a regression model.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
Why does R-squared increase when more predictors are added?
What is overfitting in the context of regression models?
How can a data scientist avoid overfitting when adding predictors?
A data science team is developing an automated ingestion pipeline for customer feedback data provided as CSV files. The pipeline frequently fails due to parsing errors, specifically when feedback text contains commas or line breaks. Although the text fields are enclosed in double quotes as per convention, the parser still misinterprets the data structure. Which of the following is the most likely underlying cause of this data ingestion problem?
The ingestion pipeline is attempting to infer a data schema, and the presence of mixed data types is causing type-casting failures.
The CSV files are being saved with a UTF-8 byte-order mark (BOM) that the ingestion script cannot interpret.
The CSV files contain unescaped double quotes within data fields that are also enclosed in double quotes.
The data provider is using a regional-specific delimiter, such as a semicolon, instead of a comma.
Answer Description
The correct answer identifies that the most probable cause is the presence of unescaped double quotes within fields that are already quoted. According to RFC 4180, a common convention for CSV files, if a field is enclosed in double quotes to handle special characters like commas or line breaks, any double quote character within the field's content must be escaped by preceding it with another double quote. Failure to do so confuses the parser, which interprets the unescaped quote as the end of the field, leading to structural errors.
- The use of a UTF-8 BOM is a common issue but typically causes the entire file to be misread from the start or results in garbled characters, not intermittent parsing failures based on specific field content.
- An incorrect delimiter, like a semicolon, would cause every line to be parsed incorrectly, not just the lines where specific characters appear within the text fields.
- Type-casting failures occur after the file has been successfully parsed into a tabular structure and the system attempts to assign data types. The problem described is a parsing failure, which happens before type inference.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What does 'escaping' mean in the context of CSV files?
Why does a UTF-8 BOM not cause parsing errors like unescaped double quotes?
What role does RFC 4180 play in CSV file formatting?
A data science team at an e-commerce company is tasked with measuring the success of a new customer loyalty program. The primary business objective is to 'significantly boost repeat business profitability'. After an initial discovery phase, the team is tracking several data points. Which of the following represents the most effective Key Performance Indicator (KPI) for the stated business objective?
The total number of weekly transactions made by loyalty program members.
The ratio of active loyalty program members to total registered users.
The month-over-month growth rate of new sign-ups for the loyalty program.
A 15% increase in the average Customer Lifetime Value (CLV) for loyalty program members over the next fiscal year.
Answer Description
The correct answer is the option that outlines a 15% increase in the average Customer Lifetime Value (CLV) for loyalty program members over the next fiscal year. A Key Performance Indicator (KPI) is a quantifiable measure of performance over time for a specific strategic objective. The primary objective is to 'boost repeat business profitability'. Increasing the average CLV directly measures long-term customer value and profitability, and setting a specific, measurable, and time-bound target (15% in the next fiscal year) makes it a true KPI.
The other options are less effective as KPIs for this specific goal:
- The total number of weekly transactions is a simple metric or measure. It tracks volume but not profitability; many low-value transactions might not meet the business goal.
- The month-over-month growth rate of new sign-ups is a metric that can be considered a 'vanity metric' in this context. It shows program adoption but does not guarantee that these new members are profitable or retained long-term.
- The ratio of active loyalty members to total users is an excellent engagement metric. However, it measures program activity, not the direct financial impact on profitability, making it a supporting metric rather than the primary KPI for the stated objective.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
Why is Customer Lifetime Value (CLV) an effective KPI for profitability?
What is the difference between a KPI and a metric?
How can a loyalty program improve Customer Lifetime Value (CLV)?
A machine learning engineer is training a deep neural network. The process involves a forward pass to generate predictions, a loss function to quantify error, and a backward pass to learn from that error. Within this training loop, what is the primary computational contribution of the backpropagation algorithm itself?
To apply an optimization rule, such as momentum or Adam, to update the network's parameters.
To normalize the activations of hidden layers to ensure a stable distribution of inputs during training.
To determine the initial error value by comparing the network's final output with the ground-truth labels.
To efficiently calculate the gradient of the loss function with respect to every weight and bias in the network.
Answer Description
The correct answer is that backpropagation's primary role is to efficiently compute the gradient of the loss function with respect to every parameter (weights and biases) in the network. It does this by applying the chain rule of calculus, starting from the output layer and working backward.
- The option suggesting that backpropagation applies an optimization rule like Adam is incorrect. Backpropagation calculates the gradients, but the optimization algorithm (like Adam or SGD) is a separate component that uses these gradients to update the network's parameters.
- The option about determining the initial error value describes the loss calculation step, which happens after the forward pass but before the backward pass and backpropagation.
- The option referring to normalizing activations describes Batch Normalization, which is a separate technique used to stabilize training, not the function of backpropagation.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is backpropagation and how does it work?
What is the chain rule and why is it important in backpropagation?
How does backpropagation differ from an optimization algorithm like Adam?
You have inherited a helper function named standardize_df()
in a production feature-engineering library. The function must (1) subtract the mean and divide by the population standard deviation for every numeric column and (2) leave any column whose standard deviation is exactly zero unchanged to avoid division-by-zero problems. You are charged with adding a single PyTest unit test that delivers the strongest regression-catching power while still following unit-testing best practices (deterministic data, small scope, Arrange-Act-Assert structure, no unnecessary external libraries). Which test design best satisfies these requirements?
Apply scikit-learn's
StandardScaler
to a different DataFrame and assert that its output equals the output ofstandardize_df
.Generate 10 000 random rows, call
standardize_df
, and assert only that the output DataFrame has the same shape as the input.Set
numpy.random.seed(0)
inside the test and simply check thatstandardize_df
executes without raising an exception.Construct a small DataFrame with one constant and one varying numeric column, run
standardize_df
, then assert withpytest.approx
that the varying column now has mean 0 and std 1 and that the constant column is identical to the original.
Answer Description
The most effective test is the one that deliberately constructs a minimal, fully deterministic DataFrame that exercises both documented behaviors and then makes precise assertions. A two-column frame-one varying column and one constant (zero-variance) column-hits the boundary case that would otherwise raise division-by-zero errors. Using pytest.approx
(or an equivalent tolerance-aware comparison) to assert the varying column's mean≈0 and std≈1 confirms correct standardization, while Series.equals
on the constant column verifies that it was left untouched. This follows the Arrange-Act-Assert pattern, uses no external services, and will fail loudly if any future change alters the algorithm.
The other choices either observe only superficial properties (shape), rely on an additional library (StandardScaler) so that failures could originate outside the function under test, or make no content-specific assertion at all; hence they provide far less defect detection value.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is the purpose of `pytest.approx` in the unit test?
Why is having a constant column (zero-variance) important in the test?
Why are deterministic data and the Arrange-Act-Assert structure crucial for unit tests?
A data scientist wants to report a two-sided 95% confidence interval for the true population Pearson correlation between two numerical features. In a random sample of n = 60 observations, the sample correlation is r = 0.58. To use standard normal critical values, which pre-processing step should be applied to the correlation estimate before constructing the confidence interval?
Transform r with Fisher's inverse hyperbolic tangent (z-transformation), build the interval in the transformed space, then back-transform the interval's endpoints.
Multiply r by √(n−2)/√(1−r²) and treat the result as standard normal when forming the interval.
Use a Box-Cox transformation on each variable so that the resulting correlation can be treated as normally distributed.
Apply the Wilson score method directly to r to obtain the interval.
Answer Description
Because the sampling distribution of Pearson's r is skewed and its variance depends on the unknown population correlation (ρ), a direct calculation using normal theory is inappropriate. Fisher's z-transformation-z = atanh(r) = ½ ln[(1+r)/(1−r)]-is a variance-stabilizing transform that makes the resulting statistic, z, approximately normally distributed as N(atanh(ρ), 1/(n−3)). A 95% interval for this transformed value is therefore z ± 1.96 / √(n−3). Applying the inverse transform (tanh) to the interval's endpoints yields the confidence interval for ρ. The Wilson score interval is designed for binomial proportions. A Box-Cox transformation applies to the raw data, not the correlation coefficient r. The statistic r√(n−2)/√(1−r²) follows a t-distribution and is used for hypothesis testing, not interval estimation.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is Fisher's z-transformation and why is it used for correlations?
Why can't the Wilson score method or Box-Cox transformation be used in this case?
What is the role of sample size (n) in constructing the confidence interval for correlation?
A data science team is tasked with determining if a new, computationally intensive recommendation algorithm causes a statistically significant increase in user engagement compared to the current algorithm. To generate the data needed for this analysis, the team plans to deploy the new algorithm to a segment of users. Which of the following is the most critical component of the experimental design to ensure the resulting data can be used to infer causality?
Randomly assigning users to either the new algorithm (treatment group) or the existing algorithm (control group).
Formulating a precise null hypothesis and an alternative hypothesis with a defined p-value threshold for significance.
Selecting the most active users for the new algorithm's group to maximize the potential observable impact.
Implementing detailed logging to capture all user interactions with the recommendations, such as clicks and hover time.
Answer Description
The correct answer is the random assignment of users to either the treatment group (new algorithm) or the control group (existing algorithm). This process is the cornerstone of a randomized controlled trial (RCT), which is the gold standard for establishing a causal relationship. Randomization helps ensure that, on average, both groups are similar in all respects (both known and unknown characteristics) before the experiment begins. This minimizes the risk of confounding variables-external factors that could influence user engagement and be mistaken for an effect of the new algorithm. By isolating the independent variable (the algorithm type), any statistically significant difference in engagement observed between the groups can be confidently attributed to the algorithm, thus supporting a causal inference.
- Implementing detailed logging is crucial for measuring the outcome (the dependent variable), but it does not in itself validate the experimental setup for causal inference. Without randomization, you cannot rule out that differences in engagement were caused by pre-existing differences between user groups rather than the algorithm itself.
- Selecting the most active users for the new algorithm introduces severe selection bias. Any observed increase in engagement in this group would be confounded by their inherent high activity levels, making it impossible to determine if the new algorithm had any real effect. The results would not be generalizable.
- Formulating a hypothesis is a critical step in the scientific method and precedes the experiment, but it is part of the analytical framework, not the data generation design. A well-formed hypothesis is meaningless if the data used to test it is collected in a biased manner that invalidates the results.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
Why is random assignment important in experiments?
What are confounding variables and how do they affect causal inference?
What is a randomized controlled trial (RCT) and why is it considered the gold standard?
During a model audit, you examine the first convolutional layer of an image-classification network. The layer receives a 128×128×3 input and applies 64 kernels of size 5×5 with stride 1 and "same" padding so that the spatial resolution of the output remains 128×128. Bias terms are present (one per kernel), but you must report only the number of trainable weights excluding biases in this layer. How many weights does the layer contain?
78 643 200
9 600
4 800
1 600
Answer Description
A 2-D convolutional layer learns one set of weights per filter. The number of weights per filter equals kernel_height × kernel_width × input_channels.
- Each filter: 5 × 5 × 3 = 75 weights.
- Number of filters: 64.
Total weights = 75 × 64 = 4 800.
The other values arise from common mistakes: 1 600 ignores the three input channels (25 × 64); 9 600 double-counts parameters by multiplying by an incorrect factor; 78 643 200 assumes every output neuron has its own kernel instead of sharing parameters, eliminating the key efficiency of CNNs.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What does 'stride' mean in a convolutional layer?
Why is 'same' padding used, and how does it preserve spatial resolution?
What is the role of kernels in a convolutional layer?
You have just finished training a logistic-regression model that flags potentially fraudulent B2B invoices. For next week's 10-minute board meeting, the CFO wants one slide that instantly shows how many legitimate invoices would be held for manual review (false positives) and how many fraudulent invoices the model might miss (false negatives). Several board members are color-blind and have little time for technical explanations. Which visualization and design choice will best satisfy these communication requirements?
An annotated confusion-matrix heatmap that uses a color-blind-safe blue-orange palette and displays the four cell counts in large text.
A scatter plot of predicted fraud probability versus invoice amount, colored by the model's predicted class labels.
A 3-D stacked pie chart that uses red and green slices to depict true positives, false positives, true negatives, and false negatives.
An ROC curve showing the area under the curve (AUC) with an interactive threshold slider.
Answer Description
A confusion-matrix heatmap uses the same 2×2 layout as the underlying error counts, so non-technical executives can immediately map each cell to a real-world outcome (true/false and positive/negative). Showing the counts as large annotations removes the need for the audience to estimate values from color alone, and selecting a blue-orange (or other color-blind-safe) palette ensures that board members with red-green deficiencies see clear contrast. ROC curves and probability scatterplots require an understanding of thresholds or continuous scores that most executives do not possess. A 3-D pie chart exaggerates areas, makes comparison difficult, and relies on red/green hues that many viewers cannot distinguish. Therefore, the annotated, color-blind-friendly confusion-matrix heatmap is the most effective and accessible choice.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is a confusion matrix and why is it useful?
Why is a color-blind-safe palette important in data visualization?
Why are ROC curves or 3-D pie charts unsuitable for executive audiences?
A data scientist is developing a model to classify product images into 150 distinct categories. The chosen base algorithm is a Support Vector Machine (SVM), which is inherently a binary classifier. The development team is operating under significant computational resource constraints, making training time a primary concern. Which multiclass classification strategy is the most appropriate choice for adapting the SVM model in this scenario?
One-vs-Rest (OvR)
One-vs-One (OvO)
Error-Correcting Output Codes (ECOC)
Multinomial Logistic Regression
Answer Description
The correct answer is One-vs-Rest (OvR). In a multiclass classification problem with 'K' classes, the OvR strategy trains 'K' individual binary classifiers. Each classifier is trained to distinguish one class from the remaining 'K-1' classes. For this scenario with 150 classes, OvR would require training 150 SVM models. The One-vs-One (OvO) strategy trains a binary classifier for every pair of classes, resulting in K*(K-1)/2 classifiers. For 150 classes, this would be 150*149/2 = 11,175 classifiers, which is far more computationally expensive to train than OvR. While each OvO classifier is trained on a smaller subset of data, the sheer number of classifiers makes it less efficient for problems with a large number of classes. Multinomial Logistic Regression is an inherently multiclass algorithm and is not a strategy for adapting a binary classifier like SVM. Error-Correcting Output Codes (ECOC) is a more complex ensemble method that can be more robust but is generally more computationally intensive to train than OvR, as it often involves training more than 'K' classifiers.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
Why is One-vs-Rest (OvR) preferred over One-vs-One (OvO) in this scenario?
How does a binary classifier like SVM handle multiclass problems using the OvR approach?
What are the trade-offs of using One-vs-Rest (OvR) for multiclass classification?
A data science team is designing a data lake architecture on a distributed file system to store terabytes of structured event data for analytical querying. The primary use case involves running complex, read-heavy queries for feature engineering, which frequently select a small subset of columns from a wide table containing over 200 columns. The system must also support schema evolution as new event properties are added over time. Given these requirements, which data format is the most appropriate for storing the processed data in the data lake to optimize query performance and storage efficiency?
CSV
Avro
Parquet
JSON
Answer Description
The correct answer is Parquet. Parquet is a columnar storage format specifically designed for efficient data storage and retrieval in analytical workloads. Its columnar nature allows query engines to read only the necessary columns to satisfy a query, which drastically reduces I/O and improves performance, especially for wide tables where only a subset of columns is accessed. Parquet also offers excellent compression and supports schema evolution, making it the ideal choice for this scenario.
- Avro is an incorrect choice because it is a row-based storage format. While it is efficient for write-heavy workloads and data serialization (like in streaming pipelines), its row-based nature requires reading entire rows of data, which is inefficient for analytical queries that only need a few columns from a wide table.
- JSON is incorrect because, although it supports schema flexibility and nested data, it is a text-based, row-oriented format. It is more verbose and significantly less performant for large-scale analytical queries compared to binary, columnar formats like Parquet.
- CSV is incorrect as it is a simple, text-based, row-oriented format. It is inefficient for querying subsets of columns from large, wide datasets and lacks robust support for schema evolution or data typing.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
Why is Parquet considered a columnar format?
What is schema evolution and how does Parquet handle it?
Why is Avro unsuitable for read-heavy analytical queries?
Your data science team must release a patient readmission data set to external researchers. The file currently contains direct identifiers (patient name, Social Security number) and quasi-identifiers (full date of birth, 5-digit ZIP code). Compliance requires that (a) the released data no longer be considered protected health information (PHI) under the HIPAA Privacy Rule, so no patient authorization is needed, and (b) age and regional patterns remain analytically useful. Which preprocessing approach best meets both requirements?
Encrypt the entire data set with AES-256 and provide researchers with the decryption key after they sign a data-use agreement.
Replace each name and Social Security number with a random UUID but keep the full date of birth and 5-digit ZIP code intact.
Mask all but the last four digits of each Social Security number and hash patient names with SHA-256 while leaving other fields unchanged.
Remove the 18 identifiers listed in 45 CFR §164.514(b)(2), convert dates of birth to age in years, and truncate ZIP codes to their first three digits when the corresponding area exceeds 20,000 residents.
Answer Description
Under the HIPAA Safe Harbor de-identification method in 45 CFR §164.514(b)(2), removing 18 specific identifiers-and, for geographic and temporal data, reducing them to coarser values-renders the data "not individually identifiable." Once those identifiers are removed and ZIP codes are truncated to the first three digits (when the combined area has >20,000 residents) and dates are generalized (for example, converting date of birth to age in years), the resulting data set is no longer PHI and may be disclosed without individual authorization. Pseudonymization that leaves full dates of birth and 5-digit ZIP codes is reversible and therefore still PHI; encrypting the file or partially masking identifiers preserves the underlying direct identifiers, so the data remain PHI unless every user lacks the key. Only the Safe Harbor-compliant transformation both removes the regulatory burden and preserves useful aggregated age and location information.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
Why is truncating ZIP codes to the first three digits necessary under HIPAA?
What is the significance of converting dates of birth to age in years?
What are the '18 identifiers' referenced in HIPAA Safe Harbor?
A data science team is developing a predictive model for equipment failure using a single, unpruned decision tree. During testing, they observe two phenomena:
- The model achieves near-perfect accuracy on the training dataset but performs poorly on the unseen validation dataset.
- Minor changes to the training data, such as removing a small number of data points, result in a drastically different tree structure and predictions.
Which underlying characteristic of decision trees is the primary cause of both of these observations?
High bias
Multicollinearity
High variance
The curse of dimensionality
Answer Description
The correct answer is high variance. High variance in a model means it is highly sensitive to fluctuations in the training data. This sensitivity causes two primary effects seen in unpruned decision trees. First, the model learns the training data, including its noise, too well, which leads to overfitting. This explains why the model has high accuracy on the training set but generalizes poorly to new, unseen data. Second, because the model is so closely fitted to the specific training data, even small changes to that data can lead to significant changes in the model's structure and predictions, a behavior known as instability.
- High bias is incorrect. High bias refers to underfitting, where the model is too simple to capture the underlying patterns in the data. This would result in poor performance on both the training and validation sets, which contradicts the scenario.
- The curse of dimensionality refers to problems that arise when working with high-dimensional data, such as data sparsity and increased computational cost. While it can impact model performance, it is not the direct cause of a model's instability and overfitting in the way high variance is.
- Multicollinearity, the correlation between predictor variables, can affect the stability of a decision tree's feature selection and interpretability but is not the fundamental reason for overfitting and sensitivity to data changes. The core issue described is high variance, for which decision trees are well-known.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is high variance in machine learning models?
Why are decision trees prone to instability?
How can high variance in decision trees be reduced?
A quantitative analyst is modeling a company's monthly sales data from the last decade. A time series plot reveals a consistent upward trend, and an Augmented Dickey-Fuller (ADF) test confirms the series is non-stationary. The analyst plans to use an Autoregressive Integrated Moving Average (ARIMA) model for forecasting. To address the identified non-stationarity, which configuration choice for the ARIMA(p, d, q) model is the most critical first step?
Determine the autoregressive order,
p
, by examining the Partial Autocorrelation Function (PACF) of the original series.Set the differencing order,
d
, to a value of 1 or more to make the series stationary.Select the moving average order,
q
, by inspecting the Autocorrelation Function (ACF) of the original series.Apply a Box-Cox transformation to the series to remove the stochastic trend.
Answer Description
The correct answer is to set the differencing order, d
, to a value of 1 or more. The 'I' in ARIMA stands for 'Integrated' and represents the differencing applied to the raw observations to make the time series stationary. Since the ADF test confirmed the series is non-stationary due to a trend, applying one or more orders of differencing (setting d >= 1) is the essential first step to stabilize the mean of the series. Only after the series is stationary should the analyst examine the ACF and PACF plots of the differenced series to determine the appropriate p
and q
values. Analyzing the ACF and PACF plots of the original, non-stationary series would be misleading. A Box-Cox transformation is used to stabilize non-constant variance (heteroskedasticity), not to remove a trend, which is a form of non-stationarity in the mean.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What does 'd' in ARIMA(p, d, q) represent?
What is the purpose of the Augmented Dickey-Fuller (ADF) test?
Why can't we use the ACF and PACF plots of a non-stationary series to determine 'p' and 'q'?
You are deploying an orchestration platform to automate hourly data-ingestion pipelines. A six-hour network outage prevents the 03:00-08:00 runs from executing. When connectivity returns, the business wants the orchestrator to automatically create and execute pipeline instances for each missed hour in chronological order, with no manual triggers or code changes. Which built-in orchestration capability is specifically designed to satisfy this requirement?
Service-level-agreement (SLA) miss callbacks that fire alerts when a task runs too long
Branching operators that choose between alternative execution paths during the workflow
Catch-up (backfill) scheduling that automatically creates runs for every missed time interval
Dynamic task mapping that expands a single task into parallel subtasks at runtime
Answer Description
Most workflow-orchestration tools (for example, Apache Airflow, Prefect, and Dagster) provide a catch-up or backfill feature. When enabled, the scheduler inspects the schedule for any intervals that have not yet been processed and automatically generates separate pipeline (DAG/flow) runs for each missing interval. Those runs are queued and executed in timestamp order, honoring task dependencies exactly as they would have run originally. Dynamic task mapping, branching, and SLA-miss callbacks are unrelated: they affect how individual tasks behave, not how the scheduler recovers entire missed pipeline intervals. Therefore, enabling catch-up/backfill directly meets the requirement to re-run all six missed hourly pipelines without manual intervention.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is 'catch-up' (backfill) scheduling in an orchestration tool?
What is the difference between catch-up scheduling and dynamic task mapping?
How does catch-up scheduling maintain task dependencies?
A data scientist is developing an ordinary least-squares model to predict daily revenue, a strictly positive continuous variable. The revenue distribution is highly right-skewed and, after an initial linear fit, the residual-versus-fitted plot shows a wedge-shaped pattern that widens as fitted values increase, indicating heteroscedasticity. The scientist needs a single data transformation on the response variable that (1) can stabilize the variance and approximate normality and (2) lets the optimal transformation be chosen from a continuum of power functions using maximum-likelihood estimation. Which transformation should be applied before refitting the model?
Standardize the revenue variable with a z-score (mean 0, standard deviation 1).
Rescale the revenue variable to the 0-1 range with min-max normalization.
Apply a Box-Cox power transformation to the revenue variable.
Take the natural logarithm (ln) of the revenue variable.
Answer Description
The Box-Cox transformation is a family of power transforms (y^{(\lambda)}=(y^{\lambda}-1)/\lambda) (with the limiting case (\lambda=0) equal to the natural log). Because it includes a tunable parameter (\lambda), practitioners can estimate the value that maximizes the likelihood of the data, producing a variance-stabilizing, near-normal response. This directly addresses the wedge-shaped heteroscedasticity evident in the residual plot. A plain logarithmic transform is a special case of Box-Cox but fixes (\lambda) at 0, removing the ability to search for a better power. Z-score standardization and min-max scaling only change location and scale; they neither normalize a skewed distribution nor correct non-constant error variance.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is heteroscedasticity and why is it problematic in regression models?
Why is the Box-Cox transformation preferred over a fixed transformation like natural logarithm?
What are the limitations of z-score standardization and min-max normalization in addressing heteroscedasticity?
A data scientist is building a logistic regression model to detect fraudulent financial transactions. The model uses four features: age
, account_balance
, number_of_monthly_transactions
, and average_transaction_amount
. An initial exploratory data analysis using box plots for each individual feature reveals no significant outliers. However, the model's performance is unexpectedly poor, and a residuals vs. leverage plot indicates that a few data points have an unusually high influence on the model's coefficients.
Given this scenario, which of the following methods is the MOST appropriate for identifying these influential, problematic data points?
Apply a Box-Cox transformation to each feature.
Calculate the Mahalanobis distance for each data point.
Implement an Isolation Forest algorithm on the dataset.
Generate a scatter plot matrix of all feature pairs.
Answer Description
The correct answer is to calculate the Mahalanobis distance for each data point. Mahalanobis distance is a multivariate outlier detection method that measures the distance of a point from the center of a distribution (the centroid), while accounting for the correlation between the variables. In this scenario, since univariate analysis showed no outliers, the problem is likely due to an unusual combination of feature values (e.g., a young person with an extremely high account balance and transaction frequency), which is exactly what Mahalanobis distance is designed to detect. These multivariate outliers can exert high leverage on regression models, which is consistent with the diagnostic plot findings.
- Generating a scatter plot matrix is a useful visualization technique but is limited to showing relationships between pairs of variables. It would not reliably identify outliers that only become apparent when considering three or more variables simultaneously.
- An Isolation Forest is a powerful, modern algorithm for anomaly detection. While it is effective for multivariate outliers, Mahalanobis distance is a more fundamental statistical measure directly related to the geometric influence of a point in a multivariate linear model context. For identifying influential points in a regression setting, Mahalanobis distance is the most direct and classic approach.
- Applying a Box-Cox transformation is a technique used to stabilize variance and make data more closely resemble a normal distribution. Its purpose is to transform the data to better meet model assumptions, not to identify which specific data points are outliers.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is Mahalanobis distance, and why is it useful in detecting multivariate outliers?
How does Mahalanobis distance differ from an Isolation Forest algorithm?
Why wouldn't a scatter plot matrix or Box-Cox transformation address the issue in this situation?
While tuning a CNN that classifies photographs of industrial defects, you observe that validation accuracy drops sharply whenever the defect is partly hidden by a worker's hand or tool, even though training loss remains low. You decide to add a masking-based data-augmentation step that follows the Random Erasing technique. Which configuration is MOST likely to increase robustness to partial occlusion without altering the ground-truth class labels?
Add a binary channel that records the location of an arbitrary mask and train the model to reconstruct the hidden pixels as a secondary objective.
At each forward pass, randomly zero-out a comparable fraction of convolutional filters in the network's first layer to simulate information loss.
With a fixed probability, overwrite one randomly located rectangular region covering roughly 2-20 % of every training image with random pixel values (or the per-channel dataset mean) while keeping the original label.
Replace all background pixels that fall outside each annotated bounding box with a uniform gray mask so that only the foreground object remains visible.
Answer Description
Random Erasing augments training data by selecting a random rectangle (typically 2-20 % of the image area with a random aspect ratio) and replacing its pixels with random values or the dataset's mean. The label remains unchanged, forcing the network to rely on contextual cues distributed across the image, which empirically improves robustness to occlusion and reduces over-fitting.
Replacing the entire background (second choice) changes the statistical structure of every image rather than introducing random occlusions and can even remove features that are useful for differentiation. Randomly dropping convolutional filters (third choice) is a form of network regularization (analogous to dropout) rather than data augmentation and does not teach the model to handle occluded inputs. Adding an explicit binary mask channel and reconstruction task (fourth choice) converts the problem into multitask learning and signals the network where the occlusion is, defeating the purpose of forcing invariance to hidden regions.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is the Random Erasing technique in data augmentation?
Why does Random Erasing improve robustness to occlusion over methods like masking the background?
Why is multitask learning not suitable for this specific problem?
A manufacturing company deployed a gradient-boosted model to predict bearing failures from streaming sensor data. Two weeks after a firmware update changed the calibration of the vibration sensors, the model's precision fell from 0.82 to 0.55 even though the proportion of actual failures in the field remained at 3.4 %. Subsequent analysis shows that the mean and variance of multiple vibration-related features have shifted by more than two standard deviations, but the conditional relationship between those features and the failure label appears unchanged. Which phenomenon is the most likely root cause of the model's performance degradation?
Model over-fitting resulting from excessively high variance during initial training
Data drift (covariate shift) caused by the firmware-induced change in input feature distributions
Concept drift because the physical mechanism of bearing failure has evolved
Data leakage introduced by inadvertently training on target-related features
Answer Description
The firmware update altered the statistical distribution of several input features (covariate shift) while the underlying mapping from features to the failure label stayed the same. This is the textbook definition of data drift. Data drift (also called feature or covariate drift) occurs when P(X) changes but P(Y|X) remains stationary; such a mismatch between the training and production input distribution causes a loss of predictive power even though the concept being modeled has not changed. Concept drift would require the relationship P(Y|X) itself to change, data leakage involves improper inclusion of future or target-related variables during training, and classic over-fitting/under-fitting problems originate in the training process rather than in a post-deployment shift in feature statistics.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is data drift in machine learning?
How does concept drift differ from data drift?
What steps can be taken to monitor and address data drift?
A machine-learning engineer must deploy a Docker-packaged real-time fraud-detection model. Traffic is usually near zero but can spike unpredictably to thousands of requests per second. The business wants to pay nothing while the service is idle yet keep end-to-end inference latency below 100 ms during spikes. Which cloud deployment approach best meets these requirements?
Deploy the container on a request-driven serverless container platform that supports automatic scale-to-zero (for example, Google Cloud Run or Azure Container Apps).
Host the container on a single, large bare-metal instance to eliminate virtualization overhead.
Provision a fleet of reserved virtual machines sized for the maximum anticipated peak load.
Run the model on a managed Kubernetes cluster with a fixed-size node pool sized for average traffic.
Answer Description
A request-driven serverless container platform-such as Google Cloud Run or Azure Container Apps-meets both goals. These services automatically scale container instances from zero to many within seconds and bill per request, so no compute costs accrue while the application is idle. You can also pin a small minimum instance count to keep one warm container ready, ensuring sub-100 ms response times even during sudden bursts. In contrast, reserving VM capacity or using a fixed-size Kubernetes node pool wastes money during idle periods, while a single bare-metal host cannot absorb large spikes.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is a serverless container platform?
How does auto-scaling work in serverless platforms?
What does scale-to-zero mean in the context of serverless computing?
Cool beans!
Looks like that's it! You can go back and review your answers or click the button below to grade your test.