A data scientist develops a binary classification model to predict critical equipment failures in a large manufacturing plant. These failures are extremely rare, making up only 0.5% of the instances in the historical data. After training, the model is evaluated on a holdout test set of 10,000 instances and achieves an overall accuracy of 99.5%. Which of the following is the most important and valid conclusion the data scientist should draw from this result?
The 99.5% accuracy is highly misleading due to the severe class imbalance and likely indicates that the model has little to no skill in predicting the rare failure events.
The high accuracy of 99.5% is a strong indicator that the model is overfitting to the training data and requires immediate regularization.
An accuracy of 99.5% is a good baseline, but the model should be retrained on a larger dataset to confirm its stability and performance.
The model's 99.5% accuracy demonstrates exceptional performance and can be confidently deployed to proactively manage equipment maintenance schedules.
The correct answer highlights a critical weakness of using accuracy as a primary evaluation metric on datasets with severe class imbalance, a phenomenon often called the 'accuracy paradox'. In this scenario, the positive class (failure) represents only 0.5% of the data (50 instances in a 10,000-instance test set), while the negative class (no failure) represents 99.5% (9,950 instances). A trivial model that simply predicts the majority class ('no failure') for every single instance would achieve an accuracy of (0 True Positives + 9950 True Negatives) / 10,000 total instances = 99.5%. However, this model would be completely useless as it fails to identify any of the critical failures (0% Recall for the positive class). Therefore, the high accuracy score is misleading. In such cases, it is essential to evaluate the model using metrics that are sensitive to performance on the minority class, such as Precision, Recall, F1-score, or the Matthews Correlation Coefficient (MCC), to get a true picture of the model's predictive utility.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
Why is accuracy a poor metric for imbalanced datasets?