You are developing a binary classifier that flags defective parts on an assembly line. Historical data show that only about 0.5 % of the parts are actually defective. Your first model reports an overall accuracy of 99.4 % and an AUROC of 0.93, yet quality-control engineers still find that many defects slip through. To obtain a more informative view of the model's ability to detect the rare defective parts and to guide further tuning, which single performance metric should you evaluate next, and why?
Log-loss (cross-entropy) - it is unaffected by class imbalance and therefore gives an unbiased measure of model quality.
Area under the ROC curve (AUROC) - it directly compensates for class imbalance by weighting false-negative errors more heavily.
Area under the Precision-Recall curve (AUPRC) - it emphasizes the trade-off between precision and recall and is not dominated by the vast number of true negatives.
Overall accuracy - it already reflects the combined effect of precision and recall over both classes.
Area under the Precision-Recall curve (AUPRC) summarizes how well a classifier balances precision (the proportion of flagged parts that are truly defective) against recall (the proportion of all defects that are found) across every possible decision threshold. Because precision and recall ignore true negatives, the metric is not inflated by the overwhelming number of non-defective parts, making it sensitive to the minority positive class.
Overall accuracy can be very high when the positive class is rare, even for a model that never predicts a defect. AUROC plots TPR against FPR; with extreme class imbalance, an apparently strong AUROC can still correspond to very low precision. Log-loss is dominated by the majority class unless class weights are applied, so it may mask poor positive-class performance. Therefore, AUPRC is the most informative single metric in this scenario.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is the AUPRC and why is it better for imbalanced datasets?
Open an interactive chat with Bash
Why is AUROC less effective for evaluating rare-event detection?
Open an interactive chat with Bash
Why is accuracy misleading for imbalanced datasets?