A data scientist is tasked with building a multi-class classification model to categorize customer support tickets into 10 distinct types. The dataset is highly imbalanced; some ticket types represent over 40% of the data, while three critical but rare types each account for less than 1%. The primary business requirement is to ensure the model performs well across all categories, giving equal importance to both common and rare ticket types. Given this specific requirement, which statistical metric is the most appropriate for evaluating model performance during design iterations?
The correct answer is Macro-Averaged F1-Score. In a multi-class classification scenario with imbalanced data, the choice of metric is critical. The business requirement is to treat all classes, including rare ones, with equal importance.
Macro-Averaged F1-Score: This metric calculates the F1-score for each class independently and then computes their unweighted average. By doing so, it treats all classes equally, regardless of their size. This directly addresses the business need to evaluate performance on rare but critical categories fairly.
Micro-Averaged F1-Score: This metric aggregates the counts of true positives, false negatives, and false positives across all classes before calculating a single F1-score. In an imbalanced dataset, this score will be dominated by the performance on the majority classes. For single-label, multi-class problems, the micro-F1 score is equivalent to overall accuracy.
Overall Accuracy: This is the ratio of correct predictions to the total number of predictions. It is not suitable for imbalanced datasets because a model can achieve a high accuracy score by simply predicting the majority class, while failing completely on minority classes.
R-squared: Also known as the coefficient of determination, R-squared is a metric used to evaluate the performance of regression models, not classification models. It measures the proportion of the variance in the dependent variable that is predictable from the independent variable(s).
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
Why is the Macro-Averaged F1-Score preferred for imbalanced datasets?
Open an interactive chat with Bash
How does Micro-Averaged F1-Score differ from Macro-Averaged F1-Score?
Open an interactive chat with Bash
Why is Overall Accuracy not ideal for evaluating imbalanced datasets?