You are building a sentiment classifier that must label customer-service tickets as Positive, Negative, or Neutral. In a corpus of 600 000 tickets, about 80 % are Neutral, 15 % Negative, and 5 % Positive. An LSTM model currently reports 81 % overall accuracy, but stakeholders want a single evaluation metric that is not dominated by the Neutral majority and instead gives each sentiment category equal influence on the final score. Which metric should you monitor during model development to satisfy this requirement?
A macro-averaged F1 score first computes the F1 for each class separately and then takes the unweighted mean. Because each class contributes equally, performance on Positive and Negative tickets is just as influential as Neutral, making this metric well suited to imbalanced multi-class sentiment problems. Overall accuracy and micro-averaged precision are dominated by the large Neutral class and can mask poor minority-class performance. A weighted F1 score partially addresses imbalance but still scales each class's contribution by its support, so the majority class would continue to drive the result, failing to meet the stakeholders' requirement.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is a Macro-averaged F1 score?
Open an interactive chat with Bash
Why is overall accuracy not suitable for imbalanced datasets?
Open an interactive chat with Bash
How is a Weighted F1 score different from a Macro-averaged F1 score?