You are building an LDA-based topic exploration dashboard for a corpus of 50 000 product reviews. During hyper-parameter tuning you train several models with 20-150 topics. Perplexity on a held-out validation set keeps decreasing as more topics are added, yet domain experts say the extra topics become redundant and semantically confusing beyond roughly 80. Which additional quantitative metric should you add to the tuning pipeline so that the automatically selected model better tracks human interpretability of the topics?
Measure the BLEU score between the model's predicted words and the original review sentences.
Rely solely on the validation perplexity because lower perplexity always implies more interpretable topics.
Use the silhouette coefficient of k-means clusters built from TF-IDF document vectors to pick the best topic count.
Compute a topic coherence score (e.g., c_v or NPMI) on the top words of each model and choose the model that maximizes it.
Perplexity measures how well a probabilistic model predicts unseen tokens, so it continues to fall as you add parameters-even when the resulting topics are no longer meaningful to people. Topic coherence scores (for example c_v, c_npmi or u_mass) instead quantify how strongly the top words in each topic co-occur across the corpus, a signal shown to correlate with human judgments of topic quality. BLEU is designed for machine-translation n-gram overlap and tells nothing about latent topics, while a silhouette coefficient evaluates vector-space clustering, not word-distribution topics. Therefore, using a topic coherence metric is the correct way to align automatic model selection with human interpretability.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is perplexity and why does it decrease as more topics are added in LDA models?
Open an interactive chat with Bash
What is a topic coherence score, and how does it differ from perplexity?
Open an interactive chat with Bash
Why are metrics like BLEU or silhouette coefficient not suitable for evaluating LDA topics?