In the TF-IDF text-classification pipeline you are building for English-language restaurant reviews, the initial document-term matrix contains more than 150 000 unique tokens because words such as "run", "running", and "ran" are treated as separate features. You want to reduce this sparsity without accidentally conflating semantically different words like "universe" and "university". Which single text-preparation step best satisfies the requirement?
Switch to character-level tokenization so each character becomes a feature.
Remove all stop words, including verbs and adjectives, before vectorization.
Apply part-of-speech-aware lemmatization to convert each token to its dictionary lemma.
Run the Porter stemming algorithm to strip suffixes from every token.
Part-of-speech-aware lemmatization replaces every inflected form with its canonical dictionary lemma (e.g., running → run) while using POS tags to choose the correct form. This groups true morphological variants together, shrinking the vocabulary and sparsity yet still distinguishing unrelated words. Porter stemming also merges variants but can over-truncate and map unrelated words to the same root (universe/university → univers). Character-level tokenization increases, rather than reduces, dimensionality, and indiscriminate stop-word removal drops many sentiment-bearing tokens while leaving inflectional variation intact.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is part-of-speech-aware lemmatization?
Open an interactive chat with Bash
Why is Porter stemming less effective in this case?
Open an interactive chat with Bash
How would character-level tokenization affect the document-term matrix?