During error analysis of a document-classification pipeline, you notice that the model assigns the same feature representation to the sentences "Please book a conference room" and "The book is overdue." The ambiguity arises because the bag-of-words/TF-IDF vectorizer ignores the syntactic role of the token book. To capture this distinction while keeping the representation compatible with a sparse document-term matrix, which preprocessing adjustment should you implement?
Remove all verbs from the corpus prior to vectorization to eliminate tokens that cause sense ambiguity.
Lemmatize all nouns and verbs and omit POS information; lemmatization alone resolves the ambiguity.
Lower-case every token and collapse repeated characters (e.g., soooo → so); the classifier will infer syntactic context automatically.
Append the POS tag to every token before TF-IDF vectorization (e.g., book_VB vs book_NN) so syntactically different usages become distinct features.
Attaching each word's part-of-speech (POS) tag before vectorization creates separate features such as "book_VB" and "book_NN." This preserves grammatical context and lets the classifier distinguish verbs from nouns without expanding the feature space more than necessary. Removing all verbs would discard valuable information, lemmatizing alone would still collapse both uses of book to the same lemma, and simple lower-casing/character normalization provides no syntactic signal-so none of those approaches resolve the ambiguity.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What are POS tags and why are they important in NLP?
Open an interactive chat with Bash
How does appending POS tags to tokens improve TF-IDF vectorization?
Open an interactive chat with Bash
Why is lemmatization alone not sufficient to resolve sense ambiguity?