A financial-services firm needs to automate the processing of thousands of annual reports. The workflow must 1) categorize each report by its dominant business sector and 2) extract every company name and monetary amount into a structured database. Which sequence of NLP techniques best meets these two requirements?
Latent Dirichlet Allocation (LDA) for categorization, then Named-entity recognition (NER).
K-means clustering for categorization, then Named-entity recognition (NER).
Word2vec embeddings for categorization, then Named-entity recognition (NER).
Latent Dirichlet Allocation (LDA) for categorization, then part-of-speech (POS) tagging.
The best approach is Latent Dirichlet Allocation (LDA) followed by Named-entity recognition (NER).
LDA is an unsupervised topic-modeling algorithm that groups documents according to underlying themes, making it well suited to assign each report to a business sector without labeled examples.
After categorization, NER scans the text and labels spans such as ORG (organization) and MONEY, providing the company names and monetary figures needed for the database.
Why the other choices are unsuitable:
K-means clustering + NER: K-means can cluster vectorized documents but does not explicitly model topics, often producing less interpretable or sector-specific groupings than LDA.
LDA + part-of-speech tagging: POS tagging assigns grammatical categories (noun, verb, etc.) and will not reliably identify entities like company names or currency values.
Word2vec embeddings + NER: Word2vec yields word-level vectors; using them alone for document-level sector classification requires additional aggregation and supervised modeling, adding complexity without clear benefit here.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is Latent Dirichlet Allocation (LDA) and how does it work?
Open an interactive chat with Bash
What is Named-entity recognition (NER) and what types of entities can it detect?
Open an interactive chat with Bash
Why can’t K-means clustering match the effectiveness of LDA in categorizing texts by topic?