A data scientist is performing topic modeling on a corpus of several hundred thousand financial reports. They construct a document-term matrix (DTM) as the initial feature set. Due to the large and specialized vocabulary, the resulting DTM is extremely high-dimensional and sparse. This leads to the "curse of dimensionality", which presents a significant challenge for subsequent analysis. Which of the following statements BEST describes a primary consequence of this issue and a standard method to address it?
The high dimensionality causes distance metrics to become less meaningful, hampering the performance of clustering and classification algorithms. This can be mitigated by applying dimensionality reduction techniques like Singular Value Decomposition (SVD).
The sparsity of the matrix guarantees that any machine learning model trained on it will be underfit. The primary solution is to use a more complex model, such as a deep neural network, to capture the sparse features.
The primary issue is the loss of semantic relationships between words, such as synonymy. This is addressed by applying TF-IDF weighting to the DTM before modeling.
The computational cost of creating the DTM itself is the main bottleneck. This is best solved by implementing a more efficient tokenization algorithm and using hash vectorization.
The correct answer identifies that high dimensionality makes distance calculations, which are fundamental to many machine learning algorithms like clustering, less meaningful. It also correctly pairs this problem with a standard mitigation technique, Singular Value Decomposition (SVD), which is used for dimensionality reduction. High dimensionality and sparsity can lead to overfitting, not underfitting, because models may learn from noise in the vast, sparse feature space. The loss of semantic context is a result of the bag-of-words model, not directly of high dimensionality, and TF-IDF is a weighting scheme, not a solution for dimensionality or semantics. While creating the DTM can be costly, the question focuses on the challenges for analysis after the DTM is created.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is the 'curse of dimensionality' in machine learning?
Open an interactive chat with Bash
How does Singular Value Decomposition (SVD) help with dimensionality reduction?
Open an interactive chat with Bash
Why is sparsity a challenge when working with a document-term matrix?