A data scientist is building a text classification model for a large corpus of customer support tickets. After applying a TF-IDF vectorizer with a vocabulary of 75,000 terms, the resulting document-term matrix is over 99.5% sparse. The initial model, a support vector machine with a linear kernel, is training very slowly and showing poor generalization. The scientist suspects the extreme sparsity and high dimensionality are the root causes. Which of the following is the most appropriate next step to mitigate these specific problems?
Utilize a one-hot encoding scheme on the document categories before re-fitting the model.
Replace all zero-value entries in the matrix with the column (term) mean to create a dense matrix.
Convert the sparse matrix to a dense format and then use standard Principal Component Analysis (PCA) for feature extraction.
Apply Truncated Singular Value Decomposition (SVD) to reduce the dimensionality of the feature space.
The correct answer is to apply Truncated Singular Value Decomposition (SVD). TF-IDF vectorization on large text corpora characteristically produces very high-dimensional and sparse matrices. These matrices can cause computational inefficiency and lead to poor model performance, a phenomenon related to the 'curse of dimensionality'.
Truncated SVD is a dimensionality reduction technique particularly well-suited for this scenario because it is designed to operate efficiently on large sparse matrices, unlike standard Principal Component Analysis (PCA). It decomposes the sparse TF-IDF matrix into a lower-dimensional, dense matrix that captures the most significant latent semantic relationships in the data. This process, often called Latent Semantic Analysis (LSA) in this context, directly addresses both the high dimensionality and the computational challenges of sparsity, which can improve the training time and generalization of the subsequent model.
Replacing zero-value entries with the mean is incorrect. In a TF-IDF matrix, zeros are meaningful; they indicate the absence of a term, not missing data. Imputing them would destroy this information, create a computationally intractable dense matrix, and introduce noise.
Applying one-hot encoding to the document categories is irrelevant to the problem. This action would modify the target variable, not the sparse feature matrix that is causing the training and performance issues.
Converting the sparse matrix to a dense format to use standard PCA is highly impractical and ill-advised. The conversion would likely lead to a MemoryError due to the enormous size of the dense matrix. Furthermore, standard PCA implementations typically center the data, which is problematic for sparse data and computationally less efficient than Truncated SVD, which does not require centering.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
Why is Truncated SVD preferable for sparse matrices compared to standard PCA?
Open an interactive chat with Bash
What does the 'curse of dimensionality' mean in this context?
Open an interactive chat with Bash
What is Latent Semantic Analysis (LSA) and how does it relate to Truncated SVD?