You are building a text-clustering workflow that starts with an extremely sparse 1 000 000 × 50 000 term-document matrix X. Because the matrix will not fit in memory when densified, constructing the covariance matrix XᵀX for a standard principal component analysis (PCA) is not an option. Instead, you choose to apply a truncated singular value decomposition (t-SVD) to reduce the dimensionality of X prior to clustering.
Which statement best explains why t-SVD is generally preferred over covariance-based PCA for this scenario?
t-SVD guarantees that the resulting singular vectors are both orthogonal and sparse, making clusters easier to interpret than those obtained from PCA.
t-SVD forces all components of the lower-dimensional representation to be non-negative, so the projected features can be read as probabilities without any post-processing.
t-SVD automatically scales every column of X to unit variance, eliminating the need for TF-IDF or other term-weighting schemes.
t-SVD can be computed with iterative methods (e.g., randomized SVD or Lanczos) that multiply X by vectors without ever materializing XᵀX, allowing the decomposition to run efficiently on the sparse matrix.
Truncated SVD operates directly on the original sample matrix, so iterative solvers such as randomized SVD or Lanczos only need matrix-vector products with X. This avoids forming or storing the dense covariance matrix, lets the algorithm stream over the sparse data structure, and greatly reduces both memory usage and runtime. PCA, in contrast, is usually implemented by first centering the data and computing XᵀX (or XXᵀ), which is prohibitive for huge, sparse term-document matrices. The other options describe properties that t-SVD does not provide: it does not automatically normalize term frequencies, it does not guarantee sparsity of the singular vectors (and PCA vectors are orthogonal as well), and it does not enforce non-negativity on the embedded features.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
Why is t-SVD suitable for sparse matrices?
Open an interactive chat with Bash
What are some iterative methods used in t-SVD computation?
Open an interactive chat with Bash
How does covariance-based PCA differ in memory usage from t-SVD?