A data scientist is evaluating several classifiers for a large-scale e-mail filtering project. The feature set is a 750 000 × 120 000 bag-of-words matrix stored in compressed-sparse-row (CSR) format with fewer than 1 % non-zero values. Training MultinomialNB and LinearSVC completes quickly and stays below 4 GB of RAM, but running GaussianNB on the same matrix causes the Python process to allocate more than 60 GB before the job is killed.
Which property of sparse-matrix handling in this scenario best explains why the GaussianNB run exhausts memory while the other two models do not?
GaussianNB applies kernel density estimation that adds synthetic features, dramatically increasing dimensionality when the input is sparse.
GaussianNB computes an all-pairs Euclidean distance matrix and therefore materializes a full n × n distance table in memory.
GaussianNB requires integer word-count features, so it duplicates the sparse matrix as a separate float array before fitting.
GaussianNB implicitly converts the CSR matrix to a dense array in order to calculate feature means and variances, causing all zero entries to be stored explicitly.
MultinomialNB and LinearSVC are implemented to work directly on SciPy sparse matrices, leaving zero entries implicit and therefore inexpensive in both memory and computation. GaussianNB, in contrast, calls validation utilities that reject sparse input and forcibly convert it to a NumPy dense array so it can compute per-feature means and variances. Densifying the 750 000 × 120 000 matrix materializes every zero, inflating the data size by two to three orders of magnitude and quickly overrunning available memory. The failure is thus a consequence of the implicit-to-explicit expansion of zeros during the dense conversion, not of integer-count requirements, kernel density estimation, or pair-wise distance calculations.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
Why does GaussianNB convert a sparse matrix to a dense array?
Open an interactive chat with Bash
What are CSR matrices and why are they efficient for sparse data?
Open an interactive chat with Bash
How do MultinomialNB and LinearSVC handle sparse matrices efficiently?