You are integrating a k-nearest neighbors-based anomaly detector that flags points whose mean distance to their k nearest neighbors is unusually large. The raw data consist of 2 million rows with 200 numeric features. A prototype that uses brute-force neighbor search on the original features exceeds available memory and returns answers in minutes.
Which modification is most likely to reduce both memory usage and query latency without sacrificing the detector's ability to isolate outliers?
Build a KD-tree index on the original 200-dimensional features.
Switch the distance metric to cosine similarity and keep brute-force search.
Keep brute-force search but lower k from 20 to 5.
Apply PCA to reduce dimensionality, then build a ball-tree index on the reduced space.
Principal component analysis (PCA) can project the 200-dimensional data onto a lower-dimensional subspace that captures most of the variance. Working in that subspace shortens each feature vector, so the distance calculations and the tree index require less memory and fewer floating-point operations. After the reduction, a ball-tree index is preferable to a KD-tree because ball trees partition points with hyperspheres and degrade less sharply as dimensionality grows. Using a KD-tree index on the original 200-dimensional features would be inefficient, as KD-trees are only effective when the dimensionality is roughly below 20; on 200 dimensions their query time approaches brute force. Simply lowering k or switching the distance metric while keeping brute-force search leaves the core performance problem unsolved, as the full distance matrix must still be stored or recomputed. These latter changes have a marginal impact on memory and only a modest impact on runtime, and they may also hurt anomaly-detection fidelity.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is PCA and how does it reduce dimensionality?
Open an interactive chat with Bash
Why is a ball-tree more effective than a KD-tree in high dimensions?
Open an interactive chat with Bash
Why does brute-force search perform poorly on large datasets with many features?