A payments-security team is clustering 100 000 transaction embeddings, each represented by 128 continuous features. They believe fraudulent user rings form clusters that are highly irregular in shape, vary greatly in size, and are surrounded by many benign transactions that should be labeled as noise. Because the true number of fraud rings is unknown, the team needs an algorithm that can discover an appropriate number of clusters on its own. For scalability, they will accelerate neighborhood queries with a k-d tree and aim for an overall runtime close to O(n log n). Which unsupervised technique best satisfies these requirements?
Expectation-Maximization Gaussian mixture modeling with Bayesian information criterion (BIC) to select the number of components
k-means clustering with the elbow method to determine the value of k
Agglomerative hierarchical clustering using Ward linkage and a dendrogram cutoff
Density-Based Spatial Clustering of Applications with Noise (DBSCAN) with ε and minPts tuned on a validation subset
DBSCAN is a density-based algorithm that (1) does not require users to specify the number of clusters, (2) can discover clusters of arbitrary shape and varying size, and (3) automatically labels low-density points as noise-ideal for isolating normal transactions from fraud rings. When range searches are indexed with a k-d tree (or similar), the average complexity drops to roughly O(n log n), matching the team's performance goal.
k-means assumes roughly spherical, equal-size clusters and requires k to be chosen in advance, so it cannot meet the shape, size, or unknown-k constraints. A Gaussian mixture model likewise needs the number of components (or a costly model-selection loop) and treats every point as belonging to some component, offering no built-in noise handling. Agglomerative clustering can handle irregular shapes, but still needs an explicit cut of the dendrogram (implicit k) and scales poorly compared with an indexed DBSCAN implementation.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is DBSCAN and how does it work?
Open an interactive chat with Bash
Why are k-d trees used in DBSCAN, and how do they improve efficiency?
Open an interactive chat with Bash
How does DBSCAN handle noise and irregular cluster shapes better than k-means?