A large-scale video streaming service is developing a new recommender system. The available data consists of a massive, sparse user-item interaction matrix derived from implicit feedback, such as which videos users watched to completion. The key operational requirement is for a highly scalable algorithm that can be parallelized to handle millions of users and items efficiently. Given these constraints, which of the following approaches is the most appropriate choice?
Content-based filtering using item metadata
Singular Value Decomposition (SVD) with mean imputation for missing values
The correct answer is Alternating Least Squares (ALS). ALS is a matrix factorization algorithm used in collaborative filtering that is particularly well-suited for large-scale, sparse datasets and implicit feedback. Its iterative nature, where it alternates between solving for user factors and item factors, allows the process to be highly parallelized and scalable, which is a key requirement in the scenario.
User-based k-Nearest Neighbors (k-NN) is a less appropriate choice. While it is a form of collaborative filtering, it suffers from significant scalability issues. Calculating user-to-user similarities becomes computationally prohibitive in a system with millions of users.
Content-based filtering is incorrect because the scenario specifies that the primary data source is a user-item interaction matrix (implicit feedback). Content-based filtering relies on item metadata (e.g., genre, actors, director), not user interaction patterns.
Singular Value Decomposition (SVD) with mean imputation is also not the best choice. Standard SVD struggles with the massive data sparsity typical of recommender systems and requires a complete matrix. Imputing missing values in such a large, sparse matrix is impractical and can introduce significant noise. While variants of SVD exist, ALS is specifically designed to handle sparse, implicit feedback data at scale more efficiently.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What makes ALS suitable for sparse datasets?
Open an interactive chat with Bash
Why is k-Nearest Neighbors less scalable for large datasets?
Open an interactive chat with Bash
How does ALS handle implicit feedback specifically?