A movie-streaming provider keeps a 1-5 star rating matrix and wants to build a user-based, similarity-based recommender. Some customers are "tough graders" who rarely rate above three stars, while others routinely give four or five stars even to average titles. To make sure that neighbor selection reflects relative preferences rather than each customer's personal rating scale, which similarity measure should the data scientist choose when constructing the user-user similarity matrix?
Jaccard similarity on the sets of movies each user has rated
Euclidean distance between raw rating vectors
Pearson correlation coefficient computed on co-rated items
Cosine similarity applied to the raw rating vectors
The Pearson correlation coefficient centers each user's ratings by subtracting the user's own mean before computing covariance, then scales by the standard deviations. This removes systematic "easy" or "harsh" rating bias and measures how similarly two users deviate from their individual averages, making it ideal when rating-scale differences exist. Cosine similarity and Euclidean distance both operate on the raw magnitudes, so two users with identical ordering but consistently higher or lower scores will appear less similar. Jaccard similarity ignores rating values entirely and is suited only to binary implicit feedback, not 1-5 star data.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
Why is the Pearson correlation coefficient effective for dealing with rating biases?
Open an interactive chat with Bash
How does cosine similarity differ from the Pearson correlation coefficient in this scenario?
Open an interactive chat with Bash
What is the main limitation of Jaccard similarity in this context?