Data Mining The art of extracting knowledge from large bodies of structured data. Let s put it to use! 1
Recommendations 2
Basic Recommendations with Collaborative Filtering
Making Recommendations 4
The Netflix Prize (2006-2009) 5
The Netflix Prize (2006-2009) 6
What was the Netflix Prize? In October, 2006 Netflix released a dataset containing 100 million anonymous movie ratings and challenged the data mining, machine learning, and computer science communities to develop systems that could beat the accuracy of its recommendation system, Cinematch. Thus began the Netflix Prize, an open competition for the best collaborative filtering algorithm to predict user ratings for films, solely based on previous ratings without any other information about the users or films. 7
The Netflix Prize Datasets Netflix provided a training dataset of 100,480,507 ratings that 480,189 users gave to 17,770 movies. Each training rating (or instance) is of the form user, movie, data of rating, rating. The user and movie fields are integer IDs, while ratings are from 1 to 5 (integral) stars. 8
The Netflix Prize Datasets The qualifying dataset contained over 2,817,131 instances of the form user, movie, date of rating, with ratings known only to the jury. A participating team s algorithm had to predict grades on the entire qualifying set, consisting of a validation and test set. During the competition, teams were only informed of the score for a validation or quiz set of 1,408,342 ratings. The jury used a test set of 1,408,789 ratings to determine potential prize winners. 9
The Netflix Prize Data Movie Ratings 1 2.. m 1 5 2 5 4 Users 2 2 5 3. 2 2 4 2 n 5 1 5? 10
The Netflix Prize Data Movie Ratings 1 2.. m 1 5 2 5 4 Instances (samples, examples, observations) 2 2 5 3. 2 2 4 2 n 5 1 5? 11
The Netflix Prize Data Features (attributes, dimensions) 1 2.. m 1 5 2 5 4 Users 2 2 5 3. 2 2 4 2 n 5 1 5? 12
The Netflix Prize Goal Movie Ratings Users Star Wars Hoop Dreams Contact Titanic Joe 5 2 5 4 John 2 5 3 Al 2 2 4 2 Everaldo 5 1 5? 13 Goal: Predict? (a movie rating) for a user
The Netflix Prize Methods Bennett, James, and Stan Lanning. "The Netflix Prize." Proceedings of KDD Cup and Workshop. Vol. 2007. 2007. 14
The Netflix Prize Methods 15 We will discuss these methods now. We will discuss these methods by the end of the course.
Raw Averages User average: Simply assign the average rating given by user r u U, where U is the set of all users. Item average: Simply assign i I r u,i, where I is the set of all items and r u,i is the rating given to item i by user u. 16
Raw Averages User average: Simply assign the average rating given by user r u U, where U is the set of all users. Item average: Simply assign i I r u,i, where I is the set of all items and r u,i is the rating given to item i by user u. 17 What about universally good or bad movies? Or skewed rating systems?
Bayesian Method Apply Bayes Theorem: Of the ratings r R a user could give for a movie, assign the highest value: P r i = P i r P r P i where P r i is the (conditional) probability of rating r given item i, P i r is the (conditional) probability of item i given rating r, P r is the (prior) probability of rating r, and P i is the (prior) probability of item i. 18
Bayesian Method Apply Bayes Theorem: Of the ratings r R a user could give for a movie, assign the highest value: 19 P r i = P i r P r P i where P r i is the (conditional) probability of rating r given item i, P i r is the (conditional) probability of item i given rating r, P r is the (prior) probability of rating r, and P i is the (prior) probability of item i. But this method still doesn t account for the similarity between users.
Cute Kitten Picture Intermission 20
Key to Collaborative Filtering Common insight: personal tastes are correlated If Alice and Bob both like X and Alice likes Y, then Bob is more likely to like Y, especially (perhaps) if Bob knows Alice. 21
Collaborative Filtering Collaborative filtering (CF) systems work by collecting user feedback in the form of ratings for items in a given domain and exploiting similarities in rating behavior amongst several users in determining how to recommend an item 22
Collaborative Filtering Dataset Items 1 2.. m 1 5 2 5 4 Users 2 2 5 3. 2 2 4 2 n 5 1 5? 23 Goal: Predict? (an item) for n (a user)
Types of Collaborative Filtering 1 Neighborhood- or Memory-based 2 Model-based 3 Hybrid 24
Types of Collaborative Filtering 1 Neighborhood- or Memory-based We ll talk about this type now. 2 3 25
Neighborhood-based CF A subset of users are chosen based on their similarity to the active users, and a weighted combination of their ratings is used to produce predictions for this user. 26
It has three steps: 1 Neighborhood-based CF Assign a weight to all users with respect to similarity with the active user 2 3 Select k users that have the highest similarity with the active user commonly called the neighborhood. Compute a prediction from a weighted combination of the selected neighbors ratings. 27
Neighborhood-based CF 28 Step 1 In step 1, the weight w a,u is a measure of similarity between the user u and the active user a. The most commonly used measure of similarity is the Pearson correlation coefficient between the ratings of the two users: w a,u = i I i I r a,i ra r u,i ru r a,i ra 2 i I r u,i ru where I is the set of items rated by both users, r u,i is the rating given to item i by user u, and ru is the mean rating given by user u. 2
Neighborhood-based CF Step 2 In step 2, some sort of threshold is used on the similarity score to determine the neighborhood. 29
Neighborhood-based CF Step 3 In step 3, predictions are generally computed as the weighted average of deviations from the neighbor s mean, as in: p a,i = ra = u K r u,i ru w a,u w a,u u K where p a,i is the prediction for the active user a for item i, w a,u is the similarity between users a and u, and K is the neighborhood or set of most similar users. 30
Neighborhood-base CF 31 Common Problems: The search for similar users has high computational complexity, causing conventional neighborhood-based CF algorithms to not scale well. It is common for the active user to have highly correlated neighbors that are based on very few co-rated (overlapping) items, which often result in bad predictors. When measuring the similarity between users, items that have been rated by all (and universally liked or disliked) are not as useful as less common items.
Item-to-Item Matching An extension to neighborhood-based CF. Addresses the problem of high computational complexity of searching for similar users. The idea: Rather than matching similar users, match a user s rated items to similar items. 32
Item-to-Item Matching In this approach, similarities between pairs of items i and j are computed off-line using Pearson correlation, given by: w i,j = u U u U r u,i ri r u,j r j r u,i ri 2 r u,j r j where U is the set of all users who have rated both items i and j, r u,i is the rating of user u on item i, and ri is the average rating of the ith item across users. u U 2 33
Item-to-Item Matching Now, the rating for item i for user a can be predicted using a simple weighted average, as in: p a,i = r u,i w i,j where K is the neighborhood set of the k items rated by a that are most similar to i. j K j K w i,j 34
Significance Weighting Another extension to neighborhood-based CF. Addresses the problem of bad predictors generated by active user to have highly correlated neighbors that are based on very few co-rated (overlapping) items. The idea: Multiply the similarity weight by a significance weighting factor, which devalues the correlations based on a few co-rated items. 35
Inverse User Frequency Yet another extension to neighborhood-based CF. Addresses the problem of the dominance of items that have been rated by all (and universally liked or disliked), yet are not as useful as less common items. The idea: Weight an item rating by the inverse of the frequency that item is rated. 36
Inverse User Frequency When measuring the similarity between users, items that have been rated by all (and universally liked or disliked) are not as useful as less common items. To account for this, compute f i = log n n i where n i is the number of users who have rated item i out of the total number of n users. To apply inverse user frequency while using similarity-based CF, the original rating is transformed for i by multiplying it by the factor f i. 37
And Now Let s run the data mining on some data! 38
References Prem Melville and Vikas Sindhwani. Recommender Systems. In Encyclopedia of Machine Learning, Claude Sammut and Geoffrey Webb (Eds), Springer, 2010. 39