Andriy Mnih and Ruslan Salakhutdinov

Similar documents
Review: Probabilistic Matrix Factorization. Probabilistic Matrix Factorization (PMF)

Restricted Boltzmann Machines for Collaborative Filtering

* Matrix Factorization and Recommendation Systems

Binary Principal Component Analysis in the Netflix Collaborative Filtering Task

Scaling Neighbourhood Methods

Large-scale Ordinal Collaborative Filtering

Large-scale Collaborative Prediction Using a Nonparametric Random Effects Model

Graphical Models for Collaborative Filtering

SCMF: Sparse Covariance Matrix Factorization for Collaborative Filtering

Preliminaries. Data Mining. The art of extracting knowledge from large bodies of structured data. Let s put it to use!

Collaborative Filtering. Radek Pelánek

Scalable Bayesian Matrix Factorization

A Modified PMF Model Incorporating Implicit Item Associations

Using SVD to Recommend Movies

Recommender Systems. Dipanjan Das Language Technologies Institute Carnegie Mellon University. 20 November, 2007

Recommendation Systems

Collaborative Filtering Applied to Educational Data Mining

Generative Models for Discrete Data

Approximate Inference using MCMC

Mixed Membership Matrix Factorization

STA 4273H: Statistical Machine Learning

Mixed Membership Matrix Factorization

Matrix Factorization Techniques for Recommender Systems

Algorithms for Collaborative Filtering

Bayesian Matrix Factorization with Side Information and Dirichlet Process Mixtures

Probabilistic Matrix Factorization

Downloaded 09/30/17 to Redistribution subject to SIAM license or copyright; see

Bayesian Machine Learning

Matrix and Tensor Factorization from a Machine Learning Perspective

Data Mining Techniques

STA 4273H: Statistical Machine Learning

STA414/2104 Statistical Methods for Machine Learning II

Collaborative Filtering

Matrix Factorization Techniques for Recommender Systems

CS281 Section 4: Factor Analysis and PCA

6.034 Introduction to Artificial Intelligence

Matrix Factorization Techniques For Recommender Systems. Collaborative Filtering

Matrix Factorization and Collaborative Filtering

CS 175: Project in Artificial Intelligence. Slides 4: Collaborative Filtering

STA 4273H: Sta-s-cal Machine Learning

Cost-Aware Collaborative Filtering for Travel Tour Recommendations

Collaborative topic models: motivations cont

Mixture-Rank Matrix Approximation for Collaborative Filtering

Probabilistic Low-Rank Matrix Completion with Adaptive Spectral Regularization Algorithms

Collaborative Filtering on Ordinal User Feedback

Connections between score matching, contrastive divergence, and pseudolikelihood for continuous-valued variables. Revised submission to IEEE TNN

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING. Non-linear regression techniques Part - II

STA 4273H: Statistical Machine Learning

Social Relations Model for Collaborative Filtering

Collaborative Recommendation with Multiclass Preference Context

ECE521 week 3: 23/26 January 2017

A UCB-Like Strategy of Collaborative Filtering

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

Rating Prediction with Topic Gradient Descent Method for Matrix Factorization in Recommendation

Nonparametric Bayesian Methods (Gaussian Processes)

1-bit Matrix Completion. PAC-Bayes and Variational Approximation

Transfer Learning for Collective Link Prediction in Multiple Heterogenous Domains

TopicMF: Simultaneously Exploiting Ratings and Reviews for Recommendation

STA 414/2104: Machine Learning

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

Temporal Collaborative Filtering with Bayesian Probabilistic Tensor Factorization

Large-Scale Feature Learning with Spike-and-Slab Sparse Coding

Recommendation Systems

Principal Component Analysis (PCA) for Sparse High-Dimensional Data

arxiv: v2 [cs.ir] 14 May 2018

DATA MINING LECTURE 8. Dimensionality Reduction PCA -- SVD

Decoupled Collaborative Ranking

Collaborative Filtering: A Machine Learning Perspective

Principal Component Analysis (PCA) for Sparse High-Dimensional Data

The Pragmatic Theory solution to the Netflix Grand Prize

STA 4273H: Statistical Machine Learning

Learning to Learn and Collaborative Filtering

GWAS IV: Bayesian linear (variance component) models

Jeffrey D. Ullman Stanford University

Mixed Membership Matrix Factorization

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Recommendation. Tobias Scheffer

GAUSSIAN PROCESS REGRESSION

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

Autoencoders and Score Matching. Based Models. Kevin Swersky Marc Aurelio Ranzato David Buchman Benjamin M. Marlin Nando de Freitas

Behavioral Data Mining. Lecture 7 Linear and Logistic Regression

Large-scale Collaborative Prediction Using a Nonparametric Random Effects Model

A Regularized Recommendation Algorithm with Probabilistic Sentiment-Ratings

Bayesian Machine Learning

Joint user knowledge and matrix factorization for recommender systems

Recommender Systems EE448, Big Data Mining, Lecture 10. Weinan Zhang Shanghai Jiao Tong University

Improving Quality of Crowdsourced Labels via Probabilistic Matrix Factorization

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Collaborative Filtering via Rating Concentration

Algorithmisches Lernen/Machine Learning

Location Regularization-Based POI Recommendation in Location-Based Social Networks

Introduction to Machine Learning. Regression. Computer Science, Tel-Aviv University,

Hierarchical Bayesian Matrix Factorization with Side Information

Recommender systems, matrix factorization, variable selection and social graph data

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Practical Bayesian Optimization of Machine Learning. Learning Algorithms

CS425: Algorithms for Web Scale Data

Deep Matrix Factorization for Recommendation

CS598 Machine Learning in Computational Biology (Lecture 5: Matrix - part 2) Professor Jian Peng Teaching Assistant: Rongda Zhu

Techniques for Dimensionality Reduction. PCA and Other Matrix Factorization Methods

Transcription:

MATRIX FACTORIZATION METHODS FOR COLLABORATIVE FILTERING Andriy Mnih and Ruslan Salakhutdinov University of Toronto, Machine Learning Group 1

What is collaborative filtering? The goal of collaborative filtering (CF) is to infer user preferences for items given a large but incomplete collection of preferences for many users. For example: Suppose you infer from the data that most of the users who like Star Wars also like Lord of the Rings and dislike Dune. Then if a user watched and liked Star Wars you would recommend him/her Lord of the Rings but not Dune. Peferences can be explicit or implicit: Explicit preferences: ratings given to items by users. Implicit preferences: which items were rented or bought by users. 2

Collaborative filtering vs. content-based filtering Content-based filtering makes recommendations based on item content. E.g. for a movie: genre, actors, director, length, language, number of car chases, etc. Can be used to recommend new items for which no ratings are available yet. Does not perform as well as collaborative filtering in most cases. Collaborative filtering does not look at item content. Preferences are inferred from rating patterns alone. Cannot recommend new items they all look the same to the system. Very effective when a sufficient amount of data is available. 3

Netflix Prize: In it for the money Two years ago, Netflix has announced a movie rating predictions competition. Whoever improves Netflix s own baseline score by 10% will win the 1 million dollar prize. The training data set consists of 100,480,507 ratings from 480,189 randomly-chosen, anonymous users on 17,770 movie titles. The data is very sparse, most users rate only few movies. Also, Netflix provides a test set containing 2,817,131 user/movie pairs with the ratings withheld. The goal is to predict those ratings as accurately as possible. 4

Course projects We will provide you with a subset of the Netflix training data: a few thousand users + a few thousand movies, so that you can easily run your algorithms on CDF machines. We will also provide you with a validation set. You will report the achieved prediction accuracy on this validation set. There will be two projects based on the following two models: Probabilistic Matrix Factorization (PMF) Restricted Boltzmann Machines (RBM s) You can choose which model you would like to work on. This tutorial will cover only PMF (the easy 4-5% on Netflix). 5

CF as matrix completion Collaborative filtering can be viewed as a matrix completion problem. Task: given a user/item matrix with only a small subset of entries present, fill in (some of) the missing entries. Perhaps the simplest effective way to do this is to factorize the rating matrix into a product of two smaller matrices. 6

Matrix factorization: notation User Features M Movies V N Users R ~ U T Movie Features Suppose we have M movies, N users, and integer rating values from 1 to K. Let R ij be the rating of user i for movie j, and U R D N, V R D M be latent user and movie feature matrices. We will use U i and V j to denote the latent feature vectors for user i and movie j respectively. 7

Matrix factorization: the non-probabilistic view To predict the rating given by user i to movie j, we simply compute the dot product between the corresponding feature vectors: R ˆ ij = Ui TV j = k U ikv jk Intuition: for each user, we predict a movie rating by giving the movie feature vector to a linear model. The movie feature vector can be viewed as the input. The user feature vector can be viewed as the weight vector. The predicted rating is the output. Unlike in linear regression, where inputs are fixed and weights are learned, we learn both the weights and the inputs (by minimizing squared error). Note that the model is symmetric in movies and users. 8

Probabilistic Matrix Factorization (PMF) PMF is a simple probabilistic linear model with Gaussian observation noise. σv σu Given the feature vectors for the user and the movie, the distribution of the corresponding rating is: V j U i p(r ij U i, V j, σ 2 ) = N(R ij U T i V j, σ 2 ) j=1,...,m R ij i=1,...,n The user and movie feature vectors are given zero-mean spherical Gaussian priors: p(u σ 2 U ) = N i=1 N(U i 0, σ 2 U I), p(v σ2 V ) = M σ j=1 N(V j 0, σ 2 V I) 9

Learning (I) MAP Learning: Maximize the logposterior over movie and user features with fixed hyperparameters. σv σu Equivalent to minimizing the sum-of-squared-errors with quadratic regularization terms: V j U i E = 1 2 + λ U 2 N i=1 N i=1 M j=1 I ij ( Rij U T i V j) 2 + U i 2 Fro +λ V 2 M j=1 V j 2 Fro j=1,...,m R ij σ i=1,...,n λ U = σ 2 /σ 2 U, λ V = σ 2 /σ 2 V, and I ij = 1 if user i rated movie j and is 0 otherwise. 10

Learning (II) E = 1 2 + λ U 2 N i=1 N i=1 M j=1 ( I ij Rij U T i V ) 2 j U i 2 Fro +λ V 2 M j=1 V j 2 Fro Find a local minimum by performing gradient descent in U and V. If all ratings were observed, the objective reduces to the SVD objective in the limit of prior variances going to infinity. PMF can be viewed as a probabilistic extension of SVD, which works well even when most entries in R are missing. 11

Automatic Complexity Control for PMF (I) Model complexity is controlled by the noise variance σ 2 and the parameters of the priors (σ 2 U and σ2 V ). α V Θ V α U Θ U Approach: Find a MAP estimate for the hyperparameters after introducing priors for them. Vj U i Learning: Find a point estimate of parameters and hyperparameters by maximizing the log-posterior: j=1,...,m R ij i=1,...,n ln p(u, V, σ 2, Θ U, Θ V R) = ln p(r U, V, σ 2 ) + ln p(u Θ U ) + ln p(v Θ V ) + ln p(θ U ) + ln p(θ V ) + C σ 12

Automatic Complexity Control for PMF (II) Can use more sophisticated regularization methods than simple penalization of the Frobenius norm of the feature matrices: α V Θ V α U Θ U priors with diagonal or full covariance matrices and adjustable means, or even mixture of Gaussians priors. Using spherical Gaussian priors for feature vectors leads to the standard PMF with λ U and λ V chosen automatically. Vj j=1,...,m R ij σ U i i=1,...,n Automatic selection of the hyperparameter values worked considerably better than the manual approach that used a validation set. 13

Constrained PMF (I) Two users that have rated similar movies are likely to have preferences more similar than two randomly chosed users. Make the prior for the user feature vector depend on the movies the user has rated. This will force users who have seen the same (or similar) movies to have similar prior distributions for their feature vectors. 14

Constrained PMF (II) Let W R D M be a latent similarity constraint matrix. We define the feature vector for user i as: σ V V j σ U Y i U i I i σ W W k k=1,...,m U i = Y i + M k=1 I ikw k M k=1 I ik j=1,...,m R ij i=1,...,n I is the observed indicator matrix, I ij = 1 if user i rated movie j and 0 otherwise. Performs considerably better on infrequent users. σ 15

The feature vector for user i: U i = Y i + Constrained PMF (III) M k=1 I ikw k M k=1 I ik For standard PMF, U i and Y i are equal because the prior mean is fixed at zero. The model: p(r Y, V, W, σ 2 ) = with N i=1 M j=1 p(w σ W ) = σ V V j j=1,...,m [ N(R ij [ Y i + M k=1 R ij σ N(W k 0, σ 2 W I) σ U Y i U i I i i=1,...,n σ W W k k=1,...,m M k=1 I ikw k ] Iij TVj M k=1 I, σ )] 2 ik 16

The Netflix Dataset The Netflix dataset is large, sparse, and imbalanced. The training set: 100,480,507 ratings from 480,189 users on 17,770 movies. The validation set: 1,408,395 ratings. The test set: 2,817,131 user/movie pairs with ratings withheld. The dataset is very imbalanced. The number of ratings entered by each user ranges from 1 to over 15000. Performance is assessed by submitting predictions to Netflix, which prevents accidental cheating since the test answers are known only to Netflix. 17

Experimental Results 0.97 10D features 30D features RMSE 0.96 0.95 0.94 0.93 0.92 PMF1 SVD Netflix Baseline Score PMFA PMF2 0.91 0 10 20 30 40 50 60 70 80 90 100 Epochs RMSE 0.97 0.96 0.95 0.94 0.93 SVD Netflix Baseline Score 0.92 PMF 0.91 Constrained 0.9 PMF 0 5 10 15 20 25 30 35 Epochs 40 45 50 55 60 Performance of SVD, PMF and PMF with adaptive priors, using 10D and 30D feature vectors, on the full Netflix validation set. 18

Experimental Results 1.2 20 1.15 1.1 Movie Average 15 RMSE 1.05 1 0.95 0.9 PMF Constrained PMF Users (%) 10 5 0.85 0.8 1 5 6 10 20 40 80 160 320 640 >641 Number of Observed Ratings 0 1 5 6 10 20 40 80 160 320 640 >641 Number of Observed Ratings Left Panel: Performance of constrained PMF, PMF and movie average algorithm that always predicts the average rating of each movie. Right panel: Distribution of the number of ratings per user in the training dataset. 19

Experimental Results RMSE 0.92 0.918 0.916 0.914 0.912 0.91 0.908 0.906 0.904 0.902 Constrained PMF (using Test rated/unrated id) Constrained PMF 0.9 0 5 10 15 20 25 30 35 40 45 50 55 60 Epochs Performance of constrained PMF that uses an additional rated/unrated information from the test dataset. Netflix tells us in advance which user/movie pairs occur in the test set. 20

Bayesian PMF? Training PMF models are trained efficiently by finding point estimates of model parameters and hyperparameters. Can we take a fully Bayesian approach by place proper priors over the hyperparameters and resorting to MCMC methods? With 100 million ratings, 0.5 million users, and 18 thousand movies? Initially this seemed infeasible to us due to the great computational cost of handing a dataset of this size. 21

Bayesian PMF! Bayesian PMF implemented using MCMC can be surprisingly efficient. Going fully Bayesian improves performance by nearly 1.5% compared to just doing MAP. 22

Variations on PMF Many variations on PMF are possible: Non-negative matrix factorization. Training methods: stochastic, minibatch, alternating least squares, variational Bayes, particle filtering. Etc. 23

THE END 24