Probabilistic Matrix Factorization

Similar documents
Collaborative topic models: motivations cont

Andriy Mnih and Ruslan Salakhutdinov

Causal Inference and Recommendation Systems. David M. Blei Departments of Computer Science and Statistics Data Science Institute Columbia University

Statistical Models. David M. Blei Columbia University. October 14, 2014

Lecture: Gaussian Process Regression. STAT 6474 Instructor: Hongxiao Zhu

Probabilistic Graphical Models

STA 4273H: Statistical Machine Learning

Graphical Models for Collaborative Filtering

Linear Dynamical Systems

Dimension Reduction. David M. Blei. April 23, 2012

Scaling Neighbourhood Methods

Generative Clustering, Topic Modeling, & Bayesian Inference

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

Recommendation Systems

CS281 Section 4: Factor Analysis and PCA

STA 414/2104: Machine Learning

STA 4273H: Statistical Machine Learning

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Topic Models and Applications to Short Documents

Probabilistic Graphical Models

Generative Models for Discrete Data

13: Variational inference II

Bayesian Methods for Machine Learning

Contrastive Divergence

Collaborative Topic Modeling for Recommending Scientific Articles

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

STA 4273H: Statistical Machine Learning

CS Lecture 18. Topic Models and LDA

Topic Modelling and Latent Dirichlet Allocation

Recurrent Latent Variable Networks for Session-Based Recommendation

Mixture Models and Expectation-Maximization

Mixed Membership Matrix Factorization

The Expectation Maximization or EM algorithm

Study Notes on the Latent Dirichlet Allocation

Data Mining Techniques

Robust Monte Carlo Methods for Sequential Planning and Decision Making

STA 4273H: Sta-s-cal Machine Learning

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

11. Learning graphical models

10708 Graphical Models: Homework 2

Unsupervised Learning

Introduction to Machine Learning. Regression. Computer Science, Tel-Aviv University,

GWAS V: Gaussian processes

Markov Chain Monte Carlo

Mixed Membership Matrix Factorization

Mixed-membership Models (and an introduction to variational inference)

Machine Learning (CS 567) Lecture 5

Variational Inference and Learning. Sargur N. Srihari

Introduction to Gaussian Processes

Introduction to Probabilistic Machine Learning

COMP 551 Applied Machine Learning Lecture 20: Gaussian processes

16 : Approximate Inference: Markov Chain Monte Carlo

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

Relational Stacked Denoising Autoencoder for Tag Recommendation. Hao Wang

Variational Inference (11/04/13)

Bayesian Linear Regression [DRAFT - In Progress]

Gaussian Processes in Machine Learning

Probabilistic Graphical Models

PILCO: A Model-Based and Data-Efficient Approach to Policy Search

PROBABILISTIC LATENT SEMANTIC ANALYSIS

AN INTRODUCTION TO TOPIC MODELS

Introduction to Bayesian inference

ECE521 Lecture 19 HMM cont. Inference in HMM

Computer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression

Lecture 2: From Linear Regression to Kalman Filter and Beyond

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Computer Vision Group Prof. Daniel Cremers. 4. Gaussian Processes - Regression

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Today. Probability and Statistics. Linear Algebra. Calculus. Naïve Bayes Classification. Matrix Multiplication Matrix Inversion

ECE 5984: Introduction to Machine Learning

K-Means and Gaussian Mixture Models

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines

Linear Dynamical Systems (Kalman filter)

Introduction to Graphical Models

Gaussian Mixture Model

Deep Poisson Factorization Machines: a factor analysis model for mapping behaviors in journalist ecosystem

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

STA 414/2104, Spring 2014, Practice Problem Set #1

Large-scale Ordinal Collaborative Filtering

ECE521 lecture 4: 19 January Optimization, MLE, regularization

Introduction to Machine Learning

CS Machine Learning Qualifying Exam

Content-based Recommendation

COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017

COMS 4721: Machine Learning for Data Science Lecture 1, 1/17/2017

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes

13 : Variational Inference: Loopy Belief Propagation and Mean Field

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

Overfitting, Bias / Variance Analysis

Lecture 3: Latent Variables Models and Learning with the EM Algorithm. Sam Roweis. Tuesday July25, 2006 Machine Learning Summer School, Taiwan

Multiple Linear Regression for the Supervisor Data

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

Latent Variable Models and Expectation Maximization

Learning the Linear Dynamical System with ASOS ( Approximated Second-Order Statistics )

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Learning Gaussian Process Models from Uncertain Data

Binary Principal Component Analysis in the Netflix Collaborative Filtering Task

COS513 LECTURE 8 STATISTICAL CONCEPTS

PMR Learning as Inference

Transcription:

Probabilistic Matrix Factorization David M. Blei Columbia University November 25, 2015 1 Dyadic data One important type of modern data is dyadic data. Dyadic data are measurements on pairs. The idea is that the observed value y ij represents something measured about the interaction of i and j. One of the most prevalent types of dyadic data is user consumption data, where y ij represents an interaction between user i and item j. For example, y ij can be the number of times user i clicked on webpage j, whether user i purchased book j, the rating user i gave to movie j, or how long user i spent at location j. This problem has cropped up everywhere: Amazon (purchasing), web sites (clicks), music listening services (listens). It is important both for forming predictions & for understanding people s behavior. Throughout this lecture, we will use the language of user consumption behavior: there are users, items, and measurements about them. Typically these data are represented in a matrix, the consumption matrix. [DRAW A MATRIX, MOVIES AND USERS; STAR WARS AND ROMCOMS] As usual, the two types of problems we want to solve with dyadic data are prediction and exploration. In exploration, we want to understand patterns of consumption what kind of user likes what kind of movie? These problems are less explored in Computer Science, but they are very important in Econometrics. Here we focus on prediction. In prediction we want to predict which unrated items a user will like. For example, will user #3 like Star Wars? Empire Strikes Back? When Harry Met Sally? (How about the new Star Wars movie? Already this gets interesting. That prediction problem is called the cold start problem. More on that later.) More realistically, in prediction we provide a ranked list of unwatched movies for each user. A good list has movies she will 1

N x x i N K.0; I 2 x / y ij p. j x > i ˇj / ˇj ˇ M N K.0; I 2ˇ / Figure 1: The graphical model for probabilistic matrix factorization. like ranked higher in the list. 2 Matrix Factorization In probabilistic matrix factorization, each user and each item is associated with a latent vector. For users we call this vector her preferences; for items we call this vector its attributes. Loosely, whether a user likes an item is governed by the distance between a user s preferences and the item s attributes. If they are closer together in the latent space, then it is more likely that the user will like the item. [DRAW PICTURE IN 2D; INCLUDE STAR WARS AND ROMCOMS] Formally, the generative process is 1. For each user i, draw each preference x ik N.0; x 2 /, where k 2 1; : : : ; K. 2. For each item j, draw each item attribute ˇik N.0; 2ˇ/, where k 2 1; : : : K. 3. For each user item pair, draw y ij f.x i > ˇj /. Figure 1 gives the graphical model. (For those that are familiar, this is similar to factor analysis and principal component analysis.) Given a user/item matrix, our goal is to compute the posterior distribution of x i (for each user) and ˇj (for each item). This embeds users and items into a low-dimensional space. It places users close to the items that she likes; and places items close to the users who like them. 2

Intuitively, why does this work? Consider two users, Jack and Diane. Jack has seen every Star Wars movie except for Return of the Jedi. Diane has seen every Star Wars movie. The posterior will want to place Jack near the movies he saw and place Diane near the movies that she saw. But they have seen similar collections of movies. Thus they will be near each other in the latent space, and near the Star Wars movies. Consequently, Return of the Jedi will be close to Jack, and we can recommend that he see it. As we did for LDA, let s write down the log of the joint distribution. This will help us make these intuitions more rigorous. The joint distribution of hidden and observed variables is log p.ˇ; x; y/ D X log p.x i / C X log p.ˇj / C X X log p.y ij j x i ; ˇj /: (1) Assume the data are Gaussian, y ij j x i ; ˇj N.x > i ˇj ; 2 /: (2) Thus the last term of the log joint is related to the mean squared error between observed ratings y ij and the model s representation x i > ˇj. Again, consider a set of users who like the same two movies. The model can place the movies close together in the latent space, place the users in the same vicinity, and explain the ratings. When a new user likes one of those movies, she is placed in the same part of the space; we predict that she likes the other movie too. The priors in the first two two terms in the log joint are regularization terms; they shrink the representations of users and movies, pulling them towards zero. Empirically, regularization is crucial in these models. Practitioners use cross-validation to set the regularization parameters, i.e., the hyperparameters to the prior. When we study regression, we will learn more about regularization and why it is important. We turn to inference. First, we discuss the MAP estimates of preferences x and attributes ˇ. (Recall the MAP estimate is the value of the latent variable that maximizes the posterior.) We expand the log joint to include the Gaussian assumptions and remove terms that do not 3

depend on the latent variables, 0 1 L D X k @ X i x 2 ik =2 2 x C X j ˇ2 ik =2 A 2ˇ C X j.y ij x > i ˇj / 2 =2: (3) (To keep it simpler, we assumed that the likelihood variance is one.) We take the derivative with respect to x ik, user i s preference for factor k. We expand the square in the likelihood term and collect terms that depend on x ik, L.x ik / D x 2 ik =2 2 x C X j y 2 ij =2 C x ikˇjk y ij x 2 ikˇ2 jk =2 x ikˇjk P ` k x i`ˇj ` (4) (The last two terms come from the square of the dot product,.x > i ˇj / 2 =2.) The derivative with respect to x ik is @L @x ik D x ik = 2 x C X j y ij ˇjk.x > i ˇj /ˇjk (5) D x ik = 2 x C X j.y ij x > i ˇj /ˇjk (6) D x ik = 2 x C X j.y ij E Y ij j x i ; ˇj /ˇjk (7) For those of you that know regression, this is a (regularized) linear regression: the coefficients are x i ; the covariates are ˇj ; the response is y ij. (There is one response per item.) We will see these types of gradients again when we study generalized linear models. The derivative with respect to ˇjk is analagous: @L @ˇjk D ˇjk = 2ˇ C P i.y ij E Y ij j x i ; ˇj /xik : (8) This is a (regularized) regression where ˇj are the coefficients, x i are the covariates, and y ij is the response. (There is one response per user.) Coordinate updates of these latent variables, iterates between solving user-based reguarlized regressions and item-based regularized regressions. Now consider MCMC. The complete conditional of the preference x ik is proportional to 4

the joint. Examining Eq. 4, we see that this is a Gaussian distribution. It is an exponential family with sufficient statistics x ik and x 2 ik. We have scratched the surface. Further ideas: Tensor factorization The latent space model of networks Including observed characteristics Big challenges: sparsity, implicit data The cold start problem and content-based recommendation 5