Graphical Models for Collaborative Filtering

Similar documents
Kernel methods, kernel SVM and ridge regression

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Lecture 9: PGM Learning

Introduction to Graphical Models

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Probabilistic Graphical Models

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Part 1: Expectation Propagation

Probabilistic Graphical Models

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

Learning Parameters of Undirected Models. Sargur Srihari

Review. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Density Estimation. Seungjin Choi

Notes on Machine Learning for and

Support Vector Machines

Learning Parameters of Undirected Models. Sargur Srihari

STA 4273H: Statistical Machine Learning

A brief introduction to Conditional Random Fields

Andriy Mnih and Ruslan Salakhutdinov

CSC 412 (Lecture 4): Undirected Graphical Models

3 : Representation of Undirected GM

Learning Bayesian network : Given structure and completely observed data

Expectation Maximization (EM)

CPSC 540: Machine Learning

Probabilistic Graphical Models

6.867 Machine learning, lecture 23 (Jaakkola)

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

Conditional Random Field

Lecture 16 Deep Neural Generative Models

Bayesian Methods for Machine Learning

An Introduction to Bayesian Machine Learning

Intelligent Systems:

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Lecture 13 : Variational Inference: Mean Field Approximation

Statistical Data Mining and Machine Learning Hilary Term 2016

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Scaling Neighbourhood Methods

Review: Probabilistic Matrix Factorization. Probabilistic Matrix Factorization (PMF)

Probabilistic Graphical Models

Basic math for biology

Large-scale Ordinal Collaborative Filtering

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

Unsupervised Learning

Introduction to Machine Learning

Restricted Boltzmann Machines for Collaborative Filtering

Undirected Graphical Models

Introduction to Machine Learning

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Overview. Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation

Generative and Discriminative Approaches to Graphical Models CMSC Topics in AI

Probabilistic Graphical Models & Applications

Probabilistic Graphical Models

13: Variational inference II

CSE446: Clustering and EM Spring 2017

Hidden Markov Models Part 2: Algorithms

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc.

Graphical Models and Kernel Methods

Intractable Likelihood Functions

The Expectation-Maximization Algorithm

Pattern Recognition and Machine Learning

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

p(d θ ) l(θ ) 1.2 x x x

p L yi z n m x N n xi

CPSC 540: Machine Learning

Bayesian Networks BY: MOHAMAD ALSABBAGH

Introduction to Machine Learning

Bayesian Regression Linear and Logistic Regression

Lecture 2: From Linear Regression to Kalman Filter and Beyond

But if z is conditioned on, we need to model it:

PMR Learning as Inference

Using SVD to Recommend Movies

an introduction to bayesian inference

Lecture 2: From Linear Regression to Kalman Filter and Beyond

Gaussian Mixture Models

Learning MN Parameters with Alternative Objective Functions. Sargur Srihari

Algorithms for Variational Learning of Mixture of Gaussians

g-priors for Linear Regression

Logistic Regression. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Dynamic Approaches: The Hidden Markov Model

1 Bayesian Linear Regression (BLR)

Machine Learning Summer School

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Markov Networks.

Probabilistic Graphical Models

STA 4273H: Statistical Machine Learning

EM & Variational Bayes

Chapter 17: Undirected Graphical Models

Naïve Bayes. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Variational Scoring of Graphical Model Structures

The Bayesian approach to inverse problems

COMP90051 Statistical Machine Learning

Lecture 8: Bayesian Networks

Probabilistic Graphical Models

11 : Gaussian Graphic Models and Ising Models

Lecture 6: Graphical Models: Learning

Linear Dynamical Systems (Kalman filter)

Decomposable Graphical Gaussian Models

Bayesian Machine Learning - Lecture 7

Transcription:

Graphical Models for Collaborative Filtering Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Sequence modeling HMM, Kalman Filter, etc.: Similarity: the same graphical model topology, learning and inference follow the same principle Difference: the actual transition and observation model chosen Learning parameter: EM Compute posterior of hidden states: message passing Decoding: message passing with max-product Y 1 Y 2 Y t Y t+1 X 1 X 2 X t X t+1 2

MRF vs. CRF MRF models joint of X and Y P X, Y = 1 Z Ψ(X c, Y c ) Z does not depend on X or Y Some complicated relation in Y may be hard to model CRF models conditional of X given Y P X Y = 1 Z(Y) Ψ X c, Y c Z(Y) is a function of Y Avoid modeling complicated relation in Y Learning: each gradient needs one inference Learning: in each gradient, one inference per training point 3

Parameter Learning for Conditional Random Fields P X 1,, X k Y, θ = 1 exp θ Z Y,θ ij ijx i X j Y + i θ i X i Y = 1 Z Y,θ ij exp (θ ijx i X j Y) exp (θ i X i Y) i Z Y, θ = exp (θ ij X i X j Y) exp (θ i X i Y) x Maximize log conditional likelihood cl θ, D = log( N l=1 ij 1 Z y l,θ i ij exp (θ ij x l i xj l y l ) i exp (θ i x l i y l )) N = l ( ij log (exp(θ ij x l i xj l y l )) + i log(exp (θ i x l i y l )) log Z y l, θ ) N = l ( ij θ ij x l i xj l y l + i θ i x l i y l log Z y l, θ ) X 1 X 2 X 3 Y can be other feature function f x i Term logz y l, θ does not decompose! 4

Derivatives of log likelihood cl θ, D = 1 N N l ( ij θ ij x l i x + i θ i x l i y l log Z y l, θ ) j l y l cl θ,d θ ij = 1 N N l x l i xj l y l log Z yl,θ θ ij A convex problem Can find global optimum = 1 N N l x l i xj l y l 1 Z(y l,θ) Z y l,θ θ ij = 1 N N l x l i xj l y l 1 Z(y l,θ) exp (θ ij X i X j y l ) exp (θ i X i y l x ) X i X j y l ij i need to do inference for each y l!!! 5

Moment matching condition cl θ,d = θ ij 1 N N l x l i xj l y l 1 Z y l,θ x exp θ ij X i X j y l exp θ i X i y l i ij X i X j y l empirical covariance matrix covariance matrix from model P X i, X j, Y P Y = 1 N = 1 N N l=1 N l=1 δ(x i, x l i ) δ(x j, x l j ) δ(y, y l ) δ(y, y l ) Moment matching: cl θ,d θ ij = E P Xi,X j,y X ix j Y E P(X Y,θ)P Y [X i X j Y] 6

Optimize MLE for undirected models max θ cl θ, D is a convex optimization problem. Can be solve by many methods, such as gradient descent, conjugate gradient. Initialize model parameters θ Loop until convergence Compute cl θ,d θ ij = E P Xi,X j,y X ix j Y E P(X Y,θ)P Y [X i X j Y] Update θ ij θ ij η cl θ,d θ ij 7

Collaborative Filtering 8

Collaborative Filtering R: rating matrix; U: user factor; V: movie factor min U, V s. t. f ( U, V ) U 0, V R UV T 0, k 2 F m, n. Low rank matrix approximation approach Probabilistic matrix factorization Bayesian probabilistic matrix factorization 9

Parameter Estimation and Prediction Bayesian treats the unknown parameters as a random variable: P(θ D) = P D θ P(θ) P(D) = P D θ P(θ) P D θ P θ dθ Posterior mean estimation: θ bayes = θ P θ D dθ θ Maximum likelihood approach θ ML = argmax θ P D θ, θ MAP = argmax θ P(θ D) X N X new Bayesian prediction, take into account all possible value of θ P x new D = P x new, θ D dθ = P x new θ P θ D dθ A frequentist prediction: use a plug-in estimator P x new D = P(x new θ ML ) or P x new D = P(x new θ MAP ) 10

Matrix Factorization Approach Unconstrained problem with Frobenius norm can be solved using SVD argmin U,V R UV F 2 Global optimal R = U V When we have sparse entries, Frobenius norm is computed only for the observed entries, can no longer use SVD Eg. Use nonnegative matrix factorization argmin U,V R UV F 2, UV 0 Local optimal, over fitting problem 11

Probabilistic matrix factorization (PMF) Model components 12

PMF: Parameter Estimation Parameter estimation: MAP estimate θ MAP = argmax θ P θ D, α = argmax θ P D θ P(θ α) = argmax θ P θ, D α In the paper: 13

PMF: Interpret prior as regularization Maximize the posterior distribution with respect to parameter U and V Equivalent to minimize the sum-of-squares error function with quadratic regularization term (Plug in Gaussians and take log) 14

PMF: optimization Optimization: alternating between U and V Fix U, it is convex in V Fix V, it is convex in U Find a local minima Need to choose the regularization parameter λ U and λ V 15

Bayesian PMF: generative model A more flexible prior over U and V factor Hyperparameters Θ U = {μ U, Λ U } Θ V = {μ V, Λ V } 16

Bayesian PMF: prior over prior Add a prior over the hyperparameter Hyperhyperparameter Θ 0 = {μ 0, ν 0, W 0 } W is the Wishart distribution with v 0 degrees of freedom and a D D scale matrix W 0 17

Bayesian PMF: predicting new ratings Bayesian prediction, take into account all possible value of θ P x new D = P x new, θ D dθ = P x new θ P θ D dθ In the paper, integrating out all parameters and hyperparameters. 18

Bayesian PMF: sampling for inference Use sampling technique to compute Key idea: approximate expectation by sample average E f 1 N N i=1 f x i E U,V,ΘU,Θ V R p R ij U i, V j 1 N k p R ij U i k, V j k 19

Gibbs Sampling Gibbs sampling X = x 0 For t = 1 to N x 1 t = P(X 1 x 2 t 1,, x K t 1 ) x 2 t = P(X 2 x 1 t,, x K t 1 ) t ) x K t = P(X 2 x 1 t,, x K 1 For graphical models Only need to condition on the Variables in the Markov blanket X 2 X 3 Variants: Randomly pick variable to sample sample block by block X 1 X 4 X 5 20

Bayesian PMF: Gibbs sampling We have a directed graphical model Moralize first Markov blankets U i : R, V, Θ U V i : R, U, Θ V Θ U : U, Θ 0 Θ V : V, Θ 0 21

Bayesian PMF: Gibbs sampling equation Gibbs sampling equation for U i 22

Bayesian PMF: overall algorithm Can be sampled in parallel 23

Experiments: RMSE 24

Experiments: Runtime 25

Experiments: posterior distribution 26

Experiments: users with different history 27

Experiments: effect of training size 28

Issue: diagnose convergence Gibbs sampling: take sample after burn-in period Sampled Value Iteration number 29