Graphical Models for Collaborative Filtering

Graphical Models for Collaborative Filtering Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Sequence modeling HMM, Kalman Filter, etc.: Similarity: the same graphical model topology, learning and inference follow the same principle Difference: the actual transition and observation model chosen Learning parameter: EM Compute posterior of hidden states: message passing Decoding: message passing with max-product Y 1 Y 2 Y t Y t+1 X 1 X 2 X t X t+1 2

MRF vs. CRF MRF models joint of X and Y P X, Y = 1 Z Ψ(X c, Y c ) Z does not depend on X or Y Some complicated relation in Y may be hard to model CRF models conditional of X given Y P X Y = 1 Z(Y) Ψ X c, Y c Z(Y) is a function of Y Avoid modeling complicated relation in Y Learning: each gradient needs one inference Learning: in each gradient, one inference per training point 3

Parameter Learning for Conditional Random Fields P X 1,, X k Y, θ = 1 exp θ Z Y,θ ij ijx i X j Y + i θ i X i Y = 1 Z Y,θ ij exp (θ ijx i X j Y) exp (θ i X i Y) i Z Y, θ = exp (θ ij X i X j Y) exp (θ i X i Y) x Maximize log conditional likelihood cl θ, D = log( N l=1 ij 1 Z y l,θ i ij exp (θ ij x l i xj l y l ) i exp (θ i x l i y l )) N = l ( ij log (exp(θ ij x l i xj l y l )) + i log(exp (θ i x l i y l )) log Z y l, θ ) N = l ( ij θ ij x l i xj l y l + i θ i x l i y l log Z y l, θ ) X 1 X 2 X 3 Y can be other feature function f x i Term logz y l, θ does not decompose! 4

Derivatives of log likelihood cl θ, D = 1 N N l ( ij θ ij x l i x + i θ i x l i y l log Z y l, θ ) j l y l cl θ,d θ ij = 1 N N l x l i xj l y l log Z yl,θ θ ij A convex problem Can find global optimum = 1 N N l x l i xj l y l 1 Z(y l,θ) Z y l,θ θ ij = 1 N N l x l i xj l y l 1 Z(y l,θ) exp (θ ij X i X j y l ) exp (θ i X i y l x ) X i X j y l ij i need to do inference for each y l!!! 5

Moment matching condition cl θ,d = θ ij 1 N N l x l i xj l y l 1 Z y l,θ x exp θ ij X i X j y l exp θ i X i y l i ij X i X j y l empirical covariance matrix covariance matrix from model P X i, X j, Y P Y = 1 N = 1 N N l=1 N l=1 δ(x i, x l i ) δ(x j, x l j ) δ(y, y l ) δ(y, y l ) Moment matching: cl θ,d θ ij = E P Xi,X j,y X ix j Y E P(X Y,θ)P Y [X i X j Y] 6

Optimize MLE for undirected models max θ cl θ, D is a convex optimization problem. Can be solve by many methods, such as gradient descent, conjugate gradient. Initialize model parameters θ Loop until convergence Compute cl θ,d θ ij = E P Xi,X j,y X ix j Y E P(X Y,θ)P Y [X i X j Y] Update θ ij θ ij η cl θ,d θ ij 7

Collaborative Filtering 8

Collaborative Filtering R: rating matrix; U: user factor; V: movie factor min U, V s. t. f ( U, V ) U 0, V R UV T 0, k 2 F m, n. Low rank matrix approximation approach Probabilistic matrix factorization Bayesian probabilistic matrix factorization 9

Parameter Estimation and Prediction Bayesian treats the unknown parameters as a random variable: P(θ D) = P D θ P(θ) P(D) = P D θ P(θ) P D θ P θ dθ Posterior mean estimation: θ bayes = θ P θ D dθ θ Maximum likelihood approach θ ML = argmax θ P D θ, θ MAP = argmax θ P(θ D) X N X new Bayesian prediction, take into account all possible value of θ P x new D = P x new, θ D dθ = P x new θ P θ D dθ A frequentist prediction: use a plug-in estimator P x new D = P(x new θ ML ) or P x new D = P(x new θ MAP ) 10

Matrix Factorization Approach Unconstrained problem with Frobenius norm can be solved using SVD argmin U,V R UV F 2 Global optimal R = U V When we have sparse entries, Frobenius norm is computed only for the observed entries, can no longer use SVD Eg. Use nonnegative matrix factorization argmin U,V R UV F 2, UV 0 Local optimal, over fitting problem 11

Probabilistic matrix factorization (PMF) Model components 12

PMF: Parameter Estimation Parameter estimation: MAP estimate θ MAP = argmax θ P θ D, α = argmax θ P D θ P(θ α) = argmax θ P θ, D α In the paper: 13

PMF: Interpret prior as regularization Maximize the posterior distribution with respect to parameter U and V Equivalent to minimize the sum-of-squares error function with quadratic regularization term (Plug in Gaussians and take log) 14

PMF: optimization Optimization: alternating between U and V Fix U, it is convex in V Fix V, it is convex in U Find a local minima Need to choose the regularization parameter λ U and λ V 15

Bayesian PMF: generative model A more flexible prior over U and V factor Hyperparameters Θ U = {μ U, Λ U } Θ V = {μ V, Λ V } 16

Bayesian PMF: prior over prior Add a prior over the hyperparameter Hyperhyperparameter Θ 0 = {μ 0, ν 0, W 0 } W is the Wishart distribution with v 0 degrees of freedom and a D D scale matrix W 0 17

Bayesian PMF: predicting new ratings Bayesian prediction, take into account all possible value of θ P x new D = P x new, θ D dθ = P x new θ P θ D dθ In the paper, integrating out all parameters and hyperparameters. 18

Bayesian PMF: sampling for inference Use sampling technique to compute Key idea: approximate expectation by sample average E f 1 N N i=1 f x i E U,V,ΘU,Θ V R p R ij U i, V j 1 N k p R ij U i k, V j k 19

Gibbs Sampling Gibbs sampling X = x 0 For t = 1 to N x 1 t = P(X 1 x 2 t 1,, x K t 1 ) x 2 t = P(X 2 x 1 t,, x K t 1 ) t ) x K t = P(X 2 x 1 t,, x K 1 For graphical models Only need to condition on the Variables in the Markov blanket X 2 X 3 Variants: Randomly pick variable to sample sample block by block X 1 X 4 X 5 20

Bayesian PMF: Gibbs sampling We have a directed graphical model Moralize first Markov blankets U i : R, V, Θ U V i : R, U, Θ V Θ U : U, Θ 0 Θ V : V, Θ 0 21

Bayesian PMF: Gibbs sampling equation Gibbs sampling equation for U i 22

Bayesian PMF: overall algorithm Can be sampled in parallel 23

Experiments: RMSE 24

Experiments: Runtime 25

Experiments: posterior distribution 26

Experiments: users with different history 27

Experiments: effect of training size 28

Issue: diagnose convergence Gibbs sampling: take sample after burn-in period Sampled Value Iteration number 29