Graphical Models for Collaborative Filtering

Size: px

Start display at page:

Download "Graphical Models for Collaborative Filtering"

Samson Howard
5 years ago
Views:

1 Graphical Models for Collaborative Filtering Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

2 Sequence modeling HMM, Kalman Filter, etc.: Similarity: the same graphical model topology, learning and inference follow the same principle Difference: the actual transition and observation model chosen Learning parameter: EM Compute posterior of hidden states: message passing Decoding: message passing with max-product Y 1 Y 2 Y t Y t+1 X 1 X 2 X t X t+1 2

Z(Y) Ψ X c, Y c Z(Y) is a function of Y Avoid modeling

3 MRF vs. CRF MRF models joint of X and Y P X, Y = 1 Z Ψ(X c, Y c ) Z does not depend on X or Y Some complicated relation in Y may be hard to model CRF models conditional of X given Y P X Y = 1 Z(Y) Ψ X c, Y c Z(Y) is a function of Y Avoid modeling complicated relation in Y Learning: each gradient needs one inference Learning: in each gradient, one inference per training point 3

exp (θ ij x l i xj l y l ) i exp (θ i x l i y l )) N = l ( ij log (exp(θ ij x l i xj l y l )) + i log(exp (θ i x l i y l )) log Z y l, θ ) N =

4 Parameter Learning for Conditional Random Fields P X 1,, X k Y, θ = 1 exp θ Z Y,θ ij ijx i X j Y + i θ i X i Y = 1 Z Y,θ ij exp (θ ijx i X j Y) exp (θ i X i Y) i Z Y, θ = exp (θ ij X i X j Y) exp (θ i X i Y) x Maximize log conditional likelihood cl θ, D = log( N l=1 ij 1 Z y l,θ i ij exp (θ ij x l i xj l y l ) i exp (θ i x l i y l )) N = l ( ij log (exp(θ ij x l i xj l y l )) + i log(exp (θ i x l i y l )) log Z y l, θ ) N = l ( ij θ ij x l i xj l y l + i θ i x l i y l log Z y l, θ ) X 1 X 2 X 3 Y can be other feature function f x i Term logz y l, θ does not decompose! 4

5 Derivatives of log likelihood cl θ, D = 1 N N l ( ij θ ij x l i x + i θ i x l i y l log Z y l, θ ) j l y l cl θ,d θ ij = 1 N N l x l i xj l y l log Z yl,θ θ ij A convex problem Can find global optimum = 1 N N l x l i xj l y l 1 Z(y l,θ) Z y l,θ θ ij = 1 N N l x l i xj l y l 1 Z(y l,θ) exp (θ ij X i X j y l ) exp (θ i X i y l x ) X i X j y l ij i need to do inference for each y l!!! 5

Moment matching condition cl θ,d = θ ij 1 N N l x l i xj l y l 1 Z y l,θ x exp θ ij X i X j y l exp θ i X i y l i ij X i X j y l empirical covariance matrix covariance matrix from

6 Moment matching condition cl θ,d = θ ij 1 N N l x l i xj l y l 1 Z y l,θ x exp θ ij X i X j y l exp θ i X i y l i ij X i X j y l empirical covariance matrix covariance matrix from model P X i, X j, Y P Y = 1 N = 1 N N l=1 N l=1 δ(x i, x l i ) δ(x j, x l j ) δ(y, y l ) δ(y, y l ) Moment matching: cl θ,d θ ij = E P Xi,X j,y X ix j Y E P(X Y,θ)P Y [X i X j Y] 6

7 Optimize MLE for undirected models max θ cl θ, D is a convex optimization problem. Can be solve by many methods, such as gradient descent, conjugate gradient. Initialize model parameters θ Loop until convergence Compute cl θ,d θ ij = E P Xi,X j,y X ix j Y E P(X Y,θ)P Y [X i X j Y] Update θ ij θ ij η cl θ,d θ ij 7

8 Collaborative Filtering 8

9 Collaborative Filtering R: rating matrix; U: user factor; V: movie factor min U, V s. t. f ( U, V ) U 0, V R UV T 0, k 2 F m, n. Low rank matrix approximation approach Probabilistic matrix factorization Bayesian probabilistic matrix factorization 9

Parameter Estimation and Prediction Bayesian treats the unknown parameters as a random variable: P(θ D) = P D θ P(θ) P(D) = P D θ P(θ) P D θ P θ dθ Posterior mean estimation: θ bayes = θ P θ D dθ θ

10 Parameter Estimation and Prediction Bayesian treats the unknown parameters as a random variable: P(θ D) = P D θ P(θ) P(D) = P D θ P(θ) P D θ P θ dθ Posterior mean estimation: θ bayes = θ P θ D dθ θ Maximum likelihood approach θ ML = argmax θ P D θ, θ MAP = argmax θ P(θ D) X N X new Bayesian prediction, take into account all possible value of θ P x new D = P x new, θ D dθ = P x new θ P θ D dθ A frequentist prediction: use a plug-in estimator P x new D = P(x new θ ML ) or P x new D = P(x new θ MAP ) 10

Frobenius norm is computed only for the observed entries, can no longer use SVD Eg.

11 Matrix Factorization Approach Unconstrained problem with Frobenius norm can be solved using SVD argmin U,V R UV F 2 Global optimal R = U V When we have sparse entries, Frobenius norm is computed only for the observed entries, can no longer use SVD Eg. Use nonnegative matrix factorization argmin U,V R UV F 2, UV 0 Local optimal, over fitting problem 11

12 Probabilistic matrix factorization (PMF) Model components 12

13 PMF: Parameter Estimation Parameter estimation: MAP estimate θ MAP = argmax θ P θ D, α = argmax θ P D θ P(θ α) = argmax θ P θ, D α In the paper: 13

14 PMF: Interpret prior as regularization Maximize the posterior distribution with respect to parameter U and V Equivalent to minimize the sum-of-squares error function with quadratic regularization term (Plug in Gaussians and take log) 14

15 PMF: optimization Optimization: alternating between U and V Fix U, it is convex in V Fix V, it is convex in U Find a local minima Need to choose the regularization parameter λ U and λ V 15

16 Bayesian PMF: generative model A more flexible prior over U and V factor Hyperparameters Θ U = {μ U, Λ U } Θ V = {μ V, Λ V } 16

17 Bayesian PMF: prior over prior Add a prior over the hyperparameter Hyperhyperparameter Θ 0 = {μ 0, ν 0, W 0 } W is the Wishart distribution with v 0 degrees of freedom and a D D scale matrix W 0 17

18 Bayesian PMF: predicting new ratings Bayesian prediction, take into account all possible value of θ P x new D = P x new, θ D dθ = P x new θ P θ D dθ In the paper, integrating out all parameters and hyperparameters. 18

19 Bayesian PMF: sampling for inference Use sampling technique to compute Key idea: approximate expectation by sample average E f 1 N N i=1 f x i E U,V,ΘU,Θ V R p R ij U i, V j 1 N k p R ij U i k, V j k 19

graphical models Only need to condition on the Variables in the Markov blanket X

20 Gibbs Sampling Gibbs sampling X = x 0 For t = 1 to N x 1 t = P(X 1 x 2 t 1,, x K t 1 ) x 2 t = P(X 2 x 1 t,, x K t 1 ) t ) x K t = P(X 2 x 1 t,, x K 1 For graphical models Only need to condition on the Variables in the Markov blanket X 2 X 3 Variants: Randomly pick variable to sample sample block by block X 1 X 4 X 5 20

21 Bayesian PMF: Gibbs sampling We have a directed graphical model Moralize first Markov blankets U i : R, V, Θ U V i : R, U, Θ V Θ U : U, Θ 0 Θ V : V, Θ 0 21

22 Bayesian PMF: Gibbs sampling equation Gibbs sampling equation for U i 22

23 Bayesian PMF: overall algorithm Can be sampled in parallel 23

24 Experiments: RMSE 24

25 Experiments: Runtime 25

26 Experiments: posterior distribution 26

27 Experiments: users with different history 27

28 Experiments: effect of training size 28

29 Issue: diagnose convergence Gibbs sampling: take sample after burn-in period Sampled Value Iteration number 29

Kernel methods, kernel SVM and ridge regression

Kernel methods, kernel SVM and ridge regression Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Collaborative Filtering 2 Collaborative Filtering R: rating matrix; U: user factor;