Review: Probabilistic Matrix Factorization. Probabilistic Matrix Factorization (PMF)

Size: px

Start display at page:

Download "Review: Probabilistic Matrix Factorization. Probabilistic Matrix Factorization (PMF)"

Jocelin Townsend
5 years ago
Views:

1 Case Study 4: Collaborative Filtering Review: Probabilistic Matrix Factorization Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox February 2 th, 214 Emily Fox Probabilistic Matrix Factorization (PMF) A generative process: Pick user factors Pick movie factors For each (user,movie) pair observed: Pick rating as L u R v + noise Joint probability: Emily Fox

2 PMF Graphical Model P (L, R X) / P (L)P (R)P (X L, R) Graphically: Emily Fox MAP versus Regularized Least-Squares for Matrix Completion MAP under Gaussian Model: max log P (L, R X) = L,R 1 X X 2 u 2 L 2 1 X X u i 2 2 R 2 1 X v i u v 2 2 (L u R v v r r uv i Least-squares matrix completion with L 2 regularization: 1 X min (L u R v r uv ) 2 + u L,R 2 2 L 2 F + v 2 R 2 F r uv i r uv ) 2 + const Understanding as a probabilistic model is very useful! E.g., Change priors Incorporate other sources of information or dependencies Emily Fox

3 Posterior Computations MAP estimation focuses on point estimation: ˆ MAP = arg max p( x) What if we want a full characterization of the posterior? Maintain a measure of uncertainty Estimators other than posterior mode (different loss functions) Predictive distributions for future observations Often no closed-form characterization (e.g., mixture models, PMF, etc.) Emily Fox Bayesian PMF Example Latent user and movie factors: L u R v Observations Hyperparameters: u =1,...,n r uv v =1,...,m Want to predict new movie rating: Emily Fox

4 Bayesian PMF Example Z p(ruv X, )= p(ruv L u,r v )p(l, R X, )dldr Monte Carlo methods: u v L u R v Ideally: u =1,...,n r uv r v =1,...,m Emily Fox Bayesian PMF Gibbs Sampler Outline of Bayesian PMF sampler Emily Fox

5 Bayesian PMF Results From Salakhutdinov and Mnih, ICML 28 Netflix data with: Training set = 1,48,57 ratings from 48,189 users on 17,77 movie titles Validation set = 1,48,395 ratings. Test set = 2,817,131 user/movie pairs with the ratings withheld. RMSE Netflix Baseline Score Bayesian PMF SVD PMF Logistic PMF Epochs RMSE D Bayesian PMF 3 D 5.7 hrs hrs. 23 hrs. 1 9 hrs. 47 hrs. 188 hrs Number of Samples 1 Figure 2. Left panel: Performance of SVD, PMF, logistic PMF, and Bayesian PMF using 3D feature vectors, on the Netflix validation data. The y-axis displays RMSE (root mean squared error), and the x-axis shows the number 2of epochs, or passes, through the entire training set. Right panel: RMSE forthebayesianpmfmodelsonthevalidation setasa function of the number of samples generated. The two curves are for the models with 3D and 6D feature vectors Dimension5.4 Movie X (5 ratings) Emily Fox User C (319 ratings) Movie Y (142 ratings) Dimension5 Dimension2.2.2 Bayesian PMF Results Dimension Dimension2.2 From Salakhutdinov.2 and Mnih, ICML Dimension Bayesian model better controls for overfitting by averaging Figure over 3. possible parameters (instead of committing to one) Emily Fox Samples from the posterior over the user and movie feature vectors generated by each step of the G sampler. The two dimensions with the highest variance are shown for two users and two movies. The first 8 sam were discarded as burn-in. D Valid. RMSE % Test RMSE % PMF BPMF Inc. PMF BPMF Inc Table 1. Performance of Bayesian PMF (BPMF) and linear PMF on Netflix validation and test sets. We than trained larger PMF models with D =4and D =6. Capacitycontrolforsuchmodelsbecomesa rather challenging task. For example, a PMF model with D =6hasapproximately3millionparameters. Searching for appropriate values of the regularization coefficients becomes a very computationally expensive task. Table 1 further shows that for the 6-dimensional feature vectors, Bayesian PMF outperforms its MAP counterpart by over 2%. We should also point out that even the simplest possible Bayesian extension of the PMF model, where Gamma priors are placed over the precision hyperparameters α U and α V (see Fig. 1, as the Bayesian PMF models. It is interesting to observe that as the feature mensionality grows, the performance accuracy for MAP-trained PMF models does not improve, and trolling overfitting becomes a critical issue. The dictive accuracy of the Bayesian PMF models, h ever, steadily improves as the model complexity gr Inspired by this result, we experimented with Baye PMF models with D = 15 and D = 3 fea vectors. Note that these models have about million parameters, and running the Gibbs s pler becomes computationally much more expen Nonetheless, the validation set RMSEs for the models were.8931 and.892. Table 1 shows these models not only significantly outperform t MAP counterparts but also outperform Bayesian P models that have fewer parameters. These res clearly show that the Bayesian approach does no quire limiting the complexity of the model based on number of the training samples. In practice, 5 howe we will be limited by the available computer resour For completeness, we also report the performance

6 What you need to know Idea of full posterior inference vs. MAP estimation Gibbs sampling as an MCMC approach Example of inference in Bayesian probabilistic matrix factorization model Emily Fox Case Study 4: Collaborative Filtering Matrix Factorization and Probabilistic LFMs for Network Modeling Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox February 2 th, 214 Emily Fox

Network Data n Structure of network data Emily Fox 214 13 Properties of Data Source n

7 Network Data n Structure of network data Emily Fox Properties of Data Source n Similarities to Netflix data: n Matrix High-dimensional Sparse Differences Square Binary Emily Fox

8 8 Emily Fox Vanilla matrix factorization approach: What to return for link prediction? Slightly fancier: Matrix Factorization for Network Data Emily Fox Assume features (covariates) of the user or relationship Each user has a position in a k-dimensional latent space Probability of link: Probabilistic Latent Space Models Z1 Z2

9 Probabilistic Latent Space Models Probability of link: log odds p(r uv =1 L u,l v,x uv, )= + T x uv L u L v log odds p(r uv =1 L u,l v,x uv, )= + T x uv L T u L v Bayesian approach: Place prior on user factors and regression coefficients Place hyperprior on user factor hyperparameters Many other options and extensions (e.g., can use GMM for L u à clustering of users in the latent space) Emily Fox What you need to know Representation of network data as a matrix Adjacency matrix Similarities and differences between adjacency matrices and general matrix-valued data Matrix factorization approaches for network data Just use standard MF and threshold output Introduce link functions to constrain predicted values Probabilistic latent space models Model link probabilities using distance between latent factors Emily Fox

Andriy Mnih and Ruslan Salakhutdinov

MATRIX FACTORIZATION METHODS FOR COLLABORATIVE FILTERING Andriy Mnih and Ruslan Salakhutdinov University of Toronto, Machine Learning Group 1 What is collaborative filtering? The goal of collaborative