Case Study 4: Collaborative Filtering Review: Probabilistic Matrix Factorization Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox February 2 th, 214 Emily Fox 214 1 Probabilistic Matrix Factorization (PMF) A generative process: Pick user factors Pick movie factors For each (user,movie) pair observed: Pick rating as L u R v + noise Joint probability: Emily Fox 214 2 1
PMF Graphical Model P (L, R X) / P (L)P (R)P (X L, R) Graphically: Emily Fox 214 3 MAP versus Regularized Least-Squares for Matrix Completion MAP under Gaussian Model: max log P (L, R X) = L,R 1 X X 2 u 2 L 2 1 X X u i 2 2 R 2 1 X v i u v 2 2 (L u R v v r r uv i Least-squares matrix completion with L 2 regularization: 1 X min (L u R v r uv ) 2 + u L,R 2 2 L 2 F + v 2 R 2 F r uv i r uv ) 2 + const Understanding as a probabilistic model is very useful! E.g., Change priors Incorporate other sources of information or dependencies Emily Fox 214 4 2
Posterior Computations MAP estimation focuses on point estimation: ˆ MAP = arg max p( x) What if we want a full characterization of the posterior? Maintain a measure of uncertainty Estimators other than posterior mode (different loss functions) Predictive distributions for future observations Often no closed-form characterization (e.g., mixture models, PMF, etc.) Emily Fox 214 5 Bayesian PMF Example Latent user and movie factors: L u R v Observations Hyperparameters: u =1,...,n r uv v =1,...,m Want to predict new movie rating: Emily Fox 214 6 3
Bayesian PMF Example Z p(ruv X, )= p(ruv L u,r v )p(l, R X, )dldr Monte Carlo methods: u v L u R v Ideally: u =1,...,n r uv r v =1,...,m Emily Fox 214 7 Bayesian PMF Gibbs Sampler Outline of Bayesian PMF sampler Emily Fox 214 8 4
Bayesian PMF Results From Salakhutdinov and Mnih, ICML 28 Netflix data with: Training set = 1,48,57 ratings from 48,189 users on 17,77 movie titles Validation set = 1,48,395 ratings. Test set = 2,817,131 user/movie pairs with the ratings withheld. RMSE.97.96.95.94.93.92.91.9 Netflix Baseline Score Bayesian PMF SVD PMF Logistic PMF 1 2 3 4 5 6 Epochs RMSE.92.915.91.95.9.895 6 D Bayesian PMF 3 D 5.7 hrs. 11.7 hrs. 23 hrs. 1 9 hrs. 47 hrs. 188 hrs..89 4 8 16 32 64 128 256 512 Number of Samples 1 Figure 2. Left panel: Performance of SVD, PMF, logistic PMF, and Bayesian PMF using 3D feature vectors, on the Netflix validation data. The y-axis displays RMSE (root mean squared error), and the x-axis shows the number 2of epochs, or passes, through the entire training set. Right panel: RMSE forthebayesianpmfmodelsonthevalidation setasa function of the number of samples generated. The two curves are for the models with 3D and 6D feature vectors. 2 1 1 2 Dimension5.4 Movie X (5 ratings) Emily Fox 214 12 9 1 8 6 4 2.4 User C (319 ratings) Movie Y (142 ratings) 35 3 25 2 15 1 5 25 2 15 1 5 7 6 5 4 3 2 2 1 1 2 2 1 1 2 Dimension5 Dimension2.2.2 Bayesian PMF Results.4 2 1 1.5.5 1 Dimension2 8 6 4 2 Dimension2.2 From Salakhutdinov.2 and Mnih, ICML 28.4 1.2.1.1 Dimension2 6 5 4 3 2 1.4.2.2.4 1.5.5 1 Bayesian model better controls for overfitting by averaging Figure over 3. possible parameters (instead of committing to one) Emily Fox 214 1.4.2.2.4.2.1.1 Samples from the posterior over the user and movie feature vectors generated by each step of the G sampler. The two dimensions with the highest variance are shown for two users and two movies. The first 8 sam were discarded as burn-in. D Valid. RMSE % Test RMSE % PMF BPMF Inc. PMF BPMF Inc. 3.9154.8994 1.74.9188.929 1.73 4.9135.8968 1.83.917.92 1.83 6.915.8954 2.14.9185.8989 2.13 15.9178.8931 2.69.9211.8965 2.67 3.9231.892 3.37.9265.8954 3.36 Table 1. Performance of Bayesian PMF (BPMF) and linear PMF on Netflix validation and test sets. We than trained larger PMF models with D =4and D =6. Capacitycontrolforsuchmodelsbecomesa rather challenging task. For example, a PMF model with D =6hasapproximately3millionparameters. Searching for appropriate values of the regularization coefficients becomes a very computationally expensive task. Table 1 further shows that for the 6-dimensional feature vectors, Bayesian PMF outperforms its MAP counterpart by over 2%. We should also point out that even the simplest possible Bayesian extension of the PMF model, where Gamma priors are placed over the precision hyperparameters α U and α V (see Fig. 1, as the Bayesian PMF models. It is interesting to observe that as the feature mensionality grows, the performance accuracy for MAP-trained PMF models does not improve, and trolling overfitting becomes a critical issue. The dictive accuracy of the Bayesian PMF models, h ever, steadily improves as the model complexity gr Inspired by this result, we experimented with Baye PMF models with D = 15 and D = 3 fea vectors. Note that these models have about 75 15 million parameters, and running the Gibbs s pler becomes computationally much more expen Nonetheless, the validation set RMSEs for the models were.8931 and.892. Table 1 shows these models not only significantly outperform t MAP counterparts but also outperform Bayesian P models that have fewer parameters. These res clearly show that the Bayesian approach does no quire limiting the complexity of the model based on number of the training samples. In practice, 5 howe we will be limited by the available computer resour For completeness, we also report the performance
What you need to know Idea of full posterior inference vs. MAP estimation Gibbs sampling as an MCMC approach Example of inference in Bayesian probabilistic matrix factorization model Emily Fox 214 11 Case Study 4: Collaborative Filtering Matrix Factorization and Probabilistic LFMs for Network Modeling Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox February 2 th, 214 Emily Fox 214 12 6
Network Data n Structure of network data Emily Fox 214 13 Properties of Data Source n Similarities to Netflix data: n Matrix High-dimensional Sparse Differences Square Binary Emily Fox 214 14 7
8 Emily Fox 214 15 Vanilla matrix factorization approach: What to return for link prediction? Slightly fancier: Matrix Factorization for Network Data Emily Fox 214 16 Assume features (covariates) of the user or relationship Each user has a position in a k-dimensional latent space Probability of link: Probabilistic Latent Space Models 1 5 5 1 1 1 Z1 Z2
Probabilistic Latent Space Models Probability of link: log odds p(r uv =1 L u,l v,x uv, )= + T x uv L u L v log odds p(r uv =1 L u,l v,x uv, )= + T x uv L T u L v Bayesian approach: Place prior on user factors and regression coefficients Place hyperprior on user factor hyperparameters Many other options and extensions (e.g., can use GMM for L u à clustering of users in the latent space) Emily Fox 214 17 What you need to know Representation of network data as a matrix Adjacency matrix Similarities and differences between adjacency matrices and general matrix-valued data Matrix factorization approaches for network data Just use standard MF and threshold output Introduce link functions to constrain predicted values Probabilistic latent space models Model link probabilities using distance between latent factors Emily Fox 214 18 9