Review: Probabilistic Matrix Factorization. Probabilistic Matrix Factorization (PMF)

Similar documents
Andriy Mnih and Ruslan Salakhutdinov

Large-scale Ordinal Collaborative Filtering

Scaling Neighbourhood Methods

Large-scale Collaborative Prediction Using a Nonparametric Random Effects Model

STA 4273H: Statistical Machine Learning

Restricted Boltzmann Machines for Collaborative Filtering

Scalable Bayesian Matrix Factorization

Graphical Models for Collaborative Filtering

* Matrix Factorization and Recommendation Systems

Mixed Membership Matrix Factorization

Bayesian Matrix Factorization with Side Information and Dirichlet Process Mixtures

Mixed Membership Matrix Factorization

STA 4273H: Sta-s-cal Machine Learning

Bayesian Methods for Machine Learning

Binary Principal Component Analysis in the Netflix Collaborative Filtering Task

Pattern Recognition and Machine Learning

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Algorithms for Collaborative Filtering

Introduction to Machine Learning

STA414/2104 Statistical Methods for Machine Learning II

STA 4273H: Statistical Machine Learning

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Gaussian Mixture Model

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

STA 4273H: Statistical Machine Learning

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING. Non-linear regression techniques Part - II

SCMF: Sparse Covariance Matrix Factorization for Collaborative Filtering

Machine Learning Techniques for Computer Vision

Mixed Membership Matrix Factorization

Approximate Inference using MCMC

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Data Mining Techniques

The connection of dropout and Bayesian statistics

Relevance Vector Machines

Bayesian Learning (II)

Bayesian Mixtures of Bernoulli Distributions

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 13: SEQUENTIAL DATA

Lecture: Gaussian Process Regression. STAT 6474 Instructor: Hongxiao Zhu

Collaborative Filtering Matrix Completion Alternating Least Squares

Default Priors and Effcient Posterior Computation in Bayesian

Collaborative Filtering. Radek Pelánek

Decoupled Collaborative Ranking

CPSC 540: Machine Learning

Recent Advances in Bayesian Inference Techniques

Introduction to Gaussian Processes

Nonparametric Bayesian Methods (Gaussian Processes)

VCMC: Variational Consensus Monte Carlo

Ad Placement Strategies

Probabilistic Graphical Models

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

STAT 518 Intro Student Presentation

Large-scale Collaborative Prediction Using a Nonparametric Random Effects Model

Integrated Non-Factorized Variational Inference

STA 4273H: Statistical Machine Learning

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Introduction to Probabilistic Machine Learning

CS281 Section 4: Factor Analysis and PCA

Matrix and Tensor Factorization from a Machine Learning Perspective

(5) Multi-parameter models - Gibbs sampling. ST440/540: Applied Bayesian Analysis

Density Estimation. Seungjin Choi

Machine Learning, Midterm Exam: Spring 2009 SOLUTION

Fast Likelihood-Free Inference via Bayesian Optimization

Mathematical Formulation of Our Example

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

A Bayesian Treatment of Social Links in Recommender Systems ; CU-CS

Stat 406: Algorithms for classification and prediction. Lecture 1: Introduction. Kevin Murphy. Mon 7 January,

Part 1: Expectation Propagation

Approximate Inference Part 1 of 2

STA 414/2104: Machine Learning

Practical Bayesian Optimization of Machine Learning. Learning Algorithms

MTTTS16 Learning from Multiple Sources

Approximate Inference Part 1 of 2

Bayesian Regression Linear and Logistic Regression

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

GWAS V: Gaussian processes

Mixture-Rank Matrix Approximation for Collaborative Filtering

Social Relations Model for Collaborative Filtering

ECE521 week 3: 23/26 January 2017

20: Gaussian Processes

Machine Learning Overview

Bayesian Inference. Chapter 4: Regression and Hierarchical Models

Introduction to Logistic Regression

STA 414/2104: Machine Learning

Multivariate Normal Models

Bayesian linear regression

Temporal Collaborative Filtering with Bayesian Probabilistic Tensor Factorization

PMR Learning as Inference

Active and Semi-supervised Kernel Classification

Learning the hyper-parameters. Luca Martino

Probabilistic Time Series Classification

Machine Learning for Signal Processing Bayes Classification and Regression

INTRODUCTION TO BAYESIAN INFERENCE PART 2 CHRIS BISHOP

Bayesian Approach 2. CSC412 Probabilistic Learning & Reasoning

GWAS IV: Bayesian linear (variance component) models

Real Estate Price Prediction with Regression and Classification CS 229 Autumn 2016 Project Final Report

Large-scale Collaborative Ranking in Near-Linear Time

Probabilistic Machine Learning. Industrial AI Lab.

Probability and Information Theory. Sargur N. Srihari

Gaussian Models

Afternoon Meeting on Bayesian Computation 2018 University of Reading

Transcription:

Case Study 4: Collaborative Filtering Review: Probabilistic Matrix Factorization Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox February 2 th, 214 Emily Fox 214 1 Probabilistic Matrix Factorization (PMF) A generative process: Pick user factors Pick movie factors For each (user,movie) pair observed: Pick rating as L u R v + noise Joint probability: Emily Fox 214 2 1

PMF Graphical Model P (L, R X) / P (L)P (R)P (X L, R) Graphically: Emily Fox 214 3 MAP versus Regularized Least-Squares for Matrix Completion MAP under Gaussian Model: max log P (L, R X) = L,R 1 X X 2 u 2 L 2 1 X X u i 2 2 R 2 1 X v i u v 2 2 (L u R v v r r uv i Least-squares matrix completion with L 2 regularization: 1 X min (L u R v r uv ) 2 + u L,R 2 2 L 2 F + v 2 R 2 F r uv i r uv ) 2 + const Understanding as a probabilistic model is very useful! E.g., Change priors Incorporate other sources of information or dependencies Emily Fox 214 4 2

Posterior Computations MAP estimation focuses on point estimation: ˆ MAP = arg max p( x) What if we want a full characterization of the posterior? Maintain a measure of uncertainty Estimators other than posterior mode (different loss functions) Predictive distributions for future observations Often no closed-form characterization (e.g., mixture models, PMF, etc.) Emily Fox 214 5 Bayesian PMF Example Latent user and movie factors: L u R v Observations Hyperparameters: u =1,...,n r uv v =1,...,m Want to predict new movie rating: Emily Fox 214 6 3

Bayesian PMF Example Z p(ruv X, )= p(ruv L u,r v )p(l, R X, )dldr Monte Carlo methods: u v L u R v Ideally: u =1,...,n r uv r v =1,...,m Emily Fox 214 7 Bayesian PMF Gibbs Sampler Outline of Bayesian PMF sampler Emily Fox 214 8 4

Bayesian PMF Results From Salakhutdinov and Mnih, ICML 28 Netflix data with: Training set = 1,48,57 ratings from 48,189 users on 17,77 movie titles Validation set = 1,48,395 ratings. Test set = 2,817,131 user/movie pairs with the ratings withheld. RMSE.97.96.95.94.93.92.91.9 Netflix Baseline Score Bayesian PMF SVD PMF Logistic PMF 1 2 3 4 5 6 Epochs RMSE.92.915.91.95.9.895 6 D Bayesian PMF 3 D 5.7 hrs. 11.7 hrs. 23 hrs. 1 9 hrs. 47 hrs. 188 hrs..89 4 8 16 32 64 128 256 512 Number of Samples 1 Figure 2. Left panel: Performance of SVD, PMF, logistic PMF, and Bayesian PMF using 3D feature vectors, on the Netflix validation data. The y-axis displays RMSE (root mean squared error), and the x-axis shows the number 2of epochs, or passes, through the entire training set. Right panel: RMSE forthebayesianpmfmodelsonthevalidation setasa function of the number of samples generated. The two curves are for the models with 3D and 6D feature vectors. 2 1 1 2 Dimension5.4 Movie X (5 ratings) Emily Fox 214 12 9 1 8 6 4 2.4 User C (319 ratings) Movie Y (142 ratings) 35 3 25 2 15 1 5 25 2 15 1 5 7 6 5 4 3 2 2 1 1 2 2 1 1 2 Dimension5 Dimension2.2.2 Bayesian PMF Results.4 2 1 1.5.5 1 Dimension2 8 6 4 2 Dimension2.2 From Salakhutdinov.2 and Mnih, ICML 28.4 1.2.1.1 Dimension2 6 5 4 3 2 1.4.2.2.4 1.5.5 1 Bayesian model better controls for overfitting by averaging Figure over 3. possible parameters (instead of committing to one) Emily Fox 214 1.4.2.2.4.2.1.1 Samples from the posterior over the user and movie feature vectors generated by each step of the G sampler. The two dimensions with the highest variance are shown for two users and two movies. The first 8 sam were discarded as burn-in. D Valid. RMSE % Test RMSE % PMF BPMF Inc. PMF BPMF Inc. 3.9154.8994 1.74.9188.929 1.73 4.9135.8968 1.83.917.92 1.83 6.915.8954 2.14.9185.8989 2.13 15.9178.8931 2.69.9211.8965 2.67 3.9231.892 3.37.9265.8954 3.36 Table 1. Performance of Bayesian PMF (BPMF) and linear PMF on Netflix validation and test sets. We than trained larger PMF models with D =4and D =6. Capacitycontrolforsuchmodelsbecomesa rather challenging task. For example, a PMF model with D =6hasapproximately3millionparameters. Searching for appropriate values of the regularization coefficients becomes a very computationally expensive task. Table 1 further shows that for the 6-dimensional feature vectors, Bayesian PMF outperforms its MAP counterpart by over 2%. We should also point out that even the simplest possible Bayesian extension of the PMF model, where Gamma priors are placed over the precision hyperparameters α U and α V (see Fig. 1, as the Bayesian PMF models. It is interesting to observe that as the feature mensionality grows, the performance accuracy for MAP-trained PMF models does not improve, and trolling overfitting becomes a critical issue. The dictive accuracy of the Bayesian PMF models, h ever, steadily improves as the model complexity gr Inspired by this result, we experimented with Baye PMF models with D = 15 and D = 3 fea vectors. Note that these models have about 75 15 million parameters, and running the Gibbs s pler becomes computationally much more expen Nonetheless, the validation set RMSEs for the models were.8931 and.892. Table 1 shows these models not only significantly outperform t MAP counterparts but also outperform Bayesian P models that have fewer parameters. These res clearly show that the Bayesian approach does no quire limiting the complexity of the model based on number of the training samples. In practice, 5 howe we will be limited by the available computer resour For completeness, we also report the performance

What you need to know Idea of full posterior inference vs. MAP estimation Gibbs sampling as an MCMC approach Example of inference in Bayesian probabilistic matrix factorization model Emily Fox 214 11 Case Study 4: Collaborative Filtering Matrix Factorization and Probabilistic LFMs for Network Modeling Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox February 2 th, 214 Emily Fox 214 12 6

Network Data n Structure of network data Emily Fox 214 13 Properties of Data Source n Similarities to Netflix data: n Matrix High-dimensional Sparse Differences Square Binary Emily Fox 214 14 7

8 Emily Fox 214 15 Vanilla matrix factorization approach: What to return for link prediction? Slightly fancier: Matrix Factorization for Network Data Emily Fox 214 16 Assume features (covariates) of the user or relationship Each user has a position in a k-dimensional latent space Probability of link: Probabilistic Latent Space Models 1 5 5 1 1 1 Z1 Z2

Probabilistic Latent Space Models Probability of link: log odds p(r uv =1 L u,l v,x uv, )= + T x uv L u L v log odds p(r uv =1 L u,l v,x uv, )= + T x uv L T u L v Bayesian approach: Place prior on user factors and regression coefficients Place hyperprior on user factor hyperparameters Many other options and extensions (e.g., can use GMM for L u à clustering of users in the latent space) Emily Fox 214 17 What you need to know Representation of network data as a matrix Adjacency matrix Similarities and differences between adjacency matrices and general matrix-valued data Matrix factorization approaches for network data Just use standard MF and threshold output Introduce link functions to constrain predicted values Probabilistic latent space models Model link probabilities using distance between latent factors Emily Fox 214 18 9