Large-scale Collaborative Prediction Using a Nonparametric Random Effects Model

Similar documents
Large-scale Collaborative Prediction Using a Nonparametric Random Effects Model

Review: Probabilistic Matrix Factorization. Probabilistic Matrix Factorization (PMF)

Large-scale Ordinal Collaborative Filtering

Restricted Boltzmann Machines for Collaborative Filtering

Andriy Mnih and Ruslan Salakhutdinov

Learning to Learn and Collaborative Filtering

MTTTS16 Learning from Multiple Sources

Gaussian Process Models for Link Analysis and Transfer Learning

Transfer Learning for Collective Link Prediction in Multiple Heterogenous Domains

Social Relations Model for Collaborative Filtering

Fast Nonparametric Matrix Factorization for Large-scale Collaborative Filtering

Mixed Membership Matrix Factorization

Mixed Membership Matrix Factorization

Graphical Models for Collaborative Filtering

Probabilistic Reasoning in Deep Learning

STAT 518 Intro Student Presentation

Tutorial on Gaussian Processes and the Gaussian Process Latent Variable Model

Scaling Neighbourhood Methods

Scalable Bayesian Matrix Factorization

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING. Non-linear regression techniques Part - II

Binary Principal Component Analysis in the Netflix Collaborative Filtering Task

Learning Gaussian Process Kernels via Hierarchical Bayes

Bayesian Matrix Factorization with Side Information and Dirichlet Process Mixtures

Prediction of double gene knockout measurements

Nonparametric Bayesian Methods (Gaussian Processes)

Nonparameteric Regression:

Bayesian Machine Learning

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Gaussian Processes (10/16/13)

Matrix and Tensor Factorization from a Machine Learning Perspective

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

Machine Learning. Bayesian Regression & Classification. Marc Toussaint U Stuttgart

Introduction PCA classic Generative models Beyond and summary. PCA, ICA and beyond

Algorithms for Collaborative Filtering

Lecture 16 Deep Neural Generative Models

Afternoon Meeting on Bayesian Computation 2018 University of Reading

Probabilistic & Unsupervised Learning

Course 495: Advanced Statistical Machine Learning/Pattern Recognition

Collaborative Filtering via Rating Concentration

Introduction to Gaussian Processes

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

STA 4273H: Statistical Machine Learning

Variable sigma Gaussian processes: An expectation propagation perspective

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes

Talk on Bayesian Optimization

Lecture: Gaussian Process Regression. STAT 6474 Instructor: Hongxiao Zhu

Principal Component Analysis (PCA) for Sparse High-Dimensional Data

Probabilistic Graphical Models Lecture 20: Gaussian Processes

Gaussian Process Regression Networks

Introduction to Gaussian Processes

Probabilistic & Bayesian deep learning. Andreas Damianou

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014

Gaussian processes for inference in stochastic differential equations

Kernel methods, kernel SVM and ridge regression

Deep Learning Basics Lecture 7: Factor Analysis. Princeton University COS 495 Instructor: Yingyu Liang

GWAS V: Gaussian processes

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

An Introduction to Statistical and Probabilistic Linear Models

Matrix Factorizations: A Tale of Two Norms

CPSC 540: Machine Learning

Gaussian processes and bayesian optimization Stanisław Jastrzębski. kudkudak.github.io kudkudak

PREDICTING SOLAR GENERATION FROM WEATHER FORECASTS. Chenlin Wu Yuhan Lou

A Spectral Regularization Framework for Multi-Task Structure Learning

GAUSSIAN PROCESS REGRESSION

Bayesian Deep Learning

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Recommendation. Tobias Scheffer

Recent Advances in Bayesian Inference Techniques

GWAS IV: Bayesian linear (variance component) models

Data Mining Techniques

Neutron inverse kinetics via Gaussian Processes

Downloaded 09/30/17 to Redistribution subject to SIAM license or copyright; see

Pattern Recognition and Machine Learning

Multivariate Bayesian Linear Regression MLAI Lecture 11

Probabilistic Low-Rank Matrix Completion with Adaptive Spectral Regularization Algorithms

PILCO: A Model-Based and Data-Efficient Approach to Policy Search

Introduction to Machine Learning

SCMF: Sparse Covariance Matrix Factorization for Collaborative Filtering

Principal Component Analysis (PCA) for Sparse High-Dimensional Data

Probabilistic & Unsupervised Learning

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Linear Regression (9/11/13)

1-bit Matrix Completion. PAC-Bayes and Variational Approximation

Bayesian linear regression

Latent Variable Models and EM Algorithm

Active and Semi-supervised Kernel Classification

Gaussian Process Dynamical Models Jack M Wang, David J Fleet, Aaron Hertzmann, NIPS 2005

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

Practical Bayesian Optimization of Machine Learning. Learning Algorithms

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

Linear Dynamical Systems (Kalman filter)

Chapter 16. Structured Probabilistic Models for Deep Learning

Overfitting, Bias / Variance Analysis

Multi-task Learning with Gaussian Processes, with Applications to Robot Inverse Dynamics

Model Selection for Gaussian Processes

Introduction to Machine Learning

Kernelized Probabilistic Matrix Factorization: Exploiting Graphs and Side Information

State Space Gaussian Processes with Non-Gaussian Likelihoods

Advanced Introduction to Machine Learning CMU-10715

Transcription:

Large-scale Collaborative Prediction Using a Nonparametric Random Effects Model Kai Yu Joint work with John Lafferty and Shenghuo Zhu NEC Laboratories America, Carnegie Mellon University First Prev Page 1 Next Last Go Back Full Screen Close Quit

Learning multiple tasks For input vector z j, and its outputs Y ij (tasks), the standard regression model is under various conditions Y ij = µ + m i (z j ) + ɛ ij, where ɛ ij iid N(0, σ 2 ), i = 1,..., M, and j = 1,..., N. First Prev Page 2 Next Last Go Back Full Screen Close Quit

A kernel approach to multi-task learning To model the dependency between tasks, a hierarchical Bayesian approach may assume where m i iid GP(0, Σ) Σ(z j, z j ) 0 is a shared covariance function among inputs; Many multi-task learning approaches are similar to this. a a ICML-09 tutorial, Tresp & Yu First Prev Page 3 Next Last Go Back Full Screen Close Quit

Using task-specific features Assuming task-specific features x are available, a more flexible approach is to model the data jointly, as Y ij = µ + m(x i, z j ) + ɛ ij, where ɛ ij iid N(0, σ 2 ), m ij = m(x i, z j ) is a relational function. First Prev Page 4 Next Last Go Back Full Screen Close Quit

A nonparametric kernel-based approach Assume the relational function follows a m GP(0, Ω Σ) where Ω(x i, x i ) is a covariance function on tasks; Σ(z j, z j ) is a covariance function on inputs; any sub matrix follows m N(0, Ω Σ) Cov(m ij, m i j ) = Ω ii Σ jj ; If Ω = δ, the prior reduces to m iid i GP(0, Σ). a Yu et al., 2007; Bonilla et al., 2008 First Prev Page 5 Next Last Go Back Full Screen Close Quit

The collaborative prediction problem This essentially a multi-task learning problem with task features; Matrix factorization using additional row/column attributes; The formulation applies to many relational prediction problems. First Prev Page 6 Next Last Go Back Full Screen Close Quit

Challenges to the kernel approach Computation: the cost O(M 3 N 3 ) is prohibitive. Netflix data: M = 480, 189 and N = 17, 770. Dependent noise : when Y ij cannot be fully explained by the predictors x i and z j, the conditional independence assumption is invalid, which means, p(y m, x, z) i,j p(y ij m, x i, z j ) User and movie features are weak predictors; The relational observations Y ij alone are informative to each other. First Prev Page 7 Next Last Go Back Full Screen Close Quit

This work Novel multi-task model using both input and task attributes; Nonparametric random effects to resolve dependent noises ; Efficient algorithm for large-scale collaborative prediction problems. First Prev Page 8 Next Last Go Back Full Screen Close Quit

Nonparametric random effects Y ij = µ + m ij + f ij + ɛ ij, m(x i, z j ): a function depending on known attributes; f ij : random effects for dependent noises ; modeling dependency in observations with repeated structures. Let f ij be nonparametric: dimensionality increases with data size; nonparametric matrix factorization First Prev Page 9 Next Last Go Back Full Screen Close Quit

Efficiency considerations in modeling To save computation, we absorb ɛ into f and obtain Y ij = µ + m ij + f ij, Introduce a special generative process for m and f... m, f, Ω 0 (x i, x i ), Σ 0 (z j, z j ), First Prev Page 10 Next Last Go Back Full Screen Close Quit

The row-wise generative model First Prev Page 11 Next Last Go Back Full Screen Close Quit

The row-wise generative model First Prev Page 12 Next Last Go Back Full Screen Close Quit

The row-wise generative model First Prev Page 13 Next Last Go Back Full Screen Close Quit

The row-wise generative model First Prev Page 14 Next Last Go Back Full Screen Close Quit

The row-wise generative model First Prev Page 15 Next Last Go Back Full Screen Close Quit

The row-wise generative model First Prev Page 16 Next Last Go Back Full Screen Close Quit

The row-wise generative model First Prev Page 17 Next Last Go Back Full Screen Close Quit

The row-wise generative model First Prev Page 18 Next Last Go Back Full Screen Close Quit

The column-wise generative model First Prev Page 19 Next Last Go Back Full Screen Close Quit

Two generative models Y ij = µ + m ij + f ij, column-wise model: Ω IWP(κ, Ω 0 + τδ), m GP(0, Ω Σ 0 ), f iid j GP(0, λω), row-wise model: Σ IWP(κ, Σ 0 + λδ), m GP(0, Ω 0 Σ), f iid i GP(0, τσ), First Prev Page 20 Next Last Go Back Full Screen Close Quit

Two generative models are equivalent Both models lead to the same matrix-variate Student-t process Y MTP ( κ, 0, (Ω 0 + τδ), (Σ 0 + λδ) ), The model learns both Ω and Σ simultaneously; Sometimes one model is computationally cheaper than the other. First Prev Page 21 Next Last Go Back Full Screen Close Quit

An idea of large-scale modeling First Prev Page 22 Next Last Go Back Full Screen Close Quit

Modeling large-scale data If Ω 0 (x i, x i ) = φ(x i ), φ(x i ), m GP(0, Ω 0 Σ) implies m ij = φ(x i ), β j Without loss of generality, let Ω 0 (x i, x i ) = p 1 2 xi, p 1 2 xi, x i R p. On a finite observational matrix Y R M N, M N, the row-wise model becomes Σ IW(κ, Σ 0 + λi N ), β N(0, I p Σ), where β is p N random matrix. Y i N(β x i, τσ), i = 1,..., M First Prev Page 23 Next Last Go Back Full Screen Close Quit

Approximate Inference - EM E-step: compute the sufficient statistics {υ i, C i } for the posterior of Y i given the current β and Σ: Q(Y) = M p(y i Y Oi, β, Σ) = i=1 M N(Y i υ i, C i ), i=1 M-step: optimize β and Σ: β, Σ = arg min β,σ and then let β β, Σ Σ. { } E Q(Y) [ log p(y, β, Σ θ)] First Prev Page 24 Next Last Go Back Full Screen Close Quit

Some notation Let J i {1,..., N} be the index set of the N i observed elements in the row Y i ; Σ [:,Ji ] R N N i is the matrix obtained by keeping the columns of Σ indexed by J i ; Σ [Ji,J i ] R N i N i is obtained from Σ [:,Ji ] by further keeping only the rows indexed by J i ; Similarly, we can define Σ [Ji,:], Y [i,ji ] and m [i,ji ]. First Prev Page 25 Next Last Go Back Full Screen Close Quit

The EM algorithm E-step: for i = 1,..., M m i = β x i, υ i = m i + Σ [:,Ji ]Σ 1 [J i,j i ](Y [i,ji ] m [i,ji ]), C i = τσ τσ [:,Ji ]Σ 1 [J i,j i ]Σ [Ji,:]. M-step: β = (x x + τi p ) 1 x υ, [ M ] τ 1 i=1 Σ (C i + υ i υ i ) υ x(x x + τi p ) 1 x υ + Σ 0 + λi N = M + 2N + p + κ On Netflix, each EM iteration takes several thousands of hours. First Prev Page 26 Next Last Go Back Full Screen Close Quit

Fast implementation Let U i R N N i be a column selection operator, such that Σ [:,Ji ] = ΣU i and Σ [Ji,J i ] = U i ΣU i. The M-step only needs C = M i=1 C i + υ υ and υ x from the previous E-step. To obtain them, it s unnecessary to compute υ i and C i. For example, M C i = i=1 = M i=1 M i=1 ( ) τσ τσ [:,Ji]Σ 1 [J Σ i,j i] [J i,:] ( ) τσ τσu i Σ 1 [J U i,j i] i Σ = τmσ τσ ( M i=1 U i Σ 1 [J i,j i] U i Similar tricks can be applied to υ υ and υ x. Time for each iteration is reduced from thousands of hours to 5 hours only. ) Σ First Prev Page 27 Next Last Go Back Full Screen Close Quit

EachMovie Data 74424 users, 1648 movies; 2, 811, 718 numeric ratings Y ij {1,..., 6}; 97.17% of the elements are missing; Use 80% ratings of each user for training and the rest for testing; This random selection is repeated 10 times independently. First Prev Page 28 Next Last Go Back Full Screen Close Quit

Compared methods User Mean & Movie Mean: prediction by the empirical mean; FMMMF: fast max-margin matrix factorization a ; PPCA: probabilistic principal component analysis b ; BSRM: Bayesian stochastic relational model c, BSRM-1 uses no additional user/movie attributes d ; NREM: Nonparametric random effects model, NREM-1 uses no additional attributes. a Rennie & Srebro (2005). b Tipping & Bishop (1999). c Zhu, Yu, & Gong (2009). d Top 20 eigenvectors from of binary matrix indicating if a rating is observed or not. First Prev Page 29 Next Last Go Back Full Screen Close Quit

Results on EachMovie TABLE: Prediction Error on EachMovie Data Method RMSE Standard Error Run Time (hours) User Mean 1.4251 0.0004 Movie Mean 1.3866 0.0004 FMMMF 1.1552 0.0008 4.94 PPCA 1.1045 0.0004 1.26 BSRM-1 1.0902 0.0003 1.67 BSRM-2 1.0852 0.0003 1.70 NREM-1 1.0816 0.0003 0.59 NREM-2 1.0758 0.0003 0.59 First Prev Page 30 Next Last Go Back Full Screen Close Quit

Netflix Data 100, 480, 507 ratings from 480, 189 users on 17, 770 movies; Y ij {1, 2, 3, 4, 5}; A set of validation data contain 1, 408, 395 ratings; Therefore there are 98.81% of elements missing in the rating matrix; The test set includes 2, 817, 131 ratings; First Prev Page 31 Next Last Go Back Full Screen Close Quit

Compared methods In addition to those compared in EachMovie experiment, there are several other methods: SVD: a method almost the same as FMMMF, using a gradient-based method for optimization a. RBM: Restricted Boltzmann Machine using contrast divergence b. PMF and BPMF: probabilistic matrix factorization c, and its Bayesian version d. PMF-VB: probabilistic matrix factorization using a variational Bayes method for inference e. a Kurucz, Benczur, & Csalogany, (2007). b Salakhutdinov, Mnih & Hinton (2007). c Salakhutdinov & Mnih (2008b). d Salakhutdinov & Mnih (2008a). e Lim & Teh (2007). First Prev Page 32 Next Last Go Back Full Screen Close Quit

Results on Netflix TABLE: Performance on Netflix Data Method RMSE Run Time (hours) Cinematch 0.9514 - SVD 0.920 300 PMF 0.9265 - RBM 0.9060 - PMF-VB 0.9141 - BPMF 0.8954 1100 BSRM-2 0.8881 350 NREM-1 0.8876 148 NREM-2 0.8853 150 First Prev Page 33 Next Last Go Back Full Screen Close Quit

Predictive Uncertainty Standard deviations of prediction residuals vs. standard deviations predicted by our model on EachMovie First Prev Page 34 Next Last Go Back Full Screen Close Quit

Related work Multi-task learning using Gaussian processes, those learn the covariance Σ shared across tasks a, and those that additionally consider the covariance Ω between tasks b Application of GP models to collaborative filtering c Low-rank matrix factorization, e.g., d. Our model is nonparametric in the sense no rank constraint is imposed. Very few matrix factorization methods use known predictors. One such a work e introduces low-rank multiplicative random effects in modeling networked observations. a Lawrence & Platt (2004); Schwaighofer, Tresp & Yu (2004); Yu, Tresp & Schwaighofer (2005). b Yu, Chu, Yu, Tresp, & Xu, (2007); Bonilla, Chai, & Williams (2008). c Schwaighofer, Tresp & Yu (2004), Yu & Chu (2007) d Salakhutdinov & Mnih (2008b). e Hoff (2005) First Prev Page 35 Next Last Go Back Full Screen Close Quit

Summary The model provides a novel way to use random effects and known attributes to explain the complex dependence of data; We make the nonparametric model scalable and efficient on very large-scale problems; Our experiments demonstrate that the algorithm works very well on two challenging collaborative prediction problems; In the near future, it will be promising to perform a full Bayesian inference by a parallel Gibbs sampling method. First Prev Page 36 Next Last Go Back Full Screen Close Quit