Distributed Gaussian Processes

Similar documents
Gaussian Processes. Recommended reading: Rasmussen/Williams: Chapters 1, 2, 4, 5 Deisenroth & Ng (2015)[3] Data Analysis and Probabilistic Inference

PILCO: A Model-Based and Data-Efficient Approach to Policy Search

Deep Gaussian Processes

Learning Gaussian Process Models from Uncertain Data

Tree-structured Gaussian Process Approximations

Gaussian Processes for Big Data. James Hensman

Deep learning with differential Gaussian process flows

Expectation Propagation in Dynamical Systems

Multiple-step Time Series Forecasting with Sparse Gaussian Processes

Deep Gaussian Processes

Variational Model Selection for Sparse Gaussian Process Regression

Nonparametric Bayesian Methods (Gaussian Processes)

Doubly Stochastic Inference for Deep Gaussian Processes. Hugh Salimbeni Department of Computing Imperial College London

Non-Factorised Variational Inference in Dynamical Systems

Lecture 11: Probability Distributions and Parameter Estimation

SINGLE-TASK AND MULTITASK SPARSE GAUSSIAN PROCESSES

Variational Dependent Multi-output Gaussian Process Dynamical Systems

Generalized Robust Bayesian Committee Machine for Large-scale Gaussian Process Regression

State-Space Inference and Learning with Gaussian Processes

Variational Principal Components

Analytic Long-Term Forecasting with Periodic Gaussian Processes

Local and global sparse Gaussian process approximations

Lecture : Probabilistic Machine Learning

Introduction to Gaussian Processes

Support Vector Machines

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes

Bayesian Machine Learning

Gaussian Process Regression

Cheng Soon Ong & Christian Walder. Canberra February June 2018

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

Probabilistic & Bayesian deep learning. Andreas Damianou

Neutron inverse kinetics via Gaussian Processes

Stochastic Variational Inference for Gaussian Process Latent Variable Models using Back Constraints

Variational Inference for Mahalanobis Distance Metrics in Gaussian Process Regression

Afternoon Meeting on Bayesian Computation 2018 University of Reading

Bayesian Machine Learning

STA 4273H: Statistical Machine Learning

Should all Machine Learning be Bayesian? Should all Bayesian models be non-parametric?

Model Selection for Gaussian Processes

Intelligent Systems I

Distributed Robust Gaussian Process Regression

Tutorial on Gaussian Processes and the Gaussian Process Latent Variable Model

Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms *

Lecture: Gaussian Process Regression. STAT 6474 Instructor: Hongxiao Zhu

The Automatic Statistician

Non Linear Latent Variable Models

Introduction to Probabilistic Machine Learning

Advanced Introduction to Machine Learning CMU-10715

INFINITE MIXTURES OF MULTIVARIATE GAUSSIAN PROCESSES

Gaussian Process Vine Copulas for Multivariate Dependence

Pattern Recognition and Machine Learning

Probabilistic Graphical Models Lecture 20: Gaussian Processes

CSC2535: Computation in Neural Networks Lecture 7: Variational Bayesian Learning & Model Selection

Mathematical Formulation of Our Example

Modelling local and global phenomena with sparse Gaussian processes

Lecture 1c: Gaussian Processes for Regression

The Variational Gaussian Approximation Revisited

Introduction to Gaussian Processes

Variable sigma Gaussian processes: An expectation propagation perspective

Computer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression

arxiv: v4 [stat.ml] 11 Apr 2016

Machine Learning. Bayesian Regression & Classification. Marc Toussaint U Stuttgart

Probabilistic Models for Learning Data Representations. Andreas Damianou

Gaussian Process Random Fields

Latent Force Models. Neil D. Lawrence

Non-Gaussian likelihoods for Gaussian Processes

Reliability Monitoring Using Log Gaussian Process Regression

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

GWAS V: Gaussian processes

Model-Based Reinforcement Learning with Continuous States and Actions

Variational Dependent Multi-output Gaussian Process Dynamical Systems

STA 4273H: Statistical Machine Learning

CPSC 540: Machine Learning

Latent Variable Models with Gaussian Processes

Nonparameteric Regression:

What, exactly, is a cluster? - Bernhard Schölkopf, personal communication

Explicit Rates of Convergence for Sparse Variational Inference in Gaussian Process Regression

Optimization of Gaussian Process Hyperparameters using Rprop

Does Better Inference mean Better Learning?

Probability and Information Theory. Sargur N. Srihari

A Spectral Regularization Framework for Multi-Task Structure Learning

Fully Bayesian Deep Gaussian Processes for Uncertainty Quantification

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Click Prediction and Preference Ranking of RSS Feeds

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods

Bayesian Networks Inference with Probabilistic Graphical Models

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING. Non-linear regression techniques Part - II

Introduction to Gaussian Processes

Review: Probabilistic Matrix Factorization. Probabilistic Matrix Factorization (PMF)


Recent Advances in Bayesian Inference Techniques

Naïve Bayes classification

Introduction to Machine Learning

Lecture 4: Probabilistic Learning

Joint Emotion Analysis via Multi-task Gaussian Processes

Parameter Estimation. Industrial AI Lab.

Bayesian Learning in Undirected Graphical Models

A Process over all Stationary Covariance Kernels

Models. Carl Henrik Ek Philip H. S. Torr Neil D. Lawrence. Computer Science Departmental Seminar University of Bristol October 25 th, 2007

Transcription:

Distributed Gaussian Processes Marc Deisenroth Department of Computing Imperial College London http://wp.doc.ic.ac.uk/sml/marc-deisenroth Gaussian Process Summer School, University of Sheffield 15th September 2015

Table of Contents Gaussian Processes Sparse Gaussian Processes Distributed Gaussian Processes Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 2

Problem Setting 2 f(x) 0 2 5 4 3 2 1 0 1 2 3 4 5 6 7 8 9 10 x Objective For a set of N observations y i f px i q ` ε, ε N `0, σε 2, find a distribution over functions pp f X, yq that explains the data GP is a good solution to this probabilistic regression problem Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 3

GP Training via Marginal Likelihood Maximization GP Training Maximize the evidence/marginal likelihood ppy X, θq with respect to the hyper-parameters θ: θ P arg max θ log ppy X, θq log ppy X, θq 1 2 yjk 1 y 1 2 log K ` const Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 4

GP Training via Marginal Likelihood Maximization GP Training Maximize the evidence/marginal likelihood ppy X, θq with respect to the hyper-parameters θ: θ P arg max θ log ppy X, θq log ppy X, θq 1 2 yjk 1 y 1 2 log K ` const Automatic trade-off between data fit and model complexity Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 4

GP Training via Marginal Likelihood Maximization GP Training Maximize the evidence/marginal likelihood ppy X, θq with respect to the hyper-parameters θ: θ P arg max θ log ppy X, θq log ppy X, θq 1 2 yjk 1 y 1 2 log K ` const Automatic trade-off between data fit and model complexity Gradient-based optimization possible: B log ppy X, θq Bθ 1 BK 2yJK 1 K 1 y 1 Bθ 2 BK tr`k 1 Bθ Computational complexity: OpN 3 q for K and K 1 Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 4

GP Predictions At a test point x the predictive (posterior) distribution is Gaussian: pp f px q x, X, y, θq N ` f m, σ 2 m kpx, x q J K 1 y σ 2 kpx, x q kpx, x q J K 1 kpx, x q Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 5

GP Predictions At a test point x the predictive (posterior) distribution is Gaussian: pp f px q x, X, y, θq N ` f m, σ 2 m kpx, x q J K 1 y σ 2 kpx, x q kpx, x q J K 1 kpx, x q When you cache K 1 and K 1 y after training, then The mean prediction can be computed in OpNq The variance prediction can be computed in OpN 2 q Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 5

Application Areas Bayesian Optimization (Experimental Design) Model unknown utility functions with GPs Reinforcement Learning and Robotics Model value functions and/or dynamics with GPs Data visualization Nonlinear dimensionality reduction (GP-LVM) Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 6

Limitations of Gaussian Processes Computational and memory complexity Training scales in OpN 3 q Prediction (variances) scales in OpN 2 q Memory requirement: OpND ` N 2 q Practical limit N «10, 000 Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 7

Table of Contents Gaussian Processes Sparse Gaussian Processes Distributed Gaussian Processes Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 8

GP Factor Graph Inducing function values Training function values f i f(u1) f(um ) f(x1) f(xn ) f 1 f L Training data Test data Probabilistic graphical model (factor graph) of a GP All function values are jointly Gaussian distributed (e.g., training and test function values) Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 9

GP Factor Graph Inducing function values Training function values f i f(u1) f(um ) f(x1) f(xn ) f 1 f L Training data Test data Probabilistic graphical model (factor graph) of a GP All function values are jointly Gaussian distributed (e.g., training and test function values) GP prior «ff «ff 0 K f f K f pp f, f q N, 0 K f K Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 9

Inducing Variables Inducing function values f(u1) f(um ) Training function values f i Hypothetical function values u j f(x1) f(xn ) f 1 f L Training data Test data Introduce inducing function values f u Hypothetical function values Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 10

Inducing Variables Inducing function values f(u1) f(um ) Training function values f i Hypothetical function values u j f(x1) f(xn ) f 1 f L Training data Test data Introduce inducing function values f u Hypothetical function values All function values are still jointly Gaussian distributed (e.g., training, test and inducing function values) Approach: Compress real function values into inducing function values Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 10

Central Approximation Scheme Inducing function values f(u1) f(um ) f(x1) f(xn ) f 1 f L Training data Test data Approximation: Training and test set are conditionally independent given the inducing function values: f K f f u Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 11

Central Approximation Scheme Inducing function values f(u1) f(um ) f(x1) f(xn ) f 1 f L Training data Test data Approximation: Training and test set are conditionally independent given the inducing function values: f K f f u Then, the effective GP prior is ż «ff» fi qp f, f q pp f f u qpp f f u qpp f u qd f u N 0, K f f Q f fl, 0 Q f K Q f : K fu K 1 f u f u K fu f Nyström approximation Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 11

FI(T)C Sparse Approximation Inducing function values f(u1) f(um ) f(x1) f(xn ) f 1 f L Assume that training (and test sets) are fully independent given the inducing variables (Snelson & Ghahramani, 2006) Training data Test data Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 12

FI(T)C Sparse Approximation Inducing function values f(u1) f(um ) f(x1) f(xn ) f 1 f L Assume that training (and test sets) are fully independent given the inducing variables (Snelson & Ghahramani, 2006) Training data Test data Effective GP prior with this approximation «ff» fi qp f, f q N 0, Q f f diagpq f f K f f q Q f fl 0 Q f K Q diagpq K q can be used instead of K FIC Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 12

FI(T)C Sparse Approximation Inducing function values f(u1) f(um ) f(x1) f(xn ) f 1 f L Assume that training (and test sets) are fully independent given the inducing variables (Snelson & Ghahramani, 2006) Training data Test data Effective GP prior with this approximation «ff» fi qp f, f q N 0, Q f f diagpq f f K f f q Q f fl 0 Q f K Q diagpq K q can be used instead of K FIC Training: OpNM 2 q, Prediction: OpM 2 q Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 12

Inducing Inputs FI(T)C sparse approximation exploits inducing function values f pu i q, where u i are the corresponding inputs Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 13

Inducing Inputs FI(T)C sparse approximation exploits inducing function values f pu i q, where u i are the corresponding inputs These inputs are unknown a priori Find optimal ones Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 13

Inducing Inputs FI(T)C sparse approximation exploits inducing function values f pu i q, where u i are the corresponding inputs These inputs are unknown a priori Find optimal ones Find them by maximizing the FI(T)C marginal likelihood with respect to the inducing inputs (and the standard hyper-parameters): u 1:M P arg max q FITC py X, u 1:M, θq u 1:M Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 13

Inducing Inputs FI(T)C sparse approximation exploits inducing function values f pu i q, where u i are the corresponding inputs These inputs are unknown a priori Find optimal ones Find them by maximizing the FI(T)C marginal likelihood with respect to the inducing inputs (and the standard hyper-parameters): u 1:M P arg max q FITC py X, u 1:M, θq u 1:M Intuitively: The marginal likelihood is not only parameterized by the hyper-parameters θ, but also by the inducing inputs u 1:M. Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 13

Inducing Inputs FI(T)C sparse approximation exploits inducing function values f pu i q, where u i are the corresponding inputs These inputs are unknown a priori Find optimal ones Find them by maximizing the FI(T)C marginal likelihood with respect to the inducing inputs (and the standard hyper-parameters): u 1:M P arg max q FITC py X, u 1:M, θq u 1:M Intuitively: The marginal likelihood is not only parameterized by the hyper-parameters θ, but also by the inducing inputs u 1:M. End up with a high-dimensional non-convex optimization problem with MD additional parameters Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 13

FITC Example Pink: Original data Red crosses: Initialization of inducing inputs Blue crosses: Location of inducing inputs after optimization Figure from Ed Snelson Efficient compression of the original data set Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 14

Summary Sparse Gaussian Processes Sparse approximations typically approximate a GP with N data points by a model with M! N data points Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 15

Summary Sparse Gaussian Processes Sparse approximations typically approximate a GP with N data points by a model with M! N data points Selection of these M data points can be tricky and may involve non-trivial computations (e.g., optimizing inducing inputs) Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 15

Summary Sparse Gaussian Processes Sparse approximations typically approximate a GP with N data points by a model with M! N data points Selection of these M data points can be tricky and may involve non-trivial computations (e.g., optimizing inducing inputs) Simple (random) subset selection is fast and generally robust (Chalupka et al., 2013) Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 15

Summary Sparse Gaussian Processes Sparse approximations typically approximate a GP with N data points by a model with M! N data points Selection of these M data points can be tricky and may involve non-trivial computations (e.g., optimizing inducing inputs) Simple (random) subset selection is fast and generally robust (Chalupka et al., 2013) Computational complexity: OpM 3 q or OpNM 2 q for training Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 15

Summary Sparse Gaussian Processes Sparse approximations typically approximate a GP with N data points by a model with M! N data points Selection of these M data points can be tricky and may involve non-trivial computations (e.g., optimizing inducing inputs) Simple (random) subset selection is fast and generally robust (Chalupka et al., 2013) Computational complexity: OpM 3 q or OpNM 2 q for training Practical limit M ď 10 4. Often: M P Op10 2 q in the case of inducing variables Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 15

Summary Sparse Gaussian Processes Sparse approximations typically approximate a GP with N data points by a model with M! N data points Selection of these M data points can be tricky and may involve non-trivial computations (e.g., optimizing inducing inputs) Simple (random) subset selection is fast and generally robust (Chalupka et al., 2013) Computational complexity: OpM 3 q or OpNM 2 q for training Practical limit M ď 10 4. Often: M P Op10 2 q in the case of inducing variables If we set M N{100, i.e., each inducing function value summarizes 100 real function values, our practical limit is N P Op10 6 q Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 15

Table of Contents Gaussian Processes Sparse Gaussian Processes Distributed Gaussian Processes Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 16

Distributed Gaussian Processes Joint work with Jun Wei Ng Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 17

An Orthogonal Approximation: Distributed GPs Data set Standard GP Kernel matrix 1 O(N 3 ) GP GP GP GP Distributed GP 1 = 1 1 1 1 O(MP 3 ) Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 18

An Orthogonal Approximation: Distributed GPs Data set Standard GP Kernel matrix 1 O(N 3 ) GP GP GP GP Distributed GP 1 = 1 1 1 1 O(MP 3 ) Randomly split the full data set into M chunks Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 18

An Orthogonal Approximation: Distributed GPs Data set Standard GP Kernel matrix 1 O(N 3 ) GP GP GP GP Distributed GP 1 = 1 1 1 1 O(MP 3 ) Randomly split the full data set into M chunks Place M independent GP models (experts) on these small chunks Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 18

An Orthogonal Approximation: Distributed GPs Data set Standard GP Kernel matrix 1 O(N 3 ) GP GP GP GP Distributed GP 1 = 1 1 1 1 O(MP 3 ) Randomly split the full data set into M chunks Place M independent GP models (experts) on these small chunks Independent computations can be distributed Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 18

An Orthogonal Approximation: Distributed GPs Data set Standard GP Kernel matrix 1 O(N 3 ) GP GP GP GP Distributed GP 1 = 1 1 1 1 O(MP 3 ) Randomly split the full data set into M chunks Place M independent GP models (experts) on these small chunks Independent computations can be distributed Block-diagonal approximation of kernel matrix K Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 18

An Orthogonal Approximation: Distributed GPs Data set Standard GP Kernel matrix 1 O(N 3 ) GP GP GP GP Distributed GP 1 = 1 1 1 1 O(MP 3 ) Randomly split the full data set into M chunks Place M independent GP models (experts) on these small chunks Independent computations can be distributed Block-diagonal approximation of kernel matrix K Combine independent computations to an overall result Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 18

Training the Distributed GP Split data set of size N into M chunks of size P Independence of experts Factorization of marginal likelihood: log ppy X, θq «ÿ M log p k 1 kpy pkq X pkq, θq Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 19

Training the Distributed GP Split data set of size N into M chunks of size P Independence of experts Factorization of marginal likelihood: log ppy X, θq «ÿ M log p k 1 kpy pkq X pkq, θq Distributed optimization and training straightforward Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 19

Training the Distributed GP Split data set of size N into M chunks of size P Independence of experts Factorization of marginal likelihood: log ppy X, θq «ÿ M log p k 1 kpy pkq X pkq, θq Distributed optimization and training straightforward Computational complexity: OpMP 3 q [instead of OpN 3 q] But distributed over many machines Memory footprint: OpMP 2 ` NDq [instead of OpN 2 ` NDq] Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 19

Empirical Training Time 10 3 10 7 Computation time of NLML and its gradients (s) 10 2 10 1 10 0 10-1 Computation time (DGP) 10 2 Computation time (Full GP) Computation time (FITC) Number of GP experts (DGP) 10-2 10 1 10 3 10 4 10 5 10 6 10 7 10 8 Training data set size 10 6 10 5 10 4 10 3 Number of GP experts (DGP) NLML is proportional to training time Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 20

Empirical Training Time 10 3 10 7 Computation time of NLML and its gradients (s) 10 2 10 1 10 0 10-1 Computation time (DGP) 10 2 Computation time (Full GP) Computation time (FITC) Number of GP experts (DGP) 10-2 10 1 10 3 10 4 10 5 10 6 10 7 10 8 Training data set size 10 6 10 5 10 4 10 3 Number of GP experts (DGP) NLML is proportional to training time Full GP (16K training points) «sparse GP (50K training points) «distributed GP (16M training points) Push practical limit by order(s) of magnitude Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 20

Practical Training Times Training* with N 10 6, D 1 on a laptop: «10 30 min Training* with N 5 ˆ 10 6, D 8 on a workstation: «4 hours Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 21

Practical Training Times Training* with N 10 6, D 1 on a laptop: «10 30 min Training* with N 5 ˆ 10 6, D 8 on a workstation: «4 hours *: Maximize the marginal likelihood, stop when converged** Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 21

Practical Training Times Training* with N 10 6, D 1 on a laptop: «10 30 min Training* with N 5 ˆ 10 6, D 8 on a workstation: «4 hours *: Maximize the marginal likelihood, stop when converged** **: Convergence often after 30 80 line searches*** Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 21

Practical Training Times Training* with N 10 6, D 1 on a laptop: «10 30 min Training* with N 5 ˆ 10 6, D 8 on a workstation: «4 hours *: Maximize the marginal likelihood, stop when converged** **: Convergence often after 30 80 line searches*** ***: Line search «2 3 evaluations of marginal likelihood and its gradient (usually OpN 3 q) Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 21

Predictions with the Distributed GP µ, σ Prediction of each GP expert is Gaussian N `µ i, σ 2 i How to combine them to an overall prediction N `µ, σ 2? Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 22

Predictions with the Distributed GP µ, σ Prediction of each GP expert is Gaussian N `µ i, σi 2 How to combine them to an overall prediction N `µ, σ 2? Product-of-GP-experts PoE (product of experts; Ng & Deisenroth, 2014) gpoe (generalized product of experts; Cao & Fleet, 2014) BCM (Bayesian Committee Machine; Tresp, 2000) rbcm (robust BCM; Deisenroth & Ng, 2015) Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 22

Objectives µ, σ µ, σ µ 1, σ 1 µ 2, σ 2 µ 11, σ 11 µ 12, σ 12 µ 13, σ 13 µ 21, σ 21 µ 22, σ 22 µ 23, σ 23 Figure: Two computational graphs Scale to large data sets Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 23

Objectives µ, σ µ, σ µ 1, σ 1 µ 2, σ 2 µ 11, σ 11 µ 12, σ 12 µ 13, σ 13 µ 21, σ 21 µ 22, σ 22 µ 23, σ 23 Figure: Two computational graphs Scale to large data sets Good approximation of full GP ( ground truth ) Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 23

Objectives µ, σ µ, σ µ 1, σ 1 µ 2, σ 2 µ 11, σ 11 µ 12, σ 12 µ 13, σ 13 µ 21, σ 21 µ 22, σ 22 µ 23, σ 23 Figure: Two computational graphs Scale to large data sets Good approximation of full GP ( ground truth ) Predictions independent of computational graph Runs on heterogeneous computing infrastructures (laptop, cluster,...) Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 23

Objectives µ, σ µ, σ µ 1, σ 1 µ 2, σ 2 µ 11, σ 11 µ 12, σ 12 µ 13, σ 13 µ 21, σ 21 µ 22, σ 22 µ 23, σ 23 Figure: Two computational graphs Scale to large data sets Good approximation of full GP ( ground truth ) Predictions independent of computational graph Runs on heterogeneous computing infrastructures (laptop, cluster,...) Reasonable predictive variances Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 23

Running Example Investigate various product-of-experts models Same training procedure, but different mechanisms for predictions Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 24

Product of GP Experts Prediction model (independent predictors): Mź pp f x, Dq9 p k p f x, D pkq q, k 1 p k p f x, D pkq q N ` f µ k px q, σ 2 k px q Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 25

Product of GP Experts Prediction model (independent predictors): Mź pp f x, Dq9 p k p f x, D pkq q, k 1 p k p f x, D pkq q N ` f µ k px q, σ 2 k px q Predictive precision (inverse variance) and mean: pσ poe q 2 ÿ σ 2 px q k µ poe k pσ poe q ÿ 2 σ 2 px qµ k k px q k Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 25

Computational Graph Prediction: Mź pp f x, Dq9 p k p f x, D pkq q k 1 Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 26

Computational Graph PoE PoE GP Experts Prediction: Mź pp f x, Dq9 p k p f x, D pkq q Multiplication is associative: a b c d pa bq pc dq Mź Lź źl k ÿ p k p f D pkq q p ki p f D pkiq q, L k M k 1 k 1 k 1i 1 k Independent of computational graph Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 26

Product of GP Experts Unreasonable variances for M ą 1: Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 27

Product of GP Experts Unreasonable variances for M ą 1: pσ poe q 2 ÿ σ 2 px q k The more experts the more certain the prediction, even if every expert itself is very uncertain Cannot fall back to the prior Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 27 k

Generalized Product of GP Experts Weight the responsiblity of each expert in PoE with β k Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 28

Generalized Product of GP Experts Weight the responsiblity of each expert in PoE with β k Prediction model (independent predictors): pp f x, Dq9 Mź k 1 p β k k p f x, D pkq q p k p f x, D pkq q N ` f µ k px q, σ 2 k px q Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 28

Generalized Product of GP Experts Weight the responsiblity of each expert in PoE with β k Prediction model (independent predictors): pp f x, Dq9 Mź k 1 p β k k p f x, D pkq q p k p f x, D pkq q N ` f µ k px q, σ 2 k px q Predictive precision and mean: pσ gpoe q 2 ÿ βkσ 2 px q k µ gpoe pσ gpoe k q ÿ 2 βkσ 2 px q µ k k px q k Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 28

Generalized Product of GP Experts Weight the responsiblity of each expert in PoE with β k Prediction model (independent predictors): pp f x, Dq9 Mź k 1 p β k k p f x, D pkq q p k p f x, D pkq q N ` f µ k px q, σ 2 k px q Predictive precision and mean: pσ gpoe q 2 ÿ βkσ 2 px q k µ gpoe pσ gpoe k q ÿ 2 βkσ 2 px q µ k k px q With ř k β k 1, the model can fall back to the prior Log-opinion pool model (Heskes, 1998) Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 28 k

Computational Graph PoE PoE gpoe GP Experts Prediction: pp f x, Dq9 Mź k 1 p β k k p f x, D pkq q Lź źl k k 1 i 1 p β k i k i p f D pk iq q, ÿ β ki 1 k,i Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 29

Computational Graph PoE PoE gpoe GP Experts Prediction: pp f x, Dq9 Mź k 1 p β k k p f x, D pkq q Lź źl k k 1 i 1 p β k i k i p f D pk iq q, ÿ β ki 1 k,i Independent of computational graph if ř k,i β k i 1 Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 29

Computational Graph PoE PoE gpoe GP Experts Prediction: pp f x, Dq9 Mź k 1 p β k k p f x, D pkq q Lź źl k k 1 i 1 p β k i k i p f D pk iq q, ÿ β ki 1 k,i Independent of computational graph if ř k,i β k i 1 A priori setting of β ki required β ki 1{M a priori ( ) Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 29

Generalized Product of GP Experts Same mean as PoE Model no longer overconfident and falls back to prior Very conservative variances Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 30

Bayesian Committee Machine Apply Bayes theorem when combining predictions (and not only for computing predictions) Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 31

Bayesian Committee Machine Apply Bayes theorem when combining predictions (and not only for computing predictions) Prediction model (D pjq K D pkq f ): ś M k 1 pp f x, Dq9 p kp f x, D pkq q p M 1 p f q Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 31

Bayesian Committee Machine Apply Bayes theorem when combining predictions (and not only for computing predictions) Prediction model (D pjq K D pkq f ): Predictive precision and mean: ś M k 1 pp f x, Dq9 p kp f x, D pkq q p M 1 p f q pσ bcm q 2 ÿ M µ bcm σ 2 k 1 k pσ bcm q ÿ 2 M px q pm 1qσ 2 σ 2 k 1 k px qµ k px q Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 31

Bayesian Committee Machine Apply Bayes theorem when combining predictions (and not only for computing predictions) Prediction model (D pjq K D pkq f ): Predictive precision and mean: ś M k 1 pp f x, Dq9 p kp f x, D pkq q p M 1 p f q pσ bcm q 2 ÿ M µ bcm σ 2 k 1 k pσ bcm q ÿ 2 M px q pm 1qσ 2 σ 2 k 1 k px qµ k px q Product of GP experts, divided by M 1 times the prior Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 31

Bayesian Committee Machine Apply Bayes theorem when combining predictions (and not only for computing predictions) Prediction model (D pjq K D pkq f ): Predictive precision and mean: ś M k 1 pp f x, Dq9 p kp f x, D pkq q p M 1 p f q pσ bcm q 2 ÿ M µ bcm σ 2 k 1 k pσ bcm q ÿ 2 M px q pm 1qσ 2 σ 2 k 1 k px qµ k px q Product of GP experts, divided by M 1 times the prior Guaranteed to fall back to the prior outside data regime Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 31

Computational Graph Prediction: ś M k 1 pp f x, Dq9 p kp f x, D pkq q p M 1 p f q Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 32

Computational Graph Prediction: ś M k 1 pp f x, Dq9 p kp f x, D pkq q p M 1 p f q ś M k 1 p kp f D pkq q p M 1 p f q ś L ś Lk k 1 i 1 p k i p f D pkiq q p M 1 p f q Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 32

Computational Graph Prior PoE PoE GP Experts Prediction: ś M k 1 pp f x, Dq9 p kp f x, D pkq q p M 1 p f q ś M k 1 p kp f D pkq q p M 1 p f q ś L ś Lk k 1 i 1 p k i p f D pkiq q p M 1 p f q Independent of computational graph Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 32

Bayesian Committee Machine Variance estimates are about right When leaving the data regime, the BCM can produce junk Robustify Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 33

Robust Bayesian Committee Machine Merge gpoe (weighting of experts) with the BCM (Bayes theorem when combining predictions) Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 34

Robust Bayesian Committee Machine Merge gpoe (weighting of experts) with the BCM (Bayes theorem when combining predictions) Prediction model (conditional independence D pjq K D pkq f ): pp f x, Dq9 ś M k 1 p β k k p f x, D pkq q p ř k βk 1 p f q Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 34

Robust Bayesian Committee Machine Merge gpoe (weighting of experts) with the BCM (Bayes theorem when combining predictions) Prediction model (conditional independence D pjq K D pkq f ): pp f x, Dq9 Predictive precision and mean: pσ rbcm q 2 ÿ M µ rbcm pσ rbcm βkσ 2 k 1 k ś M k 1 p β k k p f x, D pkq q p ř k βk 1 p f q px q `p1 řm q ÿ 2 βkσ 2 px q µ k k px q k k 1 βkqσ 2, Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 34

Computational Graph Prediction: pp f x, Dq9 ś M k 1 p β k k p f x, D pkq q p ř k βk 1 p f q ś L ś Lk k 1 i 1 p β k i p ř k i k i β ki 1 p f q p f D pk iq q Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 35

Computational Graph Prior PoE PoE gpoe GP Experts Prediction: pp f x, Dq9 ś M k 1 p β k k p f x, D pkq q p ř k βk 1 p f q ś L ś Lk k 1 i 1 p β k i p ř k i k i β ki 1 p f q p f D pk iq q Independent of computational graph, even with arbitrary β ki Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 35

Robust Bayesian Committee Machine Does not break down in case of weak experts Robustified Robust version of BCM Reasonable predictions Independent of computational graph (for all choices of β k ) Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 36

Empirical Approximation Error RMSE 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 #Points/Expert 39 156 625 2500 10000 rbcm BCM gpoe PoE SOD GP 10 0 10 1 10 2 10 3 Gradient time in sec Simulated robot arm data (10K training, 10K test) Hyper-parameters of ground-truth full GP RMSE as a function of the training time Sparse GP (SOR) performs worse than any distributed GP rbcm performs best with weak GP experts Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 37

Empirical Approximation Error (2) NLPD 2 1 0 1 2 3 #Points/Expert 39 156 625 2500 10000 rbcm BCM gpoe PoE SOD GP 10 0 10 1 10 2 10 3 Gradient time in sec NLPD as a function of the training time Mean and variance BCM and PoE are not robust for weak experts gpoe suffers from too conservative variances rbcm consistently outperforms other methods Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 38

Large Data Sets: Airline Data (700K) RMSE Airline Delay, 700K DGP, random data assignment 34 Sparse GP 32 30 28 26 24 rbcm BCM gpoe PoE Dist VGP SVI GP rbcm performs best (r)bcm and (g)poe with 4096 GP experts Gradient time: 13 seconds (12 cores) Inducing inputs: Dist-VGP (Gal et al., 2014), SVI-GP (Hensman et al., 2013) (g)poe and BCM perform not worse than sparse GPs Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 39

Large Data Sets: Airline Data (700K) RMSE Airline Delay, 700K DGP, random data assignment 34 DGP, KD- tree data assignment Sparse GP 32 30 28 26 24 rbcm rbcm KD BCM BCM KD gpoe gpoe KD PoE PoE KDDist VGP SVI GP rbcm performs best (r)bcm and (g)poe with 4096 GP experts Gradient time: 13 seconds (12 cores) Inducing inputs: Dist-VGP (Gal et al., 2014), SVI-GP (Hensman et al., 2013) (g)poe and BCM perform not worse than sparse GPs KD-tree data assignment clearly helps (r)bcm Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 39

Summary: Distributed Gaussian Processes µ, σ µ 1, σ 1 µ 2, σ 2 µ 11, σ 11 µ 12, σ 12 µ 13, σ 13 µ 21, σ 21 µ 22, σ 22 µ 23, σ 23 Scale Gaussian processes to large data (beyond 10 6 ) Model conceptually straightforward and easy to train Key: Distributed computation Currently tested with N P Op10 7 q Scales to arbitrarily large data sets (with enough computing power) m.deisenroth@imperial.ac.uk Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 40

References [1] Y. Cao and D. J. Fleet. Generalized Product of Experts for Automatic and Principled Fusion of Gaussian Process Predictions. http://arxiv.org/abs/1410.7827, October 2014. [2] K. Chalupka, C. K. I. Williams, and I. Murray. A Framework for Evaluating Approximate Methods for Gaussian Process Regression. Journal of Machine Learning Research, 14:333 350, February 2013. [3] M. P. Deisenroth and J. W. Ng. Distributed Gaussian Processes. In Proceedings of the International Conference on Machine Learning, 2015. [4] M. P. Deisenroth, C. E. Rasmussen, and D. Fox. Learning to Control a Low-Cost Manipulator using Data-Efficient Reinforcement Learning. In Proceedings of Robotics: Science and Systems, Los Angeles, CA, USA, June 2011. [5] Y. Gal, M. van der Wilk, and C. E. Rasmussen. Distributed Variational Inference in Sparse Gaussian Process Regression and Latent Variable Models. In Advances in Neural Information Processing Systems. 2014. [6] J. Hensman, N. Fusi, and N. D. Lawrence. Gaussian Processes for Big Data. In A. Nicholson and P. Smyth, editors, Proceedings of the Conference on Uncertainty in Artificial Intelligence. AUAI Press, 2013. [7] T. Heskes. Selecting Weighting Factors in Logarithmic Opinion Pools. In Advances in Neural Information Processing Systems, pages 266 272. Morgan Kaufman, 1998. [8] N. Lawrence. Probabilistic Non-linear Principal Component Analysis with Gaussian Process Latent Variable Models. Journal of Machine Learning Research, 6:1783 1816, November 2005. [9] J. Ng and M. P. Deisenroth. Hierarchical Mixture-of-Experts Model for Large-Scale Gaussian Process Regression. http://arxiv.org/abs/1412.3078, December 2014. [10] J. Quiñonero-Candela and C. E. Rasmussen. A Unifying View of Sparse Approximate Gaussian Process Regression. Journal of Machine Learning Research, 6(2):1939 1960, 2005. [11] E. Snelson and Z. Ghahramani. Sparse Gaussian Processes using Pseudo-inputs. In Y. Weiss, B. Schölkopf, and J. C. Platt, editors, Advances in Neural Information Processing Systems 18, pages 1257 1264. The MIT Press, Cambridge, MA, USA, 2006. [12] M. K. Titsias. Variational Learning of Inducing Variables in Sparse Gaussian Processes. In Proceedings of the International Conference on Artificial Intelligence and Statistics, 2009. [13] V. Tresp. A Bayesian Committee Machine. Neural Computation, 12(11):2719 2741, 2000. Distributed Gaussian Processes Marc Deisenroth @GPSS, 15th September 2015 41