Tree-structured Gaussian Process Approximations

Similar documents
Variational Model Selection for Sparse Gaussian Process Regression

Variational Inference for Mahalanobis Distance Metrics in Gaussian Process Regression

Gaussian Processes for Big Data. James Hensman

Non-Factorised Variational Inference in Dynamical Systems

Variable sigma Gaussian processes: An expectation propagation perspective

Stochastic Variational Inference for Gaussian Process Latent Variable Models using Back Constraints

Explicit Rates of Convergence for Sparse Variational Inference in Gaussian Process Regression

Tutorial on Gaussian Processes and the Gaussian Process Latent Variable Model

Deep Gaussian Processes

Model Selection for Gaussian Processes

Distributed Gaussian Processes

Expectation Propagation in Factor Graphs: A Tutorial

Sparse Approximations for Non-Conjugate Gaussian Process Regression

Gaussian Process Random Fields

Part 1: Expectation Propagation

Deep Gaussian Processes

Variational Inference for Latent Variables and Uncertain Inputs in Gaussian Processes

Introduction to Gaussian Processes

Gaussian Process Vine Copulas for Multivariate Dependence

Probabilistic Models for Learning Data Representations. Andreas Damianou

arxiv: v1 [stat.ml] 27 May 2017

Non-Gaussian likelihoods for Gaussian Processes

Incremental Variational Sparse Gaussian Process Regression

Variational Model Selection for Sparse Gaussian Process Regression

arxiv: v1 [stat.ml] 8 Sep 2014

Approximate Inference Part 1 of 2

Deep learning with differential Gaussian process flows

Doubly Stochastic Inference for Deep Gaussian Processes. Hugh Salimbeni Department of Computing Imperial College London

Approximate Inference Part 1 of 2

Expectation Propagation in Dynamical Systems

Probabilistic Reasoning in Deep Learning

Gaussian Process Regression Networks

Introduction to Gaussian Processes

Probabilistic & Bayesian deep learning. Andreas Damianou

Local and global sparse Gaussian process approximations

Efficient Modeling of Latent Information in Supervised Learning using Gaussian Processes

Approximation Methods for Gaussian Process Regression

Approximation Methods for Gaussian Process Regression

Bayesian Machine Learning - Lecture 7

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes

Circular Pseudo-Point Approximations for Scaling Gaussian Processes

Nonparameteric Regression:

Probabilistic and Bayesian Machine Learning

A Unifying Framework for Gaussian Process Pseudo-Point Approximations using Power Expectation Propagation

Nonparametric Bayesian Methods (Gaussian Processes)

Bayesian Learning in Undirected Graphical Models

Non Linear Latent Variable Models

The Automatic Statistician

Learning Gaussian Process Models from Uncertain Data

13 : Variational Inference: Loopy Belief Propagation and Mean Field

Sparse Spectral Sampling Gaussian Processes

Unsupervised Learning

STAT 518 Intro Student Presentation

Generalized Robust Bayesian Committee Machine for Large-scale Gaussian Process Regression

GWAS V: Gaussian processes

System identification and control with (deep) Gaussian processes. Andreas Damianou

How to build an automatic statistician

Multiple-step Time Series Forecasting with Sparse Gaussian Processes

On Sparse Variational Methods and the Kullback-Leibler Divergence between Stochastic Processes

Lecture 6: Graphical Models: Learning

Bayesian Hidden Markov Models and Extensions

arxiv: v1 [stat.ml] 7 Nov 2014

Gaussian Processes for Audio Feature Extraction

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

20: Gaussian Processes

Lecture 13 : Variational Inference: Mean Field Approximation

Probabilistic Graphical Models Lecture 20: Gaussian Processes

Deep Neural Networks as Gaussian Processes

Expectation Propagation Algorithm

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Machine Learning Summer School

State Space Gaussian Processes with Non-Gaussian Likelihoods

Pattern Recognition and Machine Learning

What, exactly, is a cluster? - Bernhard Schölkopf, personal communication

Latent Gaussian Processes for Distribution Estimation of Multivariate Categorical Data

GAUSSIAN PROCESS REGRESSION

Likelihood NIPS July 30, Gaussian Process Regression with Student-t. Likelihood. Jarno Vanhatalo, Pasi Jylanki and Aki Vehtari NIPS-2009

Expectation Propagation for Approximate Bayesian Inference

A Process over all Stationary Covariance Kernels

Deep Gaussian Processes for Regression using Approximate Expectation Propagation

Black-box α-divergence Minimization

CSC2535: Computation in Neural Networks Lecture 7: Variational Bayesian Learning & Model Selection

Reliability Monitoring Using Log Gaussian Process Regression

13: Variational inference II

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Variational Fourier Features for Gaussian Processes

Dynamic Probabilistic Models for Latent Feature Propagation in Social Networks

Bristol Machine Learning Reading Group

Nonparmeteric Bayes & Gaussian Processes. Baback Moghaddam Machine Learning Group

Nonparametric Bayesian inference on multivariate exponential families

Introduction to Gaussian Processes

Gaussian Process Vine Copulas for Multivariate Dependence

Worst-Case Bounds for Gaussian Process Models

STA 4273H: Statistical Machine Learning

Sparse Variational Inference for Generalized Gaussian Process Models

Sparse Spectrum Gaussian Process Regression

Active and Semi-supervised Kernel Classification

STA 4273H: Statistical Machine Learning

Gaussian Processes for Machine Learning (GPML) Toolbox

MCMC for Variationally Sparse Gaussian Processes

Transcription:

Tree-structured Gaussian Process Approximations Thang Bui joint work with Richard Turner MLG, Cambridge July 1st, 2014 1 / 27

Outline 1 Introduction 2 Tree-structured GP approximation 3 Experiments 4 Summary 2 / 27

GPs for regression - A quick recap Given {x n, y n } N n=1, y n = f(x n ) + ɛ n, ɛ n iid N (0, σn), 2 f GP(0, k θ (, )) 3 / 27

GPs for regression - A quick recap Given {x n, y n } N n=1, y n = f(x n ) + ɛ n, ɛ n iid N (0, σ 2 n), f GP(0, k θ (, )) f 1 f 2 f 3 f n f N f 3 / 27

GPs for regression - A quick recap Given {x n, y n } N n=1, y n = f(x n ) + ɛ n, ɛ n iid N (0, σn), 2 f GP(0, k θ (, )) Posterior is also a GP, f 1 f 2 f 3 f n f N f m f (x) = K xf (K ff +σ 2 ni) 1 y, k f (x, x ) = k(x, x ) K xf (K ff +σ 2 ni) 1 K fx. Log marginal likelihood for learning L = log N (y; 0, K ff +σ 2 ni). 3 / 27

GPs for regression - A quick recap Given {x n, y n } N n=1, y n = f(x n ) + ɛ n, ɛ n iid N (0, σn), 2 f GP(0, k θ (, )) Posterior is also a GP, f 1 f 2 f 3 f n f N f m f (x) = K xf (K ff +σni) 2 1 y, k f (x, x ) = k(x, x ) K xf (K ff +σni) 2 1 K fx. Log marginal likelihood for learning L = log N (y; 0, K ff +σni). 2 Cost: O(N 3 ) 3 / 27

Prior work Indirect posterior approximation schemes Introducing pseudo-dataset {xm, u m } M m=1 and removing some dependencies in the prior: FITC, PI(T)C (Snelson and Ghahramani 2006; Snelson and Ghahramani 2007) Approximating the prior using M cosine basis functions: SSGP (Lázaro-Gredilla et al. 2010) 4 / 27

Prior work Indirect posterior approximation schemes Introducing pseudo-dataset {xm, u m } M m=1 and removing some dependencies in the prior: FITC, PI(T)C (Snelson and Ghahramani 2006; Snelson and Ghahramani 2007) Approximating the prior using M cosine basis functions: SSGP (Lázaro-Gredilla et al. 2010) Direct posterior approximation schemes Variational free energy approach (Seeger 2003; Titsias 2009), SVI extension to handle big data (Hensman et al. 2013) Expectation propagation (Qi et al. 2010) 4 / 27

Prior work Indirect posterior approximation schemes Introducing pseudo-dataset {xm, u m } M m=1 and removing some dependencies in the prior: FITC, PI(T)C (Snelson and Ghahramani 2006; Snelson and Ghahramani 2007) Approximating the prior using M cosine basis functions: SSGP (Lázaro-Gredilla et al. 2010) Direct posterior approximation schemes Variational free energy approach (Seeger 2003; Titsias 2009), SVI extension to handle big data (Hensman et al. 2013) Expectation propagation (Qi et al. 2010) Local approximations (Tresp 2000; Urtasun and Darrell 2008) 4 / 27

Fully independent training conditionals (FITC or SPGP) (Snelson and Ghahramani 2006) Introduce pseudo-dataset {x m, u m } M m=1 prior: p(u, f) = p(u)p(f u) 5 / 27

Fully independent training conditionals (FITC or SPGP) (Snelson and Ghahramani 2006) Introduce pseudo-dataset {x m, u m } M m=1 prior: p(u, f) = p(u)p(f u) u f 1 f 2 f 3 f n f N f 5 / 27

Fully independent training conditionals (FITC or SPGP) (Snelson and Ghahramani 2006) Introduce pseudo-dataset {x m, u m } M m=1 prior: p(u, f) = p(u)p(f u) Assume f i f j u, i, j prior: q(u, f) = q(u)q(f u) u f 1 f 2 f 3 f n f N f 5 / 27

Fully independent training conditionals (FITC or SPGP) (Snelson and Ghahramani 2006) Introduce pseudo-dataset {x m, u m } M m=1 prior: p(u, f) = p(u)p(f u) Assume f i f j u, i, j prior: q(u, f) = q(u)q(f u) u f 1 f 2 f 3 f n f N f 5 / 27

Fully independent training conditionals (FITC or SPGP) (Snelson and Ghahramani 2006) Introduce pseudo-dataset {x m, u m } M m=1 u prior: p(u, f) = p(u)p(f u) Assume f i f j u, i, j f 1 f 2 f 3 f n f N f prior: q(u, f) = q(u)q(f u) Calibrate model using KL(p(u, f) q(u, f)) 5 / 27

Fully independent training conditionals (FITC or SPGP) (Snelson and Ghahramani 2006) Introduce pseudo-dataset {x m, u m } M m=1 u prior: p(u, f) = p(u)p(f u) Assume f i f j u, i, j f 1 f 2 f 3 f n f N f prior: q(u, f) = q(u)q(f u) Calibrate model using KL(p(u, f) q(u, f)) q(u) = p(u) and q(f i u) = p(f i u) 5 / 27

Fully independent training conditionals (FITC or SPGP) (Snelson and Ghahramani 2006) Introduce pseudo-dataset {x m, u m } M m=1 prior: p(u, f) = p(u)p(f u) u Assume f i f j u, i, j prior: q(u, f) = q(u)q(f u) f 1 f 2 f 3 f n f N f Calibrate model using KL(p(u, f) q(u, f)) q(u) = p(u) and q(f i u) = p(f i u) New generative model: p(u) = N (u; 0, K uu ), p(y u) = N (y; K fu K 1 uu u, diag(k ff K fu K 1 uu K uf ) + σ 2 ni). 5 / 27

Fully independent training conditionals (FITC or SPGP) (Snelson and Ghahramani 2006) Introduce pseudo-dataset {x m, u m } M m=1 prior: p(u, f) = p(u)p(f u) u Assume f i f j u, i, j prior: q(u, f) = q(u)q(f u) f 1 f 2 f 3 f n f N f Calibrate model using KL(p(u, f) q(u, f)) q(u) = p(u) and q(f i u) = p(f i u) New generative model: p(u) = N (u; 0, K uu ), p(y u) = N (y; K fu K 1 uu u, diag(k ff K fu K 1 uu K uf ) + σ 2 ni). Cost: O(NM 2 ) 5 / 27

Fully independent training conditionals (FITC or SPGP) (Snelson and Ghahramani 2006) Introduce pseudo-dataset {x m, u m } M m=1 prior: p(u, f) = p(u)p(f u) u Assume f i f j u, i, j prior: q(u, f) = q(u)q(f u) f 1 f 2 f 3 f n f N f Calibrate model using KL(p(u, f) q(u, f)) q(u) = p(u) and q(f i u) = p(f i u) New generative model: p(u) = N (u; 0, K uu ), p(y u) = N (y; K fu K 1 uu u, diag(k ff K fu K 1 uu K uf ) + σ 2 ni). Cost: O(NM 2 ) If assume f Bi f Bj u, i, j, we obtain PI(T)C: p(y u) = N (y; K fu K 1 uu u, blkdiag(k ff K fu K 1 uu K uf ) + σ 2 ni). 5 / 27

Variational free energy (VFE) appoach (Titsias 2009) Augment model by pseudo-dataset {x m, u m } M m=1 joint posterior of f and u: p(f, u y) p(f, u)p(y f) 6 / 27

Variational free energy (VFE) appoach (Titsias 2009) Augment model by pseudo-dataset {x m, u m } M m=1 joint posterior of f and u: p(f, u y) p(f, u)p(y f) Introducing variational dist. q(f, u) = p(f u)q(u) gives the ELBO: F(q(u)) = d u dfp(f u)q(u) log p(u)p(y f) q(u) 6 / 27

Variational free energy (VFE) appoach (Titsias 2009) Augment model by pseudo-dataset {x m, u m } M m=1 joint posterior of f and u: p(f, u y) p(f, u)p(y f) Introducing variational dist. q(f, u) = p(f u)q(u) gives the ELBO: F(q(u)) = d u dfp(f u)q(u) log p(u)p(y f) q(u) 6 / 27

Variational free energy (VFE) appoach (Titsias 2009) Augment model by pseudo-dataset {x m, u m } M m=1 joint posterior of f and u: p(f, u y) p(f, u)p(y f) Introducing variational dist. q(f, u) = p(f u)q(u) gives the ELBO: F(q(u)) = d u dfp(f u)q(u) log p(u)p(y f) q(u) Optimal distribution q(u) = 1 Z p(u) exp ( dfp(f u) log p(y f) ) and F(q(u)) = log N (y; 0, σni 2 + K fu K 1 uu K uf ) 1 2σn 2 Tr(K ff K fu K 1 uu K uf ). 6 / 27

Variational free energy (VFE) appoach (Titsias 2009) Augment model by pseudo-dataset {x m, u m } M m=1 joint posterior of f and u: p(f, u y) p(f, u)p(y f) Introducing variational dist. q(f, u) = p(f u)q(u) gives the ELBO: F(q(u)) = d u dfp(f u)q(u) log p(u)p(y f) q(u) Optimal distribution q(u) = 1 Z p(u) exp ( dfp(f u) log p(y f) ) and F(q(u)) = log N (y; 0, σni 2 + K fu K 1 uu K uf ) 1 2σn 2 Tr(K ff K fu K 1 uu K uf ). Cost: O(NM 2 ) 6 / 27

Example 1 2 1.5 Exact VFE 1 0.5 y 0-0.5-1 -1.5-2 -15-10 -5 0 x 5 10 15 N = 100, M = 10 We only need a small M if the underlying function is simple 7 / 27

Example 2 y 2.5 2 1.5 1 0.5 0-0.5-1 -1.5 Exact VFE -2-15 -10-5 0 x 5 10 15 N = 100, M = 10 M needs to be large if the underlying function is complicated 8 / 27

Limitations of global approximations 3 L 2 1 y 0-1 l -2-15 -10-5 0 5 10 15 x 9 / 27

Limitations of global approximations 3 L 2 1 y 0-1 l -2-15 -10-5 0 5 10 15 x Approximately need M D d=1 L d/l d, where L d and l d are the data range and lengthscale in dimension d, 9 / 27

Limitations of global approximations 3 L 2 1 y 0-1 l -2-15 -10-5 0 5 10 15 x Approximately need M D d=1 L d/l d, where L d and l d are the data range and lengthscale in dimension d,i.e. large M when Datasets span a large input space: time-series or spatial datasets Underlying functions have short lengthscales, i.e. lots of wiggles 9 / 27

Limitations of global approximations 3 L 2 1 y 0-1 l -2-15 -10-5 0 5 10 15 x Approximately need M D d=1 L d/l d, where L d and l d are the data range and lengthscale in dimension d,i.e. large M when Datasets span a large input space: time-series or spatial datasets Underlying functions have short lengthscales, i.e. lots of wiggles O(NM 2 ) is still expensive! Local approximations may give better time/accuracy trade-off. 9 / 27

Local GPs Divide the training set into M disjoint partitions {B i } M i=1, where B i = {x j, y j } N i j=1. f B1 f B2 f B3 f Bj f BM 10 / 27

Local GPs Divide the training set into M disjoint partitions {B i } M i=1, where B i = {x j, y j } N i j=1. f B1 f B2 f B3 f Bj f BM Obtain the posterior for each partition: p(f i y Bi ) p(f i )p(y Bi f i ) 10 / 27

Local GPs Divide the training set into M disjoint partitions {B i } M i=1, where B i = {x j, y j } N i j=1. f B1 f B2 f B3 f Bj f BM Obtain the posterior for each partition: p(f i y Bi ) p(f i )p(y Bi f i ) Predict using the posterior of only one partition closest to the test point 1 : p(f y Bi ) = d f i p(f f i )p(f i y Bi ) 1 See Tresp 2000 for a way to combine predictors 10 / 27

Local GPs Divide the training set into M disjoint partitions {B i } M i=1, where B i = {x j, y j } N i j=1. f B1 f B2 f B3 f Bj f BM Obtain the posterior for each partition: p(f i y Bi ) p(f i )p(y Bi f i ) Predict using the posterior of only one partition closest to the test point 1 : p(f y Bi ) = d f i p(f f i )p(f i y Bi ) Partitions have shared or separate hyperparameters 1 See Tresp 2000 for a way to combine predictors 10 / 27

Local GPs Divide the training set into M disjoint partitions {B i } M i=1, where B i = {x j, y j } N i j=1. f B1 f B2 f B3 f Bj f BM Obtain the posterior for each partition: p(f i y Bi ) p(f i )p(y Bi f i ) Predict using the posterior of only one partition closest to the test point 1 : p(f y Bi ) = d f i p(f f i )p(f i y Bi ) Partitions have shared or separate hyperparameters Cost: O(ND 2 ), D: average size of partitions 1 See Tresp 2000 for a way to combine predictors 10 / 27

1 Introduction 2 Tree-structured GP approximation 3 Experiments 4 Summary 11 / 27

Tree-structured approximation (TSGP) TSGP is in the same family as FITC and PITC, i.e. indirect approximation via prior modification, but having additional structures: Local inducing variables for each partition, spare connection between inducing blocks. u u f 1 f 2 f 3 f n f N f (a) Full GP f 1 f 2 f 3 f n f N f (b) FITC u u B1 u B2 u B3 u Bk u BK f B1 f B2 f B3 f Bk f f BK (c) PIC f B1 f B2 f B3 f Bk f f BK (d) Tree (chain) 12 / 27

Prior modification Generative model: K q(u) = q(u Bk u Bl ), k=1 K q(f u) = q(f Bk u Bk ), k=1 N p(y f) = p(y n; f n, σn). 2 n=1 u Bl A k, Q k u Bk f Bk C k, R k 13 / 27

Prior modification Generative model: q(u) = q(f u) = p(y f) = K q(u Bk u Bl ), k=1 K q(f Bk u Bk ), k=1 N p(y n; f n, σn). 2 n=1 u Bl A k, Q k u Bk f Bk Model calibration by minimising a forward KL divergence, KL(p(f, u) q(f Bk u Bk )q(u Bk u Bl )) k which gives, q(u Bk u Bl ) = p(u Bk u Bl ) = N (u Bk ; A k u Bl, Q k ), q(f Ck u Bk ) = p(f Ck u Bk ) = N (f Ck ; C k u Bk, R k ). C k, R k A k = K uk u l K 1 u l u l, Q k = K uk u k K uk u l K 1 u l u l K ul u k, C k = K f k u k K 1 u k u k, R k = K f k f k K f k u k K 1 u k u k K uk f k. 13 / 27

Inference Marginalising out f, the model is a tree-structured Gaussian model with latent variables u and observations y Special case: Linear Gaussian state space model for time series or 1D data 14 / 27

Inference Marginalising out f, the model is a tree-structured Gaussian model with latent variables u and observations y Special case: Linear Gaussian state space model for time series or 1D data Joint posterior: p(u y) ( exp 1 ) 2 u i J i u i + u i h i exp (u i J ij u j ) i V i,j E where J i = Q 1 i + C i (R i + σ 2 I i ) 1 C i + j nei(i) A j Q 1 j A j, h i = C i R 1 i y i, and J ij = Q 1 i A i. 14 / 27

Inference Marginalising out f, the model is a tree-structured Gaussian model with latent variables u and observations y Special case: Linear Gaussian state space model for time series or 1D data Joint posterior: p(u y) ( exp 1 ) 2 u i J i u i + u i h i exp (u i J ij u j ) i V i,j E where J i = Q 1 i + C i (R i + σ 2 I i ) 1 C i + j nei(i) A j Q 1 j A j, h i = C i R 1 i y i, and J ij = Q 1 i A i. Use Gaussian belief propagation algorithm to find the marginal distribution p(u Bi y) 14 / 27

Inference Marginalising out f, the model is a tree-structured Gaussian model with latent variables u and observations y Special case: Linear Gaussian state space model for time series or 1D data Joint posterior: p(u y) ( exp 1 ) 2 u i J i u i + u i h i exp (u i J ij u j ) i V i,j E where J i = Q 1 i + C i (R i + σ 2 I i ) 1 C i + j nei(i) A j Q 1 j A j, h i = C i R 1 i y i, and J ij = Q 1 i A i. Use Gaussian belief propagation algorithm to find the marginal distribution p(u Bi y) Prediction at test points: p(f y) = d u Bi p(f u Bi )p(u Bi y) 14 / 27

Inference Marginalising out f, the model is a tree-structured Gaussian model with latent variables u and observations y Special case: Linear Gaussian state space model for time series or 1D data Joint posterior: p(u y) ( exp 1 ) 2 u i J i u i + u i h i exp (u i J ij u j ) i V i,j E where J i = Q 1 i + C i (R i + σ 2 I i ) 1 C i + j nei(i) A j Q 1 j A j, h i = C i R 1 i y i, and J ij = Q 1 i A i. Use Gaussian belief propagation algorithm to find the marginal distribution p(u Bi y) Prediction at test points: p(f y) = d u Bi p(f u Bi )p(u Bi y) Cost: O(ND 2 ), D: average number of observations per block 14 / 27

Hyperparameter learning Log marginal likelihood and its derivatives can be computed using the same message passing algorithm: p(y 1:K θ) = K p(y k y 1:k 1, θ) k=1 K d dθ log p(y θ) = [ d dθ log p(u k u l ) p(uk,u l y) + d ] dθ log p(y k u k) p(uk y). k=1 15 / 27

Hyperparameter learning Log marginal likelihood and its derivatives can be computed using the same message passing algorithm: p(y 1:K θ) = K p(y k y 1:k 1, θ) k=1 K d dθ log p(y θ) = [ d dθ log p(u k u l ) p(uk,u l y) + d ] dθ log p(y k u k) p(uk y). k=1 Tree construction: starting with k-means clustering to find observations blocks using Kruskal s algorithm to greedily select a tree choosing a large random subset of observations in each block to be pseudo outputs, no optimisation needed 15 / 27

Comparison KL minimisation Method KL minimisation Result FITC KL(p(f, u) q(u) n q(fn u)) q(u) = p(u), q(fn u) = p(fn u) PIC KL(p(f, u) q(u) k q(f C k u)) q(u) = p(u), q(f C k u) = p(f C k u) PP KL( 1 Z p(u)p(f u)q(y u) p(f, u y)) q(y u) = N (y; K fuk 1 uu u, σ 2 ni) VFE KL(p(f u)q(u) p(f, u y)) q(u) p(u) exp( log(p(y f)) p(f u) ) EP KL(q(f; u)p(y n f n)/q n(f; u) q(f; u)) q(f; u) p(f) m p(um fm) Tree KL(p(f, u) k q(f B k ub k ) q(u Bk u par(bk ))) q(f Bk u Bk ) = p(f Bk u Bk ) q(u Bk u pa(bk )) = p(u Bk u par(bk )) 16 / 27

1 Introduction 2 Tree-structured GP approximation 3 Experiments 4 Summary 17 / 27

Audio data Task: filling missing data Data: Subband of a speech signal: N = 50000, SE kernel: k θ (t, t ) = σ 2 exp( 1 (t t ) 2 ) 2l 2 Filtered speech signal: N = 50000, spectral mixture kernel: k θ (t, t ) = 2 k=1 σ2 k cos(ω k(t t )) exp( 1 (t t ) 2 ) 2lk 2 Evaluation: Prediction error vs. Training/Test time 18 / 27

Audio data Task: filling missing data Data: Subband of a speech signal: N = 50000, SE kernel: k θ (t, t ) = σ 2 exp( 1 (t t ) 2 ) 2l 2 Filtered speech signal: N = 50000, spectral mixture kernel: k θ (t, t ) = 2 k=1 σ2 k cos(ω k(t t )) exp( 1 (t t ) 2 ) 2lk 2 Evaluation: Prediction error vs. Training/Test time 2 3 y t 0 y t 0 y t 2 2 0 2 True Chain Local 3 2340 2350 2360 2370 2380 5340 5350 5360 5370 5380 Time/ms y t 3 3 0 Time/ms Left: Subband data, Right: Filtered signal 18 / 27

Audio subband data SMSE 1 0.5 0.2 16 16 5,50 5,25 16 32 2,20 256 64 32 64 128 32 2,40 512 256 128 64 128 256 5,125 2,50 5,125 5,100 5,50 2,20 5,20 5,25 2.8 SMSE 1 0.5 0.2 32 16 64 Chain Local FITC VFE SSGP 128 5,50 5,25 2.40 2.20 5,100 256 5,125 2,50 5,125 5,100 5,50 2,20 5,20 5,25 16 16 2,8 512 2,10 2,8 10 100 1000 Training time/s 0.1 1 10 Test time/s 19 / 27

Audio filter data 1 2,50 2,40 5,125 5,20 2,10 2,8 1 2,50 5,25 2,40 5,20 2,10 2,8 5,125 64 SMSE 0.1 128 5,25 5,125 5,20 2,8 512 SMSE 0.1 128 5,25 2,8 5,125 5,20 512 2,40 2,50 2,20 2,10 Chain Local FITC 2,40 2,50 2,20 2,10 0.01 100 1000 Training time/s 0.01 0.1 0.3 0.5 Test time/s 20 / 27

Terrain data Task: filling missing data Data: Altitude of a 20kmx30km region, 80 missing blocks of 1kmx1km or 200k/40k training/test points, 2D SE kernel Evaluation: Prediction error vs. Training/Test time 21 / 27

Terrain data Task: filling missing data Data: Altitude of a 20kmx30km region, 80 missing blocks of 1kmx1km or 200k/40k training/test points, 2D SE kernel Evaluation: Prediction error vs. Training/Test time 2.5 0.5 (a) (b) (c) 2.5 0-1.5 graph complete data tree inference error local inference error FITC inference error 21 / 27

Terrain data SMSE 0.2 0.1 128 256 64 128 3,400 3,300 3,300 3,400 5,300 7,400 10,300 256 15,300 33,400 512 7,400 SMSE 0.2 0.1 3,400 3,300 3,300 3,400 128 7,400 5,300 256 512 256 5,300 15,300 512 33,400 20,400 7,400 64 128 256 512 VFE FITC SSGP Tree Local Training time/s 10,300 13,400 20,400 33,400 0.03 100 1000 10,300 13,400 20,400 33,400 0.03 0.01 0.1 1 Test time/ms 22 / 27

Summary Tree-structured Gaussian process approximation pseudo-dataset has tree/chain structure model was calibrated using a KL divergence inference and learning via Gaussian belief propagatation better time/accuracy trade-off compared to FITC, VFE possible extensions: online learning, loopy BP 23 / 27

References Hensman, James, Nicolo Fusi, and Neil D Lawrence (2013). Gaussian processes for big data. In: arxiv preprint arxiv:1309.6835. Lázaro-Gredilla, Miguel et al. (2010). Sparse spectrum Gaussian process regression. In: The Journal of Machine Learning Research 11, pp. 1865 1881. Qi, Yuan (Alan), Ahmed H. Abdel-Gawad, and Thomas P. Minka (2010). Sparse-posterior Gaussian Processes for general likelihoods. In: UAI. Ed. by Peter Grünwald and Peter Spirtes. AUAI Press, pp. 450 457. Seeger, Matthias (2003). Bayesian Gaussian process models: PAC-Bayesian generalisation error bounds and sparse approximations. PhD thesis. School of Informatics, College of Science and Engineering, University of Edinburgh. Snelson, Edward and Zoubin Ghahramani (2006). Sparse Gaussian Processes using Pseudo-inputs. In: Advances in Neural Information Processing Systems. MIT press, pp. 1257 1264. (2007). Local and global sparse Gaussian process approximations. In: International Conference on Artificial Intelligence and Statistics, pp. 524 531. Titsias, Michalis K. (2009). Variational Learning of Inducing Variables in Sparse Gaussian Processes. In: International Conference on Artificial Intelligence and Statistics, pp. 567 574. Tresp, Volker (2000). A Bayesian committee machine. In: Neural Computation 12.11, pp. 2719 2741. Urtasun, Raquel and Trevor Darrell (2008). Sparse probabilistic regression for activity-independent human pose inference. In: Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on. IEEE, pp. 1 8. 24 / 27

Thanks!

Bayesian Committee Machine (BCM) (Tresp 2000) BCM combines predictions from M estimators, each uses a subset of training points Consider M partitions of the training set, by Bayes rule p(f y B1:m ) = p(f y B1:m 1, y Bm ) p(f )p(y Bm f )p(y B1:m 1 y Bm, f ) p(f )p(y Bm f )p(y B1:m 1 f ) p(f y Bm )p(f y B1:m 1 ) p(f ) (1) Apply 1 recursively to obtain p(f y B1:M ) M i=1 p(f y Bi ) p(f ) M 1 26 / 27

BCM for GP regression (Tresp 2000) Let p(f ) = N (0, K ) and p(f y Bi ) = N (ˆµ i, ˆΣ i ), then p(f y B1:M ) = N (ˆµ, ˆΣ) where, ˆΣ 1 = (M 1) K + ˆΣ 1 ˆµ = M i=1 Cost: O(ND 2 ), D: partition size ˆΣ 1 i ˆµ i More test points give better approximation! M i=1 ˆΣ 1 i 27 / 27