Variational Model Selection for Sparse Gaussian Process Regression

Similar documents
Variational Model Selection for Sparse Gaussian Process Regression

Tree-structured Gaussian Process Approximations

Deep Gaussian Processes

Markov Chain Monte Carlo Algorithms for Gaussian Processes

Deep Gaussian Processes

Approximation Methods for Gaussian Process Regression

Approximation Methods for Gaussian Process Regression

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes

The Variational Gaussian Approximation Revisited

Model Selection for Gaussian Processes

Local and global sparse Gaussian process approximations

Variational Inference for Mahalanobis Distance Metrics in Gaussian Process Regression

Nonparameteric Regression:

Analysis of Some Methods for Reduced Rank Gaussian Process Regression

Variable sigma Gaussian processes: An expectation propagation perspective

Incremental Variational Sparse Gaussian Process Regression

arxiv: v2 [stat.ml] 29 Sep 2014

Multiple-step Time Series Forecasting with Sparse Gaussian Processes

Computer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression

Sparse Spectral Sampling Gaussian Processes

Efficient Multioutput Gaussian Processes through Variational Inducing Kernels

Computer Vision Group Prof. Daniel Cremers. 4. Gaussian Processes - Regression

GAUSSIAN PROCESS REGRESSION

Scaling Gaussian Processes. Yanshuai Cao

GWAS V: Gaussian processes

Variational Inference for GPs: Presenters

Gaussian Processes for Big Data. James Hensman

STAT 518 Intro Student Presentation

Variational Inference for Latent Variables and Uncertain Inputs in Gaussian Processes

Active and Semi-supervised Kernel Classification

A Process over all Stationary Covariance Kernels

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

Variational Dependent Multi-output Gaussian Process Dynamical Systems

Learning Gaussian Process Models from Uncertain Data


Introduction to Gaussian Processes

Lecture 1c: Gaussian Processes for Regression

Explicit Rates of Convergence for Sparse Variational Inference in Gaussian Process Regression

Gaussian processes for inference in stochastic differential equations

arxiv: v1 [stat.ml] 8 Sep 2014

CS-E4830 Kernel Methods in Machine Learning

Sparse Approximations for Non-Conjugate Gaussian Process Regression

Non-Gaussian likelihoods for Gaussian Processes

Joint Emotion Analysis via Multi-task Gaussian Processes

arxiv: v1 [stat.ml] 26 Nov 2009

Week 3: The EM algorithm

arxiv: v1 [stat.ml] 3 Nov 2018

State Space Gaussian Processes with Non-Gaussian Likelihoods

A Unifying Framework for Gaussian Process Pseudo-Point Approximations using Power Expectation Propagation

MTTTS16 Learning from Multiple Sources

Deep learning with differential Gaussian process flows

CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes

Variational Inference and Learning. Sargur N. Srihari

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Fast Forward Selection to Speed Up Sparse Gaussian Process Regression

The Automatic Statistician

Variational Dependent Multi-output Gaussian Process Dynamical Systems

{ p if x = 1 1 p if x = 0

Reliability Monitoring Using Log Gaussian Process Regression

Probabilistic & Bayesian deep learning. Andreas Damianou

Probabilistic Graphical Models Lecture 20: Gaussian Processes

Gaussian Process Dynamical Models Jack M Wang, David J Fleet, Aaron Hertzmann, NIPS 2005

Gaussian Process Regression

Expectation Propagation in Dynamical Systems

Latent Force Models. Neil D. Lawrence

Sparse Spectrum Gaussian Process Regression

arxiv: v1 [stat.ml] 7 Nov 2014

Likelihood NIPS July 30, Gaussian Process Regression with Student-t. Likelihood. Jarno Vanhatalo, Pasi Jylanki and Aki Vehtari NIPS-2009

Nonparametric Bayesian Methods (Gaussian Processes)

Modelling Transcriptional Regulation with Gaussian Processes

Introduction to Gaussian Processes

SINGLE-TASK AND MULTITASK SPARSE GAUSSIAN PROCESSES

Variational Gaussian Process Dynamical Systems

Sparse Multiscale Gaussian Process Regression

Lecture: Gaussian Process Regression. STAT 6474 Instructor: Hongxiao Zhu

Advanced Introduction to Machine Learning CMU-10715

Distributed Gaussian Processes

Introduction to Gaussian Process

Gaussian with mean ( µ ) and standard deviation ( σ)

Density Estimation. Seungjin Choi

Gaussian Processes (10/16/13)

System identification and control with (deep) Gaussian processes. Andreas Damianou

Lecture 8: Graphical models for Text

STA 4273H: Statistical Machine Learning

Neutron inverse kinetics via Gaussian Processes

Gaussian Processes. 1 What problems can be solved by Gaussian Processes?

Gaussian Process Approximations of Stochastic Differential Equations

Optimization of Gaussian Process Hyperparameters using Rprop

Gaussian Process Vine Copulas for Multivariate Dependence

Gaussian Process Approximations of Stochastic Differential Equations

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Variational Principal Components

Bayesian Deep Learning

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

Fully Bayesian Deep Gaussian Processes for Uncertainty Quantification

arxiv: v1 [stat.ml] 27 May 2017

Gaussian Processes in Machine Learning

Probabilistic & Unsupervised Learning

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Transcription:

Variational Model Selection for Sparse Gaussian Process Regression Michalis K. Titsias School of Computer Science University of Manchester 7 September 2008

Outline Gaussian process regression and sparse methods Variational inference based on inducing variables Auxiliary inducing variables The variational bound Comparison with the PP/DTC and SPGP/FITC marginal likelihood Experiments in large datasets Inducing variables selected from training data Variational reformulation of SD, FITC and PITC Related work/conclusions

Gaussian process regression Regression with Gaussian noise Data: {(x i, y i ), i = 1,..., n} where x i is a vector and y i scalar Likelihood: y i = f (x i ) + ɛ, ɛ N(0, σ 2 ) p(y f) = N(y f, σ 2 I ), f i = f (x i ) GP prior on f: p(f) = N(f 0, K nn ) K nn is the n n covariance matrix on the training data computed using a kernel that depends on θ Hyperparameters: (σ 2, θ)

Gaussian process regression Maximum likelihood II inference and learning Prediction: Assume hyperparameters (σ 2, θ) are known Infer the latent values f at test inputs X : p(f y) = p(f f)p(f y)df p(f f) test conditional, p(f y) posterior on training latent values Learning (σ 2, θ): Maximize the marginal likelihood p(y) = p(y f)p(f)df = N(y 0, σ 2 I + K nn ) Time complexity is O(n 3 ) f f

Sparse GP regression Time complexity is O(n 3 ): Intractability for large datasets Exact prediction and training is intractable We can neither compute the predictive distribution p(f y) nor the marginal likelihood p(y) Approximate/sparse methods: Subset of data: Keep only m training points, complexity is O(m 3 ) Inducing/active/support variables: Complexity O(nm 2 ) Other methods: Iterative methods for linear systems

Sparse GP regression using inducing variables Inducing variables Subset of training points (Csato and Opper, 2002; Seeger et al. 2003, Smola and Bartlett, 2001) Test points (BCM; Tresp, 2000) Auxiliary variables (Snelson and Ghahramani, 2006; Quiñonero-Candela and Rasmussen, 2005) Training the sparse GP regression system Select inducing inputs Select hyperparameters (σ 2, θ) Which objective function is going to do all that? The approximate marginal likelihood But which approximate marginal likelihood?

Sparse GP regression using inducing variables Approximate marginal likelihoods currently used are derived by changing/approximating the likelihood p(y f) by changing/approximating the prior p(f) (Quiñonero-Candela and Rasmussen, 2005) all have the form F P = N(y 0, K) where K is some approximation to the true covariance σ 2 I + K nn Overfitting can often occur The approximate marginal likelihood is not a lower bound Joint learning of the inducing points and hyperparameters easily leads to overfitting

Sparse GP regression using inducing variables What we wish to do here Do model selection in a different way Never think about approximating the likelihood p(y f) or the prior p(f) Apply standard variational inference Just introduce a variational distribution to approximate the true posterior That will give us a lower bound We will propose the bound for model selection jointly handle inducing inputs and hyperparameters

Auxiliary inducing variables (Snelson and Ghahramani, 2006) Auxiliary inducing variables: m latent function values f m associated with arbitrary inputs X m Model augmentation: We augment the GP prior p(f, f m ) = p(f f m )p(f m ) joint p(y f)p(f f m )p(f m ) marginal likelihood p(y f)p(f f m )p(f m )dfdf m f,f m The model is unchanged! The predictive distribution and the marginal likelihood are the same The parameters X m play no active role (at the moment)...and there is no any fear about overfitting when we specify X m

Auxiliary inducing variables What we wish: To use the auxiliary variables (f m, X m ) to facilitate inference about the training function values f Before we get there: Let s specify the ideal inducing variables Definition: We call (f m, X m ) optimal when y and f are conditionally independent given f m p(f f m, y) = p(f f m ) At optimality: The augmented true posterior p(f, f m y) factorizes as p(f, f m y) = p(f f m )p(f m y)

Auxiliary inducing variables What we wish: To use the auxiliary variables (f m, X m ) to facilitate inference about the training function values f Question: How can we discover optimal inducing variables? Answer: Minimize a distance between the true p(f, f m y) and an approximate q(f, f m ) wrt to X m and (optionally) the number m The key: q(f, f m ) must satisfy the factorization that holds for optimal inducing variables: True p(f, f m y) = p(f f m, y)p(f m y) Approximate q(f, f m ) = p(f f m )φ(f m )

Variational learning of inducing variables Variational distribution: q(f, f m ) = p(f f m )φ(f m ) φ(f m ) is an unconstrained variational distribution over f m Standard variational inference: We minimize the divergence KL(q(f, f m ) p(f, f m y)) Equivalently we maximize a bound on the true log marginal likelihood: F V (X m, φ(f m )) = q(f, f m ) log p(y f)p(f f m)p(f m ) dfdf m f,f m q(f, f m ) Let s compute this

Computation of the variational bound F V (X m, φ(f m )) = = = = p(f f m )φ(f m ) log p(y f)p(f f m)p(f m ) dfdf m f,f m p(f f m )φ(f m ) p(f f m )φ(f m ) log p(y f)p(f m) dfdf m f,f m φ(f m ) { φ(f m ) p(f f m ) log p(y f)df + log p(f } m) df m f m f φ(f m ) { φ(f m ) log G(f m, y) + log p(f } m) df m f m φ(f m ) log G(f m, y) = log [ N(y E[f f m ], σ 2 I ) ] 1 2σ 2 Tr[Cov(f f m)] E[f f m ] = K nm K 1 mmf m, Cov(f f m ) = K nn K nm K 1 mmk mn

Computation of the variational bound Merge the logs { F V (X m, φ(f m )) = φ(f m ) log G(f } m, y)p(f m ) df m f m φ(f m ) Reverse Jensen s inequality to maximize wrt φ(f m ): F V (X m ) = log G(f m, y)p(f m )df m f m = log N(y α m, σ 2 I )p(f m )df m 1 f m 2σ 2 Tr[Cov(f f m)] = log [ N(y 0, σ 2 I + K nm K 1 mmk mn ) ] 1 2σ 2 Tr[Cov(f f m)] where Cov(f f m ) = K nn K nm K 1 mmk mn

Variational bound versus PP log likelihood The traditional projected process (PP or DTC) log likelihood is F P = log [ N(y 0, σ 2 I + K nm K 1 mmk mn ) ] What we obtained is F V = log [ N(y 0, σ 2 I + K nm K 1 mmk mn ) ] 1 2σ 2 Tr[K nn K nm K 1 mmk mn ] We got this extra trace term (the total variance of p(f f m ))

Optimal φ (f m ) and predictive distribution The optimal φ (f m ) that corresponds to the above bound gives rise to the PP predictive distribution (Csato and Opper, 2002; Seeger and Williams and Lawrence, 2003) The approximate predictive distribution is identical to PP

Variational bound for model selection Learning inducing inputs X m and (σ 2, θ) using continuous optimization Maximize the bound wrt to (X m, σ 2, θ) F V = log [ N(y 0, σ 2 I + K nm K 1 mmk mn ) ] 1 2σ 2 Tr[K nn K nm K 1 mmk mn ] The first term encourages fitting the data y The second trace term says to minimize the total variance of p(f f m ) The trace Tr[K nn K nm K 1 mmk mn ] can stand on its own as an objective function for sparse GP learning

Variational bound for model selection When the bound becomes equal to the true marginal log likelihood, i.e F V = log p(y), then: Tr[K nn K nm K 1 mmk mn ] = 0 K nn = K nm K 1 mmk mn p(f f m ) becomes a delta function We can reproduce the full/exact GP prediction

Illustrative comparison on Ed Snelson s toy data 2 1.5 1 0.5 0 0.5 1 1.5 2 0 1 2 3 4 5 6 We compare the traditional PP/DTC log likelihood F P = log [ N(y 0, σ 2 I + K nm KmmK 1 mn ) ] and the bound F V = log [ N(y 0, σ 2 I + K nm K 1 mmk mn ) ] 1 2σ 2 Tr[K nn K nm K 1 mmk mn ] We will jointly maximize over (X m, σ 2, θ)

PP Illustrative comparison 200 training points, red line is the full GP, blue line the sparse GP. We used 8, 10 and 15 inducing points 8 10 15 VAR

Illustrative comparison ( ) exponential kernel σf 2 exp (xm xn)2 2l 2 Table: Model parameters found by variational training 8 10 15 full GP l 2 0.5050 0.4327 0.3573 0.3561 σf 2 0.5736 0.6820 0.6854 0.6833 σ 2 0.0859 0.0817 0.0796 0.0796 MargL -63.5282-57.6909-55.5708-55.5647 There is a pattern here (observed in many datasets) The noise σ 2 decreases with the number of inducing points, until full GP is matched This is desirable: The method prefers to explain some signal as noise when the number of inducing variables is not enough

Illustrative comparison A more challenging problem From the original 200 training points keep 1 only 20 2 1.5 1.5 1 1 0.5 0.5 0 0 0.5 0.5 1 1 1.5 1.5 2 0 1 2 3 4 5 6 2 0 1 2 3 4 5 6 1 using the MATLAB command X = X (1 : 10 : end)

Illustrative comparison 8 10 15 VAR PP

Illustrative comparison ( ) exponential kernel σf 2 exp (xm xn)2 2l 2 Table: Model parameters found by variational training 8 10 15 full GP l 2 0.2621 0.2808 0.1804 0.1798 σf 2 0.3721 0.5334 0.5209 0.5209 σ 2 0.1163 0.0846 0.0647 0.0646 MargL -16.0995-14.8373-14.3473-14.3461 Table: Model parameters found by PP marginal likelihood 8 10 15 full GP l 2 0.0766 0.0632 0.0593 0.1798 σf 2 1.0846 1.1353 1.1939 0.5209 σ 2 0.0536 0.0589 0.0531 0.0646 MargL -8.7969-8.3492-8.0989-14.3461

Variational bound compared to PP likelihood The variational method converges to the full GP model in a systematic way as we increase the number of inducing variables It tends to find smoother predictive distributions than the full GP (the decreasing σ 2 pattern) when the amount of inducing variables is not enough The PP marginal likelihood will not converge to the full GP as we increase the number of inducing inputs and maximize over them PP tends to interpolate the training examples

SPGP/FITC marginal likelihood (Snelson and Ghahramani 2006) SPGP uses the following marginal likelihood N(y 0, σ 2 I + diag[k nn K nm K 1 mmk mn ] + K nm K 1 mmk mn ) The covariance used is closer to the true thing σ 2 + K nn compared to PP SPGP uses a non-stationary covariance matrix that can model input-dependent noise SPGP is significantly better for model selection than the PP marginal likelihood (Snelson and Ghahramani, 2006, Snelson, 2007)

SPGP/FITC marginal likelihood on toy data First row is for 200 training points and second row for 20 training points 8 10 15 2 2 2 1.5 1.5 1.5 1 1 1 0.5 0.5 0.5 0 0 0 0.5 0.5 0.5 1 1 1 1.5 1.5 1.5 2 2 2 2.5 2.5 2.5 3 2 0 2 4 6 8 10 3 2 0 2 4 6 8 10 3 2 0 2 4 6 8 10 2 2 2 1.5 1.5 1.5 1 1 1 0.5 0.5 0.5 0 0 0 0.5 0.5 0.5 1 1 1 1.5 1.5 1.5 2 2 2 2.5 2.5 2.5 3 2 0 2 4 6 8 10 3 2 0 2 4 6 8 10 3 2 0 2 4 6 8 10

SPGP/FITC on toy data Model parameters found by SPGP/FITC marginal likelihood Table: 200 training points 8 10 15 l 2 0.2531 0.3260 0.3096 0.3561 σf 2 0.3377 0.7414 0.6761 0.6833 σ 2 0.0586 0.0552 0.0674 0.0796 MargL -56.4397-50.3789-52.7890-55.5647 Table: 20 training points 8 10 15 l 2 0.2622 0.2664 0.1657 0.1798 σf 2 0.5976 0.6489 0.5419 0.5209 σ 2 0.0046 0.0065 0.0008 0.0646 MargL -11.8439-11.8636-11.4308-14.3461

SPGP/FITC marginal likelihood It can be much more robust to overfitting than PP Joint learning of inducing points and hyperparameters can cause overfitting It is able to model input-dependent noise That is a great advantage in terms of performance measures that involve the predictive variance (like average negative log probability density) It will not converge to the full GP as we increase the number of inducing points and optimize over them

Boston-housing dataset 13 inputs, 455 training points, 51 test points. Optimizing only over inducing points X m. (σ 2, θ) fixed to those obtained from full GP KL(p q) 70 60 50 40 30 20 10 0 VAR PP SPGP 0 100 200 300 400 Number of inducing variables KL(q p) 300 250 200 150 100 50 0 VAR PP SPGP 0 100 200 300 400 Number of inducing variables Log marginal likelihood 0 500 1000 FullGP VAR PP SPGP 1500 0 100 200 300 400 Number of inducing variables Figure: KLs between full GP predictive distribution (51-dimensional Gaussian) and sparse ones and the marginal likelihood Only the variational method drops the KLs to zero

Boston-housing dataset Joint learning of inducing inputs and hyperparameters SMSE 0.2 0.15 0.1 0.05 FullGP VAR PP SPGP 100 200 300 400 Number of inducing variables SNLP 2 1.5 1 0.5 0 0.5 1 1.5 FullGP VAR PP SPGP 100 200 300 400 Number of inducing variables Log marginal likelihood 400 200 0 200 400 600 FullGP VAR PP SPGP 100 200 300 400 Number of inducing variables Figure: Standardised mean squared error (SMSE), standardized negative log probability density (SNLP) and the marginal likelihood wrt to the number of inducing points For 250 points the variational method is very close to full GP

Large datasets Two large datasets: kin40k dataset: 10000 training, 30000 test, 8 attributes, http://ida.first.fraunhofer.de/ anton/data.html sarcos dataset: 44, 484 training, 4, 449 test, 21 attributes, http://www.gaussianprocess.org/gpml/data/ The inputs were normalized to have zero mean and unit variance on the training set and the outputs were centered so as to have zero mean on the training set

kin40k Joint learning of inducing points and hyperparameters. The subset of data (SD) uses 2000 training points SMSE 0.1 SD VAR PP SPGP SNLP 1 0.5 0 0.5 1 SD VAR PP SPGP 0.05 1.5 2 0 0 200 400 600 800 1000 Number of inducing variables 2.5 0 200 400 600 800 1000 Number of inducing variables Figure: Standardised mean squared error (SMSE) and standardized negative log probability density (SNLP) wrt to the number of inducing points

sarcos Joint learning of inducing points and hyperparameters. The subset of data (SD) uses 2000 training points 0.05 0.045 0.04 SD VAR PP SPGP 0 0.5 SD VAR PP SPGP SMSE 0.035 0.03 0.025 SNLP 1 1.5 0.02 0.015 2 0.01 0 100 200 300 400 500 Number of inducing variables 2.5 0 100 200 300 400 500 Number of inducing variables Figure: Standardised mean squared error (SMSE) and standardized negative log probability density (SNLP) wrt to the number of inducing points

Variational bound for greedy model selection Inducing inputs X m selected from the training set m {1,..., n} be indices of the subset of data used as inducing/active variables. n m denotes the remaining training points Optimal active latent values f m satisfy p(f y) = p(f n m f m, y n m )p(f m y) = p(f n m f m )p(f m y) Variational distribution: q(f) = p(f n m f m )φ(f m ) Variational bound: F V = log [ N(y 0, σ 2 I + K nm K 1 mmk mn ) ] 1 2σ 2 Tr[Cov(f n m f m )]

Variational bound for greedy model selection Greedy selection with hyperparameters adaption (Seeger, et. al., 2003) 1 Initialization: m =, n m = {1,..., n} 2 Point insertion and adaption: E-like step: Add j J n m, into m so as a criterion j is maximised M-like step: Update (σ 2, θ) by maximizing the approximate marginal likelihood 3 Go to step 2 or stop For the PP marginal likelihood this is problematic Non smooth convergence: The algorithm is not an EM The variational bound solves this problem. The above procedure becomes precisely a variational EM algorithm

Variational bound for greedy model selection The variational EM property comes out of the Proposition 1 Proposition 1. Let (m, X m, f m ) be the current set of active points. Any training point i n m added into the active set can never decrease the lower bound. In other words: Any point inserted cannot decrease the divergence KL(q(f) p(f y)) E-step (point insertion): Corresponds to an update of the variational distribution q(f) = p(f n m f m )φ(f m ) M-step: Updates the parameters by maximizing the bound Monotonic increase of the variational bound is guaranteed for any possible criterion

Variational formulation for sparse GP regression Define a full GP regression model Define a variational distribution of the form q(f, f m ) = p(f f m )φ(f m ) Get the approximate predictive distribution true p(f y) = p(f f, f m )p(f, f m y) f,f m approx. q(f y) = p(f f m )p(f f m )φ(f m ) = p(f f m )φ(f m )df m f,f m f m Compute the bound and use it for model selection Regarding the predictive distribution, what differentiates between SD, PP/DTC, FITC and PITC is the φ(f m ) distribution

Variational bound for FITC (similarly for PITC) The full GP model that variationally reformulates FITC models input-dependent noise p(y f) = N(y f, σ 2 I + diag[k nn K nm K 1 mmk mn ]) FITC log marginal likelihood F SPGP (X m ) = log [ N(y 0, Λ + K nm KmmK 1 mn ) ] where Λ = σ 2 I + diag[k nn K nm KmmK 1 mn ] The corresponding variational bound F V (X m ) = log [ N(y 0, Λ + K nm K 1 mmk mn ) ] 1 2 Tr[Λ 1 K] where K = K nn K nm K 1 mmk mn Again a trace term is added

Related work/conclusion Related work There is an unpublished draft of Lehel Csato and Manfred Opper about variational learning of hyperparameters in sparse GPs Seeger (2003) uses also variational methods for sparse GP classification problems Conclusions The variational method can provide us with lower bounds This can be very useful for joint learning of inducing inputs and hyperparameters Future extensions: classification, differential equations

Acknowledgements Thanks for feedback to: Neil Lawrence, Magnus Rattray, Chris Williams, Joaquin Quiñonero-Candela, Ed Snelson, Manfred Opper, Mauricio Alvarez and Kevin Sharp