Explicit Rates of Convergence for Sparse Variational Inference in Gaussian Process Regression

Similar documents
Non-Factorised Variational Inference in Dynamical Systems

Learning Gaussian Process Models from Uncertain Data

Variational Principal Components

Circular Pseudo-Point Approximations for Scaling Gaussian Processes

The Variational Gaussian Approximation Revisited

Variational Model Selection for Sparse Gaussian Process Regression

Gaussian Process Approximations of Stochastic Differential Equations

Deep Gaussian Processes

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Tree-structured Gaussian Process Approximations

A Process over all Stationary Covariance Kernels

Model Selection for Gaussian Processes

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes

Deep Gaussian Processes

Gaussian Processes for Machine Learning

Black-box α-divergence Minimization

Multiple-step Time Series Forecasting with Sparse Gaussian Processes

Gaussian Processes for Big Data. James Hensman

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

The Laplacian PDF Distance: A Cost Function for Clustering in a Kernel Feature Space

GWAS V: Gaussian processes

On Sparse Variational Methods and the Kullback-Leibler Divergence between Stochastic Processes

Tutorial on Gaussian Processes and the Gaussian Process Latent Variable Model

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Gaussian Process Regression

Active and Semi-supervised Kernel Classification

Expectation Propagation for Approximate Bayesian Inference

STAT 518 Intro Student Presentation

Latent Variable Models and EM algorithm

Local and global sparse Gaussian process approximations

Sparse Spectrum Gaussian Process Regression

Probabilistic Models for Learning Data Representations. Andreas Damianou

Stochastic Variational Inference for Gaussian Process Latent Variable Models using Back Constraints

The geometry of Gaussian processes and Bayesian optimization. Contal CMLA, ENS Cachan

Approximate Kernel PCA with Random Features

Nonparametric Bayesian Methods (Gaussian Processes)

Expectation Propagation in Dynamical Systems

Probabilistic Graphical Models Lecture 20: Gaussian Processes

(Multivariate) Gaussian (Normal) Probability Densities

Variational Dropout via Empirical Bayes

Non Linear Latent Variable Models

Natural Gradients via the Variational Predictive Distribution

Nonparameteric Regression:

Prediction of double gene knockout measurements

Convergence Rates of Kernel Quadrature Rules

Gaussian Processes in Machine Learning

Glasgow eprints Service

Gaussian processes for inference in stochastic differential equations

Worst-Case Bounds for Gaussian Process Models

GAUSSIAN PROCESS REGRESSION

Probabilistic & Unsupervised Learning

Introduction Dual Representations Kernel Design RBF Linear Reg. GP Regression GP Classification Summary. Kernel Methods. Henrik I Christensen

PATTERN RECOGNITION AND MACHINE LEARNING

STA 4273H: Statistical Machine Learning

Prediction of Data with help of the Gaussian Process Method

Sparse Spectral Sampling Gaussian Processes

Neutron inverse kinetics via Gaussian Processes

Understanding Covariance Estimates in Expectation Propagation

Bayesian Machine Learning

PILCO: A Model-Based and Data-Efficient Approach to Policy Search

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

Lecture : Probabilistic Machine Learning

Variable sigma Gaussian processes: An expectation propagation perspective

Distributionally robust optimization techniques in batch bayesian optimisation

Lecture 35: December The fundamental statistical distances

CSC2535: Computation in Neural Networks Lecture 7: Variational Bayesian Learning & Model Selection

Statistical Techniques in Robotics (16-831, F12) Lecture#21 (Monday November 12) Gaussian Processes

Learning to Learn and Collaborative Filtering

Sparse Approximations for Non-Conjugate Gaussian Process Regression

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods

Variational Gaussian-process factor analysis for modeling spatio-temporal data

Machine Learning Basics: Maximum Likelihood Estimation

Quantifying mismatch in Bayesian optimization

APPENDIX A. Background Mathematics. A.1 Linear Algebra. Vector algebra. Let x denote the n-dimensional column vector with components x 1 x 2.

Lecture 6: Graphical Models: Learning

Introduction to Gaussian Processes

Approximating high-dimensional posteriors with nuisance parameters via integrated rotated Gaussian approximation (IRGA)

Review (Probability & Linear Algebra)

Optimization of Gaussian Process Hyperparameters using Rprop

Introduction to Gaussian Processes

Variational Model Selection for Sparse Gaussian Process Regression

Variational Inference for Mahalanobis Distance Metrics in Gaussian Process Regression

Information geometry for bivariate distribution control

The connection of dropout and Bayesian statistics

9.2 Support Vector Machines 159

Computer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression

Posterior Regularization

13: Variational inference II

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA

Mathematical foundations - linear algebra

SINGLE-TASK AND MULTITASK SPARSE GAUSSIAN PROCESSES

CS 7140: Advanced Machine Learning

Probabilistic Graphical Models

Stochastic Spectral Approaches to Bayesian Inference

STA414/2104 Statistical Methods for Machine Learning II

Deep Neural Networks as Gaussian Processes

Gaussian Process Regression Forecasting of Computer Network Conditions

Gaussian Processes (10/16/13)

Transcription:

JMLR: Workshop and Conference Proceedings : 9, 08. Symposium on Advances in Approximate Bayesian Inference Explicit Rates of Convergence for Sparse Variational Inference in Gaussian Process Regression David Burt drb6@cam.ac.uk and Carl Edward Rasmussen cer54@cam.ac.uk University of Cambridge, Cambridge, UK Mark van der Wilk PROWLER.io, Cambridge, UK mark@prowler.io Abstract Sparse variational inference in Gaussian process regression has theoretical guarantees that make it robust to over-fitting. Additionally, it is well-known that in the non-sparse regime, with M N, variational inducing point methods can be equivalent to full inference. In this paper, we derive bounds on the KL-divergence from the sparse variational approximation to the full posterior and show convergence for M on the order of logn inducing features when using the squared exponential kernel.. Introduction Gaussian processes can model complex functions, are robust to over-fitting and provide uncertainty estimates. In the case of regression, the marginal likelihood and predictive posterior can be calculated in closed form. However, this becomes computationally intractable for large data sets. The variational framework by Titsias 009 approximates the model posterior with a simpler Gaussian process. This approximation retains many of the desirable properties of non-parametric models, in contrast to parametric model approximations of Quiñonero Candela and Rasmussen 005. A precise treatment of the KL-divergence between stochastic processes is given in Matthews et al. 06. While the variational approximation provides an objective function ELBO for selecting the hyperparameters, it can introduce bias into the selection process Turner and Sahani, 0. However, if the gap between the ELBO and the log marginal likelihood at the optimum choice of hyperparameters is small, the optimization will result in a choice of hyperparameters similarly good to those learned using the marginal likelihood. In this work, we introduce eigenfunction inducing features, an inter-domain feature in the style of Lázaro-Gredilla and Figueiras-Vidal 009, based on the eigendecomposition of the kernel. We obtain bounds on the KL-divergence for sparse inference with these features and give theoretical insight into the number of features needed to approximate the posterior process. These features also result in a diagonal covariance that can lead to computational gains Burt et al., 08.. Bounds on the Marginal Likelihood The log marginal likelihood for regression with a mean-centered Gaussian process prior with covariance function k,, used to regress D = {x i, y i } N i=, is given by Rasmussen and c 08. D. Burt, C.E. Rasmussen & M. van der Wilk.

Williams, 006: L = log py, where y is distributed according to N y; 0, K f,f + σnoise I, σnoise is the likelihood variance and K f,f is the data covariance matrix, with K f,f i,j = kx i, x j. Titsias 009 provided an ELBO for based on using a subset of inducing features, {u m } M m=0. L L lower = log N y; 0, Q f,f + σ noisei t σ noise. where t = tr K f,f Q f,f, Q f,f = K T u,f K u,uk u,f, K u,f m,i = covu m, fx i and K u,u m,n = covu m, u n. This ELBO is commonly used as a computationally tractable objective function for learning hyperparameters. Moreover, Matthews et al. 06 showed that the gap between the log marginal likelihood and the evidence lower bound is the KL divergence between the posterior Gaussian process and the process used in approximate inference. Instead of bounding L L lower directly, we bound the gap between the ELBO and an upper bound on the marginal likelihood Titsias, 04: L L upper = log N y; 0, Q f,f + ti + σ noisei + log Q f,f + ti + σ noisei log Q f,f + σ noisei. 3 If t = 0, L lower = L = L upper, so in order to obtain a bound on the KL-divergence, it suffices to bound t. Explicitly, Lemma With the same notation as in 3, KLQ ˆP t σ noise + y σnoise + t. 4 where ˆP is the full posterior process and Q is the variational approximation. under mild conditions, y = ON. Note that The proof is provided in Appendix A. Explicitly writing out the diagonal entries in K f,f Q f,f in order to bound t is difficult due to inverse matrix K u,u appearing in the definition of Q f,f. 3. Eigenvalues and Optimal Rate of Convergence Before proving upper bounds on t, we consider a lower bound. As observed in Titsias 04, t is the error in trace norm of a low-rank approximation to the covariance matrix, so t N + where γ m denotes the m th largest eigenvalue of K f,f. γ m 5

Generally, computing eigenvalues and eigenvectors of a large matrix is costly. However, as observed in Williams and Seeger 00, as N if the training points are i.i.d random variables drawn according to the probability measure ρ the eigenvalues of K f,f converge to those of the operator K ρ f fxkx, x dρx. 6 3.. Connection to the Nyström Approximation X In the case of inducing points subsampled from the data, the matrix Q f,f is a Nyström approximation to the K f,f Williams and Seeger, 00. Therefore, any bound on the rate of convergence of the Nyström approximation, for example Gittens and Mahoney, 03, Lemma, can be used in conjunction with Lemma to yield a bound on the KLdivergence. While Gittens and Mahoney 03, Lemma yields similar asymptotic in M behavior to the bounds we prove, due to large constants in their proof, direct application of their result to this problem appears to require tens of thousands of points. Additionally, a proper analysis when M is allowed to grow as a function of N involves bounding rates of convergence of the kernel matrix spectrum to the operator spectrum. Instead, we introduce eigenfunction features, defined with respect to K which yield elegant analytic upper bounds on t. These bounds asymptotically in N match the lower bound 5. Eigenfunction features are defined so as to be orthogonal, greatly simplifying the computation of t because K u,u = I. 3.. Eigenfunction Inducing Features Mercer s theorem tells us that kx, x = λ m φ m xφ m x, 7 m= where λ m, φ m m=0 are the eigenvalue-eigenfunction pairs of K ρ, which we assume are sorted so that λ m > λ m+ > 0 for all m. We define eigenfunction inducing features by, u m = fx φ m xdµx. λm X where µ is a probability measure that can be treated as a variational parameter. Using the orthogonality of eigenfunctions and the eigenfunction property it can be shown that, covu m, fx = λ m φ m x and covu m, u m = δ m,m. 8 3.3. An Example: Squared Exponential Kernel The squared exponential kernel, defined for X = R, by k se x, x = v k x x /l, is among the most popular choices of kernels. Suppose that the measure used in defining the K has normal density N 0, s. In this case the eigenvalues of K are given by Rasmussen and Williams, 006: λ m = v k a A Bm, 9 3

with a = /4s, b = /l, c = a + ab, A = a + b + c, B = b/a. From the above expression, when B is near one, we should expect to need more inducing features in order to bound the KL-divergence. This occurs when the observed data spans many kernel lengthscales. For any lengthscale, the eigenvalues decay exponentially. This asymptotic decay is closely connected to the smoothness of sample functions from the prior, see Rasmussen and Williams 006, Chapters 4,7. It has previously been shown that truncating the kernel with M such that λ M σnoise /N eigenbasis functions leads to almost no loss in model performance Ferrari-Trecate et al., 999. We prove the following: Theorem If the x i N 0, s i.i.d. For sparse inference with eigenfunction inducing features defined with respect to qx N 0, s with s < s There exists an N s,s such that for all N > N s,s inducing points inference with a set of M = c s,s logn features results in: P rklq ˆP > ɛ < δ. While the full proof is in Appendix B, we suggest why this should be true in the following section. 4. Bounds on the Trace using Eigenfunction Features Using the eigenfunction features, deriving a bound on t is straightforward. From 8, K T u,f K u,f i,j = M λ m φ m x i φ m x j. m=0 Combining this with Mercer s theorem yields the following lemma: Lemma 3 Suppose the training data consists of N i.i.d. random variables drawn according to probability density px and we use M eigenfunction inducing features defined with respect px to a density qx such that k = max x qx <, then lim N N tr K f,f Q f,f = λ m E p [φ m x i ] k λ m. 0 For the squared exponential kernel, 0 is a geometric series, and has a closed form see Appendix B. An example of the bound on the expected value of the trace and the resulting bound on the KL-divergence is shown in Figure. Under the assumed input distribution in Lemma 3 as N tends to infinity for fixed M, the empirical eigenvalues in 5 approach the λ m so the expected value of N t is asymptotically tight. In order to rigorously prove Theorem we need to understand the trace, not the expected value of its entries. In order to achieve this, as well as to obtain bounds that are effective as M grows as a function of N, we show that the square of the eigenfunctions has bounded second moment under qx. The details given in Appendix B. 4

Figure : Log-linear plot of bounds blue on the trace left and KL-divergence right plotted against the actual values yellow for a synthetic data set with N = 00, x N 0, 5, v k =, l =. References Bengt von Bahr and Carl-Gustav Esseen. Inequalities for the r th Absolute Moment of a Sum of Random Variables, r. The Annals of Mathematical Statistics, 36: 99 303, 965. ISSN 0003485. David Burt, Carl Edward Rasmussen, and Mark van der Wilk. Sparse Variational Gaussian Process Inference with Eigenfunction Inducing Features. Submitted, October 08. Pafnutii Lvovich Chebyshev. Des valeurs moyennes, Liouville s. J. Math. Pures Appl., : 77 84, 867. Giancarlo Ferrari-Trecate, Christopher KI Williams, and Manfred Opper. Finitedimensional approximation of Gaussian processes. In Advances in neural information processing systems, pages 8 4, 999. Alex Gittens and Michael Mahoney. Revisiting the Nyström method for improved largescale machine learning. In Proceedings of the 30th International Conference on Machine Learning, volume 8 of Proceedings of Machine Learning Research, pages 567 575, Atlanta, Georgia, USA, June 03. PMLR. Izrail Solomonovich Gradshteyn and Iosif Moiseevich Ryzhik. Table of integrals, series, and products. Academic press, 04. Miguel Lázaro-Gredilla and Anbal Figueiras-Vidal. Inter-domain Gaussian Processes for Sparse Inference using Inducing Features. In Y. Bengio, D. Schuurmans, J. D. Lafferty, C. K. I. Williams, and A. Culotta, editors, Advances in Neural Information Processing Systems, pages 087 095. Curran Associates, Inc., 009. Alexander G de G Matthews, James Hensman, Richard Turner, and Zoubin Ghahramani. On sparse variational methods and the Kullback-Leibler divergence between stochastic processes. Journal of Machine Learning Research, 5:3 39, 06. 5

Joaquin Quiñonero Candela and Carl Edward Rasmussen. A unifying view of sparse approximate Gaussian process regression. Journal of Machine Learning Research, 6Dec: 939 959, 005. Carl Edward Rasmussen and Christopher KI Williams. Gaussian processes for machine learning. MIT Press, 006. Michalis K. Titsias. Variational learning of inducing variables in sparse Gaussian processes. In Artificial Intelligence and Statistics, pages 567 574, 009. Michalis K. Titsias. Variational Inference for Gaussian and Determinantal Point Processes. December 04. Published: Workshop on Advances in Variational Inference NIPS 04. Richard Eric Turner and Maneesh Sahani. Two problems with variational expectation maximisation for time series models, pages 04 4. Cambridge University Press, 0. Christopher K. I. Williams and Matthias Seeger. Using the Nyström Method to Speed Up Kernel Machines. In T. K. Leen, T. G. Dietterich, and V. Tresp, editors, Advances in Neural Information Processing Systems 3, pages 68 688. MIT Press, 00. Appendix A. Proof of Lemma Proof The proof has a similar spirit to that of 3 provided in Titsias 04. Let R = Q f,f + σ noise I. t L upper L lower = σnoise + y T R y y T R + ti y t = σnoise + y T R R + ti y. Since Q f,f is symmetric positive semidefinite, R is positive definite with eigenvalues bounded below by σ noise. Write, R = UΛUT, where U is unitary and Λ is a diagonal matrix with non-increasing diagonal entries γ γ... γ N σ noise. We can rewrite the second term ignoring the factor of one half in as, y T UΛ U T UΛ + ti U T y = U T y T Λ Λ + ti U T y. Now define, z = U T y. Since U is unitary, z = y. U T y T Λ Λ + ti U T y = z T Λ Λ + ti z = z t i γ i i + γ it y t γn + γ Nt. The last inequality comes from noting that the fraction in the sum attains a maximum when γ i is minimized. Since σnoise is a lower bound on the smallest eigenvalue of R, we have, y T R R + ti t y y σnoise 4 + σ noise t, 6

from which Lemma follows. Appendix B. Proof of Theorem For inference with a squared exponential kernel and M eigenfunction inducing features defined with respect to µ N 0, s a stronger version of Lemma 3 is the following: Theorem 4 Suppose x i N 0, s with s s then for all M 0, lim n B M N t kλ 0 B for k = s s holds almost surely for all M 0. For s < s, for any α > 0 P t > KNB M s α N s s s s s 3 4 where K :=.94cs / v k a B A α + s s s. Remark 5 This proof could likely be generalized for all s s via using a concentration inequality based on the r th moment for < r, for example those in Bahr and Esseen 965, in place of Chebyshev s inequality. The proof of the first part of the statement, 3, comes from the observation that if q is px the density associated to µ and p is the density the x i are drawn from max x qx = p0 q0 = k. Note that here it is essential s s or else this ratio would not be bounded as x. Then, N t = N = N i= λ m N λ m φ mx i N φ mx i. 5 The inner sum is an expectation with respect to p of i.i.d. random variables. By the strong law of large numbers, N N i= i= φ mx i a.s. E p [φ mx i ] = k φ m x qxdx φ m x pxdx = k. 6 7

This shows 3 holds as N tends to for any fixed M. As the countable intersection of events that occur with probability one also occurs with probability one, this holds for all M simultaneously almost surely. For the probabilistic bound, we begin by establishing a term-wise bound on elements of K f,f Q f,f that takes into account the locations of the training inputs. Lemma 6 Let q i,j denote the i, j th entry in Q n,n and k i,j = kx i, x j denote the i, j th entry in K n,n, then k i,j q i,j.9 a 4cs B M x v k A B exp i + x j 4s. 7 holds for all pairs x i, x j. Proof As in earlier proofs, we have k i,j q i,j = λ m φ m x i φ m x j, We need to now take into account the location of the x i in this bound. For the squared exponential kernel, φ m x = 4cs/4 m! m H m cx exp c ax, where H m x is the m th Hermite polynomial, Rasmussen and Williams, 006, we have normalized the basis so φ m L µ =. We use the following bound on Hermite functions, obtained by squaring the bound in Gradshteyn and Ryzhik 04, H m x i H m x j <.9m! m e x i +x j /. Expanding into the definition of the φ m we obtain k i,j q i,j = 4cs expax i + x H m cx j e cx j H m cx i e cx i j λ m m m! Hm λ m cx j e cx j H m cx i e cx i 4cs expax i + x j m m!.9 4cs expax i + x j =.9 4cs exp x i + x j 4s v k a A λ m B M B. The first inequality triangle inequality is sharp on the terms effecting the trace since when x i = x j the sum must be term-wise positive. 8

x Let A i = exp i. It remains to derive a probabilistic bound on S := N i= A i. We first compute the moments of the A i under the input distribution. s E[A r rx x i ] = exp πs R s exp s dx s = s s r, 8 for rs < s. From this we deduce VarA i = E[A i ] E[A i] s = s. Chebyshev s inequality Chebyshev 867 tells us that for any α > s s s s 0, P S > N α + s s s V ara α N. 9 Theorem 4 follows. In order to prove Theorem choose M 3 logn = logn 3 then KNB M. For N large N, we have that this is an upper bound on the trace with high probability. Using this upper bound in Lemma gives Theorem as a corollary. 9