Explicit Rates of Convergence for Sparse Variational Inference in Gaussian Process Regression

Size: px
Start display at page:

Download "Explicit Rates of Convergence for Sparse Variational Inference in Gaussian Process Regression"

Transcription

1 JMLR: Workshop and Conference Proceedings : 9, 08. Symposium on Advances in Approximate Bayesian Inference Explicit Rates of Convergence for Sparse Variational Inference in Gaussian Process Regression David Burt drb6@cam.ac.uk and Carl Edward Rasmussen cer54@cam.ac.uk University of Cambridge, Cambridge, UK Mark van der Wilk PROWLER.io, Cambridge, UK mark@prowler.io Abstract Sparse variational inference in Gaussian process regression has theoretical guarantees that make it robust to over-fitting. Additionally, it is well-known that in the non-sparse regime, with M N, variational inducing point methods can be equivalent to full inference. In this paper, we derive bounds on the KL-divergence from the sparse variational approximation to the full posterior and show convergence for M on the order of logn inducing features when using the squared exponential kernel.. Introduction Gaussian processes can model complex functions, are robust to over-fitting and provide uncertainty estimates. In the case of regression, the marginal likelihood and predictive posterior can be calculated in closed form. However, this becomes computationally intractable for large data sets. The variational framework by Titsias 009 approximates the model posterior with a simpler Gaussian process. This approximation retains many of the desirable properties of non-parametric models, in contrast to parametric model approximations of Quiñonero Candela and Rasmussen 005. A precise treatment of the KL-divergence between stochastic processes is given in Matthews et al. 06. While the variational approximation provides an objective function ELBO for selecting the hyperparameters, it can introduce bias into the selection process Turner and Sahani, 0. However, if the gap between the ELBO and the log marginal likelihood at the optimum choice of hyperparameters is small, the optimization will result in a choice of hyperparameters similarly good to those learned using the marginal likelihood. In this work, we introduce eigenfunction inducing features, an inter-domain feature in the style of Lázaro-Gredilla and Figueiras-Vidal 009, based on the eigendecomposition of the kernel. We obtain bounds on the KL-divergence for sparse inference with these features and give theoretical insight into the number of features needed to approximate the posterior process. These features also result in a diagonal covariance that can lead to computational gains Burt et al., 08.. Bounds on the Marginal Likelihood The log marginal likelihood for regression with a mean-centered Gaussian process prior with covariance function k,, used to regress D = {x i, y i } N i=, is given by Rasmussen and c 08. D. Burt, C.E. Rasmussen & M. van der Wilk.

2 Williams, 006: L = log py, where y is distributed according to N y; 0, K f,f + σnoise I, σnoise is the likelihood variance and K f,f is the data covariance matrix, with K f,f i,j = kx i, x j. Titsias 009 provided an ELBO for based on using a subset of inducing features, {u m } M m=0. L L lower = log N y; 0, Q f,f + σ noisei t σ noise. where t = tr K f,f Q f,f, Q f,f = K T u,f K u,uk u,f, K u,f m,i = covu m, fx i and K u,u m,n = covu m, u n. This ELBO is commonly used as a computationally tractable objective function for learning hyperparameters. Moreover, Matthews et al. 06 showed that the gap between the log marginal likelihood and the evidence lower bound is the KL divergence between the posterior Gaussian process and the process used in approximate inference. Instead of bounding L L lower directly, we bound the gap between the ELBO and an upper bound on the marginal likelihood Titsias, 04: L L upper = log N y; 0, Q f,f + ti + σ noisei + log Q f,f + ti + σ noisei log Q f,f + σ noisei. 3 If t = 0, L lower = L = L upper, so in order to obtain a bound on the KL-divergence, it suffices to bound t. Explicitly, Lemma With the same notation as in 3, KLQ ˆP t σ noise + y σnoise + t. 4 where ˆP is the full posterior process and Q is the variational approximation. under mild conditions, y = ON. Note that The proof is provided in Appendix A. Explicitly writing out the diagonal entries in K f,f Q f,f in order to bound t is difficult due to inverse matrix K u,u appearing in the definition of Q f,f. 3. Eigenvalues and Optimal Rate of Convergence Before proving upper bounds on t, we consider a lower bound. As observed in Titsias 04, t is the error in trace norm of a low-rank approximation to the covariance matrix, so t N + where γ m denotes the m th largest eigenvalue of K f,f. γ m 5

3 Generally, computing eigenvalues and eigenvectors of a large matrix is costly. However, as observed in Williams and Seeger 00, as N if the training points are i.i.d random variables drawn according to the probability measure ρ the eigenvalues of K f,f converge to those of the operator K ρ f fxkx, x dρx Connection to the Nyström Approximation X In the case of inducing points subsampled from the data, the matrix Q f,f is a Nyström approximation to the K f,f Williams and Seeger, 00. Therefore, any bound on the rate of convergence of the Nyström approximation, for example Gittens and Mahoney, 03, Lemma, can be used in conjunction with Lemma to yield a bound on the KLdivergence. While Gittens and Mahoney 03, Lemma yields similar asymptotic in M behavior to the bounds we prove, due to large constants in their proof, direct application of their result to this problem appears to require tens of thousands of points. Additionally, a proper analysis when M is allowed to grow as a function of N involves bounding rates of convergence of the kernel matrix spectrum to the operator spectrum. Instead, we introduce eigenfunction features, defined with respect to K which yield elegant analytic upper bounds on t. These bounds asymptotically in N match the lower bound 5. Eigenfunction features are defined so as to be orthogonal, greatly simplifying the computation of t because K u,u = I. 3.. Eigenfunction Inducing Features Mercer s theorem tells us that kx, x = λ m φ m xφ m x, 7 m= where λ m, φ m m=0 are the eigenvalue-eigenfunction pairs of K ρ, which we assume are sorted so that λ m > λ m+ > 0 for all m. We define eigenfunction inducing features by, u m = fx φ m xdµx. λm X where µ is a probability measure that can be treated as a variational parameter. Using the orthogonality of eigenfunctions and the eigenfunction property it can be shown that, covu m, fx = λ m φ m x and covu m, u m = δ m,m An Example: Squared Exponential Kernel The squared exponential kernel, defined for X = R, by k se x, x = v k x x /l, is among the most popular choices of kernels. Suppose that the measure used in defining the K has normal density N 0, s. In this case the eigenvalues of K are given by Rasmussen and Williams, 006: λ m = v k a A Bm, 9 3

4 with a = /4s, b = /l, c = a + ab, A = a + b + c, B = b/a. From the above expression, when B is near one, we should expect to need more inducing features in order to bound the KL-divergence. This occurs when the observed data spans many kernel lengthscales. For any lengthscale, the eigenvalues decay exponentially. This asymptotic decay is closely connected to the smoothness of sample functions from the prior, see Rasmussen and Williams 006, Chapters 4,7. It has previously been shown that truncating the kernel with M such that λ M σnoise /N eigenbasis functions leads to almost no loss in model performance Ferrari-Trecate et al., 999. We prove the following: Theorem If the x i N 0, s i.i.d. For sparse inference with eigenfunction inducing features defined with respect to qx N 0, s with s < s There exists an N s,s such that for all N > N s,s inducing points inference with a set of M = c s,s logn features results in: P rklq ˆP > ɛ < δ. While the full proof is in Appendix B, we suggest why this should be true in the following section. 4. Bounds on the Trace using Eigenfunction Features Using the eigenfunction features, deriving a bound on t is straightforward. From 8, K T u,f K u,f i,j = M λ m φ m x i φ m x j. m=0 Combining this with Mercer s theorem yields the following lemma: Lemma 3 Suppose the training data consists of N i.i.d. random variables drawn according to probability density px and we use M eigenfunction inducing features defined with respect px to a density qx such that k = max x qx <, then lim N N tr K f,f Q f,f = λ m E p [φ m x i ] k λ m. 0 For the squared exponential kernel, 0 is a geometric series, and has a closed form see Appendix B. An example of the bound on the expected value of the trace and the resulting bound on the KL-divergence is shown in Figure. Under the assumed input distribution in Lemma 3 as N tends to infinity for fixed M, the empirical eigenvalues in 5 approach the λ m so the expected value of N t is asymptotically tight. In order to rigorously prove Theorem we need to understand the trace, not the expected value of its entries. In order to achieve this, as well as to obtain bounds that are effective as M grows as a function of N, we show that the square of the eigenfunctions has bounded second moment under qx. The details given in Appendix B. 4

5 Figure : Log-linear plot of bounds blue on the trace left and KL-divergence right plotted against the actual values yellow for a synthetic data set with N = 00, x N 0, 5, v k =, l =. References Bengt von Bahr and Carl-Gustav Esseen. Inequalities for the r th Absolute Moment of a Sum of Random Variables, r. The Annals of Mathematical Statistics, 36: , 965. ISSN David Burt, Carl Edward Rasmussen, and Mark van der Wilk. Sparse Variational Gaussian Process Inference with Eigenfunction Inducing Features. Submitted, October 08. Pafnutii Lvovich Chebyshev. Des valeurs moyennes, Liouville s. J. Math. Pures Appl., : 77 84, 867. Giancarlo Ferrari-Trecate, Christopher KI Williams, and Manfred Opper. Finitedimensional approximation of Gaussian processes. In Advances in neural information processing systems, pages 8 4, 999. Alex Gittens and Michael Mahoney. Revisiting the Nyström method for improved largescale machine learning. In Proceedings of the 30th International Conference on Machine Learning, volume 8 of Proceedings of Machine Learning Research, pages , Atlanta, Georgia, USA, June 03. PMLR. Izrail Solomonovich Gradshteyn and Iosif Moiseevich Ryzhik. Table of integrals, series, and products. Academic press, 04. Miguel Lázaro-Gredilla and Anbal Figueiras-Vidal. Inter-domain Gaussian Processes for Sparse Inference using Inducing Features. In Y. Bengio, D. Schuurmans, J. D. Lafferty, C. K. I. Williams, and A. Culotta, editors, Advances in Neural Information Processing Systems, pages Curran Associates, Inc., 009. Alexander G de G Matthews, James Hensman, Richard Turner, and Zoubin Ghahramani. On sparse variational methods and the Kullback-Leibler divergence between stochastic processes. Journal of Machine Learning Research, 5:3 39, 06. 5

6 Joaquin Quiñonero Candela and Carl Edward Rasmussen. A unifying view of sparse approximate Gaussian process regression. Journal of Machine Learning Research, 6Dec: , 005. Carl Edward Rasmussen and Christopher KI Williams. Gaussian processes for machine learning. MIT Press, 006. Michalis K. Titsias. Variational learning of inducing variables in sparse Gaussian processes. In Artificial Intelligence and Statistics, pages , 009. Michalis K. Titsias. Variational Inference for Gaussian and Determinantal Point Processes. December 04. Published: Workshop on Advances in Variational Inference NIPS 04. Richard Eric Turner and Maneesh Sahani. Two problems with variational expectation maximisation for time series models, pages Cambridge University Press, 0. Christopher K. I. Williams and Matthias Seeger. Using the Nyström Method to Speed Up Kernel Machines. In T. K. Leen, T. G. Dietterich, and V. Tresp, editors, Advances in Neural Information Processing Systems 3, pages MIT Press, 00. Appendix A. Proof of Lemma Proof The proof has a similar spirit to that of 3 provided in Titsias 04. Let R = Q f,f + σ noise I. t L upper L lower = σnoise + y T R y y T R + ti y t = σnoise + y T R R + ti y. Since Q f,f is symmetric positive semidefinite, R is positive definite with eigenvalues bounded below by σ noise. Write, R = UΛUT, where U is unitary and Λ is a diagonal matrix with non-increasing diagonal entries γ γ... γ N σ noise. We can rewrite the second term ignoring the factor of one half in as, y T UΛ U T UΛ + ti U T y = U T y T Λ Λ + ti U T y. Now define, z = U T y. Since U is unitary, z = y. U T y T Λ Λ + ti U T y = z T Λ Λ + ti z = z t i γ i i + γ it y t γn + γ Nt. The last inequality comes from noting that the fraction in the sum attains a maximum when γ i is minimized. Since σnoise is a lower bound on the smallest eigenvalue of R, we have, y T R R + ti t y y σnoise 4 + σ noise t, 6

7 from which Lemma follows. Appendix B. Proof of Theorem For inference with a squared exponential kernel and M eigenfunction inducing features defined with respect to µ N 0, s a stronger version of Lemma 3 is the following: Theorem 4 Suppose x i N 0, s with s s then for all M 0, lim n B M N t kλ 0 B for k = s s holds almost surely for all M 0. For s < s, for any α > 0 P t > KNB M s α N s s s s s 3 4 where K :=.94cs / v k a B A α + s s s. Remark 5 This proof could likely be generalized for all s s via using a concentration inequality based on the r th moment for < r, for example those in Bahr and Esseen 965, in place of Chebyshev s inequality. The proof of the first part of the statement, 3, comes from the observation that if q is px the density associated to µ and p is the density the x i are drawn from max x qx = p0 q0 = k. Note that here it is essential s s or else this ratio would not be bounded as x. Then, N t = N = N i= λ m N λ m φ mx i N φ mx i. 5 The inner sum is an expectation with respect to p of i.i.d. random variables. By the strong law of large numbers, N N i= i= φ mx i a.s. E p [φ mx i ] = k φ m x qxdx φ m x pxdx = k. 6 7

8 This shows 3 holds as N tends to for any fixed M. As the countable intersection of events that occur with probability one also occurs with probability one, this holds for all M simultaneously almost surely. For the probabilistic bound, we begin by establishing a term-wise bound on elements of K f,f Q f,f that takes into account the locations of the training inputs. Lemma 6 Let q i,j denote the i, j th entry in Q n,n and k i,j = kx i, x j denote the i, j th entry in K n,n, then k i,j q i,j.9 a 4cs B M x v k A B exp i + x j 4s. 7 holds for all pairs x i, x j. Proof As in earlier proofs, we have k i,j q i,j = λ m φ m x i φ m x j, We need to now take into account the location of the x i in this bound. For the squared exponential kernel, φ m x = 4cs/4 m! m H m cx exp c ax, where H m x is the m th Hermite polynomial, Rasmussen and Williams, 006, we have normalized the basis so φ m L µ =. We use the following bound on Hermite functions, obtained by squaring the bound in Gradshteyn and Ryzhik 04, H m x i H m x j <.9m! m e x i +x j /. Expanding into the definition of the φ m we obtain k i,j q i,j = 4cs expax i + x H m cx j e cx j H m cx i e cx i j λ m m m! Hm λ m cx j e cx j H m cx i e cx i 4cs expax i + x j m m!.9 4cs expax i + x j =.9 4cs exp x i + x j 4s v k a A λ m B M B. The first inequality triangle inequality is sharp on the terms effecting the trace since when x i = x j the sum must be term-wise positive. 8

9 x Let A i = exp i. It remains to derive a probabilistic bound on S := N i= A i. We first compute the moments of the A i under the input distribution. s E[A r rx x i ] = exp πs R s exp s dx s = s s r, 8 for rs < s. From this we deduce VarA i = E[A i ] E[A i] s = s. Chebyshev s inequality Chebyshev 867 tells us that for any α > s s s s 0, P S > N α + s s s V ara α N. 9 Theorem 4 follows. In order to prove Theorem choose M 3 logn = logn 3 then KNB M. For N large N, we have that this is an upper bound on the trace with high probability. Using this upper bound in Lemma gives Theorem as a corollary. 9

Non-Factorised Variational Inference in Dynamical Systems

Non-Factorised Variational Inference in Dynamical Systems st Symposium on Advances in Approximate Bayesian Inference, 08 6 Non-Factorised Variational Inference in Dynamical Systems Alessandro D. Ialongo University of Cambridge and Max Planck Institute for Intelligent

More information

Learning Gaussian Process Models from Uncertain Data

Learning Gaussian Process Models from Uncertain Data Learning Gaussian Process Models from Uncertain Data Patrick Dallaire, Camille Besse, and Brahim Chaib-draa DAMAS Laboratory, Computer Science & Software Engineering Department, Laval University, Canada

More information

Variational Principal Components

Variational Principal Components Variational Principal Components Christopher M. Bishop Microsoft Research 7 J. J. Thomson Avenue, Cambridge, CB3 0FB, U.K. cmbishop@microsoft.com http://research.microsoft.com/ cmbishop In Proceedings

More information

Circular Pseudo-Point Approximations for Scaling Gaussian Processes

Circular Pseudo-Point Approximations for Scaling Gaussian Processes Circular Pseudo-Point Approximations for Scaling Gaussian Processes Will Tebbutt Invenia Labs, Cambridge, UK will.tebbutt@invenialabs.co.uk Thang D. Bui University of Cambridge tdb4@cam.ac.uk Richard E.

More information

The Variational Gaussian Approximation Revisited

The Variational Gaussian Approximation Revisited The Variational Gaussian Approximation Revisited Manfred Opper Cédric Archambeau March 16, 2009 Abstract The variational approximation of posterior distributions by multivariate Gaussians has been much

More information

Variational Model Selection for Sparse Gaussian Process Regression

Variational Model Selection for Sparse Gaussian Process Regression Variational Model Selection for Sparse Gaussian Process Regression Michalis K. Titsias School of Computer Science University of Manchester 7 September 2008 Outline Gaussian process regression and sparse

More information

Gaussian Process Approximations of Stochastic Differential Equations

Gaussian Process Approximations of Stochastic Differential Equations Gaussian Process Approximations of Stochastic Differential Equations Cédric Archambeau Dan Cawford Manfred Opper John Shawe-Taylor May, 2006 1 Introduction Some of the most complex models routinely run

More information

Deep Gaussian Processes

Deep Gaussian Processes Deep Gaussian Processes Neil D. Lawrence 8th April 2015 Mascot Num 2015 Outline Introduction Deep Gaussian Process Models Variational Methods Composition of GPs Results Outline Introduction Deep Gaussian

More information

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008 Gaussian processes Chuong B Do (updated by Honglak Lee) November 22, 2008 Many of the classical machine learning algorithms that we talked about during the first half of this course fit the following pattern:

More information

Tree-structured Gaussian Process Approximations

Tree-structured Gaussian Process Approximations Tree-structured Gaussian Process Approximations Thang Bui joint work with Richard Turner MLG, Cambridge July 1st, 2014 1 / 27 Outline 1 Introduction 2 Tree-structured GP approximation 3 Experiments 4 Summary

More information

A Process over all Stationary Covariance Kernels

A Process over all Stationary Covariance Kernels A Process over all Stationary Covariance Kernels Andrew Gordon Wilson June 9, 0 Abstract I define a process over all stationary covariance kernels. I show how one might be able to perform inference that

More information

Model Selection for Gaussian Processes

Model Selection for Gaussian Processes Institute for Adaptive and Neural Computation School of Informatics,, UK December 26 Outline GP basics Model selection: covariance functions and parameterizations Criteria for model selection Marginal

More information

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes CSci 8980: Advanced Topics in Graphical Models Gaussian Processes Instructor: Arindam Banerjee November 15, 2007 Gaussian Processes Outline Gaussian Processes Outline Parametric Bayesian Regression Gaussian

More information

Deep Gaussian Processes

Deep Gaussian Processes Deep Gaussian Processes Neil D. Lawrence 30th April 2015 KTH Royal Institute of Technology Outline Introduction Deep Gaussian Process Models Variational Approximation Samples and Results Outline Introduction

More information

Gaussian Processes for Machine Learning

Gaussian Processes for Machine Learning Gaussian Processes for Machine Learning Carl Edward Rasmussen Max Planck Institute for Biological Cybernetics Tübingen, Germany carl@tuebingen.mpg.de Carlos III, Madrid, May 2006 The actual science of

More information

Black-box α-divergence Minimization

Black-box α-divergence Minimization Black-box α-divergence Minimization José Miguel Hernández-Lobato, Yingzhen Li, Daniel Hernández-Lobato, Thang Bui, Richard Turner, Harvard University, University of Cambridge, Universidad Autónoma de Madrid.

More information

Multiple-step Time Series Forecasting with Sparse Gaussian Processes

Multiple-step Time Series Forecasting with Sparse Gaussian Processes Multiple-step Time Series Forecasting with Sparse Gaussian Processes Perry Groot ab Peter Lucas a Paul van den Bosch b a Radboud University, Model-Based Systems Development, Heyendaalseweg 135, 6525 AJ

More information

Gaussian Processes for Big Data. James Hensman

Gaussian Processes for Big Data. James Hensman Gaussian Processes for Big Data James Hensman Overview Motivation Sparse Gaussian Processes Stochastic Variational Inference Examples Overview Motivation Sparse Gaussian Processes Stochastic Variational

More information

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Gaussian Processes Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 01 Pictorial view of embedding distribution Transform the entire distribution to expected features Feature space Feature

More information

The Laplacian PDF Distance: A Cost Function for Clustering in a Kernel Feature Space

The Laplacian PDF Distance: A Cost Function for Clustering in a Kernel Feature Space The Laplacian PDF Distance: A Cost Function for Clustering in a Kernel Feature Space Robert Jenssen, Deniz Erdogmus 2, Jose Principe 2, Torbjørn Eltoft Department of Physics, University of Tromsø, Norway

More information

GWAS V: Gaussian processes

GWAS V: Gaussian processes GWAS V: Gaussian processes Dr. Oliver Stegle Christoh Lippert Prof. Dr. Karsten Borgwardt Max-Planck-Institutes Tübingen, Germany Tübingen Summer 2011 Oliver Stegle GWAS V: Gaussian processes Summer 2011

More information

On Sparse Variational Methods and the Kullback-Leibler Divergence between Stochastic Processes

On Sparse Variational Methods and the Kullback-Leibler Divergence between Stochastic Processes On Sparse Variational Methods and the Kullback-Leibler Divergence between Stochastic Processes Alexander G. de G. Matthews 1, James Hensman 2, Richard E. Turner 1, Zoubin Ghahramani 1 1 University of Cambridge,

More information

Tutorial on Gaussian Processes and the Gaussian Process Latent Variable Model

Tutorial on Gaussian Processes and the Gaussian Process Latent Variable Model Tutorial on Gaussian Processes and the Gaussian Process Latent Variable Model (& discussion on the GPLVM tech. report by Prof. N. Lawrence, 06) Andreas Damianou Department of Neuro- and Computer Science,

More information

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.) Prof. Daniel Cremers 2. Regression (cont.) Regression with MLE (Rep.) Assume that y is affected by Gaussian noise : t = f(x, w)+ where Thus, we have p(t x, w, )=N (t; f(x, w), 2 ) 2 Maximum A-Posteriori

More information

Gaussian Process Regression

Gaussian Process Regression Gaussian Process Regression 4F1 Pattern Recognition, 21 Carl Edward Rasmussen Department of Engineering, University of Cambridge November 11th - 16th, 21 Rasmussen (Engineering, Cambridge) Gaussian Process

More information

Active and Semi-supervised Kernel Classification

Active and Semi-supervised Kernel Classification Active and Semi-supervised Kernel Classification Zoubin Ghahramani Gatsby Computational Neuroscience Unit University College London Work done in collaboration with Xiaojin Zhu (CMU), John Lafferty (CMU),

More information

Expectation Propagation for Approximate Bayesian Inference

Expectation Propagation for Approximate Bayesian Inference Expectation Propagation for Approximate Bayesian Inference José Miguel Hernández Lobato Universidad Autónoma de Madrid, Computer Science Department February 5, 2007 1/ 24 Bayesian Inference Inference Given

More information

STAT 518 Intro Student Presentation

STAT 518 Intro Student Presentation STAT 518 Intro Student Presentation Wen Wei Loh April 11, 2013 Title of paper Radford M. Neal [1999] Bayesian Statistics, 6: 475-501, 1999 What the paper is about Regression and Classification Flexible

More information

Latent Variable Models and EM algorithm

Latent Variable Models and EM algorithm Latent Variable Models and EM algorithm SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic 3.1 Clustering and Mixture Modelling K-means and hierarchical clustering are non-probabilistic

More information

Local and global sparse Gaussian process approximations

Local and global sparse Gaussian process approximations Local and global sparse Gaussian process approximations Edward Snelson Gatsby Computational euroscience Unit University College London, UK snelson@gatsby.ucl.ac.uk Zoubin Ghahramani Department of Engineering

More information

Sparse Spectrum Gaussian Process Regression

Sparse Spectrum Gaussian Process Regression Journal of Machine Learning Research 11 (2010) 1865-1881 Submitted 2/08; Revised 2/10; Published 6/10 Sparse Spectrum Gaussian Process Regression Miguel Lázaro-Gredilla Departamento de Teoría de la Señal

More information

Probabilistic Models for Learning Data Representations. Andreas Damianou

Probabilistic Models for Learning Data Representations. Andreas Damianou Probabilistic Models for Learning Data Representations Andreas Damianou Department of Computer Science, University of Sheffield, UK IBM Research, Nairobi, Kenya, 23/06/2015 Sheffield SITraN Outline Part

More information

Stochastic Variational Inference for Gaussian Process Latent Variable Models using Back Constraints

Stochastic Variational Inference for Gaussian Process Latent Variable Models using Back Constraints Stochastic Variational Inference for Gaussian Process Latent Variable Models using Back Constraints Thang D. Bui Richard E. Turner tdb40@cam.ac.uk ret26@cam.ac.uk Computational and Biological Learning

More information

The geometry of Gaussian processes and Bayesian optimization. Contal CMLA, ENS Cachan

The geometry of Gaussian processes and Bayesian optimization. Contal CMLA, ENS Cachan The geometry of Gaussian processes and Bayesian optimization. Contal CMLA, ENS Cachan Background: Global Optimization and Gaussian Processes The Geometry of Gaussian Processes and the Chaining Trick Algorithm

More information

Approximate Kernel PCA with Random Features

Approximate Kernel PCA with Random Features Approximate Kernel PCA with Random Features (Computational vs. Statistical Tradeoff) Bharath K. Sriperumbudur Department of Statistics, Pennsylvania State University Journées de Statistique Paris May 28,

More information

Nonparametric Bayesian Methods (Gaussian Processes)

Nonparametric Bayesian Methods (Gaussian Processes) [70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent

More information

Expectation Propagation in Dynamical Systems

Expectation Propagation in Dynamical Systems Expectation Propagation in Dynamical Systems Marc Peter Deisenroth Joint Work with Shakir Mohamed (UBC) August 10, 2012 Marc Deisenroth (TU Darmstadt) EP in Dynamical Systems 1 Motivation Figure : Complex

More information

Probabilistic Graphical Models Lecture 20: Gaussian Processes

Probabilistic Graphical Models Lecture 20: Gaussian Processes Probabilistic Graphical Models Lecture 20: Gaussian Processes Andrew Gordon Wilson www.cs.cmu.edu/~andrewgw Carnegie Mellon University March 30, 2015 1 / 53 What is Machine Learning? Machine learning algorithms

More information

(Multivariate) Gaussian (Normal) Probability Densities

(Multivariate) Gaussian (Normal) Probability Densities (Multivariate) Gaussian (Normal) Probability Densities Carl Edward Rasmussen, José Miguel Hernández-Lobato & Richard Turner April 20th, 2018 Rasmussen, Hernàndez-Lobato & Turner Gaussian Densities April

More information

Variational Dropout via Empirical Bayes

Variational Dropout via Empirical Bayes Variational Dropout via Empirical Bayes Valery Kharitonov 1 kharvd@gmail.com Dmitry Molchanov 1, dmolch111@gmail.com Dmitry Vetrov 1, vetrovd@yandex.ru 1 National Research University Higher School of Economics,

More information

Non Linear Latent Variable Models

Non Linear Latent Variable Models Non Linear Latent Variable Models Neil Lawrence GPRS 14th February 2014 Outline Nonlinear Latent Variable Models Extensions Outline Nonlinear Latent Variable Models Extensions Non-Linear Latent Variable

More information

Natural Gradients via the Variational Predictive Distribution

Natural Gradients via the Variational Predictive Distribution Natural Gradients via the Variational Predictive Distribution Da Tang Columbia University datang@cs.columbia.edu Rajesh Ranganath New York University rajeshr@cims.nyu.edu Abstract Variational inference

More information

Nonparameteric Regression:

Nonparameteric Regression: Nonparameteric Regression: Nadaraya-Watson Kernel Regression & Gaussian Process Regression Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro,

More information

Prediction of double gene knockout measurements

Prediction of double gene knockout measurements Prediction of double gene knockout measurements Sofia Kyriazopoulou-Panagiotopoulou sofiakp@stanford.edu December 12, 2008 Abstract One way to get an insight into the potential interaction between a pair

More information

Convergence Rates of Kernel Quadrature Rules

Convergence Rates of Kernel Quadrature Rules Convergence Rates of Kernel Quadrature Rules Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE NIPS workshop on probabilistic integration - Dec. 2015 Outline Introduction

More information

Gaussian Processes in Machine Learning

Gaussian Processes in Machine Learning Gaussian Processes in Machine Learning November 17, 2011 CharmGil Hong Agenda Motivation GP : How does it make sense? Prior : Defining a GP More about Mean and Covariance Functions Posterior : Conditioning

More information

Glasgow eprints Service

Glasgow eprints Service Girard, A. and Rasmussen, C.E. and Quinonero-Candela, J. and Murray- Smith, R. (3) Gaussian Process priors with uncertain inputs? Application to multiple-step ahead time series forecasting. In, Becker,

More information

Gaussian processes for inference in stochastic differential equations

Gaussian processes for inference in stochastic differential equations Gaussian processes for inference in stochastic differential equations Manfred Opper, AI group, TU Berlin November 6, 2017 Manfred Opper, AI group, TU Berlin (TU Berlin) inference in SDE November 6, 2017

More information

Worst-Case Bounds for Gaussian Process Models

Worst-Case Bounds for Gaussian Process Models Worst-Case Bounds for Gaussian Process Models Sham M. Kakade University of Pennsylvania Matthias W. Seeger UC Berkeley Abstract Dean P. Foster University of Pennsylvania We present a competitive analysis

More information

GAUSSIAN PROCESS REGRESSION

GAUSSIAN PROCESS REGRESSION GAUSSIAN PROCESS REGRESSION CSE 515T Spring 2015 1. BACKGROUND The kernel trick again... The Kernel Trick Consider again the linear regression model: y(x) = φ(x) w + ε, with prior p(w) = N (w; 0, Σ). The

More information

Probabilistic & Unsupervised Learning

Probabilistic & Unsupervised Learning Probabilistic & Unsupervised Learning Gaussian Processes Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc ML/CSML, Dept Computer Science University College London

More information

Introduction Dual Representations Kernel Design RBF Linear Reg. GP Regression GP Classification Summary. Kernel Methods. Henrik I Christensen

Introduction Dual Representations Kernel Design RBF Linear Reg. GP Regression GP Classification Summary. Kernel Methods. Henrik I Christensen Kernel Methods Henrik I Christensen Robotics & Intelligent Machines @ GT Georgia Institute of Technology, Atlanta, GA 30332-0280 hic@cc.gatech.edu Henrik I Christensen (RIM@GT) Kernel Methods 1 / 37 Outline

More information

PATTERN RECOGNITION AND MACHINE LEARNING

PATTERN RECOGNITION AND MACHINE LEARNING PATTERN RECOGNITION AND MACHINE LEARNING Chapter 1. Introduction Shuai Huang April 21, 2014 Outline 1 What is Machine Learning? 2 Curve Fitting 3 Probability Theory 4 Model Selection 5 The curse of dimensionality

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 7 Approximate

More information

Prediction of Data with help of the Gaussian Process Method

Prediction of Data with help of the Gaussian Process Method of Data with help of the Gaussian Process Method R. Preuss, U. von Toussaint Max-Planck-Institute for Plasma Physics EURATOM Association 878 Garching, Germany March, Abstract The simulation of plasma-wall

More information

Sparse Spectral Sampling Gaussian Processes

Sparse Spectral Sampling Gaussian Processes Sparse Spectral Sampling Gaussian Processes Miguel Lázaro-Gredilla Department of Signal Processing & Communications Universidad Carlos III de Madrid, Spain miguel@tsc.uc3m.es Joaquin Quiñonero-Candela

More information

Neutron inverse kinetics via Gaussian Processes

Neutron inverse kinetics via Gaussian Processes Neutron inverse kinetics via Gaussian Processes P. Picca Politecnico di Torino, Torino, Italy R. Furfaro University of Arizona, Tucson, Arizona Outline Introduction Review of inverse kinetics techniques

More information

Understanding Covariance Estimates in Expectation Propagation

Understanding Covariance Estimates in Expectation Propagation Understanding Covariance Estimates in Expectation Propagation William Stephenson Department of EECS Massachusetts Institute of Technology Cambridge, MA 019 wtstephe@csail.mit.edu Tamara Broderick Department

More information

Bayesian Machine Learning

Bayesian Machine Learning Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 2: Bayesian Basics https://people.orie.cornell.edu/andrew/orie6741 Cornell University August 25, 2016 1 / 17 Canonical Machine Learning

More information

PILCO: A Model-Based and Data-Efficient Approach to Policy Search

PILCO: A Model-Based and Data-Efficient Approach to Policy Search PILCO: A Model-Based and Data-Efficient Approach to Policy Search (M.P. Deisenroth and C.E. Rasmussen) CSC2541 November 4, 2016 PILCO Graphical Model PILCO Probabilistic Inference for Learning COntrol

More information

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions Pattern Recognition and Machine Learning Chapter 2: Probability Distributions Cécile Amblard Alex Kläser Jakob Verbeek October 11, 27 Probability Distributions: General Density Estimation: given a finite

More information

Lecture : Probabilistic Machine Learning

Lecture : Probabilistic Machine Learning Lecture : Probabilistic Machine Learning Riashat Islam Reasoning and Learning Lab McGill University September 11, 2018 ML : Many Methods with Many Links Modelling Views of Machine Learning Machine Learning

More information

Variable sigma Gaussian processes: An expectation propagation perspective

Variable sigma Gaussian processes: An expectation propagation perspective Variable sigma Gaussian processes: An expectation propagation perspective Yuan (Alan) Qi Ahmed H. Abdel-Gawad CS & Statistics Departments, Purdue University ECE Department, Purdue University alanqi@cs.purdue.edu

More information

Distributionally robust optimization techniques in batch bayesian optimisation

Distributionally robust optimization techniques in batch bayesian optimisation Distributionally robust optimization techniques in batch bayesian optimisation Nikitas Rontsis June 13, 2016 1 Introduction This report is concerned with performing batch bayesian optimization of an unknown

More information

Lecture 35: December The fundamental statistical distances

Lecture 35: December The fundamental statistical distances 36-705: Intermediate Statistics Fall 207 Lecturer: Siva Balakrishnan Lecture 35: December 4 Today we will discuss distances and metrics between distributions that are useful in statistics. I will be lose

More information

CSC2535: Computation in Neural Networks Lecture 7: Variational Bayesian Learning & Model Selection

CSC2535: Computation in Neural Networks Lecture 7: Variational Bayesian Learning & Model Selection CSC2535: Computation in Neural Networks Lecture 7: Variational Bayesian Learning & Model Selection (non-examinable material) Matthew J. Beal February 27, 2004 www.variational-bayes.org Bayesian Model Selection

More information

Statistical Techniques in Robotics (16-831, F12) Lecture#21 (Monday November 12) Gaussian Processes

Statistical Techniques in Robotics (16-831, F12) Lecture#21 (Monday November 12) Gaussian Processes Statistical Techniques in Robotics (16-831, F12) Lecture#21 (Monday November 12) Gaussian Processes Lecturer: Drew Bagnell Scribe: Venkatraman Narayanan 1, M. Koval and P. Parashar 1 Applications of Gaussian

More information

Learning to Learn and Collaborative Filtering

Learning to Learn and Collaborative Filtering Appearing in NIPS 2005 workshop Inductive Transfer: Canada, December, 2005. 10 Years Later, Whistler, Learning to Learn and Collaborative Filtering Kai Yu, Volker Tresp Siemens AG, 81739 Munich, Germany

More information

Sparse Approximations for Non-Conjugate Gaussian Process Regression

Sparse Approximations for Non-Conjugate Gaussian Process Regression Sparse Approximations for Non-Conjugate Gaussian Process Regression Thang Bui and Richard Turner Computational and Biological Learning lab Department of Engineering University of Cambridge November 11,

More information

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods Pattern Recognition and Machine Learning Chapter 6: Kernel Methods Vasil Khalidov Alex Kläser December 13, 2007 Training Data: Keep or Discard? Parametric methods (linear/nonlinear) so far: learn parameter

More information

Variational Gaussian-process factor analysis for modeling spatio-temporal data

Variational Gaussian-process factor analysis for modeling spatio-temporal data Variational Gaussian-process factor analysis for modeling spatio-temporal data Jaakko Luttinen Adaptive Informatics Research Center Helsinki University of Technology, Finland Jaakko.Luttinen@tkk.fi Alexander

More information

Machine Learning Basics: Maximum Likelihood Estimation

Machine Learning Basics: Maximum Likelihood Estimation Machine Learning Basics: Maximum Likelihood Estimation Sargur N. srihari@cedar.buffalo.edu This is part of lecture slides on Deep Learning: http://www.cedar.buffalo.edu/~srihari/cse676 1 Topics 1. Learning

More information

Quantifying mismatch in Bayesian optimization

Quantifying mismatch in Bayesian optimization Quantifying mismatch in Bayesian optimization Eric Schulz University College London e.schulz@cs.ucl.ac.uk Maarten Speekenbrink University College London m.speekenbrink@ucl.ac.uk José Miguel Hernández-Lobato

More information

APPENDIX A. Background Mathematics. A.1 Linear Algebra. Vector algebra. Let x denote the n-dimensional column vector with components x 1 x 2.

APPENDIX A. Background Mathematics. A.1 Linear Algebra. Vector algebra. Let x denote the n-dimensional column vector with components x 1 x 2. APPENDIX A Background Mathematics A. Linear Algebra A.. Vector algebra Let x denote the n-dimensional column vector with components 0 x x 2 B C @. A x n Definition 6 (scalar product). The scalar product

More information

Lecture 6: Graphical Models: Learning

Lecture 6: Graphical Models: Learning Lecture 6: Graphical Models: Learning 4F13: Machine Learning Zoubin Ghahramani and Carl Edward Rasmussen Department of Engineering, University of Cambridge February 3rd, 2010 Ghahramani & Rasmussen (CUED)

More information

Introduction to Gaussian Processes

Introduction to Gaussian Processes Introduction to Gaussian Processes Neil D. Lawrence GPSS 10th June 2013 Book Rasmussen and Williams (2006) Outline The Gaussian Density Covariance from Basis Functions Basis Function Representations Constructing

More information

Approximating high-dimensional posteriors with nuisance parameters via integrated rotated Gaussian approximation (IRGA)

Approximating high-dimensional posteriors with nuisance parameters via integrated rotated Gaussian approximation (IRGA) Approximating high-dimensional posteriors with nuisance parameters via integrated rotated Gaussian approximation (IRGA) Willem van den Boom Department of Statistics and Applied Probability National University

More information

Review (Probability & Linear Algebra)

Review (Probability & Linear Algebra) Review (Probability & Linear Algebra) CE-725 : Statistical Pattern Recognition Sharif University of Technology Spring 2013 M. Soleymani Outline Axioms of probability theory Conditional probability, Joint

More information

Optimization of Gaussian Process Hyperparameters using Rprop

Optimization of Gaussian Process Hyperparameters using Rprop Optimization of Gaussian Process Hyperparameters using Rprop Manuel Blum and Martin Riedmiller University of Freiburg - Department of Computer Science Freiburg, Germany Abstract. Gaussian processes are

More information

Introduction to Gaussian Processes

Introduction to Gaussian Processes Introduction to Gaussian Processes Iain Murray murray@cs.toronto.edu CSC255, Introduction to Machine Learning, Fall 28 Dept. Computer Science, University of Toronto The problem Learn scalar function of

More information

Variational Model Selection for Sparse Gaussian Process Regression

Variational Model Selection for Sparse Gaussian Process Regression Variational Model Selection for Sparse Gaussian Process Regression Michalis K. Titsias School of Computer Science, University of Manchester, UK mtitsias@cs.man.ac.uk Abstract Sparse Gaussian process methods

More information

Variational Inference for Mahalanobis Distance Metrics in Gaussian Process Regression

Variational Inference for Mahalanobis Distance Metrics in Gaussian Process Regression Variational Inference for Mahalanobis Distance Metrics in Gaussian Process Regression Michalis K. Titsias Department of Informatics Athens University of Economics and Business mtitsias@aueb.gr Miguel Lázaro-Gredilla

More information

Information geometry for bivariate distribution control

Information geometry for bivariate distribution control Information geometry for bivariate distribution control C.T.J.Dodson + Hong Wang Mathematics + Control Systems Centre, University of Manchester Institute of Science and Technology Optimal control of stochastic

More information

The connection of dropout and Bayesian statistics

The connection of dropout and Bayesian statistics The connection of dropout and Bayesian statistics Interpretation of dropout as approximate Bayesian modelling of NN http://mlg.eng.cam.ac.uk/yarin/thesis/thesis.pdf Dropout Geoffrey Hinton Google, University

More information

9.2 Support Vector Machines 159

9.2 Support Vector Machines 159 9.2 Support Vector Machines 159 9.2.3 Kernel Methods We have all the tools together now to make an exciting step. Let us summarize our findings. We are interested in regularized estimation problems of

More information

Computer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression

Computer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression Group Prof. Daniel Cremers 9. Gaussian Processes - Regression Repetition: Regularized Regression Before, we solved for w using the pseudoinverse. But: we can kernelize this problem as well! First step:

More information

Posterior Regularization

Posterior Regularization Posterior Regularization 1 Introduction One of the key challenges in probabilistic structured learning, is the intractability of the posterior distribution, for fast inference. There are numerous methods

More information

13: Variational inference II

13: Variational inference II 10-708: Probabilistic Graphical Models, Spring 2015 13: Variational inference II Lecturer: Eric P. Xing Scribes: Ronghuo Zheng, Zhiting Hu, Yuntian Deng 1 Introduction We started to talk about variational

More information

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA 1 MACHINE LEARNING Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA 2 Practicals Next Week Next Week, Practical Session on Computer Takes Place in Room GR

More information

Mathematical foundations - linear algebra

Mathematical foundations - linear algebra Mathematical foundations - linear algebra Andrea Passerini passerini@disi.unitn.it Machine Learning Vector space Definition (over reals) A set X is called a vector space over IR if addition and scalar

More information

SINGLE-TASK AND MULTITASK SPARSE GAUSSIAN PROCESSES

SINGLE-TASK AND MULTITASK SPARSE GAUSSIAN PROCESSES SINGLE-TASK AND MULTITASK SPARSE GAUSSIAN PROCESSES JIANG ZHU, SHILIANG SUN Department of Computer Science and Technology, East China Normal University 500 Dongchuan Road, Shanghai 20024, P. R. China E-MAIL:

More information

CS 7140: Advanced Machine Learning

CS 7140: Advanced Machine Learning Instructor CS 714: Advanced Machine Learning Lecture 3: Gaussian Processes (17 Jan, 218) Jan-Willem van de Meent (j.vandemeent@northeastern.edu) Scribes Mo Han (han.m@husky.neu.edu) Guillem Reus Muns (reusmuns.g@husky.neu.edu)

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models Lecture 9: Variational Inference Relaxations Volkan Cevher, Matthias Seeger Ecole Polytechnique Fédérale de Lausanne 24/10/2011 (EPFL) Graphical Models 24/10/2011 1 / 15

More information

Stochastic Spectral Approaches to Bayesian Inference

Stochastic Spectral Approaches to Bayesian Inference Stochastic Spectral Approaches to Bayesian Inference Prof. Nathan L. Gibson Department of Mathematics Applied Mathematics and Computation Seminar March 4, 2011 Prof. Gibson (OSU) Spectral Approaches to

More information

STA414/2104 Statistical Methods for Machine Learning II

STA414/2104 Statistical Methods for Machine Learning II STA414/2104 Statistical Methods for Machine Learning II Murat A. Erdogdu & David Duvenaud Department of Computer Science Department of Statistical Sciences Lecture 3 Slide credits: Russ Salakhutdinov Announcements

More information

Deep Neural Networks as Gaussian Processes

Deep Neural Networks as Gaussian Processes Deep Neural Networks as Gaussian Processes Jaehoon Lee, Yasaman Bahri, Roman Novak, Samuel S. Schoenholz, Jeffrey Pennington, Jascha Sohl-Dickstein Google Brain {jaehlee, yasamanb, romann, schsam, jpennin,

More information

Gaussian Process Regression Forecasting of Computer Network Conditions

Gaussian Process Regression Forecasting of Computer Network Conditions Gaussian Process Regression Forecasting of Computer Network Conditions Christina Garman Bucknell University August 3, 2010 Christina Garman (Bucknell University) GPR Forecasting of NPCs August 3, 2010

More information

Gaussian Processes (10/16/13)

Gaussian Processes (10/16/13) STA561: Probabilistic machine learning Gaussian Processes (10/16/13) Lecturer: Barbara Engelhardt Scribes: Changwei Hu, Di Jin, Mengdi Wang 1 Introduction In supervised learning, we observe some inputs

More information