Explicit Rates of Convergence for Sparse Variational Inference in Gaussian Process Regression
|
|
- Denis Sherman
- 5 years ago
- Views:
Transcription
1 JMLR: Workshop and Conference Proceedings : 9, 08. Symposium on Advances in Approximate Bayesian Inference Explicit Rates of Convergence for Sparse Variational Inference in Gaussian Process Regression David Burt drb6@cam.ac.uk and Carl Edward Rasmussen cer54@cam.ac.uk University of Cambridge, Cambridge, UK Mark van der Wilk PROWLER.io, Cambridge, UK mark@prowler.io Abstract Sparse variational inference in Gaussian process regression has theoretical guarantees that make it robust to over-fitting. Additionally, it is well-known that in the non-sparse regime, with M N, variational inducing point methods can be equivalent to full inference. In this paper, we derive bounds on the KL-divergence from the sparse variational approximation to the full posterior and show convergence for M on the order of logn inducing features when using the squared exponential kernel.. Introduction Gaussian processes can model complex functions, are robust to over-fitting and provide uncertainty estimates. In the case of regression, the marginal likelihood and predictive posterior can be calculated in closed form. However, this becomes computationally intractable for large data sets. The variational framework by Titsias 009 approximates the model posterior with a simpler Gaussian process. This approximation retains many of the desirable properties of non-parametric models, in contrast to parametric model approximations of Quiñonero Candela and Rasmussen 005. A precise treatment of the KL-divergence between stochastic processes is given in Matthews et al. 06. While the variational approximation provides an objective function ELBO for selecting the hyperparameters, it can introduce bias into the selection process Turner and Sahani, 0. However, if the gap between the ELBO and the log marginal likelihood at the optimum choice of hyperparameters is small, the optimization will result in a choice of hyperparameters similarly good to those learned using the marginal likelihood. In this work, we introduce eigenfunction inducing features, an inter-domain feature in the style of Lázaro-Gredilla and Figueiras-Vidal 009, based on the eigendecomposition of the kernel. We obtain bounds on the KL-divergence for sparse inference with these features and give theoretical insight into the number of features needed to approximate the posterior process. These features also result in a diagonal covariance that can lead to computational gains Burt et al., 08.. Bounds on the Marginal Likelihood The log marginal likelihood for regression with a mean-centered Gaussian process prior with covariance function k,, used to regress D = {x i, y i } N i=, is given by Rasmussen and c 08. D. Burt, C.E. Rasmussen & M. van der Wilk.
2 Williams, 006: L = log py, where y is distributed according to N y; 0, K f,f + σnoise I, σnoise is the likelihood variance and K f,f is the data covariance matrix, with K f,f i,j = kx i, x j. Titsias 009 provided an ELBO for based on using a subset of inducing features, {u m } M m=0. L L lower = log N y; 0, Q f,f + σ noisei t σ noise. where t = tr K f,f Q f,f, Q f,f = K T u,f K u,uk u,f, K u,f m,i = covu m, fx i and K u,u m,n = covu m, u n. This ELBO is commonly used as a computationally tractable objective function for learning hyperparameters. Moreover, Matthews et al. 06 showed that the gap between the log marginal likelihood and the evidence lower bound is the KL divergence between the posterior Gaussian process and the process used in approximate inference. Instead of bounding L L lower directly, we bound the gap between the ELBO and an upper bound on the marginal likelihood Titsias, 04: L L upper = log N y; 0, Q f,f + ti + σ noisei + log Q f,f + ti + σ noisei log Q f,f + σ noisei. 3 If t = 0, L lower = L = L upper, so in order to obtain a bound on the KL-divergence, it suffices to bound t. Explicitly, Lemma With the same notation as in 3, KLQ ˆP t σ noise + y σnoise + t. 4 where ˆP is the full posterior process and Q is the variational approximation. under mild conditions, y = ON. Note that The proof is provided in Appendix A. Explicitly writing out the diagonal entries in K f,f Q f,f in order to bound t is difficult due to inverse matrix K u,u appearing in the definition of Q f,f. 3. Eigenvalues and Optimal Rate of Convergence Before proving upper bounds on t, we consider a lower bound. As observed in Titsias 04, t is the error in trace norm of a low-rank approximation to the covariance matrix, so t N + where γ m denotes the m th largest eigenvalue of K f,f. γ m 5
3 Generally, computing eigenvalues and eigenvectors of a large matrix is costly. However, as observed in Williams and Seeger 00, as N if the training points are i.i.d random variables drawn according to the probability measure ρ the eigenvalues of K f,f converge to those of the operator K ρ f fxkx, x dρx Connection to the Nyström Approximation X In the case of inducing points subsampled from the data, the matrix Q f,f is a Nyström approximation to the K f,f Williams and Seeger, 00. Therefore, any bound on the rate of convergence of the Nyström approximation, for example Gittens and Mahoney, 03, Lemma, can be used in conjunction with Lemma to yield a bound on the KLdivergence. While Gittens and Mahoney 03, Lemma yields similar asymptotic in M behavior to the bounds we prove, due to large constants in their proof, direct application of their result to this problem appears to require tens of thousands of points. Additionally, a proper analysis when M is allowed to grow as a function of N involves bounding rates of convergence of the kernel matrix spectrum to the operator spectrum. Instead, we introduce eigenfunction features, defined with respect to K which yield elegant analytic upper bounds on t. These bounds asymptotically in N match the lower bound 5. Eigenfunction features are defined so as to be orthogonal, greatly simplifying the computation of t because K u,u = I. 3.. Eigenfunction Inducing Features Mercer s theorem tells us that kx, x = λ m φ m xφ m x, 7 m= where λ m, φ m m=0 are the eigenvalue-eigenfunction pairs of K ρ, which we assume are sorted so that λ m > λ m+ > 0 for all m. We define eigenfunction inducing features by, u m = fx φ m xdµx. λm X where µ is a probability measure that can be treated as a variational parameter. Using the orthogonality of eigenfunctions and the eigenfunction property it can be shown that, covu m, fx = λ m φ m x and covu m, u m = δ m,m An Example: Squared Exponential Kernel The squared exponential kernel, defined for X = R, by k se x, x = v k x x /l, is among the most popular choices of kernels. Suppose that the measure used in defining the K has normal density N 0, s. In this case the eigenvalues of K are given by Rasmussen and Williams, 006: λ m = v k a A Bm, 9 3
4 with a = /4s, b = /l, c = a + ab, A = a + b + c, B = b/a. From the above expression, when B is near one, we should expect to need more inducing features in order to bound the KL-divergence. This occurs when the observed data spans many kernel lengthscales. For any lengthscale, the eigenvalues decay exponentially. This asymptotic decay is closely connected to the smoothness of sample functions from the prior, see Rasmussen and Williams 006, Chapters 4,7. It has previously been shown that truncating the kernel with M such that λ M σnoise /N eigenbasis functions leads to almost no loss in model performance Ferrari-Trecate et al., 999. We prove the following: Theorem If the x i N 0, s i.i.d. For sparse inference with eigenfunction inducing features defined with respect to qx N 0, s with s < s There exists an N s,s such that for all N > N s,s inducing points inference with a set of M = c s,s logn features results in: P rklq ˆP > ɛ < δ. While the full proof is in Appendix B, we suggest why this should be true in the following section. 4. Bounds on the Trace using Eigenfunction Features Using the eigenfunction features, deriving a bound on t is straightforward. From 8, K T u,f K u,f i,j = M λ m φ m x i φ m x j. m=0 Combining this with Mercer s theorem yields the following lemma: Lemma 3 Suppose the training data consists of N i.i.d. random variables drawn according to probability density px and we use M eigenfunction inducing features defined with respect px to a density qx such that k = max x qx <, then lim N N tr K f,f Q f,f = λ m E p [φ m x i ] k λ m. 0 For the squared exponential kernel, 0 is a geometric series, and has a closed form see Appendix B. An example of the bound on the expected value of the trace and the resulting bound on the KL-divergence is shown in Figure. Under the assumed input distribution in Lemma 3 as N tends to infinity for fixed M, the empirical eigenvalues in 5 approach the λ m so the expected value of N t is asymptotically tight. In order to rigorously prove Theorem we need to understand the trace, not the expected value of its entries. In order to achieve this, as well as to obtain bounds that are effective as M grows as a function of N, we show that the square of the eigenfunctions has bounded second moment under qx. The details given in Appendix B. 4
5 Figure : Log-linear plot of bounds blue on the trace left and KL-divergence right plotted against the actual values yellow for a synthetic data set with N = 00, x N 0, 5, v k =, l =. References Bengt von Bahr and Carl-Gustav Esseen. Inequalities for the r th Absolute Moment of a Sum of Random Variables, r. The Annals of Mathematical Statistics, 36: , 965. ISSN David Burt, Carl Edward Rasmussen, and Mark van der Wilk. Sparse Variational Gaussian Process Inference with Eigenfunction Inducing Features. Submitted, October 08. Pafnutii Lvovich Chebyshev. Des valeurs moyennes, Liouville s. J. Math. Pures Appl., : 77 84, 867. Giancarlo Ferrari-Trecate, Christopher KI Williams, and Manfred Opper. Finitedimensional approximation of Gaussian processes. In Advances in neural information processing systems, pages 8 4, 999. Alex Gittens and Michael Mahoney. Revisiting the Nyström method for improved largescale machine learning. In Proceedings of the 30th International Conference on Machine Learning, volume 8 of Proceedings of Machine Learning Research, pages , Atlanta, Georgia, USA, June 03. PMLR. Izrail Solomonovich Gradshteyn and Iosif Moiseevich Ryzhik. Table of integrals, series, and products. Academic press, 04. Miguel Lázaro-Gredilla and Anbal Figueiras-Vidal. Inter-domain Gaussian Processes for Sparse Inference using Inducing Features. In Y. Bengio, D. Schuurmans, J. D. Lafferty, C. K. I. Williams, and A. Culotta, editors, Advances in Neural Information Processing Systems, pages Curran Associates, Inc., 009. Alexander G de G Matthews, James Hensman, Richard Turner, and Zoubin Ghahramani. On sparse variational methods and the Kullback-Leibler divergence between stochastic processes. Journal of Machine Learning Research, 5:3 39, 06. 5
6 Joaquin Quiñonero Candela and Carl Edward Rasmussen. A unifying view of sparse approximate Gaussian process regression. Journal of Machine Learning Research, 6Dec: , 005. Carl Edward Rasmussen and Christopher KI Williams. Gaussian processes for machine learning. MIT Press, 006. Michalis K. Titsias. Variational learning of inducing variables in sparse Gaussian processes. In Artificial Intelligence and Statistics, pages , 009. Michalis K. Titsias. Variational Inference for Gaussian and Determinantal Point Processes. December 04. Published: Workshop on Advances in Variational Inference NIPS 04. Richard Eric Turner and Maneesh Sahani. Two problems with variational expectation maximisation for time series models, pages Cambridge University Press, 0. Christopher K. I. Williams and Matthias Seeger. Using the Nyström Method to Speed Up Kernel Machines. In T. K. Leen, T. G. Dietterich, and V. Tresp, editors, Advances in Neural Information Processing Systems 3, pages MIT Press, 00. Appendix A. Proof of Lemma Proof The proof has a similar spirit to that of 3 provided in Titsias 04. Let R = Q f,f + σ noise I. t L upper L lower = σnoise + y T R y y T R + ti y t = σnoise + y T R R + ti y. Since Q f,f is symmetric positive semidefinite, R is positive definite with eigenvalues bounded below by σ noise. Write, R = UΛUT, where U is unitary and Λ is a diagonal matrix with non-increasing diagonal entries γ γ... γ N σ noise. We can rewrite the second term ignoring the factor of one half in as, y T UΛ U T UΛ + ti U T y = U T y T Λ Λ + ti U T y. Now define, z = U T y. Since U is unitary, z = y. U T y T Λ Λ + ti U T y = z T Λ Λ + ti z = z t i γ i i + γ it y t γn + γ Nt. The last inequality comes from noting that the fraction in the sum attains a maximum when γ i is minimized. Since σnoise is a lower bound on the smallest eigenvalue of R, we have, y T R R + ti t y y σnoise 4 + σ noise t, 6
7 from which Lemma follows. Appendix B. Proof of Theorem For inference with a squared exponential kernel and M eigenfunction inducing features defined with respect to µ N 0, s a stronger version of Lemma 3 is the following: Theorem 4 Suppose x i N 0, s with s s then for all M 0, lim n B M N t kλ 0 B for k = s s holds almost surely for all M 0. For s < s, for any α > 0 P t > KNB M s α N s s s s s 3 4 where K :=.94cs / v k a B A α + s s s. Remark 5 This proof could likely be generalized for all s s via using a concentration inequality based on the r th moment for < r, for example those in Bahr and Esseen 965, in place of Chebyshev s inequality. The proof of the first part of the statement, 3, comes from the observation that if q is px the density associated to µ and p is the density the x i are drawn from max x qx = p0 q0 = k. Note that here it is essential s s or else this ratio would not be bounded as x. Then, N t = N = N i= λ m N λ m φ mx i N φ mx i. 5 The inner sum is an expectation with respect to p of i.i.d. random variables. By the strong law of large numbers, N N i= i= φ mx i a.s. E p [φ mx i ] = k φ m x qxdx φ m x pxdx = k. 6 7
8 This shows 3 holds as N tends to for any fixed M. As the countable intersection of events that occur with probability one also occurs with probability one, this holds for all M simultaneously almost surely. For the probabilistic bound, we begin by establishing a term-wise bound on elements of K f,f Q f,f that takes into account the locations of the training inputs. Lemma 6 Let q i,j denote the i, j th entry in Q n,n and k i,j = kx i, x j denote the i, j th entry in K n,n, then k i,j q i,j.9 a 4cs B M x v k A B exp i + x j 4s. 7 holds for all pairs x i, x j. Proof As in earlier proofs, we have k i,j q i,j = λ m φ m x i φ m x j, We need to now take into account the location of the x i in this bound. For the squared exponential kernel, φ m x = 4cs/4 m! m H m cx exp c ax, where H m x is the m th Hermite polynomial, Rasmussen and Williams, 006, we have normalized the basis so φ m L µ =. We use the following bound on Hermite functions, obtained by squaring the bound in Gradshteyn and Ryzhik 04, H m x i H m x j <.9m! m e x i +x j /. Expanding into the definition of the φ m we obtain k i,j q i,j = 4cs expax i + x H m cx j e cx j H m cx i e cx i j λ m m m! Hm λ m cx j e cx j H m cx i e cx i 4cs expax i + x j m m!.9 4cs expax i + x j =.9 4cs exp x i + x j 4s v k a A λ m B M B. The first inequality triangle inequality is sharp on the terms effecting the trace since when x i = x j the sum must be term-wise positive. 8
9 x Let A i = exp i. It remains to derive a probabilistic bound on S := N i= A i. We first compute the moments of the A i under the input distribution. s E[A r rx x i ] = exp πs R s exp s dx s = s s r, 8 for rs < s. From this we deduce VarA i = E[A i ] E[A i] s = s. Chebyshev s inequality Chebyshev 867 tells us that for any α > s s s s 0, P S > N α + s s s V ara α N. 9 Theorem 4 follows. In order to prove Theorem choose M 3 logn = logn 3 then KNB M. For N large N, we have that this is an upper bound on the trace with high probability. Using this upper bound in Lemma gives Theorem as a corollary. 9
Non-Factorised Variational Inference in Dynamical Systems
st Symposium on Advances in Approximate Bayesian Inference, 08 6 Non-Factorised Variational Inference in Dynamical Systems Alessandro D. Ialongo University of Cambridge and Max Planck Institute for Intelligent
More informationLearning Gaussian Process Models from Uncertain Data
Learning Gaussian Process Models from Uncertain Data Patrick Dallaire, Camille Besse, and Brahim Chaib-draa DAMAS Laboratory, Computer Science & Software Engineering Department, Laval University, Canada
More informationVariational Principal Components
Variational Principal Components Christopher M. Bishop Microsoft Research 7 J. J. Thomson Avenue, Cambridge, CB3 0FB, U.K. cmbishop@microsoft.com http://research.microsoft.com/ cmbishop In Proceedings
More informationCircular Pseudo-Point Approximations for Scaling Gaussian Processes
Circular Pseudo-Point Approximations for Scaling Gaussian Processes Will Tebbutt Invenia Labs, Cambridge, UK will.tebbutt@invenialabs.co.uk Thang D. Bui University of Cambridge tdb4@cam.ac.uk Richard E.
More informationThe Variational Gaussian Approximation Revisited
The Variational Gaussian Approximation Revisited Manfred Opper Cédric Archambeau March 16, 2009 Abstract The variational approximation of posterior distributions by multivariate Gaussians has been much
More informationVariational Model Selection for Sparse Gaussian Process Regression
Variational Model Selection for Sparse Gaussian Process Regression Michalis K. Titsias School of Computer Science University of Manchester 7 September 2008 Outline Gaussian process regression and sparse
More informationGaussian Process Approximations of Stochastic Differential Equations
Gaussian Process Approximations of Stochastic Differential Equations Cédric Archambeau Dan Cawford Manfred Opper John Shawe-Taylor May, 2006 1 Introduction Some of the most complex models routinely run
More informationDeep Gaussian Processes
Deep Gaussian Processes Neil D. Lawrence 8th April 2015 Mascot Num 2015 Outline Introduction Deep Gaussian Process Models Variational Methods Composition of GPs Results Outline Introduction Deep Gaussian
More informationGaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008
Gaussian processes Chuong B Do (updated by Honglak Lee) November 22, 2008 Many of the classical machine learning algorithms that we talked about during the first half of this course fit the following pattern:
More informationTree-structured Gaussian Process Approximations
Tree-structured Gaussian Process Approximations Thang Bui joint work with Richard Turner MLG, Cambridge July 1st, 2014 1 / 27 Outline 1 Introduction 2 Tree-structured GP approximation 3 Experiments 4 Summary
More informationA Process over all Stationary Covariance Kernels
A Process over all Stationary Covariance Kernels Andrew Gordon Wilson June 9, 0 Abstract I define a process over all stationary covariance kernels. I show how one might be able to perform inference that
More informationModel Selection for Gaussian Processes
Institute for Adaptive and Neural Computation School of Informatics,, UK December 26 Outline GP basics Model selection: covariance functions and parameterizations Criteria for model selection Marginal
More informationCSci 8980: Advanced Topics in Graphical Models Gaussian Processes
CSci 8980: Advanced Topics in Graphical Models Gaussian Processes Instructor: Arindam Banerjee November 15, 2007 Gaussian Processes Outline Gaussian Processes Outline Parametric Bayesian Regression Gaussian
More informationDeep Gaussian Processes
Deep Gaussian Processes Neil D. Lawrence 30th April 2015 KTH Royal Institute of Technology Outline Introduction Deep Gaussian Process Models Variational Approximation Samples and Results Outline Introduction
More informationGaussian Processes for Machine Learning
Gaussian Processes for Machine Learning Carl Edward Rasmussen Max Planck Institute for Biological Cybernetics Tübingen, Germany carl@tuebingen.mpg.de Carlos III, Madrid, May 2006 The actual science of
More informationBlack-box α-divergence Minimization
Black-box α-divergence Minimization José Miguel Hernández-Lobato, Yingzhen Li, Daniel Hernández-Lobato, Thang Bui, Richard Turner, Harvard University, University of Cambridge, Universidad Autónoma de Madrid.
More informationMultiple-step Time Series Forecasting with Sparse Gaussian Processes
Multiple-step Time Series Forecasting with Sparse Gaussian Processes Perry Groot ab Peter Lucas a Paul van den Bosch b a Radboud University, Model-Based Systems Development, Heyendaalseweg 135, 6525 AJ
More informationGaussian Processes for Big Data. James Hensman
Gaussian Processes for Big Data James Hensman Overview Motivation Sparse Gaussian Processes Stochastic Variational Inference Examples Overview Motivation Sparse Gaussian Processes Stochastic Variational
More informationGaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012
Gaussian Processes Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 01 Pictorial view of embedding distribution Transform the entire distribution to expected features Feature space Feature
More informationThe Laplacian PDF Distance: A Cost Function for Clustering in a Kernel Feature Space
The Laplacian PDF Distance: A Cost Function for Clustering in a Kernel Feature Space Robert Jenssen, Deniz Erdogmus 2, Jose Principe 2, Torbjørn Eltoft Department of Physics, University of Tromsø, Norway
More informationGWAS V: Gaussian processes
GWAS V: Gaussian processes Dr. Oliver Stegle Christoh Lippert Prof. Dr. Karsten Borgwardt Max-Planck-Institutes Tübingen, Germany Tübingen Summer 2011 Oliver Stegle GWAS V: Gaussian processes Summer 2011
More informationOn Sparse Variational Methods and the Kullback-Leibler Divergence between Stochastic Processes
On Sparse Variational Methods and the Kullback-Leibler Divergence between Stochastic Processes Alexander G. de G. Matthews 1, James Hensman 2, Richard E. Turner 1, Zoubin Ghahramani 1 1 University of Cambridge,
More informationTutorial on Gaussian Processes and the Gaussian Process Latent Variable Model
Tutorial on Gaussian Processes and the Gaussian Process Latent Variable Model (& discussion on the GPLVM tech. report by Prof. N. Lawrence, 06) Andreas Damianou Department of Neuro- and Computer Science,
More informationComputer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)
Prof. Daniel Cremers 2. Regression (cont.) Regression with MLE (Rep.) Assume that y is affected by Gaussian noise : t = f(x, w)+ where Thus, we have p(t x, w, )=N (t; f(x, w), 2 ) 2 Maximum A-Posteriori
More informationGaussian Process Regression
Gaussian Process Regression 4F1 Pattern Recognition, 21 Carl Edward Rasmussen Department of Engineering, University of Cambridge November 11th - 16th, 21 Rasmussen (Engineering, Cambridge) Gaussian Process
More informationActive and Semi-supervised Kernel Classification
Active and Semi-supervised Kernel Classification Zoubin Ghahramani Gatsby Computational Neuroscience Unit University College London Work done in collaboration with Xiaojin Zhu (CMU), John Lafferty (CMU),
More informationExpectation Propagation for Approximate Bayesian Inference
Expectation Propagation for Approximate Bayesian Inference José Miguel Hernández Lobato Universidad Autónoma de Madrid, Computer Science Department February 5, 2007 1/ 24 Bayesian Inference Inference Given
More informationSTAT 518 Intro Student Presentation
STAT 518 Intro Student Presentation Wen Wei Loh April 11, 2013 Title of paper Radford M. Neal [1999] Bayesian Statistics, 6: 475-501, 1999 What the paper is about Regression and Classification Flexible
More informationLatent Variable Models and EM algorithm
Latent Variable Models and EM algorithm SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic 3.1 Clustering and Mixture Modelling K-means and hierarchical clustering are non-probabilistic
More informationLocal and global sparse Gaussian process approximations
Local and global sparse Gaussian process approximations Edward Snelson Gatsby Computational euroscience Unit University College London, UK snelson@gatsby.ucl.ac.uk Zoubin Ghahramani Department of Engineering
More informationSparse Spectrum Gaussian Process Regression
Journal of Machine Learning Research 11 (2010) 1865-1881 Submitted 2/08; Revised 2/10; Published 6/10 Sparse Spectrum Gaussian Process Regression Miguel Lázaro-Gredilla Departamento de Teoría de la Señal
More informationProbabilistic Models for Learning Data Representations. Andreas Damianou
Probabilistic Models for Learning Data Representations Andreas Damianou Department of Computer Science, University of Sheffield, UK IBM Research, Nairobi, Kenya, 23/06/2015 Sheffield SITraN Outline Part
More informationStochastic Variational Inference for Gaussian Process Latent Variable Models using Back Constraints
Stochastic Variational Inference for Gaussian Process Latent Variable Models using Back Constraints Thang D. Bui Richard E. Turner tdb40@cam.ac.uk ret26@cam.ac.uk Computational and Biological Learning
More informationThe geometry of Gaussian processes and Bayesian optimization. Contal CMLA, ENS Cachan
The geometry of Gaussian processes and Bayesian optimization. Contal CMLA, ENS Cachan Background: Global Optimization and Gaussian Processes The Geometry of Gaussian Processes and the Chaining Trick Algorithm
More informationApproximate Kernel PCA with Random Features
Approximate Kernel PCA with Random Features (Computational vs. Statistical Tradeoff) Bharath K. Sriperumbudur Department of Statistics, Pennsylvania State University Journées de Statistique Paris May 28,
More informationNonparametric Bayesian Methods (Gaussian Processes)
[70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent
More informationExpectation Propagation in Dynamical Systems
Expectation Propagation in Dynamical Systems Marc Peter Deisenroth Joint Work with Shakir Mohamed (UBC) August 10, 2012 Marc Deisenroth (TU Darmstadt) EP in Dynamical Systems 1 Motivation Figure : Complex
More informationProbabilistic Graphical Models Lecture 20: Gaussian Processes
Probabilistic Graphical Models Lecture 20: Gaussian Processes Andrew Gordon Wilson www.cs.cmu.edu/~andrewgw Carnegie Mellon University March 30, 2015 1 / 53 What is Machine Learning? Machine learning algorithms
More information(Multivariate) Gaussian (Normal) Probability Densities
(Multivariate) Gaussian (Normal) Probability Densities Carl Edward Rasmussen, José Miguel Hernández-Lobato & Richard Turner April 20th, 2018 Rasmussen, Hernàndez-Lobato & Turner Gaussian Densities April
More informationVariational Dropout via Empirical Bayes
Variational Dropout via Empirical Bayes Valery Kharitonov 1 kharvd@gmail.com Dmitry Molchanov 1, dmolch111@gmail.com Dmitry Vetrov 1, vetrovd@yandex.ru 1 National Research University Higher School of Economics,
More informationNon Linear Latent Variable Models
Non Linear Latent Variable Models Neil Lawrence GPRS 14th February 2014 Outline Nonlinear Latent Variable Models Extensions Outline Nonlinear Latent Variable Models Extensions Non-Linear Latent Variable
More informationNatural Gradients via the Variational Predictive Distribution
Natural Gradients via the Variational Predictive Distribution Da Tang Columbia University datang@cs.columbia.edu Rajesh Ranganath New York University rajeshr@cims.nyu.edu Abstract Variational inference
More informationNonparameteric Regression:
Nonparameteric Regression: Nadaraya-Watson Kernel Regression & Gaussian Process Regression Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro,
More informationPrediction of double gene knockout measurements
Prediction of double gene knockout measurements Sofia Kyriazopoulou-Panagiotopoulou sofiakp@stanford.edu December 12, 2008 Abstract One way to get an insight into the potential interaction between a pair
More informationConvergence Rates of Kernel Quadrature Rules
Convergence Rates of Kernel Quadrature Rules Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE NIPS workshop on probabilistic integration - Dec. 2015 Outline Introduction
More informationGaussian Processes in Machine Learning
Gaussian Processes in Machine Learning November 17, 2011 CharmGil Hong Agenda Motivation GP : How does it make sense? Prior : Defining a GP More about Mean and Covariance Functions Posterior : Conditioning
More informationGlasgow eprints Service
Girard, A. and Rasmussen, C.E. and Quinonero-Candela, J. and Murray- Smith, R. (3) Gaussian Process priors with uncertain inputs? Application to multiple-step ahead time series forecasting. In, Becker,
More informationGaussian processes for inference in stochastic differential equations
Gaussian processes for inference in stochastic differential equations Manfred Opper, AI group, TU Berlin November 6, 2017 Manfred Opper, AI group, TU Berlin (TU Berlin) inference in SDE November 6, 2017
More informationWorst-Case Bounds for Gaussian Process Models
Worst-Case Bounds for Gaussian Process Models Sham M. Kakade University of Pennsylvania Matthias W. Seeger UC Berkeley Abstract Dean P. Foster University of Pennsylvania We present a competitive analysis
More informationGAUSSIAN PROCESS REGRESSION
GAUSSIAN PROCESS REGRESSION CSE 515T Spring 2015 1. BACKGROUND The kernel trick again... The Kernel Trick Consider again the linear regression model: y(x) = φ(x) w + ε, with prior p(w) = N (w; 0, Σ). The
More informationProbabilistic & Unsupervised Learning
Probabilistic & Unsupervised Learning Gaussian Processes Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc ML/CSML, Dept Computer Science University College London
More informationIntroduction Dual Representations Kernel Design RBF Linear Reg. GP Regression GP Classification Summary. Kernel Methods. Henrik I Christensen
Kernel Methods Henrik I Christensen Robotics & Intelligent Machines @ GT Georgia Institute of Technology, Atlanta, GA 30332-0280 hic@cc.gatech.edu Henrik I Christensen (RIM@GT) Kernel Methods 1 / 37 Outline
More informationPATTERN RECOGNITION AND MACHINE LEARNING
PATTERN RECOGNITION AND MACHINE LEARNING Chapter 1. Introduction Shuai Huang April 21, 2014 Outline 1 What is Machine Learning? 2 Curve Fitting 3 Probability Theory 4 Model Selection 5 The curse of dimensionality
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 7 Approximate
More informationPrediction of Data with help of the Gaussian Process Method
of Data with help of the Gaussian Process Method R. Preuss, U. von Toussaint Max-Planck-Institute for Plasma Physics EURATOM Association 878 Garching, Germany March, Abstract The simulation of plasma-wall
More informationSparse Spectral Sampling Gaussian Processes
Sparse Spectral Sampling Gaussian Processes Miguel Lázaro-Gredilla Department of Signal Processing & Communications Universidad Carlos III de Madrid, Spain miguel@tsc.uc3m.es Joaquin Quiñonero-Candela
More informationNeutron inverse kinetics via Gaussian Processes
Neutron inverse kinetics via Gaussian Processes P. Picca Politecnico di Torino, Torino, Italy R. Furfaro University of Arizona, Tucson, Arizona Outline Introduction Review of inverse kinetics techniques
More informationUnderstanding Covariance Estimates in Expectation Propagation
Understanding Covariance Estimates in Expectation Propagation William Stephenson Department of EECS Massachusetts Institute of Technology Cambridge, MA 019 wtstephe@csail.mit.edu Tamara Broderick Department
More informationBayesian Machine Learning
Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 2: Bayesian Basics https://people.orie.cornell.edu/andrew/orie6741 Cornell University August 25, 2016 1 / 17 Canonical Machine Learning
More informationPILCO: A Model-Based and Data-Efficient Approach to Policy Search
PILCO: A Model-Based and Data-Efficient Approach to Policy Search (M.P. Deisenroth and C.E. Rasmussen) CSC2541 November 4, 2016 PILCO Graphical Model PILCO Probabilistic Inference for Learning COntrol
More informationPattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions
Pattern Recognition and Machine Learning Chapter 2: Probability Distributions Cécile Amblard Alex Kläser Jakob Verbeek October 11, 27 Probability Distributions: General Density Estimation: given a finite
More informationLecture : Probabilistic Machine Learning
Lecture : Probabilistic Machine Learning Riashat Islam Reasoning and Learning Lab McGill University September 11, 2018 ML : Many Methods with Many Links Modelling Views of Machine Learning Machine Learning
More informationVariable sigma Gaussian processes: An expectation propagation perspective
Variable sigma Gaussian processes: An expectation propagation perspective Yuan (Alan) Qi Ahmed H. Abdel-Gawad CS & Statistics Departments, Purdue University ECE Department, Purdue University alanqi@cs.purdue.edu
More informationDistributionally robust optimization techniques in batch bayesian optimisation
Distributionally robust optimization techniques in batch bayesian optimisation Nikitas Rontsis June 13, 2016 1 Introduction This report is concerned with performing batch bayesian optimization of an unknown
More informationLecture 35: December The fundamental statistical distances
36-705: Intermediate Statistics Fall 207 Lecturer: Siva Balakrishnan Lecture 35: December 4 Today we will discuss distances and metrics between distributions that are useful in statistics. I will be lose
More informationCSC2535: Computation in Neural Networks Lecture 7: Variational Bayesian Learning & Model Selection
CSC2535: Computation in Neural Networks Lecture 7: Variational Bayesian Learning & Model Selection (non-examinable material) Matthew J. Beal February 27, 2004 www.variational-bayes.org Bayesian Model Selection
More informationStatistical Techniques in Robotics (16-831, F12) Lecture#21 (Monday November 12) Gaussian Processes
Statistical Techniques in Robotics (16-831, F12) Lecture#21 (Monday November 12) Gaussian Processes Lecturer: Drew Bagnell Scribe: Venkatraman Narayanan 1, M. Koval and P. Parashar 1 Applications of Gaussian
More informationLearning to Learn and Collaborative Filtering
Appearing in NIPS 2005 workshop Inductive Transfer: Canada, December, 2005. 10 Years Later, Whistler, Learning to Learn and Collaborative Filtering Kai Yu, Volker Tresp Siemens AG, 81739 Munich, Germany
More informationSparse Approximations for Non-Conjugate Gaussian Process Regression
Sparse Approximations for Non-Conjugate Gaussian Process Regression Thang Bui and Richard Turner Computational and Biological Learning lab Department of Engineering University of Cambridge November 11,
More informationPattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods
Pattern Recognition and Machine Learning Chapter 6: Kernel Methods Vasil Khalidov Alex Kläser December 13, 2007 Training Data: Keep or Discard? Parametric methods (linear/nonlinear) so far: learn parameter
More informationVariational Gaussian-process factor analysis for modeling spatio-temporal data
Variational Gaussian-process factor analysis for modeling spatio-temporal data Jaakko Luttinen Adaptive Informatics Research Center Helsinki University of Technology, Finland Jaakko.Luttinen@tkk.fi Alexander
More informationMachine Learning Basics: Maximum Likelihood Estimation
Machine Learning Basics: Maximum Likelihood Estimation Sargur N. srihari@cedar.buffalo.edu This is part of lecture slides on Deep Learning: http://www.cedar.buffalo.edu/~srihari/cse676 1 Topics 1. Learning
More informationQuantifying mismatch in Bayesian optimization
Quantifying mismatch in Bayesian optimization Eric Schulz University College London e.schulz@cs.ucl.ac.uk Maarten Speekenbrink University College London m.speekenbrink@ucl.ac.uk José Miguel Hernández-Lobato
More informationAPPENDIX A. Background Mathematics. A.1 Linear Algebra. Vector algebra. Let x denote the n-dimensional column vector with components x 1 x 2.
APPENDIX A Background Mathematics A. Linear Algebra A.. Vector algebra Let x denote the n-dimensional column vector with components 0 x x 2 B C @. A x n Definition 6 (scalar product). The scalar product
More informationLecture 6: Graphical Models: Learning
Lecture 6: Graphical Models: Learning 4F13: Machine Learning Zoubin Ghahramani and Carl Edward Rasmussen Department of Engineering, University of Cambridge February 3rd, 2010 Ghahramani & Rasmussen (CUED)
More informationIntroduction to Gaussian Processes
Introduction to Gaussian Processes Neil D. Lawrence GPSS 10th June 2013 Book Rasmussen and Williams (2006) Outline The Gaussian Density Covariance from Basis Functions Basis Function Representations Constructing
More informationApproximating high-dimensional posteriors with nuisance parameters via integrated rotated Gaussian approximation (IRGA)
Approximating high-dimensional posteriors with nuisance parameters via integrated rotated Gaussian approximation (IRGA) Willem van den Boom Department of Statistics and Applied Probability National University
More informationReview (Probability & Linear Algebra)
Review (Probability & Linear Algebra) CE-725 : Statistical Pattern Recognition Sharif University of Technology Spring 2013 M. Soleymani Outline Axioms of probability theory Conditional probability, Joint
More informationOptimization of Gaussian Process Hyperparameters using Rprop
Optimization of Gaussian Process Hyperparameters using Rprop Manuel Blum and Martin Riedmiller University of Freiburg - Department of Computer Science Freiburg, Germany Abstract. Gaussian processes are
More informationIntroduction to Gaussian Processes
Introduction to Gaussian Processes Iain Murray murray@cs.toronto.edu CSC255, Introduction to Machine Learning, Fall 28 Dept. Computer Science, University of Toronto The problem Learn scalar function of
More informationVariational Model Selection for Sparse Gaussian Process Regression
Variational Model Selection for Sparse Gaussian Process Regression Michalis K. Titsias School of Computer Science, University of Manchester, UK mtitsias@cs.man.ac.uk Abstract Sparse Gaussian process methods
More informationVariational Inference for Mahalanobis Distance Metrics in Gaussian Process Regression
Variational Inference for Mahalanobis Distance Metrics in Gaussian Process Regression Michalis K. Titsias Department of Informatics Athens University of Economics and Business mtitsias@aueb.gr Miguel Lázaro-Gredilla
More informationInformation geometry for bivariate distribution control
Information geometry for bivariate distribution control C.T.J.Dodson + Hong Wang Mathematics + Control Systems Centre, University of Manchester Institute of Science and Technology Optimal control of stochastic
More informationThe connection of dropout and Bayesian statistics
The connection of dropout and Bayesian statistics Interpretation of dropout as approximate Bayesian modelling of NN http://mlg.eng.cam.ac.uk/yarin/thesis/thesis.pdf Dropout Geoffrey Hinton Google, University
More information9.2 Support Vector Machines 159
9.2 Support Vector Machines 159 9.2.3 Kernel Methods We have all the tools together now to make an exciting step. Let us summarize our findings. We are interested in regularized estimation problems of
More informationComputer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression
Group Prof. Daniel Cremers 9. Gaussian Processes - Regression Repetition: Regularized Regression Before, we solved for w using the pseudoinverse. But: we can kernelize this problem as well! First step:
More informationPosterior Regularization
Posterior Regularization 1 Introduction One of the key challenges in probabilistic structured learning, is the intractability of the posterior distribution, for fast inference. There are numerous methods
More information13: Variational inference II
10-708: Probabilistic Graphical Models, Spring 2015 13: Variational inference II Lecturer: Eric P. Xing Scribes: Ronghuo Zheng, Zhiting Hu, Yuntian Deng 1 Introduction We started to talk about variational
More informationMACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA
1 MACHINE LEARNING Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA 2 Practicals Next Week Next Week, Practical Session on Computer Takes Place in Room GR
More informationMathematical foundations - linear algebra
Mathematical foundations - linear algebra Andrea Passerini passerini@disi.unitn.it Machine Learning Vector space Definition (over reals) A set X is called a vector space over IR if addition and scalar
More informationSINGLE-TASK AND MULTITASK SPARSE GAUSSIAN PROCESSES
SINGLE-TASK AND MULTITASK SPARSE GAUSSIAN PROCESSES JIANG ZHU, SHILIANG SUN Department of Computer Science and Technology, East China Normal University 500 Dongchuan Road, Shanghai 20024, P. R. China E-MAIL:
More informationCS 7140: Advanced Machine Learning
Instructor CS 714: Advanced Machine Learning Lecture 3: Gaussian Processes (17 Jan, 218) Jan-Willem van de Meent (j.vandemeent@northeastern.edu) Scribes Mo Han (han.m@husky.neu.edu) Guillem Reus Muns (reusmuns.g@husky.neu.edu)
More informationProbabilistic Graphical Models
Probabilistic Graphical Models Lecture 9: Variational Inference Relaxations Volkan Cevher, Matthias Seeger Ecole Polytechnique Fédérale de Lausanne 24/10/2011 (EPFL) Graphical Models 24/10/2011 1 / 15
More informationStochastic Spectral Approaches to Bayesian Inference
Stochastic Spectral Approaches to Bayesian Inference Prof. Nathan L. Gibson Department of Mathematics Applied Mathematics and Computation Seminar March 4, 2011 Prof. Gibson (OSU) Spectral Approaches to
More informationSTA414/2104 Statistical Methods for Machine Learning II
STA414/2104 Statistical Methods for Machine Learning II Murat A. Erdogdu & David Duvenaud Department of Computer Science Department of Statistical Sciences Lecture 3 Slide credits: Russ Salakhutdinov Announcements
More informationDeep Neural Networks as Gaussian Processes
Deep Neural Networks as Gaussian Processes Jaehoon Lee, Yasaman Bahri, Roman Novak, Samuel S. Schoenholz, Jeffrey Pennington, Jascha Sohl-Dickstein Google Brain {jaehlee, yasamanb, romann, schsam, jpennin,
More informationGaussian Process Regression Forecasting of Computer Network Conditions
Gaussian Process Regression Forecasting of Computer Network Conditions Christina Garman Bucknell University August 3, 2010 Christina Garman (Bucknell University) GPR Forecasting of NPCs August 3, 2010
More informationGaussian Processes (10/16/13)
STA561: Probabilistic machine learning Gaussian Processes (10/16/13) Lecturer: Barbara Engelhardt Scribes: Changwei Hu, Di Jin, Mengdi Wang 1 Introduction In supervised learning, we observe some inputs
More information