Lecture 1c: Gaussian Processes for Regression

Similar documents
The Variational Gaussian Approximation Revisited

Lecture 1a: Basic Concepts and Recaps

Probabilistic & Unsupervised Learning

Gaussian Process Regression

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Lecture 1b: Linear Models for Regression

Nonparameteric Regression:

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Nonparametric Bayesian Methods (Gaussian Processes)

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes

Stochastic Variational Inference for Gaussian Process Latent Variable Models using Back Constraints

Gaussian Process priors with Uncertain Inputs: Multiple-Step-Ahead Prediction

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

Variational Principal Components

Bayesian Machine Learning

Gaussian Processes in Machine Learning

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Lecture: Gaussian Process Regression. STAT 6474 Instructor: Hongxiao Zhu

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods

Machine Learning Srihari. Gaussian Processes. Sargur Srihari

The geometry of Gaussian processes and Bayesian optimization. Contal CMLA, ENS Cachan

Lecture 3a: Dirichlet processes

Gaussian with mean ( µ ) and standard deviation ( σ)

Variational Model Selection for Sparse Gaussian Process Regression

Computer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression

CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes

Bayesian Inference of Noise Levels in Regression

Regression with Input-Dependent Noise: A Bayesian Treatment

Lecture 3: Latent Variables Models and Learning with the EM Algorithm. Sam Roweis. Tuesday July25, 2006 Machine Learning Summer School, Taiwan

Gaussian Process Approximations of Stochastic Differential Equations

Learning Gaussian Process Models from Uncertain Data

Where now? Machine Learning and Bayesian Inference

Ch 4. Linear Models for Classification

Computer Vision Group Prof. Daniel Cremers. 4. Gaussian Processes - Regression

Model Selection for Gaussian Processes

Probabilistic Graphical Models Lecture 20: Gaussian Processes

Gaussian Process Dynamical Models Jack M Wang, David J Fleet, Aaron Hertzmann, NIPS 2005

Bayesian Inference: Principles and Practice 3. Sparse Bayesian Models and the Relevance Vector Machine

Cheng Soon Ong & Christian Walder. Canberra February June 2018

A variational radial basis function approximation for diffusion processes

STA414/2104 Statistical Methods for Machine Learning II

GWAS V: Gaussian processes

STAT 518 Intro Student Presentation

Multiple-step Time Series Forecasting with Sparse Gaussian Processes

Active and Semi-supervised Kernel Classification

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Reliability Monitoring Using Log Gaussian Process Regression

Introduction to Gaussian Processes

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

GAUSSIAN PROCESS REGRESSION

Linear Models for Regression

Introduction to Machine Learning

Outline Lecture 2 2(32)

Gaussian processes for inference in stochastic differential equations

Linear Models for Classification

Quantifying mismatch in Bayesian optimization

Probabilistic Graphical Models

SCUOLA DI SPECIALIZZAZIONE IN FISICA MEDICA. Sistemi di Elaborazione dell Informazione. Regressione. Ruggero Donida Labati

Analytic Long-Term Forecasting with Periodic Gaussian Processes

Tutorial on Gaussian Processes and the Gaussian Process Latent Variable Model

STA 4273H: Statistical Machine Learning

Pattern Recognition and Machine Learning

Bayesian Linear Regression. Sargur Srihari

Gaussian Process Approximations of Stochastic Differential Equations

Introduction to Gaussian Processes

Variational Dependent Multi-output Gaussian Process Dynamical Systems

PILCO: A Model-Based and Data-Efficient Approach to Policy Search

Worst-Case Bounds for Gaussian Process Models

Deep learning with differential Gaussian process flows

Introduction to Gaussian Process

Outline lecture 2 2(30)

Unsupervised Learning

GWAS IV: Bayesian linear (variance component) models

Tokamak profile database construction incorporating Gaussian process regression

Machine Learning. Bayesian Regression & Classification. Marc Toussaint U Stuttgart

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Lecture 5: GPs and Streaming regression

Machine Learning 4771

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

STA 4273H: Sta-s-cal Machine Learning

State Space Representation of Gaussian Processes

Approximate Message Passing with Built-in Parameter Estimation for Sparse Signal Recovery

Gaussian Processes in Reinforcement Learning

Introduction to Gaussian Processes

Recent Advances in Bayesian Inference Techniques

Iterative Reweighted Least Squares

Optimization of Gaussian Process Hyperparameters using Rprop

Machine Learning Basics III

Neutron inverse kinetics via Gaussian Processes

Variational Inference for Latent Variables and Uncertain Inputs in Gaussian Processes

COMP 551 Applied Machine Learning Lecture 21: Bayesian optimisation

Gaussian Process Latent Variable Models for Dimensionality Reduction and Time Series Modeling

Machine Learning. 7. Logistic and Linear Regression

Machine Learning Lecture 2

Reading Group on Deep Learning Session 1

MACHINE LEARNING ADVANCED MACHINE LEARNING

The Bayesian approach to inverse problems

Transcription:

Lecture c: Gaussian Processes for Regression Cédric Archambeau Centre for Computational Statistics and Machine Learning Department of Computer Science University College London c.archambeau@cs.ucl.ac.uk Advanced Topics in Machine Learning (MSc in Intelligent Systems) January 8

Today s plan The equivalent kernel Definition of a Gaussian process Sampling from a Gaussian process Gaussian processes for regression Parameter inference Automatic relevance determination Covariance functions Sparse etensions Non-Gaussian likelihoods

Probabilistic linear regression and the equivalent kernel The predictive mean is defined by a weighted combination of the targets: y(, µ w ) = µ w φ() = σ t ΦΣ wφ() = X n σ φ() Σ wφ( n)t n X k(, n) tn. n The equivalent kernel k(, ) is implicitly defined in terms of the basis function φ( ) and is data dependent through Σ w. It can be reformulated in terms of an inner product: k(, ) = ψ ()ψ( ), ψ( ) σ Σ / w φ( ). It determines the correlation between (often nearby) input pairs: y(, w)y(, w) = φ () ww φ( ) = σ k(, ). The idea is to define the covariance function or kernel directly instead of chosing basis functions which induce an implicit kernel.

Gaussian process A multivariate Gaussian distribution: Defines a probability density (based on correlations) over D random variables. Is defined by a mean vector µ and a covariance matri Σ: y (y,..., y D ) N (µ, Σ). A Gaussian process is a generalization of a multivariate Gaussian distribution to infinitely many variables. It defines a probability measure over random functions. (Informally a function can be viewed as an infinitely long vector.) It is defined by a mean function m() and a covariance function k(, ): y( ) GP(m( ), k(, )). The joint distribution over a finite subset of variables is a consistent finite dimensional Gaussian! (see Lecture a)

Eample of a covariance function The squared eponential kernel is defined as k(, ) = c ep j ff, l where c and l > are hyperparameters. It is a valid kernel as it leads to a positive semidefinite Gram matri K R N N for any possible choices of the set { n} N n=. It is a stationary kernel, i.e. it depends only on the difference. It corresponds to projecting the input data into an infinite dimensional feature space (see e.g. Shawe-Taylor and Cristianini, 4). Alternatively, it corresponds to using an infinite number of basis functions (not just on the training points).

Consider a basis function which is an infinite sum of squared eponentials weighted by Gaussian random variables: ψ() = where w(u) N (, ) for all u. Z w(u)e ( u) The resulting covariance function defines the squared eponential kernel: k(, ) = ψ()ψ( ) = Z du, e ( u) e ( u) du ) convolution e (.

Sampling random functions from a Gaussian processes Sequential sampling: y Y p(yn y \n) = Y N ( emn, n> n> σ n), where y \n = (y n,..., y ). Repeat for n > : Generate n. Draw a sample z n from N (, ). Compute the function value associated to n using y n = σ nz n + em n. Batch sampling: y N (m, K). Generate a set of inputs { n} N n=. Draw N samples from N (, ). Compute the function values using y = L z + m. Matri L is the upper triangular Cholesky factor of the kernel matri K.

The function values y n and y \n are jointly Gaussian:»» m(n) k(n, n) k n p(y n, y \n ) = N, m \n k n K \n = N (m, K). «The conditional p(y n y \n ) is then also Gaussian with the conditional mean and the conditional variance respectively given by em n = m( n) + k nk \n (y \n m \n ), σ n = k( n, n) k nk \n k n.

Eample (demo: gpsampl fun).5.5.5 y()!.5!!.5!!.5!!.8!.6!.4!...4.6.8 Figure: Three random functions generated from a GP with m() = and a squared eponential covariance function (c = and l =.5).

Gaussian processes for regression The choice of the kernel defines a prior process (and a prior measure over functions): y( ) GP(, k(, )). We assume a finite number of observations and iid Gaussian noise. The likelihood is given by t y, σ N (y, σ I N ), where y (y( ),..., y( N )) are the latent function values. The posterior process is again a Gaussian process: where y( ) t, σ GP( em( ), k(, )), em( ) = k ( )(K + σ I N ) t, k(, ) = k(, ) k ( )(K + σ I N ) k( ).

Any latent function value y() is jointly Gaussian with the finite subset y:» «K k() p(y, y()) = N, k, () k(, ) where k() (k(, ),..., k(, N )). The mean and the variance of the conditional Gaussian p(y() y) are given by µ() = k ()K y, κ(, ) = k(, ) k ()K k(). We have the p(y) = N (, K) and the p(t y) = N (y, σ I N ), such that where Σ = (K + σ I N ). p(y t) = N (σ Σt, Σ), Hence, the marginal posterior p(y() t) = R p(y() y)p(y t)dy is a Gaussian with mean and variance given by em() = k ()(K + σ I N ) t, where the Woodbury identity was invoked. k(, ) = k(, ) k ()(K + σ I N ) k(),

Eample (demo: gpsampl fun) 5 5 4 4 3 3 y() y()!!!!!3!3!4!4!5!!.8!.6!.4!...4.6.8!5!!.8!.6!.4!...4.6.8 (a) Prior. (b) Posterior. Figure: Three random functions generated from (a) the prior GP and (b) the posterior GP. An observation is indicated by a +, the mean function by a dashed line and the 3 standard deviation error bars by the shaded regions. We used a squared eponential covariance function (c = and l =.5).

Learning the parameters by type II ML Let us denote the kernel parameters by θ. We view the latent functions as nuisance parameters and maimise the log-marginal wrt σ and θ. The log-marginal likelihood is given by ln p(t σ, θ) = N ln π ln K(θ) + σ I N {z } t (K(θ) + σ I N ) t. {z } compleity penality data fit The noise variance σ and the kernel parameters θ can be learned by means of gradient ascent techniques (see Nocedal and Wright, ): ln p(t σ, θ) = n σ tr (K + σ I N ) o + ν ν, ln p(t σ, θ) = j θ k tr (K + σ I N ) νν ff K, θ k where ν (K(θ) + σ I N ) t. The negative log-marginal surface is non-conve (no guarantee of attaining a global minimum) and the computational compleity for its evaluation is O(N 3 ).

Z p(t σ, θ) = Z = p(t y, σ) p(y θ) dy N (y, σ I N ) N (, K(θ)) dy = N (, K(θ) + σ I N ).

Predictive distribution The predictive distribution at for type II ML estimates of the hyperparameters is given by p(t t) p(t t, σ ML, θ ML) = N ( em ML(), k ML(, ) + σ ML). The predictive variance has three components: The prior variance k ML(, ). The term k ML()(K ML + σ I N ) k ML(), which reduces the prior uncertainty and tells us how much is eplained by the data. The noise σ ML on the observations. This term is independent of the targets!

Z p(t t, σ ML, θ ML) = Z = p(t y(), σ ML) posterior GP z } { p(y() t, σ ML, θ ML) dy() N (y(), σ ML) N ( em ML(), k ML(, )) dy() = N ( em ML(), σ ML + k ML(, )).

t t Sinc eample revisited.5.5.5.5!.5!!8!6!4! 4 6 8!.5!!8!6!4! 4 6 8 (a) Variational linear regression. (b) GP regression. Figure: Comparison of the optimal solutions found by (a) variational linear regression with squared eponential basis functions (λ =.495) and by (b) Gaussian process regression with a squared eponential kernel (λ =.84).

Automatic relevance determination (ARD) Can we select the relevant input dimensions form the data? Consider a more general form of the squared eponential kernel: ( ) k(, ) = c ep DX ( d d), where {l d } D d= are allowed to be different. The characteristic length scale l d measures the distance for being uncorrelated along d. Hence, d is not relevant if /l d is small. In general, ARD can be implemented by imposing hierarchical priors on the parameters. For eample, ARD is used in relevance vector machines for achieving sparsity. A prior with different inverse length scale α m is imposed on each weight w m. d= l d

! Eample& C. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 6, E. Rasmussen! c 6 Massachusetts Institute of Technology. www.gaussianprocess.org/gpml N 6853X.! The Model Selection Problem input 7!! input!!! input!! (a) (a) input input!! (b) (b)!!! input input! (c Figure 5.: Functions with two dimensional input drawn at r squared eponential covariance function Gaussian processes, y () as a function Figure: Latent function values of the input measures dimensions (5.) andrespectively.. three different distance in eq. The In (a) both dimensions are relevant, while (b)(b)only is.!, and (c) Λ = (, )!,! = (6, 6)!. In pa! in =,! = (, 3) output y output y output y output y output y (a)! are equally important, while in (b) the function varies less rapi than. In (c) the Λ column gives the direction of most rapid!!!!!! covariance will become almost independent of that input, it from the inference. ARD has been used successfully fo input by several authors, e.g. Williams and Rasmussen [!

Covariance functions In order to be valid a kernel should satisfy Mercer s condition (see e.g. Shawe-Taylor and Cristianini, 4). In practice we require the kernel to induce a symmetric and positive semidefinite kernel matri. Eamples of other kernels: Non-stationary kernels (e.g. sigmoidal kernel). Kernels for structured inputs (e.g. string kernels). Some rules for kernel design: k(, ) = ck (, ), k(, ) = k (, ) + k (, ), k(, ) = k (, )k (, ), k(, ) = f ()k (, )f (),. where c > is a constant and f ( ) is a deterministic function. An interesting open question is how to learn (the type of) the kernel.

Periodic covariance functions A periodic signal of can be constructed using the following warping function: u() = (sin, cos ). Plugging u in the squared eponential kernel leads to a periodic kernel: 8 9 < sin = k(, ) = c ep : ;, where we used the fact that u() u( ) = 4 sin ( ). l 5 4 3 y()!!!3!4!5!!8!6!4! 4 6 8 Figure: Three random functions generated with a periodic kernel (c = and l =.5).

Rational quadratic covariance functions The rational quadratic kernel is defined as follows: k(, ) = + νl «ν+d where ν > is the shape parameter, l > the scale parameter and D is the dimension of the input space. The rational quadratic kernel (or Student-t kernel) corresponds to an infinite miture of scaled squared eponentials: Z Z p(r u, l) p(u ν) du = N (, l /u)g( ν, ν ) du where r. + r νl «ν+d The shape parameter ν defines the thickness of the kernel tails. The squared eponential is recovered for ν.,.

Eample revisited 5 4 3 5 4 3 5 4 3 y() y() y()!!!!!!!3!3!3!4!4!4!5!!.8!.6!.4!...4.6.8!5!!.8!.6!.4!...4.6.8!5!!.8!.6!.4!...4.6.8 (a) Prior ν = 3. (b) Prior ν = 3. (c) Prior ν. 5 5 5 4 4 4 3 3 3 y() y() y()!!!!!!!3!3!3!4!4!4!5!!.8!.6!.4!...4.6.8!5!!.8!.6!.4!...4.6.8!5!!.8!.6!.4!...4.6.8 (d) Prior ν = 3. (e) Prior ν = 3. (f) Prior ν. Figure: Three random functions generated from (a) the prior GP and (b) the posterior GP with the rational quadratic kernel (l =.5). The observations are indicated by +, the means by a dashed lines and the 3 standard deviation error bars by the shaded regions.

Matérn covariance functions The Matérn kernel is given by «k(, ) = ν ν ν «ν K ν, Γ(ν) l l where ν > and l >. Function K ν( ) is the modified Bessel function of the second kind. The order ν defines the roughness of the random functions as they are ν times differentiable: We have the Laplacian or Ornstein-Uhlenbeck kernel for ν =. For ν = p + with p N, the covariance function takes the simple form of a product of an eponential and a polynomial of order p. j k(, ) = ep ν l ff p! p! px = (p + i)! i!(p i)! We recover the squared eponential kernel for ν. «8ν p i. There is in general no closed form solution for the derivative of K ν( ) wrt ν. The Ornstein-Uhlenbeck (OU) process is a mathematical description of the velocity of a particle undergoing Brownian motion. l

Eample revisited 5 4 3 5 4 3 5 4 3 y() y() y()!!!!!!!3!3!3!4!4!4!5!!.8!.6!.4!...4.6.8!5!!.8!.6!.4!...4.6.8!5!!.8!.6!.4!...4.6.8 (a) Prior ν =. (b) Prior ν = 5. (c) Prior ν. 5 5 5 4 4 4 3 3 3 y() y() y()!!!!!!!3!3!3!4!4!4!5!!.8!.6!.4!...4.6.8!5!!.8!.6!.4!...4.6.8!5!!.8!.6!.4!...4.6.8 (d) Prior ν =. (e) Prior ν = 5. (f) Prior ν. Figure: Three random functions generated from (a) the prior GP and (b) the posterior GP with the Matérn kernel (l =.5). The observations are indicated by +, the means by a dashed lines and the 3 standard deviation error bars by the shaded regions.

Matérn kernel vs rational quadratic kernel.9.8! = /3! = 3! " #.9.8 p = p = p! ".7.7.6.6 k(, ).5 k(, ).5.4.4.3.3.... 3 4 5 6 7 8 9! (a) Rational quadratic. 3 4 5 6 7 8 9! (b) Matérn. Figure: Comparison of the rational quadratic and the Matérn kernel with unit length scale (l = ) for three values of respectively the shape and the roughness parameter. Both kernels are less localised than the squared eponential. Forcing the random latent functions to be infinitely differentiable might be unrealistic in practice.

Sparse Gaussian processes The main problem with GPs is that eact inference is O(N 3 ), where N is the number of input variables. Subset of training data: The data points in the active set are selected in a greedy fashion according to some heuristic: Random selection. Vector quantisation or clustering (e.g. K-means). Maimum entropy score (Lawrence et al., 3): H[p(y n y \n )] H[p(y n y)]. Maimum information gain (Seeger et al., 3): KL[p(y n y) p(y n y \n )].... Predictions are made based on the active set only. Subset of regressors: Consider a set inducing variables u R M, which are deterministically related to the latent function values: y() = k u ()K u u. The GP prior is replaced by a degenerate GP with the covariance function k SoR(, ) = y()y( ) = k u ()K u k u( ). The (inputs of the) inducing variables are selected from the training data according to some simple heuristic.

Sparse Gaussian processes (continued) 3 Projected process approimation (Csató and Opper, ): Consider again a set inducing variables u R M. They are now related to the observations: t y t u N (k u ()K u u, σ ), u N (, K u). The information contained in the N observations is absorbed into the m inducing variables. Same predictive mean as for the subset of regressors, but more realistic predictive variance (i.e. it grows when moving away from observations). 4 Pseudo-inputs approimation (Snelson and Ghahramani, 6): The approimate likelihood is chosen from richer class: t y t u N (k u ()K u u, k(, ) k u ()K u k u( ) + σ ). It can be shown that this choice leads to a (non-degenerate) GP with the covariance function k(, ) PI = k SoR(, ) + δ(, ) `k(, ) k SoR(, ), where δ(, ) is the Kronecker s delta.

Non-Gaussian noise Assume the noise is non-gaussian, but still iid. The likelihood factorises and takes the following form: p(t y, θ) e P N n= V n, where V n V θ (t n, y n) is a nonlinear function parametrised by θ. Even for a GP prior, the posterior non-gaussian process is intractable. We consider the variational Gaussian distribution q(y) = N (µ, Σ), which maimises the free energy (Opper and Archambeau, 8) The stationary points are given by F(q, θ) = ln p(t, y θ) q(y) + H[q(y)]. µ = K ν, ν (... V n qn / µ n... ), Λ Σ = K +, Λ diag{... Vn qn / Σ nn... }, where q n q( n) is the marginal Gaussian. The number of parameters to optimise (e.g. by gradient descent) is O(N)!

Non-Gaussian noise (continued) If y GP, then the conditional mean function and the conditional variance function are given by µ() = k ()K y, κ(, ) = k(, ) k ()K k(). The approimate posterior process is a Gaussian process Z y( ) t p(y( ) y) q(y) dy = GP(eµ( ), κ(, )), with mean function and the predictive variance given by em() = k () ν, k(, ) = k(, ) k ()(K + Λ ) k(), where the Woodbury identity was invoked. The log-marginal is intractable, but the noise and the kernel parameters can be estimated by maimising F.

Sinc eample with Laplace noise The likelihood is defined as p(t y, η) = η e η t y, with η >..8.6.4.!.!.4.8.6.4.!.!.4!!5 5!!5 5 (a) Standard GP. (b) Variational GP. Figure: Sinc eample with Laplace noise (η = ). Both GPs use an optimised squared eponential kernel. Note that the shaded regions indicate the standard deviation error bars. Useful Gaussian identities: (see Opper and Archambeau (8) for proof) fi fl V n qn Vn (n µn)vn qn =, µ n n Σ nn V n qn Σ nn = q n = fi fl V n n = (n µn) V n Σnn V qn n qn q n Σ. nn

Interpretation of the variational Gaussian approimation Laplace approimation: A Gaussian density is fitted locally at a mode of the posterior and the covariance is built from the curvature of the log-posterior around this point: = y ln p(t, y θ), Σ = y y ln p(t, y θ). Variational Gaussian approimation: The variational mean and the variational covariance can be rewritten in two different ways: = µ ln p(t, y θ) q(y) = y ln p(t, y θ) q(y), Σ = µ µ ln p(t, y θ) q(y) = y y ln p(t, y θ) q(y). A Gaussian density is fitted globally, i.e. the conditions of the Laplace approimations hold on average. The variational Gaussian method is also equivalent to applying Laplace s method to an implicitly defined probability density q(µ) e ln p(t,y θ) q(y).

References L. Csató and M. Opper, Sparse on-line Gaussian processes, Neural Computation 4:64-668,. C.M. Bishop: Pattern Recognition and Machine Learning. Springer, 6. J. Nocedal and S.J. Wright: Numerical optimization. Springer,. M. Opper and C. Archambeau, The variational Gaussian approimation revisited, Neural Computation 8. C. E. Rasmussen and C. K.I. Williams: Gaussian Processes for Machine Learning. MIT Press, 6. J. Shawe-Taylor and N. Cristianini: Kernel Methods for Pattern Analysis. Cambridge University Press, 4. E. Snelson and Z. Ghahramani, Sparse Gaussian processes using pseudo-inputs, NIPS 5. Tutorial on Gaussian processes at NIPS 6 by C. E. Rasmussen. The Matri Cookbook by K. B. Petersen and M. S. Pedersen.