Lecture 1c: Gaussian Processes for Regression

Lecture c: Gaussian Processes for Regression Cédric Archambeau Centre for Computational Statistics and Machine Learning Department of Computer Science University College London c.archambeau@cs.ucl.ac.uk Advanced Topics in Machine Learning (MSc in Intelligent Systems) January 8

Today s plan The equivalent kernel Definition of a Gaussian process Sampling from a Gaussian process Gaussian processes for regression Parameter inference Automatic relevance determination Covariance functions Sparse etensions Non-Gaussian likelihoods

Probabilistic linear regression and the equivalent kernel The predictive mean is defined by a weighted combination of the targets: y(, µ w ) = µ w φ() = σ t ΦΣ wφ() = X n σ φ() Σ wφ( n)t n X k(, n) tn. n The equivalent kernel k(, ) is implicitly defined in terms of the basis function φ( ) and is data dependent through Σ w. It can be reformulated in terms of an inner product: k(, ) = ψ ()ψ( ), ψ( ) σ Σ / w φ( ). It determines the correlation between (often nearby) input pairs: y(, w)y(, w) = φ () ww φ( ) = σ k(, ). The idea is to define the covariance function or kernel directly instead of chosing basis functions which induce an implicit kernel.

Gaussian process A multivariate Gaussian distribution: Defines a probability density (based on correlations) over D random variables. Is defined by a mean vector µ and a covariance matri Σ: y (y,..., y D ) N (µ, Σ). A Gaussian process is a generalization of a multivariate Gaussian distribution to infinitely many variables. It defines a probability measure over random functions. (Informally a function can be viewed as an infinitely long vector.) It is defined by a mean function m() and a covariance function k(, ): y( ) GP(m( ), k(, )). The joint distribution over a finite subset of variables is a consistent finite dimensional Gaussian! (see Lecture a)

Eample of a covariance function The squared eponential kernel is defined as k(, ) = c ep j ff, l where c and l > are hyperparameters. It is a valid kernel as it leads to a positive semidefinite Gram matri K R N N for any possible choices of the set { n} N n=. It is a stationary kernel, i.e. it depends only on the difference. It corresponds to projecting the input data into an infinite dimensional feature space (see e.g. Shawe-Taylor and Cristianini, 4). Alternatively, it corresponds to using an infinite number of basis functions (not just on the training points).

Consider a basis function which is an infinite sum of squared eponentials weighted by Gaussian random variables: ψ() = where w(u) N (, ) for all u. Z w(u)e ( u) The resulting covariance function defines the squared eponential kernel: k(, ) = ψ()ψ( ) = Z du, e ( u) e ( u) du ) convolution e (.

Sampling random functions from a Gaussian processes Sequential sampling: y Y p(yn y \n) = Y N ( emn, n> n> σ n), where y \n = (y n,..., y ). Repeat for n > : Generate n. Draw a sample z n from N (, ). Compute the function value associated to n using y n = σ nz n + em n. Batch sampling: y N (m, K). Generate a set of inputs { n} N n=. Draw N samples from N (, ). Compute the function values using y = L z + m. Matri L is the upper triangular Cholesky factor of the kernel matri K.

The function values y n and y \n are jointly Gaussian:»» m(n) k(n, n) k n p(y n, y \n ) = N, m \n k n K \n = N (m, K). «The conditional p(y n y \n ) is then also Gaussian with the conditional mean and the conditional variance respectively given by em n = m( n) + k nk \n (y \n m \n ), σ n = k( n, n) k nk \n k n.

Eample (demo: gpsampl fun).5.5.5 y()!.5!!.5!!.5!!.8!.6!.4!...4.6.8 Figure: Three random functions generated from a GP with m() = and a squared eponential covariance function (c = and l =.5).

Gaussian processes for regression The choice of the kernel defines a prior process (and a prior measure over functions): y( ) GP(, k(, )). We assume a finite number of observations and iid Gaussian noise. The likelihood is given by t y, σ N (y, σ I N ), where y (y( ),..., y( N )) are the latent function values. The posterior process is again a Gaussian process: where y( ) t, σ GP( em( ), k(, )), em( ) = k ( )(K + σ I N ) t, k(, ) = k(, ) k ( )(K + σ I N ) k( ).

Any latent function value y() is jointly Gaussian with the finite subset y:» «K k() p(y, y()) = N, k, () k(, ) where k() (k(, ),..., k(, N )). The mean and the variance of the conditional Gaussian p(y() y) are given by µ() = k ()K y, κ(, ) = k(, ) k ()K k(). We have the p(y) = N (, K) and the p(t y) = N (y, σ I N ), such that where Σ = (K + σ I N ). p(y t) = N (σ Σt, Σ), Hence, the marginal posterior p(y() t) = R p(y() y)p(y t)dy is a Gaussian with mean and variance given by em() = k ()(K + σ I N ) t, where the Woodbury identity was invoked. k(, ) = k(, ) k ()(K + σ I N ) k(),

Eample (demo: gpsampl fun) 5 5 4 4 3 3 y() y()!!!!!3!3!4!4!5!!.8!.6!.4!...4.6.8!5!!.8!.6!.4!...4.6.8 (a) Prior. (b) Posterior. Figure: Three random functions generated from (a) the prior GP and (b) the posterior GP. An observation is indicated by a +, the mean function by a dashed line and the 3 standard deviation error bars by the shaded regions. We used a squared eponential covariance function (c = and l =.5).

Learning the parameters by type II ML Let us denote the kernel parameters by θ. We view the latent functions as nuisance parameters and maimise the log-marginal wrt σ and θ. The log-marginal likelihood is given by ln p(t σ, θ) = N ln π ln K(θ) + σ I N {z } t (K(θ) + σ I N ) t. {z } compleity penality data fit The noise variance σ and the kernel parameters θ can be learned by means of gradient ascent techniques (see Nocedal and Wright, ): ln p(t σ, θ) = n σ tr (K + σ I N ) o + ν ν, ln p(t σ, θ) = j θ k tr (K + σ I N ) νν ff K, θ k where ν (K(θ) + σ I N ) t. The negative log-marginal surface is non-conve (no guarantee of attaining a global minimum) and the computational compleity for its evaluation is O(N 3 ).

Z p(t σ, θ) = Z = p(t y, σ) p(y θ) dy N (y, σ I N ) N (, K(θ)) dy = N (, K(θ) + σ I N ).

Predictive distribution The predictive distribution at for type II ML estimates of the hyperparameters is given by p(t t) p(t t, σ ML, θ ML) = N ( em ML(), k ML(, ) + σ ML). The predictive variance has three components: The prior variance k ML(, ). The term k ML()(K ML + σ I N ) k ML(), which reduces the prior uncertainty and tells us how much is eplained by the data. The noise σ ML on the observations. This term is independent of the targets!

Z p(t t, σ ML, θ ML) = Z = p(t y(), σ ML) posterior GP z } { p(y() t, σ ML, θ ML) dy() N (y(), σ ML) N ( em ML(), k ML(, )) dy() = N ( em ML(), σ ML + k ML(, )).

t t Sinc eample revisited.5.5.5.5!.5!!8!6!4! 4 6 8!.5!!8!6!4! 4 6 8 (a) Variational linear regression. (b) GP regression. Figure: Comparison of the optimal solutions found by (a) variational linear regression with squared eponential basis functions (λ =.495) and by (b) Gaussian process regression with a squared eponential kernel (λ =.84).

Automatic relevance determination (ARD) Can we select the relevant input dimensions form the data? Consider a more general form of the squared eponential kernel: ( ) k(, ) = c ep DX ( d d), where {l d } D d= are allowed to be different. The characteristic length scale l d measures the distance for being uncorrelated along d. Hence, d is not relevant if /l d is small. In general, ARD can be implemented by imposing hierarchical priors on the parameters. For eample, ARD is used in relevance vector machines for achieving sparsity. A prior with different inverse length scale α m is imposed on each weight w m. d= l d

! Eample& C. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 6, E. Rasmussen! c 6 Massachusetts Institute of Technology. www.gaussianprocess.org/gpml N 6853X.! The Model Selection Problem input 7!! input!!! input!! (a) (a) input input!! (b) (b)!!! input input! (c Figure 5.: Functions with two dimensional input drawn at r squared eponential covariance function Gaussian processes, y () as a function Figure: Latent function values of the input measures dimensions (5.) andrespectively.. three different distance in eq. The In (a) both dimensions are relevant, while (b)(b)only is.!, and (c) Λ = (, )!,! = (6, 6)!. In pa! in =,! = (, 3) output y output y output y output y output y (a)! are equally important, while in (b) the function varies less rapi than. In (c) the Λ column gives the direction of most rapid!!!!!! covariance will become almost independent of that input, it from the inference. ARD has been used successfully fo input by several authors, e.g. Williams and Rasmussen [!

Covariance functions In order to be valid a kernel should satisfy Mercer s condition (see e.g. Shawe-Taylor and Cristianini, 4). In practice we require the kernel to induce a symmetric and positive semidefinite kernel matri. Eamples of other kernels: Non-stationary kernels (e.g. sigmoidal kernel). Kernels for structured inputs (e.g. string kernels). Some rules for kernel design: k(, ) = ck (, ), k(, ) = k (, ) + k (, ), k(, ) = k (, )k (, ), k(, ) = f ()k (, )f (),. where c > is a constant and f ( ) is a deterministic function. An interesting open question is how to learn (the type of) the kernel.

Periodic covariance functions A periodic signal of can be constructed using the following warping function: u() = (sin, cos ). Plugging u in the squared eponential kernel leads to a periodic kernel: 8 9 < sin = k(, ) = c ep : ;, where we used the fact that u() u( ) = 4 sin ( ). l 5 4 3 y()!!!3!4!5!!8!6!4! 4 6 8 Figure: Three random functions generated with a periodic kernel (c = and l =.5).

Rational quadratic covariance functions The rational quadratic kernel is defined as follows: k(, ) = + νl «ν+d where ν > is the shape parameter, l > the scale parameter and D is the dimension of the input space. The rational quadratic kernel (or Student-t kernel) corresponds to an infinite miture of scaled squared eponentials: Z Z p(r u, l) p(u ν) du = N (, l /u)g( ν, ν ) du where r. + r νl «ν+d The shape parameter ν defines the thickness of the kernel tails. The squared eponential is recovered for ν.,.

Eample revisited 5 4 3 5 4 3 5 4 3 y() y() y()!!!!!!!3!3!3!4!4!4!5!!.8!.6!.4!...4.6.8!5!!.8!.6!.4!...4.6.8!5!!.8!.6!.4!...4.6.8 (a) Prior ν = 3. (b) Prior ν = 3. (c) Prior ν. 5 5 5 4 4 4 3 3 3 y() y() y()!!!!!!!3!3!3!4!4!4!5!!.8!.6!.4!...4.6.8!5!!.8!.6!.4!...4.6.8!5!!.8!.6!.4!...4.6.8 (d) Prior ν = 3. (e) Prior ν = 3. (f) Prior ν. Figure: Three random functions generated from (a) the prior GP and (b) the posterior GP with the rational quadratic kernel (l =.5). The observations are indicated by +, the means by a dashed lines and the 3 standard deviation error bars by the shaded regions.

Matérn covariance functions The Matérn kernel is given by «k(, ) = ν ν ν «ν K ν, Γ(ν) l l where ν > and l >. Function K ν( ) is the modified Bessel function of the second kind. The order ν defines the roughness of the random functions as they are ν times differentiable: We have the Laplacian or Ornstein-Uhlenbeck kernel for ν =. For ν = p + with p N, the covariance function takes the simple form of a product of an eponential and a polynomial of order p. j k(, ) = ep ν l ff p! p! px = (p + i)! i!(p i)! We recover the squared eponential kernel for ν. «8ν p i. There is in general no closed form solution for the derivative of K ν( ) wrt ν. The Ornstein-Uhlenbeck (OU) process is a mathematical description of the velocity of a particle undergoing Brownian motion. l

Eample revisited 5 4 3 5 4 3 5 4 3 y() y() y()!!!!!!!3!3!3!4!4!4!5!!.8!.6!.4!...4.6.8!5!!.8!.6!.4!...4.6.8!5!!.8!.6!.4!...4.6.8 (a) Prior ν =. (b) Prior ν = 5. (c) Prior ν. 5 5 5 4 4 4 3 3 3 y() y() y()!!!!!!!3!3!3!4!4!4!5!!.8!.6!.4!...4.6.8!5!!.8!.6!.4!...4.6.8!5!!.8!.6!.4!...4.6.8 (d) Prior ν =. (e) Prior ν = 5. (f) Prior ν. Figure: Three random functions generated from (a) the prior GP and (b) the posterior GP with the Matérn kernel (l =.5). The observations are indicated by +, the means by a dashed lines and the 3 standard deviation error bars by the shaded regions.

Matérn kernel vs rational quadratic kernel.9.8! = /3! = 3! " #.9.8 p = p = p! ".7.7.6.6 k(, ).5 k(, ).5.4.4.3.3.... 3 4 5 6 7 8 9! (a) Rational quadratic. 3 4 5 6 7 8 9! (b) Matérn. Figure: Comparison of the rational quadratic and the Matérn kernel with unit length scale (l = ) for three values of respectively the shape and the roughness parameter. Both kernels are less localised than the squared eponential. Forcing the random latent functions to be infinitely differentiable might be unrealistic in practice.

Sparse Gaussian processes The main problem with GPs is that eact inference is O(N 3 ), where N is the number of input variables. Subset of training data: The data points in the active set are selected in a greedy fashion according to some heuristic: Random selection. Vector quantisation or clustering (e.g. K-means). Maimum entropy score (Lawrence et al., 3): H[p(y n y \n )] H[p(y n y)]. Maimum information gain (Seeger et al., 3): KL[p(y n y) p(y n y \n )].... Predictions are made based on the active set only. Subset of regressors: Consider a set inducing variables u R M, which are deterministically related to the latent function values: y() = k u ()K u u. The GP prior is replaced by a degenerate GP with the covariance function k SoR(, ) = y()y( ) = k u ()K u k u( ). The (inputs of the) inducing variables are selected from the training data according to some simple heuristic.

Sparse Gaussian processes (continued) 3 Projected process approimation (Csató and Opper, ): Consider again a set inducing variables u R M. They are now related to the observations: t y t u N (k u ()K u u, σ ), u N (, K u). The information contained in the N observations is absorbed into the m inducing variables. Same predictive mean as for the subset of regressors, but more realistic predictive variance (i.e. it grows when moving away from observations). 4 Pseudo-inputs approimation (Snelson and Ghahramani, 6): The approimate likelihood is chosen from richer class: t y t u N (k u ()K u u, k(, ) k u ()K u k u( ) + σ ). It can be shown that this choice leads to a (non-degenerate) GP with the covariance function k(, ) PI = k SoR(, ) + δ(, ) `k(, ) k SoR(, ), where δ(, ) is the Kronecker s delta.

Non-Gaussian noise Assume the noise is non-gaussian, but still iid. The likelihood factorises and takes the following form: p(t y, θ) e P N n= V n, where V n V θ (t n, y n) is a nonlinear function parametrised by θ. Even for a GP prior, the posterior non-gaussian process is intractable. We consider the variational Gaussian distribution q(y) = N (µ, Σ), which maimises the free energy (Opper and Archambeau, 8) The stationary points are given by F(q, θ) = ln p(t, y θ) q(y) + H[q(y)]. µ = K ν, ν (... V n qn / µ n... ), Λ Σ = K +, Λ diag{... Vn qn / Σ nn... }, where q n q( n) is the marginal Gaussian. The number of parameters to optimise (e.g. by gradient descent) is O(N)!

Non-Gaussian noise (continued) If y GP, then the conditional mean function and the conditional variance function are given by µ() = k ()K y, κ(, ) = k(, ) k ()K k(). The approimate posterior process is a Gaussian process Z y( ) t p(y( ) y) q(y) dy = GP(eµ( ), κ(, )), with mean function and the predictive variance given by em() = k () ν, k(, ) = k(, ) k ()(K + Λ ) k(), where the Woodbury identity was invoked. The log-marginal is intractable, but the noise and the kernel parameters can be estimated by maimising F.

Sinc eample with Laplace noise The likelihood is defined as p(t y, η) = η e η t y, with η >..8.6.4.!.!.4.8.6.4.!.!.4!!5 5!!5 5 (a) Standard GP. (b) Variational GP. Figure: Sinc eample with Laplace noise (η = ). Both GPs use an optimised squared eponential kernel. Note that the shaded regions indicate the standard deviation error bars. Useful Gaussian identities: (see Opper and Archambeau (8) for proof) fi fl V n qn Vn (n µn)vn qn =, µ n n Σ nn V n qn Σ nn = q n = fi fl V n n = (n µn) V n Σnn V qn n qn q n Σ. nn

Interpretation of the variational Gaussian approimation Laplace approimation: A Gaussian density is fitted locally at a mode of the posterior and the covariance is built from the curvature of the log-posterior around this point: = y ln p(t, y θ), Σ = y y ln p(t, y θ). Variational Gaussian approimation: The variational mean and the variational covariance can be rewritten in two different ways: = µ ln p(t, y θ) q(y) = y ln p(t, y θ) q(y), Σ = µ µ ln p(t, y θ) q(y) = y y ln p(t, y θ) q(y). A Gaussian density is fitted globally, i.e. the conditions of the Laplace approimations hold on average. The variational Gaussian method is also equivalent to applying Laplace s method to an implicitly defined probability density q(µ) e ln p(t,y θ) q(y).

References L. Csató and M. Opper, Sparse on-line Gaussian processes, Neural Computation 4:64-668,. C.M. Bishop: Pattern Recognition and Machine Learning. Springer, 6. J. Nocedal and S.J. Wright: Numerical optimization. Springer,. M. Opper and C. Archambeau, The variational Gaussian approimation revisited, Neural Computation 8. C. E. Rasmussen and C. K.I. Williams: Gaussian Processes for Machine Learning. MIT Press, 6. J. Shawe-Taylor and N. Cristianini: Kernel Methods for Pattern Analysis. Cambridge University Press, 4. E. Snelson and Z. Ghahramani, Sparse Gaussian processes using pseudo-inputs, NIPS 5. Tutorial on Gaussian processes at NIPS 6 by C. E. Rasmussen. The Matri Cookbook by K. B. Petersen and M. S. Pedersen.