Gaussian Processes for Machine Learning

Gaussian Processes for Machine Learning Carl Edward Rasmussen Max Planck Institute for Biological Cybernetics Tübingen, Germany carl@tuebingen.mpg.de Carlos III, Madrid, May 2006 The actual science of logic is conversant at present only with things either certain, impossible, or entirely doubtful, none of which (fortunately) we have to reason on. Therefore the true logic for this world is the calculus of Probabilities, which takes account of the magnitude of the probability which is, or ought to be, in a reasonable man s mind. James Clerk Maxwell [1850]

Course Goals To understand how learning for regression classification can be based on Gaussian processes how GPs relate to alternative methods the advantages and shortcomings of GPs some practical familiarity using matlab functions 1/30

Todays Outline Brief background on probability Gaussian processes Bayesian inference two views the process view Occam s Razor the weight-space view on the equivalence between the two views 2/30

Why Gaussian Processes Gaussian processes are not new! Used in spatial statistics, geostatistics (kriging), meteorology Why are GPs not used more often? computational reasons Bayesian 3/30

Probabilities are non-negative p(x) 0 Probabilities and Probability Densities Probabilities normalize: x p(x) = 1 (discrete) or p(x)dx = 1 (continuous). The joint probability of x and y is p(x, y). The marginal probability of x is p(x) = p(x, y)dy. The conditional probability of x given y is p(x y) = p(x, y) p(y). Bayes Rule is given by p(x, y) = p(x y)p(y) = p(y x)p(x) p(y x) = p(x y)p(y) p(x) p(x y)p(y). Interpretations of p(x): long run frequencies (frequentist, classical) subjective degree of belief (Bayesian) 4/30

Expectation and Variance (Moments) The expectation (or mean or average) of a random variable is: µ = E[x] = xp(x)dx = x p(x). The variance (or second central moment) is: σ 2 = V[x] = (x µ) 2 p(x)dx = E[x 2 ] (E[x]) 2. The covariance between x and y: cov(x, y) = E[(x E[x])(y E[y])]. If x and y are independent, then their covariance is zero. 5/30

The Gaussian Distribution A joint, or multivariate Gaussian distribution in D dimensions: y N (µ, Σ) = (2π) D/2 Σ 1/2 exp ( 1 2 (y µ) Σ 1 (y µ) ), is fully specified by its mean vector where µ is the mean vector and Σ the covariance matrix. Let x and y be jointly Gaussian random vectors [ x y ] ([ ] [ ]) µx A C N, µ y C B ( [ ] [ ] 1 ) µx Ã C = N, µ y C, B then the marginal distribution of x and the conditional distribution of x given y are x N (µ x, A), and x y N ( µ x + CB 1 (y µ y ), A CB 1 C ) or x y N ( µ x Ã 1 C(y µy ), Ã 1). (1) 6/30

Products of Gaussians The product of two Gaussians gives another (un-normalized) Gaussian N (x a, A)N (x b, B) = Z 1 N (x c, C) (2) where c = C(A 1 a + B 1 b) and C = (A 1 + B 1 ) 1. Notice that the resulting Gaussian has a precision (inverse variance) equal to the sum of the precisions and a mean equal to the convex sum of the means, weighted by the precisions. The normalizing constant looks itself like a Gaussian (in a or b) Z 1 = (2π) D/2 A + B 1/2 exp ( 1 2 (a b) (A + B) 1 (a b) ). 7/30

Gaussian Processes Definition: A Gaussian Process is a collection of random variables any finite number of which have (consistent) joint Gaussian distributions. Key observation: A Gaussian process is a natural generalization from vectors to functions: f(x) GP(m(x), k(x, x )), and is fully specified by its mean function m(x) and covariance function k(x, x ). This may sound very impractical......but we are saved by the marginalization property. Recall: p(x) = p(x, y)dy, for Gaussians: p(x, y) = N ([ a b ], [ A B B C ] ) = p(x) = N (a, A) 8/30

Gaussian Processes specify distributions over functions To generate a random sample from a D dimensional joint Gaussian with covariance matrix S and mean vector m: (in matlab) x = randn(d,1); y = chol(s) *x + m; where chol is the Cholesky factor R such that R R = S. Thus, the covariance of y is: E[yy ] = E[R xx R] = R E[xx ]R = R IR = S. Example: smooth, stationary, Squared Exponential (or Gaussian) GP: ( f(x) GP m(x) = 0, k(x, x ) = exp ( 1 2 (x x ) 2)). Do try this at home! 9/30

2 output, f(x) 1 0 1 2 5 0 5 input, x 10/30

Function drawn at random from a Gaussian Process with Gaussian covariance 8 7 6 5 4 3 2 1 0 1 2 6 4 2 0 2 4 6 6 4 2 0 2 4 6 11/30

Bayesian Inference, parametric model Supervised parametric learning: data: x, y model: y = f w (x) + ε Gaussian likelihood: p(y x, w, M i ) c exp( 1 2 (y c f w (x c )) 2 /σ 2 noise). Parameter prior: p(w M i ) Posterior parameter distribution by Bayes rule p(a b) = p(b a)p(a)/p(b): p(w x, y, M i ) = p(w M i)p(y x, w, M i ) p(y x, M i ) 12/30

Bayesian Inference, parametric model, cont. Making predictions: p(y x, x, y, M i ) = p(y w, x, M i )p(w x, y, M i )dw Marginal likelihood: p(y x, M i ) = p(w M i )p(y x, w, M i )dw. Model probability: p(m i x, y) = p(m i)p(y x, M i ) p(y x) Problem: integrals are intractable for most interesting models! 13/30

Non-parametric Gaussian process models In our non-parametric model, the parameters are the function itself! Gaussian likelihood: y x, f(x), M i N (f, σ 2 noisei) (Zero mean) Gaussian process prior: f(x) M i GP ( m(x) 0, k(x, x ) ) Leads to a Gaussian process posterior f(x) x, y, M i GP ( m post (x) = k(x, x)[k(x, x) + σ 2 noisei] 1 y, k post (x, x ) = k(x, x ) k(x, x)[k(x, x) + σ 2 noisei] 1 k(x, x ) ). And a Gaussian predictive distribution: y x, x, y, M i N ( k(x, x) [K + σ 2 noisei] 1 y, k(x, x ) + σ 2 noise k(x, x) [K + σ 2 noisei] 1 k(x, x) ) 14/30

Prior and Posterior 2 2 output, f(x) 1 0 1 output, f(x) 1 0 1 2 5 0 5 input, x 2 5 0 5 input, x Predictive distribution: p(y x, x, y) N ( k(x, x) [K + σ 2 noisei] 1 y, k(x, x ) + σ 2 noise k(x, x) [K + σ 2 noisei] 1 k(x, x) ) 15/30

Graphical model for Gaussian Process observations y (1) y ( ) * y (c) Gaussian field f (1) f ( ) * f (c) inputs x (1) x (2) x ( ) x (c) * Square nodes are clamped, round nodes stochastic (free). Thick black line indicates every-one is connected to everyone. Notice, that adding a triplet x (i), f (i), y (i) (where the input x (i) is clamped) does not influence the distribution (as long as y (i) is not clamped). This is guaranteed by the consistency of the GP. This explains why we can make inference using a finite amount of computation! 16/30

The marginal likelihood Log marginal likelihood: log p(y x, M i ) = 1 2 y K 1 y 1 2 log K n 2 log(2π) is the combination of a data fit term and complexity penalty. Occam s Razor is automatic. Learning in Gaussian process models involves finding the form of the covariance function, and any unknown (hyper-) parameters θ. This can be done by optimizing the marginal likelihood: log p(y x, θ, M i ) θ j = 1 2 y K 1 K θ j K 1 y 1 2 trace(k 1 K θ j ) 17/30

Example: Fitting the length scale parameter Parameterized covariance function: k(x, x ) = v 2 exp ( (x x ) 2 2l 2 ) + σ 2 n δ xx. 1.5 1 observations too short good length scale too long 0.5 0 0.5 10 8 6 4 2 0 2 4 6 8 10 The mean posterior predictive function is plotted for 3 different length scales (the green curve corresponds to optimizing the marginal likelihood). Notice, that an almost exact fit to the data can be achieved by reducing the length scale but the marginal likelihood does not favour this! 18/30

Why, in principle, does Bayesian Inference work? Occam s Razor too simple P(Y M i ) "just right" too complex Y All possible data sets 19/30

The weight-space view Linear model: Gaussian noise. assume y = x w + ε = f(x) + ε, where ε N (0, σnoise 2 ) is independent Likelihood: p(y X, w) = = n p(y i x i, w) = i=1 n i=1 1 exp ( (y i x i w)2 ) 2πσn 2σ 2 n 1 (2πσn) exp ( 1 2 n/2 2σn 2 y X w 2) = N (X w, σni). 2 Note, that the likelihood also has a Gaussian shape w.r.t. normalized. the parameter w, although not 20/30

Maximum Likelihood Maximizing the likelihood over weights: log p(y X, w) w = 0 = w ML = (XX ) 1 Xy. (The solution could be ill determined if XX is close to singular.) Predictive distribution: p(f x, D, w ML ) = N (x w ML, σ 2 n). 21/30

The Bayesian Approach Define the prior distribution: eg, Σ p = σ 2 pi, for some σ 2 p. p(w) = N (0, Σ p ), What does this prior mean? Posterior Likelihood Prior p(w X, y) exp ( 1 2σn 2 (y X w) (y X w) ) exp ( 1 2 w Σ 1 p w ) Ie, the posterior is Gaussian: exp ( 1 2 (w w) ( 1 σ 2 n XX + Σ 1 ) ) p (w w), p(w X, y) N ( w = 1 σn 2 A 1 Xy, A 1 ), where A = σn 2 XX + Σ 1 p. 22/30

Making Predictions The predictive distribution is again Gaussian: p(f x, X, y) = p(f x, w)p(w X, y)dw = = N ( 1 σn 2 x A 1 Xy, x A 1 ) x. x w p(w X, y)dw Notice, how the error-bars increase when predicting further away from the origin. Why? Outstanding Problems: The model is still too limited (only linear). What about σ 2 n and Σ p? 23/30

Nonlinear Regression Instead of x, we can use a linear model in some fixed function basis φ(x) = (ϕ 1 (x), ϕ 2 (x),...). Example: φ(x) = (1, x, x 2,...), turns: f(x) = φ(x) w, into polynomial regression, which is more flexible. Note: the analysis from the previous slides can be reused by replacing x by φ(x). The maximum likelihood parameters become: w ML = (Φ(X)Φ(X) ) 1 Φ(X)y. with design matrix Φ(X) = (φ(x 1 ), φ(x 2 ),...). 24/30

Re-introducing the Feature Space The predictive distribution: p(f x, X, y) = p(f x, w)p(w X, y)dw where A = σ 2 n where K = Φ Σ p Φ. = N ( 1 σn 2 φ(x ) A 1 Φ(X)y, φ(x ) A 1 φ(x ) ), Φ(X)Φ(X) + Σ 1 p. This can be re-written as: f x, X, y N ( φ Σ p Φ(K + σ 2 ni) 1 y, φ Σ p φ φ Σ p Φ(K + σ 2 ni) 1 Φ Σ p φ ), Notice, that this last expression contains φ(x) only in the form of inner products can be possibly replaced by kernel functions k(x, x ) = φ(x) Σ p φ(x )! 25/30

On the correspondence between weight space and function space views There is an exact correspondence between the two views: Given a set of m basis functions, φ(x), it is obvious that we can construct the corresponding covariance function: k(x, x ) = φ(x) Σ p φ(x ). conversely, given a positive definite covariance function k(x, x ) we can decompose it (Mercer s theorem): k(x, x ) = λ i φ i (x)φ i (x ), where λ i and φ i are eigenvalues and eigenfunctions respectively, obeying: k(x, x )φ i (x)dx = λφ i (x ). i=1 The function space view is valid irrespective of whether the corresponding eigendecomposition is finite. The squared exponential covariance function does not have a finite decomposistion. 26/30

From random functions to covariance functions Consider the class of linear functions: f(x) = ax + b, where a N (0, α), and b N (0, β). We can compute the mean function: µ(x) = E[f(x)] = f(x)p(a)p(b)dadb = axp(a)da + bp(b)db = 0, and covariance function: k(x, x ) = E[(f(x) 0)(f(x ) 0)] = (ax + b)(ax + b)p(a)p(b)dadb = a 2 xx p(a)da + b 2 p(b)db + (x + x ) abp(a)p(b)dadb = αxx + β. 27/30

Predictions from a noise free Gaussian Process Imagine that we have observed the value of the function at some locations {X, f}, and seek to make predictions f at (one or more) test inputs X. The joint distribution under the GP is: [ f f ] N ( 0, [ ] K(X, X) K(X, X ) ). K(X, X) K(X, X ) Now, we can condition on the obsevations, to obtain f X, X, f N ( f ; K(X, X )K(X, X) 1 f, K(X, X ) K(X, X )K(X, X) 1 K(X, X) ). Remark: Conditioning a Gaussian gives another Gaussian: [ x y] N ( [ ]) A C T 0, C B = x y N (C T B 1 y, A C T B 1 C) 28/30

Predictions from a noisy Gaussian Process More commonly, we have access to noisy observations {X, y}. Assuming independent Gaussian noise, the joint distribution of observations and test predictions is: [ ] y N ( [ ] K(X, X) + σ 2 0, n I K(X, X ) ), f K(X, X) K(X, X ) Conditioning on the observations: f X, X, y N ( K(X, X )[K(X, X) + σ 2 ni] 1 y, K(X, X ) K(X, X )[K(X, X) + σ 2 ni] 1 K(X, X) ). Note: Equating K(X, Z) with Φ(X) Σ p Φ(Z) for the Bayesian linear model, gives exactly identical results! 29/30

Some Interpretation Recall our main result: f X, X, y N ( K(X, X)[K(X, X) + σni] 2 1 y, K(X, X ) K(X, X)[K(X, X) + σni] 2 1 K(X, X ) ). The mean is linear in two ways: µ(x ) = k(x, X)[K(X, X) + σ 2 n] 1 y = n β c y (c) = c=1 n α c k(x, x (c) ). c=1 The last form is most commonly encountered in the kernel literature. The variance is the difference between two terms: V (x ) = k(x, x ) k(x, X)[K(X, X) + σ 2 ni] 1 k(x, x ), the first term is the prior variance, from which we subtract a (positive) term, telling how much the data X has explained. Note, that the variance is independent of the observed outputs y. 30/30