Computer Vision Group Prof. Daniel Cremers. 4. Gaussian Processes - Regression

Size: px

Start display at page:

Download "Computer Vision Group Prof. Daniel Cremers. 4. Gaussian Processes - Regression"

Louise Cannon
5 years ago
Views:

1 Group Prof. Daniel Cremers 4. Gaussian Processes - Regression

2 Definition (Rep.) Definition: A Gaussian process is a collection of random variables, any finite number of which have a joint Gaussian distribution. The number of random variables can be infinite! This means: a GP is a Gaussian distribution over functions! To specify a GP we need: mean function: m(x) =E[y(x)] covariance function: k(x 1, x )=E[y(x 1 ) m(x 1 )y(x ) m(x )] Group

3 Example (Rep.) green line: sinusoidal data source blue circles: data points with Gaussian noise red line: mean function of the Gaussian process shaded red area: σ confidence interval 3 Group

4 A Simple Example (Rep.) In Bayesian linear regression, we had y(x) = (x) T w with prior probability w N (0, p ). This means: E[y(x)] = (x) T E[w] =0 E[y(x 1 )y(x ))] = (x 1 ) T E[ww T ] (x )= (x 1 ) T p (x ) Any number of function values y(x 1 ),...,y(x N ) is jointly Gaussian with zero mean. The covariance function of this process is k(x 1, x )= (x 1 ) T p (x ) In general, any valid kernel function can be used. 4 Group

5 The Covariance Function (Rep.) The most used covariance function (kernel) is: k(x p, x q )= f exp( 1 l (x p x q ) )+ n pq signal variance length scale noise variance It is known as squared exponential, radial basis function or Gaussian kernel. Other possibilities exist, e.g. the exponential kernel: k(x p, x q )=exp( x p x q ) This is used in the Ornstein-Uhlenbeck process. 5 Group

6 Sampling from a GP (Rep.) Just as we can sample from a Gaussian distribution, we can also generate samples from a GP. Every sample will then be a function! Process: 1.Choose a number of input points x 1,...,x M.Compute the covariance matrix K where K ij = k(x i, x j ) 3.Generate a random Gaussian vector from y N (0,K) 4.Plot the values x 1,...,x M versus y 1,...,y M 6 Group

7 Sampling from a GP (Rep.) Squared exponential kernel Exponential kernel 7 Group

8 Prediction with a Gaussian Process Most often we are more interested in predicting new function values for given input data. We have: training data test input x 1,...,x N x 1,...,x M y 1,...,y N And we want test outputs y 1,...,y M The joint probability is y y K(X, X) K(X, X ) N 0, K(X,X) K(X,X ) and we need to compute p(y x,x,y). 8 Group

9 Gaussian Conditionals Assume we have two variables and that are jointly Gaussian distributed, i.e. with x = xa x b µ = µa µ b x a x b N (x µ, ) = aa ab Then it follows p(x a x b )=N (x µ a b, a b ) where and µ a b = µ a + ab 1 bb (x b µ b ) a b = aa ab 1 bb ba Schur Complement 9 Group

10 Gaussian Marginals and Conditionals Main idea of the proof for the conditional (using inverse of block matrices): aa ba ab bb 1 = I 0 1 bb ba I ( / bb ) bb I ab 1 bb 0 I The lower line corresponds to a quadratic form that is only dependent on p(x b ), i.e. the rest can be identified with the conditional Normal distribution p(x a x b ). (for details see, e.g. Bishop page 86/87) 10 Group

11 Prediction with a Gaussian Process In the case of only one test point we have Now we compute the conditional distribution where K(X, x )= 0 k(x 1, x )... k(x N, x ) 1 x C A = k p(y x,x,y) =N (y µ, ) µ = k T K 1 t = k(x, x ) k T K 1 k This defines the predictive distribution. 11 Group

12 Example Functions sampled from a Gaussian Process prior Functions sampled from the predictive distribution The predictive distribution is itself a Gaussian process. It represents the posterior after observing the data. The covariance is low in the vicinity of data points. 1 Group

13 Varying the Hyperparameters l = f =1, n = l =0.3, f =1.08, n = data samples GP prediction with different kernel hyper parameters l =3 f =1.16 n = Group

14 Varying the Hyperparameters The squared exponential covariance function can be generalized to k(x p, x q )= f exp( 1 (x p x q ) T M(x p x q )) + n pq where M can be: I M = l : this is equal to the above case : every feature dimension has its own length scale parameter M = diag(l 1,...,l D ) : here Λ has less than M = T + diag(l 1,...,l D ) D columns 14 Group

15 Varying the Hyperparameters 1 1 output y 0 1 output y input x 0 input x1 0 input x 0 input x1 M = I 1 M = diag(1, 3) output y 0 1 M = input x 0 input x1 + diag(6, 6) 15 Group

16 Implementation Algorithm 1: GP regression Data: training data (X, y), test data x Input: Hyper parameters K ij k(x i, x j ) f, l, n L cholesky(k + yi) n L T \(L\y) E[f ] k T v L\k var[f ] k(x, x ) v T v log p(y X) 1 yt P i log L ii Training Phase N log( ) Test Phase Cholesky decomposition is numerically stable Can be used to compute inverse efficiently 16 Group

17 Estimating the Hyperparameters To find optimal hyper parameters we need the marginal likelihood: p(y X) = Z p(y f,x)p(f X)df This expression implicitly depends on the hyper parameters, but y and X are given from the training data. It can be computed in closed form, as all terms are Gaussians. We take the logarithm, compute the derivative and set it to 0. This is the training step. 17 Group

18 Estimating the Hyperparameters noise standard deviation output, y characteristic lengthscale input, x The log marginal likelihood is not necessarily concave, i.e. it can have local maxima. The local maxima can correspond to sub-optimal solutions. 18 output, y input, x Group

19 Automatic Relevance Determination We have seen how the covariance function can be generalized using a matrix M If M is diagonal this results in the kernel function k(x, x 0 )= f exp 1 DX i=1 i (x i x 0 i)! We can interpret the feature dimension i as weights for each Thus, if the length scale l i =1/ i of an input dimension is large, the input is less relevant During training this is done automatically 19 Group

20 Automatic Relevance Determination 3-dimensional data, parameters 1 3 as they evolve during training 1 3 During the optimization process to learn the hyper-parameters, the reciprocal length scale for one parameter decreases, i.e.: This hyper parameter is not very relevant! 0 Group

21 Group Prof. Daniel Cremers Gaussian Processes - Classification

22 Gaussian Processes For Classification In regression we have classification we have y R y { 1; 1}, in binary To use a GP for classification, we can apply a sigmoid function to the posterior obtained from the GP and compute the class probability as: p(y =+1 x) = (f(x)) If the sigmoid function is symmetric: then we have p(y x) = (yf(x)). ( z) =1 (z) A typical type of sigmoid function is the logistic sigmoid: 1 (z) = 1 + exp( z) Group

23 Application of the Sigmoid Function Function sampled from a Gaussian Process Sigmoid function applied to the GP function Another symmetric sigmoid function is the cumulative Gaussian: (z) = Z z 1 N (x 0, 1)dx 3 Group

24 Visualization of Sigmoid Functions The cumulative Gaussian is slightly steeper than the logistic sigmoid 4 Group

25 The Latent Variables In regression, we directly estimated f as f(x) GP(m(x),k(x, x 0 )) and values of f where observed in the training data. Now only labels +1 or -1 are observed and f is treated as a set of latent variables. A major advantage of the Gaussian process classifier over other methods is that it marginalizes over all latent functions rather than maximizing some model parameters. 5 Group

26 Class Prediction with a GP The aim is to compute the predictive distribution Z p(y =+1 X, y, x )= p(y f )p(f X, y, x )df (f ) 6 Group

27 Class Prediction with a GP The aim is to compute the predictive distribution Z p(y =+1 X, y, x )= p(y f )p(f X, y, x )df we marginalize over the latent variables from the training data: p(f X, y, x )= Z p(f X, x, f)p(f X, y)df predictive distribution of the latent variable (from regression) 7 Group

28 Class Prediction with a GP The aim is to compute the predictive distribution Z p(y =+1 X, y, x )= p(y f )p(f X, y, x )df we marginalize over the latent variables from the training data: p(f X, y, x )= Z p(f X, x, f)p(f X, y)df we need the posterior over the latent variables: likelihood (sigmoid) p(f X, y) = p(y f)p(f X) p(y X) prior normalizer 8 Group

29 A Simple Example Red: Two-class training data Green: mean function of p(f X, y) Light blue: sigmoid of the mean function 9 Group

30 But There Is A Problem... p(f X, y) = p(y f)p(f X) p(y X) The likelihood term is not a Gaussian! This means, we can not compute the posterior in closed form. There are several different solutions in the literature, e.g.: Laplace approximation Expectation Propagation Variational methods 30 Group

31 Laplace Approximation p(f X, y) q(f X, y) =N (f ˆf,A 1 ) where and ˆf = arg max p(f X, y) f A = rr log p(f X, y) f=ˆf second-order Taylor expansion To compute ˆf an iterative approach using Newton s method has to be used. The Hessian matrix A can be computed as A = K 1 + W where W = rr log p(y f) is a diagonal matrix which depends on the sigmoid function. 31 Group

32 Laplace Approximation Yellow: a non-gaussian posterior Red: a Gaussian approximation, the mean is the mode of the posterior, the variance is the negative second derivative at the mode 3 Group

33 Predictions Now that we have p(f X, y, x )= p(f X, y) Z we can compute: p(f X, x, f)p(f X, y)df From the regression case we have: p(f X, x, f) =N (f µ, ) where µ = k T K 1 f = k(x, x ) k T K 1 k Linear in f This reminds us of a property of Gaussians that we saw earlier! 33 Group

34 Gaussian Properties (Rep.) If we are given this: I. II. p(x) =N (x µ, 1 ) p(y x) =N (y Ax + b, ) Then it follows (properties of Gaussians): III. IV. p(y) =N (y Aµ + b, + A 1 A T ) p(x y) =N (x (A T 1 (y b)+ 1 y), ) 1 where =( A T 1 s A) 1 34 Group

35 Applying this to Laplace E[f X, y, x ]=k(x ) T K 1ˆf V[f X, y, x ]=k(x, x ) k T (K + W 1 ) 1 k It remains to compute p(y =+1 X, y, x )= Z p(y f )p(f X, y, x )df Depending on the kind of sigmoid function we can compute this in closed form (cumulative Gaussian sigmoid) have to use sampling methods or analytical approximations (logistic sigmoid) 35 Group

36 A Simple Example Two-class problem (training data in red and blue) Green line: optimal decision boundary Black line: GP classifier decision boundary Right: posterior probability 36 Group

37 Summary Kernel methods solve problems by implicitly mapping the data into a (high-dimensional) feature space The feature function itself is not used, instead the algorithm is expressed in terms of the kernel Gaussian Processes are Normal distributions over functions To specify a GP we need a covariance function (kernel) and a mean function More on Gaussian Processes: 37 Group

Computer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression

Computer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression Group Prof. Daniel Cremers 9. Gaussian Processes - Regression Repetition: Regularized Regression Before, we solved for w using the pseudoinverse. But: we can kernelize this problem as well! First step: