Computer Vision Group Prof. Daniel Cremers. 4. Gaussian Processes - Regression

Similar documents
Computer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Nonparametric Bayesian Methods (Gaussian Processes)

Nonparameteric Regression:

Lecture: Gaussian Process Regression. STAT 6474 Instructor: Hongxiao Zhu

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes

Model Selection for Gaussian Processes

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

Gaussian Processes (10/16/13)

Gaussian with mean ( µ ) and standard deviation ( σ)

GAUSSIAN PROCESS REGRESSION

GWAS V: Gaussian processes

Gaussian Process Regression

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods

Machine Learning. Bayesian Regression & Classification. Marc Toussaint U Stuttgart

Computer Vision Group Prof. Daniel Cremers. 3. Regression

Gaussian Processes in Machine Learning

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

Machine Learning Srihari. Gaussian Processes. Sargur Srihari

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Ch 4. Linear Models for Classification

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

STA 4273H: Statistical Machine Learning

20: Gaussian Processes

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Probabilistic Graphical Models Lecture 20: Gaussian Processes

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Joint Emotion Analysis via Multi-task Gaussian Processes

Introduction to Gaussian Processes

Likelihood NIPS July 30, Gaussian Process Regression with Student-t. Likelihood. Jarno Vanhatalo, Pasi Jylanki and Aki Vehtari NIPS-2009

Pattern Recognition and Machine Learning

Non-Gaussian likelihoods for Gaussian Processes

Mathematical Formulation of Our Example

Midterm. Introduction to Machine Learning. CS 189 Spring You have 1 hour 20 minutes for the exam.

Probabilistic & Unsupervised Learning

Chris Bishop s PRML Ch. 8: Graphical Models

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Outline Lecture 2 2(32)

PILCO: A Model-Based and Data-Efficient Approach to Policy Search

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Relevance Vector Machines

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

Lecture 5: GPs and Streaming regression

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

Introduction to Machine Learning

Machine Learning Lecture 7

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Gaussian processes for inference in stochastic differential equations

Probabilistic Machine Learning. Industrial AI Lab.

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Machine Learning - Waseda University Logistic Regression

Reliability Monitoring Using Log Gaussian Process Regression

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Supervised Learning Coursework

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Learning Gaussian Process Models from Uncertain Data

Introduction to Gaussian Processes

Advanced Introduction to Machine Learning CMU-10715

Active and Semi-supervised Kernel Classification

CS 7140: Advanced Machine Learning

Neural Network Training

Introduction to Machine Learning Spring 2018 Note 18

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

Gaussian processes and bayesian optimization Stanisław Jastrzębski. kudkudak.github.io kudkudak

Introduction to Gaussian Process

Vasil Khalidov & Miles Hansard. C.M. Bishop s PRML: Chapter 5; Neural Networks

Reading Group on Deep Learning Session 1

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.

Practical Bayesian Optimization of Machine Learning. Learning Algorithms

Linear Classification

Probabilistic Machine Learning

Jeff Howbert Introduction to Machine Learning Winter

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

CS-E4830 Kernel Methods in Machine Learning

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

MTTTS16 Learning from Multiple Sources

Machine Learning Practice Page 2 of 2 10/28/13

Probabilistic numerics for deep learning

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes

STA414/2104 Statistical Methods for Machine Learning II

Linear Models for Classification

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Support Vector Machines

Marginal density. If the unknown is of the form x = (x 1, x 2 ) in which the target of investigation is x 1, a marginal posterior density

Linear & nonlinear classifiers

Variational Model Selection for Sparse Gaussian Process Regression

Multiple-step Time Series Forecasting with Sparse Gaussian Processes

Introduction to Machine Learning

STA 4273H: Sta-s-cal Machine Learning

Log Gaussian Cox Processes. Chi Group Meeting February 23, 2016

Support Vector Machines

Transcription:

Group Prof. Daniel Cremers 4. Gaussian Processes - Regression

Definition (Rep.) Definition: A Gaussian process is a collection of random variables, any finite number of which have a joint Gaussian distribution. The number of random variables can be infinite! This means: a GP is a Gaussian distribution over functions! To specify a GP we need: mean function: m(x) =E[y(x)] covariance function: k(x 1, x )=E[y(x 1 ) m(x 1 )y(x ) m(x )] Group

Example (Rep.) green line: sinusoidal data source blue circles: data points with Gaussian noise red line: mean function of the Gaussian process shaded red area: σ confidence interval 3 Group

A Simple Example (Rep.) In Bayesian linear regression, we had y(x) = (x) T w with prior probability w N (0, p ). This means: E[y(x)] = (x) T E[w] =0 E[y(x 1 )y(x ))] = (x 1 ) T E[ww T ] (x )= (x 1 ) T p (x ) Any number of function values y(x 1 ),...,y(x N ) is jointly Gaussian with zero mean. The covariance function of this process is k(x 1, x )= (x 1 ) T p (x ) In general, any valid kernel function can be used. 4 Group

The Covariance Function (Rep.) The most used covariance function (kernel) is: k(x p, x q )= f exp( 1 l (x p x q ) )+ n pq signal variance length scale noise variance It is known as squared exponential, radial basis function or Gaussian kernel. Other possibilities exist, e.g. the exponential kernel: k(x p, x q )=exp( x p x q ) This is used in the Ornstein-Uhlenbeck process. 5 Group

Sampling from a GP (Rep.) Just as we can sample from a Gaussian distribution, we can also generate samples from a GP. Every sample will then be a function! Process: 1.Choose a number of input points x 1,...,x M.Compute the covariance matrix K where K ij = k(x i, x j ) 3.Generate a random Gaussian vector from y N (0,K) 4.Plot the values x 1,...,x M versus y 1,...,y M 6 Group

Sampling from a GP (Rep.) Squared exponential kernel Exponential kernel 7 Group

Prediction with a Gaussian Process Most often we are more interested in predicting new function values for given input data. We have: training data test input x 1,...,x N x 1,...,x M y 1,...,y N And we want test outputs y 1,...,y M The joint probability is y y K(X, X) K(X, X ) N 0, K(X,X) K(X,X ) and we need to compute p(y x,x,y). 8 Group

Gaussian Conditionals Assume we have two variables and that are jointly Gaussian distributed, i.e. with x = xa x b µ = µa µ b x a x b N (x µ, ) = aa ab Then it follows p(x a x b )=N (x µ a b, a b ) where and µ a b = µ a + ab 1 bb (x b µ b ) a b = aa ab 1 bb ba Schur Complement 9 Group

Gaussian Marginals and Conditionals Main idea of the proof for the conditional (using inverse of block matrices): aa ba ab bb 1 = I 0 1 bb ba I ( / bb ) 1 0 0 1 bb I ab 1 bb 0 I The lower line corresponds to a quadratic form that is only dependent on p(x b ), i.e. the rest can be identified with the conditional Normal distribution p(x a x b ). (for details see, e.g. Bishop page 86/87) 10 Group

Prediction with a Gaussian Process In the case of only one test point we have Now we compute the conditional distribution where K(X, x )= 0 B @ k(x 1, x )... k(x N, x ) 1 x C A = k p(y x,x,y) =N (y µ, ) µ = k T K 1 t = k(x, x ) k T K 1 k This defines the predictive distribution. 11 Group

Example 1.5 1 0.5 0 0.5 1 1.5 5 0 5 Functions sampled from a Gaussian Process prior.5 1.5 1 0.5 0 0.5 1 1.5 5 0 5 Functions sampled from the predictive distribution The predictive distribution is itself a Gaussian process. It represents the posterior after observing the data. The covariance is low in the vicinity of data points. 1 Group

Varying the Hyperparameters 3 3 1 0 1 3 8 6 4 0 4 6 8 l = f =1, n =0.1 1 0 1 3 8 6 4 0 4 6 8 3 l =0.3, f =1.08, n =0.0005 0 data samples GP prediction with different kernel hyper parameters 1 0 1 3 8 6 4 0 4 6 8 l =3 f =1.16 n =0.89 13 Group

Varying the Hyperparameters The squared exponential covariance function can be generalized to k(x p, x q )= f exp( 1 (x p x q ) T M(x p x q )) + n pq where M can be: I M = l : this is equal to the above case : every feature dimension has its own length scale parameter M = diag(l 1,...,l D ) : here Λ has less than M = T + diag(l 1,...,l D ) D columns 14 Group

Varying the Hyperparameters 1 1 output y 0 1 output y 0 1 0 input x 0 input x1 0 input x 0 input x1 M = I 1 M = diag(1, 3) output y 0 1 M = 0 1 1 1 1 input x 0 input x1 + diag(6, 6) 15 Group

Implementation Algorithm 1: GP regression Data: training data (X, y), test data x Input: Hyper parameters K ij k(x i, x j ) f, l, n L cholesky(k + yi) n L T \(L\y) E[f ] k T v L\k var[f ] k(x, x ) v T v log p(y X) 1 yt P i log L ii Training Phase N log( ) Test Phase Cholesky decomposition is numerically stable Can be used to compute inverse efficiently 16 Group

Estimating the Hyperparameters To find optimal hyper parameters we need the marginal likelihood: p(y X) = Z p(y f,x)p(f X)df This expression implicitly depends on the hyper parameters, but y and X are given from the training data. It can be computed in closed form, as all terms are Gaussians. We take the logarithm, compute the derivative and set it to 0. This is the training step. 17 Group

Estimating the Hyperparameters noise standard deviation 10 0 10 1 output, y 1 0 1 10 0 10 1 characteristic lengthscale 5 0 5 input, x The log marginal likelihood is not necessarily concave, i.e. it can have local maxima. The local maxima can correspond to sub-optimal solutions. 18 output, y 1 0 1 5 0 5 input, x Group

Automatic Relevance Determination We have seen how the covariance function can be generalized using a matrix M If M is diagonal this results in the kernel function k(x, x 0 )= f exp 1 DX i=1 i (x i x 0 i)! We can interpret the feature dimension i as weights for each Thus, if the length scale l i =1/ i of an input dimension is large, the input is less relevant During training this is done automatically 19 Group

Automatic Relevance Determination 3-dimensional data, parameters 1 3 as they evolve during training 1 3 During the optimization process to learn the hyper-parameters, the reciprocal length scale for one parameter decreases, i.e.: This hyper parameter is not very relevant! 0 Group

Group Prof. Daniel Cremers Gaussian Processes - Classification

Gaussian Processes For Classification In regression we have classification we have y R y { 1; 1}, in binary To use a GP for classification, we can apply a sigmoid function to the posterior obtained from the GP and compute the class probability as: p(y =+1 x) = (f(x)) If the sigmoid function is symmetric: then we have p(y x) = (yf(x)). ( z) =1 (z) A typical type of sigmoid function is the logistic sigmoid: 1 (z) = 1 + exp( z) Group

Application of the Sigmoid Function Function sampled from a Gaussian Process Sigmoid function applied to the GP function Another symmetric sigmoid function is the cumulative Gaussian: (z) = Z z 1 N (x 0, 1)dx 3 Group

Visualization of Sigmoid Functions The cumulative Gaussian is slightly steeper than the logistic sigmoid 4 Group

The Latent Variables In regression, we directly estimated f as f(x) GP(m(x),k(x, x 0 )) and values of f where observed in the training data. Now only labels +1 or -1 are observed and f is treated as a set of latent variables. A major advantage of the Gaussian process classifier over other methods is that it marginalizes over all latent functions rather than maximizing some model parameters. 5 Group

Class Prediction with a GP The aim is to compute the predictive distribution Z p(y =+1 X, y, x )= p(y f )p(f X, y, x )df (f ) 6 Group

Class Prediction with a GP The aim is to compute the predictive distribution Z p(y =+1 X, y, x )= p(y f )p(f X, y, x )df we marginalize over the latent variables from the training data: p(f X, y, x )= Z p(f X, x, f)p(f X, y)df predictive distribution of the latent variable (from regression) 7 Group

Class Prediction with a GP The aim is to compute the predictive distribution Z p(y =+1 X, y, x )= p(y f )p(f X, y, x )df we marginalize over the latent variables from the training data: p(f X, y, x )= Z p(f X, x, f)p(f X, y)df we need the posterior over the latent variables: likelihood (sigmoid) p(f X, y) = p(y f)p(f X) p(y X) prior normalizer 8 Group

A Simple Example Red: Two-class training data Green: mean function of p(f X, y) Light blue: sigmoid of the mean function 9 Group

But There Is A Problem... p(f X, y) = p(y f)p(f X) p(y X) The likelihood term is not a Gaussian! This means, we can not compute the posterior in closed form. There are several different solutions in the literature, e.g.: Laplace approximation Expectation Propagation Variational methods 30 Group

Laplace Approximation p(f X, y) q(f X, y) =N (f ˆf,A 1 ) where and ˆf = arg max p(f X, y) f A = rr log p(f X, y) f=ˆf second-order Taylor expansion To compute ˆf an iterative approach using Newton s method has to be used. The Hessian matrix A can be computed as A = K 1 + W where W = rr log p(y f) is a diagonal matrix which depends on the sigmoid function. 31 Group

Laplace Approximation Yellow: a non-gaussian posterior Red: a Gaussian approximation, the mean is the mode of the posterior, the variance is the negative second derivative at the mode 3 Group

Predictions Now that we have p(f X, y, x )= p(f X, y) Z we can compute: p(f X, x, f)p(f X, y)df From the regression case we have: p(f X, x, f) =N (f µ, ) where µ = k T K 1 f = k(x, x ) k T K 1 k Linear in f This reminds us of a property of Gaussians that we saw earlier! 33 Group

Gaussian Properties (Rep.) If we are given this: I. II. p(x) =N (x µ, 1 ) p(y x) =N (y Ax + b, ) Then it follows (properties of Gaussians): III. IV. p(y) =N (y Aµ + b, + A 1 A T ) p(x y) =N (x (A T 1 (y b)+ 1 y), ) 1 where =( 1 1 + A T 1 s A) 1 34 Group

Applying this to Laplace E[f X, y, x ]=k(x ) T K 1ˆf V[f X, y, x ]=k(x, x ) k T (K + W 1 ) 1 k It remains to compute p(y =+1 X, y, x )= Z p(y f )p(f X, y, x )df Depending on the kind of sigmoid function we can compute this in closed form (cumulative Gaussian sigmoid) have to use sampling methods or analytical approximations (logistic sigmoid) 35 Group

A Simple Example Two-class problem (training data in red and blue) Green line: optimal decision boundary Black line: GP classifier decision boundary Right: posterior probability 36 Group

Summary Kernel methods solve problems by implicitly mapping the data into a (high-dimensional) feature space The feature function itself is not used, instead the algorithm is expressed in terms of the kernel Gaussian Processes are Normal distributions over functions To specify a GP we need a covariance function (kernel) and a mean function More on Gaussian Processes: http://videolectures.net/epsrcws08_rasmussen_lgp/ 37 Group