Computer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression

Similar documents
Computer Vision Group Prof. Daniel Cremers. 4. Gaussian Processes - Regression

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Lecture: Gaussian Process Regression. STAT 6474 Instructor: Hongxiao Zhu

Nonparameteric Regression:

Nonparametric Bayesian Methods (Gaussian Processes)

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes

Gaussian Processes (10/16/13)

Model Selection for Gaussian Processes

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

GWAS V: Gaussian processes

Gaussian with mean ( µ ) and standard deviation ( σ)

GAUSSIAN PROCESS REGRESSION

Gaussian Processes in Machine Learning

Machine Learning. Bayesian Regression & Classification. Marc Toussaint U Stuttgart

Computer Vision Group Prof. Daniel Cremers. 3. Regression

Gaussian Process Regression

STA 4273H: Statistical Machine Learning

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods

Machine Learning Srihari. Gaussian Processes. Sargur Srihari

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Ch 4. Linear Models for Classification

Joint Emotion Analysis via Multi-task Gaussian Processes

Probabilistic Graphical Models Lecture 20: Gaussian Processes

Mathematical Formulation of Our Example

20: Gaussian Processes

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Non-Gaussian likelihoods for Gaussian Processes

Introduction to Gaussian Processes

Likelihood NIPS July 30, Gaussian Process Regression with Student-t. Likelihood. Jarno Vanhatalo, Pasi Jylanki and Aki Vehtari NIPS-2009

Outline Lecture 2 2(32)

Pattern Recognition and Machine Learning

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Chris Bishop s PRML Ch. 8: Graphical Models

Midterm. Introduction to Machine Learning. CS 189 Spring You have 1 hour 20 minutes for the exam.

Lecture 5: GPs and Streaming regression

Probabilistic & Unsupervised Learning

Reliability Monitoring Using Log Gaussian Process Regression

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

Relevance Vector Machines

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

Introduction to Machine Learning

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Neural Network Training

Introduction to Gaussian Processes

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

STA414/2104 Statistical Methods for Machine Learning II

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

PILCO: A Model-Based and Data-Efficient Approach to Policy Search

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Machine Learning - Waseda University Logistic Regression

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Introduction to Gaussian Processes

CS 7140: Advanced Machine Learning

MTTTS16 Learning from Multiple Sources

Practical Bayesian Optimization of Machine Learning. Learning Algorithms

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Multiple-step Time Series Forecasting with Sparse Gaussian Processes

Supervised Learning Coursework

Advanced Introduction to Machine Learning CMU-10715

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

STA 4273H: Sta-s-cal Machine Learning

Machine Learning Lecture 7

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014

Learning Gaussian Process Models from Uncertain Data

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Gaussian processes for inference in stochastic differential equations

Multivariate statistical methods and data mining in particle physics

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Gaussian processes and bayesian optimization Stanisław Jastrzębski. kudkudak.github.io kudkudak

Variational Model Selection for Sparse Gaussian Process Regression

Reading Group on Deep Learning Session 1

Machine Learning Lecture 5

Support Vector Machines

Probabilistic numerics for deep learning

CMU-Q Lecture 24:

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians

Nonparmeteric Bayes & Gaussian Processes. Baback Moghaddam Machine Learning Group

Log Gaussian Cox Processes. Chi Group Meeting February 23, 2016

Cheng Soon Ong & Christian Walder. Canberra February June 2018

ECE521 week 3: 23/26 January 2017

Introduction to Machine Learning

Bayesian Support Vector Machines for Feature Ranking and Selection

CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes

Introduction to Machine Learning

Introduction to Gaussian Processes

Gaussian discriminant analysis Naive Bayes

Introduction to Machine Learning Spring 2018 Note 18

Transcription:

Group Prof. Daniel Cremers 9. Gaussian Processes - Regression

Repetition: Regularized Regression Before, we solved for w using the pseudoinverse. But: we can kernelize this problem as well! First step: Matrix inversion lemma Group

Thus, we have: Kernelized Regression w =( I D + T ) 1 T t = T ( I N + T ) 1 t by defining: K = T a =( I N + K) 1 t we get: y(x) = (x) T w = (x) T T a = k(x) T (K + I N ) 1 t (same result as last lecture) This means that the predicted output is a linear combination of the training outputs, where the coefficients depend on the similarities to the training input. 3 Group

Motivation We have found a way to predict function values of y for new input points x As we used regularized regression, we can equivalently find the predictive distribution by marginalizing out the parameters w Can we find a closed form for that distribution? How can we model the uncertainty of our prediction? Can we use that for classification? 4 Group

Gaussian Marginals and Conditionals Before we start, we need some formulae: Assume we have two variables and that are jointly Gaussian distributed, i.e. x a x b N (x µ, ) with x = xa x b µ = µa µ b = aa ab Then the cond. distribution where and The marginal is µ a b = µ a + ab 1 bb (x b µ b ) a b = aa ab 1 bb ba p(x a )=N (x a µ a, aa ) p(x a x b )=N (x µ a b, a b ) Schur Complement 5 Group

Gaussian Marginals and Conditionals Main idea of the proof for the conditional (using inverse of block matrices): aa ba ab bb 1 = I 0 1 bb ba I ( / bb ) 1 0 0 1 bb I ab 1 bb 0 I The lower line corresponds to a quadratic form that is only dependent on p(x b ), i.e. the rest can be identified with the conditional Normal distribution p(x a x b ). (for details see, e.g. Bishop or Murphy) 6 Group

Definition Definition: A Gaussian process is a collection of random variables, any finite number of which have a joint Gaussian distribution. The number of random variables can be infinite! This means: a GP is a Gaussian distribution over functions! To specify a GP we need: mean function: m(x) =E[y(x)] covariance function: k(x 1, x )=E[y(x 1 ) m(x 1 )y(x ) m(x )] 7 Group

Example green line: sinusoidal data source blue circles: data points with Gaussian noise red line: mean function of the Gaussian process 8 Group

How Can We Handle Infinity? Idea: split the (infinite) number of random variables into a finite and an infinite subset. x = xf x i N µf µ i, f fi T fi i finite part infinite part From the marginalization property we get: Z p(x f )= p(x f, x i )dx i = N (x f µ f, f ) This means we can use finite vectors. 9 Group

A Simple Example In Bayesian linear regression, we had y(x) = (x) T w with prior probability w N (0, p ). This means: E[y(x)] = (x) T E[w] =0 E[y(x 1 )y(x ))] = (x 1 ) T E[ww T ] (x )= (x 1 ) T p (x ) Any number of function values y(x 1 ),...,y(x N ) is jointly Gaussian with zero mean. The covariance function of this process is k(x 1, x )= (x 1 ) T p (x ) In general, any valid kernel function can be used. 10 Group

The Covariance Function The most used covariance function (kernel) is: k(x p, x q )= f exp( 1 l (x p x q ) )+ n pq signal variance length scale noise variance It is known as squared exponential, radial basis function or Gaussian kernel. Other possibilities exist, e.g. the exponential kernel: k(x p, x q )=exp( x p x q ) This is used in the Ornstein-Uhlenbeck process. 11 Group

Sampling from a GP Just as we can sample from a Gaussian distribution, we can also generate samples from a GP. Every sample will then be a function! Process: 1.Choose a number of input points x 1,...,x M.Compute the covariance matrix K where K ij = k(x i, x j ) 3.Generate a random Gaussian vector from y N (0,K) 4.Plot the values x 1,...,x M versus y 1,...,y M 1 Group

Sampling from a GP Squared exponential kernel Exponential kernel 13 Group

Prediction with a Gaussian Process Most often we are more interested in predicting new function values for given input data. We have: training data test input x 1,...,x N x 1,...,x M y 1,...,y N And we want test outputs y 1,...,y M The joint probability is y y K(X, X) K(X, X ) N 0, K(X,X) K(X,X ) and we need to compute p(y x,x,y). 14 Group

Prediction with a Gaussian Process In the case of only one test point we have Now we compute the conditional distribution where K(X, x )= 0 B @ k(x 1, x )... k(x N, x ) 1 x C A = k p(y x,x,y) =N (y µ, ) µ = k T K 1 t = k(x, x ) k T K 1 k This defines the predictive distribution. 15 Group

Example 1.5 1 0.5 0 0.5 1 1.5 5 0 5 Functions sampled from a Gaussian Process prior.5 1.5 1 0.5 0 0.5 1 1.5 5 0 5 Functions sampled from the predictive distribution The predictive distribution is itself a Gaussian process. It represents the posterior after observing the data. The covariance is low in the vicinity of data points. 16 Group

Varying the Hyperparameters 3 3 1 0 1 3 8 6 4 0 4 6 8 l = f =1, n =0.1 1 0 1 3 8 6 4 0 4 6 8 3 l =0.3, f =1.08, n =0.0005 0 data samples GP prediction with different kernel hyper parameters 1 0 1 3 8 6 4 0 4 6 8 l =3 f =1.16 n =0.89 17 Group

Varying the Hyperparameters The squared exponential covariance function can be generalized to k(x p, x q )= f exp( 1 (x p x q ) T M(x p x q )) + n pq where M can be: I M = l : this is equal to the above case : every feature dimension has its own length scale parameter M = diag(l 1,...,l D ) : here Λ has less than M = T + diag(l 1,...,l D ) D columns 18 Group

Varying the Hyperparameters 1 1 output y 0 1 output y 0 1 0 input x 0 input x1 0 input x 0 input x1 M = I 1 M = diag(1, 3) output y 0 1 M = 0 1 1 1 1 input x 0 input x1 + diag(6, 6) 19 Group

Implementation Algorithm 1: GP regression Data: training data (X, y), test data x Input: Hyper parameters K ij k(x i, x j ) f, l, n L cholesky(k + yi) n L T \(L\y) E[f ] k T v L\k var[f ] k(x, x ) v T v log p(y X) 1 yt P i log L ii Training Phase N log( ) Test Phase Cholesky decomposition is numerically stable Can be used to compute inverse efficiently 0 Group

Estimating the Hyperparameters To find optimal hyper parameters we need the marginal likelihood: p(y X) = Z p(y f,x)p(f X)df This expression implicitly depends on the hyper parameters, but y and X are given from the training data. It can be computed in closed form, as all terms are Gaussians. We take the logarithm, compute the derivative and set it to 0. This is the training step. 1 Group

Estimating the Hyperparameters noise standard deviation 10 0 10 1 output, y 1 0 1 10 0 10 1 characteristic lengthscale 5 0 5 input, x The log marginal likelihood is not necessarily concave, i.e. it can have local maxima. The local maxima can correspond to sub-optimal solutions. output, y 1 0 1 5 0 5 input, x Group

Automatic Relevance Determination We have seen how the covariance function can be generalized using a matrix M If M is diagonal this results in the kernel function k(x, x 0 )= f exp 1 DX i=1 i (x i x 0 i)! We can interpret the feature dimension i as weights for each Thus, if the length scale l i =1/ i of an input dimension is large, the input is less relevant During training this is done automatically 3 Group

Automatic Relevance Determination 3-dimensional data, parameters 1 3 as they evolve during training 1 3 During the optimization process to learn the hyper-parameters, the reciprocal length scale for one parameter decreases, i.e.: This hyper parameter is not very relevant! 4 Group

Group Prof. Daniel Cremers Gaussian Processes - Classification

Gaussian Processes For Classification In regression we have classification we have y R y { 1; 1}, in binary To use a GP for classification, we can apply a sigmoid function to the posterior obtained from the GP and compute the class probability as: p(y =+1 x) = (f(x)) If the sigmoid function is symmetric: then we have p(y x) = (yf(x)). ( z) =1 (z) A typical type of sigmoid function is the logistic sigmoid: 1 (z) = 1 + exp( z) 6 Group

Application of the Sigmoid Function Function sampled from a Gaussian Process Sigmoid function applied to the GP function Another symmetric sigmoid function is the cumulative Gaussian: (z) = Z z 1 N (x 0, 1)dx 7 Group

Visualization of Sigmoid Functions The cumulative Gaussian is slightly steeper than the logistic sigmoid 8 Group

The Latent Variables In regression, we directly estimated f as f(x) GP(m(x),k(x, x 0 )) and values of f where observed in the training data. Now only labels +1 or -1 are observed and f is treated as a set of latent variables. A major advantage of the Gaussian process classifier over other methods is that it marginalizes over all latent functions rather than maximizing some model parameters. 9 Group

Class Prediction with a GP The aim is to compute the predictive distribution Z p(y =+1 X, y, x )= p(y f )p(f X, y, x )df (f ) 30 Group

Class Prediction with a GP The aim is to compute the predictive distribution Z p(y =+1 X, y, x )= p(y f )p(f X, y, x )df we marginalize over the latent variables from the training data: p(f X, y, x )= Z p(f X, x, f)p(f X, y)df predictive distribution of the latent variable (from regression) 31 Group

Class Prediction with a GP The aim is to compute the predictive distribution Z p(y =+1 X, y, x )= p(y f )p(f X, y, x )df we marginalize over the latent variables from the training data: p(f X, y, x )= Z p(f X, x, f)p(f X, y)df we need the posterior over the latent variables: likelihood (sigmoid) p(f X, y) = p(y f)p(f X) p(y X) prior normalizer 3 Group

A Simple Example Red: Two-class training data Green: mean function of p(f X, y) Light blue: sigmoid of the mean function 33 Group

But There Is A Problem... p(f X, y) = p(y f)p(f X) p(y X) The likelihood term is not a Gaussian! This means, we can not compute the posterior in closed form. There are several different solutions in the literature, e.g.: Laplace approximation Expectation Propagation Variational methods 34 Group

Laplace Approximation p(f X, y) q(f X, y) =N (f ˆf,A 1 ) where and ˆf = arg max p(f X, y) f A = rr log p(f X, y) f=ˆf second-order Taylor expansion To compute ˆf an iterative approach using Newton s method has to be used. The Hessian matrix A can be computed as A = K 1 + W where W = rr log p(y f) is a diagonal matrix which depends on the sigmoid function. 35 Group

Laplace Approximation Yellow: a non-gaussian posterior Red: a Gaussian approximation, the mean is the mode of the posterior, the variance is the negative second derivative at the mode 36 Group

Predictions Now that we have p(f X, y, x )= p(f X, y) Z we can compute: p(f X, x, f)p(f X, y)df From the regression case we have: p(f X, x, f) =N (f µ, ) where µ = k T K 1 f = k(x, x ) k T K 1 k Linear in f This reminds us of a property of Gaussians that we saw earlier! 37 Group

Gaussian Properties (Rep.) If we are given this: I. II. p(x) =N (x µ, 1 ) p(y x) =N (y Ax + b, ) Then it follows (properties of Gaussians): III. IV. p(y) =N (y Aµ + b, + A 1 A T ) p(x y) =N (x (A T 1 (y b)+ 1 y), ) 1 where =( 1 1 + A T 1 s A) 1 38 Group

Applying this to Laplace E[f X, y, x ]=k(x ) T K 1ˆf V[f X, y, x ]=k(x, x ) k T (K + W 1 ) 1 k It remains to compute p(y =+1 X, y, x )= Z p(y f )p(f X, y, x )df Depending on the kind of sigmoid function we can compute this in closed form (cumulative Gaussian sigmoid) have to use sampling methods or analytical approximations (logistic sigmoid) 39 Group

A Simple Example Two-class problem (training data in red and blue) Green line: optimal decision boundary Black line: GP classifier decision boundary Right: posterior probability 40 Group

Summary Gaussian Processes are Normal distributions over functions To specify a GP we need a covariance function (kernel) and a mean function For regression we can compute the predictive distribution in closed form For classification, we use a sigmoid and have to approximate the latent posterior More on Gaussian Processes: http://videolectures.net/epsrcws08_rasmussen_lgp/ 41 Group