Gaussian Processes for Machine Learning

Similar documents
Gaussian Process Regression

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes

Probabilistic & Unsupervised Learning

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

GAUSSIAN PROCESS REGRESSION

Bayesian Machine Learning

Nonparameteric Regression:

Gaussian Processes (10/16/13)

20: Gaussian Processes

Statistical Techniques in Robotics (16-831, F12) Lecture#21 (Monday November 12) Gaussian Processes

Introduction to Bayesian Statistics

Lecture : Probabilistic Machine Learning

Nonparmeteric Bayes & Gaussian Processes. Baback Moghaddam Machine Learning Group

CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes

Probabilistic Graphical Models Lecture 20: Gaussian Processes

Nonparametric Bayesian Methods (Gaussian Processes)

PATTERN RECOGNITION AND MACHINE LEARNING

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

DD Advanced Machine Learning

Lecture 5: GPs and Streaming regression

Model Selection for Gaussian Processes

ECE521 Tutorial 2. Regression, GPs, Assignment 1. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel for the slides.

Tutorial on Gaussian Processes and the Gaussian Process Latent Variable Model

Today. Probability and Statistics. Linear Algebra. Calculus. Naïve Bayes Classification. Matrix Multiplication Matrix Inversion

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

Statistical Techniques in Robotics (16-831, F12) Lecture#20 (Monday November 12) Gaussian Processes

Introduction to Gaussian Processes

01 Probability Theory and Statistics Review

Reliability Monitoring Using Log Gaussian Process Regression

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

Bayesian Interpretations of Regularization

GWAS V: Gaussian processes

Gaussian Processes in Machine Learning

Lecture: Gaussian Process Regression. STAT 6474 Instructor: Hongxiao Zhu

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods

Linear regression example Simple linear regression: f(x) = ϕ(x)t w w ~ N(0, ) The mean and covariance are given by E[f(x)] = ϕ(x)e[w] = 0.

Overfitting, Bias / Variance Analysis

CPSC 540: Machine Learning

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak

PILCO: A Model-Based and Data-Efficient Approach to Policy Search

Some Concepts of Probability (Review) Volker Tresp Summer 2018

Bayesian Machine Learning

Perhaps the simplest way of modeling two (discrete) random variables is by means of a joint PMF, defined as follows.

Gaussian with mean ( µ ) and standard deviation ( σ)

COMP 551 Applied Machine Learning Lecture 20: Gaussian processes

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Neutron inverse kinetics via Gaussian Processes

Machine learning - HT Maximum Likelihood

Gaussian Processes. 1 What problems can be solved by Gaussian Processes?

Introduction Dual Representations Kernel Design RBF Linear Reg. GP Regression GP Classification Summary. Kernel Methods. Henrik I Christensen

CS 7140: Advanced Machine Learning

Lecture 7 and 8: Markov Chain Monte Carlo

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

Tokamak profile database construction incorporating Gaussian process regression

Probabilistic Graphical Models

Machine Learning Srihari. Gaussian Processes. Sargur Srihari

Bayesian Machine Learning

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines

Multivariate Bayesian Linear Regression MLAI Lecture 11

Machine Learning. Bayesian Regression & Classification. Marc Toussaint U Stuttgart

[POLS 8500] Review of Linear Algebra, Probability and Information Theory

Lecture Note 1: Probability Theory and Statistics

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

CPSC 540: Machine Learning

Statistical Learning Theory

Expectation Propagation for Approximate Bayesian Inference

STAT 518 Intro Student Presentation

Introduction to Machine Learning

Introduction to Gaussian Processes

CPSC 540: Machine Learning

Density Estimation. Seungjin Choi

Lecture 2 Machine Learning Review

Based on slides by Richard Zemel

Linear Models for Regression

Machine Learning Srihari. Probability Theory. Sargur N. Srihari

Lecture 1a: Basic Concepts and Recaps

Variational Bayesian Logistic Regression

Introduction to Gaussian Processes

Bristol Machine Learning Reading Group

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING. Non-linear regression techniques Part - II

Introduction to Gaussian Process

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Optimization of Gaussian Process Hyperparameters using Rprop

Computer Vision Group Prof. Daniel Cremers. 4. Gaussian Processes - Regression

Gaussian processes for inference in stochastic differential equations

Bayesian Inference. Chris Mathys Wellcome Trust Centre for Neuroimaging UCL. London SPM Course

CMU-Q Lecture 24:

Practical Bayesian Optimization of Machine Learning. Learning Algorithms

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Bayesian Linear Regression. Sargur Srihari

ECE521 week 3: 23/26 January 2017

Cheng Soon Ong & Christian Walder. Canberra February June 2018

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Bias-Variance Trade-off in ML. Sargur Srihari

STA 4273H: Statistical Machine Learning

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Probabilistic & Bayesian deep learning. Andreas Damianou

Transcription:

Gaussian Processes for Machine Learning Carl Edward Rasmussen Max Planck Institute for Biological Cybernetics Tübingen, Germany carl@tuebingen.mpg.de Carlos III, Madrid, May 2006 The actual science of logic is conversant at present only with things either certain, impossible, or entirely doubtful, none of which (fortunately) we have to reason on. Therefore the true logic for this world is the calculus of Probabilities, which takes account of the magnitude of the probability which is, or ought to be, in a reasonable man s mind. James Clerk Maxwell [1850]

Course Goals To understand how learning for regression classification can be based on Gaussian processes how GPs relate to alternative methods the advantages and shortcomings of GPs some practical familiarity using matlab functions 1/30

Todays Outline Brief background on probability Gaussian processes Bayesian inference two views the process view Occam s Razor the weight-space view on the equivalence between the two views 2/30

Why Gaussian Processes Gaussian processes are not new! Used in spatial statistics, geostatistics (kriging), meteorology Why are GPs not used more often? computational reasons Bayesian 3/30

Probabilities are non-negative p(x) 0 Probabilities and Probability Densities Probabilities normalize: x p(x) = 1 (discrete) or p(x)dx = 1 (continuous). The joint probability of x and y is p(x, y). The marginal probability of x is p(x) = p(x, y)dy. The conditional probability of x given y is p(x y) = p(x, y) p(y). Bayes Rule is given by p(x, y) = p(x y)p(y) = p(y x)p(x) p(y x) = p(x y)p(y) p(x) p(x y)p(y). Interpretations of p(x): long run frequencies (frequentist, classical) subjective degree of belief (Bayesian) 4/30

Expectation and Variance (Moments) The expectation (or mean or average) of a random variable is: µ = E[x] = xp(x)dx = x p(x). The variance (or second central moment) is: σ 2 = V[x] = (x µ) 2 p(x)dx = E[x 2 ] (E[x]) 2. The covariance between x and y: cov(x, y) = E[(x E[x])(y E[y])]. If x and y are independent, then their covariance is zero. 5/30

The Gaussian Distribution A joint, or multivariate Gaussian distribution in D dimensions: y N (µ, Σ) = (2π) D/2 Σ 1/2 exp ( 1 2 (y µ) Σ 1 (y µ) ), is fully specified by its mean vector where µ is the mean vector and Σ the covariance matrix. Let x and y be jointly Gaussian random vectors [ x y ] ([ ] [ ]) µx A C N, µ y C B ( [ ] [ ] 1 ) µx à C = N, µ y C, B then the marginal distribution of x and the conditional distribution of x given y are x N (µ x, A), and x y N ( µ x + CB 1 (y µ y ), A CB 1 C ) or x y N ( µ x à 1 C(y µy ), à 1). (1) 6/30

Products of Gaussians The product of two Gaussians gives another (un-normalized) Gaussian N (x a, A)N (x b, B) = Z 1 N (x c, C) (2) where c = C(A 1 a + B 1 b) and C = (A 1 + B 1 ) 1. Notice that the resulting Gaussian has a precision (inverse variance) equal to the sum of the precisions and a mean equal to the convex sum of the means, weighted by the precisions. The normalizing constant looks itself like a Gaussian (in a or b) Z 1 = (2π) D/2 A + B 1/2 exp ( 1 2 (a b) (A + B) 1 (a b) ). 7/30

Gaussian Processes Definition: A Gaussian Process is a collection of random variables any finite number of which have (consistent) joint Gaussian distributions. Key observation: A Gaussian process is a natural generalization from vectors to functions: f(x) GP(m(x), k(x, x )), and is fully specified by its mean function m(x) and covariance function k(x, x ). This may sound very impractical......but we are saved by the marginalization property. Recall: p(x) = p(x, y)dy, for Gaussians: p(x, y) = N ([ a b ], [ A B B C ] ) = p(x) = N (a, A) 8/30

Gaussian Processes specify distributions over functions To generate a random sample from a D dimensional joint Gaussian with covariance matrix S and mean vector m: (in matlab) x = randn(d,1); y = chol(s) *x + m; where chol is the Cholesky factor R such that R R = S. Thus, the covariance of y is: E[yy ] = E[R xx R] = R E[xx ]R = R IR = S. Example: smooth, stationary, Squared Exponential (or Gaussian) GP: ( f(x) GP m(x) = 0, k(x, x ) = exp ( 1 2 (x x ) 2)). Do try this at home! 9/30

2 output, f(x) 1 0 1 2 5 0 5 input, x 10/30

Function drawn at random from a Gaussian Process with Gaussian covariance 8 7 6 5 4 3 2 1 0 1 2 6 4 2 0 2 4 6 6 4 2 0 2 4 6 11/30

Bayesian Inference, parametric model Supervised parametric learning: data: x, y model: y = f w (x) + ε Gaussian likelihood: p(y x, w, M i ) c exp( 1 2 (y c f w (x c )) 2 /σ 2 noise). Parameter prior: p(w M i ) Posterior parameter distribution by Bayes rule p(a b) = p(b a)p(a)/p(b): p(w x, y, M i ) = p(w M i)p(y x, w, M i ) p(y x, M i ) 12/30

Bayesian Inference, parametric model, cont. Making predictions: p(y x, x, y, M i ) = p(y w, x, M i )p(w x, y, M i )dw Marginal likelihood: p(y x, M i ) = p(w M i )p(y x, w, M i )dw. Model probability: p(m i x, y) = p(m i)p(y x, M i ) p(y x) Problem: integrals are intractable for most interesting models! 13/30

Non-parametric Gaussian process models In our non-parametric model, the parameters are the function itself! Gaussian likelihood: y x, f(x), M i N (f, σ 2 noisei) (Zero mean) Gaussian process prior: f(x) M i GP ( m(x) 0, k(x, x ) ) Leads to a Gaussian process posterior f(x) x, y, M i GP ( m post (x) = k(x, x)[k(x, x) + σ 2 noisei] 1 y, k post (x, x ) = k(x, x ) k(x, x)[k(x, x) + σ 2 noisei] 1 k(x, x ) ). And a Gaussian predictive distribution: y x, x, y, M i N ( k(x, x) [K + σ 2 noisei] 1 y, k(x, x ) + σ 2 noise k(x, x) [K + σ 2 noisei] 1 k(x, x) ) 14/30

Prior and Posterior 2 2 output, f(x) 1 0 1 output, f(x) 1 0 1 2 5 0 5 input, x 2 5 0 5 input, x Predictive distribution: p(y x, x, y) N ( k(x, x) [K + σ 2 noisei] 1 y, k(x, x ) + σ 2 noise k(x, x) [K + σ 2 noisei] 1 k(x, x) ) 15/30

Graphical model for Gaussian Process observations y (1) y ( ) * y (c) Gaussian field f (1) f ( ) * f (c) inputs x (1) x (2) x ( ) x (c) * Square nodes are clamped, round nodes stochastic (free). Thick black line indicates every-one is connected to everyone. Notice, that adding a triplet x (i), f (i), y (i) (where the input x (i) is clamped) does not influence the distribution (as long as y (i) is not clamped). This is guaranteed by the consistency of the GP. This explains why we can make inference using a finite amount of computation! 16/30

The marginal likelihood Log marginal likelihood: log p(y x, M i ) = 1 2 y K 1 y 1 2 log K n 2 log(2π) is the combination of a data fit term and complexity penalty. Occam s Razor is automatic. Learning in Gaussian process models involves finding the form of the covariance function, and any unknown (hyper-) parameters θ. This can be done by optimizing the marginal likelihood: log p(y x, θ, M i ) θ j = 1 2 y K 1 K θ j K 1 y 1 2 trace(k 1 K θ j ) 17/30

Example: Fitting the length scale parameter Parameterized covariance function: k(x, x ) = v 2 exp ( (x x ) 2 2l 2 ) + σ 2 n δ xx. 1.5 1 observations too short good length scale too long 0.5 0 0.5 10 8 6 4 2 0 2 4 6 8 10 The mean posterior predictive function is plotted for 3 different length scales (the green curve corresponds to optimizing the marginal likelihood). Notice, that an almost exact fit to the data can be achieved by reducing the length scale but the marginal likelihood does not favour this! 18/30

Why, in principle, does Bayesian Inference work? Occam s Razor too simple P(Y M i ) "just right" too complex Y All possible data sets 19/30

The weight-space view Linear model: Gaussian noise. assume y = x w + ε = f(x) + ε, where ε N (0, σnoise 2 ) is independent Likelihood: p(y X, w) = = n p(y i x i, w) = i=1 n i=1 1 exp ( (y i x i w)2 ) 2πσn 2σ 2 n 1 (2πσn) exp ( 1 2 n/2 2σn 2 y X w 2) = N (X w, σni). 2 Note, that the likelihood also has a Gaussian shape w.r.t. normalized. the parameter w, although not 20/30

Maximum Likelihood Maximizing the likelihood over weights: log p(y X, w) w = 0 = w ML = (XX ) 1 Xy. (The solution could be ill determined if XX is close to singular.) Predictive distribution: p(f x, D, w ML ) = N (x w ML, σ 2 n). 21/30

The Bayesian Approach Define the prior distribution: eg, Σ p = σ 2 pi, for some σ 2 p. p(w) = N (0, Σ p ), What does this prior mean? Posterior Likelihood Prior p(w X, y) exp ( 1 2σn 2 (y X w) (y X w) ) exp ( 1 2 w Σ 1 p w ) Ie, the posterior is Gaussian: exp ( 1 2 (w w) ( 1 σ 2 n XX + Σ 1 ) ) p (w w), p(w X, y) N ( w = 1 σn 2 A 1 Xy, A 1 ), where A = σn 2 XX + Σ 1 p. 22/30

Making Predictions The predictive distribution is again Gaussian: p(f x, X, y) = p(f x, w)p(w X, y)dw = = N ( 1 σn 2 x A 1 Xy, x A 1 ) x. x w p(w X, y)dw Notice, how the error-bars increase when predicting further away from the origin. Why? Outstanding Problems: The model is still too limited (only linear). What about σ 2 n and Σ p? 23/30

Nonlinear Regression Instead of x, we can use a linear model in some fixed function basis φ(x) = (ϕ 1 (x), ϕ 2 (x),...). Example: φ(x) = (1, x, x 2,...), turns: f(x) = φ(x) w, into polynomial regression, which is more flexible. Note: the analysis from the previous slides can be reused by replacing x by φ(x). The maximum likelihood parameters become: w ML = (Φ(X)Φ(X) ) 1 Φ(X)y. with design matrix Φ(X) = (φ(x 1 ), φ(x 2 ),...). 24/30

Re-introducing the Feature Space The predictive distribution: p(f x, X, y) = p(f x, w)p(w X, y)dw where A = σ 2 n where K = Φ Σ p Φ. = N ( 1 σn 2 φ(x ) A 1 Φ(X)y, φ(x ) A 1 φ(x ) ), Φ(X)Φ(X) + Σ 1 p. This can be re-written as: f x, X, y N ( φ Σ p Φ(K + σ 2 ni) 1 y, φ Σ p φ φ Σ p Φ(K + σ 2 ni) 1 Φ Σ p φ ), Notice, that this last expression contains φ(x) only in the form of inner products can be possibly replaced by kernel functions k(x, x ) = φ(x) Σ p φ(x )! 25/30

On the correspondence between weight space and function space views There is an exact correspondence between the two views: Given a set of m basis functions, φ(x), it is obvious that we can construct the corresponding covariance function: k(x, x ) = φ(x) Σ p φ(x ). conversely, given a positive definite covariance function k(x, x ) we can decompose it (Mercer s theorem): k(x, x ) = λ i φ i (x)φ i (x ), where λ i and φ i are eigenvalues and eigenfunctions respectively, obeying: k(x, x )φ i (x)dx = λφ i (x ). i=1 The function space view is valid irrespective of whether the corresponding eigendecomposition is finite. The squared exponential covariance function does not have a finite decomposistion. 26/30

From random functions to covariance functions Consider the class of linear functions: f(x) = ax + b, where a N (0, α), and b N (0, β). We can compute the mean function: µ(x) = E[f(x)] = f(x)p(a)p(b)dadb = axp(a)da + bp(b)db = 0, and covariance function: k(x, x ) = E[(f(x) 0)(f(x ) 0)] = (ax + b)(ax + b)p(a)p(b)dadb = a 2 xx p(a)da + b 2 p(b)db + (x + x ) abp(a)p(b)dadb = αxx + β. 27/30

Predictions from a noise free Gaussian Process Imagine that we have observed the value of the function at some locations {X, f}, and seek to make predictions f at (one or more) test inputs X. The joint distribution under the GP is: [ f f ] N ( 0, [ ] K(X, X) K(X, X ) ). K(X, X) K(X, X ) Now, we can condition on the obsevations, to obtain f X, X, f N ( f ; K(X, X )K(X, X) 1 f, K(X, X ) K(X, X )K(X, X) 1 K(X, X) ). Remark: Conditioning a Gaussian gives another Gaussian: [ x y] N ( [ ]) A C T 0, C B = x y N (C T B 1 y, A C T B 1 C) 28/30

Predictions from a noisy Gaussian Process More commonly, we have access to noisy observations {X, y}. Assuming independent Gaussian noise, the joint distribution of observations and test predictions is: [ ] y N ( [ ] K(X, X) + σ 2 0, n I K(X, X ) ), f K(X, X) K(X, X ) Conditioning on the observations: f X, X, y N ( K(X, X )[K(X, X) + σ 2 ni] 1 y, K(X, X ) K(X, X )[K(X, X) + σ 2 ni] 1 K(X, X) ). Note: Equating K(X, Z) with Φ(X) Σ p Φ(Z) for the Bayesian linear model, gives exactly identical results! 29/30

Some Interpretation Recall our main result: f X, X, y N ( K(X, X)[K(X, X) + σni] 2 1 y, K(X, X ) K(X, X)[K(X, X) + σni] 2 1 K(X, X ) ). The mean is linear in two ways: µ(x ) = k(x, X)[K(X, X) + σ 2 n] 1 y = n β c y (c) = c=1 n α c k(x, x (c) ). c=1 The last form is most commonly encountered in the kernel literature. The variance is the difference between two terms: V (x ) = k(x, x ) k(x, X)[K(X, X) + σ 2 ni] 1 k(x, x ), the first term is the prior variance, from which we subtract a (positive) term, telling how much the data X has explained. Note, that the variance is independent of the observed outputs y. 30/30