CSci 8980: Advanced Topics in Graphical Models Gaussian Processes

Similar documents
Gaussian Process Regression

Gaussian Processes for Machine Learning

Model Selection for Gaussian Processes

GAUSSIAN PROCESS REGRESSION

Gaussian Processes in Machine Learning

Lecture: Gaussian Process Regression. STAT 6474 Instructor: Hongxiao Zhu

Nonparameteric Regression:

Computer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression

Computer Vision Group Prof. Daniel Cremers. 4. Gaussian Processes - Regression

Probabilistic & Unsupervised Learning

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Probabilistic Graphical Models Lecture 20: Gaussian Processes

Nonparametric Bayesian Methods (Gaussian Processes)

Tutorial on Gaussian Processes and the Gaussian Process Latent Variable Model

20: Gaussian Processes

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

PILCO: A Model-Based and Data-Efficient Approach to Policy Search

Nonparmeteric Bayes & Gaussian Processes. Baback Moghaddam Machine Learning Group

Bayesian Machine Learning

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Linear regression example Simple linear regression: f(x) = ϕ(x)t w w ~ N(0, ) The mean and covariance are given by E[f(x)] = ϕ(x)e[w] = 0.

Machine Learning. Bayesian Regression & Classification. Marc Toussaint U Stuttgart

Lecture : Probabilistic Machine Learning

Machine Learning Srihari. Gaussian Processes. Sargur Srihari

Active and Semi-supervised Kernel Classification

Introduction to Gaussian Process

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

Gaussian Processes (10/16/13)

GWAS V: Gaussian processes

Relevance Vector Machines

Bayesian Machine Learning

Introduction to Gaussian Processes

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Bayesian Machine Learning

Lecture 5: GPs and Streaming regression

Introduction. Chapter 1

Statistical Techniques in Robotics (16-831, F12) Lecture#21 (Monday November 12) Gaussian Processes

Neutron inverse kinetics via Gaussian Processes

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Statistical Techniques in Robotics (16-831, F12) Lecture#20 (Monday November 12) Gaussian Processes

CPSC 540: Machine Learning

Variational Model Selection for Sparse Gaussian Process Regression

CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes

Probabilistic Models for Learning Data Representations. Andreas Damianou

Learning Gaussian Process Models from Uncertain Data

Unsupervised Learning

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Naïve Bayes classification

Expectation Propagation for Approximate Bayesian Inference

Gaussian processes for inference in stochastic differential equations

Reliability Monitoring Using Log Gaussian Process Regression

Bayesian Inference: Principles and Practice 3. Sparse Bayesian Models and the Relevance Vector Machine

Introduction to Gaussian Processes

Recent Advances in Bayesian Inference Techniques

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

Practical Bayesian Optimization of Machine Learning. Learning Algorithms

STA 4273H: Statistical Machine Learning

Introduction to SVM and RVM

Density Estimation. Seungjin Choi

Outline Lecture 2 2(32)

Probabilistic & Bayesian deep learning. Andreas Damianou

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

STA414/2104 Statistical Methods for Machine Learning II

Gaussian processes and bayesian optimization Stanisław Jastrzębski. kudkudak.github.io kudkudak

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Introduction to Machine Learning

Variational Approximate Inference in Latent Linear Models

Gaussian Processes. 1 What problems can be solved by Gaussian Processes?

Part 1: Expectation Propagation

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

ECE521 week 3: 23/26 January 2017

Introduction to Machine Learning

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

PMR Learning as Inference

ECE521 Tutorial 2. Regression, GPs, Assignment 1. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel for the slides.

Linear Classification

Advanced Introduction to Machine Learning CMU-10715

Worst-Case Bounds for Gaussian Process Models

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

STAT 518 Intro Student Presentation

PATTERN RECOGNITION AND MACHINE LEARNING

PREDICTING SOLAR GENERATION FROM WEATHER FORECASTS. Chenlin Wu Yuhan Lou

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING. Non-linear regression techniques Part - II

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines

DD Advanced Machine Learning

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

Introduction Generalization Overview of the main methods Resources. Pattern Recognition. Bertrand Thirion and John Ashburner

Bayesian RL Seminar. Chris Mansley September 9, 2008

CS-E3210 Machine Learning: Basic Principles

Bayesian Gaussian / Linear Models. Read Sections and 3.3 in the text by Bishop

MTTTS16 Learning from Multiple Sources

Gaussian with mean ( µ ) and standard deviation ( σ)

Transcription:

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes Instructor: Arindam Banerjee November 15, 2007

Gaussian Processes Outline

Gaussian Processes Outline Parametric Bayesian Regression

Gaussian Processes Outline Parametric Bayesian Regression Parameters to Functions

Gaussian Processes Outline Parametric Bayesian Regression Parameters to Functions GP Regression

Gaussian Processes Outline Parametric Bayesian Regression Parameters to Functions GP Regression GP Classification

Gaussian Processes Outline Parametric Bayesian Regression Parameters to Functions GP Regression GP Classification We will use

Gaussian Processes Outline Parametric Bayesian Regression Parameters to Functions GP Regression GP Classification We will use Primary: Carl Rasmussen s GP tutorial slides (NIPS 06)

Gaussian Processes Outline Parametric Bayesian Regression Parameters to Functions GP Regression GP Classification We will use Primary: Carl Rasmussen s GP tutorial slides (NIPS 06) Secondary: Hanna Wallach s slides on regression

The Prediction Problem 420 CO 2 concentration, ppm 400 380 360 340 320 1960 1980 year 2000 2020 Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 3 / 55

The Prediction Problem 420 CO 2 concentration, ppm 400 380 360 340 320 1960 1980 year 2000 2020 Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 4 / 55

The Prediction Problem 420 CO 2 concentration, ppm 400 380 360 340 320 1960 1980 year 2000 2020 Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 5 / 55

Maximum likelihood, parametric model Supervised parametric learning: data: x, y model: y = f w (x) + ε Gaussian likelihood: p(y x, w, M i ) c exp( 1 2 (y c f w (x c )) 2 /σ 2 noise). Maximize the likelihood: w ML = argmax p(y x, w, M i ). w Make predictions, by plugging in the ML estimate: p(y x, w ML, M i ) Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 16 / 55

Bayesian Inference, parametric model Supervised parametric learning: data: x, y model: y = f w (x) + ε Gaussian likelihood: p(y x, w, M i ) c exp( 1 2 (y c f w (x c )) 2 /σ 2 noise). Parameter prior: p(w M i ) Posterior parameter distribution by Bayes rule p(a b) = p(b a)p(a)/p(b): p(w x, y, M i ) = p(w M i)p(y x, w, M i ) p(y x, M i ) Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 17 / 55

Bayesian Inference, parametric model, cont. Making predictions: p(y x, x, y, M i ) = p(y w, x, M i )p(w x, y, M i )dw Marginal likelihood: p(y x, M i ) = p(w M i )p(y x, w, M i )dw. Model probability: p(m i x, y) = p(m i)p(y x, M i ) p(y x) Problem: integrals are intractable for most interesting models! Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 18 / 55

Bayesian Linear Regression Bayesian Linear Regression (2) Likelihood of parameters is: P(y X, w) = N (X w, σ 2 I ). Assume a Gaussian prior over parameters: P(w) = N (0, Σ p ). Apply Bayes theorem to obtain posterior: P(w y, X ) P(y X, w)p(w). Hanna M. Wallach Introduction to Gaussian Process Regression hmw26@cam.ac.uk

Bayesian Linear Regression Bayesian Linear Regression (3) Posterior distribution over w is: P(w y, X ) = N ( 1 σ 2 A 1 X y, A 1 ) where A = Σ 1 p + 1 σ 2 XX. Predictive distribution is: P(f x, X, y) = f (x w)p(w X, y)dw = N ( 1 σ 2 x A 1 X y, x A 1 x ). Hanna M. Wallach Introduction to Gaussian Process Regression hmw26@cam.ac.uk

Non-parametric Gaussian process models In our non-parametric model, the parameters is the function itself! Gaussian likelihood: y x, f (x), M i N(f, σ 2 noisei) (Zero mean) Gaussian process prior: f (x) M i GP ( m(x) 0, k(x, x ) ) Leads to a Gaussian process posterior f (x) x, y, M i GP ( m post (x) = k(x, x)[k(x, x) + σ 2 noisei] 1 y, k post (x, x ) = k(x, x ) k(x, x)[k(x, x) + σ 2 noisei] 1 k(x, x ) ). And a Gaussian predictive distribution: y x, x, y, M i N ( k(x, x) [K + σ 2 noisei] 1 y, k(x, x ) + σ 2 noise k(x, x) [K + σ 2 noisei] 1 k(x, x) ) Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 19 / 55

The Gaussian Distribution The Gaussian distribution is given by p(x µ, Σ) = N(µ, Σ) = (2π) D/2 Σ 1/2 exp ( 1 2 (x µ) Σ 1 (x µ) ) where µ is the mean vector and Σ the covariance matrix. Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 8 / 55

Conditionals and Marginals of a Gaussian joint Gaussian conditional joint Gaussian marginal Both the conditionals and the marginals of a joint Gaussian are again Gaussian. Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 9 / 55

What is a Gaussian Process? A Gaussian process is a generalization of a multivariate Gaussian distribution to infinitely many variables. Informally: infinitely long vector function Definition: a Gaussian process is a collection of random variables, any finite number of which have (consistent) Gaussian distributions. A Gaussian distribution is fully specified by a mean vector, µ, and covariance matrix Σ: f = (f 1,..., f n ) N(µ, Σ), indexes i = 1,..., n A Gaussian process is fully specified by a mean function m(x) and covariance function k(x, x ): f (x) GP ( m(x), k(x, x ) ), indexes: x Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 10 / 55

The marginalization property Thinking of a GP as a Gaussian distribution with an infinitely long mean vector and an infinite by infinite covariance matrix may seem impractical...... luckily we are saved by the marginalization property: Recall: p(x) = p(x, y)dy. For Gaussians: p(x, y) = N ([ a ] [ A B ] ), = p(x) = b B N(a, A) C Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 11 / 55

Random functions from a Gaussian Process Example one dimensional Gaussian process: p(f (x)) GP ( m(x) = 0, k(x, x ) = exp( 1 2 (x x ) 2 ) ). To get an indication of what this distribution over functions looks like, focus on a finite subset of function values f = (f (x 1 ), f (x 2 ),..., f (x n )), for which where Σ ij = k(x i, x j ). f N(0, Σ), Then plot the coordinates of f as a function of the corresponding x values. Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 12 / 55

Some values of the random function 1.5 output, f(x) 1 0.5 0 0.5 1 1.5 5 0 5 input, x Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 13 / 55

Non-parametric Gaussian process models In our non-parametric model, the parameters is the function itself! Gaussian likelihood: y x, f (x), M i N(f, σ 2 noisei) (Zero mean) Gaussian process prior: f (x) M i GP ( m(x) 0, k(x, x ) ) Leads to a Gaussian process posterior f (x) x, y, M i GP ( m post (x) = k(x, x)[k(x, x) + σ 2 noisei] 1 y, k post (x, x ) = k(x, x ) k(x, x)[k(x, x) + σ 2 noisei] 1 k(x, x ) ). And a Gaussian predictive distribution: y x, x, y, M i N ( k(x, x) [K + σ 2 noisei] 1 y, k(x, x ) + σ 2 noise k(x, x) [K + σ 2 noisei] 1 k(x, x) ) Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 19 / 55

Prior and Posterior output, f(x) 2 1 0 1 2 5 0 5 input, x output, f(x) 2 1 0 1 2 5 0 5 input, x Predictive distribution: p(y x, x, y) N ( k(x, x) [K + σ 2 noisei] 1 y, k(x, x ) + σ 2 noise k(x, x) [K + σ 2 noisei] 1 k(x, x) ) Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 20 / 55

Graphical model for Gaussian Process y 1 x 1 Square nodes are observed (clamped), round nodes stochastic (free). x 1 y 2 x 2 y 3 f 1 f 1 f 2 f 2 f 3 f 3 y 1 x 3 x 2 y 2 All pairs of latent variables are connected. Predictions y depend only on the corresponding single latent f. x 3 y n f n x n y 3 Notice, that adding a triplet x m, f m, y m does not influence the distribution. This is guaranteed by the marginalization property of the GP. This explains why we can make inference using a finite amount of computation! Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 21 / 55

Some interpretation Recall our main result: f X, X, y N ( K(X, X)[K(X, X) + σ 2 ni] 1 y, K(X, X ) K(X, X)[K(X, X) + σ 2 ni] 1 K(X, X ) ). The mean is linear in two ways: µ(x ) = k(x, X)[K(X, X) + σ 2 n] 1 y = n β c y (c) = c=1 n α c k(x, x (c) ). c=1 The last form is most commonly encountered in the kernel literature. The variance is the difference between two terms: V(x ) = k(x, x ) k(x, X)[K(X, X) + σ 2 ni] 1 k(x, x ), the first term is the prior variance, from which we subtract a (positive) term, telling how much the data X has explained. Note, that the variance is independent of the observed outputs y. Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 22 / 55

The marginal likelihood Log marginal likelihood: log p(y x, M i ) = 1 2 y K 1 y 1 2 log K n 2 log(2π) is the combination of a data fit term and complexity penalty. Occam s Razor is automatic. Learning in Gaussian process models involves finding the form of the covariance function, and any unknown (hyper-) parameters θ. This can be done by optimizing the marginal likelihood: log p(y x, θ, M i ) θ j = 1 2 y 1 K K K 1 y 1 K trace(k 1 ) θ j 2 θ j Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 23 / 55

Example: Fitting the length scale parameter Parameterized covariance function: k(x, x ) = v 2 exp ( (x x ) 2 2l 2 ) + σ 2 n δ xx. 1.5 1 observations too short good length scale too long 0.5 0 0.5 10 8 6 4 2 0 2 4 6 8 10 The mean posterior predictive function is plotted for 3 different length scales (the green curve corresponds to optimizing the marginal likelihood). Notice, that an almost exact fit to the data can be achieved by reducing the length scale but the marginal likelihood does not favour this! Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 24 / 55

Why, in principle, does Bayesian Inference work? Occam s Razor too simple P(Y M i ) "just right" too complex Y All possible data sets Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 25 / 55

An illustrative analogous example Imagine the simple task of fitting the variance, σ 2, of a zero-mean Gaussian to a set of n scalar observations. The log likelihood is log p(y µ, σ 2 ) = 1 2 (yi µ) 2 /σ 2 n 2 log(σ2 ) n 2 log(2π) Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 26 / 55

From random functions to covariance functions Consider the class of linear functions: f (x) = ax + b, where a N(0, α), and b N(0, β). We can compute the mean function: µ(x) = E[f (x)] = f (x)p(a)p(b)dadb = axp(a)da + bp(b)db = 0, and covariance function: k(x, x ) = E[(f (x) 0)(f (x ) 0)] = (ax + b)(ax + b)p(a)p(b)dadb = a 2 xx p(a)da + b 2 p(b)db + (x + x ) abp(a)p(b)dadb = αxx + β. Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 27 / 55

From random functions to covariance functions II Consider the class of functions (sums of squared exponentials): 1 f (x) = lim γ i exp( (x i/n) 2 ), where γ i N(0, 1), i n n = The mean function is: µ(x) = E[f (x)] = i γ(u) exp( (x u) 2 )du, where γ(u) N(0, 1), u. exp( (x u) 2 ) γp(γ)dγdu = 0, and the covariance function: E[f (x)f (x )] = exp ( (x u) 2 (x u) 2) du = exp ( 2(u x + x ) 2 + (x + x ) 2 x 2 x 2) )du exp ( (x x ) 2 ) 2 2 2 Thus, the squared exponential covariance function is equivalent to regression using infinitely many Gaussian shaped basis functions placed everywhere, not just at your training points! Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 28 / 55.

Model Selection in Practise; Hyperparameters There are two types of task: form and parameters of the covariance function. Typically, our prior is too weak to quantify aspects of the covariance function. We use a hierarchical model using hyperparameters. Eg, in ARD: k(x, x ) = v 2 0 exp ( D (x d x d )2 ), hyperparameters θ = (v 0, v 1,..., v d, σ 2 n). v1=v2=1 d=1 2v 2 d v1=v2=0.32 v1=0.32 and v2=1 2 2 2 1 0 0 0 2 2 2 2 2 2 0 2 0 x2 2 x1 2 0 2 0 x2 2 x1 2 0 2 0 x2 2 x1 Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 30 / 55

Binary Gaussian Process Classification 4 1 latent function, f(x) 2 0 2 4 input, x class probability, π(x) 0 input, x The class probability is related to the latent function, f, through: p(y = 1 f (x)) = π(x) = Φ ( f (x) ), where Φ is a sigmoid function, such as the logistic or cumulative Gaussian. Observations are independent given f, so the likelihood is n n p(y f) = p(y i f i ) = Φ(y i f i ). i=1 i=1 Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 40 / 55

Prior and Posterior for Classification We use a Gaussian process prior for the latent function: f X, θ N(0, K) The posterior becomes: p(f D, θ) = p(y f) p(f X, θ) p(d θ) = N(f 0, K) p(d θ) m Φ(y i f i ), i=1 which is non-gaussian. The latent value at the test point, f (x ) is p(f D, θ, x ) = p(f f, X, θ, x )p(f D, θ)df, and the predictive class probability becomes p(y D, θ, x ) = p(y f )p(f D, θ, x )df, both of which are intractable to compute. Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 41 / 55

Gaussian Approximation to the Posterior We approximate the non-gaussian posterior by a Gaussian: p(f D, θ) q(f D, θ) = N(m, A) then q(f D, θ, x ) = N(f µ, σ 2 ), where µ = k K 1 m σ 2 = k(x, x ) k (K 1 K 1 AK 1 )k. Using this approximation with the cumulative Gaussian likelihood ( q(y = 1 D, θ, x ) = Φ(f ) N(f µ, σ 2 µ ) )df = Φ 1 + σ 2 Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 42 / 55

Laplace s method and Expectation Propagation How do we find a good Gaussian approximation N(m, A) to the posterior? Laplace s method: Find the Maximum A Posteriori (MAP) lantent values f MAP, and use a local expansion (Gaussian) around this point as suggested by Williams and Barber [10]. Variational bounds: bound the likelihood by some tractable expression A local variational bound for each likelihood term was given by Gibbs and MacKay [1]. A lower bound based on Jensen s inequality by Opper and Seeger [7]. Expectation Propagation: use an approximation of the likelihood, such that the moments of the marginals of the approximate posterior match the (approximate) moment of the posterior, Minka [6]. Laplace s method and EP were compared by Kuss and Rasmussen [3]. Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 43 / 55

Conclusions Complex non-linear inference problems can be solved by manipulating plain old Gaussian distributions Bayesian inference is tractable for GP regression and Approximations exist for classification predictions are probabilistic compare different models (via the marginal likelihood) GPs are a simple and intuitive means of specifying prior information, and explaining data, and equivalent to other models: RVM s, splines, closely related to SVMs. Outlook: new interesting covariance functions application to structured data better understanding of sparse methods Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 53 / 55