GAUSSIAN PROCESS REGRESSION

Similar documents
Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Nonparameteric Regression:

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

Nonparametric Bayesian Methods (Gaussian Processes)

Lecture : Probabilistic Machine Learning

Gaussian Process Regression

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Advanced Introduction to Machine Learning CMU-10715

Lecture: Gaussian Process Regression. STAT 6474 Instructor: Hongxiao Zhu

Density Estimation. Seungjin Choi

Probabilistic & Unsupervised Learning

Probabilistic Graphical Models Lecture 20: Gaussian Processes

20: Gaussian Processes

Reliability Monitoring Using Log Gaussian Process Regression

Gaussian Processes in Machine Learning

Computer Vision Group Prof. Daniel Cremers. 4. Gaussian Processes - Regression

Computer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression

Practical Bayesian Optimization of Machine Learning. Learning Algorithms

Machine Learning. Bayesian Regression & Classification. Marc Toussaint U Stuttgart

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Gaussian Processes for Machine Learning

CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes

GWAS V: Gaussian processes

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Gaussian Processes (10/16/13)

CS-E3210 Machine Learning: Basic Principles

Tutorial on Gaussian Processes and the Gaussian Process Latent Variable Model

Nonparametric Regression With Gaussian Processes

Introduction to Gaussian Processes

Bayesian Linear Regression [DRAFT - In Progress]

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Statistical Techniques in Robotics (16-831, F12) Lecture#21 (Monday November 12) Gaussian Processes

Gaussian processes and bayesian optimization Stanisław Jastrzębski. kudkudak.github.io kudkudak

y(x) = x w + ε(x), (1)

STA414/2104 Statistical Methods for Machine Learning II

Model Selection for Gaussian Processes

COMP 551 Applied Machine Learning Lecture 20: Gaussian processes

Introduction to Probabilistic Machine Learning

Introduction to Gaussian Processes

PILCO: A Model-Based and Data-Efficient Approach to Policy Search

Today. Calculus. Linear Regression. Lagrange Multipliers

Bayesian Machine Learning

Parametric Techniques Lecture 3

Probabilistic Models for Learning Data Representations. Andreas Damianou

Bayesian Interpretations of Regularization

y Xw 2 2 y Xw λ w 2 2

CPSC 540: Machine Learning

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Probabilistic & Bayesian deep learning. Andreas Damianou

CS 7140: Advanced Machine Learning

Density Estimation: ML, MAP, Bayesian estimation

GWAS IV: Bayesian linear (variance component) models

Kernel methods, kernel SVM and ridge regression

Gaussian with mean ( µ ) and standard deviation ( σ)

Outline Lecture 2 2(32)

Introduction to Gaussian Processes

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING. Non-linear regression techniques Part - II

CITEC SummerSchool 2013 Learning From Physics to Knowledge Selected Learning Methods

Parametric Techniques

CSC321 Lecture 18: Learning Probabilistic Models

Lecture 2: From Linear Regression to Kalman Filter and Beyond

Lecture 4: Probabilistic Learning

1 Bayesian Linear Regression (BLR)

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

Relevance Vector Machines

Prediction of Data with help of the Gaussian Process Method

Machine Learning Srihari. Gaussian Processes. Sargur Srihari

PREDICTING SOLAR GENERATION FROM WEATHER FORECASTS. Chenlin Wu Yuhan Lou

Introduction to Machine Learning

Introduction to Bayesian Inference

STAT 518 Intro Student Presentation

Bayesian RL Seminar. Chris Mansley September 9, 2008

Bayesian Linear Regression. Sargur Srihari

Multivariate Bayesian Linear Regression MLAI Lecture 11

The Laplace Approximation

COS513 LECTURE 8 STATISTICAL CONCEPTS

How to build an automatic statistician

Introduction to Gaussian Process

Lecture 2: From Linear Regression to Kalman Filter and Beyond

Probabilistic Reasoning in Deep Learning

Learning Bayesian network : Given structure and completely observed data

9.2 Support Vector Machines 159

Bayesian Machine Learning

Introduction to Bayesian Statistics

Learning Gaussian Process Models from Uncertain Data

Neutron inverse kinetics via Gaussian Processes

Outline lecture 2 2(30)

Statistical Techniques in Robotics (16-831, F12) Lecture#20 (Monday November 12) Gaussian Processes

Bayesian Machine Learning

Copula Regression RAHUL A. PARSA DRAKE UNIVERSITY & STUART A. KLUGMAN SOCIETY OF ACTUARIES CASUALTY ACTUARIAL SOCIETY MAY 18,2011

Gaussian Process Regression with K-means Clustering for Very Short-Term Load Forecasting of Individual Buildings at Stanford

Advanced Machine Learning Practical 4b Solution: Regression (BLR, GPR & Gradient Boosting)

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Transcription:

GAUSSIAN PROCESS REGRESSION CSE 515T Spring 2015

1. BACKGROUND The kernel trick again...

The Kernel Trick Consider again the linear regression model: y(x) = φ(x) w + ε, with prior p(w) = N (w; 0, Σ). The kernel trick is to define the function K(x, x ) = φ(x) Σφ(x ), which allows us to... Introduction The Kernel Trick 3

The Kernel Trick... given training data D and test inputs X, write the predictive distribution in this nice form: where and we have defined p(y X, D, σ 2 ) = N (y ; µ y D, K y D), µ y D = K (K + σ 2 I) 1 y K y D = K K (K + σ 2 I) 1 K, K = K(X, X) K = K(X, X ) K = K(X, X ). Introduction The Kernel Trick 4

The Kernel Trick This does more than just make the expressions pretty. In particular, it is often easier/cheaper to calculate K directly rather than explicitly compute φ(x) and take the dot product. Example: all subsets kernel. Introduction The Kernel Trick 5

The Kernel Trick Idea: completely abandon the idea of deriving explicit feature expansions and simply derive (positive-definite) kernel functions K directly! Introduction The Kernel Trick 6

The Kernel Trick Maybe we could skip the entire procedure of thinking about w, which we never explicitly use here... where p(y X, D, σ 2 ) = N (y ; µ y D, K y D), µ y D = K (K + σ 2 I) 1 y K y D = K K (K + σ 2 I) 1 K. Introduction The Kernel Trick 7

2. GAUSSIAN PROCESSES A reimagining of Bayesian regression

For more information (Also code!) http://www.gaussianprocess.org/ Gaussian Processes Introduction 9

Regression Consider the general regression problem. Here we have: an input domain X (for example, R n, but in general anything), an unknown function f : X R, and and (perhaps noisy) observations of the function: D = { (x i, y i ) }, where y i = f(x i ) + ε i. Our goal is to predict the value of the function f(x ) at some test locations X. Gaussian Processes Introduction 10

Gaussian processes Gaussian processes take a nonparameteric approach to regression. We select a prior distribution over the function f and condition this distribution on our observations, using the posterior distribution to make predictions. Gaussian processes are very powerful and leverage the many convenient properties of the Gaussian distribution to enable tractable inference. Gaussian Processes Introduction 11

From the Gaussian distribution to GPs How can we leverage these useful properties of the Gaussian distribution to approach the regression problem? We have a problem: the latent function f is usually infinite dimensional; however, the multivariate Gaussian distribution is only useful in finite dimensions. The Gaussian process is a natural generalization of the multivariate Gaussian distribution to potentially infinite settings. Gaussian Processes GPs: Prior 12

GPs: Definition Definition (GPs) A Gaussian process is a (potentially infinite) collection of random variables such that the joint distribution of any finite number of them is multivariate Gaussian. Gaussian Processes GPs: Prior 13

GPs: Notation A Gaussian process distribution on f is written p(f) = GP(f; µ, K), and just like the multivariate Gaussian distribution, is parameterized by its first two moments (now functions): E[f] = µ: X R, the mean function, and E[ (f(x) µ(x) )( f(x ) µ(x ) )] = K : X X R, a positive semidefinite covariance function or kernel. Gaussian Processes GPs: Prior 14

GPs: Mean and covariance functions The mean function encodes the central tendency of the function, and is often assumed to be a constant (usually zero). The covariance function encodes information about the shape and structure we expect the function to have. A simple and very common example is the squared exponential covariance: K(x, x ) = exp ( 1 /2 x x 2), which encodes the notation that nearby points should have similar function values. Gaussian Processes GPs: Prior 15

GPs: Prior on finite sets Suppose we have selected a GP prior GP(f; µ, K) for the function f. Consider a finite set of points X X. The GP prior on f, by definition, implies the following joint distribution on the associated function values f = f(x): p(f X) = N (f; µ(x), K(X, X) ). That is, we simply evaluate the mean and covariance functions at X and take the associated multivariate Gaussian distribution. Very simple! Gaussian Processes GPs: Prior 16

Prior: Sampling examples 2 µ(x) ±2σ samples 0 2 0 1 2 3 4 5 6 7 8 9 10 x K = exp ( 1 /2 x x 2) Gaussian Processes GPs: Prior 17

Prior: Sampling examples 2 µ(x) ±2σ samples 0 2 0 1 2 3 4 5 6 7 8 9 10 x K = λ 2 exp ( x ) x 2 2l 2 λ = 1 /2, l = 2 Gaussian Processes GPs: Prior 18

Prior: Sampling examples 4 2 0 2 4 µ(x) ±2σ samples 0 1 2 3 4 5 6 7 8 9 10 x K = exp ( x x ) Gaussian Processes GPs: Prior 19

From the prior to the posterior So far, I ve only told you how to construct prior distributions over the function f. How do we condition our prior on some observations D = (X, f) to make predictions about the value of f at some points X? Gaussian Processes GPs: Posterior 20

From the prior to the posterior We begin by writing the joint distribution between the training function values f(x) = f and the test function values f(x ) = f : ( [ ] f p(f, f ) = N ; f [ ] µ(x), µ(x ) [ ] ) K(X, X) K(X, X )... K(X, X) K(X, X ) Gaussian Processes GPs: Posterior 21

From the prior to the posterior... we then condition this multivariate Gaussian on the known training values f. We already know how to do that! p(f X, D) = N ( f ; µ f D (X ), K f D (X, X ) ), where µ f D (x) = µ(x) + K(x, X)K 1( f µ(x) ) K f D (x, x ) = K(x, x ) K(x, X)K 1 K(X, x ). Gaussian Processes GPs: Posterior 22

From the prior to the posterior Notice that the functions µ f D and K f D are valid mean and covariance functions, respectively. That means the previous slide is telling us the posterior distribution over f is a Gaussian process! Gaussian Processes GPs: Posterior 23

The posterior mean One way to understand the posterior mean function µ f D is as a correction to the prior mean consisting of a weighted combination of kernel functions, one for each training data point: µ f D (x) = µ(x) + K(x, X) ( K(X, X) ) 1( f µ(x) ) = µ(x) + N α i K(x i, x), i=1 where α i = K(X, X) 1( f(x i ) µ(x i ) ). Gaussian Processes GPs: Posterior 24

Prior 2 µ(x) ±2σ samples 0 2 0 1 2 3 4 5 6 7 8 9 10 x K = exp ( 1 /2 x x 2) Gaussian Processes GPs: Posterior 25

Posterior example 2 observations µ(x) ±2σ 0 2 0 1 2 3 4 5 6 7 8 9 10 x Gaussian Processes GPs: Posterior 26

Posterior: Sampling 2 observations µ(x) ±2σ samples 0 2 0 1 2 3 4 5 6 7 8 9 10 x Gaussian Processes GPs: Posterior 27

Dealing with noise So far, we have assumed we can sample the function f exactly, which is uncommon in regression settings. How do we deal with observation noise? tl;dr: the same way we did with Bayesian linear regression! Gaussian Processes Gaussian noise 28

Dealing with noise We must create a model for our observations given the latent function. To begin, we will choose the simple iid, zero-mean additive Gaussian noise model: combined we have y(x) = f(x) + ε, p(ε x) = N (ε; 0, σ 2 ); p(y f) = N (y; f, σ 2 I). Gaussian Processes Gaussian noise 29

Noisy posterior To derive the posterior given noisy observations D, we again write the joint distribution between the training function values y and the test function values f : ( [y ] p(y, f ) = N ; f [ ] µ(x), µ(x ) [ ] ) K(X, X) + σ 2 I K(X, X )... K(X, X) K(X, X ) (1) Gaussian Processes Gaussian noise 30

Noisy posterior... and condition as before. p(f X, D) = N ( f ; µ f D (X ), K f D (X, X ) ), where µ f D (x) = µ(x) + K(x, X) ( K(X, X) + σ 2 I ) 1( ) y µ(x) K f D (x, x ) = K(x, x ) K(x, X) ( K(X, X) + σ 2 I ) 1 K(X, x ). Gaussian Processes Gaussian noise 31

Noisy posterior: Sampling 2 observations µ(x) ±2σ samples 0 2 0 1 2 3 4 5 6 7 8 9 10 x σ = 0.1 Gaussian Processes Gaussian noise 32

Noisy posterior: Sampling 2 observations µ(x) ±2σ samples 0 2 0 1 2 3 4 5 6 7 8 9 10 x σ = 0.5 Gaussian Processes Gaussian noise 33

3. HYPERPARAMETERS We re not done yet...

Hyperparameters So far, we have assumed that the Gaussian process prior distribution on f has been specified a priori. But this prior distribution itself has parameters, for example the length scale l, the output scale λ, and the noise variance σ 2. As parameters of a prior distribution, we call these hyperparameters. For convenience, we will write θ to denote the vector of all hyperparameters of the model (including of µ and K). How do we learn θ? Hyperparameters 35

Marginal likelihood Assume we have chosen a parameterized prior p(f θ) = GP ( f; µ(x; θ), K(x, x ; θ) ). We will measure the quality of the fit to our training data D = (X, y) with the marginal likelihood, the probability of observing the given data under our prior: p(y X, θ) = p(y f) p(f X, θ) df, where we have marginalized the unknown function values f (hence, marginal likelihood). Hyperparameters 36

Marginal likelihood: Evaluating Thankfully, this is an integral we can do analytically under the Gaussian noise assumption! p(y X, θ) = p(y f) p(f X, θ) df, = iid noise GP prior N (y; f, σ 2 I) N ( f; µ(x; θ), K(X, X; θ) ) df = N ( y; µ(x; θ), K(X, X; θ) + σ 2 I ). (Convolutions of two Gaussians are Gaussian.) Hyperparameters 37

Marginal likelihood: Evaluating The log-likelihood of our data under the chosen prior are then (writing V = (K(X, X; θ) + σ 2 I)): log p(y X, θ) = data fit (y µ) V 1 (y µ) 2 Occam s razor log det V 2 N log 2π 2 The first term is large when the data fit the model well, and the second term is large when the volume of the prior covariance is small; that is, when the model is simpler. Hyperparameters 38

Hyperparameters: Example 4 2 0 2 4 0 1 2 3 4 5 6 7 8 9 10 x θ = (λ, l, σ) = (1, 1, 1 /5), log p(y X, θ) = 27.6 Hyperparameters 39

Hyperparameters: Example 4 2 0 2 4 0 1 2 3 4 5 6 7 8 9 10 x θ = (λ, l, σ) = (2, 1 /3, 1 /20), log p(y X, θ) = 46.5 Hyperparameters 40

Hyperparameters are important Comparing the marginal likelihoods, we see that the observed data are over 100 million times more likely to have been generated by the first model rather than from the second model! Clearly hyperparameters can be quite important. Hyperparameters 41

Can we marginalize hyperparameters? To be fully Bayesian, we would choose a hyperprior over θ, p(θ), and marginalize the unknown hyperparameters when making predictions: p(f x, D) = p(f x, D, θ)p(y X, θ)p(θ) dθ p(y X, θ)p(θ) dθ Unfortunately, this integral cannot be resolved analytically. (Of course... ) Hyperparameters 42

Maximum likelihood-ii Instead, if we believe the posterior distribution over θ to be well-concentrated (for example, if we have many training examples), we may approximate p(θ D) with a delta distribution at the point with the maximum marginal likelihood: θ MLE = arg max p(y X, θ). θ This is called maximum likelihood-ii (ML-II) inference. This effectively gives the approximation How can we find θ MLE? p(f x, D) p(f x, D, θ MLE ). Hyperparameters 43