Gaussian Processes (10/16/13)

Similar documents
Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Lecture: Gaussian Process Regression. STAT 6474 Instructor: Hongxiao Zhu

Practical Bayesian Optimization of Machine Learning. Learning Algorithms

Machine Learning. Bayesian Regression & Classification. Marc Toussaint U Stuttgart

Computer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression

Reliability Monitoring Using Log Gaussian Process Regression

Computer Vision Group Prof. Daniel Cremers. 4. Gaussian Processes - Regression

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Introduction. Chapter 1

Model Selection for Gaussian Processes

Nonparameteric Regression:

Gaussian Processes. 1 What problems can be solved by Gaussian Processes?

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

GAUSSIAN PROCESS REGRESSION

GWAS V: Gaussian processes

Introduction to Gaussian Processes

Statistical Techniques in Robotics (16-831, F12) Lecture#21 (Monday November 12) Gaussian Processes

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

Statistical Techniques in Robotics (16-831, F12) Lecture#20 (Monday November 12) Gaussian Processes

Nonparametric Bayesian Methods (Gaussian Processes)

Gaussian Processes for Machine Learning

Supervised Learning Coursework

CPSC 540: Machine Learning

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING. Non-linear regression techniques Part - II

Introduction to Gaussian Processes

System identification and control with (deep) Gaussian processes. Andreas Damianou

Gaussian with mean ( µ ) and standard deviation ( σ)

COMP 551 Applied Machine Learning Lecture 20: Gaussian processes

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

20: Gaussian Processes

Probabilistic & Unsupervised Learning

Gaussian Process Regression

STAT 518 Intro Student Presentation

Introduction to Machine Learning

Density Estimation. Seungjin Choi

MTTTS16 Learning from Multiple Sources

CS-E4830 Kernel Methods in Machine Learning

STA 4273H: Sta-s-cal Machine Learning

Bayesian Machine Learning

Advanced Introduction to Machine Learning CMU-10715

Lecture : Probabilistic Machine Learning

Review: Support vector machines. Machine learning techniques and image analysis

CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods

Gaussian Processes in Machine Learning

Learning Gaussian Process Models from Uncertain Data

Introduction to Machine Learning

Sparse Linear Models (10/7/13)

18.9 SUPPORT VECTOR MACHINES

Nearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2

Probabilistic Reasoning in Deep Learning

How to build an automatic statistician

Building an Automatic Statistician

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

STA414/2104 Statistical Methods for Machine Learning II

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

The Automatic Statistician

Gaussian processes for inference in stochastic differential equations

CS 7140: Advanced Machine Learning

ECE521 week 3: 23/26 January 2017

A Process over all Stationary Covariance Kernels

Announcements. Proposals graded

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Tutorial on Gaussian Processes and the Gaussian Process Latent Variable Model

Probabilistic Graphical Models Lecture 20: Gaussian Processes

PATTERN RECOGNITION AND MACHINE LEARNING

STA 4273H: Statistical Machine Learning

Introduction to Gaussian Processes

Probabilistic Machine Learning. Industrial AI Lab.

Support Vector Machines and Bayes Regression

Least Squares Regression

Recap from previous lecture

Bayesian Support Vector Machines for Feature Ranking and Selection

9.2 Support Vector Machines 159

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Gaussian processes and bayesian optimization Stanisław Jastrzębski. kudkudak.github.io kudkudak

Lecture 13: Data Modelling and Distributions. Intelligent Data Analysis and Probabilistic Inference Lecture 13 Slide No 1

Distributionally robust optimization techniques in batch bayesian optimisation

Support Vector Machines

BAYESIAN CLASSIFICATION OF HIGH DIMENSIONAL DATA WITH GAUSSIAN PROCESS USING DIFFERENT KERNELS

Nonparametric Regression With Gaussian Processes

Machine Learning. Classification, Discriminative learning. Marc Toussaint University of Stuttgart Summer 2015

Advanced Machine Learning Practical 4b Solution: Regression (BLR, GPR & Gradient Boosting)

Joint Emotion Analysis via Multi-task Gaussian Processes

4 Bias-Variance for Ridge Regression (24 points)

Machine Learning, Fall 2009: Midterm

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Neutron inverse kinetics via Gaussian Processes

STA 414/2104, Spring 2014, Practice Problem Set #1

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Machine Learning Srihari. Gaussian Processes. Sargur Srihari

Bayesian Learning (II)

An Introduction to Gaussian Processes for Spatial Data (Predictions!)

Lecture 2: From Linear Regression to Kalman Filter and Beyond

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES. Wei Chu, S. Sathiya Keerthi, Chong Jin Ong

Transcription:

STA561: Probabilistic machine learning Gaussian Processes (10/16/13) Lecturer: Barbara Engelhardt Scribes: Changwei Hu, Di Jin, Mengdi Wang 1 Introduction In supervised learning, we observe some inputs x i and some outputs y i. We assume that y i = f(x i ), for some unknown f, possibly corrupted by noise: y i = f(x i ) + ɛ where ɛ N (0, σ 2 ). One approach to modeling this relationship is to compute a posterior distribution over functions given the data, p(f X, y), and then to use this posterior distribution to make predictions given new inputs, marginalizing over all possible functions, i.e., to compute: p(y x, X, y) = p(y f, x )p(f X, y)df. F Up until now, we have focused on parametric representations for the function f, so that instead of inferring p(f D), we infer p(θ D). In this lecture, we discuss a way to perform Bayesian inference on a distribution of functions. Our approach will be based on Gaussian processes or GPs. A GP defines a prior distribution over functions; when we use it as a prior, we can build a posterior distribution over functions given data observations. Although it might seem difficult to represent a distribution over functions, a key point is that, in GPs, we will define a posterior distribution over f(x) at all values X X given a finite set of observed points, say x 1,..., x n. In order to infer the function at an infinite number of points given a finite number of points, a GP assumes that p(f(x 1 ), f(x 2 ),...f(x n )) is jointly Gaussian, with a mean µ(x) and covariance Σ(x) given by ij = κ(x i, x j ), where κ is a positive definite kernel function. We use the kernel to enforce smoothness of the random function across time points: if x i and x j are similar according to our kernel function κ, then this constrains the distribution of our function f(x i ) and f(x j ) so that they are (in probability) similar: cov(f(x i ), f(x j )) = κ(x i, x j ). 1

Gaussian Processes 2 Figure 1: The x-axis is x, the y-axis is f(x), and this shows one instance of a function in this GP space. 2 Gaussian Processes 2.1 Prediction We observe some inputs x i and some outputs y i. We assume that y i = f(x i ) + ɛ (Figure 1), for some unknown f( ) corrupted by Gaussian noise ɛ N (0, σ 2 ). y i = f(x i ) + ɛ i A simple example can be illustrated in Figure 1, {(x 1, y 1 ), (x 2, y 2 ), (x 3, y 3 ), (x 4, y 4 ), (x 5, y 5 ), (x 6, y 6 )} are observed data, the curved line is one possible function of x, i.e. f(x). To make a prediction for a x, can find the expected value y from the expected value of the posterior probability of f(x ). 2.2 Distribution over functions Our approach will be to define a GP prior over functions. We need to be able to define a distribution over the function s values at a finite set of points, say x 1,..., x n. Recall that a GP assumes that p(f(x 1 ), f(x 2 ),...f(x n )) is jointly Gaussian, with some mean µ(x) = E[f(x)] and covariance Σ(x) given by Σ ij = κ(x i, x j ), where κ is a positive definite kernel function. In particular, f(x) GP (µ(x), K(x, x )). Often, µ(x) = 0 for simplicity. The posterior distribution of the function given the data is: p(f X, y) = N (f µ, K) where µ = (µ(x 1 ),..., µ(x n )) and Σ ij = κ(x i, x j ). We can produce predictions from this posterior

Gaussian Processes 3 distribution as follows: p(y x, X, y) = p(y f, x )p(f X, y)df. F Because this is a Gaussian distribution, it turns out that this integral is straightforward. We will write this out below. Figure 2: Gaussian process puts a distribution on function space F. Gaussian Process is a distribution over functions, used as a prior, we can define a posterior distribution over functions given data. Key point: data do not need to define function for all values of x, but only at a finite number of values. In Figure 2 we assume that p(f(x 1 ), f(x 2 ),...f(x N )) follows some distribution (i.e., there is a probability p(f µ, κ) over function in function space F). Given a new data x, we predict y by using statistical machinery (i.e., the posterior predictive distribution). 2.3 Predictions for noise-free observations Suppose we observed a training set D = {(x i, f(x i )), i = 1 : n}, where f i = f(x i ) is the noise-free observation of the function evaluated at x i. Given a test set X of size n p, we want to predict the function outputs f. In the setting of no uncertainty (i.e., ɛ = 0), if we ask the GP to predict f(x) for a value of x that it has already seen, we want the GP to return the answer f(x) with no uncertainty. In other words, it should memorize the training data, and perform noiseless interpolation of points x not in the training data set. This will only happen if the observations are noiseless, i.e., every time x appears in the data set, the same f(x) appears as the label. We will consider the case of noisy observations later.

Gaussian Processes 4 2 1.5 1 0.5 0 0.5 1 1.5 2 5 0 5 2.5 2 1.5 1 0.5 0 0.5 1 1.5 2 5 0 5 Figure 3: Left: functions sampled from a GP prior with SE kernel. Right: samples from a GP posterior, after conditioning on five noise-free observations. The shaded area represents E[f(x)] ± 2std(f(x)). (Figure 15.22 from Murphy, 2012) Now we return to the prediction problem. By the definition of the GP, the joint distribution has the following form ( ) ( ( ) ( ) ) f µ K K N, f µ K T K where K = κ(x, X) is n n, K = κ(x, X ) is n n, and K = κ(x, X ) is n n. By the standard rules for conditioning Gaussians (see previous lecture), the posterior has the following form f f, X, X N (m, Σ ) where m = µ (X ) + K T K 1 (f µ (X)), and Σ = K K T K 1 K. This allows us to compute the posterior prediction for noiseless function f(x ) given our data and new sample x. This process is illustrated in Figure 5. On the left we show sample samples from the prior, p(f X), where we use a squared exponential kernel (i.e., a Gaussian kernel or RBF kernel). On the right we show samples from the posterior, p(f f, X, X ). We see that the model interpolates the training data according to the posterior probability of the points along X, and that the predictive uncertainty increases as we move further away from the observed data. This uncertainty is strongly regulated by the choice of the kernel. If we changed the length scale of the kernel, on one hand we may have much less uncertainty (i.e., a smoother function) or, changing the length scale the other way, we may have much greater uncertainty in our predictions (i.e., a less smooth function). This is how the kernel controls the smoothness of the functions in the underlying function space.

Gaussian Processes 5 2.4 Relationship between kernel and smoothness When we use GPs for prediciton, we infer the distribution of functions in between observed data points x using some notion of smoothness of the functions in function space (see Figure 3). Let us assume p(f(x 1 ),..., f(x N )) is jointly Gaussian: f(x 1 ),..., f(x N ) N n (µ(x), Σ(x)) where Σ(x) = K, K i,j = κ(x i, x j ), and κ is our Mercer kernel function. The idea is: the quantification of similarity among x i and x j is actually a measure of the similarity between f(x i ) and f(x j ), i.e., cov (f(x i ), f(x j )) = κ(x i, x j ). If neighboring points are similar, then correspondingly the conditional probability of one of the points f(x ) given x, x, f(x) is going to have less uncertainty; whereas if those points have no similarity, then the same conditional probability will simply revert to the prior probability. As the example of Figure 4: at each point x, f(x) is marginally Gaussian. between x points, we see the smoothness of functions. the smoothness is dictated by the kernel. For example, if we use RBF kernels, then, under small length scale, κ(x, x ) is near zero, so the function is non-smooth (the dashed line). Under large length scale, κ(x, x ) is larger, so the function is smoother between noiseless observations (the solid line). 2.5 Predictions using noisy observations Now let us consider the case where what we observe is a noisy version of the underlying function, y = f(x) + ɛ, where ɛ N ( 0, σ 2 y). Now at each point x, the corresponding f(x) also has noise because we observe y = f(x) + ɛ. In other words, there is now conditional uncertainty in the function space, conditioning on a specific x. In this case, predictions now must incorporate this uncertainty. The covariance of the observed noisy responses y i and y j changes only along the diagonal (i.e., the variance at a specific point x): cov (y i, y j ) = κ (x i, x j ) + σ 2 yδ (i = j) From the above equation, we know that if i j If i = j, then cov (y i, y j ) = κ (x i, x j ) cov (y i, y j ) = κ (x i, x i ) + σ 2 y

Gaussian Processes 6 Figure 4: Two draws from a noiseless GP posterior with different length scales in an RBF kernel. Let us define cov (y x) = K + σ 2 yi n K y The joint distribution of y and f is ( ) y N f ( ( Ky K 0, K T K Here we assume the mean is zero for notational simplicity. As before, using the conditional Gaussians, given observed y, x and new data x, f is distributed as f y, x, x N (µ, Σ ) where µ = K T K 1 y y, and Σ = K K T K 1 y K. For a single test input x, let k posterior mean is )) = [κ (x, x 1 ),..., κ (x, x n )], another way to write the f = k T K 1 y y = n α i κ (x i, x ) where α i = K 1 y y. Note the similarity with this equation to the equation for prediction in Support Vector Machines (SVMs). For GPs, however, there is not necessarily the sparsity implied by the choice of support vectors: every observation has some impact for a dense matrix K and dense vector y. i=1

Gaussian Processes 7 3 GP Regression GPs can be used for (nonlinear) regression. Let the prior on the (possibly non-linear) regression function f( ) be a GP, denoted by f(x) GP (m(x), κ(x, x )) where m(x) is the mean function and κ(x, x ) is the kernel or covariance function, i.e., m(x) = E[f(x)] κ(x, x ) = E[(f(x) m(x))(f(x ) m(x )) T ] We obviously require that κ( ) be a positive definite kernel. For any finite set of points, this process defines a joint Gaussian: p(f X) = N (f µ, K) where µ = (m(x 1 ),..., m(x N )) and K ij = κ(x i, x j ). Note that it is common to use a mean function of m(x = 0), since the GP is flexible enough to model the mean arbitrarily well. Figure 5 Then we can model GP regression, or y i = f(x i ) + ɛ, where ɛ N (0, σ 2 ) as noisy GP prediction. In particular, we have the same mean and covariance parameterization as with noisy GP prediction. Furthermore, we can compute the joint density of the response y and the underlying function f as above. We can also do GP classification: as with logistic regression, squash the GP response through a logistic function to get predictions between (0, 1).