Nonparametric Regression With Gaussian Processes

Similar documents
GAUSSIAN PROCESS REGRESSION

COMP 551 Applied Machine Learning Lecture 20: Gaussian processes

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

Introduction to Gaussian Processes

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

Nonparameteric Regression:

CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes

Probabilistic & Unsupervised Learning

1 Bayesian Linear Regression (BLR)

Gaussian Processes (10/16/13)

STA 4273H: Sta-s-cal Machine Learning

Bayesian Machine Learning

Week 3: Linear Regression

Machine Learning Srihari. Gaussian Processes. Sargur Srihari

STA414/2104 Statistical Methods for Machine Learning II

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

Relevance Vector Machines

Bayesian Methods for Machine Learning

ECE521 week 3: 23/26 January 2017

GWAS V: Gaussian processes

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Probabilistic Graphical Models Lecture 20: Gaussian Processes

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods

Statistical Techniques in Robotics (16-831, F12) Lecture#21 (Monday November 12) Gaussian Processes

Gaussian Process Regression

Classification for High Dimensional Problems Using Bayesian Neural Networks and Dirichlet Diffusion Trees

Practical Bayesian Optimization of Machine Learning. Learning Algorithms

Machine Learning Practice Page 2 of 2 10/28/13

MTTTS16 Learning from Multiple Sources

Introduction to Machine Learning

The Bayesian approach to inverse problems

Machine Learning. Bayesian Regression & Classification. Marc Toussaint U Stuttgart

Density Estimation. Seungjin Choi

Computer Vision Group Prof. Daniel Cremers. 3. Regression

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines

STA 4273H: Statistical Machine Learning

Machine Learning Basics: Maximum Likelihood Estimation

Massachusetts Institute of Technology

CPSC 540: Machine Learning

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

Kernel methods, kernel SVM and ridge regression

Introduction to Gaussian Processes

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Gaussian processes and bayesian optimization Stanisław Jastrzębski. kudkudak.github.io kudkudak

Bayesian Inference: Principles and Practice 3. Sparse Bayesian Models and the Relevance Vector Machine

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING. Non-linear regression techniques Part - II

Introduction to Gaussian Processes

DEPARTMENT OF COMPUTER SCIENCE AUTUMN SEMESTER MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

Where now? Machine Learning and Bayesian Inference

CS-E3210 Machine Learning: Basic Principles

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Gaussian Processes. 1 What problems can be solved by Gaussian Processes?

Introduction. Chapter 1

DD Advanced Machine Learning

Computer Vision Group Prof. Daniel Cremers. 4. Gaussian Processes - Regression

Discrete Mathematics and Probability Theory Fall 2015 Lecture 21

Bayesian Learning (II)

Introduction Dual Representations Kernel Design RBF Linear Reg. GP Regression GP Classification Summary. Kernel Methods. Henrik I Christensen

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

Announcements. Proposals graded

Learning Gaussian Process Models from Uncertain Data

Overview. Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation

Basis Expansion and Nonlinear SVM. Kai Yu

4 Bias-Variance for Ridge Regression (24 points)

Neutron inverse kinetics via Gaussian Processes

Bayesian Quadrature: Model-based Approximate Integration. David Duvenaud University of Cambridge

Learning from Data: Regression

Gaussian with mean ( µ ) and standard deviation ( σ)

20: Gaussian Processes

Introduction to Probabilistic Machine Learning

Machine learning - HT Basis Expansion, Regularization, Validation

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Lecture 3: Pattern Classification

Multivariate Bayesian Linear Regression MLAI Lecture 11

CPSC 540: Machine Learning

Gaussian Quiz. Preamble to The Humble Gaussian Distribution. David MacKay 1

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Non-Parametric Bayes

Supervised Learning Coursework

Computer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.

A Bayesian Treatment of Linear Gaussian Regression

Reliability Monitoring Using Log Gaussian Process Regression

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

PATTERN RECOGNITION AND MACHINE LEARNING

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Linear & nonlinear classifiers

General Bayesian Inference I

Advanced Introduction to Machine Learning CMU-10715

Nonparametric Bayesian Methods (Gaussian Processes)

Introduction to SVM and RVM

COMP 551 Applied Machine Learning Lecture 21: Bayesian optimisation

Gaussian Process Dynamical Models Jack M Wang, David J Fleet, Aaron Hertzmann, NIPS 2005

Transcription:

Nonparametric Regression With Gaussian Processes From Chap. 45, Information Theory, Inference and Learning Algorithms, D. J. C. McKay Presented by Micha Elsner Nonparametric Regression With Gaussian Processes p.

A Non-linear Dataset We have a group of N data points in d-dimensional space, X, and associated values t. 3.5 3 2.5 2 1.5 1 0 2 4 6 8 10 12 Nonparametric Regression With Gaussian Processes p.

Linear Least-Squares Regression We wish to find a function y(x) for which the mean squared error is small: ŷ(x) = argmin y 1 N (y(x i ) t i ) 2 We hypothesize that y is a linear function of the training data parameterized by the weight vector w, y(x;w) = w 0 + w T x. Our optimal w is now: w = (X T X) 1 X T t i Nonparametric Regression With Gaussian Processes p.

Which looks like this: 3.5 3 2.5 2 1.5 1 0 2 4 6 8 10 12 Nonparametric Regression With Gaussian Processes p.

A Statistical Model Linear least-squares regression is a procedure for finding a w; let s turn it into a model for our data. We suppose that there exists a function y(x;w ), and our data t is generated from this function, then corrupted by Gaussian white noise (mean 0, variance σ 2 ν). t i = y(x i ;w ) + ν i ν i N( 0,σ 2 ν) In this case, our data t consists of samples from a Gaussian: t(x) N( y(x;w),σ 2 ν) Nonparametric Regression With Gaussian Processes p.

Bayesian inference Now the probability of our parameter w is: p(w,σ 2 ν t,x) = p(t y(x;w,σ2 ν),x)p(w,σ 2 ν) p(t X) The solution (in case of an appropriate form for y and a prior over w and σ 2 ν) will be the same least-squares solution as before. Nonparametric Regression With Gaussian Processes p.

Only now we have error bars 3.5 3 2.5 2 1.5 1 0 2 4 6 8 10 12 Gaussian white noise: 1 0.8 0.6 0.4 0.2 0 1 0.8 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8 1 Nonparametric Regression With Gaussian Processes p.

Non-parametric Bayesian inference We could dispense with the parameter w and compute over the space of functions directly: p(y t,x) = p(t y(x),x)p(y(x)) p(t X) This is a generalization of the previous equation; to work with it we need to be able to define priors over the infinite space of functions y. However, we d like to retain some useful properties: Simple descriptions of our favorite parametric models. Tractable (finite-space) representation of the posterior. Distribution of the data given the model is Gaussian. Nonparametric Regression With Gaussian Processes p.

Gaussian Processes A prior which is a Gaussian process has all the properties we desire. A Gaussian process is a distribution over the space of functions, with mean function µ(x) (a function because it need not be stationary as above it may be something like f(x;w)) and an operator A analogous to the covariance: P(y(x) µ(x),a) = 1 Z exp( 1 2 (y(x) µ(x))t A(y(x) µ(x))) The inner product (y(x) µ(x)) T A(y(x) µ(x)) is defined as: (y(x) µ(x)) T A(y(x) µ(x))dx Nonparametric Regression With Gaussian Processes p.

Linear Regression Revisited Back in our simple linear regression model, let s impose a prior on w; unsurprisingly, let s make it Gaussian, with mean 0 and spherical covariance σ 2 wi. w N( 0,σ 2 wi) Let Y be the value of y(x) at each point x. What are our prior beliefs about Y like? Y must still be Gaussian, and has expected mean 0. The covariance Q = Y Y T is: = (Xw)(Xw) T = X(ww T )X T = Xσ 2 wx T Nonparametric Regression With Gaussian Processes p. 1

Covariance Terms t is Y, plus the additive Gaussian noise terms ν. It s still Gaussian with mean 0. The covariance C is Q, plus the noise covariance, σ 2 νi. Any particular element of Q is the data covariance at that point, times the variance in w: A particular element of C is: Q i,j = (σ 2 wxx T ) i,j C i,j = Q i,j + I i=j σ 2 ν Nonparametric Regression With Gaussian Processes p. 1

The Kernel Trick So far, we ve been working explicitly in the input space (the dimensions of X plus an extra feature always equal to 1, to allow a bias term). We can project the data into feature space using a set of basis [ functions φ. ] One way to do this would be to form R = φ 1 (x)...φ H (x). But our set of basis functions could be huge, even infinite. However, our covariance Q only depends on the dot products between vectors. So all we really need is a function C(x,x ) that produces the entries of the covariance matrix Q. Nonparametric Regression With Gaussian Processes p. 1

Doing Inference So suppose we have fixed C as some covariance function. Actually using our regression model (predicting t for some new x) requires inference over the posterior p(t n+1 (x n+1 t,x) = p(t n+1(x n+1 ),t(x)) p(t). This distribution is a Gaussian (because our prior distribution over t is Gaussian). So the posterior is: p(t n+1 (x n+1 t(x)) exp( 1 2 [t Nt n+1 ]C 1 N+1 [t Nt n+1 ] T ) This is just an ordinary Gaussian distribution. The tough part is the covariance matrix CN+1 1, which is the inverse of the covariance C for the data X augmented by the new point t n+1. Nonparametric Regression With Gaussian Processes p. 1

Inverting the Covariance Matrix Inverting matrices is hard. Ideally we only want to invert the data once, especially if we have many training points. The inverse covariance of the augmented data matrix can be computed only in terms of the inverse covariance of the original data, C 1, and the covariance of the new point against the original data, k. The prediction of the new point is: ˆt n+1 = k T C 1 t The (Gaussian) posterior distribution for the point is: ( ) p(t n+1 (x n+1 ) t(x)) exp (t n+1 ˆt n+1 ) 2 2σn+1 2 Nonparametric Regression With Gaussian Processes p. 1

Incremental learning Suppose we get a new point x n+1 and its label t n+1. We could recompute C n+1 and invert it all over again. But that would be expensive. Notice that: [ [ ] ] ] C n [k C n+1 = [ k T ] [ ] κ Here again k is the covariance of x n+1 with the training set: k i = C(x i,x n+1 ) κ is C(x,x) + σ ν and C(x,x) is usually 1. We can compute the inverse Cn+1 1 without inverting the matrix from scratch. Nonparametric Regression With Gaussian Processes p. 1

Incremental inversion C n+1 = [ [ [ ] [ ] M m m T ] [ ] sm ] (Partitioned inverse equations; Barnett 1979). sm = (κ k T C 1 n k) 1 m = smc 1 n k M = C 1 n + 1 sm mmt Nonparametric Regression With Gaussian Processes p. 1

A few useful priors The radial basis kernel assumes that points near one another have similar values; the similarity between points is expressed as a Gaussian: x y 2 d(x,y) = exp 2r 2 If we assume our basis contains an infinite number of these functions and set our covariance function to: x y 2 C(x,y) = θ 1 exp 4r 2 We get the same thing. Nonparametric Regression With Gaussian Processes p. 1

Result looks like: 4 3.5 3 2.5 2 1.5 1 0.5 0 0 2 4 6 8 10 12 Nonparametric Regression With Gaussian Processes p. 1

Exploiting periodicity We can use a prior that assumes points nearby in the period of the function have similar values: So C(0, 2π) is 0. C(x,y) = θ 1 exp 2 sin(.5 (x y)) r Nonparametric Regression With Gaussian Processes p. 1

Result looks like: With only 25 training points we get some sparsity issues: 4 A sinusoidal prior helps us reconstruct the data: 4 3.5 3 2.5 2 1.5 3.5 3 2.5 2 1 0.5 0 0.5 0 2 4 6 8 10 12 1.5 1 0.5 0 2 4 6 8 10 12 Nonparametric Regression With Gaussian Processes p. 2

Why Bayesian analysis? Why not just use kernel methods? Error bars: how unlikely is a given prediction? Generative model: given an x, efficiently sample a t. Hierarchical inference: sample hyperparameters (the θ parameters that specify C, and the parameters for the noise model). Posterior inference: can compute data likelihood given the posterior distribution over the parameters, not a single set of parameters. Nonparametric Regression With Gaussian Processes p. 2

Review: How do you do it? Get some data. Decide on a covariance function C (with some hyperparameters θ). Guess the noise variance σ ν. Construct the covariance matrix C, where C i,j = C(x i,x j ) and C i,i has an additional term σ ν. Invert the covariance matrix. (This is the hard part!) Predict for new points x n+1 : take the distances k i = C(i,n + 1) and compute k T C 1 t. Remember, if you can t directly invert your training covariance matrix ( 1000 points) you have to be cleverer! Nonparametric Regression With Gaussian Processes p. 2