Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Similar documents
GAUSSIAN PROCESS REGRESSION

Kernel methods for comparing distributions, measuring dependence

Gaussian Processes (10/16/13)

Gaussian Process Regression

Kernel methods, kernel SVM and ridge regression

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING. Non-linear regression techniques Part - II

Practical Bayesian Optimization of Machine Learning. Learning Algorithms

Lecture: Gaussian Process Regression. STAT 6474 Instructor: Hongxiao Zhu

Gaussian Processes in Machine Learning

COMP 551 Applied Machine Learning Lecture 20: Gaussian processes

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes

Review. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Gaussian Processes for Machine Learning

Nonparameteric Regression:

STAT 518 Intro Student Presentation

GWAS V: Gaussian processes

Probabilistic & Unsupervised Learning

Nonparametric Bayesian Methods (Gaussian Processes)

Statistical Techniques in Robotics (16-831, F12) Lecture#21 (Monday November 12) Gaussian Processes

Statistical Techniques in Robotics (16-831, F12) Lecture#20 (Monday November 12) Gaussian Processes

Gaussian processes for inference in stochastic differential equations

Bayesian Linear Regression. Sargur Srihari

Advanced Introduction to Machine Learning CMU-10715

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

Computer Vision Group Prof. Daniel Cremers. 4. Gaussian Processes - Regression

20: Gaussian Processes

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

Computer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression

Gaussian Processes. 1 What problems can be solved by Gaussian Processes?

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Gaussian Process Regression Networks

Density Estimation. Seungjin Choi

Introduction. Chapter 1

Tutorial on Gaussian Processes and the Gaussian Process Latent Variable Model

Nonparmeteric Bayes & Gaussian Processes. Baback Moghaddam Machine Learning Group

Kernel Bayes Rule: Nonparametric Bayesian inference with kernels

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

22 : Hilbert Space Embeddings of Distributions

STA 4273H: Statistical Machine Learning

PILCO: A Model-Based and Data-Efficient Approach to Policy Search

CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes

CS-E4830 Kernel Methods in Machine Learning

Gaussian with mean ( µ ) and standard deviation ( σ)

CS 7140: Advanced Machine Learning

4 Bias-Variance for Ridge Regression (24 points)

Model Selection for Gaussian Processes

Announcements. Proposals graded

Introduction to Gaussian Processes

Introduction to Gaussian Processes

Probabilistic Graphical Models Lecture 20: Gaussian Processes

Kernels for Automatic Pattern Discovery and Extrapolation

Machine Learning Srihari. Gaussian Processes. Sargur Srihari

Nearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2

STA414/2104 Statistical Methods for Machine Learning II

CPSC 540: Machine Learning

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Bayesian Interpretations of Regularization

Gaussian processes and bayesian optimization Stanisław Jastrzębski. kudkudak.github.io kudkudak

Machine Learning. Bayesian Regression & Classification. Marc Toussaint U Stuttgart

Probabilistic numerics for deep learning

MTTTS16 Learning from Multiple Sources

Nonparametric Regression With Gaussian Processes

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Lecture 5: GPs and Streaming regression

Reliability Monitoring Using Log Gaussian Process Regression

Building an Automatic Statistician

Bayesian Linear Regression [DRAFT - In Progress]

Hierarchical Modeling for Univariate Spatial Data

Graphical Models for Collaborative Filtering

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

A Process over all Stationary Covariance Kernels

y(x) = x w + ε(x), (1)

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

Supervised Learning Coursework

Introduction to Gaussian Process

PREDICTING SOLAR GENERATION FROM WEATHER FORECASTS. Chenlin Wu Yuhan Lou

STA 4273H: Statistical Machine Learning

ECE521 week 3: 23/26 January 2017

Linear regression example Simple linear regression: f(x) = ϕ(x)t w w ~ N(0, ) The mean and covariance are given by E[f(x)] = ϕ(x)e[w] = 0.

Kernel PCA, clustering and canonical correlation analysis

Introduction to Probabilistic Graphical Models

Active and Semi-supervised Kernel Classification

PATTERN RECOGNITION AND MACHINE LEARNING

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 13: SEQUENTIAL DATA

Probability and Information Theory. Sargur N. Srihari

INTRODUCTION TO PATTERN RECOGNITION

Computer Emulation With Density Estimation

Lecture 1c: Gaussian Processes for Regression

System identification and control with (deep) Gaussian processes. Andreas Damianou

Kernel adaptive Sequential Monte Carlo

STA 4273H: Sta-s-cal Machine Learning

Prediction of double gene knockout measurements

Joint Emotion Analysis via Multi-task Gaussian Processes

Lecture 4 February 2

ECE521 lecture 4: 19 January Optimization, MLE, regularization

Transcription:

Gaussian Processes Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 01

Pictorial view of embedding distribution Transform the entire distribution to expected features Feature space Feature map:

Embedding Distributions: Mean Mean reduces the entire distribution to a single number Representation power very restricted 1D feature space 3

Embedding Distributions: Mean + Variance Mean and variance reduces the entire distribution to two numbers Variance Richer representation But not enough Mean D feature space 4

Embedding with kernel features Transform distribution to infinite dimensional vector Rich representation Feature space Mean, Variance, higher order moment 5

Finite sample approximation of embedding 6

Estimating embedding distance Finite sample estimator Form a kernel matrix with 4 blocks Average this block Average this block Average this block Average this block 7

Measure Dependence via Embeddings Use squared distance to measure dependence between X and Y Feature space 1 X Mean Y Mean Cov. Higher order feature 8

Estimating embedding distances Given samples (x 1, y 1 ),, (x m, y m ) P X, Y Dependence measure can be expressed as inner products μ XY μ X μ Y = E XY [φ X ψ Y ] E X φ X E Y [ψ Y ] =< μ XY, μ XY > < μ XY, μ X μ Y >+< μ X μ Y, μ X μ Y > Kernel matrix operation (H = I 1 m 11 ) X and Y data are ordered in the same way 1 H k(x i, x j ) H k(y i, y j ) m trace( ) 9

Application of kernel distance measure 10

Reference 11

Multivariate Gaussians P X 1, X,, X n = π 1 n Σ 1 exp 1 x μ Σ 1 x μ Mean vector μ i = E[X i ] μ = μ 1 μ μ n Covariance matrix σ ij = E X i μ i X j μ j Σ = σ 1 σ 1 σ 13 σ 1 σ σ 3 σ 31 σ 3 σ 3 1

Conditioning on a Gaussian Joint Gaussian P X, Y N μ; Σ Conditioning a Gaussian variable Y on another Gaussian variable X still gets a Gaussian P Y X N μ Y X ; σ Y X New observation μ Y X = μ Y + σ YX σ X σ Y X Prior mean = σ Y σ YX σ X X μ X Prior variance Prior mean Posterior variance does not depend on a particular observed value Observe X always decrease variance 13

Conditional Gaussian is a linear model Conditinal linear Gaussian P Y X N μ Y X ; σ Y X μ Y X = μ Y + σ YX σ X P Y X X μ X N β 0 + βx; σ Y X The ridge in the figure is the line β 0 + βx If we make a slice at particular X, we get a Gaussian All these Gaussian slices have the same variance σ Y X = σ Y σ YX σ X 14

Conditional Gaussian (general case) Joint Gaussian P X, Y N μ; Σ Conditional Gaussian P Y X N μ Y X ; Σ YY X μ Y X = μ Y + Σ YX Σ 1 XX (X μ X ) Σ YY X = Σ YY Σ YX Σ 1 XX Σ XY Conditional Gaussian is linear in X, P Y X N β 0 + BX; Σ YY X β 0 = μ Y Σ YX Σ 1 XX μ X 1 B = Σ YX Σ XX Linear regression model Y = β 0 + BX + ε White noise N(0, Σ YY X ) 15

What is Gaussian Process? A Gaussian process is a generalization of a multivariate Gaussian distribution to infinitely many variables Formally: a collection of random variables, any finite number of which have (consistent) Gaussian distributions Informally, infinitely long vector with dimensions index by x function f(x) A Gaussian process is fully specified by a mean function m x = E[f(x)] and covariance function k x, x = E f x m x f x m x f x GP m x, k x, x, x: indices 16

A set of sample from Gaussian process For each fixed value of x, there is a Gaussian variable associated with it focus on a finite subset of value f = f x 1, f x,, f x N, for which f N(0, Σ) where Σ ij = k(x i, x j ) Then plot the coordinates of f as a function of the corresponding x values 17

Random function from a Gaussian process one dimensional Gaussian process: f x GP 0, k x, x = exp 1 x x To generate a sample from GP Gaussian variable f i, f j are indexed by x i, x j respectively, and their covariance (ij-th entry in Σ) defined by k x i, x j Covariance k x i, x j f i Generate N iid. samples: y = y 1,, y N N 0; I f j Transform the sample: f = f 1,, f N = μ + Σ 1/ y 18 x i x j

Random function from a Gaussian process Now have two indices x and y covariance function k x, y, x, y = exp x x + y y 19

Gaussian process as a prior A Gaussian process is a prior for functions, we can use it for nonparametric regression Fit a function to noisy observations Gaussian process regression Gaussian likelihood y x, f x N f, σ noise I The parameter is a function f x GP m x = 0, k x, x with Gaussian process prior 0

Graphical model for Gaussian Process Square nodes are observed, round nodes unobserved (latent) Red nodes are training data, blue nodes are test data All pairs of latent variables (f) are connected Prediction of y depends only on the corresponding f We can do learning and inference based on this graphical model 1

Covariance function of Gaussian processes For any finite collection of indices x 1, x,, x n, the covariance matrix is positive semidefinite Σ = k x 1, x 1 k x 1, x k x, x 1 k x, x k(x n, x 1 ) k(x n, x ) k x 1, x n k x, x n k(x n, x n ) The covariance function needs to be a kernel function over the indices! Eg. Gaussian RBF kernel k x, x = exp 1 x x

Covariance function of Gaussian process Another example k x i, x j = v 0 exp x i x j r α + v 1 + v δ ij These kernel parameters are interpretable in the covariance function context v 0 : variance scale v 1 : variance bias v : noise variance r: lengthscale α: roughness 3

Samples from GPs with different kernels 4

Matern kernel k x i, x j = 1 Γ ν v 1 v l x i x j v K v v l x i x j K v is modified Bessel function of second kind of order v, l is the length scale Sample functions from GP with Matern kernel are v 1 times differentiable. Hyperparamter v can control smoothness Special cases (let r = x i x j ) k v= 1 k v= 3 k v= 5 r = exp r l r = 1 + 3r l r = 1 + 5r l : Laplace kernel, Brownian motion exp 3r l + 5r exp 5r 3l l (once differentiable) (twice differentiable) k v r = exp r l : smooth (infinitely differentiable) 5

Matern kernel II Univariate Matern kernel function with unit length scale 6

Kernels for periodic, smooth functions To create GP over periodic functions, we can first map the inputs to u = sin x, cos x, and then measure distance in u space. Combined with square exponential function, k x, x = exp sin π x x l Three functions drawn at random, left l > 1 and right l < 1 7

Using Gaussian process for nonlinear regression Observing a dataset D = n x i, y i i=1 Prior P(f) is Gaussian process, like a multivariate Gaussian, therefore, posterior of f is also a Gaussian process Bayesian rule P f D = P D f P(f) P(D) Everything else about GPs follows the basic rules of probabilities applied to multivariate Gaussians 8

Posterior of Gaussian process Gaussian process regression For simplicity, noiseless observation y = f(x) The parameter is a function f x GP m x = 0, k x, x with Gaussian process prior Multivariate Gaussian P Y X N μ Y X ; Σ YY X μ Y X = μ Y + Σ YX Σ 1 XX (X μ X ) Σ YY X = Σ YY Σ YX Σ 1 XX Σ XY GP posterior f x n x i, y i i=1 Y = (y,, y n ) = f x 1, f x n m post x k post x, x = 0 + Σ f x Y Σ 1 YY Y ~GP m post x, k post x, x = Σ f x f(x) Σ f x Y Σ 1 YY Σ Yf x 9

Prior and Posterior GP In the noiseless case (y = f(x)), mean function of the posterior GP passes the training data points Posterior GP has reduced variance, zero variance at training point Prior Posterior 30

Noisy Observation Gaussian likelihood y x, f x N f, σ noise I n f x x i, y i i=1 ~GP m post x, k post x, x Y = (y,, y n ) m post x = 0 + Σ f x Y Σ YY + σ noise I 1 Y k post x, x = Σ f x f(x) Σ f x Y Σ YY + σ noise I 1 Σ Yf x Covariance function is the kernel function Σ f x Y = k x, x 1,, k x, x n Σ YY = k x 1, x 1 k x 1, x k x, x 1 k x, x k(x n, x 1 ) k(x n, x ) k x 1, x n k x, x n k(x n, x n ) 31

Prior and posterior: noisy case In the noisy case (y = f x + ε), mean function of posterior GP does not necessarily passes the training data points Posterior GP has reduced variance 3