Gaussian Processes Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 01
Pictorial view of embedding distribution Transform the entire distribution to expected features Feature space Feature map:
Embedding Distributions: Mean Mean reduces the entire distribution to a single number Representation power very restricted 1D feature space 3
Embedding Distributions: Mean + Variance Mean and variance reduces the entire distribution to two numbers Variance Richer representation But not enough Mean D feature space 4
Embedding with kernel features Transform distribution to infinite dimensional vector Rich representation Feature space Mean, Variance, higher order moment 5
Finite sample approximation of embedding 6
Estimating embedding distance Finite sample estimator Form a kernel matrix with 4 blocks Average this block Average this block Average this block Average this block 7
Measure Dependence via Embeddings Use squared distance to measure dependence between X and Y Feature space 1 X Mean Y Mean Cov. Higher order feature 8
Estimating embedding distances Given samples (x 1, y 1 ),, (x m, y m ) P X, Y Dependence measure can be expressed as inner products μ XY μ X μ Y = E XY [φ X ψ Y ] E X φ X E Y [ψ Y ] =< μ XY, μ XY > < μ XY, μ X μ Y >+< μ X μ Y, μ X μ Y > Kernel matrix operation (H = I 1 m 11 ) X and Y data are ordered in the same way 1 H k(x i, x j ) H k(y i, y j ) m trace( ) 9
Application of kernel distance measure 10
Reference 11
Multivariate Gaussians P X 1, X,, X n = π 1 n Σ 1 exp 1 x μ Σ 1 x μ Mean vector μ i = E[X i ] μ = μ 1 μ μ n Covariance matrix σ ij = E X i μ i X j μ j Σ = σ 1 σ 1 σ 13 σ 1 σ σ 3 σ 31 σ 3 σ 3 1
Conditioning on a Gaussian Joint Gaussian P X, Y N μ; Σ Conditioning a Gaussian variable Y on another Gaussian variable X still gets a Gaussian P Y X N μ Y X ; σ Y X New observation μ Y X = μ Y + σ YX σ X σ Y X Prior mean = σ Y σ YX σ X X μ X Prior variance Prior mean Posterior variance does not depend on a particular observed value Observe X always decrease variance 13
Conditional Gaussian is a linear model Conditinal linear Gaussian P Y X N μ Y X ; σ Y X μ Y X = μ Y + σ YX σ X P Y X X μ X N β 0 + βx; σ Y X The ridge in the figure is the line β 0 + βx If we make a slice at particular X, we get a Gaussian All these Gaussian slices have the same variance σ Y X = σ Y σ YX σ X 14
Conditional Gaussian (general case) Joint Gaussian P X, Y N μ; Σ Conditional Gaussian P Y X N μ Y X ; Σ YY X μ Y X = μ Y + Σ YX Σ 1 XX (X μ X ) Σ YY X = Σ YY Σ YX Σ 1 XX Σ XY Conditional Gaussian is linear in X, P Y X N β 0 + BX; Σ YY X β 0 = μ Y Σ YX Σ 1 XX μ X 1 B = Σ YX Σ XX Linear regression model Y = β 0 + BX + ε White noise N(0, Σ YY X ) 15
What is Gaussian Process? A Gaussian process is a generalization of a multivariate Gaussian distribution to infinitely many variables Formally: a collection of random variables, any finite number of which have (consistent) Gaussian distributions Informally, infinitely long vector with dimensions index by x function f(x) A Gaussian process is fully specified by a mean function m x = E[f(x)] and covariance function k x, x = E f x m x f x m x f x GP m x, k x, x, x: indices 16
A set of sample from Gaussian process For each fixed value of x, there is a Gaussian variable associated with it focus on a finite subset of value f = f x 1, f x,, f x N, for which f N(0, Σ) where Σ ij = k(x i, x j ) Then plot the coordinates of f as a function of the corresponding x values 17
Random function from a Gaussian process one dimensional Gaussian process: f x GP 0, k x, x = exp 1 x x To generate a sample from GP Gaussian variable f i, f j are indexed by x i, x j respectively, and their covariance (ij-th entry in Σ) defined by k x i, x j Covariance k x i, x j f i Generate N iid. samples: y = y 1,, y N N 0; I f j Transform the sample: f = f 1,, f N = μ + Σ 1/ y 18 x i x j
Random function from a Gaussian process Now have two indices x and y covariance function k x, y, x, y = exp x x + y y 19
Gaussian process as a prior A Gaussian process is a prior for functions, we can use it for nonparametric regression Fit a function to noisy observations Gaussian process regression Gaussian likelihood y x, f x N f, σ noise I The parameter is a function f x GP m x = 0, k x, x with Gaussian process prior 0
Graphical model for Gaussian Process Square nodes are observed, round nodes unobserved (latent) Red nodes are training data, blue nodes are test data All pairs of latent variables (f) are connected Prediction of y depends only on the corresponding f We can do learning and inference based on this graphical model 1
Covariance function of Gaussian processes For any finite collection of indices x 1, x,, x n, the covariance matrix is positive semidefinite Σ = k x 1, x 1 k x 1, x k x, x 1 k x, x k(x n, x 1 ) k(x n, x ) k x 1, x n k x, x n k(x n, x n ) The covariance function needs to be a kernel function over the indices! Eg. Gaussian RBF kernel k x, x = exp 1 x x
Covariance function of Gaussian process Another example k x i, x j = v 0 exp x i x j r α + v 1 + v δ ij These kernel parameters are interpretable in the covariance function context v 0 : variance scale v 1 : variance bias v : noise variance r: lengthscale α: roughness 3
Samples from GPs with different kernels 4
Matern kernel k x i, x j = 1 Γ ν v 1 v l x i x j v K v v l x i x j K v is modified Bessel function of second kind of order v, l is the length scale Sample functions from GP with Matern kernel are v 1 times differentiable. Hyperparamter v can control smoothness Special cases (let r = x i x j ) k v= 1 k v= 3 k v= 5 r = exp r l r = 1 + 3r l r = 1 + 5r l : Laplace kernel, Brownian motion exp 3r l + 5r exp 5r 3l l (once differentiable) (twice differentiable) k v r = exp r l : smooth (infinitely differentiable) 5
Matern kernel II Univariate Matern kernel function with unit length scale 6
Kernels for periodic, smooth functions To create GP over periodic functions, we can first map the inputs to u = sin x, cos x, and then measure distance in u space. Combined with square exponential function, k x, x = exp sin π x x l Three functions drawn at random, left l > 1 and right l < 1 7
Using Gaussian process for nonlinear regression Observing a dataset D = n x i, y i i=1 Prior P(f) is Gaussian process, like a multivariate Gaussian, therefore, posterior of f is also a Gaussian process Bayesian rule P f D = P D f P(f) P(D) Everything else about GPs follows the basic rules of probabilities applied to multivariate Gaussians 8
Posterior of Gaussian process Gaussian process regression For simplicity, noiseless observation y = f(x) The parameter is a function f x GP m x = 0, k x, x with Gaussian process prior Multivariate Gaussian P Y X N μ Y X ; Σ YY X μ Y X = μ Y + Σ YX Σ 1 XX (X μ X ) Σ YY X = Σ YY Σ YX Σ 1 XX Σ XY GP posterior f x n x i, y i i=1 Y = (y,, y n ) = f x 1, f x n m post x k post x, x = 0 + Σ f x Y Σ 1 YY Y ~GP m post x, k post x, x = Σ f x f(x) Σ f x Y Σ 1 YY Σ Yf x 9
Prior and Posterior GP In the noiseless case (y = f(x)), mean function of the posterior GP passes the training data points Posterior GP has reduced variance, zero variance at training point Prior Posterior 30
Noisy Observation Gaussian likelihood y x, f x N f, σ noise I n f x x i, y i i=1 ~GP m post x, k post x, x Y = (y,, y n ) m post x = 0 + Σ f x Y Σ YY + σ noise I 1 Y k post x, x = Σ f x f(x) Σ f x Y Σ YY + σ noise I 1 Σ Yf x Covariance function is the kernel function Σ f x Y = k x, x 1,, k x, x n Σ YY = k x 1, x 1 k x 1, x k x, x 1 k x, x k(x n, x 1 ) k(x n, x ) k x 1, x n k x, x n k(x n, x n ) 31
Prior and posterior: noisy case In the noisy case (y = f x + ε), mean function of posterior GP does not necessarily passes the training data points Posterior GP has reduced variance 3