Probabilistic Graphical Models Lecture 20: Gaussian Processes

Similar documents
20: Gaussian Processes

Bayesian Machine Learning

Bayesian Machine Learning

Bayesian Machine Learning

Gaussian Process Regression

Bayesian Deep Learning

CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes

GAUSSIAN PROCESS REGRESSION

Lecture : Probabilistic Machine Learning

Kernel Methods for Large Scale Representation Learning

Model Selection for Gaussian Processes

GWAS V: Gaussian processes

Computer Vision Group Prof. Daniel Cremers. 4. Gaussian Processes - Regression

Lecture: Gaussian Process Regression. STAT 6474 Instructor: Hongxiao Zhu

Probabilistic & Unsupervised Learning

Computer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression

Kernels for Automatic Pattern Discovery and Extrapolation

Probabilistic Graphical Models Lecture 21: Advanced Gaussian Processes

A Process over all Stationary Covariance Kernels

Gaussian Process Regression Networks

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Probabilistic & Bayesian deep learning. Andreas Damianou

Gaussian Processes for Machine Learning

Probabilistic Graphical Models Lecture 17: Markov chain Monte Carlo

Covariance Kernels for Fast Automatic Pattern Discovery and Extrapolation with Gaussian Processes

Linear regression example Simple linear regression: f(x) = ϕ(x)t w w ~ N(0, ) The mean and covariance are given by E[f(x)] = ϕ(x)e[w] = 0.

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

Nonparmeteric Bayes & Gaussian Processes. Baback Moghaddam Machine Learning Group

Introduction to Gaussian Processes

STA414/2104 Statistical Methods for Machine Learning II

Nonparameteric Regression:

Tutorial on Gaussian Processes and the Gaussian Process Latent Variable Model

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

Introduction to Gaussian Processes

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Reliability Monitoring Using Log Gaussian Process Regression

How to build an automatic statistician

CS-E4830 Kernel Methods in Machine Learning

Learning Gaussian Process Models from Uncertain Data

Gaussian Processes (10/16/13)

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Practical Bayesian Optimization of Machine Learning. Learning Algorithms

STA 4273H: Sta-s-cal Machine Learning

An Introduction to Gaussian Processes for Spatial Data (Predictions!)

Building an Automatic Statistician

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Bayesian Hidden Markov Models and Extensions

Density Estimation. Seungjin Choi

STAT 518 Intro Student Presentation

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods

Probabilistic Reasoning in Deep Learning

Unsupervised Learning

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

CPSC 540: Machine Learning

Bayesian RL Seminar. Chris Mansley September 9, 2008

Lecture 5: GPs and Streaming regression

PILCO: A Model-Based and Data-Efficient Approach to Policy Search

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Machine Learning. Bayesian Regression & Classification. Marc Toussaint U Stuttgart

The Automatic Statistician

Gaussian Processes in Machine Learning

Multiple-step Time Series Forecasting with Sparse Gaussian Processes

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Approximate Inference Part 1 of 2

Approximate Inference Part 1 of 2

Bayesian Linear Regression. Sargur Srihari

Expectation Propagation for Approximate Bayesian Inference

Statistical Techniques in Robotics (16-831, F12) Lecture#21 (Monday November 12) Gaussian Processes

Statistical Approaches to Learning and Discovery

Gaussian Process Kernels for Pattern Discovery and Extrapolation

Introduction into Bayesian statistics

Gaussian Processes. 1 What problems can be solved by Gaussian Processes?

Introduction to Probability and Statistics (Continued)

Gaussian processes for inference in stochastic differential equations

Parametric Techniques Lecture 3

Prediction of Data with help of the Gaussian Process Method

Optimization of Gaussian Process Hyperparameters using Rprop

Recent Advances in Bayesian Inference Techniques

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

Approximate Inference using MCMC

Nonparametric Regression With Gaussian Processes

Introduction to Gaussian Processes

Introduction to Gaussian Process

Parametric Techniques

ECE521 week 3: 23/26 January 2017

CS-E3210 Machine Learning: Basic Principles

Expectation Propagation in Dynamical Systems

Bayesian Inference and MCMC

COS513 LECTURE 8 STATISTICAL CONCEPTS

Lecture 6: Graphical Models: Learning

Machine Learning Summer School

Bayesian Networks BY: MOHAMAD ALSABBAGH

Non-Factorised Variational Inference in Dynamical Systems

STA 4273H: Statistical Machine Learning

Neutron inverse kinetics via Gaussian Processes

Lecture 2: From Linear Regression to Kalman Filter and Beyond

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Transcription:

Probabilistic Graphical Models Lecture 20: Gaussian Processes Andrew Gordon Wilson www.cs.cmu.edu/~andrewgw Carnegie Mellon University March 30, 2015 1 / 53

What is Machine Learning? Machine learning algorithms adapt with data versus having fixed decision rules. Machine learning aims not only to equip people with tools to analyse data, but to create algorithms which can learn and make decisions without human intervention. 1,2 In order for a model to automatically learn and make decisions, it must be able to discover patterns and extrapolate those patterns to new situations. 1 E.g., N.D. Lawrence (2010), What is Machine Learning? 2 T.M. Mitchell (2006), What is Machine Learning and Where Is it Headed? 2 / 53

Function Learning Example 700 Airline Passengers (Thousands) 600 500 400 300 200 Train 100 1949 1951 1953 1955 1957 1959 1961 Year 3 / 53

Function Learning Example 700 Airline Passengers (Thousands) 600 500 400 300 200 Train Human? 100 1949 1951 1953 1955 1957 1959 1961 Year 4 / 53

Function Learning Example 700 Airline Passengers (Thousands) 600 500 400 300 200 Train Alien? 100 1949 1951 1953 1955 1957 1959 1961 Year 5 / 53

Function Learning Example Airline Passengers (Thousands) 700 600 500 400 300 200 Train Human? 100 1949 1951 1953 1955 1957 1959 1961 Year 6 / 53

Function Learning Example Airline Passengers (Thousands) 700 600 500 400 300 200 Train Human? Alien? 100 1949 1951 1953 1955 1957 1959 1961 Year 7 / 53

Function Learning Example Airline Passengers (Thousands) 700 600 500 400 300 200 Train Alien? Test Human? 100 1949 1951 1953 1955 1957 1959 1961 Year 8 / 53

Building an Intelligent Model The ability for a model to learn from data depends on its: 1. Support: what solutions we think are a priori possible. 2. Inductive biases: what solutions we think are a priori likely. Examples: Function Learning, Character Recognition Human ability to make remarkable generalisations from data could derive from an expressive prior combined with Bayesian inference. 9 / 53

Statistics from Scratch Basic Regression Problem Training set of N targets (observations) y = (y(x 1 ),..., y(x N )) T. Observations evaluated at inputs X = (x 1,..., x N ) T. Want to predict the value of y(x ) at a test input x. For example: Given CO 2 concentrations y measured at times X, what will the CO 2 concentration be for x = 2024, 10 years from now? Just knowing high school math, what might you try? 10 / 53

Statistics from Scratch 700 Airline Passengers (Thousands) 600 500 400 300 200 100 1949 1951 1953 1955 1957 1959 1961 Year 11 / 53

Statistics from Scratch Guess the parametric form of a function that could fit the data f (x, w) = w T x [Linear function of w and x] f (x, w) = w T φ(x) Model) f (x, w) = g(w T φ(x)) [Linear function of w] (Linear Basis Function [Non-linear in x and w] (E.g., Neural Network) φ(x) is a vector of basis functions. For example, if φ(x) = (1, x, x 2 ) and x R 1 then f (x, w) = w 0 + w 1 x + w 2 x 2 is a quadratic function. Choose an error measure E(w), minimize with respect to w E(w) = N i=1 [f (x i, w) y(x i )] 2 12 / 53

Statistics from Scratch A probabilistic approach We could explicitly account for noise in our model. y(x) = f (x, w) + ɛ(x), where ɛ(x) is a noise function. One commonly takes ɛ(x) = N (0, σ 2 ) for i.i.d. additive Gaussian noise, in which case p(y(x) x, w, σ 2 ) = N (y(x); f (x, w), σ 2 ) Observation Model (1) N p(y x, w, σ 2 ) = N (y(x i ); f (x i, w), σ 2 ) Likelihood (2) i=1 Maximize the likelihood of the data p(y x, w, σ 2 ) with respect to σ 2, w. For a Gaussian noise model, this approach will make the same predictions as using a squared loss error function: log p(y X, w, σ 2 ) 1 2σ 2 N [f (x i, w) y(x i )] 2 (3) i=1 13 / 53

Statistics from Scratch The probabilistic approach helps us interpret the error measure in a deterministic approach, and gives us a sense of the noise level σ 2. Probabilistic methods thus provide an intuitive framework for representing uncertainty, and model development. Both approaches are prone to over-fitting for flexible f (x, w): low error on the training data, high error on the test set. Regularization Use a penalized log likelihood (or error function), such as model fit {}}{ complexity penalty log p(y X, w) 1 n {}}{ 2σ 2 (f (x i, w) y(x i ) 2 ) λw T w. (4) i=1 But how should we define complexity, and how much should we penalize complexity? Can set λ using cross-validation. 14 / 53

Bayesian Inference Bayes Rule p(a b) = p(b a)p(a)/p(b), p(a b) p(b a)p(a). (5) posterior = likelihood prior marginal likelihood, p(w y, X, σ2 ) = p(y X, w, σ2 )p(w) p(y X, σ 2 ) Predictive Distribution p(y x, y, X) = p(y x, w)p(w y, X)dw. (7). (6) Average of infinitely many models weighted by their posterior probabilities. No over-fitting, automatically calibrated complexity. Typically more interested in distribution over functions than in parameters w. 15 / 53

Representing Uncertainty Different types of uncertainty: Uncertainty through lack of knowledge Intrinsic uncertainty; e.g., radioactive decay. Uncertainty through lack of knowledge can seem like intrinsic uncertainty (e.g., rolling dice). Regardless of whether or not the universe is deterministic whether there is some underlying true answer we will always have uncertainty. We can represent this belief using probability distributions (Bayesian methods, probabilistic modelling). 16 / 53

Parametric Regression Review Deterministic E(w) = N (f (x i, w) y i ) 2. (8) i=1 Maximum Likelihood Bayesian p(y(x) x, w) = N (y(x); f (x, w), σn) 2, (9) N p(y X, w) = N (y(x i ); f (x i, w), σn) 2. (10) i=1 posterior = likelihood prior marginal likelihood, p(y X, w)p(w) p(w y, X) =. (11) p(y X) 17 / 53

Model Selection and Marginal Likelihood p(y M 1, X) = p(y f 1 (x, w))p(w)dw (13) Complex Model Simple Model Appropriate Model p(y M) y All Possible Datasets 18 / 53

Blackboard: Examples of Occam s Razor in Everyday Inferences For further reading, see MacKay (2003) textbook, Information Theory, Inference, and Learning Algorithms. 19 / 53

Occam s Razor Example -1, 3, 7, 11,??,?? H 1 : the sequence is an arithmetic progression, add n, where n is an integer. H 2 : the sequence is generated by a cubic function of the form cx 3 + dx 2 + e, where c, d, and e are fractions. ( 1 11 x3 + 9 11 x2 + 23 11 ) 20 / 53

Model Selection 2 1.5 Outputs, y(x) 1 0.5 0 0.5 1 0 20 40 60 80 100 Inputs, x Observations y(x). Assume p(y(x) f (x)) N (y(x); f (x), σ 2 ). Consider polynomials of different orders. As always, observations are out of the chosen model class! Which model should we choose? f 0 (x) = a 0, (14) f 1 (x) = a 0 + a 1 x, (15) f 2 (x) = a 0 + a 1 x + a 2 x 2, (16). (17) f J (x) = a 0 + a 1 x + a 2 x 2 + + a J x J. (18) 21 / 53

Model Selection: Occam s Hill 0.25 Marginal Likelihood (Evidence) 0.2 0.15 0.1 0.05 0 1 2 3 4 5 6 7 8 9 10 11 Model Order Marginal likelihood (evidence) as a function of model order, using an isotropic prior p(a) = N (0, σ 2 I). 22 / 53

Model Selection: Occam s Asymptote 0.25 Marginal Likelihood (Evidence) 0.2 0.15 0.1 0.05 0 1 2 3 4 5 6 7 8 9 10 11 Model Order Marginal likelihood (evidence) as a function of model order, using an anisotropic prior p(a i ) = N (0, γ i ), with γ learned from the data. 23 / 53

Occam s Razor 0.25 0.25 0.2 0.2 Marginal Likelihood (Evidence) 0.15 0.1 0.05 Marginal Likelihood (Evidence) 0.15 0.1 0.05 0 1 2 3 4 5 6 7 8 9 10 11 Model Order 0 1 2 3 4 5 6 7 8 9 10 11 Model Order (a) Isotropic Gaussian Prior (b) Anisotropic Gaussian Prior For further reading, see Rasmussen and Ghahramani (2001) (Occam s Razor) and Kass and Raftery (1995) (Bayes Factors) 24 / 53

Linear Basis Models Consider the simple linear model, f (x) = a 0 + a 1 x, (19) a 0, a 1 N (0, 1). (20) 25 20 15 10 Output, f(x) 5 0 5 10 15 20 25 10 8 6 4 2 0 2 4 6 8 10 Input, x 25 / 53

Linear Models We are interested in the induced distribution over functions, not the parameters... Let s characterise the properties of these functions directly: f (x a 0, a 1 ) = a 0 + a 1 x, a 0, a 1 N (0, 1). (21) E[f (x)] = E[a 0 ] + E[a 1 ]x = 0. (22) cov[f (x b ), f (x c )] = E[f (x b )f (x c )] E[f (x b )]E[f (x c )] (23) = E[a 2 0 + a 0 a 1 (x b + x c ) + a 2 1x b x c ] 0 (24) = E[a 2 0] + E[a 2 1x b x c ] + E[a 0 a 1 (x b + x c )] (25) = 1 + x b x c + 0 (26) = 1 + x b x c. (27) 26 / 53

Linear Models Therefore any collection of values has a joint Gaussian distribution [f (x 1 ),..., f (x N )] N (0, K), (28) By definition, f (x) is a Gaussian process. Definition K ij = cov(f (x i ), f (x j )) = k(x i, x j ) = 1 + x b x c. (29) A Gaussian process (GP) is a collection of random variables, any finite number of which have a joint Gaussian distribution. We write f (x) GP(m, k) to mean [f (x 1 ),..., f (x N )] N (µ, K) (30) µ i = m(x i ) (31) K ij = k(x i, x j ), (32) for any collection of input values x 1,..., x N. In other words, f is a GP with mean function m(x) and covariance kernel k(x i, x j ). 27 / 53

Linear Basis Function Models Model Specification f (x, w) = w T φ(x) (33) p(w) = N (0, Σ w ) (34) Moments of Induced Distribution over Functions E[f (x, w)] = m(x) = E[w T ]φ(x) = 0 (35) cov(f (x i ), f (x j )) = k(x i, x j ) = E[f (x i )f (x j )] E[f (x i )]E[f (x j )] (36) = φ(x i ) T E[ww T ]φ(x j ) 0 (37) = φ(x i ) T Σ w φ(x j ) (38) f (x, w) is a Gaussian process, f (x) N (m, k) with mean function m(x) = 0 and covariance kernel k(x i, x j ) = φ(x i ) T Σ w φ(x j ). The entire basis function model of Eqs. (33) and (34) is encapsulated as a distribution over functions with kernel k(x, x ). 28 / 53

Gaussian Processes We are ultimately more interested in and have stronger intuitions about the functions that model our data than weights w in a parametric model, and we can express those intuitions using a covariance kernel. The kernel controls the support and inductive biases of our model, and thus its ability to generalise. 29 / 53

Example: RBF Kernel k RBF (x, x ) = cov(f (x), f (x )) = a 2 exp( x x 2 Far and above the most popular kernel. 2l 2 ) (39) Expresses the intuition that function values at nearby inputs are more correlated than function values at far away inputs. The kernel hyperparameters a and l control amplitudes and wiggliness of these functions. GPs with an RBF kernel have large support and are universal approximators. 30 / 53

Sampling from a GP with an RBF Kernel x = [-10:0.2:10] ; % inputs (where we query the GP) N = numel(x); % number of inputs K = zeros(n,n); % covariance matrix % very inefficient way of creating K in Matlab for i=1:n for j=1:n K(i,j) = k_rbf(x(i),x(j)); end end K = K + 1e-6*eye(N); CK = chol(k); f = CK *randn(n,1); % add jitter for conditioning % draws from N(0,K) plot(x,f); 31 / 53

Samples from a GP with an RBF Kernel 3 2 1 output, f(x) 0 1 2 3 10 8 6 4 2 0 2 4 6 8 input, x 32 / 53

1D RBF Kernel with Different Length-scales k RBF (x, x ) = cov(f (x), f (x )) = a 2 exp( x x 2 2l 2 ) (40) 1 0.9 0.8 0.7 SE kernels with Different Length scales l=7 l = 0.7 l = 2.28 0.6 k(τ) 0.5 0.4 0.3 0.2 0.1 0 0 2 4 6 8 10 12 14 16 18 20 τ Figure: SE kernels with different length-scales, as a function of τ = x x. 33 / 53

RBF Kernel Covariance Matrix k RBF (x, x ) = cov(f (x), f (x )) = a 2 exp( x x 2 2l 2 ) (41) The covariance matrix K for ordered inputs on a 1D grid. K ij = k RBF (x i, x j ). 20 40 60 80 100 20 40 60 34 / 53

Gaussian Process Inference Observed noisy data y = (y(x 1 ),..., y(x N )) T at input locations X. Start with the standard regression assumption: N (y(x); f (x), σ 2 ). Place a Gaussian process distribution over noise free functions f (x) GP(0, k θ ). The kernel k is parametrized by θ. Infer p(f y, X, X ) for the noise free function f evaluated at test points X. Joint distribution [ ] ( [ y Kθ (X, X) + σ 2 I K θ (X, X ) N 0, K θ (X, X) K θ (X, X ) f Conditional predictive distribution ]). (42) f X, X, y, θ N ( f, cov(f )), (43) f = K θ (X, X)[K θ (X, X) + σ 2 I] 1 y, (44) cov(f ) = K θ (X, X ) K θ (X, X)[K θ (X, X) + σ 2 I] 1 K θ (X, X ). (45) 35 / 53

Inference using an RBF kernel Specify f (x) GP(0, k). Choose k RBF (x, x ) = a 2 0 exp( x x 2 ). Choose values for a 0 and l 0. Observe data, look at the prior and posterior over functions. 2l 2 0 4 Samples from GP Prior 4 Samples from GP Posterior 3 3 2 2 Output, f(x) 1 0 1 Output, f(x) 1 0 1 2 2 3 3 4 10 5 0 5 10 Input, x (a) 4 10 5 0 5 10 Input, x (b) Does something look strange about these functions? 36 / 53

Inference using an RBF kernel Increase the length-scale l. 4 Samples from GP Prior 3 2 Output, f(x) 3 2 1 0 1 Output, f(x) 1 0 1 2 3 2 3 4 10 5 4 10 5 0 5 10 Input, x (a) (b) 37 / 53

Learning and Model Selection p(m i y) = p(y M i)p(m i ) (46) p(y) We can write the evidence of the model as p(y M i ) = p(y f, M i )p(f)df, (47) Complex Model Simple Model Appropriate Model 4 3 2 Data Simple Complex Appropriate p(y M) Output, f(x) 1 0 1 2 3 y All Possible Datasets (a) 4 10 8 6 4 2 0 2 4 6 8 10 Input, x (b) 38 / 53

Learning and Model Selection We can integrate away the entire Gaussian process f (x) to obtain the marginal likelihood, as a function of kernel hyperparameters θ alone. p(y θ, X) = p(y f, X)p(f θ, X)df. (48) model fit {}}{ log p(y θ, X) = 1 2 yt (K θ + σ 2 I) 1 y complexity penalty {}}{ 1 2 log K θ + σ 2 I N 2 log(2π). (49) An extremely powerful mechanism for kernel learning. 4 Samples from GP Prior 4 Samples from GP Posterior 3 3 2 2 Output, f(x) 1 0 1 Output, f(x) 1 0 1 2 2 3 3 4 10 5 0 5 10 Input, x 4 10 5 0 5 10 Input, x 39 / 53

Learning and Model Selection A fully Bayesian treatment would integrate away kernel hyperparameters θ. p(f X, X, y) = p(f X, X, y, θ)p(θ y)dθ (50) For example, we could specify a prior p(θ), use MCMC to take J samples from p(θ y) p(y θ)p(θ), and then find p(f X, X, y) 1 J J p(f X, X, y, θ (i) ), θ (i) p(θ y). (51) i=1 If we have a non-gaussian noise model, and thus cannot integrate away f, the strong dependencies between Gaussian process f and hyperparameters θ make sampling extremely difficult. In my experience, the most effective solution is to use a deterministic approximation for the posterior p(f y) which enables one to work with an approximate marginal likelihood. 40 / 53

Gaussian Process Covariance Kernels Let τ = x x : k SE (τ) = exp( 0.5τ 2 /l 2 ) (52) 3τ 3τ k MA (τ) = a(1 + ) exp( ) (53) l l k RQ (τ) = (1 + τ 2 2 α l 2 ) α (54) k PE (τ) = exp( 2 sin 2 (π τ ω)/l 2 ) (55) 41 / 53

Inference and Learning 1. Learning: Optimize marginal likelihood, model fit {}}{ log p(y θ, X) = 1 2 yt (K θ + σ 2 I) 1 y with respect to kernel hyperparameters θ. complexity penalty {}}{ 1 2 log K θ + σ 2 I N 2 log(2π), 2. Inference: Conditioned on kernel hyperparameters θ, form the predictive distribution for test inputs X : f X, X, y, θ N ( f, cov(f )), f = K θ (X, X)[K θ (X, X) + σ 2 I] 1 y, cov(f ) = K θ (X, X ) K θ (X, X)[K θ (X, X) + σ 2 I] 1 K θ (X, X ). 42 / 53

Gaussian process graphical model Squared are observed, circles are latent, the thick bar is a set of fully connected nodes. Each y i is conditionally independent given f i. Because of the marginalization property of a GP, addition of further inputs x and unobserved targets y does not change the distribution of any other variables. Figure from GPML, Rasmussen and Williams (2006) 43 / 53

Worked Example: Combining Kernels, CO 2 Data CO 2 Concentration (ppm) 400 380 360 340 320 1968 1977 1986 1995 2004 Year Example from Rasmussen and Williams (2006), Gaussian Processes for Machine Learning. 44 / 53

Worked Example: Combining Kernels, CO 2 Data 45 / 53

Worked Example: Combining Kernels, CO 2 Data Long rising trend: k 1 (x p, x q ) = θ 2 1 exp ( (xp xq)2 2θ2 2 Quasi-periodic seasonal changes: ( k 2 (x p, x q ) = k RBF (x p, x q )k PER (x p, x q ) = θ3 2 exp (xp xq) Multi-scale medium term irregularities: ) k 3 (x p, x q ) = θ6 (1 2 θ8 + (xp xq)2 2θ 8θ7 2 ( Correlated and i.i.d. noise: k 4 (x p, x q ) = θ9 2 exp 2θ 2 4 ) 2 sin2 (π(x p x q)) θ 2 5 (xp xq)2 2θ10 2 k total (x p, x q ) = k 1 (x p, x q ) + k 2 (x p, x q ) + k 3 (x p, x q ) + k 4 (x p, x q ) ) ) + θ 2 11 δ pq 46 / 53

Worked Example: Combining Kernels, CO 2 Data Hand crafted a kernel combination to perform extrapolation Confidence in the extrapolation is high (suggests that model is well specified). Can interpret the learned kernel hyperparameters θ to learn information about our dataset. A lot of the interesting pattern recognition has been done by a human in this example. We would like to completely automate this modelling procedure. 47 / 53

Non-Gaussian Likelihoods We can no longer analytically integrate away the Gaussian process. But we can use a simple Monte carlo sum: p(f y, X, x ) = p(f f, x )p(f y)df 1 J J p(f f (j), x ), j=1 f (j) p(f y) But how do we sample from p(f y)? 48 / 53

Non-Gaussian Likelihoods We can no longer analytically integrate away the Gaussian process. But we can use a simple Monte carlo sum: p(f y, X, x ) = p(f f, x )p(f y)df 1 J J p(f f (j), x ), j=1 f (j) p(f y) But how do we sample from p(f y)? Elliptical slice sampling. Murray et. al. AISTATS 2010. 49 / 53

Non-Gaussian Likelihoods But what about hyperparameters? It s easy to implement Gibbs sampling: p(f y, θ) p(y f)p(f θ) (56) p(θ f, y) p(f θ)p(θ). (57) But this won t work because of strong correlations between f and θ. 50 / 53

Non-Gaussian Likelihoods But what about hyperparameters? It s easy to implement Gibbs sampling: p(f y, θ) p(y f)p(f θ) (58) p(θ f, y) p(f θ)p(θ). (59) But this won t work because of strong correlations between f and θ. Transform into a whitened space, f = Lν, and sample from ν and θ, which decouples correlations. 51 / 53

Non-Gaussian Likelihoods But what about hyperparameters? It s easy to implement Gibbs sampling: p(f y, θ) p(y f)p(f θ) (60) p(θ f, y) p(f θ)p(θ). (61) But this won t work because of strong correlations between f and θ. Transform into a whitened space, f = Lν, and sample from ν and θ, which decouples correlations. Use a deterministic approach to approximately integrate away f to access a marginal likelihood, conditioned only on kernel hyperparameters θ: p(y θ) = p(y f)p(f θ)df (62) The Laplace approximation, for example, approximates p(f y) as a Gaussian. 52 / 53

Readings for Next Time C. Rasmussen and C. Williams, GPML, Ch. 4, 5 Y. Saatchi, PhD Thesis, 2011. Chapter 5 J. Candela and C.E. Rasmussen, A unifying view of sparse approximation Gaussian process regression, JMLR 2005. A.G. Wilson and R.P. Adams. Gaussian process kernels for pattern discovery and extrapolation, ICML 2013. 53 / 53