Model Selection for Gaussian Processes

Similar documents
CSci 8980: Advanced Topics in Graphical Models Gaussian Processes

Nonparametric Bayesian Methods (Gaussian Processes)

Computer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression

Computer Vision Group Prof. Daniel Cremers. 4. Gaussian Processes - Regression

Gaussian Process Regression

Gaussian Processes (10/16/13)

Nonparameteric Regression:

20: Gaussian Processes

Introduction to Gaussian Process

GWAS V: Gaussian processes

Probabilistic Graphical Models Lecture 20: Gaussian Processes

Lecture: Gaussian Process Regression. STAT 6474 Instructor: Hongxiao Zhu

Learning Gaussian Process Models from Uncertain Data

GAUSSIAN PROCESS REGRESSION

Gaussian Processes in Machine Learning

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

COMP 551 Applied Machine Learning Lecture 21: Bayesian optimisation

Kernels for Automatic Pattern Discovery and Extrapolation

Machine Learning. Bayesian Regression & Classification. Marc Toussaint U Stuttgart

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Active and Semi-supervised Kernel Classification

Introduction to Gaussian Processes

Introduction to Gaussian Processes

Probabilistic & Unsupervised Learning

Neutron inverse kinetics via Gaussian Processes

Nonparmeteric Bayes & Gaussian Processes. Baback Moghaddam Machine Learning Group

Gaussian Process Regression Networks

Gaussian Processes. 1 What problems can be solved by Gaussian Processes?

Reliability Monitoring Using Log Gaussian Process Regression

Introduction. Chapter 1

Statistical Techniques in Robotics (16-831, F12) Lecture#21 (Monday November 12) Gaussian Processes

Gaussian Processes for Machine Learning

Statistical Techniques in Robotics (16-831, F12) Lecture#20 (Monday November 12) Gaussian Processes

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Bayesian Support Vector Machines for Feature Ranking and Selection

Optimization of Gaussian Process Hyperparameters using Rprop

State Space Gaussian Processes with Non-Gaussian Likelihoods

Tutorial on Gaussian Processes and the Gaussian Process Latent Variable Model

Prediction of Data with help of the Gaussian Process Method

How to build an automatic statistician

Practical Bayesian Optimization of Machine Learning. Learning Algorithms

Variational Model Selection for Sparse Gaussian Process Regression

Lecture 5: GPs and Streaming regression

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods

Lecture : Probabilistic Machine Learning

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING. Non-linear regression techniques Part - II

Gaussian with mean ( µ ) and standard deviation ( σ)

The Automatic Statistician

MTTTS16 Learning from Multiple Sources

Multi-task Learning with Gaussian Processes, with Applications to Robot Inverse Dynamics

Approximate Inference Part 1 of 2

Approximate Inference Part 1 of 2

Density Estimation. Seungjin Choi

Multiple-step Time Series Forecasting with Sparse Gaussian Processes

Ch 4. Linear Models for Classification

CPSC 540: Machine Learning

CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Machine Learning Srihari. Gaussian Processes. Sargur Srihari

Introduction to Gaussian Processes

Classification for High Dimensional Problems Using Bayesian Neural Networks and Dirichlet Diffusion Trees

Probabilistic & Bayesian deep learning. Andreas Damianou

Bayesian Machine Learning

CS-E4830 Kernel Methods in Machine Learning

Bayesian Machine Learning

STA 4273H: Sta-s-cal Machine Learning

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Probabilistic numerics for deep learning

PILCO: A Model-Based and Data-Efficient Approach to Policy Search

Advanced Introduction to Machine Learning CMU-10715

Expectation Propagation for Approximate Bayesian Inference

COMP 551 Applied Machine Learning Lecture 20: Gaussian processes

Building an Automatic Statistician

Probabilistic Reasoning in Deep Learning

Introduction to Probabilistic Machine Learning

The Variational Gaussian Approximation Revisited

STAT 518 Intro Student Presentation

Spatial smoothing using Gaussian processes

Lecture 6: Graphical Models: Learning

PREDICTING SOLAR GENERATION FROM WEATHER FORECASTS. Chenlin Wu Yuhan Lou

Glasgow eprints Service

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines

Bayesian Regression Linear and Logistic Regression

Disease mapping with Gaussian processes

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak

Bayesian Deep Learning

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Introduction to Machine Learning

Bayesian Inference: Principles and Practice 3. Sparse Bayesian Models and the Relevance Vector Machine

Regression. Machine Learning and Pattern Recognition. Chris Williams. School of Informatics, University of Edinburgh.

A Process over all Stationary Covariance Kernels

Linear regression example Simple linear regression: f(x) = ϕ(x)t w w ~ N(0, ) The mean and covariance are given by E[f(x)] = ϕ(x)e[w] = 0.

Gaussian processes and bayesian optimization Stanisław Jastrzębski. kudkudak.github.io kudkudak

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

A Magiv CV Theory for Large-Margin Classifiers

Tree-structured Gaussian Process Approximations

Tokamak profile database construction incorporating Gaussian process regression

Machine learning - HT Maximum Likelihood

Transcription:

Institute for Adaptive and Neural Computation School of Informatics,, UK December 26

Outline GP basics Model selection: covariance functions and parameterizations Criteria for model selection Marginal likelihood Cross-validation Examples Thanks to Carl Rasmussen (book co-author)

Features Bayesian learning Higher level regularization Predictive uncertainty Cross-validation Ensemble learning Search strategies (no more grid search) Feature selection More than 2 levels of inference

Gaussian Process Basics For a stochastic process f (x), mean function is Assume µ(x) x Covariance function µ(x) = E[f (x)]. k(x, x ) = E[f (x)f (x )]. Priors over function-space can be defined directly by choosing a covariance function, e.g. E[f (x)f (x )] = exp( w x x ) Gaussian processes are stochastic processes defined by their mean and covariance functions.

Gaussian Process Prediction A Gaussian process places a prior over functions Observe data D = (x i, y i ) n i=1, obtain a posterior distribution p(f D) p(f )p(d f ) posterior prior likelihood For a Gaussian likelihood (regression), predictions can be made exactly via matrix computations For classification, we need approximations (or MCMC)

GP Regression Prediction at x N (f (x ), var(x )), with f (x ) = n α i k(x, x i ) i=1 where and α = (K + σ 2 I ) 1 y var(x ) = k(x, x ) k T (x )(K + σ 2 I ) 1 k(x ) with k(x ) = (k(x, x 1 ),..., k(x, x n )) T

GP classification Response function π(x) = r(f (x)), e.g. logistic or probit For these choices log likelihood is concave unique maximum Use MAP or EP for inference (unconstrained optimization, c.f. SVMs) For GPR and GPC, some consistency results are available for non-degenerate kernels

Model Selection Covariance function often has some free parameters θ We can choose different families of covariance functions k SE (x p, x q ) = σ 2 f exp ( 1 2 (x p x q ) M(x p x q ) ) + σ 2 nδ pq, k NN (x p, x q ) = σ 2 f sin 1 ( where x = (1, x 1,..., x D ) 2 x p M x q ) + σnδ 2 pq, (1 + 2 x p M x p )(1 + 2 x q M x q )

Automatic Relevance Determination k SE (x p, x q ) = σ 2 f exp ( 1 2 (x p x q ) M(x p x q ) ) + σ 2 nδ pq Isotropic M = l 2 I ARD: M = diag(l 2 1, l 2 2,..., l 2 D ) 2 2 1 1 output y 1 output y 1 2 2 2 input x2 2 2 input x1 2 2 input x2 2 2 input x1 2

Further modelling flexibility We can combine covariance functions to make things more general Example, functional ANOVA, e.g. k(x, x ) = D D i 1 k i (x i, x i ) + k ij (x i, x j ; x i, x j ) i=1 i=2 j=1 Non-linear warping of the input space (Sampson and Guttorp, 1992)

The Baby and the Bathwater MacKay (23 ch 45): In moving from neural networks to kernel machines did we throw out the baby with the bathwater? i.e. the ability to learn hidden features/representations But consider M = ΛΛ + diag(l) 2 for Λ being D k, for k < D The k columns of Λ can identify directions in the input space with specially high relevance (Vivarelli and Williams, 1999)

Understanding the prior 1 We can analyze and draw samples from the prior 1 1 1 2 1 2 1 k(x x ) = exp x x /l k(x x ) = exp x x 2 /2l 2 with l =.1

1 σ = 1 σ = 3 σ = 1 output, f(x) 1 4 4 input, x Samples from the neural network covariance function

Criteria for Model Selection Bayesian marginal likelihood ( evidence ) p(d model) Estimate the generalization error (e.g. cross-validation) Bound the generalization error (e.g. PAC-Bayes)

Bayesian Model Selection M For GPR, we can compute the marginal likelihood p(y X, θ, M) exactly (integrating out f). For GPC it can be approximated using Laplace approx or EP Can also use MCMC to sample p(m i y, X ) p(y X, M i )p(m i ) where p(y X, M i ) = p(y X, θ i, M i )p(θ i M i ) dθ i f y

Type-II Maximum Likelihood Type-II maximum likelihood maximizes the marginal likelihood p(y X, θ i, M i ) rather than integrates p(y X, θ i, M i ) is differentiable wrt θ: no more grid search! log p(y X, θ) = 1 θ j 2 y 1 K K K 1 y 1 θ j 2 trace( 1 K ) K θ j This was in Williams and Rasmussen (NIPS*95) Can also use MAP estimation or MCMC in θ-space, see e.g. Williams and Barber (1998)

marginal likelihood, p(y X,H i ) y all possible data sets simple intermediate complex log probability 4 2 2 4 6 8 1 minus complexity penalty data fit marginal likelihood 1 characteristic lengthscale Marginal likelihood automatically incorporates a trade-off between model fit and model complexity log p(y X, θ i, M i ) = 1 2 yt Ky 1 y 1 2 log K y n log 2π 2

Marginal Likelihood and Local Optima 2 2 output, y 1 output, y 1 1 1 2 5 5 input, x 2 5 5 input, x There can be multiple optima of the marginal likelihood These correspond to different interpretations of the data

Cross-validation for GPR log p(y i X, y i, θ) = 1 2 log(2πσ2 i ) (y i µ i ) 2 L LOO (X, y, θ) = 2σ 2 i n log p(y i X, y i, θ) i=1 Leave-one-out predictions can be made efficiently (e.g. Wahba, 199; Sundararajan and Keerthi, 21) We can also compute derivatives L LOO / θ j LOO for squared error ignores predictive variances, and does not determine overall scale of the covariance function For GPC LOO computations are trickier, but the cavity method (Opper and Winther, 2) seems to be effective

Comparing marginal likelihood and LOO-CV L = L LOO = n log p(y i {y j, j < i}, θ) i=1 n log p(y i {y j, j i}, θ) i=1 Marginal likelihood tells us the probability of the data given the assumptions of the model LOO-CV gives an estimate of the predictive log probability, whether or not the model assumptions are fulfilled CV procedures should be more robust against model mis-specification (e.g. Wahba, 199)

Example 1: Mauna Loa CO 2 42 CO 2 concentration, ppm 4 38 36 34 32 196 197 198 199 2 21 22 year Fit this data with sum of four covariance functions, modelling (i) smooth trend (ii) periodic component (with some decay) (iii) medium term irregularities (iv) noise Optimize and compare models using marginal likelihood

Example 2: Robot Arm Inverse Dynamics 44,484 training, 4,449 test examples, in 21-dimensions Map from 7 joint positions, velocities and accelerations of 7 joints to torque Use SE (Gaussian) covariance function with ARD 23 hyperparameters, optimizing marginal likelihood or L LOO Similar accuracy for both SMSE and mean standardized log loss, but marginal likelihood optimzation is quicker

Multi-task Learning o D D D 1 2 3 1 2 3 Minka and Picard (1999) D D D 1 2 3

Summary Bayesian learning Higher level regularization Predictive uncertainty Cross-validation Ensemble learning Search strategies (no more grid search) Feature selection More than 2 levels of inference Model selection is much more than just setting parameters in a covariance function

Relationship to alignment where y has +1/-1 elements A(K, y) = y Ky n K F ( ) log A(K, y) = log y Ky log tr(k) log q(y K) = 1 2ˆf K 1ˆf + log p(y ˆf) 1 log B, 2 where B = I + W 1 2 KW 1 2, ˆf is the maximum of the posterior found by Newton s method, and W is the diagonal matrix W = log p(y ˆf).