Global Optimisation with Gaussian Processes. Michael A. Osborne Machine Learning Research Group Department o Engineering Science University o Oxford

Similar documents
Big model configuration with Bayesian quadrature. David Duvenaud, Roman Garnett, Tom Gunter, Philipp Hennig, Michael A Osborne and Stephen Roberts.

Probabilistic numerics for deep learning

Bayesian Inference: Principles and Practice 3. Sparse Bayesian Models and the Relevance Vector Machine

Bayesian Quadrature: Model-based Approximate Integration. David Duvenaud University of Cambridge

Nonparametric Bayesian Methods (Gaussian Processes)

Gaussian Processes for Sequential Prediction

Uncertainty in energy system models

STA414/2104 Statistical Methods for Machine Learning II

Reliability Monitoring Using Log Gaussian Process Regression

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 13: SEQUENTIAL DATA

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

STAT 518 Intro Student Presentation

Nonparmeteric Bayes & Gaussian Processes. Baback Moghaddam Machine Learning Group

Bayesian Regression Linear and Logistic Regression

PMR Learning as Inference

Gaussian Process Optimization with Mutual Information

STA 4273H: Statistical Machine Learning

GAUSSIAN PROCESS REGRESSION

Part 1: Expectation Propagation

Further Issues and Conclusions

Variational Principal Components

Variational Methods in Bayesian Deconvolution

STA 4273H: Sta-s-cal Machine Learning

Overview. Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation

Bayesian Deep Learning

Introduction to Gaussian Processes

Lecture : Probabilistic Machine Learning

How to build an automatic statistician

PILCO: A Model-Based and Data-Efficient Approach to Policy Search

Probabilistic & Unsupervised Learning

CSC 2541: Bayesian Methods for Machine Learning

Fast Likelihood-Free Inference via Bayesian Optimization

Relevance Vector Machines

Learning Gaussian Process Models from Uncertain Data

Robust Monte Carlo Methods for Sequential Planning and Decision Making

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Need for Sampling in Machine Learning. Sargur Srihari

Prediction of Data with help of the Gaussian Process Method

Model-Based Reinforcement Learning with Continuous States and Actions

Context-Dependent Bayesian Optimization in Real-Time Optimal Control: A Case Study in Airborne Wind Energy Systems

Pattern Recognition and Machine Learning

An Alternative Infinite Mixture Of Gaussian Process Experts

Nonparametric Regression With Gaussian Processes

The Variational Gaussian Approximation Revisited

Widths. Center Fluctuations. Centers. Centers. Widths

Anomaly Detection and Removal Using Non-Stationary Gaussian Processes

Lecture: Gaussian Process Regression. STAT 6474 Instructor: Hongxiao Zhu

Predictive Variance Reduction Search

Probabilistic Models for Learning Data Representations. Andreas Damianou

Review: Probabilistic Matrix Factorization. Probabilistic Matrix Factorization (PMF)

Model Selection for Gaussian Processes

Linear Dynamical Systems

Gaussian Process Regression

Stochastic Variational Inference for Gaussian Process Latent Variable Models using Back Constraints

Uncertainty Quantification for Machine Learning and Statistical Models

Probabilistic Graphical Models Lecture 20: Gaussian Processes

Bayesian Learning in Undirected Graphical Models

J. Sadeghi E. Patelli M. de Angelis

DD Advanced Machine Learning

Gaussian Processes for Computer Experiments

Modelling Transcriptional Regulation with Gaussian Processes

Introduction to Gaussian Processes

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.

Optimisation séquentielle et application au design

Sequential Bayesian Prediction in the Presence of Changepoints

Confidence Estimation Methods for Neural Networks: A Practical Comparison

Markov chain Monte Carlo methods for visual tracking

Lecture 6: Graphical Models: Learning

Lecture 7 and 8: Markov Chain Monte Carlo

Linear Models for Classification

Variational Model Selection for Sparse Gaussian Process Regression

COMP 551 Applied Machine Learning Lecture 20: Gaussian processes

Improving posterior marginal approximations in latent Gaussian models

MODEL BASED LEARNING OF SIGMA POINTS IN UNSCENTED KALMAN FILTERING. Ryan Turner and Carl Edward Rasmussen

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Tutorial on Gaussian Processes and the Gaussian Process Latent Variable Model

Uncertainty quantification for Wavefield Reconstruction Inversion

20: Gaussian Processes

Infinite Mixtures of Gaussian Process Experts

Fast Forward Selection to Speed Up Sparse Gaussian Process Regression

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Gaussian Process Regression: Active Data Selection and Test Point. Rejection. Sambu Seo Marko Wallat Thore Graepel Klaus Obermayer

TDA231. Logistic regression

Stochastic Spectral Approaches to Bayesian Inference

Multivariate Bayesian Linear Regression MLAI Lecture 11

Latent Variable Models and EM Algorithm

Efficient Likelihood-Free Inference

Lecture 1b: Linear Models for Regression

Statistical Techniques in Robotics (16-831, F12) Lecture#21 (Monday November 12) Gaussian Processes

A new Hierarchical Bayes approach to ensemble-variational data assimilation

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Active and Semi-supervised Kernel Classification

Gaussian Process Approximations of Stochastic Differential Equations

GWAS V: Gaussian processes

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Transcription:

Global Optimisation with Gaussian Processes Michael A. Osborne Machine Learning Research Group Department o Engineering Science University o Oxford

Global optimisation considers objective functions that are multi-modal and expensive to evaluate.

Optimisation is a decision problem: we must select evaluations to determine the minimum. Hence we need to specify a loss function and a probability distribution.

By de ining a loss function (including the cost o observation), we can select evaluations optimally by minimising the expected loss. input x objective function y(x) output y

We de ine a loss function that is equal to the lowest function value found: the lower this value, the lower our loss. Assuming we have only one evaluation remaining, the loss o it returning value y given that the current lowest value obtained is η is

This loss function makes computing the expected loss simple: as such, we ll take a myopic approximation and only ever consider the next evaluation (rather than all possible future evaluations). : All available information. : next eval. location. The expected loss is the expected lowest value o the function we ve evaluated after the next evaluation.

We choose a Gaussian process as the probability distribution for the objective function, giving a tractable expected loss.

We can use Gaussian processes for optimisation.

We can use Gaussian processes for optimisation.

We can use Gaussian processes for optimisation.

We can use Gaussian processes for optimisation.

We can use Gaussian processes for optimisation.

We can use Gaussian processes for optimisation.

We can use Gaussian processes for optimisation.

We can use Gaussian processes for optimisation.

We have shifted an optimisation problem over an objective to an optimisation problem over an expected loss: the Gaussian process has served as a surrogate (or emulator) for the objective. input x expected loss function (x) output λ

This is useful because the expected loss is cheaper to evaluate than the objective, and admits gradient and Hessian observations to aid the optimisation. input x expected loss function (x) output λ, Diλ, DiDjλ

Our Gaussian process is speci ied by hyperparameters λ and σ, giving expected length scales o the function in output and input spaces respectively. λ λ 2 σ σ

Hyperparameter values can have a signi icant in luence: below, A has small input scale and prior C has large input scale. In optimisation, amongst other things, hyperparameters specify a step size.

Management o hyperparameters is important for optimisation: we start with no data!

Management o hyperparameters is important.

Management o hyperparameters is important.

Management o hyperparameters is important.

Management o hyperparameters is important.

Management o hyperparameters is important.

Management o hyperparameters is important.

Management o hyperparameters is important.

Management o hyperparameters is important (we ll come back to this!).

We can improve upon our myopic approximation by considering multiple function evaluations into the future using (expensive) sampling. I n is our total information up to index n M represents the index o our inal decision All nodes along the dark line are correlated

We tested on a range o problems. Ackley 2/5 Branin 6-hump Camelback Goldstein-Price Griewank 2/5 Hartman 3/6 Rastigrin Shekel 5/7/10 Shubert

Gaussian process global optimisation (GPGO) outperforms tested alternatives for global optimisation over 140 test problems. True global optimum G 1 0.622 0.626 0.722 0.779 Optimum found after set number of fn evals d d/g myopic 2-step lookahead Start pt 0 RBF DIRECT GPGO GPGO (Best)

We can extend GPGO to optimise dynamic functions, by simply taking time as an additional input (that we have no control over).

We can use GPGO to select the optimal subset o sensor locations for weather prediction.

Consider functions o sets, such as the predictive quality o a set o sensors. In order to perform inference about such functions, we build a covariance function over sets.

To build such a covariance, we need the distance d (A,B) between sets A and B. For δ suf iciently large, d(a,b) is large somewhat large small small

d(a,b) is the minimum total distance that nodes would have to be moved in order to render set A identical to set B, i nodes can be merged along the way.

d(a,b) is the minimum total distance that nodes would have to be moved in order to render set A identical to set B (the Earth-mover s distance). Above, we assume equally weighted nodes (weights are indicated by the thickness o the outline).

We also wish to determine weights that re lect the importance o sensors to the overall network (more isolated nodes are more important).

The mean prediction at a point observations from set A is given This term weights the observations from A: we use it as a measure o the importance o the nodes. Note that we integrate over and normalise.

We sequentially selected subsets o 5 sensors (from 50 around the UK) so as to optimize the quality o our predictions (Q) about air temperature from 1959 to 1984. Observe Evaluate Q GPGO Q

A) All sensors B) Sensors selected by greedy mutual information procedure (MI). MI is not adaptive to data! C) Sensors we selected in 1969. D) Sensors we selected in 1979. (A) (B) (C) (D)

We could extend GPGO to additionally ind the optimal number o set elements by de ining the cost o additional elements.

Gaussian distributed variables are joint Gaussian with any af ine transform o them.

A function over which we have a Gaussian process is joint Gaussian with any integral or derivative o it, as integration and differentiation are af ine.

We can modify covariance functions to manage derivative or integral observations. derivative observation at x i and function observation at x j. derivative observation at x i and derivative observation at x j. K K D ( x, x ) = K( x, x ) i j x x' x x= x ( x, x ) = K( x, x ) D, D i j ' j i x= x i x' = x j

We can modify the squared exponential covariance to manage derivative observations.

The performance o Gaussian process global optimisation can be improved by including observations o the gradient o a function.

The performance o Gaussian process global optimisation can be improved by including observations o the gradient o a function.

Conditioning becomes an issue when we have multiple close observations, giving rows in the covariance matrix that are very similar. 1 0.9999 0 0 0.9999 1 0 0 0 0 1 0.1 0 0 0.1 1 Too similar Conditioning is a particular problem in optimisation, where we often take many samples around a local minimum to exactly determine it.

The usual solution to conditioning problems is to add a small positive quantity (jitter) to the diagonal o the covariance matrix. 1.01 0.9999 0 0 0.9999 1.01 0 0 0 0 1.01 0.1 0 0 0.1 1.01 Su ciently dissimilar As jitter is effectively imposed noise, adding jitter to all diagonal elements dilutes the informativeness o our data.

We can readily incorporate observations o the derivative into the Gaussian process used in GPGO. This also gives us a way to resolve conditioning issues (that result from the very close observations that are taken during optimisation).

We can readily incorporate observations o the derivative into the Gaussian process used in GPGO. This also gives us a way to resolve conditioning issues (that result from the very close observations that are taken during optimisation).

We can readily incorporate observations o the derivative into the Gaussian process used in GPGO. This also gives us a way to resolve conditioning issues (that result from the very close observations that are taken during optimisation).

We can readily incorporate observations o the derivative into the Gaussian process used in GPGO. This also gives us a way to resolve conditioning issues (that result from the very close observations that are taken during optimisation).

We can readily incorporate observations o the derivative into the Gaussian process used in GPGO. This also gives us a way to resolve conditioning issues (that result from the very close observations that are taken during optimisation).

Likewise, we can use observations o an integrand in order to perform inference for its integral, Z: this is known as Bayesian Quadrature.

Modelling data is problematic due to parameters, such as the parameters o a polynomial model.

We could minimise errors (or maximise likelihood) to ind a `best- it, at the risk o over itting.

The Bayesian approach is to average over many possible parameters, requiring integration.

The Bayesian approach is to average over many possible parameters, requiring integration.

Inference requires integrating (or marginalising) over the many possible states o the world consistent with our data, which is often non-analytic. Probability( y ) = the area under this curve: Probability Density p( y, ϑ ) ϑ

log-likelihood There are many different approaches to quadrature (numerical integration) for probabilistic integrals; integrand estimation is usually undervalued. parameter

Optimisation (as in maximum likelihood), particularly using global optimisers, gives a reasonable heuristic for exploring the integrand. log-likelihood parameter

However, maximum likelihood is an unreasonable way o estimating a multi-modal or broad likelihood integrand: why throw away all those other samples? log-likelihood parameter

In particular, maximum likelihood is inadequate for the hyper-parameters o a Gaussian process used for optimisation. Approximating the posterior over hyperparameters (above right) as a delta function results in insuf icient exploration o the objective function!

Monte Carlo schemes give even more powerful methods o exploration, that have revolutionised Bayesian inference. log-likelihood parameter

These methods force us to be uncertain about our action, which is unlikely to result in actions that minimise a sensible expected loss. log-likelihood parameter

log-likelihood Monte Carlo schemes give a fairly reasonable method o exploration; but an unreasonable means o integrand estimation. parameter

Bayesian quadrature (aka Bayesian Monte Carlo) gives a powerful method for estimating the integrand: a Gaussian process. log-likelihood parameter

Consider the integral Bayesian quadrature achieves more accurate results than Monte Carlo, and provides an estimate o our uncertainty..

Doubly-Bayesian quadrature (BBQ) additionally explores the integrand so as to minimise the uncertainty about the integral. Integrand Sample number

Bayesian optimisation is a means o optimising expensive, multi-modal objective functions; Bayesian quadrature is a means o integrating over expensive, multi-modal integrands.

Thanks! I would like to also like thank my collaborators: David Duvenaud, Roman Garnett, Zoubin Ghahramani, Carl Rasmussen, and Stephen Roberts. References http://www.robots.ox.ac.uk/~mosb