Big model configuration with Bayesian quadrature. David Duvenaud, Roman Garnett, Tom Gunter, Philipp Hennig, Michael A Osborne and Stephen Roberts.

Similar documents
Global Optimisation with Gaussian Processes. Michael A. Osborne Machine Learning Research Group Department o Engineering Science University o Oxford

Gaussian Processes for Sequential Prediction

Probabilistic numerics for deep learning

Anomaly Detection and Removal Using Non-Stationary Gaussian Processes

Lecture: Gaussian Process Regression. STAT 6474 Instructor: Hongxiao Zhu

Variational Methods in Bayesian Deconvolution

Introduction to Gaussian Processes

Nonparametric Bayesian Methods (Gaussian Processes)

Prediction of Data with help of the Gaussian Process Method

COMP 551 Applied Machine Learning Lecture 20: Gaussian processes

Bayesian Inference: Principles and Practice 3. Sparse Bayesian Models and the Relevance Vector Machine

STA 4273H: Sta-s-cal Machine Learning

Recent Advances in Bayesian Inference Techniques

Lecture : Probabilistic Machine Learning

Variational Bayesian Monte Carlo

Probabilistic Machine Learning. Industrial AI Lab.

Efficient Likelihood-Free Inference

Nonparametric Regression With Gaussian Processes

Lecture 2: From Linear Regression to Kalman Filter and Beyond

Annealed Importance Sampling for Neural Mass Models

A new Hierarchical Bayes approach to ensemble-variational data assimilation

STA 4273H: Statistical Machine Learning

Statistics 203: Introduction to Regression and Analysis of Variance Penalized models

Sequential Bayesian Prediction in the Presence of Changepoints

Bayesian Quadrature: Model-based Approximate Integration. David Duvenaud University of Cambridge

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 13: SEQUENTIAL DATA

Overview. Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation

Bayesian Methods for Machine Learning

20: Gaussian Processes

Fast Likelihood-Free Inference via Bayesian Optimization

STA 4273H: Statistical Machine Learning

COMP90051 Statistical Machine Learning

Bagging During Markov Chain Monte Carlo for Smoother Predictions

STA414/2104 Statistical Methods for Machine Learning II

Part 1: Expectation Propagation

Parameter Estimation. Industrial AI Lab.

PMR Learning as Inference

Machine Learning CSE546 Sham Kakade University of Washington. Oct 4, What about continuous variables?

Kernel adaptive Sequential Monte Carlo

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Review: Probabilistic Matrix Factorization. Probabilistic Matrix Factorization (PMF)

COMP 551 Applied Machine Learning Lecture 21: Bayesian optimisation

Gaussian with mean ( µ ) and standard deviation ( σ)

INTRODUCTION TO PATTERN RECOGNITION

COMP 551 Applied Machine Learning Lecture 19: Bayesian Inference

DD Advanced Machine Learning

The connection of dropout and Bayesian statistics

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

Probabilistic & Unsupervised Learning

Bayesian Regression Linear and Logistic Regression

Further Issues and Conclusions

Bayesian Learning (II)

Hierarchical sparse Bayesian learning for structural health monitoring. with incomplete modal data

Computer Vision Group Prof. Daniel Cremers. 3. Regression

Variational Principal Components

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

MTTTS16 Learning from Multiple Sources

Robotics. Lecture 4: Probabilistic Robotics. See course website for up to date information.

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Uncertainty Quantification for Machine Learning and Statistical Models

GWAS V: Gaussian processes

Multimodel Ensemble forecasts

An introduction to Bayesian statistics and model calibration and a host of related topics

Computer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression

Creating Non-Gaussian Processes from Gaussian Processes by the Log-Sum-Exp Approach. Radford M. Neal, 28 February 2005

GAUSSIAN PROCESS REGRESSION

Probabilistic Graphical Models

Computer Vision Group Prof. Daniel Cremers. 4. Gaussian Processes - Regression

Gaussian Process Regression

STAT 518 Intro Student Presentation

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Lecture 9. Time series prediction

Approximate Inference Part 1 of 2

Reliability Monitoring Using Log Gaussian Process Regression

STA 4273H: Statistical Machine Learning

Approximate Inference Part 1 of 2

PATTERN RECOGNITION AND MACHINE LEARNING

How to build an automatic statistician

17 : Markov Chain Monte Carlo

Where now? Machine Learning and Bayesian Inference

Gaussian Processes for Computer Experiments

Classification for High Dimensional Problems Using Bayesian Neural Networks and Dirichlet Diffusion Trees

Log Gaussian Cox Processes. Chi Group Meeting February 23, 2016

Sequential Bayesian Prediction in the Presence of Changepoints and Faults

Widths. Center Fluctuations. Centers. Centers. Widths

Model Selection for Gaussian Processes

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes

Uncertainty in energy system models

Building an Automatic Statistician

Tutorial on Gaussian Processes and the Gaussian Process Latent Variable Model

Gentle Introduction to Infinite Gaussian Mixture Modeling

Least Squares Regression

Nonparametric Bayes tensor factorizations for big data

New Insights into History Matching via Sequential Monte Carlo

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Transcription:

Big model configuration with Bayesian quadrature David Duvenaud, Roman Garnett, Tom Gunter, Philipp Hennig, Michael A Osborne and Stephen Roberts.

This talk will develop Bayesian quadrature approaches to building ensembles of models for big and complex data.

For big and complex data, it is difficult to pick the right model parameters.

Modelling data requires fitting parameters, such as the a and b of y = a x + b.

The performance of non-parametric models is sensitive to the selection of hyperparameters.

Evaluating the quality of model fit (or the likelihood) is expensive for big data. error

Assuming correlated errors, computation cost will be supralinear in the number of data.

log-likelihood Complex models and real data often lead to multi-modal likelihood functions. parameter

Optimisation (as in maximum likelihood or least squares), gives a reasonable heuristic for exploring the likelihood. log-likelihood parameter

The naïve fitting of models to data performed by maximum likelihood can lead to overfitting.

Bayesian averaging over ensembles of models reduces overfitting, and provides more honest estimates of uncertainty.

log-likelihood Averaging requires integrating over the many possible states of the world consistent with data: this is often non-analytic. parameter

log-likelihood There are many different approaches to quadrature (numerical integration); integrand estimation is undervalued by most. parameter

Maximum likelihood is an unreasonable way of estimating a multi-modal or broad likelihood integrand. log-likelihood parameter

Monte Carlo schemes give powerful methods of exploration that have revolutionised Bayesian inference. log-likelihood parameter

log-likelihood Monte Carlo schemes give a reasonable method of exploration, but an unsound means of integrand estimation. parameter

Monte Carlo schemes have some other potential issues: some parameters must be hand-tuned; convergence diagnostics are often unreliable.

Bayesian quadrature provides optimal ensembles of models for big data.

Bayesian quadrature gives a powerful method for estimating the integrand: a Gaussian process. log-likelihood parameter

Gaussian distributed variables are joint Gaussian with any affine transform of them.

A function over which we have a Gaussian process is joint Gaussian with any integral or derivative of it, as integration and differentiation are linear.

We can use observations of an integrand l in order to perform inference for its integral, Z: this is known as Bayesian Quadrature.

Bayesian quadrature makes best use ( ) of our ( ) evaluations of model fit, important for big data, where such evaluations are expensive. likelihood (model fit) samples posterior mean posterior variance expected area parameter

Consider the integral Bayesian quadrature achieves more accurate results than Monte Carlo, and provides an estimate of our uncertainty..

Bayesian quadrature promises solutions to some of the issues of Monte Carlo: hyper-parameters can be automatically set by maximum likelihood; the variance in the integral is a natural convergence diagnostic.

We use a Laplace approximation to marginalise the hyperparameters of the Gaussian process model.

We really want to use a Gaussian process to model the log-likelihood, rather than the likelihood. Doing so better captures the dynamic range of likelihoods, and extends the correlation range.

Using a Gaussian process for the log-likelihood means that the distribution for the integral of the likelihood is no longer analytic. log-likelihood at x 2 log-likelihood at x 1 p( log-likelihood ) exp( log-likelihood )

We could linearise the likelihood as a function of the log-likelihood. This renders the likelihood and log-likelihood jointly part of one Gaussian process (along with integrals).

However, this linearisation is typically poor for the extreme log transform. ℓ(x) χ2 process Mean (ground truth) Mean (WSABI - M) 95% CI (WSABI - M) Mean (WSABI - L) 95% CI (WSABI - L) X Rather than the log, we model the square-root (WSABI), which is more amenable to linearisation. Figure 2: The 2 process, alongside moment matched (WSABI - M) and linearised approximations (WSABI - L). Notice that the WSABI - L mean is nearly identical to the ground truth. nty about either the total integrand surface or the integral. Let us define this next sample loca be x, and the associated likelihood to be ` := `(x ). Two utility functions immediately pre

Doubly-Bayesian quadrature (BBQ) additionally explores the integrand so as to minimise the uncertainty about the integral. Integrand Sample number

log Error Doubly-Bayesian Quadrature (BBQ) selects efficient samples, but the computation of the expected reduction in integral variance is extremely costly.

Our linearisation implies we are more uncertain about large likelihoods than small likelihoods. Hence selecting samples with large variance promotes both exploration and exploitation.

Our method (Warped Sequential Active Bayesian Integration) converges quickly in wall-clock time for a synthetic integrand. Fractional error vs ground truth 10 0 10 1 10 2 WSABI-L WSABI-L ± 1std.error SMC SMC ± 1std.error 10 3 0 20 40 60 80 100 120 140 160 180 200 Time in seconds

WSABI-L converges more quickly than Annealed Importance Sampling in integrating out eight hyperparameters in a Gaussian process regression problem (yacht). 8000 6000 4000 Ground truth WSABI-L WSABI-M SMC AIS BMC log Z 2000 0 2000 4000 6000 0 200 400 600 800 1000 1200 1400 Time in seconds

WSABI-L converges quickly in integrating out hyperparameters in a Gaussian process classification problem (CiteSeer x data). log Z 240 250 260 270 280 290 300 Ground truth WSABI-L WSABI-M SMC AIS BMC 200 400 600 800 1000 1200 1400 1600 1800 Time in seconds

Active Bayesian quadrature gives optimal averaging over models for big and complex data.

Bayesian quadrature is an example of probabilistic numerics: the study of numeric methods as learning algorithms.

With Bayesian quadrature, we can also estimate integrals to compute posterior distributions for any hyperparameters.

Complex data with anomalies, changepoints and faults demands model averaging.

In considering data with changepoints and faults, we must entertain multiple hypotheses using Bayesian quadrature.

Changepoint covariances feature hyperparameters, for which we can produce posterior distributions using quadrature. input scale pre-changepoint changepoint location input scale post-changepoint Changepoint detection requires the posterior for the changepoint location hyperparameter.

We can perform both prediction and changepoint detection using Bayesian quadrature.

We can build covariances to accommodate faults, a common challenge in sensor networks.

We use algorithms capable of spotting hidden patterns and anomalies in on-line data.

We identify the OPEC embargo in Oct 1973 and the resignation of Nixon in Aug 1974. time since last change

We use algorithms capable of spotting hidden patterns and anomalies in on-line data. Nile flood levels

Our algorithm detects a possible change in measurement noise in AD715.

We detect the Nilometer built in AD 715.

Saccades (sudden eye movements) introduce spurious peaks into EEG data. 80 60 40 20 0 20 40 60 0 50 100 150 200 250 300 350 400 450

We can perform honest prediction for this complex signal during saccade anomalies.

Wannengrat hosts a remote weather sensor network used for climate change science, for which observations are costly.

Our algorithm acquires more data during interesting volatile periods.

Bayesian quadrature has enabled changepoint detection through efficient model averaging.

Global optimisation considers objective functions that are multi-modal and expensive to evaluate.

By defining the costs of observation and uncertainty, we can select evaluations optimally by minimising the expected loss with respect to a probability distribution. input x objective function y(x) output y

We choose a Gaussian process as the probability distribution for the objective function, giving a tractable expected loss.

Our Gaussian process is specified by hyperparameters λ and σ, giving expected length scales of the function in output and input spaces respectively. λ λ 2 σ σ

Management of hyperparameters is important for optimisation: we start with no data!

Management of hyperparameters is important for optimisation: we start with no data!

Management of hyperparameters is important for optimisation: we start with no data!

Management of hyperparameters is important for optimisation: we start with no data!

Management of hyperparameters is important for optimisation: we start with no data!

Management of hyperparameters is important for optimisation: we start with no data!

Management of hyperparameters is important for optimisation: we start with no data!

Management of hyperparameters is important for optimisation: we start with no data!

Management of hyperparameters is important for optimisation: we start with no data!