Bayesian Quadrature: Model-based Approximate Integration. David Duvenaud University of Cambridge

Similar documents
STA414/2104 Statistical Methods for Machine Learning II

ICML Scalable Bayesian Inference on Point processes. with Gaussian Processes. Yves-Laurent Kom Samo & Stephen Roberts

STA 4273H: Sta-s-cal Machine Learning

Business Statistics: A Decision-Making Approach 6 th Edition. Chapter Goals

Density Estimation. Seungjin Choi

Bayesian Inference and MCMC

Global Optimisation with Gaussian Processes. Michael A. Osborne Machine Learning Research Group Department o Engineering Science University o Oxford

Gaussian Process Vine Copulas for Multivariate Dependence

Where now? Machine Learning and Bayesian Inference

Bayesian Regression Linear and Logistic Regression

Nonparametric Bayesian Methods (Gaussian Processes)

Sandwiching the marginal likelihood using bidirectional Monte Carlo. Roger Grosse

Particle-Based Approximate Inference on Graphical Model

One-parameter models

Curve Fitting Re-visited, Bishop1.2.5

A Comparison of Particle Filters for Personal Positioning

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Introduction to Gaussian Processes

Bayesian Inference: Principles and Practice 3. Sparse Bayesian Models and the Relevance Vector Machine

Nonparametric Regression With Gaussian Processes

Kernel Sequential Monte Carlo

DS-GA 1003: Machine Learning and Computational Statistics Homework 7: Bayesian Modeling

Bayes Nets: Sampling

ML estimation: Random-intercepts logistic model. and z

Power EP. Thomas Minka Microsoft Research Ltd., Cambridge, UK MSR-TR , October 4, Abstract

STA 4273H: Statistical Machine Learning

CSC 2541: Bayesian Methods for Machine Learning

STA 4273H: Statistical Machine Learning

Bayesian Inference in Astronomy & Astrophysics A Short Course

GAUSSIAN PROCESS REGRESSION

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

STA 4273H: Statistical Machine Learning

An example to illustrate frequentist and Bayesian approches

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Lecture: Gaussian Process Regression. STAT 6474 Instructor: Hongxiao Zhu

LECTURE 15 Markov chain Monte Carlo

Lecture : Probabilistic Machine Learning

CS 343: Artificial Intelligence

Why do we care? Examples. Bayes Rule. What room am I in? Handling uncertainty over time: predicting, estimating, recognizing, learning

Part 1: Expectation Propagation

Exercises Tutorial at ICASSP 2016 Learning Nonlinear Dynamical Models Using Particle Filters

Introduction to Machine Learning

17 : Markov Chain Monte Carlo

Bayesian Support Vector Machines for Feature Ranking and Selection

Sampling from complex probability distributions

Bayesian Estimation with Sparse Grids

Nonparmeteric Bayes & Gaussian Processes. Baback Moghaddam Machine Learning Group

An introduction to Sequential Monte Carlo

Markov Chain Monte Carlo methods

Log Gaussian Cox Processes. Chi Group Meeting February 23, 2016

Kernel adaptive Sequential Monte Carlo

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

Rejection sampling - Acceptance probability. Review: How to sample from a multivariate normal in R. Review: Rejection sampling. Weighted resampling

Non-Parametric Bayes

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

CSC321 Lecture 18: Learning Probabilistic Models

Learning Energy-Based Models of High-Dimensional Data

Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm

Gaussian Process Optimization with Mutual Information

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 13: SEQUENTIAL DATA

MCMC Sampling for Bayesian Inference using L1-type Priors

Recent Advances in Bayesian Inference Techniques

Gaussian Process priors with Uncertain Inputs: Multiple-Step-Ahead Prediction

Introduction to Probabilistic Machine Learning

STAT 518 Intro Student Presentation

Big model configuration with Bayesian quadrature. David Duvenaud, Roman Garnett, Tom Gunter, Philipp Hennig, Michael A Osborne and Stephen Roberts.

Machine Learning. Probabilistic KNN.

Advanced Introduction to Machine Learning CMU-10715

Probabilistic Graphical Models

Sampling Algorithms for Probabilistic Graphical models

Paul Karapanagiotidis ECO4060

Markov chain Monte Carlo

Notes on pseudo-marginal methods, variational Bayes and ABC

19 : Slice Sampling and HMC

Sparse Stochastic Inference for Latent Dirichlet Allocation

Lecture Note 2: Estimation and Information Theory

Fundamental Issues in Bayesian Functional Data Analysis. Dennis D. Cox Rice University

Markov chain Monte Carlo methods for visual tracking

Expectation Propagation for Approximate Bayesian Inference

Bayesian Methods for Machine Learning

Markov Chain Monte Carlo

April 20th, Advanced Topics in Machine Learning California Institute of Technology. Markov Chain Monte Carlo for Machine Learning

Sequential Bayesian Prediction in the Presence of Changepoints

Gaussian Processes for Computer Experiments

Bayesian model selection: methodology, computation and applications

Bayesian Methods in Multilevel Regression

FREQUENTIST BEHAVIOR OF FORMAL BAYESIAN INFERENCE

Principles of Bayesian Inference

Bayesian Networks in Educational Assessment

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling

Introduction to Bayesian inference

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, What about continuous variables?

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

David Giles Bayesian Econometrics

Lecture 7 and 8: Markov Chain Monte Carlo

Computational statistics

Gaussian Process Approximations of Stochastic Differential Equations

STA Module 10 Comparing Two Proportions

Slice Sampling with Adaptive Multivariate Steps: The Shrinking-Rank Method

Monte Carlo integration

Transcription:

Bayesian Quadrature: Model-based Approimate Integration David Duvenaud University of Cambridge

The Quadrature Problem ˆ We want to estimate an integral Z = f ()p()d ˆ Most computational problems in inference correspond to integrals: ˆ Epectations ˆ Marginal distributions ˆ Integrating out nuisance parameters ˆ Normalization constants ˆ Model comparison function f() input density p()

Sampling Methods ˆ Monte Carlo methods: Sample from p(), take empirical mean: Ẑ = 1 N N f ( i ) i=1 function f() input density p() samples

Sampling Methods ˆ Monte Carlo methods: Sample from p(), take empirical mean: Ẑ = 1 N N f ( i ) i=1 ˆ Possibly sub-optimal for two reasons: function f() input density p() samples

Sampling Methods ˆ Monte Carlo methods: Sample from p(), take empirical mean: Ẑ = 1 N N f ( i ) i=1 ˆ Possibly sub-optimal for two reasons: ˆ Random bunching up function f() input density p() samples

Sampling Methods ˆ Monte Carlo methods: Sample from p(), take empirical mean: Ẑ = 1 N N f ( i ) i=1 ˆ Possibly sub-optimal for two reasons: ˆ Random bunching up ˆ Often, nearby function values will be similar function f() input density p() samples

Sampling Methods ˆ Monte Carlo methods: Sample from p(), take empirical mean: Ẑ = 1 N N f ( i ) i=1 ˆ Possibly sub-optimal for two reasons: ˆ Random bunching up ˆ Often, nearby function values will be similar ˆ Model-based and quasi-monte Carlo methods spread out samples to achieve faster convergence. function f() input density p() samples

Model-based Integration

Model-based Integration ˆ Place a prior on f, for eample, a GP ˆ Posterior over f implies a posterior over Z. Z f() p() GP mean GP mean ± SD p(z)

Model-based Integration ˆ Place a prior on f, for eample, a GP ˆ Posterior over f implies a posterior over Z. Z f() p() GP mean GP mean ±2SD p(z) samples

Model-based Integration ˆ Place a prior on f, for eample, a GP ˆ Posterior over f implies a posterior over Z. Z f() p() GP mean GP mean ±2SD p(z) samples

Model-based Integration ˆ Place a prior on f, for eample, a GP ˆ Posterior over f implies a posterior over Z. Z f() p() GP mean GP mean ±2SD p(z) samples

Model-based Integration ˆ Place a prior on f, for eample, a GP ˆ Posterior over f implies a posterior over Z. Z f() p() GP mean GP mean ±2SD p(z) samples

Model-based Integration ˆ Place a prior on f, for eample, a GP ˆ Posterior over f implies a posterior over Z. Z f() p() GP mean GP mean ±2SD p(z) samples

Model-based Integration ˆ Place a prior on f, for eample, a GP ˆ Posterior over f implies a posterior over Z. Z f() p() GP mean GP mean ±2SD p(z) samples ˆ We ll call using a GP prior Bayesian Quadrature

Bayesian Quadrature Estimator ˆ Posterior over Z has mean linear in f ( s ): E gp [Z f ( s )] = N z T K 1 f ( i ) i=1 where z n = k(, n )p()d

Bayesian Quadrature Estimator ˆ Posterior over Z has mean linear in f ( s ): E gp [Z f ( s )] = N z T K 1 f ( i ) i=1 where z n = k(, n )p()d ˆ Natural to minimize posterior variance of Z: V [Z f ( s )] = k(, )p()p( )dd z T K 1 z

Bayesian Quadrature Estimator ˆ Posterior over Z has mean linear in f ( s ): E gp [Z f ( s )] = N z T K 1 f ( i ) i=1 where z n = k(, n )p()d ˆ Natural to minimize posterior variance of Z: V [Z f ( s )] = k(, )p()p( )dd z T K 1 z ˆ Doesn t depend on function values at all!

Bayesian Quadrature Estimator ˆ Posterior over Z has mean linear in f ( s ): E gp [Z f ( s )] = N z T K 1 f ( i ) i=1 where z n = k(, n )p()d ˆ Natural to minimize posterior variance of Z: V [Z f ( s )] = k(, )p()p( )dd z T K 1 z ˆ Doesn t depend on function values at all! ˆ Choosing samples sequentially to minimize variance: Sequential Bayesian Quadrature.

Things you can do with Bayesian Quadrature ˆ Can incorporate knowledge of function (symmetries) f (, y) = f (y, ) k s (, y,, y ) = k(, y,, y ) + k(, y,, y) + k(, y,, y ) + k(, y,, y)

Things you can do with Bayesian Quadrature ˆ Can incorporate knowledge of function (symmetries) f (, y) = f (y, ) k s (, y,, y ) = k(, y,, y ) + k(, y,, y) + k(, y,, y ) + k(, y,, y) ˆ Can condition on gradients

Things you can do with Bayesian Quadrature ˆ Can incorporate knowledge of function (symmetries) f (, y) = f (y, ) k s (, y,, y ) = k(, y,, y ) + k(, y,, y) + k(, y,, y ) + k(, y,, y) ˆ Can condition on gradients ˆ Posterior variance is a natural convergence diagnostic 0.2 0.1 Z 0 0.1 0.2 100 120 140 160 180 200 # of samples

More things you can do with Bayesian Quadrature ˆ Can compute likelihood of GP, learn kernel ˆ Can compute marginals with error bars, in two ways:

More things you can do with Bayesian Quadrature ˆ Can compute likelihood of GP, learn kernel ˆ Can compute marginals with error bars, in two ways: ˆ Simply from the GP posterior:

More things you can do with Bayesian Quadrature ˆ Can compute likelihood of GP, learn kernel ˆ Can compute marginals with error bars, in two ways: ˆ Or by recomputing f θ () for different θ with same Z True σ Marg. Like. 0.2 0.4 0.6 0.8 1 1.2 σ

More things you can do with Bayesian Quadrature ˆ Can compute likelihood of GP, learn kernel ˆ Can compute marginals with error bars, in two ways: ˆ Or by recomputing f θ () for different θ with same Z True σ Marg. Like. ˆ Much nicer than histograms! 0.2 0.4 0.6 0.8 1 1.2 σ

Rates of Convergence What is rate of convergence of SBQ when its assumptions are true? Epected Variance / MMD 10 1 MMD 10 2 10 3 O(1/ N ) i.i.d. sampling Herding with 1/N weights 10 0 10 1 10 2 number of samples

Rates of Convergence What is rate of convergence of SBQ when its assumptions are true? Epected Variance / MMD 10 1 MMD 10 2 10 3 O(1/ N ) i.i.d. sampling Herding with 1/N weights Herding with BQ weights 10 0 10 1 10 2 number of samples

Rates of Convergence What is rate of convergence of SBQ when its assumptions are true? Epected Variance / MMD 10 1 MMD 10 2 10 3 O(1/N ) i.i.d. sampling Herding with 1/N weights Herding with BQ weights SBQ with BQ weights 10 0 10 1 10 2 number of samples

Rates of Convergence What is rate of convergence of SBQ when its assumptions are true? Epected Variance / MMD Empirical Rates in RKHS 10 1 MMD 10 2 10 3 O(1/N ) i.i.d. sampling Herding with 1/N weights Herding with BQ weights SBQ with BQ weights 10 0 10 1 10 2 number of samples

Rates of Convergence What is rate of convergence of SBQ when its assumptions are true? Epected Variance / MMD Empirical Rates out of RKHS 10 1 MMD 10 2 10 3 O(1/N ) i.i.d. sampling Herding with 1/N weights Herding with BQ weights SBQ with BQ weights 10 0 10 1 10 2 number of samples

Rates of Convergence What is rate of convergence of SBQ when its assumptions are true? Epected Variance / MMD Bound on Bayesian Error 10 1 MMD 10 2 10 3 O(1/N ) i.i.d. sampling Herding with 1/N weights Herding with BQ weights SBQ with BQ weights 10 0 10 1 10 2 number of samples

GPs vs Log-GPs for Inference 120 100 l() 80 60 40 True Function Evaluations 20 0 20

GPs vs Log-GPs for Inference 120 100 l() 80 60 40 True Function Evaluations GP Posterior 20 0 20

GPs vs Log-GPs for Inference 120 100 l() 80 60 40 True Function Evaluations GP Posterior 20 0 20 4 log l() 3 2 1 True Log-func Evaluations 0 1

GPs vs Log-GPs for Inference 120 100 l() 80 60 40 True Function Evaluations GP Posterior 20 0 20 4 log l() 3 2 1 0 True Log-func Evaluations GP Posterior 1

GPs vs Log-GPs for Inference 120 100 l() 80 60 40 20 True Function Evaluations GP Posterior Log-GP Posterior 0 20 4 log l() 3 2 1 0 True Log-func Evaluations GP Posterior 1

GPs vs Log-GPs 200 l() 150 100 True Function Evaluations 50 0

GPs vs Log-GPs 200 l() 150 100 50 True Function Evaluations GP Posterior 0

GPs vs Log-GPs 200 l() 150 100 50 True Function Evaluations GP Posterior 0 0 log l() 50 100 True Log-func Evaluations 150 200

GPs vs Log-GPs 200 l() 150 100 50 True Function Evaluations GP Posterior 0 0 log l() 50 100 150 True Log-func Evaluations GP Posterior 200

GPs vs Log-GPs 200 l() 150 100 50 True Function Evaluations GP Posterior Log-GP Posterior 0 0 log l() 50 100 150 True Log-func Evaluations GP Posterior 200

Integrating under Log-GPs 80 l() 60 40 20 0 True Function Evaluations GP Posterior Log-GP Posterior 20 log l() 4 3 2 1 0 True Log-func Evaluations GP Posterior 1

Integrating under Log-GPs 80 l() 60 40 20 0 True Function Evaluations Mean of GP Appro Log-GP Inducing Points 20 log l() 4 3 2 1 0 1 True Log-func Evaluations GP Posterior Inducing Points

Conclusions ˆ Model-based integration allows active learning about integrals, can require fewer samples than MCMC, and allows us to check our assumptions.

Conclusions ˆ Model-based integration allows active learning about integrals, can require fewer samples than MCMC, and allows us to check our assumptions. ˆ BQ has nice convergence properties if its assumptions are correct.

Conclusions ˆ Model-based integration allows active learning about integrals, can require fewer samples than MCMC, and allows us to check our assumptions. ˆ BQ has nice convergence properties if its assumptions are correct. ˆ For inference, GP is not especially appropriate, but other models are intractable.

Limitations and Future Directions ˆ Right now, BQ really only works in low dimensions ( < 10 ), when the function is fairly smooth, and is only worth using when computing f () is epensive.

Limitations and Future Directions ˆ Right now, BQ really only works in low dimensions ( < 10 ), when the function is fairly smooth, and is only worth using when computing f () is epensive. ˆ How to etend to high dimensions? Gradient observations are helpful, but a D-dimensional gradient is D separate observations.

Limitations and Future Directions ˆ Right now, BQ really only works in low dimensions ( < 10 ), when the function is fairly smooth, and is only worth using when computing f () is epensive. ˆ How to etend to high dimensions? Gradient observations are helpful, but a D-dimensional gradient is D separate observations. ˆ It seems unlikely that we ll find another tractable nonparametric distribution like GPs - should we accept that we ll need a second round of approimate integration on a surrogate model?

Limitations and Future Directions ˆ Right now, BQ really only works in low dimensions ( < 10 ), when the function is fairly smooth, and is only worth using when computing f () is epensive. ˆ How to etend to high dimensions? Gradient observations are helpful, but a D-dimensional gradient is D separate observations. ˆ It seems unlikely that we ll find another tractable nonparametric distribution like GPs - should we accept that we ll need a second round of approimate integration on a surrogate model? ˆ How much overhead is worthwhile? Bounded rationality work seems relevant.

Limitations and Future Directions ˆ Right now, BQ really only works in low dimensions ( < 10 ), when the function is fairly smooth, and is only worth using when computing f () is epensive. ˆ How to etend to high dimensions? Gradient observations are helpful, but a D-dimensional gradient is D separate observations. ˆ It seems unlikely that we ll find another tractable nonparametric distribution like GPs - should we accept that we ll need a second round of approimate integration on a surrogate model? ˆ How much overhead is worthwhile? Bounded rationality work seems relevant. Thanks!