Bayesian Quadrature: Model-based Approximate Integration. David Duvenaud University of Cambridge

Bayesian Quadrature: Model-based Approimate Integration David Duvenaud University of Cambridge

The Quadrature Problem ˆ We want to estimate an integral Z = f ()p()d ˆ Most computational problems in inference correspond to integrals: ˆ Epectations ˆ Marginal distributions ˆ Integrating out nuisance parameters ˆ Normalization constants ˆ Model comparison function f() input density p()

Sampling Methods ˆ Monte Carlo methods: Sample from p(), take empirical mean: Ẑ = 1 N N f ( i ) i=1 function f() input density p() samples

Sampling Methods ˆ Monte Carlo methods: Sample from p(), take empirical mean: Ẑ = 1 N N f ( i ) i=1 ˆ Possibly sub-optimal for two reasons: function f() input density p() samples

Sampling Methods ˆ Monte Carlo methods: Sample from p(), take empirical mean: Ẑ = 1 N N f ( i ) i=1 ˆ Possibly sub-optimal for two reasons: ˆ Random bunching up function f() input density p() samples

Sampling Methods ˆ Monte Carlo methods: Sample from p(), take empirical mean: Ẑ = 1 N N f ( i ) i=1 ˆ Possibly sub-optimal for two reasons: ˆ Random bunching up ˆ Often, nearby function values will be similar function f() input density p() samples

Model-based Integration

Model-based Integration ˆ Place a prior on f, for eample, a GP ˆ Posterior over f implies a posterior over Z. Z f() p() GP mean GP mean ± SD p(z)

Model-based Integration ˆ Place a prior on f, for eample, a GP ˆ Posterior over f implies a posterior over Z. Z f() p() GP mean GP mean ±2SD p(z) samples

Model-based Integration ˆ Place a prior on f, for eample, a GP ˆ Posterior over f implies a posterior over Z. Z f() p() GP mean GP mean ±2SD p(z) samples ˆ We ll call using a GP prior Bayesian Quadrature

Bayesian Quadrature Estimator ˆ Posterior over Z has mean linear in f ( s ): E gp [Z f ( s )] = N z T K 1 f ( i ) i=1 where z n = k(, n )p()d

Bayesian Quadrature Estimator ˆ Posterior over Z has mean linear in f ( s ): E gp [Z f ( s )] = N z T K 1 f ( i ) i=1 where z n = k(, n )p()d ˆ Natural to minimize posterior variance of Z: V [Z f ( s )] = k(, )p()p( )dd z T K 1 z ˆ Doesn t depend on function values at all!

Things you can do with Bayesian Quadrature ˆ Can incorporate knowledge of function (symmetries) f (, y) = f (y, ) k s (, y,, y ) = k(, y,, y ) + k(, y,, y) + k(, y,, y ) + k(, y,, y)

Things you can do with Bayesian Quadrature ˆ Can incorporate knowledge of function (symmetries) f (, y) = f (y, ) k s (, y,, y ) = k(, y,, y ) + k(, y,, y) + k(, y,, y ) + k(, y,, y) ˆ Can condition on gradients ˆ Posterior variance is a natural convergence diagnostic 0.2 0.1 Z 0 0.1 0.2 100 120 140 160 180 200 # of samples

More things you can do with Bayesian Quadrature ˆ Can compute likelihood of GP, learn kernel ˆ Can compute marginals with error bars, in two ways:

More things you can do with Bayesian Quadrature ˆ Can compute likelihood of GP, learn kernel ˆ Can compute marginals with error bars, in two ways: ˆ Simply from the GP posterior:

More things you can do with Bayesian Quadrature ˆ Can compute likelihood of GP, learn kernel ˆ Can compute marginals with error bars, in two ways: ˆ Or by recomputing f θ () for different θ with same Z True σ Marg. Like. 0.2 0.4 0.6 0.8 1 1.2 σ

Rates of Convergence What is rate of convergence of SBQ when its assumptions are true? Epected Variance / MMD 10 1 MMD 10 2 10 3 O(1/ N ) i.i.d. sampling Herding with 1/N weights 10 0 10 1 10 2 number of samples

Rates of Convergence What is rate of convergence of SBQ when its assumptions are true? Epected Variance / MMD 10 1 MMD 10 2 10 3 O(1/ N ) i.i.d. sampling Herding with 1/N weights Herding with BQ weights 10 0 10 1 10 2 number of samples

Rates of Convergence What is rate of convergence of SBQ when its assumptions are true? Epected Variance / MMD 10 1 MMD 10 2 10 3 O(1/N ) i.i.d. sampling Herding with 1/N weights Herding with BQ weights SBQ with BQ weights 10 0 10 1 10 2 number of samples

Rates of Convergence What is rate of convergence of SBQ when its assumptions are true? Epected Variance / MMD Empirical Rates in RKHS 10 1 MMD 10 2 10 3 O(1/N ) i.i.d. sampling Herding with 1/N weights Herding with BQ weights SBQ with BQ weights 10 0 10 1 10 2 number of samples

Rates of Convergence What is rate of convergence of SBQ when its assumptions are true? Epected Variance / MMD Empirical Rates out of RKHS 10 1 MMD 10 2 10 3 O(1/N ) i.i.d. sampling Herding with 1/N weights Herding with BQ weights SBQ with BQ weights 10 0 10 1 10 2 number of samples

Rates of Convergence What is rate of convergence of SBQ when its assumptions are true? Epected Variance / MMD Bound on Bayesian Error 10 1 MMD 10 2 10 3 O(1/N ) i.i.d. sampling Herding with 1/N weights Herding with BQ weights SBQ with BQ weights 10 0 10 1 10 2 number of samples

GPs vs Log-GPs for Inference 120 100 l() 80 60 40 True Function Evaluations 20 0 20

GPs vs Log-GPs for Inference 120 100 l() 80 60 40 True Function Evaluations GP Posterior 20 0 20

GPs vs Log-GPs for Inference 120 100 l() 80 60 40 True Function Evaluations GP Posterior 20 0 20 4 log l() 3 2 1 True Log-func Evaluations 0 1

GPs vs Log-GPs for Inference 120 100 l() 80 60 40 True Function Evaluations GP Posterior 20 0 20 4 log l() 3 2 1 0 True Log-func Evaluations GP Posterior 1

GPs vs Log-GPs for Inference 120 100 l() 80 60 40 20 True Function Evaluations GP Posterior Log-GP Posterior 0 20 4 log l() 3 2 1 0 True Log-func Evaluations GP Posterior 1

GPs vs Log-GPs 200 l() 150 100 True Function Evaluations 50 0

GPs vs Log-GPs 200 l() 150 100 50 True Function Evaluations GP Posterior 0

GPs vs Log-GPs 200 l() 150 100 50 True Function Evaluations GP Posterior 0 0 log l() 50 100 True Log-func Evaluations 150 200

GPs vs Log-GPs 200 l() 150 100 50 True Function Evaluations GP Posterior 0 0 log l() 50 100 150 True Log-func Evaluations GP Posterior 200

GPs vs Log-GPs 200 l() 150 100 50 True Function Evaluations GP Posterior Log-GP Posterior 0 0 log l() 50 100 150 True Log-func Evaluations GP Posterior 200

Integrating under Log-GPs 80 l() 60 40 20 0 True Function Evaluations GP Posterior Log-GP Posterior 20 log l() 4 3 2 1 0 True Log-func Evaluations GP Posterior 1

Integrating under Log-GPs 80 l() 60 40 20 0 True Function Evaluations Mean of GP Appro Log-GP Inducing Points 20 log l() 4 3 2 1 0 1 True Log-func Evaluations GP Posterior Inducing Points

Conclusions ˆ Model-based integration allows active learning about integrals, can require fewer samples than MCMC, and allows us to check our assumptions.

Conclusions ˆ Model-based integration allows active learning about integrals, can require fewer samples than MCMC, and allows us to check our assumptions. ˆ BQ has nice convergence properties if its assumptions are correct.

Limitations and Future Directions ˆ Right now, BQ really only works in low dimensions ( < 10 ), when the function is fairly smooth, and is only worth using when computing f () is epensive.

Limitations and Future Directions ˆ Right now, BQ really only works in low dimensions ( < 10 ), when the function is fairly smooth, and is only worth using when computing f () is epensive. ˆ How to etend to high dimensions? Gradient observations are helpful, but a D-dimensional gradient is D separate observations. ˆ It seems unlikely that we ll find another tractable nonparametric distribution like GPs - should we accept that we ll need a second round of approimate integration on a surrogate model?