Bayesian Quadrature: Model-based Approimate Integration David Duvenaud University of Cambridge
The Quadrature Problem ˆ We want to estimate an integral Z = f ()p()d ˆ Most computational problems in inference correspond to integrals: ˆ Epectations ˆ Marginal distributions ˆ Integrating out nuisance parameters ˆ Normalization constants ˆ Model comparison function f() input density p()
Sampling Methods ˆ Monte Carlo methods: Sample from p(), take empirical mean: Ẑ = 1 N N f ( i ) i=1 function f() input density p() samples
Sampling Methods ˆ Monte Carlo methods: Sample from p(), take empirical mean: Ẑ = 1 N N f ( i ) i=1 ˆ Possibly sub-optimal for two reasons: function f() input density p() samples
Sampling Methods ˆ Monte Carlo methods: Sample from p(), take empirical mean: Ẑ = 1 N N f ( i ) i=1 ˆ Possibly sub-optimal for two reasons: ˆ Random bunching up function f() input density p() samples
Sampling Methods ˆ Monte Carlo methods: Sample from p(), take empirical mean: Ẑ = 1 N N f ( i ) i=1 ˆ Possibly sub-optimal for two reasons: ˆ Random bunching up ˆ Often, nearby function values will be similar function f() input density p() samples
Sampling Methods ˆ Monte Carlo methods: Sample from p(), take empirical mean: Ẑ = 1 N N f ( i ) i=1 ˆ Possibly sub-optimal for two reasons: ˆ Random bunching up ˆ Often, nearby function values will be similar ˆ Model-based and quasi-monte Carlo methods spread out samples to achieve faster convergence. function f() input density p() samples
Model-based Integration
Model-based Integration ˆ Place a prior on f, for eample, a GP ˆ Posterior over f implies a posterior over Z. Z f() p() GP mean GP mean ± SD p(z)
Model-based Integration ˆ Place a prior on f, for eample, a GP ˆ Posterior over f implies a posterior over Z. Z f() p() GP mean GP mean ±2SD p(z) samples
Model-based Integration ˆ Place a prior on f, for eample, a GP ˆ Posterior over f implies a posterior over Z. Z f() p() GP mean GP mean ±2SD p(z) samples
Model-based Integration ˆ Place a prior on f, for eample, a GP ˆ Posterior over f implies a posterior over Z. Z f() p() GP mean GP mean ±2SD p(z) samples
Model-based Integration ˆ Place a prior on f, for eample, a GP ˆ Posterior over f implies a posterior over Z. Z f() p() GP mean GP mean ±2SD p(z) samples
Model-based Integration ˆ Place a prior on f, for eample, a GP ˆ Posterior over f implies a posterior over Z. Z f() p() GP mean GP mean ±2SD p(z) samples
Model-based Integration ˆ Place a prior on f, for eample, a GP ˆ Posterior over f implies a posterior over Z. Z f() p() GP mean GP mean ±2SD p(z) samples ˆ We ll call using a GP prior Bayesian Quadrature
Bayesian Quadrature Estimator ˆ Posterior over Z has mean linear in f ( s ): E gp [Z f ( s )] = N z T K 1 f ( i ) i=1 where z n = k(, n )p()d
Bayesian Quadrature Estimator ˆ Posterior over Z has mean linear in f ( s ): E gp [Z f ( s )] = N z T K 1 f ( i ) i=1 where z n = k(, n )p()d ˆ Natural to minimize posterior variance of Z: V [Z f ( s )] = k(, )p()p( )dd z T K 1 z
Bayesian Quadrature Estimator ˆ Posterior over Z has mean linear in f ( s ): E gp [Z f ( s )] = N z T K 1 f ( i ) i=1 where z n = k(, n )p()d ˆ Natural to minimize posterior variance of Z: V [Z f ( s )] = k(, )p()p( )dd z T K 1 z ˆ Doesn t depend on function values at all!
Bayesian Quadrature Estimator ˆ Posterior over Z has mean linear in f ( s ): E gp [Z f ( s )] = N z T K 1 f ( i ) i=1 where z n = k(, n )p()d ˆ Natural to minimize posterior variance of Z: V [Z f ( s )] = k(, )p()p( )dd z T K 1 z ˆ Doesn t depend on function values at all! ˆ Choosing samples sequentially to minimize variance: Sequential Bayesian Quadrature.
Things you can do with Bayesian Quadrature ˆ Can incorporate knowledge of function (symmetries) f (, y) = f (y, ) k s (, y,, y ) = k(, y,, y ) + k(, y,, y) + k(, y,, y ) + k(, y,, y)
Things you can do with Bayesian Quadrature ˆ Can incorporate knowledge of function (symmetries) f (, y) = f (y, ) k s (, y,, y ) = k(, y,, y ) + k(, y,, y) + k(, y,, y ) + k(, y,, y) ˆ Can condition on gradients
Things you can do with Bayesian Quadrature ˆ Can incorporate knowledge of function (symmetries) f (, y) = f (y, ) k s (, y,, y ) = k(, y,, y ) + k(, y,, y) + k(, y,, y ) + k(, y,, y) ˆ Can condition on gradients ˆ Posterior variance is a natural convergence diagnostic 0.2 0.1 Z 0 0.1 0.2 100 120 140 160 180 200 # of samples
More things you can do with Bayesian Quadrature ˆ Can compute likelihood of GP, learn kernel ˆ Can compute marginals with error bars, in two ways:
More things you can do with Bayesian Quadrature ˆ Can compute likelihood of GP, learn kernel ˆ Can compute marginals with error bars, in two ways: ˆ Simply from the GP posterior:
More things you can do with Bayesian Quadrature ˆ Can compute likelihood of GP, learn kernel ˆ Can compute marginals with error bars, in two ways: ˆ Or by recomputing f θ () for different θ with same Z True σ Marg. Like. 0.2 0.4 0.6 0.8 1 1.2 σ
More things you can do with Bayesian Quadrature ˆ Can compute likelihood of GP, learn kernel ˆ Can compute marginals with error bars, in two ways: ˆ Or by recomputing f θ () for different θ with same Z True σ Marg. Like. ˆ Much nicer than histograms! 0.2 0.4 0.6 0.8 1 1.2 σ
Rates of Convergence What is rate of convergence of SBQ when its assumptions are true? Epected Variance / MMD 10 1 MMD 10 2 10 3 O(1/ N ) i.i.d. sampling Herding with 1/N weights 10 0 10 1 10 2 number of samples
Rates of Convergence What is rate of convergence of SBQ when its assumptions are true? Epected Variance / MMD 10 1 MMD 10 2 10 3 O(1/ N ) i.i.d. sampling Herding with 1/N weights Herding with BQ weights 10 0 10 1 10 2 number of samples
Rates of Convergence What is rate of convergence of SBQ when its assumptions are true? Epected Variance / MMD 10 1 MMD 10 2 10 3 O(1/N ) i.i.d. sampling Herding with 1/N weights Herding with BQ weights SBQ with BQ weights 10 0 10 1 10 2 number of samples
Rates of Convergence What is rate of convergence of SBQ when its assumptions are true? Epected Variance / MMD Empirical Rates in RKHS 10 1 MMD 10 2 10 3 O(1/N ) i.i.d. sampling Herding with 1/N weights Herding with BQ weights SBQ with BQ weights 10 0 10 1 10 2 number of samples
Rates of Convergence What is rate of convergence of SBQ when its assumptions are true? Epected Variance / MMD Empirical Rates out of RKHS 10 1 MMD 10 2 10 3 O(1/N ) i.i.d. sampling Herding with 1/N weights Herding with BQ weights SBQ with BQ weights 10 0 10 1 10 2 number of samples
Rates of Convergence What is rate of convergence of SBQ when its assumptions are true? Epected Variance / MMD Bound on Bayesian Error 10 1 MMD 10 2 10 3 O(1/N ) i.i.d. sampling Herding with 1/N weights Herding with BQ weights SBQ with BQ weights 10 0 10 1 10 2 number of samples
GPs vs Log-GPs for Inference 120 100 l() 80 60 40 True Function Evaluations 20 0 20
GPs vs Log-GPs for Inference 120 100 l() 80 60 40 True Function Evaluations GP Posterior 20 0 20
GPs vs Log-GPs for Inference 120 100 l() 80 60 40 True Function Evaluations GP Posterior 20 0 20 4 log l() 3 2 1 True Log-func Evaluations 0 1
GPs vs Log-GPs for Inference 120 100 l() 80 60 40 True Function Evaluations GP Posterior 20 0 20 4 log l() 3 2 1 0 True Log-func Evaluations GP Posterior 1
GPs vs Log-GPs for Inference 120 100 l() 80 60 40 20 True Function Evaluations GP Posterior Log-GP Posterior 0 20 4 log l() 3 2 1 0 True Log-func Evaluations GP Posterior 1
GPs vs Log-GPs 200 l() 150 100 True Function Evaluations 50 0
GPs vs Log-GPs 200 l() 150 100 50 True Function Evaluations GP Posterior 0
GPs vs Log-GPs 200 l() 150 100 50 True Function Evaluations GP Posterior 0 0 log l() 50 100 True Log-func Evaluations 150 200
GPs vs Log-GPs 200 l() 150 100 50 True Function Evaluations GP Posterior 0 0 log l() 50 100 150 True Log-func Evaluations GP Posterior 200
GPs vs Log-GPs 200 l() 150 100 50 True Function Evaluations GP Posterior Log-GP Posterior 0 0 log l() 50 100 150 True Log-func Evaluations GP Posterior 200
Integrating under Log-GPs 80 l() 60 40 20 0 True Function Evaluations GP Posterior Log-GP Posterior 20 log l() 4 3 2 1 0 True Log-func Evaluations GP Posterior 1
Integrating under Log-GPs 80 l() 60 40 20 0 True Function Evaluations Mean of GP Appro Log-GP Inducing Points 20 log l() 4 3 2 1 0 1 True Log-func Evaluations GP Posterior Inducing Points
Conclusions ˆ Model-based integration allows active learning about integrals, can require fewer samples than MCMC, and allows us to check our assumptions.
Conclusions ˆ Model-based integration allows active learning about integrals, can require fewer samples than MCMC, and allows us to check our assumptions. ˆ BQ has nice convergence properties if its assumptions are correct.
Conclusions ˆ Model-based integration allows active learning about integrals, can require fewer samples than MCMC, and allows us to check our assumptions. ˆ BQ has nice convergence properties if its assumptions are correct. ˆ For inference, GP is not especially appropriate, but other models are intractable.
Limitations and Future Directions ˆ Right now, BQ really only works in low dimensions ( < 10 ), when the function is fairly smooth, and is only worth using when computing f () is epensive.
Limitations and Future Directions ˆ Right now, BQ really only works in low dimensions ( < 10 ), when the function is fairly smooth, and is only worth using when computing f () is epensive. ˆ How to etend to high dimensions? Gradient observations are helpful, but a D-dimensional gradient is D separate observations.
Limitations and Future Directions ˆ Right now, BQ really only works in low dimensions ( < 10 ), when the function is fairly smooth, and is only worth using when computing f () is epensive. ˆ How to etend to high dimensions? Gradient observations are helpful, but a D-dimensional gradient is D separate observations. ˆ It seems unlikely that we ll find another tractable nonparametric distribution like GPs - should we accept that we ll need a second round of approimate integration on a surrogate model?
Limitations and Future Directions ˆ Right now, BQ really only works in low dimensions ( < 10 ), when the function is fairly smooth, and is only worth using when computing f () is epensive. ˆ How to etend to high dimensions? Gradient observations are helpful, but a D-dimensional gradient is D separate observations. ˆ It seems unlikely that we ll find another tractable nonparametric distribution like GPs - should we accept that we ll need a second round of approimate integration on a surrogate model? ˆ How much overhead is worthwhile? Bounded rationality work seems relevant.
Limitations and Future Directions ˆ Right now, BQ really only works in low dimensions ( < 10 ), when the function is fairly smooth, and is only worth using when computing f () is epensive. ˆ How to etend to high dimensions? Gradient observations are helpful, but a D-dimensional gradient is D separate observations. ˆ It seems unlikely that we ll find another tractable nonparametric distribution like GPs - should we accept that we ll need a second round of approimate integration on a surrogate model? ˆ How much overhead is worthwhile? Bounded rationality work seems relevant. Thanks!