Kernel Sequential Monte Carlo

Similar documents
Kernel adaptive Sequential Monte Carlo

Kernel Adaptive Metropolis-Hastings

Adaptive HMC via the Infinite Exponential Family

Surveying the Characteristics of Population Monte Carlo

Adaptive Monte Carlo methods

Exercises Tutorial at ICASSP 2016 Learning Nonlinear Dynamical Models Using Particle Filters

Inference in state-space models with multiple paths from conditional SMC

A Review of Pseudo-Marginal Markov Chain Monte Carlo

Pseudo-marginal MCMC methods for inference in latent variable models

Controlled sequential Monte Carlo

Afternoon Meeting on Bayesian Computation 2018 University of Reading

CPSC 540: Machine Learning

Computer Practical: Metropolis-Hastings-based MCMC

SMC 2 : an efficient algorithm for sequential analysis of state-space models

An introduction to Sequential Monte Carlo

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Particle Metropolis-adjusted Langevin algorithms

Bayesian Methods for Machine Learning

Convergence Rates of Kernel Quadrature Rules

Calibration of Stochastic Volatility Models using Particle Markov Chain Monte Carlo Methods

Sequential Monte Carlo Samplers for Applications in High Dimensions

MCMC for big data. Geir Storvik. BigInsight lunch - May Geir Storvik MCMC for big data BigInsight lunch - May / 17

arxiv: v1 [stat.co] 1 Jun 2015

An Brief Overview of Particle Filtering

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling

Adaptive Population Monte Carlo

An ABC interpretation of the multiple auxiliary variable method

An Adaptive Sequential Monte Carlo Sampler

Lecture: Gaussian Process Regression. STAT 6474 Instructor: Hongxiao Zhu

Monte Carlo Approximation of Monte Carlo Filters

Advanced Computational Methods in Statistics: Lecture 5 Sequential Monte Carlo/Particle Filtering

April 20th, Advanced Topics in Machine Learning California Institute of Technology. Markov Chain Monte Carlo for Machine Learning

Introduction to Machine Learning CMU-10701

STA 4273H: Statistical Machine Learning

Likelihood-free MCMC

Bayesian Monte Carlo Filtering for Stochastic Volatility Models

Probabilistic Graphical Models Lecture 17: Markov chain Monte Carlo

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision

17 : Optimization and Monte Carlo Methods

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

The Poisson transform for unnormalised statistical models. Nicolas Chopin (ENSAE) joint work with Simon Barthelmé (CNRS, Gipsa-LAB)

Lecture 2: From Linear Regression to Kalman Filter and Beyond

Learning Static Parameters in Stochastic Processes

STA 4273H: Statistical Machine Learning

List of projects. FMS020F NAMS002 Statistical inference for partially observed stochastic processes, 2016

MCMC and Gibbs Sampling. Kayhan Batmanghelich

Markov chain Monte Carlo methods for visual tracking

Practical Bayesian Optimization of Machine Learning. Learning Algorithms

Markov chain Monte Carlo

Pattern Recognition and Machine Learning

Sequential Monte Carlo samplers for Bayesian DSGE models

Computational statistics

Sequential Monte Carlo Methods for Bayesian Computation

On Markov chain Monte Carlo methods for tall data

Brief introduction to Markov Chain Monte Carlo

17 : Markov Chain Monte Carlo

Particle MCMC for Bayesian Microwave Control

Bayesian Quadrature: Model-based Approximate Integration. David Duvenaud University of Cambridge

Markov Chain Monte Carlo Lecture 4

Bayesian Estimation of Input Output Tables for Russia

New Insights into History Matching via Sequential Monte Carlo

Lecture 2: From Linear Regression to Kalman Filter and Beyond

Control Variates for Markov Chain Monte Carlo

Inexact approximations for doubly and triply intractable problems

Monte Carlo methods for sampling-based Stochastic Optimization

An efficient stochastic approximation EM algorithm using conditional particle filters

Nonparametric Bayesian Methods (Gaussian Processes)

A Dirichlet Form approach to MCMC Optimal Scaling

Sequential Monte Carlo samplers for Bayesian DSGE models

Session 5B: A worked example EGARCH model

PILCO: A Model-Based and Data-Efficient Approach to Policy Search

Bayesian Computations for DSGE Models

Layered Adaptive Importance Sampling

Particle Metropolis-Hastings using gradient and Hessian information

Markov Chain Monte Carlo methods

Part 1: Expectation Propagation

Particle Filtering a brief introductory tutorial. Frank Wood Gatsby, August 2007

Bagging During Markov Chain Monte Carlo for Smoother Predictions

Bayesian Inference and MCMC

Computer Vision Group Prof. Daniel Cremers. 11. Sampling Methods: Markov Chain Monte Carlo

Effective Sample Size for Importance Sampling based on discrepancy measures

Monte Carlo Methods. Leon Gu CSD, CMU

Markov Chain Monte Carlo (MCMC)

Online appendix to On the stability of the excess sensitivity of aggregate consumption growth in the US

Slice Sampling with Adaptive Multivariate Steps: The Shrinking-Rank Method

Infinite-State Markov-switching for Dynamic. Volatility Models : Web Appendix

The Recycling Gibbs Sampler for Efficient Learning

Log Gaussian Cox Processes. Chi Group Meeting February 23, 2016

Probabilistic Graphical Models

Output-Sensitive Adaptive Metropolis-Hastings for Probabilistic Programs

Answers and expectations

Computer Vision Group Prof. Daniel Cremers. 11. Sampling Methods

Introduction to Machine Learning

Adaptive Metropolis with Online Relabeling

An introduction to adaptive MCMC

Notes on pseudo-marginal methods, variational Bayes and ABC

The Hierarchical Particle Filter

Paul Karapanagiotidis ECO4060

28 : Approximate Inference - Distributed MCMC

Learning the hyper-parameters. Luca Martino

Transcription:

Kernel Sequential Monte Carlo Ingmar Schuster (Paris Dauphine) Heiko Strathmann (University College London) Brooks Paige (Oxford) Dino Sejdinovic (Oxford) * equal contribution April 25, 2016 1 / 37

Section 1 Outline 2 / 37

1 Introduction Importance Sampling, PMC and SMC Intractable likelihoods Kernel emulators 2 Kernel SMC 3 Implementation Details 4 Evaluation 5 Conclusion 3 / 37

Section 2 Introduction 4 / 37

Importance Sampling, PMC and SMC Importance Sampling estimators Importance Sampling identity H = π(x)h(x)dx = π(x) q(x) h(x)q(x)dx 1 N w(x i )h(x i ) w Σ i=1 where X i q iid, w(x ) = π(x )/q(x ) called unnormalized importance weight, w Σ = N i=1 w(x i) PMC identity: for any law g over proposals H = π(x) q t (x) h(x) dq t(x) dg(q t ) = 1 w Σ T t=1 i=1 N w t (X i )h(x i ) 5 / 37

Importance Sampling, PMC and SMC Proposal fatter than target π(x) q(x) w(x) =π(x)/q(x) 4 2 0 2 4 6 / 37

Importance Sampling, PMC and SMC Proposal thinner than target π(x) q(x) w(x) =π(x)/q(x) 4 2 0 2 4 7 / 37

Importance Sampling, PMC and SMC Population Monte Carlo Cappé et al. (2004) Input: initial proposal density q 0, unnormalized density π, population size N, sample size m Output: lists P, W of m samples and weights Initialize P = List() Initialize W = List() while len(p) m do construct proposal distribution q t generate set of p samples X t from q t and append it to P for all X X t append weights π(x )/q t (X ) to W end while 8 / 37

Importance Sampling, PMC and SMC Sequential Monte Carlo Samplers Approximate integrals with respect to target distribution π T Build upon Importance Sampling: approximate integral of h wrt density π T using samples following density q (under certain conditions): h(x)dπ T (x) = h(x) π T (x) q(x) dq(x) Given prior π 0, build sequence π 0,..., π i,... π T such that π i+1 is closer to π T than π i (δ(π i+1, π T ) < δ(π i, π T ) for some divergence δ) sample from π i can approximate π i+1 well using importance weight function w( ) = π i+1 ( )/π i ( ) 9 / 37

Importance Sampling, PMC and SMC Sequential Monte Carlo Samplers At i = 0 Using proposal density q 0, generate particles {(w 0,j, X 0,j )} N j=1 where w 0,j = π 0 (X 0,j )/q 0 (X 0,j ) importance resampling, resulting in N equally weighted particles {(1/N, X 0,j )} N j=1 rejuvenation move for each X 0,j by Markov Kernel leaving π 0 invariant At i > 0 approximate π i by {(π i (X i 1,j )/π i 1 (X i 1,j ), X i 1,j )} N j=1 resampling rejuvenation leaving π i invariant if π i π T, repeat 10 / 37

Importance Sampling, PMC and SMC A visual SMC iteration Target and proposal π i (x) π i 1 (x) Weighted samples Resampling proportional to weights MCMC rejuvenation 4 2 0 2 4 11 / 37

Importance Sampling, PMC and SMC Sequential Monte Carlo Samplers estimate evidence Z T of π T by Z T Z 0 T i=1 1 N j w i,j (aka normalizing constant, marginal likelihood) Can be adaptive in rejuvenation steps without diminishing adaptation as required in adaptive MCMC Will construct rejuvenation using RKHS-embedding of particles 12 / 37

Intractable likelihoods Intractable Likelihoods and Evidence intractable likelihoods arise in many models (e.g. nonconjugate latent variable models) for unbiased likelihood estimates, SMC/PMC still valid simple case: estimate likelihood using IS or SMC, leads to IS 2 (Tran et al., 2013) and SMC 2 (Chopin et al., 2011) results in noisy Importance Weights, but approximation of evidence (probability of model given data) is still valid (Tran et al., 2013, Lemma 3) cannot easily use information on geometry of π for efficient inference (e.g. gradients unavailable) 13 / 37

Kernel emulators Kernel emulators In the following: adapt RKHS-based emulators to PMC and SMC in intractable likelihood settings for adapting to target geometry Using pd kernel k(, ) we can adapt to local covariance (Sejdinovic et al., 2014) use gradient information of infinite exponential family approximation to π (Strathmann et al., 2015) Emulators used for constructing proposals q t and use importance correction in PMC Metropolis-Hastings correction within in SMC rejuvenation moves for samples X q t 14 / 37

Kernel emulators Kernel Emulators Local covariance: let k (y, x) = y k(y, x) and µ(y) = k (y, x)dπ(x) then K(y) = (k (y, x) µ(y)) 2 dπ(x) Gradient emulation fit infinite exponential family approximation q(y) = exp(f (y) A(f )) where f (y) = f, k(y, ) H is the inner product between natural parameters f and sufficient statistics k(y, ) in H by minimizing ( y log π(y) y f (y)) 2 dπ(y) use gradients information of log q in proposals 15 / 37

Section 3 Kernel SMC 16 / 37

Kernel Gradient Importance sampling Use N ( X + δ 1 X log q(x ), δ 2 C) proposals with importance weighting in PMC C is a fit to global covariance of target π resulting in Kernel Gradient Importance Sampling (KGRIS) variant of Gradient Importance Sampling (Schuster, 2015) 17 / 37

Kernel Adaptive SMC Sampler Use artificial sequence of distributions leading from prior π 0 to posterior π T rejuvenation with MH moves using N ( X, δk(x )) proposals resulting in Kernel Adaptive SMC (KASMC) similar to Adaptive SMC sampler, a special case when using a linear kernel (Fearnhead and Taylor, 2013) 18 / 37

KASMC versus ASMC green: ASMC / KASMC with linear kernel red: KASMC with Gaussian RBF kernel 19 / 37

Section 4 Implementation Details 20 / 37

Construction of Target Sequence For artificial distribution sequence we used geometric bridge π i π 1 ρ i 0 π ρ i T where (ρ i ) T i=1 is an increasing sequence satisfying ρ T = 1 another standard choice in Bayesian Inference is adding datapoints one after another π i (X ) = π(x d 1,..., d ρi D ) resulting in Iterated Batch Importance Sampling (Chopin, 2002, IBIS) 21 / 37

Stochastic approximation, variance reduction Free scaling parameters can be tuned for optimal scaling of MCMC using stochastic approximation framework of Andrieu and Thoms (2008) asymptotically optimal acceptance rate for Random Walk MH is α opt = 0.234 (Rosenthal, 2011) tune single parameter δ i by δ i+1 = δ i + λ i (ˆα i α opt ) for non-increasing λ 1,..., λ T used Random Fourier Features (Rahimi and Recht, 2007) for efficient on-line updates of emulators used weighted updates and Rao-Blackwellization for variance reduction in estimated emulators 22 / 37

Section 5 Evaluation 23 / 37

Synthetic nonlinear target (Banana) Synthetic target: Banana distribution in 8 dimensions, i.e. Gaussian with twisted second dimension 8 6 4 2 0 2 4 20 15 10 5 0 5 10 15 20 24 / 37

Synthetic nonlinear target (Banana) Compare performance of Random-Walk rejuvenation with asymptotically optimal scaling (ν = 2.38/ d), ASMC and KASMC with Gaussian RBF kernel Fixed learning rate of λ = 0.1 to adapt scale parameter using stochastic approximation Geometric bridge of length 20 30 Monte Carlo runs Report Maximum Mean Discrepancy (MMD) using polynomial kernel of order 3: distance of moments up to order 3 between ground truth samples and samples produced by each method 25 / 37

Synthetic nonlinear target (Banana) Figure: Improved convergence of all mixed moments up to order 3 of KASMC compared to ASMC and RW-SMC. 26 / 37

Evidence approximation for intractable likelihoods in classification using Gaussian Processes (GP), logistic transformation renders likelihood intractable likelihood can be unbiasedly estimated using Importance Sampling from EP approximation estimate model evidence when using ARD kernel in the GP particularly hard because noisy likelihoods means noisy importance weights ground truth by averaging evidence estimate over 20 long running SMC algorithms 27 / 37

Evidence approximation for intractable likelihoods 10 3 Estimation variance K ASS ASMC 10 2 10 1 10 0 0 100 200 300 400 500 Number of particles Figure: Monte Carlo Variance, KASMC in blue, ASMC in green. 28 / 37

Stochastic volatility model with intractable likelihood Stochastic volatility particularly challenging class of bayesian inverse problems time series as a high-dimensional nuisance variable models have to capture the non-linearities in the data (Barndorff-Nielsen and Shephard, 2001) concentrate on the prediction of daily volatility of asset prices, reusing the model and dataset studied by Chopin et al. (2011) (nuisance of dimension d = 753) report RMSE of target covariance estimate 29 / 37

KGRIS with Stochastic volatility RMSE covariance 0.0064 0.0062 0.0060 0.0058 0.0056 0.0054 AM KGIS 0.0052 5 10 15 20 25 30 35 40 45 50 Population size 30 / 37

Section 6 Conclusion 31 / 37

Conclusion (1) Developed Kernel SMC framework KSMC exploits kernel emulators of target structure combines these with general SMC/PMC advantages for multimodal targets and evidence estimation especially attractive when likelihoods are intractable 32 / 37

Conclusion (2) evaluated on several challenging models where it was clearly improving statistical efficiency KASMC exhibits better MMD for Banana less MC variance than ASMC in evidence estimation for GP classification KGRIS clearly improves covariance estimates in Stochastic Volatility model 33 / 37

Thanks! 34 / 37

Literature I Andrieu, C. and Thoms, J. (2008). A tutorial on adaptive MCMC. Statistics and Computing, 18(November):343 373. Barndorff-Nielsen, O. E. and Shephard, N. (2001). Non-Gaussian Ornstern-Uhlenbeck-Based Models and Some of Their Uses in Financial Economics. Journal of the Royal Statistical Society. Series B, 63(2):167 241. Cappé, O., Guillin, a., Marin, J. M., and Robert, C. P. (2004). Population Monte Carlo. Journal of Computational and Graphical Statistics, 13(4):907 929. Chopin, N. (2002). A sequential particle filter method for static models. Biometrika, 89(3):539 552. 35 / 37

Literature II Chopin, N., Jacob, P. E., and Papaspiliopoulos, O. (2011). SMCˆ2: an efficient algorithm for sequential analysis of state-space models. 0(1):1 27. Fearnhead, P. and Taylor, B. M. (2013). An Adaptive Sequential Monte Carlo Sampler. Bayesian Analysis, (2):411 438. Rahimi, A. and Recht, B. (2007). Random Features for Large Scale Kernel Machines. In Neural Information Processing Systems, number 1, pages 1 8. Rosenthal, J. S. (2011). Optimal Proposal Distributions and Adaptive MCMC. In Handbook of Markov Chain Monte Carlo, chapter 4, pages 93 112. Chapman & Hall. 36 / 37

Literature III Schuster, I. (2015). Consistency of Importance Sampling estimates based on dependent sample sets and an application to models with factorizing likelihoods. arxiv preprint, pages 1 14. Sejdinovic, D., Strathmann, H., Garcia, M. L., Andrieu, C., and Gretton, A. (2014). Kernel Adaptive Metropolis-Hastings. arxiv, 32. Strathmann, H., Sejdinovic, D., Livingstone, S., Szabo, Z., and Gretton, A. (2015). Gradient-free Hamiltonian Monte Carlo with efficient Kernel Exponential Families. In Neural Information Processing Systems. Tran, M.-N., Scharth, M., Pitt, M. K., and Kohn, R. (2013). Importance sampling squared for Bayesian inference in latent variable models. 37 / 37