Kernel adaptive Sequential Monte Carlo

Similar documents
Kernel Sequential Monte Carlo

Adaptive HMC via the Infinite Exponential Family

Kernel Adaptive Metropolis-Hastings

CPSC 540: Machine Learning

Markov Chain Monte Carlo Lecture 4

Inference in state-space models with multiple paths from conditional SMC

Afternoon Meeting on Bayesian Computation 2018 University of Reading

Exercises Tutorial at ICASSP 2016 Learning Nonlinear Dynamical Models Using Particle Filters

Paul Karapanagiotidis ECO4060

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

A Review of Pseudo-Marginal Markov Chain Monte Carlo

Controlled sequential Monte Carlo

Probabilistic Graphical Models Lecture 17: Markov chain Monte Carlo

Sequential Monte Carlo Samplers for Applications in High Dimensions

Computer Practical: Metropolis-Hastings-based MCMC

Pseudo-marginal MCMC methods for inference in latent variable models

Bayesian Methods for Machine Learning

An Adaptive Sequential Monte Carlo Sampler

Brief introduction to Markov Chain Monte Carlo

Practical Bayesian Optimization of Machine Learning. Learning Algorithms

Markov chain Monte Carlo methods for visual tracking

Introduction to Machine Learning CMU-10701

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 13: SEQUENTIAL DATA

STA 4273H: Statistical Machine Learning

Adaptive Monte Carlo methods

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling

New Insights into History Matching via Sequential Monte Carlo

MCMC and Gibbs Sampling. Kayhan Batmanghelich

An ABC interpretation of the multiple auxiliary variable method

SMC 2 : an efficient algorithm for sequential analysis of state-space models

Lecture: Gaussian Process Regression. STAT 6474 Instructor: Hongxiao Zhu

Nonparametric Bayesian Methods (Gaussian Processes)

April 20th, Advanced Topics in Machine Learning California Institute of Technology. Markov Chain Monte Carlo for Machine Learning

Bayesian Inference and MCMC

Learning the hyper-parameters. Luca Martino

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision

STAT 518 Intro Student Presentation

Convergence Rates of Kernel Quadrature Rules

MCMC Sampling for Bayesian Inference using L1-type Priors

CS242: Probabilistic Graphical Models Lecture 7B: Markov Chain Monte Carlo & Gibbs Sampling

A Repelling-Attracting Metropolis Algorithm for Multimodality

An introduction to Sequential Monte Carlo

Graphical Models and Kernel Methods

Bagging During Markov Chain Monte Carlo for Smoother Predictions

Slice Sampling with Adaptive Multivariate Steps: The Shrinking-Rank Method

Computer Vision Group Prof. Daniel Cremers. 11. Sampling Methods

Deblurring Jupiter (sampling in GLIP faster than regularized inversion) Colin Fox Richard A. Norton, J.

17 : Optimization and Monte Carlo Methods

PILCO: A Model-Based and Data-Efficient Approach to Policy Search

Bayesian Computations for DSGE Models

MCMC for big data. Geir Storvik. BigInsight lunch - May Geir Storvik MCMC for big data BigInsight lunch - May / 17

Evidence estimation for Markov random fields: a triply intractable problem

Monte Carlo Approximation of Monte Carlo Filters

Large Scale Bayesian Inference

Pattern Recognition and Machine Learning. Bishop Chapter 11: Sampling Methods

Markov Chain Monte Carlo (MCMC)

17 : Markov Chain Monte Carlo

Computational statistics

Session 3A: Markov chain Monte Carlo (MCMC)

A Search and Jump Algorithm for Markov Chain Monte Carlo Sampling. Christopher Jennison. Adriana Ibrahim. Seminar at University of Kuwait

Particle Metropolis-adjusted Langevin algorithms

Monte Carlo in Bayesian Statistics

Approximate Inference Part 1 of 2

Approximate Inference Part 1 of 2

Learning Static Parameters in Stochastic Processes

Part 1: Expectation Propagation

Surveying the Characteristics of Population Monte Carlo

An Brief Overview of Particle Filtering

Inexact approximations for doubly and triply intractable problems

Log Gaussian Cox Processes. Chi Group Meeting February 23, 2016

Kobe University Repository : Kernel

Lecture 8: The Metropolis-Hastings Algorithm

EVALUATING SYMMETRIC INFORMATION GAP BETWEEN DYNAMICAL SYSTEMS USING PARTICLE FILTER

Pattern Recognition and Machine Learning

Markov chain Monte Carlo

arxiv: v1 [stat.co] 1 Jun 2015

Lecture 2: From Linear Regression to Kalman Filter and Beyond

Likelihood-free MCMC

6 Markov Chain Monte Carlo (MCMC)

Sequential Monte Carlo Methods (for DSGE Models)

Approximate Slice Sampling for Bayesian Posterior Inference

Control Variates for Markov Chain Monte Carlo

Monte Carlo Inference Methods

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Model Selection for Gaussian Processes

Introduction to Gaussian Processes

STA 4273H: Sta-s-cal Machine Learning

Calibration of Stochastic Volatility Models using Particle Markov Chain Monte Carlo Methods

Markov Chain Monte Carlo Algorithms for Gaussian Processes

Notes on pseudo-marginal methods, variational Bayes and ABC

The Recycling Gibbs Sampler for Efficient Learning

CSC 2541: Bayesian Methods for Machine Learning

Markov Chain Monte Carlo, Numerical Integration

Monte Carlo methods for sampling-based Stochastic Optimization

MCMC algorithms for fitting Bayesian models

28 : Approximate Inference - Distributed MCMC

Gaussian Processes (10/16/13)

Lecture 7 and 8: Markov Chain Monte Carlo

Online appendix to On the stability of the excess sensitivity of aggregate consumption growth in the US

Sequential Monte Carlo Methods in High Dimensions

Transcription:

Kernel adaptive Sequential Monte Carlo Ingmar Schuster (Paris Dauphine) Heiko Strathmann (University College London) Brooks Paige (Oxford) Dino Sejdinovic (Oxford) December 7, 2015 1 / 36

Section 1 Outline 2 / 36

1 Introduction 2 Kernel Adaptive SMC (KASS) 3 Implementation Details 4 Evaluation 5 Conclusion 3 / 36

Section 2 Introduction 4 / 36

Sequential Monte Carlo Samplers Approximate integrals with respect to target distribution π T Build upon Importance Sampling: approximate integral of h wrt density π T using samples following density q (under certain conditions): h(x)dπ T (x) = h(x) π T (x) q(x) dq(x) Given prior π 0, build sequence π 0,..., π i,... π T such that π i+1 is closer to π T than π i (δ(π i+1, π T ) < δ(π i, π T ) for some divergence δ) sample from π i can approximate π i+1 well using importance weight function w( ) = π i+1 ( )/π i ( ) 5 / 36

Sequential Monte Carlo Samplers At i = 0 Using proposal density q 0, generate particles {(w 0,j, X 0,j )} N j=1 where w 0,j = π 0 (X 0,j )/q 0 (X 0,j ) importance resampling, resulting in N equally weighted particles {(1/N, X 0,j )} N j=1 rejuvenation move for each X 0,j by Markov Kernel leaving π 0 invariant At i > 0 approximate π i by {(π i (X i 1,j )/π i 1 (X i 1,j ), X i 1,j )} N j=1 resampling rejuvenation leaving π i invariant if π i π T, repeat 6 / 36

Sequential Monte Carlo Samplers estimate evidence Z T of π T by Z T Z 0 T i=1 1 N j w i,j (aka normalizing constant, marginal likelihood) Can be adaptive in rejuvenation steps without diminishing adaptation as required in adaptive MCMC Will construct rejuvenation using RKHS-embedding of particles 7 / 36

Intractable Likelihoods and Evidence in nonconjugate latent variable models, intractable likelihoods arise when likelihood can be estimated unbiasedly, SMC still valid simple case: estimate likelihood using IS or SMC, leads to IS 2 (Tran et al., 2013) and SMC 2 (Chopin et al., 2011) results in noisy Importance Weights, but evidence approximation is still valid (Tran et al., 2013, Lemma 3) 8 / 36

Nonlinear proposals based on positive definite Kernels Kernel Adaptive Metropolis Hastings (KAMH) was introduced in Sejdinovic et al. (2014) Given previous samples from target distribution π, draw new ones more efficiently Each sample mapped to functional in Reproducing Kernel Hilbert Space (RKHS) H k using pd kernel k(, ) Fit Gaussian q k in H k with µ = Σ = k(, x)dπ(x) 1 n n k(, X i ) i=1 k(, x) k(, x)dπ(x) µ µ 9 / 36

Nonlinear proposals based on positive definite Kernels Draw sample from q k and project back into original space, use as proposal in MH KAMH set in adaptive MCMC, using vanishing adaptation (e.g. vanishing probability to use new samples for computing adaptive proposal) Depending on used positive definite kernel, can adapt to nonlinear targets 10 / 36

Section 3 Kernel Adaptive SMC (KASS) 11 / 36

Adaptive SMC Sampler SMC works on a sequence of targets, so we use an artificial sequence of distributions leading from prior π 0 to posterior π T parameters of rejuvenation kernel can be adapted before rejuvenation Fearnhead and Taylor (2013) used global Gaussian approximation as proposal in Metropolis Hastings rejuvenation resulting in adaptive SMC sampler (ASMC) 12 / 36

Kernel adaptive rejuvenation instead, we use RKHS-proposal projected into input space (in closed form) given unweighted particles { X i } N i=1, proposal at X j is q KAMH ( X j ) = N ( X j, ν 2 M X, X j CM X, X j + γ 2 I )) where C = I 1 n 11 is centering matrix and M X, Xj = 2[ x k(x, X 1 ) x= Xj,..., x k(x, X N ) x= Xj ] results in ASMC using linear kernel k(x, X ) = X X locally adaptive fit using Gaussian RBF k(x, X ) = exp ( X X 2 ) 2σ 2 13 / 36

KASS versus ASMC green: ASMC / KASS with linear kernel red: KASS with Gaussian RBF kernel 14 / 36

Related Work Most direct relation to ASMC (which is a special case) All SMC samplers related to Annealed Importance Sampling which however does not use resampling (Neal, 1998) Local Adaptive Importance Sampling (Givens and Raferty, 1996, LAIS) has similar locally adaptive effect at each iteration compute pairwise distances between Importance Samples use k nearest neighbors for fitting local Gaussian proposal no resampling steps mean decrease in sampling efficiency which is exponential in dimensionality of problem 15 / 36

Section 4 Implementation Details 16 / 36

Construction of Target Sequence For artificial distribution sequence we used geometric bridge π i π 1 ρ i 0 π ρ i T where (ρ i ) T i=1 is an increasing sequence satisfying ρ T = 1 another standard choice in Bayesian Inference is adding datapoints one after another π i (X ) = π(x d 1,..., d ρi D ) resulting in Iterated Batch Importance Sampling (Chopin, 2002, IBIS) 17 / 36

Stochastic approximation tuning of ν 2 KASS free scaling parameter ν 2 can be tuned for optimal scaling Fearnhead and Taylor (2013) use auxiliary variable approach with ESJD criterion We used stochastic approximation framework of Andrieu and Thoms (2008) instead asymptotically optimal acceptance rate for Random Walk proposals is α opt = 0.234 (Rosenthal, 2011) after rejuvenation, Rao-Blackwellized estimator ˆα i available by averaging MH acceptance probabilities tune ν 2 by ν 2 i+1 = ν 2 i + λ i (ˆα i α opt ) for non-increasing λ 1,..., λ T 18 / 36

Section 5 Evaluation 19 / 36

Synthetic nonlinear target (Banana) Synthetic target: Banana distribution in 8 dimensions, i.e. Gaussian with twisted second dimension 8 6 4 2 0 2 4 20 15 10 5 0 5 10 15 20 20 / 36

Synthetic nonlinear target (Banana) Compare performance of Random-Walk rejuvenation with asymptotically optimal scaling (ν = 2.38/ d), ASMC and KASS with Gaussian RBF kernel Fixed learning rate of λ = 0.1 to adapt scale parameter using stochastic approximation Geometric bridge of length 20 30 Monte Carlo runs Report Maximum Mean Discrepancy (MMD) using polynomial kernel of order 3: distance of moments up to order 3 between ground truth samples and samples produced by each method 21 / 36

Synthetic nonlinear target (Banana) MMD to benchmark sample 8.0 7.5 7.0 6.5 6.0 5.5 5.0 4.5 4.0 3.5 10 7 0 100 200 300 400 500 600 Population size KASS RWSMC ASMC Figure: Improved convergence of all mixed moments up to order 3 of KASS compared to ASMC and RW-SMC. 22 / 36

Sensor network localization Applied problem: infer locations of S = 3 sensors in a sensor network measuring distance to each other Known position for B = 2 base sensors Measurements successful with probability decaying exponentially in squared distance (otherwise unobserved) ( ( Z i,j Binom 1, exp x i x j 2 )) 2 2 0.3 2 Measurements corrupted by Gaussian noise { N ( x i x j, 0.02) if Z i,j = 1 Y i,j Y i,j = 0 else 23 / 36

Sensor network localization run KASS and ASMC with geometric bridge of length 50 and 10, 000 particles, fixed learning rate λ i = 1 run KAMH for 50 10, 000 iterations, discard first half as burn-in, diminishing adaptation λ i = 1/ i initialize both algorithms with samples from prior qualitative comparison of KASS and closest adaptive MCMC algorithm KAMH 24 / 36

Sensor network localization: KAMH adaptive MCMC 1.0 MCMC (KAMH) 0.8 0.6 0.4 0.2 0.0 0.1 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Figure: Posterior samples of unknown sensor locations (in color) by KAMH. Set-up of the true sensor locations (black dots) and base sensors (black stars) causes uncertainty in posterior. 25 / 36

Sensor network localization: KASS adaptive SMC 1.0 SMC (KASS) 0.8 0.6 0.4 0.2 0.0 0.1 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Figure: Posterior samples of unknown sensor locations (in color) by KASS. Set-up of the true sensor locations (black dots) and base sensors (black stars) causes uncertainty in posterior. 26 / 36

Sensor network localization MCMC algorithm not able to traverse all the modes without special care (e.g. Wormhole HMC by Lan et al., 2014) KASS and ASMC perform similarly in this setup with S = 2 (higher uncertainty), 1000 particles MMD of 0.76 ± 0.4 for KASS 0.94 ± 0.7 for ASMC 27 / 36

Evidence approximation for intractable likelihoods in classification using Gaussian Processes (GP), logistic transformation renders likelihood intractable likelihood can be unbiasedly estimated using Importance Sampling from EP approximation estimate model evidence when using ARD kernel in the GP particularly hard because noisy likelihoods means noisy importance weights ground truth by averaging evidence estimate over 20 long running SMC algorithms 28 / 36

Evidence approximation for intractable likelihoods Figure: Ground truth in red, KASS in blue, ASMC in green. 29 / 36

Section 6 Conclusion 30 / 36

Conclusion (1) Developed Kernel Adaptive SMC sampler for static models KASS exploits local covariance of target through RKHS-informed rejuvenation proposals combines these with general SMC advantages for multimodal targets and evidence estimation especially attractive when likelihoods are intractable 31 / 36

Conclusion (2) evaluated on a strongly twisted Banana where it was clearly better than ASMC KASS enables exploring multiple modes in nonlinear sensor KASS exhibits less variance than ASMC in evidence estimation for GP classification evidence approximation even in case of intractable likelihoods 32 / 36

Thanks! 33 / 36

Literature I Andrieu, C. and Thoms, J. (2008). A tutorial on adaptive MCMC. Statistics and Computing, 18(November):343 373. Chopin, N. (2002). A sequential particle filter method for static models. Biometrika, 89(3):539 552. Chopin, N., Jacob, P. E., and Papaspiliopoulos, O. (2011). SMCˆ2: an efficient algorithm for sequential analysis of state-space models. 0(1):1 27. Fearnhead, P. and Taylor, B. M. (2013). An Adaptive Sequential Monte Carlo Sampler. Bayesian Analysis, (2):411 438. Givens, G. H. and Raferty, A. E. (1996). Local Adaptive Importance Sampling for Multivariate Densities with Strong Nonlinear Relationships. Journal of the American Statistical Association, 91(433):132 141. 34 / 36

Literature II Lan, S., Streets, J., and Shahbaba, B. (2014). Wormhole hamiltonian monte carlo. In Twenty-Eighth AAAI Conference on Artificial Intelligence. Neal, R. (1998). Annealed Importance Sampling. Technical report, University of Toronto. Rosenthal, J. S. (2011). Optimal Proposal Distributions and Adaptive MCMC. In Handbook of Markov Chain Monte Carlo, chapter 4, pages 93 112. Chapman & Hall. Sejdinovic, D., Strathmann, H., Lomeli, M. G., Andrieu, C., and Gretton, A. (2014). Kernel Adaptive Metropolis-Hastings. In International Conference on Machine Learning (ICML), pages 1665 1673. 35 / 36

Literature III Tran, M.-N., Scharth, M., Pitt, M. K., and Kohn, R. (2013). Importance sampling squared for Bayesian inference in latent variable models. pages 1 39. 36 / 36