Kernel adaptive Sequential Monte Carlo

Kernel adaptive Sequential Monte Carlo Ingmar Schuster (Paris Dauphine) Heiko Strathmann (University College London) Brooks Paige (Oxford) Dino Sejdinovic (Oxford) December 7, 2015 1 / 36

Section 1 Outline 2 / 36

1 Introduction 2 Kernel Adaptive SMC (KASS) 3 Implementation Details 4 Evaluation 5 Conclusion 3 / 36

Section 2 Introduction 4 / 36

Sequential Monte Carlo Samplers Approximate integrals with respect to target distribution π T Build upon Importance Sampling: approximate integral of h wrt density π T using samples following density q (under certain conditions): h(x)dπ T (x) = h(x) π T (x) q(x) dq(x) Given prior π 0, build sequence π 0,..., π i,... π T such that π i+1 is closer to π T than π i (δ(π i+1, π T ) < δ(π i, π T ) for some divergence δ) sample from π i can approximate π i+1 well using importance weight function w( ) = π i+1 ( )/π i ( ) 5 / 36

Sequential Monte Carlo Samplers At i = 0 Using proposal density q 0, generate particles {(w 0,j, X 0,j )} N j=1 where w 0,j = π 0 (X 0,j )/q 0 (X 0,j ) importance resampling, resulting in N equally weighted particles {(1/N, X 0,j )} N j=1 rejuvenation move for each X 0,j by Markov Kernel leaving π 0 invariant At i > 0 approximate π i by {(π i (X i 1,j )/π i 1 (X i 1,j ), X i 1,j )} N j=1 resampling rejuvenation leaving π i invariant if π i π T, repeat 6 / 36

Sequential Monte Carlo Samplers estimate evidence Z T of π T by Z T Z 0 T i=1 1 N j w i,j (aka normalizing constant, marginal likelihood) Can be adaptive in rejuvenation steps without diminishing adaptation as required in adaptive MCMC Will construct rejuvenation using RKHS-embedding of particles 7 / 36

Intractable Likelihoods and Evidence in nonconjugate latent variable models, intractable likelihoods arise when likelihood can be estimated unbiasedly, SMC still valid simple case: estimate likelihood using IS or SMC, leads to IS 2 (Tran et al., 2013) and SMC 2 (Chopin et al., 2011) results in noisy Importance Weights, but evidence approximation is still valid (Tran et al., 2013, Lemma 3) 8 / 36

Nonlinear proposals based on positive definite Kernels Kernel Adaptive Metropolis Hastings (KAMH) was introduced in Sejdinovic et al. (2014) Given previous samples from target distribution π, draw new ones more efficiently Each sample mapped to functional in Reproducing Kernel Hilbert Space (RKHS) H k using pd kernel k(, ) Fit Gaussian q k in H k with µ = Σ = k(, x)dπ(x) 1 n n k(, X i ) i=1 k(, x) k(, x)dπ(x) µ µ 9 / 36

Nonlinear proposals based on positive definite Kernels Draw sample from q k and project back into original space, use as proposal in MH KAMH set in adaptive MCMC, using vanishing adaptation (e.g. vanishing probability to use new samples for computing adaptive proposal) Depending on used positive definite kernel, can adapt to nonlinear targets 10 / 36

Section 3 Kernel Adaptive SMC (KASS) 11 / 36

Adaptive SMC Sampler SMC works on a sequence of targets, so we use an artificial sequence of distributions leading from prior π 0 to posterior π T parameters of rejuvenation kernel can be adapted before rejuvenation Fearnhead and Taylor (2013) used global Gaussian approximation as proposal in Metropolis Hastings rejuvenation resulting in adaptive SMC sampler (ASMC) 12 / 36

Kernel adaptive rejuvenation instead, we use RKHS-proposal projected into input space (in closed form) given unweighted particles { X i } N i=1, proposal at X j is q KAMH ( X j ) = N ( X j, ν 2 M X, X j CM X, X j + γ 2 I )) where C = I 1 n 11 is centering matrix and M X, Xj = 2[ x k(x, X 1 ) x= Xj,..., x k(x, X N ) x= Xj ] results in ASMC using linear kernel k(x, X ) = X X locally adaptive fit using Gaussian RBF k(x, X ) = exp ( X X 2 ) 2σ 2 13 / 36

KASS versus ASMC green: ASMC / KASS with linear kernel red: KASS with Gaussian RBF kernel 14 / 36

Related Work Most direct relation to ASMC (which is a special case) All SMC samplers related to Annealed Importance Sampling which however does not use resampling (Neal, 1998) Local Adaptive Importance Sampling (Givens and Raferty, 1996, LAIS) has similar locally adaptive effect at each iteration compute pairwise distances between Importance Samples use k nearest neighbors for fitting local Gaussian proposal no resampling steps mean decrease in sampling efficiency which is exponential in dimensionality of problem 15 / 36

Section 4 Implementation Details 16 / 36

Construction of Target Sequence For artificial distribution sequence we used geometric bridge π i π 1 ρ i 0 π ρ i T where (ρ i ) T i=1 is an increasing sequence satisfying ρ T = 1 another standard choice in Bayesian Inference is adding datapoints one after another π i (X ) = π(x d 1,..., d ρi D ) resulting in Iterated Batch Importance Sampling (Chopin, 2002, IBIS) 17 / 36

Stochastic approximation tuning of ν 2 KASS free scaling parameter ν 2 can be tuned for optimal scaling Fearnhead and Taylor (2013) use auxiliary variable approach with ESJD criterion We used stochastic approximation framework of Andrieu and Thoms (2008) instead asymptotically optimal acceptance rate for Random Walk proposals is α opt = 0.234 (Rosenthal, 2011) after rejuvenation, Rao-Blackwellized estimator ˆα i available by averaging MH acceptance probabilities tune ν 2 by ν 2 i+1 = ν 2 i + λ i (ˆα i α opt ) for non-increasing λ 1,..., λ T 18 / 36

Section 5 Evaluation 19 / 36

Synthetic nonlinear target (Banana) Synthetic target: Banana distribution in 8 dimensions, i.e. Gaussian with twisted second dimension 8 6 4 2 0 2 4 20 15 10 5 0 5 10 15 20 20 / 36

Synthetic nonlinear target (Banana) Compare performance of Random-Walk rejuvenation with asymptotically optimal scaling (ν = 2.38/ d), ASMC and KASS with Gaussian RBF kernel Fixed learning rate of λ = 0.1 to adapt scale parameter using stochastic approximation Geometric bridge of length 20 30 Monte Carlo runs Report Maximum Mean Discrepancy (MMD) using polynomial kernel of order 3: distance of moments up to order 3 between ground truth samples and samples produced by each method 21 / 36

Synthetic nonlinear target (Banana) MMD to benchmark sample 8.0 7.5 7.0 6.5 6.0 5.5 5.0 4.5 4.0 3.5 10 7 0 100 200 300 400 500 600 Population size KASS RWSMC ASMC Figure: Improved convergence of all mixed moments up to order 3 of KASS compared to ASMC and RW-SMC. 22 / 36

Sensor network localization Applied problem: infer locations of S = 3 sensors in a sensor network measuring distance to each other Known position for B = 2 base sensors Measurements successful with probability decaying exponentially in squared distance (otherwise unobserved) ( ( Z i,j Binom 1, exp x i x j 2 )) 2 2 0.3 2 Measurements corrupted by Gaussian noise { N ( x i x j, 0.02) if Z i,j = 1 Y i,j Y i,j = 0 else 23 / 36

Sensor network localization run KASS and ASMC with geometric bridge of length 50 and 10, 000 particles, fixed learning rate λ i = 1 run KAMH for 50 10, 000 iterations, discard first half as burn-in, diminishing adaptation λ i = 1/ i initialize both algorithms with samples from prior qualitative comparison of KASS and closest adaptive MCMC algorithm KAMH 24 / 36

Sensor network localization: KAMH adaptive MCMC 1.0 MCMC (KAMH) 0.8 0.6 0.4 0.2 0.0 0.1 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Figure: Posterior samples of unknown sensor locations (in color) by KAMH. Set-up of the true sensor locations (black dots) and base sensors (black stars) causes uncertainty in posterior. 25 / 36

Sensor network localization: KASS adaptive SMC 1.0 SMC (KASS) 0.8 0.6 0.4 0.2 0.0 0.1 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Figure: Posterior samples of unknown sensor locations (in color) by KASS. Set-up of the true sensor locations (black dots) and base sensors (black stars) causes uncertainty in posterior. 26 / 36

Sensor network localization MCMC algorithm not able to traverse all the modes without special care (e.g. Wormhole HMC by Lan et al., 2014) KASS and ASMC perform similarly in this setup with S = 2 (higher uncertainty), 1000 particles MMD of 0.76 ± 0.4 for KASS 0.94 ± 0.7 for ASMC 27 / 36

Evidence approximation for intractable likelihoods in classification using Gaussian Processes (GP), logistic transformation renders likelihood intractable likelihood can be unbiasedly estimated using Importance Sampling from EP approximation estimate model evidence when using ARD kernel in the GP particularly hard because noisy likelihoods means noisy importance weights ground truth by averaging evidence estimate over 20 long running SMC algorithms 28 / 36

Evidence approximation for intractable likelihoods Figure: Ground truth in red, KASS in blue, ASMC in green. 29 / 36

Section 6 Conclusion 30 / 36

Conclusion (1) Developed Kernel Adaptive SMC sampler for static models KASS exploits local covariance of target through RKHS-informed rejuvenation proposals combines these with general SMC advantages for multimodal targets and evidence estimation especially attractive when likelihoods are intractable 31 / 36

Conclusion (2) evaluated on a strongly twisted Banana where it was clearly better than ASMC KASS enables exploring multiple modes in nonlinear sensor KASS exhibits less variance than ASMC in evidence estimation for GP classification evidence approximation even in case of intractable likelihoods 32 / 36

Thanks! 33 / 36

Literature I Andrieu, C. and Thoms, J. (2008). A tutorial on adaptive MCMC. Statistics and Computing, 18(November):343 373. Chopin, N. (2002). A sequential particle filter method for static models. Biometrika, 89(3):539 552. Chopin, N., Jacob, P. E., and Papaspiliopoulos, O. (2011). SMCˆ2: an efficient algorithm for sequential analysis of state-space models. 0(1):1 27. Fearnhead, P. and Taylor, B. M. (2013). An Adaptive Sequential Monte Carlo Sampler. Bayesian Analysis, (2):411 438. Givens, G. H. and Raferty, A. E. (1996). Local Adaptive Importance Sampling for Multivariate Densities with Strong Nonlinear Relationships. Journal of the American Statistical Association, 91(433):132 141. 34 / 36

Literature II Lan, S., Streets, J., and Shahbaba, B. (2014). Wormhole hamiltonian monte carlo. In Twenty-Eighth AAAI Conference on Artificial Intelligence. Neal, R. (1998). Annealed Importance Sampling. Technical report, University of Toronto. Rosenthal, J. S. (2011). Optimal Proposal Distributions and Adaptive MCMC. In Handbook of Markov Chain Monte Carlo, chapter 4, pages 93 112. Chapman & Hall. Sejdinovic, D., Strathmann, H., Lomeli, M. G., Andrieu, C., and Gretton, A. (2014). Kernel Adaptive Metropolis-Hastings. In International Conference on Machine Learning (ICML), pages 1665 1673. 35 / 36

Literature III Tran, M.-N., Scharth, M., Pitt, M. K., and Kohn, R. (2013). Importance sampling squared for Bayesian inference in latent variable models. pages 1 39. 36 / 36