Inference in state-space models with multiple paths from conditional SMC

Inference in state-space models with multiple paths from conditional SMC Sinan Yıldırım (Sabancı) joint work with Christophe Andrieu (Bristol), Arnaud Doucet (Oxford) and Nicolas Chopin (ENSAE) September 20, 2018 1 / 36

State-space models f θ f θ f θ f θ f θ X 1 X 2 X t 1 X t g θ g θ g θ g θ Y 1 Y 2 Y t 1 Y t {X t, t 1} is a Markov chain X 1 µ θ ( ) and X t (X t 1 = x t 1 ) f θ ( x t 1 ) We only have access to a conditionally independent process {Y t, t 1} Y t (X t = x t ) g θ ( x t ) 2 / 36

The Metropolis-Hastings algorithm We are interested in sampling from the posterior distribution of θ given y 1:n : where η(θ) is the prior density for θ p(θ y 1:n) η(θ)p θ (y 1:n) Metropolis-Hastings (MH) Given θ, 1 Propose θ q(θ, ) 2 Accept θ with probability { min otherwise reject and keep θ 1, q(θ, θ) q(θ, θ ) η(θ ) η(θ) } p θ (y 1:n), p θ (y 1:n) 3 / 36

Intractability The MH acceptance ratio: r(θ, θ ) := q(θ, θ) η(θ ) p θ (y 1:n ) q(θ, θ ) η(θ) p θ (y 1:n ) Intractable likelihood: n p θ (y 1:n) = µ θ (x 1)g θ (y 1 x 1) g θ (y i x i)f θ (x i x i 1)dx 1:n X n i=2 There are effective methods to estimate p θ (y 1:n) unbiasedly, based on sequential Monte Carlo (SMC), aka the particle filter 4 / 36

Pseudo-marginal approach for state-space models Replace p θ (y 1:n ) with an non-negative unbiased estimator obtained from SMC E[ˆp θ (y 1:n)] = p θ (y 1:n), θ Θ Pseudo-marginal MH (Andrieu and Roberts, 2009; Andrieu et al, 2010) Given θ and ˆp θ (y 1:n ), 1 Propose θ q(θ, ) 2 Run an SMC algorithm to obtain ˆp θ (y 1:n ) 3 Accept (θ, ˆp θ (y 1:n )) with probability { min 1, q(θ, θ) q(θ, θ ) otherwise reject and keep (θ, ˆp θ (y 1:n )) η(θ ) η(θ) } ˆp θ (y 1:n ) ˆp θ (y 1:n ) 5 / 36

Scalability issues with pseudo-marginal algorithms In pseudo-marginal algorithms, one estimates p θ (y 1:n) and p θ (y 1:n) independently For fixed number of particles, the variability of ˆp θ (y 1:n ) increases with n, As θ θ the variability in ˆp θ (y 1:n)/ˆp θ (y 1:n) does not vanish Variability of the acceptance ratio q(θ, θ) η(θ ) ˆp θ (y 1:n) q(θ, θ ) η(θ) ˆp θ (y 1:n ) has a big impact on the performance (Andrieu and Vihola, 2014, 2015) Alternatives: Correlated pseudo-marginal algorithm of Deligiannidis et al (2018) Estimate p θ (y 1:n)/p θ (y 1:n) directly 6 / 36

Estimating the likelihood ratio Shorthand notation: Let x = x 1:n and y = y 1:n Given θ and θ, choose θ Θ, eg θ = (θ + θ )/2 If R θ is a Markov transition kernel leaving p θ ( y) invariant, then p θ (y) p θ (y) p θ (y) p = θ (y) pθ (x p θ (y) p θ (y) =, y) p θ (x, y) p θ(x y)r θ (x, x )dxdx Variance reduction (Crooks, 1998; Jarzynski, 1997; Neal, 2001, 1996) If further we have reversibility p θ (x y)r θ (x, x ) = p θ (x y)r θ (x, x), then we have a valid algorithm for p(θ, x y) η(θ)p θ (x, y) (Neal, 2004) 7 / 36

AIS based MCMC with one-step annealing AIS MCMC for SSM 1 Given (θ, x), sample θ q(θ, ) 2 Let θ = (θ + θ )/2 and sample x R θ (x, ) 3 Accept (θ, x ) with probability { min 1, q(θ, θ) η(θ ) q(θ, θ ) η(θ) otherwise reject and keep (θ, x) p θ (x, y) p θ (x, y) } p θ (x, y) p θ (x ;, y) 8 / 36

Reversible kernel: conditional SMC (Andrieu et al, 2010) 9 / 36

csmc and path degeneracy 16 14 12 state 1 08 06 04 5 10 15 20 25 time index 16 14 12 state 1 08 06 04 5 10 15 20 25 time index Picture created by Olivier Cappé 10 / 36

csmc with backward sampling (Whiteley, 2010) csmc(m, θ, x): csmc at θ with M particles conditional on the path x Let R M θ (x, ) be the Markov kernel of csmc(m, θ, x)+backward sampling R M θ (x, ) is a p θ (x y) invariant Markov kernel (Andrieu et al, 2010) More importantly, R M θ is reversible (Chopin and Singh, 2015) 11 / 36

Numerical experiments Non-linear benchmark model The latent variables and observations of the state-space model are defined as X t = X t 1 /2 + 25X t 1 /(1 + X 2 t 1) + 8 cos(12t) + V t, t 2, Y t = X 2 t /20 + W t, t 1, where X 1 N (0, 10), V t iid N (0, σ 2 v ), W t iid N (0, σ 2 w ) The static parameter of the model is then θ = (σ 2 v, σ 2 w ) 12 / 36

Comparisons with fixed N and varying n M = 200 for MwPG (Lindsten et al, 2015) and AIS MCMC and M = 2000 for PMMH Integrated autocorrelation (IAC) times averaged over 200 runs: AIS MCMC csmc+bs MwPG PMMH σv 2 σw 2 σv 2 σw 2 σv 2 σw 2 n = 1000 177 235 209 294 713 592 n = 2000 175 237 206 294 7590 7579 n = 5000 176 237 207 296 58086 56635 n = 10000 176 240 207 302 73681 71709 IAC times for MwPG with M = 200 IAC times for AIS MCMC with M = 200 IAC times for PMMH with M = 2000 σ v 2 24 22 20 σ v 2 20 19 18 17 σ v 2 10000 5000 18 1000 2000 5000 10000 n 16 15 1000 2000 5000 10000 n 0 1000 2000 5000 10000 n σ w 2 35 30 σ w 2 30 28 26 24 σ w 2 10000 5000 22 25 1000 2000 5000 10000 n 20 1000 2000 5000 10000 n 0 1000 2000 5000 10000 n 13 / 36

Comparison on a sticky model (Pitt and Shephard, 1999) X t = 098X t 1 + V t, V t N (0, 002), Y t = X t + µ + W t, W t N (0, 01), For this comparison, R θ,θ,k(x, ) = π θ,θ,k is used IAC times averaged over 100 runs for each for the standard deviation σ µ of the symmetric RW proposal 5000 n = 1000 5000 n = 5000 5000 n = 10000 5000 n = 50000 AIS MCMC C SMC BS with K = 1 4000MwG 4000 4000 4000 IAC time 3000 2000 3000 2000 3000 2000 3000 2000 1000 1000 1000 1000 0 5 σ 10 µ 0 5 10 0 σ σ 5 10 0 5 10 µ µ σ µ 14 / 36

Preliminary theoretical results Considered the situation p θ (y 1:n ) := n p θ (y t ) = t=1 n p θ (x t, y t )dx t X t=1 We make some assumptions including 1 Θ R is compact, 2 For any x, y X Y, θ log p θ (x, y) is three times differentiable with derivatives uniformly bounded in θ, x, y 3 θ = θ + ϵ n with ϵ q 0 ( ), a symmetric distribution with density bounded away from 0 4 θ = θ+θ 2 5 AIS MCMC that uses a product of p θ ( y)-invariant reversible Markov transition kernels R M n θ,y with lim sup M (θ,x,y) Θ X Y R M θ,y( x, ) pθ ( y) tv = 0 15 / 36

Some asymptotics ξ = {X t, X t } t 1, ω = {Y t} t 1, E ω n : conditional expectations given observations ω r n (θ, θ ; ξ): estimated accepted ratio used in AIS MCMC r n(θ, θ ): The exact acceptance ratio of the MH Theorem Under some assumptions P as, for any ε 0 > 0 there exist n 0, M 0 N such that for any sequence { } M n N N satisfying M n M 0 for n n 0, [ sup E ω n min{1, rn (θ, θ, ξ)} ] [ E ω n min{1, rn (θ, θ ) exp(z)} ] ε0, n n 0 where ( ) Z (θ, θ, ω) N σ2 (θ, θ ), σ 2 (θ, θ ), 2 [ σ 2 (θ, θ ) := (θ θ) 2 E 2 θ 2 θ log p θ (X Y )], and θ = (θ + θ )/2 17 / 36

More general AIS MCMC with K 1 intermediate steps Consider P θ,θ,k := {π θ,θ,k, k = 0,, K + 1} where π θ,θ,k(x) := p θk (x y) p θk (x, y) =: γ θ,θ,k(x), with annealing θ k = θk/(k + 1) + θ (1 k/(k + 1)) Also, consider the associated Markov kernels R θ,θ,k := { R θ,θ,k : X n X n [0, 1], k = 1,, K } In this case the estimator is of the form K γ θ,θ,k+1(x k ) γ θ,θ,k(x k ) k=0 where x 1 R θ,θ,k(x 0, ) and x k R θ,θ,k(x k 1, ), k = 2,, K 20 / 36

Response to increasing budget: fixed n and varying M, K We fixed n = 500, run each combination 200 times On each row: MCMC AIS with M 0 = 500 particles K intermediate steps, MwPG and PMMH algorithms for M = KM 0 particles AIS MCMC csmc+bs MwPG PMMH σv 2 σw 2 σv 2 σw 2 σv 2 σw 2 K = 1 177 209 229 298 1619 3093 K = 2 145 157 221 284 418 435 K = 3 139 156 228 281 226 216 K = 4 150 159 200 311 190 193 K = 5 134 149 204 258 189 175 K = 6 130 131 208 263 169 160 K = 7 137 124 183 265 166 141 K = 8 137 126 227 276 143 137 K = 9 120 122 219 298 163 140 K = 10 135 137 227 267 149 140 22 / 36

Using all paths of csmc Law of k := (k 1,, k n) in backward sampling conditional on ζ = x (1:M) 1:n : Resulting path: ζ (k) ϕ θ(k ζ) For any θ, θ, θ Θ, and paths x, x X n, define r x,x (θ, θ ; θ) = q(θ, θ) η(θ ) p θ (x, y)p θ(x, y) q(θ, θ ) η(θ) p θ (x, y)p θ (x, y) Theorem: Unbiased estimator of acceptance ratio Let x p θ ( y), ζ x csmc(m, θ, x) The expectation k [M] n ϕ θ (k ζ) r x,ζ (k)(θ, θ ; θ) is an unbiased estimator of r(θ, θ ) 23 / 36

1 Sample θ q(θ, ) and v U(0, 1) 2 Set θ = (θ + θ )/2 3 if v 1/2 then 4 Run a csmc(m, θ, x) to obtain ζ 5 Set x = ζ (k) wp ϕ θ (k ζ) r x,ζ (k)(θ, θ ; θ) 6 Accept (θ, x ) with probability min 1, k [M] n ϕ θ (k ζ) r x,ζ (k)(θ, θ ; θ) otherwise reject 24 / 36

1 Sample θ q(θ, ) and v U(0, 1) 2 Set θ = (θ + θ )/2 3 if v 1/2 then 4 Run a csmc(m, θ, x) to obtain ζ 5 Set x = ζ (k) wp ϕ θ (k ζ) r x,ζ (k)(θ, θ ; θ) 6 Accept (θ, x ) with probability otherwise reject 7 else 8 Run a csmc(m, θ, x) to obtain ζ 9 Set x = ζ (k) with probability ϕ θ(k ζ) 10 Accept (θ, x ) with probability min 1, ϕ θ (k ζ) r x,ζ (k)(θ, θ ; θ) k [M] n 1 min 1, ϕ θ (k ζ) r x,ζ (k)(θ, θ; θ), k [M] n otherwise reject 25 / 36

Theoretical results Theorem: Exactness of the algorithm The presented algorithm targets p(θ, x y) Corollary (to the exactness of the algorithm) Let x p θ ( y), ζ x csmc(m, θ, x) Let x be drawn with backward sampling conditional upon ζ Then, ϕ θ (k ζ) r x,ζ (k)(θ, θ; θ) k [M] n is an unbiased estimator of r(θ, θ ) 1 26 / 36

Computation load The computations needed can be performed with a complexity of O(M 2 n) Acceptance ratios can be performed by a sum-product algorithm Sampling a path ζ (k) can be performed with a forward-filtering backward-sampling algorithm However, O(M 2 n) can still be overwhelming, especially when M is large 27 / 36

Reduced computation via subsampling Let u (1),, u (N) be independently sampled paths via backward sampling conditional on ζ We can still target p(θ, x y) using 1 N N r x,u (i)(θ, θ ; θ), i=1 which is an unbiased estimator of r(θ, θ ) Computational complexity: O(NMn) per iteration instead of O(M 2 n); Moreover, sampling N paths can be parallelised 28 / 36

1 Sample θ q(θ, ) and v U(0, 1) 2 Set θ = (θ + θ )/2 3 if v 1/2 then 4 Run a csmc(m, θ, x) to obtain particles ζ 5 Draw N paths with backward sampling, u (1),, u (N) 6 Set x = u (k) wp r x,u (k)(θ, θ ; θ) 7 Accept (θ, x ) with probability min { 1, 1 N } N r x,u (i)(θ, θ ; θ) ; i=1 otherwise reject, and keep (θ, x) 8 else 9 Run a csmc(m, θ, x) to obtain particles ζ 10 Set u (1) = x 11 Draw N paths with backward sampling x, u (2), u (3),, u (N) 12 Accept (θ, x ) with probability min { 1, [ 1 N ] 1 } N r x,u (i) (θ, θ; θ) ; i=1 otherwise reject, and keep (θ, x) 30 / 36

Theorem: Exactness The presented algorithm targets p(θ, x y) Corollary (to the exactness of the algorithm) Let x p θ ( y), ζ x csmc(m, θ, x) Let u (1) = x and x, u (2), u (3),, u (N) be N independent paths sampled with backward sampling conditioned on ζ Then [ 1 N N r x,u (i)(θ, θ; θ) i=1 is an unbiased estimator of r(θ, θ ) ] 1 31 / 36

Numerical example: IAC time σ v 2 24 22 20 18 16 14 IAC times of algorithms, n = 500, M = 100 MwPG N = 1 N = 10 N = 50 σ w 2 30 25 20 15 MwPG N = 1 N = 10 N = 50 32 / 36

Numerical example: convergence vs IAC time ensemble average for σ w 2(k) conv to post mean - single csmc 025 02 100 015 80 60 N = 1 N = 10 N = 100 004 IAC times for σ 2 w - multiple csmcs means: 3787, 2928, 2546 003 01 40 0 100 200 300 20 iterations ensemble average for σ w 4(k) conv to post mean square - single csmc conv to post median - single csmc 005 006 002 100 80 60 001 40 0 100 200 300 iterations 20 ensemble median for σ w 2(k) 005 IAC times for σ 4 w - multiple csmcs means: 0043654, 2798, 2425 003 002 001 0 100 200 300 iterations N = 1 N = 10 N = 100 N = 1 N = 10 N = 100 IAC times for σ w 2 - single csmc means: 3551, 2816, 2510 IAC times for σ 4 w - single csmc means: 3420, 2682, 2404 100 100 80 80 60 60 40 40 20 20 N = 1 N = 10 N = 100 N = 1 N = 10 N = 100 33 / 36

Discussion Discussed AIS based MCMC for state-space models Scalable Monte Carlo inference for state-space models, arxiv:180902527 Discussed the use of multiple, (or all possible) paths from csmc for AIS based MCMC for state-space models If done in parallel, using multiple paths from csmc can be beneficial in terms of convergence time IAC time (?) The methodology presented for state-space models are a special case in a more general framework: Designing MH algorithms with asymmetric acceptance ratios can be useful in other models such as general latent variable models trans-dimensional models doubly intractable models On the utility of Metropolis-Hastings with asymmetric acceptance ratio, arxiv:180309527 34 / 36

Thank you! 35 / 36

References: Andrieu, C, Doucet, A, and Holenstein, R (2010) Particle Markov chain Monte Carlo methods Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72:269 342 Andrieu, C and Roberts, G O (2009) The pseudo-marginal approach for efficient Monte Carlo computations Annals of Statistics, 37(2):569 1078 Andrieu, C and Vihola, M (2014) Establishing some order amongst exact approximations of MCMCs arxiv:14046909 Andrieu, C and Vihola, M (2015) Convergence properties of pseudo-marginal Markov chain Monte Carlo algorithms The Annals of Applied Probability, 25(2):1030 1077 Chopin, N and Singh, S (2015) On particle Gibbs sampling Bernoulli, 21(3):1855 1883 Crooks, G (1998) Nonequilibrium measurements of free energy differences for microscopically reversible markovian systems Journal of Statistical Physics, 90(5-6):1481 1487 Deligiannidis, G, Doucet, A, and Pitt, M K (2018) The correlated pseudomarginal method Journal of the Royal Statistical Society: Series B (Statistical Methodology) (forthcoming) Jarzynski, C (1997) Equilibrium free-energy differences from nonequilibrium measurements: A master-equation approach Phys Rev E, 56:5018?5035 Lindsten, F, Douc, R,, and Moulines, E (2015) Uniform ergodicity of the particle Gibbs sampler Scandinavian Journal of Statistics, 42(3):775?797 Neal, R (1996) Sampling from multimodal distributions using tempered transitions Statistics and Computing, 6(4):353 366 Neal, R (2001) Annealed importance sampling Statistics and Computing, 11:125 139 Neal, R M (2004) Taking bigger Metropolis steps by dragging fast variables Technical report, University of Toronto Pitt, M K and Shephard, N (1999) Analytic convergence rates and parametrization issues for the Gibbs sampler applied to state space models Journal of Time Series Analysis, 20(1):63 85 Whiteley, N (2010) Discussion of Particle Markov chain Monte Carlo methods by Andrieu et al Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72(3):306 307 36 / 36