Divide-and-Conquer Joint work with: John Aston, Alexandre Bouchard-Côté, Brent Kirkpatrick, Fredrik Lindsten, Christian Næsseth, Thomas Schön University of Warwick a.m.johansen@warwick.ac.uk http://go.warwick.ac.uk/amjohansen/talks/ Algorithms and Computationally Intensive Statistics November 25th, 2016
Outline Importance Sampling to (SMC) SMC to Divide and Conquer SMC (DC-SMC) Some Theoretical Properties of DC-SMC Illustrative Applications Conclusions and Some (Open) Questions
Essential Problem The Abstract Problem Given a density, π(x) = γ(x) Z, such that γ(x) can be evaluated pointwise, how can we approximate π and how about Z? Can we do so robustly? In a distributed setting?
Importance Sampling Simple identity: provided γ µ: γ(x) Z = γ(x)dx = µ(x) µ(x)dx So, if X 1, X 2,... iid µ, then: unbiasedness slln clt [ 1 lim N N N [ ] 1 N γ(x i ) N : E =Z N µ(x i ) 1 lim N N N i=1 ( [ ]) γ(xi ) where W N 0, Var µ(x 1 ) ϕ(x 1). N i=1 i=1 γ(x i ) µ(x i ) ϕ(x i) a.s. γ(ϕ) γ(x i ) µ(x i ) ϕ(x i) γ(ϕ) ] d W
Sequential Importance Sampling Write γ(x 1:n ) = γ(x 1 ) define, for p = 1,..., n then γ(x 1:n ) µ(x 1:n ) }{{} W n(x 1:n ) γ p (x 1:p ) = γ 1 (x 1 ) = γ 1(x 1 ) µ 1 (x 1 ) }{{} w 1 (x 1 ) p=2 n γ(x p x 1:p 1 ), p=2 p γ(x q x 1:q 1 ), q=2 n γ p (x 1:p ), γ p 1 (x 1:p 1 )µ p (x p x 1:p 1 ) }{{} w p(x 1:p ) and we can sequentially approximate Z p = γ p (x 1:p )dx 1:p.
Sequential Importance Resampling (SIR) Given a sequence γ 1 (x 1 ), γ 2 (x 1:2 ),...: Initialisation, n = 1: Sample X 1 1,..., X N 1 Compute iid µ 1 W i 1 = γ 1(X i 1 ) µ i 1 (X i 1 ) Obtain Ẑ N 1 = 1 N N i=1 W i 1 π N 1 = Ni=1 W i 1 δ X i 1 N j=1 W j 1 [This is just (self-normalized) importance sampling.]
Iteration, n n + 1: Resample: sample (X 1 n,1:n 1,..., X N n,1:n 1 ) iid N i=1 δ X i n 1 Sample X i n,n q n ( X i n,1:n 1 ) Compute W i n = γ n (X i n,1:n ) γ n 1 (X i n,1:n 1 ) q n(x i n,n X i n,1:n 1 ). Obtain Ẑ N n = Ẑ N n 1 1 N N Wn i π n N = i=1 N i=1 W nδ i X i n N j=1 W n j.
SIR: Theoretical Justification Under regularity conditions we still have: unbiasedness slln E[Ẑ N n ] = Z n lim N πn n (ϕ) a.s. = π n (ϕ) clt For a normal random variable W n of appropriate variance: lim N[ π N n (ϕ) π n (ϕ)] = d W n N although establishing this becomes a little harder (cf., e.g. Del Moral (2004), Andrieu et al. 2010).
Simple Particle Filters: One Family of SIR Algorithms x 1 x 2 x 3 x 4 x 5 x 6 y 1 y 2 y 3 y 4 y 5 y 6 Unobserved Markov chain {X n } transition f. Observed process {Y n } conditional density g. The joint density is available: p(x 1:n, y 1:n θ) = f θ 1 (x 1 )g θ (y 1 x 1 ) Natural SIR target distributions: n f θ (x i x i 1 )g θ (y i x i ). i=2 πn(x θ 1:n ) :=p(x 1:n y 1:n, θ) p(x 1:n, y 1:n θ) =: γn(x θ 1:n ) Zn θ = p(x 1:n, y 1:n θ)dx 1:n = p(y 1:n θ)
Bootstrap PFs and Similar Choosing πn(x θ 1:n ) :=p(x 1:n y 1:n, θ) p(x 1:n, y 1:n θ) =: γn(x θ 1:n ) Zn θ = p(x 1:n, y 1:n θ)dx 1:n = p(y 1:n θ) and q p (x p x 1:p 1 ) = f θ (x p x p 1 ) yields the bootstrap particle filter of Gordon et al. (1993), whereas q p (x p x 1:p 1 ) = p(x p x p 1, y p, θ) yields the locally optimal particle filter. Note: Many alternative particle filters are SIR algorithms with other targets. Cf. J. and Doucet (2008); Doucet and J. (2011).
Samplers: Another SIR Class Given a sequence of targets π 1,..., π n on arbitrary spaces, Del Moral et al. (2006) extend the space: π n (x 1:n ) =π n (x n ) γ n (x 1:n ) =γ n (x n ) Z n = = 1 p=n 1 1 p=n 1 γ n (x 1:n )dx 1:n γ n (x n ) L p (x p+1, x p ) L p (x p+1, x p ) 1 p=n 1 L p (x p+1, x p )dx 1:n = γ n (x n )dx n = Z n
A Simple SMC Sampler Given γ 1,..., γ n, on (E, E), for i = 1,..., N Sample X i 1 iid µ 1 compute W i 1 = γ 1(X i 1 ) µ 1 (X i 1 ) and Ẑ1 N = 1 N N i=1 W 1 i For p = 2,..., n Resample: X 1:N iid n,n 1 N i=1 W n 1 i δ X n 1. Sample: X i n K n (X i n,1:n 1, ), where π nk n = π n. Compute: W i n = γn(x i n,n 1 ) γ n 1(X i n,n 1 ). Then Ẑ N n = Ẑ N n 1 1 N N i=1 W i n, n and πn N i=1 = W nδ i X i n n. j=1 W n j
Bayesian Inference (Chopin, 2001;Del Moral et al., 2006) In a Bayesian context: Given a prior p(θ) and likelihood l(θ; y 1:m ) One could specify: Data Tempering γ p (θ) = p(θ)l(θ; y 1:mp ) for m 1 = 0 < m 2 < < m T = m Likelihood Tempering γ p (θ) = p(θ)l(θ; y 1:m ) βp β 1 = 0 < β 2 < < β T = 1 Something else? Here Z T = p(θ)l(θ; y 1:n )dθ and γ T (θ) p(θ y 1:n ). Specifying (m 1,..., m T ), (β 1,..., β T ) or (γ 1,..., γ T ) is hard 1. for 1 Just ask Lewis...
Illustrative Sequences of Targets dnorm(x, mu0, var0) 0.0 0.2 0.4 0.6 0.8 1.0 1.2 dnorm(x, mu0, var0) 0.0 0.2 0.4 0.6 0.8 1.0 1.2 10 5 0 5 10 x 10 5 0 5 10 x
One Adaptive Scheme (Zhou, J. & Aston, 2016)+Refs Resample When ESS(W 1:N n ) = ( N i=1 (W i n) 2 ) 1 is below a threshold. Likelihood Tempering At iteration n: Set β n such that: N( N n ) 2 N k=1 W (k) (k) n 1 (w n ) = CESS 2 j=1 W (j) n 1 w (j) which controls χ 2 -discrepancy between successive distributions. Proposals Follow (Jasra et al., 2010): adapt to keep acceptance rate about right. Question Are there better, practical approaches to specifying a sequence of distributions?
Divide-and-Conquer (Lindsten, J. et al., 2016) Many models admit natural decompositions: Level 0: x 5 Level 1: x 4 x 4 Level 2: x 1 x 2 x 3 x 1 x 2 x 3 x 1 x 2 x 3 y 1 y 2 y 3 y 1 y 2 y 3 y 1 y 2 y 3 To which we can apply a divide-and-conquer strategy: π r π t......... π c1 π cc
A few formalities... Use a tree, T of models (with rootward variable inclusion): π r π t......... π c1 π cc Let t T denote a node; r T is the root. Let C(t) = {c 1,..., c C } denote the children of t. Let X t denote the space of variables included in t but not its children. dc-smc can be viewed as a recursion over this tree.
dc-smc(t) an extension of SIR 1. For c C(t) : 1.1 ({x i c, w i c} N i=1, Ẑ N c ) dc-smc(c). 1.2 Resample {x i c, wc} i N i=1 to obtain the equally weighted particle system {ˆx i c, 1} N i=1. 2. For particle i = 1 : N: 2.1 If X t, simulate x i t q t ( ˆx i c 1,..., ˆx i c C ), where (c 1, c 2,..., c C ) = C(t); else x i t. 2.2 Set x i t = (ˆx i c 1,..., ˆx i c C, x i t). 2.3 Compute w i t = γ t (x i t ) 1 c C(t) γc (ˆxi c ) q t ( x i t ˆxi c 1,...,ˆx i c ). C 3. Compute Ẑ N t = { 1 N N i=1 wi t} c C(t) Ẑ N c. 4. Return ({x i t, w i t} N i=1, Ẑ N t ).
Theoretical Properties I Unbiasedness of Normalising Constant Estimates Provided that γ t c C(t) γ c q t for every t T and an unbiased, exchangeable resampling scheme is applied to every population at every iteration, we have for any N 1: E[Ẑ r N ] = Z r = γ r (x r )dx r.
Strong Law of Large Numbers Under regularity conditions the weighted particle system (x N,i r, wr N,i ) N i=1 generated by dc-smc(r) is consistent in that for all functions f : Z R satisfying certain assumptions: N i=1 w N,i r N j=1 wn,j r ϕ(x N,i r ) a.s. π(ϕ(x)) as N.
Some (Importance) Extensions 1. Mixture Resampling: we could make use of N i=1 j=1 N wc i 1 wc j dγ t 2 (x i c dγ c1 γ 1, x j c 2 )δ (x i c2 c1,x j c 2 ) 2. Tempering (Del Moral et al, 2006): between γ c1 γ c2 and γ t 3. Adaptation (Zhou, J. and Aston, 2016): tempering sequence MCMC kernel parameters
An Ising Model k indexes v k V Z 2 x k { 1, +1} p(z) e βe(z), β 0 E(z) = (k,l) E x kx l We consider a grid of size 64 64 with β = 0.4407 (the critical temperature).
A sequence of decompositions
log Z 3815 3810 3805 3800 SMC D&C SMC (ann) D&C SMC (ann+mix) 2450 2500 2550 2600 10 2 10 3 10 4 E[E(x)] Wall clock time (s) 2650 SMC D&C SMC (ann) D&C SMC (ann+mix) Single flip MH 10 2 10 3 10 4 Wall clock time (s) Summaries over 50 independent runs of each algorithm.
New York Schools Maths Test: data Data organised into a tree T. A root-to-leaf path is: NYC (the root, denoted by r T ), borough, school district, school, year. Each leaf t T comes with an observation of m t exam successes out of M t trials. Total of 278 399 test instances five borough (Manhattan, The Bronx, Brooklyn, Queens, Staten Island), 32 distinct districts, 710 distinct schools.
New York Schools Maths Test: Bayesian Model Number of successes m t at a leaf t is Bin(M t, p t ). where p t = logistic(θ t ), where θ t is a latent parameter. internal nodes of the tree also have a latent θ t model the difference in θ t along e = (t t ) as θ t = θ t + e, where, e N(0, σ 2 e). We put an improper prior (uniform on (, )) on θ r. We also make the variance random, but shared across siblings, σ 2 t Exp(1).
New York Schools Maths Test: Implementation The basic SIR-implementation of dc-smc. Using the natural hierarchical structure provided by the model. Given σ 2 t and the θ t at the leaves, the other random variables are multivariate normal. We instantiate values for θ t only at the leaves. At internal node t, sample only σt 2 and marginalize out θ t. Each step of dc-smc therefore is either: i. At leaves sample p t Beta(1 + m t, 1 + M t m t ) and set θ t = logit(p t ). ii. At internal nodes sample σt 2 Exp(1). Java implementation: https://github.com/alexandrebouchard/multilevelsmc
New York Schools Maths Test: Results Mean parameter 5 4 3 2 1 0 NY NY Bronx NY Kings NY Manhattan NY Queens NY Richmond 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 Posterior density approximation DC with 10 000 particles. Bronx County has the highest fraction (41%) of children (under 18) living below poverty level. 2 Queens has the second lowest (19.7%), after Richmond (Staten Island, 16.7%). Staten Island contains a single school district so the posterior distribution is much flatter for this borough. 2 Statistics from the New York State Poverty Report 2013, http://ams.nyscommunityaction.org/resources/documents/news/nyscaas 2013 Poverty Report.pdf
Normalising Constant Estimates 3825 Estimate of log(z) 3850 3875 3900 sampling_method 100 1000 10000 100000 1000000 Number of particles (log scale) DC STD
Distributed Implementation 5000 4000 Time (s) 3000 2000 Speedup 10 1000 0 1 0 10 20 30 Number of nodes 1 10 Number of nodes Xeon X5650 2.66GHz processors connected by a non-blocking Infiniband 4X QDR network
Conclusions SMC SIR D&C-SMC SIR + Coalescence Distributed implementation is straightforward D&C strategy can improve even serial performance Some questions remain unanswered: How can we construct (near) optimal tree-decompositions? How much standard SMC theory can be extended to this setting? Some application areas are appealing: Inference for phylogenetic trees in linguistics. Principled aggregration of mass univariate analyses from neuroimaging.
References [1] C. Andrieu, A. Doucet, and R. Holenstein. Particle Markov chain Monte Carlo. Journal of the Royal Statistical Society B, 72(3):269 342, 2010. [2] N. Chopin. A sequential particle filter method for static models. Biometrika, 89(3):539 551, 2002. [3] P. Del Moral. Feynman-Kac formulae: genealogical and interacting particle systems with applications. Probability and Its Applications. Springer Verlag, New York, 2004. [4] P. Del Moral, A. Doucet, and A. Jasra. methods for Bayesian Computation. In Bayesian Statistics 8. Oxford University Press, 2006. [5] A. Doucet and A. M. Johansen. A tutorial on particle filtering and smoothing: Fiteen years later. In D. Crisan and B. Rozovsky, editors, The Oxford Handbook of Nonlinear Filtering, pages 656 704. Oxford University Press, 2011. [6] N. J. Gordon, S. J. Salmond, and A. F. M. Smith. Novel approach to nonlinear/non-gaussian Bayesian state estimation. IEE Proceedings-F, 140(2):107 113, April 1993. [7] A. Jasra, D. A. Stephens, A. Doucet, and T. Tsagaris. Inference for Lévy-Driven Stochastic Volatility Models via Adaptive. Scandinavian Journal of Statistics, 38(1):1 22, Dec. 2010. [8] A. M. Johansen and A. Doucet. A note on the auxiliary particle filter. Statistics and Probability Letters, 78 (12):1498 1504, September 2008. URL http://dx.doi.org/10.1016/j.spl.2008.01.032. [9] F. Lindsten, A. M. Johansen, C. A. Naesseth, B. Kirkpatrick, T. Schön, J. A. D. Aston, and A. Bouchard-Côté. Divide and conquer with sequential Monte Carlo samplers. Journal of Computational and Graphical Statistics, 2016. In press. [10] Y. Zhou, A. M. Johansen, and J. A. D. Aston. Towards automatic model comparison: An adaptive sequential Monte Carlo approach. Journal of Computational and Graphical Statistics, 25(3):701 726, 2016. doi: 10.1080/10618600.2015.1060885.