Sequential Monte Carlo Methods

University of Pennsylvania EABCN Training School May 10, 2016

Introduction Unfortunately, standard MCMC can be inaccurate, especially in medium and large-scale DSGE models: disentangling importance of internal versus external propagation mechanism; determining the relative importance of shocks. Previously: Modify MCMC algorithms to overcome weaknesses: blocking of parameters; tailoring of (mixture) proposal densities Now, we use sequential Monte Carlo (SMC) (more precisely, sequential importance sampling) instead: Better suited to handle irregular and multimodal posteriors associated with large DSGE models. Algorithms can be easily parallelized. SMC = Importance Sampling on Steriods. We build on Theoretical work: Chopin (2004); Del Moral, Doucet, Jasra (2006) Applied work: Creal (2007); Durham and Geweke (2011, 2012)

Importance Sampling Approximate π( ) by using a different, tractable density g(θ) that is easy to sample from. For more general problems, posterior density may be non-normalized. So we write π(θ) = p(y θ)p(θ) p(y ) = f (θ) f (θ)dθ. Importance sampling is based on the identity The ratio E π [h(θ)] = h(θ)π(θ)dθ = w(θ) = f (θ) g(θ) f (θ) h(θ) Θ Θ g(θ) g(θ)dθ f (θ) g(θ) g(θ)dθ. is called the (unnormalized) importance weight.

Importance Sampling 1 For i = 1 to N, draw θ i iid g(θ) and compute the unnormalized importance weights w i = w(θ i ) = f (θi ) g(θ i ). 2 Compute the normalized importance weights W i = 1 N w i N i=1 w i. An approximation of E π [h(θ)] is given by h N = 1 N N W i h(θ i ). i=1

Illustration If θ i s are draws from g( ) then E π [h] 1 N N i=1 h(θi )w(θ i ) 1 N N i=1 w(θi ), w(θ) = f (θ) g(θ). 0.40 0.35 f 0.30 g 1 0.25 g 0.20 2 0.15 0.10 0.05 0.00 6 4 2 0 2 4 6 weights f/g 1 f/g 2 6 4 2 0 2 4 6

Accuracy Since we are generating iid draws from g(θ), it s fairly straightforward to derive a CLT: It can be shown that N( hn E π [h]) = N ( 0, Ω(h) ), where Ω(h) = V g [(π/g)(h E π [h])]. Using a crude approximation (see, e.g., Liu (2008)), we can factorize Ω(h) as follows: Ω(h) V π [h] ( V g [π/g] + 1 ). The approximation highlights that the larger the variance of the importance weights, the less accurate the Monte Carlo approximation relative to the accuracy that could be achieved with an iid sample from the posterior. Users often monitor ESS = N V π[h] Ω(h) N 1 + V g [π/g].

From Importance Sampling to Sequential Importance Sampling In general, it s hard to construct a good proposal density g(θ), especially if the posterior has several peaks and valleys. Idea - Part 1: it might be easier to find a proposal density for π n (θ) = [p(y θ)]φn p(θ) [p(y θ)] φ np(θ)dθ = f n(θ) Z n. at least if φ n is close to zero. Idea - Part 2: We can try to turn a proposal density for π n into a proposal density for π n+1 and iterate, letting φ n φ N = 1.

Illustration: Our state-space model: [ y t = [1 1]s t, s t = Innovation: ɛ t iidn(0, 1). ] [ θ1 2 0 1 (1 θ1 2) θ 1θ 2 (1 θ1 2) s t 1 + 0 Prior: uniform on the square 0 θ 1 1 and 0 θ 2 1. Simulate T = 200 observations given θ = [0.45, 0.45], which is observationally equivalent to θ = [0.89, 0.22] ] ɛ t.

Illustration: Tempered Posteriors of θ 1 5 4 3 2 1 0 50 40 30 n 20 10 0.0 0.2 0.4 0.6 θ 1 0.8 1.0 π n (θ) = [p(y θ)]φn p(θ) = f ( ) λ n(θ) n, φ [p(y θ)] φ np(θ)dθ n = Z n N φ

Illustration: Posterior Draws θ 2 1.0 0.8 0.6 0.4 0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0 θ 1

SMC Algorithm: A Graphical Illustration 10 5 0 5 10 C S M C S M C S M φ 0 φ 1 φ 2 φ 3 π n (θ) is represented by a swarm of particles {θ i n, W i n} N i=1 : h n,n = 1 N N i=1 W i nh(θ i n) a.s. E πn [h(θ n )]. C is Correction; S is Selection; and M is Mutation.

SMC Algorithm 1 Initialization. (φ 0 = 0). Draw the initial particles from the prior: iid p(θ) and W i 1 = 1, i = 1,..., N. θ i 1 2 Recursion. For n = 1,..., N φ, 1 Correction. Reweight the particles from stage n 1 by defining the incremental weights w i n = [p(y θ i n 1)] φn φ n 1 (1) and the normalized weights W n i w i = nwn 1 i 1 N N i=1 w, i = 1,..., N. (2) nw i n 1 i An approximation of E πn [h(θ)] is given by h n,n = 1 N 2 Selection. N i=1 W i nh(θ i n 1). (3)

SMC Algorithm 1 Initialization. 2 Recursion. For n = 1,..., N φ, 1 Correction. 2 Selection. (Optional Resampling) Let {ˆθ} N i=1 denote N iid draws from a multinomial distribution characterized by support points and weights {θ i n 1, W i n} N i=1 and set W i n = 1. An approximation of E πn [h(θ)] is given by ĥ n,n = 1 N N Wnh(ˆθ i n). i (4) i=1 3 Mutation. Propagate the particles {ˆθ i, W i n} via N MH steps of a MH algorithm with transition density θ i n K n(θ n ˆθ i n; ζ n) and stationary distribution π n(θ). An approximation of E πn [h(θ)] is given by h n,n = 1 N N h(θn)w i n. i (5) i=1

Remarks Correction Step: reweight particles from iteration n 1 to create importance sampling approximation of E πn [h(θ)] Selection Step: the resampling of the particles (good) equalizes the particle weights and thereby increases accuracy of subsequent importance sampling approximations; (not good) adds a bit of noise to the MC approximation. Mutation Step: adapts particles to posterior π n(θ); imagine we don t do it: then we would be using draws from prior p(θ) to approximate posterior π(θ), which can t be good! 5 4 3 2 1 0 50 40 30 n 20 10 0.4 0.2 0.0 0.6 θ1 0.8 1.0

Theoretical Properties Goal: strong law of large numbers (SLLN) and central limit theorem (CLT) as N for every iteration n = 1,..., N φ. Regularity conditions: proper prior; bounded likelihood function; 2 + δ posterior moments of h(θ). Idea of proof (Chopin, 2004): proceed recursively Initialization: SLLN and CLT for iid random variables because we sample from prior. Assume that n 1 approximation (with normalized weights) yields N ( 1 N Show that N ( 1 N ) N h(θn 1)W i n 1 i E πn 1 [h(θ)] = N ( 0, Ω ) n 1(h) i=1 ) N h(θn)w i n i E πn [h(θ)] = N ( 0, Ω ) n(h) i=1

Theoretical Properties: Correction Step Suppose that the n 1 approximation (with normalized weights) yields ( ) 1 N N h(θ i N n 1)Wn 1 i E πn 1 [h(θ)] = N ( 0, Ω n 1 (h) ) i=1 Then ( 1 N N N i=1 h(θi n 1 )[p(y ) θi n 1 )]φn φn 1 Wn 1 i 1 N N i=1 [p(y E θi n 1 )]φn φn 1 Wn 1 i πn [h(θ)] = N ( 0, Ω n (h) ) where Ω n (h) = Ω n 1 ( vn 1 (θ)(h E πn [h]) ) v n 1 (θ) = [p(y θ)] φn φn 1 Z n 1 Z n This step relies on likelihood evaluations from iteration n 1 that are already stored in memory.

Theoretical Properties: Selection / Resampling After resampling by drawing from iid multinomial distribution we obtain ( ) 1 N N h(ˆθ i )Wn i E πn [h] = N ( 0, N ˆΩ(h) ), where i=1 ˆΩ n (h) = Ω(h) + V πn [h] Disadvantage of resampling: it adds noise. Advantage of resampling: it equalizes the particle weights, reducing the variance of v n (θ) in Ω n+1 (h) = Ω n ( vn (θ)(h E πn+1 [h]).

Theoretical Properties: Mutation We are using the Markov transition kernel K n (θ ˆθ) to transform draws ˆθ i n into draws θ i n. To preserve the distribution of the ˆθ n s i it has to be the case that π n (θ) = K n (θ ˆθ)π n (ˆθ)d ˆθ. It can be shown that the overall asymptotic variance after the mutation is the sum of the variance of the approximation of the conditional mean E Kn( θ n 1 )[h(θ)] which is given by ˆΩ ( E Kn( θ n 1 )[h(θ)] ) ; a weighted average of the conditional variance V Kn( θ n 1 )[h(θ)]: W n 1(θ n 1)v n 1(θ n 1)V Kn( θ n 1 )[h(θ)]π n 1(θ n 1). This step is easily parallelizable.

More on Transition Kernel in Mutation Step Transition kernel K n (θ ˆθ n 1 ; ζ n ): generated by running M steps of a Metropolis-Hastings algorithm. Lessons from DSGE model MCMC: blocking of parameters can reduces persistence of Markov chain; mixture proposal density avoids getting stuck. Blocking: Partition the parameter vector θ n into N blocks equally sized blocks, denoted by θ n,b, b = 1,..., N blocks. (We generate the blocks for n = 1,..., N φ randomly prior to running the SMC algorithm.) Example: random walk proposal density: ( ) ϑ b (θn,b,m 1, i θn, b,m, i Σ n,b) N θn,b,m 1, i cnσ 2 n,b.

Adaptive Choice of ζ n = (Σ n, c n ) Infeasible adaption: Let Σ n = V πn [θ]. Adjust scaling factor according to c n = c n 1f ( 1 R n 1(ζ n 1) ), where R n 1( ) is population rejection rate from iteration n 1 and e16(x 0.25) f (x) = 0.95 + 0.10 1 + e. 16(x 0.25) Feasible adaption use output from stage n 1 to replace ζ n by ˆζ n : Use particle approximations of E πn [θ] and V πn [θ] based on {θ i n 1, W i n} N i=1. Use actual rejection rate from stage n 1 to calculate ĉ n = ĉ n 1f ( ˆRn 1(ˆζ n 1) ). Result: under suitable regularity conditions replacing ζ n by ˆζ n where n(ˆζ n ζ n ) = O p (1) does not affect the asymptotic variance of the MC approximation.

Adaption of SMC Algorithm for Stylized State-Space Model 1.0 Acceptance Rate 0.5 0.0 1.50 Scale Parameter c 0.75 0.00 1000 Effective Sample Size 500 0 n 10 20 30 40 50 Notes: The dashed line in the top panel indicates the target acceptance rate of 0.25.

Convergence of SMC Approximation for Stylized State-Space Model 0.12 0.10 0.08 0.06 0.04 0.02 NV [θ 1 ] NV [θ 2 ] 0.00 N 10000 15000 20000 25000 30000 35000 40000 Notes: The figure shows NV[ θ j ] for each parameter as a function of the number of particles N. V[ θ j ] is computed based on N run = 1, 000 runs of the SMC algorithm with N φ = 100. The width of the bands is (2 1.96) 3/N run (NV[ θ j ]).

More on Resampling So far, we have used multinomial resampling. It s fairly intuitive and it is straightforward to obtain a CLT. But: multinominal resampling is not particularly efficient. The book contains a section on alternative resampling schemes (stratified resampling, residual resampling...) These alternative techniques are designed to achieve a variance reduction. Most resampling algorithms are not parallelizable because they rely on the normalized particle weights.

Application 1: Small Scale New Keynesian Model We will take a look at the effect of various tuning choices on accuracy: Tempering schedule λ: λ = 1 is linear, λ > 1 is convex. Number of stages N φ versus number of particles N. Number of blocks in mutation step versus number of particles.

Effect of λ on Inefficiency Factors InEff N [ θ] 10 4 10 3 10 2 10 1 10 0 1 2 3 4 5 6 7 8 λ Notes: The figure depicts hairs of InEff N [ θ] as function of λ. The inefficiency factors are computed based on N run = 50 runs of the SMC algorithm. Each hair corresponds to a DSGE model parameter.

Number of Stages N φ vs Number of Particles N 10 1 10 0 10 1 10 2 10 3 ρ g σ r ρ r ρ z σ g κ σ z π (A) ψ 1 γ (Q) ψ 2 r (A) τ N φ = 400, N = 250 N φ = 200, N = 500 N φ = 100, N = 1000 N φ = 50, N = 2000 N φ = 25, N = 4000 Notes: Plot of V[ θ]/v π [θ] for a specific configuration of the SMC algorithm. The inefficiency factors are computed based on N run = 50 runs of the SMC algorithm. N blocks = 1, λ = 2, N MH = 1.

Number of blocks N blocks in Mutation Step vs Number of Particles N 10 0 10 1 10 2 10 3 ρ g σ g σ z σ r ρ r κ ρ z π (A) ψ 1 τ γ (Q) r (A) ψ 2 Nblocks = 4, N = 250 Nblocks = 2, N = 500 Nblocks = 1, N = 1000 Notes: Plot of V[ θ]/v π [θ] for a specific configuration of the SMC algorithm. The inefficiency factors are computed based on N run = 50 runs of the SMC algorithm. N φ = 100, λ = 2, N MH = 1.

A Few Words on Posterior Model Probabilities Posterior model probabilities π i,t = π i,0 p(y 1:T M i ) M j=1 π j,0p(y 1:T M j ) where p(y 1:T M i ) = p(y 1:T θ (i), M i )p(θ (i) M i )dθ (i) For any model: ln p(y 1:T M i ) = T t=1 ln p(y t θ (i), Y 1:t 1, M i )p(θ (i) Y 1:t 1, M i )dθ (i) Marginal data density p(y 1:T M i ) arises as a by-product of SMC.

Marginal Likelihood Approximation Recall w i n = [p(y θ i n 1 )]φn φn 1. Then 1 N N w nw i n 1 i i=1 = p φn 1 (Y θ)p(θ) [p(y θ)] φn φn 1 dθ p φ n 1(Y θ)p(θ)dθ p(y θ) φ n p(θ)dθ p(y θ) φ n 1p(θ)dθ Thus, N φ ( ) 1 N w i N nwn 1 i n=1 i=1 p(y θ)p(θ)dθ.

SMC Marginal Data Density Estimates N φ = 100 N φ = 400 N Mean(ln ˆp(Y )) SD(ln ˆp(Y )) Mean(ln ˆp(Y )) SD(ln ˆp(Y )) 500-352.19 (3.18) -346.12 (0.20) 1,000-349.19 (1.98) -346.17 (0.14) 2,000-348.57 (1.65) -346.16 (0.12) 4,000-347.74 (0.92) -346.16 (0.07) Notes: Table shows mean and standard deviation of log marginal data density estimates as a function of the number of particles N computed over N run = 50 runs of the SMC sampler with N blocks = 4, λ = 2, and N MH = 1.

Application 2: Estimation of Smets and Wouters (2007) Model Benchmark macro model, has been estimated many (many) times. Core of many larger-scale models. 36 estimated parameters. RWMH: 10 million draws (5 million discarded); SMC: 500 stages with 12,000 particles. We run the RWM (using a particular version of a parallelized MCMC) and the SMC algorithm on 24 processors for the same amount of time. We estimate the SW model twenty times using RWM and SMC and get essentially identical results.

Application 2: Estimation of Smets and Wouters (2007) Model More interesting question: how does quality of posterior simulators change as one makes the priors more diffuse? Replace Beta by Uniform distributions; increase variances of parameters with Gamma and Normal prior by factor of 3.

SW Model with DIFFUSE Prior: Estimation stability RWH (black) versus SMC (red) 4 3 2 1 0 1 2 3 4 l ι w µ p µ w ρ w ξ w r π

A Measure of Effective Number of Draws Suppose we could generate iid N eff ( Ê π [θ] approx N E π [θ], 1 N eff V π [θ] draws from posterior, then ). We can measure the variance of Ê π [θ] by running SMC and RWM algorithm repeatedly. Then, N eff V π[θ] V [ Ê π [θ] ]

Effective Number of Draws SMC RWMH Parameter Mean STD(Mean) N eff Mean STD(Mean) N eff σ l 3.06 0.04 1058 3.04 0.15 60 l -0.06 0.07 732-0.01 0.16 177 ι p 0.11 0.00 637 0.12 0.02 19 h 0.70 0.00 522 0.69 0.03 5 Φ 1.71 0.01 514 1.69 0.04 10 r π 2.78 0.02 507 2.76 0.03 159 ρ b 0.19 0.01 440 0.21 0.08 3 ϕ 8.12 0.16 266 7.98 1.03 6 σ p 0.14 0.00 126 0.15 0.04 1 ξ p 0.72 0.01 91 0.73 0.03 5 ι w 0.73 0.02 87 0.72 0.03 36 µ p 0.77 0.02 77 0.80 0.10 3 ρ w 0.69 0.04 49 0.69 0.09 11 µ w 0.63 0.05 49 0.63 0.09 11 ξ w 0.93 0.01 43 0.93 0.02 8

A Closer Look at the Posterior: Two Modes Parameter Mode 1 Mode 2 ξ w 0.844 0.962 ι w 0.812 0.918 ρ w 0.997 0.394 µ w 0.978 0.267 Log Posterior -804.14-803.51 Mode 1 implies that wage persistence is driven by extremely exogenous persistent wage markup shocks. Mode 2 implies that wage persistence is driven by endogenous amplification of shocks through the wage Calvo and indexation parameter. SMC is able to capture the two modes.

A Closer Look at the Posterior: Internal ξ w versus External ρ w Propagation 1 0.9 0.8 0.7 0.6 ρ w 0.5 0.4 0.3 0.2 0.1 0 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 ξ w

Stability of Posterior Computations: RWH (black) versus SMC (red) 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 P (ξ w > ρ w) P (ρ w > µ w) P (ξ w > µ w) P (ξ p > ρ p) P (ρ p > µ p) P (ξ p > µ p)

Marginal Data Density Bayesian model evaluation is based on marginal data density p n (Y ) = [p(y θ)] φn p(θ)dθ. Recall that marginal data density is a by-product of the importance sampler: ˆp n (Y ) = Ẑn = 1 N N i=1 W i n. Algorithm, Method MDD Estimate Standard Deviation SMC, Particle Estimate -873.46 0.24 RWM, Harmonic Mean -874.56 1.20

Application 3: News Shock Model of Schmitt-Grohe and Uribe (2012) Real business cycle model; investment adjustment costs; variable capacity utilitization; Jaimovich-Rebelo (2009) preferences; permanent and stationary neutral and investment-specific technology shocks, goverment spending, wage mark-up, and preference shocks. Data: Real GDP, consumption, investment, government spending, hours, TFP, price of investment growth rates from 1955:Q2-2006:Q4.

Key Model Features News Shocks: each of the 7 exogenous processes has two anticipated components z t = ρz t 1 + ɛ 0 t }{{} unanticipated + ɛ 4 t 4 }{{} anticipated 4 qtrs ahead + ɛ 8 t 8 }{{} anticipated 8 qtrs ahead Jaimovich-Rebelo Pref: Households have CRRA preferences over consumption bundle V t : V t = C t bc t ψh θ t S t S t is a geometric average of current and past habit-adjusted consumption levels. S t = (C t bc t 1 ) γ (S t 1 ) 1 γ γ 0 : GHH preferences. Wealth effect on labor is small. γ 1 : Standard preferences.

Effective Number of Draws SMC RWMH Parameter Mean STD(Mean) N eff Mean STD(Mean) N eff σ 4 z 3.14 0.04 4190 3.08 0.24 108 i σg 8 0.41 0.01 1830 0.41 0.03 108 σ 8 z 5.59 0.09 1124 5.68 0.30 102 i θ 4.13 0.02 671 4.14 0.05 146 σ 0 z 12.27 0.09 640 12.36 0.09 616 i σµ 0 1.04 0.04 626 0.92 0.04 625 σg 0 0.62 0.01 609 0.60 0.03 111 κ 9.32 0.05 578 9.33 0.09 208 σζ 4 2.43 0.09 406 2.44 0.04 2066 σζ 0 3.82 0.10 384 3.80 0.22 73 σζ 8 2.65 0.11 335 2.62 0.18 126 σµ 4 4.26 0.24 49 4.33 0.49 12 σµ 8 1.36 0.24 46 1.34 0.49 11

Histogram of Wage Markup Process Parameters: RWH (black) versus SMC (red) 0 ρ µ σ µ 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0.9 0.92 0.94 0.96 0.98 1 0 0 2 4 6 0.4 σ µ 4 0.4 σ µ 8 0.3 0.3 0.2 0.2 0.1 0.1 0 0 2 4 6 0 0 2 4 6

Histogram of Preference Parameter γ: RWH (black) versus SMC (red) 1 8 x 10 3 γ γ 0.9 0.8 0.7 0.6 7 6 5 0.5 0.4 0.3 0.2 0.1 4 3 2 1 0 0 0 0.1 0.1 0.2 0.3 0.4 0.5 0.5 0.6 0.6 0.7 0.7 0.8 0.8 0.9 0.9 γ 0.6 implies that the importance of anticipated shocks for hours is substantially diminished.

An Alternative Prior for γ SGU use a Uniform prior for γ. We now change the prior to Beta(2, 1) (density is straight line between 0 and 1).

Histogram of Preference Parameter γ Alternative Prior: RWH (black) versus SMC (red) 1 γ 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Anticipated Shocks Variance Shares for Hours: RWH (black) versus SMC (red) SGU Prior Beta(2, 1) Prior 1 Hours 1 Hours 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1