Particle Learning and Smoothing

Size: px

Start display at page:

Download "Particle Learning and Smoothing"

Constance Ross
5 years ago
Views:

1 Particle Learning and Smoothing Hedibert Freitas Lopes The University of Chicago Booth School of Business September 18th 2009 Instituto Nacional de Pesquisas Espaciais São José dos Campos, Brazil Joint with Nick Polson, Carlos Carvalho and Mike Johannes 1

2 Outline of the talk Let the general dynamic model be Observation equation : p(y t+1 x t+1, θ) System equation : p(x t+1 x t, θ) We will talk about... MCMC in normal dynamic linear models. Particle filters: learning states x t+1. Particle filters: learning parameters θ. Particle Learning (PL) framework. More general dynamic models. Final remarks. 1

3 Normal dynamic linear model (NDLM) West and Harrison (1997): Hidden Markovian structure y t+1 x t+1 N(F t+1 x t+1, t+1) x t+1 x t N(G t+1 x t, t+1) Sequential learning p(x t y t ) = p(x t+1 y t ) = p(y t+1 x t ) = p(x t+1 y t+1 ) where y t = (y 1,..., y t ). 2

4 Example i. Local level model Let θ = (, ), x 0 N(m 0, C 0 ) and Kalman filter recursions y t+1 x t+1, θ N(x t+1, ) x t+1 x t, θ N(x t, ) Posterior at t: (x t y t ) N(m t, C t ) Prior at t + 1: (x t+1 y t ) N(m t, R t+1 ) Predictive at t + 1: (y t+1 y t ) N(m t, Q t+1 ) Posterior at t + 1: (x t+1 y t+1 ) N(m t+1, C t+1 ) where R t+1 = C t +, Q t+1 = R t+1 +, A t+1 = R t+1 /Q t+1, C t+1 = A t+1, and m t+1 = (1 A t+1 )m t + A t+1 y t+1. 3

5 Example i. Backward smoothing For t = n, x n y n N(m n n, C n n ), where m n n = m n C n n = C n For t < n, x t y n N(m n t, C n t ), where m n t = (1 B t )m t + B t m n t+1 C n t = (1 B t )C t + B 2 t C n t+1 and B t = C t C t + 4

6 Example i. Backward sampling For t = n, x n y n N(a n n, R n n ), where a n n = m n R n n = C n For t < n, x t x t+1, y n N(a n t, R n t ), where at n = (1 B t )m t + B t x t+1 Rt n = B t and B t = C t C t + This is basically the Forward filtering, backward sampling (FFBS) algorithm commonly used to sample from p(x n y n ) (Carter and Kohn, 1994 and Frühwirth-Schnatter, 1994). 5

7 Example i. n = 100, = 1.0, = 0.5 and x 0 = y(t) x(t)

8 Example i. p(x t y t, θ) and p(x t y n, θ) for t n. m 0 = 0.0 and C 0 = Forward filtering Backward smoothing

9 Non-Gaussian, nonlinear dynamic models The dynamic model is for t = 1,..., n and p(x 0 ). p(y t+1 x t+1 ) and p(x t+1 x t ) Prior and posterior at t + 1, i.e. p(x t+1 y t ) = p(x t+1 x t )p(x t y t )dx t p(x t+1 y t+1 ) p(y t+1 x t+1 )p(x t+1 y t ) are usually unavailable in closed form. Over the last 20 years: Carlin, Polson and Stoffer (1992) for more general DMs; Carter and Kohn (1994) and Frühwirth-Schnatter (1994) for conditionally Gaussian DLMs; Gamerman (1998) for generalized DLMs. 8

10 The Bayesian boostrap filter (BBF) Gordon, Salmond and Smith (1993) use a propagate-sample scheme to go from p(x t y t ) to p(x t+1 y t ) to p(x t 1 y t 1 ). 9

11 Resample or not resample? PARTICLE PATHS Time 10

12 Weights Time PARTICLE WEIGHTS 11

13 Fully adapted BBF (FABBF) Posterior at t: {x (i) t } N i=1 p(x t y t ). Propagate Draw { x (i) t+1 }N i=1 from p(x t+1 x (i) t, y t+1 ). Resample Draw {x (j) t+1 }N (i) j=1 from { x t+1 }N i=1 with weights w (i) t+1 p(y t+1 x (i) t ). Posterior at t + 1: {x (i) t+1 }N i=1 p(x t+1 y t+1 ). 12

14 The auxiliary particle filter (APF) Posterior at t: {x (i) t } N i=1 p(x t y t ). Resample Draw { x (j) t } N j=1 where g(x t ) = E(x t x t 1 ). (i) from {x t } N i=1 with weights w (i) t+1 p(y t+1 g(x (i) t )) Propagate Draw {x (i) t+1 }N i=1 from p(x t+1 x (i) t ) and resample (SIR) with weights t+1 p(y t+1 x (j) t+1 ). p(y t+1 g(x (kj ) t )) ω (j) Posterior at t + 1: {x (i) t+1 }N i=1 p(x t+1 y t+1 ). 13

15 How about p(θ y n )? Two-step strategy: On the first step, approximate p(θ y n ) by p N (θ y n ) = pn (y n θ)p(θ) p(y n ) p N (y n θ)p(θ) where p N (y n θ) is a SMC approximation to p(y n θ). Then, on the 2nd step, sample θ via a MCMC scheme or a SIR scheme 1. Problem 1: SMC looses its appealing sequential nature. Problem 2: Overall sampling scheme is sensitive to p N (y θ). 1 See Fernándes-Villaverde and Rubio-Ramírez (2007) Estimating Macroeconomic Models: A Likelihood Approach, DeJong, Dharmarajan, Liesenfeld, Moura and Richard (2009) Efficient Likelihood Evaluation of State-Space Representations for applications of this two-step strategy to DSGE and related models. 14

16 Example ii. Exact integrated likelihood p(y n, ) n = 100, x 0 = 0, = 1, = 0.5 and x 0 N(0.0, 100) grid: = (0.1,..., 2) and = (0.1,..., 3) t=10 t=20 t=30 t=40 t= t=60 t=70 t=80 t=90 t=

17 Example ii. Approximated p N (y n, ) Based on N = 1000 particles t=10 t=20 t=30 t=40 t= t=60 t=70 t=80 t=90 t=

18 Another (perhaps more natural) idea Sequentially learning x t and θ. Posterior at t : p(x t θ, y t )p(θ y t ) Prior ate t+1 : p(x t+1 θ, y t )p(θ y t ) Posterior at t+1 : p(x t+1 θ, y t+1 )p(θ y t+1 ) Advantages: Sequential updates of p(θ y t ), p(x t y t ) and p(θ, x t y t ) Sequential h-steps ahead forecast p(y t+h y t ) Sequential approximations for p(y t y t 1 ) Sequential Bayes factors t j=1 B 12t = p(y j y j 1, M 1 ) t j=1 p(y j y j 1, M 2 ) 17

19 Liu and West filter (LWF) {(x t, θ) (i) } N i=1 p(x t, θ y t ). Compute θ, V and m(θ (i) ) = aθ (i) + (1 a) θ i. Resampling Compute g(x (i) t ) = E(x t+1 x (i) t, m(θ (i) )) i. Compute w (i) t+1 = p(y t+1 g(x (j) t ), m(θ (j) )) i. Resample ( x t, θ) (i) from {(x t, θ, w t+1 ) (j) } N j=1. Sampling Sample θ(i) N(m( θ(i) ), (1 a 2 )V ) i Sample x (i) t+1 p(x t+1 x (i) t, θ(i) ) i. Compute ω (i) t+1 = p(y t+1 x (i) t+1, θ(i) )/p(yt+1 g( x (i) t ), m( θ (i) )) i. Resample (x t+1, θ) (i) from {( x t+1, θ, ω t+1 ) (j) } N j=1. {(x t+1, θ) (i) } N i=1 p(x t+1, θ y t+1 ). 18

20 Choosing a They approximate p(θ y t ) by a N-component mixture of normals p(θ y t ) = N i=1 ω (i) t f N (θ aθ (i) + (1 a) θ, (1 a 2 )V ) where θ and V approximate mean and variance of (θ y t ). A discount factor argument is used to set up the shrinkage quantity For example, δ = 0.50 leads to a = δ = 0.75 leads to a = δ = 0.95 leads to a = δ = 1.00 leads to a = a = 3δ 1. 2δ When a = 1.0, the particles will degenerate over. 19

21 Example iii. N = 2000, x 0 = 25, =.1, =.05 LW1:log, LW2:, LW3:LW1 + o.p. LW4:LW2 + o.p. 2 LW1 with delta=0.75 LW2 with delta=0.75 LW3 with delta=0.75 LW4 with delta= LW1 with delta=0.95 LW2 with delta=0.95 LW3 with delta=0.95 LW4 with delta= o.p.=optimal propagation 20

22 Particle Learning (PL) Advantages: Sequentially learning about (x t, θ); A real/practical alternative to MCMC methods; Sequential model assessment; Applicable in a wide range of dynamic and static models. Key issues in our approach: Reverse the Kalman filter logic: resample/propagate; Estimation of fixed, unknown parameters; Work with conditional sufficient statistics; Use SMC for smoothing. 21

23 Parameter learning 3 It is assumed that p ( θ x t, y t) = p (θ s t ), where s t is a recursively defined sufficient statistic (SS), s t+1 = S (s t, x t+1, y t+1 ). SS are just another state with a deterministic evolution; Filtering SS provides a mechanism for replenishing the parameters avoiding degeneracy; Reduction of the variance of re-sampling weights: Rao-Blackwellization. 3 Storvik (2002) and Fearnhead (2002) 22

24 Reverse the Kalman filter logic Traditional logic for filtering (Kalman, 1960): Predict/Update p ( x t+1 y t) = p (x t+1 x t ) p ( x t y t) dx t p ( x t+1 y t+1) p (y t+1 x t+1 ) p ( x t+1 y t). By Bayes rule, we can reverse this logic through: p ( x t y t+1) p (y t+1 x t ) p ( x t y t) p ( x t+1 y t+1) = p (x t+1 x t, y t+1 ) p ( x t y t+1) dx t We will first re-sample (smooth) and then propagate. Since information in y t+1 is used in both steps, the algorithm will be more efficient. 23

25 The general approach Assume at t, {(x t, s t ) (i)} N approximates i=1 pn (x t, s t y t ); Once y t+1 is observed, the re-sample/propagation rule is Resampling p ( x t, s t y t+1) p (y t+1 x t, s t ) p ( x t, s t y t) Propagation p ( x t+1 y t+1) = p (x t+1 x t, s t, y t+1 ) p ( x t, s t y t+1) dx t ds t 24

26 Perfect adaptation Target: p ( x t+1, x t y t+1) IS weights: w p (x t+1 x t, y t+1 ) p (y t+1 x t ) p (x t y t ) q 1 (x t y t+1 ) q 2 (x t+1 x t, y t+1 ) w p (x t+1 x t, y t+1 ) p (y t+1 x t ) p (x t y t ) p (x t y t+1 ) p (x t+1 x t, y t+1 ) p (x t y t ) = 1 This is the ideal scenario and should serve as a guiding principle in the construction of sequential Monte Carlo filters 4. Marginalization is key! 4 Perfect adaptation is also discussed in Pitt and Shephard (1999) and Johansen and Doucet (2008). 25

27 The general PL algorithm Posterior at t: Φ t { (x t, θ) (i)} N i=1 p(x t, θ y t ). Compute, for i = 1,..., N, w (i) t+1 p(y t+1 x (i) t, θ (i) ) Resample from Φ t with weights w t+1 : Φ t Propagate states Update sufficient statistics Sample parameters x (i) t+1 p(x t+1 x (i) t, θ (i), y t+1 ) s (i) t+1 = S(s(i) t, x (i) t+1, y t+1) { ( x t, θ) (i) } N i=1. θ (i) p(θ s (i) t+1 ) 26

28 Example ii. p N (θ y t ) via PL for x t and MCMC for (, ) Prior: U(0.1, 2) and U(0.1, 3). Based on N = 1000 particles t= t= t= t= t= t= t= t= t= t=100 27

29 Example ii. Trace plots for and t=10 t=50 t= Time Time Time t=10 t=50 t= Time Time Time 28

30 Example ii. p N (θ y t ) via PL for (x t,, ) Prior: U(0.1, 2) and U(0.1, 3). Based on N = 1000 particles t= t= t= t= t= t= t= t= t= t=100 29

31 Example iv. Comparing BBF, APF, FABBF and PL Simulation M = 20 data sets with n = 100 observations each, = and {5, 0.13, 0.013, , } from with x 0 = 0. y t+1 x t+1 N(x t+1, ) x t+1 x t N(x t, ) Particle filters: R = 20 replications of N = 1000 particles. Prior set up: x 0 N(0, 10). 30

32 Example iv. Log relative mean square error Let q t α be such that Pr(x t < q α t y t ) = α and α = (0.05, 0.25, 0.5, 0.75, 0.95). Then, the average MSE for filter f, t and quantile 100α-percentile is MSE α t,f = 1 MR M i=1 j=1 R (ˆq tij,f α qα it) 2 We compare filters f 1 and f 2 based on log relative MSE, i.e. LRMLE α t,f 1,f 2 = log MSE α t,f 1 MSE α t,f 2 31

33 Example iv. = 5.0 and τ/σ = 0.05 (APF,BBF), (FABBF,BBF) and (PL,BBF) Percentile = 5 Percentile = 25 Percentile = 50 Percentile = 75 Percentile = Time Time Time Time Time APF FAAPF PL APF FAAPF PL APF FAAPF PL APF FAAPF PL APF FAAPF PL 32

34 Example iv. = 0.13 and τ/σ = 0.32 Percentile = 5 Percentile = 25 Percentile = 50 Percentile = 75 Percentile = Time Time Time Time Time APF FAAPF PL APF FAAPF PL APF FAAPF PL APF FAAPF PL APF FAAPF PL 33

35 Example iv. = and τ/σ = 1.00 Percentile = 5 Percentile = 25 Percentile = 50 Percentile = 75 Percentile = Time Time Time Time Time APF FAAPF PL APF FAAPF PL APF FAAPF PL APF FAAPF PL APF FAAPF PL 34

36 Example iv. = and τ/σ = 3.16 Percentile = 5 Percentile = 25 Percentile = 50 Percentile = 75 Percentile = Time Time Time Time Time APF FAAPF PL APF FAAPF PL APF FAAPF PL APF FAAPF PL APF FAAPF PL 35

37 SMC smoothers SMC smoothers are alternatives to MCMC in state-space models. Godsill, Doucet and West (2004) Monte Carlo smoothing for non-linear series introduced an O(TN 2 ) algorithm that relies on Forward particles, and Backward re-weighting via evolution equation. See also Briers, Doucet and Maskell s (2009) Smoothing Algorithms for State-Space Models for other O(TN 2 ) smoothers. An O(TN) smoothing algorithm is introduced by Fearnhead, Wyncoll and Tawn s (2008) A sequential smoothing algorithm with linear computational cost. 36

38 Example v. PL and MCMC (filtering and smoothing) n = 100, = 1, = 0.5 and x 0 = 0 and x 0 N(0, 100). N = 1000 PL particles M = 1000 MCMC draws, after discarding the first M 0 = Smoothed distributions Kullback Leibler divergence Time TRUE MCMC PL PL MCMC 37

39 Example v. Computing (in seconds) N = M 0 = M = 500 n PL MCMC n = 100 and N = M 0 = M N PL MCMC MacBook 2.4GHz Intel Duo processor, 4GB 1067MHz memory. 38

40 Example vi. Sample-resample or PL? Three series of lengtht = 1000 were simulated from y t x t, N(x t, ) x t x t 1, N(x t 1, ) with x 0 = 0 and (, ) in {(0.1, 0.01), (0.01, 0.01), (0.01, 0.1)}. Throughout is kept fixed. The independent prior distributions for x 0 and are x 0 N(m 0, V 0 ) and IG(a, b), for a = 10, b = (a + 1) 0, m 0 = 0 and V 0 = 1, where 0 is the true value of for a given study. We also include BBF in the comparison, for completion. In all filters is sampled offline from p( S t ) where S t is the vector of conditional sufficient statistics. 39

41 Example vi. Mean absolute error The three filters are rerun R = 100 s, all with the same seed within run, for each one of the three simulated data sets. Five different number of particles N were considered: 250, 500, 1000, 2000 and Mean absolute errors (MAE) taken over the 100 replications are constructed by comparing percentiles of the true sequential distributions p(x t y t ) and p( y t ) to percentiles of the estimated sequential distributions p N (x t y t ) and p N ( y t ). For α = 0.1, 0.5, 0.9, true and estimated values of q x t,α and q t,α were computed, for Pr(x t < q x t,α y t ) = Pr( < q t,α y t ) = α. For a in {x, } and α in {0.01, 0.50, 0.99}, MAE a t,α = 1 R R qt,α a ˆq t,α,r a r=1 40

42 Example vi. M = 500 and learning. BBF,sample-resample,PL. (tau2,sig2,q)=(0.01,0.1,10%) (tau2,sig2,q)=(0.01,0.1,50%) (tau2,sig2,q)=(0.01,0.1,90%) MAE MAE MAE (tau2,sig2,q)=(0.01,0.01,10%) (tau2,sig2,q)=(0.01,0.01,50%) (tau2,sig2,q)=(0.01,0.01,90%) MAE MAE MAE (tau2,sig2,q)=(0.1,0.01,10%) (tau2,sig2,q)=(0.1,0.01,50%) (tau2,sig2,q)=(0.1,0.01,90%) MAE MAE MAE

43 Example vi. M = 5000 and learning. (tau2,sig2,q)=(0.01,0.1,10%) (tau2,sig2,q)=(0.01,0.1,50%) (tau2,sig2,q)=(0.01,0.1,90%) MAE MAE MAE (tau2,sig2,q)=(0.01,0.01,10%) (tau2,sig2,q)=(0.01,0.01,50%) (tau2,sig2,q)=(0.01,0.01,90%) MAE 2e-04 3e-04 4e-04 5e-04 6e-04 MAE 2e-04 3e-04 4e-04 5e-04 6e-04 MAE 2e-04 3e-04 4e-04 5e-04 6e (tau2,sig2,q)=(0.1,0.01,10%) (tau2,sig2,q)=(0.1,0.01,50%) (tau2,sig2,q)=(0.1,0.01,90%) MAE MAE MAE

44 Example vi. M = 500 and learning x t. (tau2,sig2,q)=(0.01,0.1,10%) (tau2,sig2,q)=(0.01,0.1,50%) (tau2,sig2,q)=(0.01,0.1,90%) MAE MAE MAE (tau2,sig2,q)=(0.01,0.01,10%) (tau2,sig2,q)=(0.01,0.01,50%) (tau2,sig2,q)=(0.01,0.01,90%) MAE MAE MAE (tau2,sig2,q)=(0.1,0.01,10%) (tau2,sig2,q)=(0.1,0.01,50%) (tau2,sig2,q)=(0.1,0.01,90%) MAE MAE MAE

45 Example vi. M = 5000 and learning x t. (tau2,sig2,q)=(0.01,0.1,10%) (tau2,sig2,q)=(0.01,0.1,50%) (tau2,sig2,q)=(0.01,0.1,90%) MAE MAE MAE (tau2,sig2,q)=(0.01,0.01,10%) (tau2,sig2,q)=(0.01,0.01,50%) (tau2,sig2,q)=(0.01,0.01,90%) MAE MAE MAE (tau2,sig2,q)=(0.1,0.01,10%) (tau2,sig2,q)=(0.1,0.01,50%) (tau2,sig2,q)=(0.1,0.01,90%) MAE MAE MAE

46 Example vii. Computing sequential Bayes factors A series y t is simulated from a AR(1) plus noise model: for t = 1,..., T. (y t+1 x t+1, θ) N(x t+1, ) (x t+1 x t, θ) N(βx t, ) We set T = 100, x 0 = 0, θ = (β,, ) = (0.9, 1.0, 0.5). and are kept known and the independent prior distributions for β and x 0 are both N(0, 1). 45

47 Example vii. Simulated data y(t) and x(t) Time 46

48 Example vii. PL pure filter versus PL We run two filters: PL pure filter - our particle learning algorithm for learning x t and keeping β fixed; PL - our particle learning algorithm for learning x t and β sequentially. The filters are based on N = 10, 000 particles. 47

49 Example vii. PL pure filter versus PL β was fixed at the true value. x t Time PL pure filter PL 48

50 Example vii. PL - learning β β Time 49

51 Example vii. PL - learning β Comparing p N (β y t ) with true p(β y t ). t=10 t=20 t=30 t=40 t=50 Density Density Density Density Density t=60 t=70 t=80 t=90 t=100 Density Density Density Density Density

52 Example vii. Sequential Bayes factor Bayes factor PL versus PL pure filter Time 51

53 Example vii. Posterior model probabilities: 4 models Time 52

54 Example vii. Posterior model probabilities: 31 models PL pure filter PL Posterior probability Density

55 PL in Conditional Dynamic Linear Models (CDLM) The model is y t+1 = F λt+1 x t+1 + ɛ t+1 where ɛ t+1 N ( ) 0, V λt+1 x t+1 = G λt+1 x t + ɛ x t+1 where ɛ x t+1 N ( ) 0, W λt+1 The error distribution p (ɛ t+1 ) = N ( 0, V λt+1 ) p (λt+1 ) dλ t+1 The augmented latent state is λ t+1 p(λ t+1 λ t ) PL extends Liu and Chen s (2000) Mixture of Kalman Filters. 54

56 Algorithm Step 1 (Re-sample): Generate an index k(i) Multi ( w (i)) where w (i) p (y t+1 (st x, θ) (i)) Step 2 (Propagate): States ) λ t+1 p (λ t+1 (λ t, θ) k(i), y t+1 x t+1 p(x t+1 (x t, θ) k(i), λ t+1, y t+1 ) Step 3 (Propagate): Sufficient Statistics s x t+1 = K (s x t, θ, λ t+1, y t+1 ) s t+1 = S (s t, x t+1, λ t+1, y t+1 ) 55

57 Example A. Dynamic factor with switching loadings 5 For t = 1,..., T, the model is defined as follows: Observation equation y t z t, θ N(γ t x t, I 2 ) State equations x t x t 1, θ N(x t 1, σx) 2 λ t λ t 1, θ Ber((1 p) 1 λ t 1 q λ t 1 ) where z t = (x t, λ t ). Factor loadings: γ t = (1, β λt ). Parameters: θ = (β 1, β 2,, σx, 2 p, q). 5 Lopes and Carvalho (2007) and Lopes, Salazar and Gamerman (2008) 56

58 Example A. Conditionally conjugate prior (β i ) N ( b i0, ) B i0 ( ν00 IG 2, d ) 00 2 ( σx 2 ν10 IG 2, d ) 10 2 p Beta(p 1, p 2 ) q Beta(q 1, q 2 ) x 0 N(m 0, C 0 ) for i = 1, 2, 57

59 Example A. Particle representation At t, particles {(x t, λ t, θ, s x t, s t ) (i)} N i=1 approximating p ( x t, λ t, θ, s x t, s t y t) where st x = S(st 1 x, θ) are state sufficient statistics s t = S(s t 1, x t, λ t ) are fixed parameter sufficient statistics 58

60 Example A. Re-sampling (x t, λ t, θ, s x t, s t ) Let us redefine β i = (1, β i ) whenever necessary. Draw an index k(i) Multi(ω (i) ) with weights ω (i) p(y t+1 (st x, λ t, θ) k(i) ) with 2 p(y t+1 m t, C t, λ t, θ) = f N (y t+1 ; β j m t, V j ) Pr (λ t+1 = j λ t, θ) j=1 where V j = (C t + x)β j β j + σ2 I 2, m t and C t are components of s x t and f N denotes the normal density function. 59

61 Example A. Propagating states Draw auxiliary state λ t+1 where λ (i) t+1 p(λ t+1 (s x t, λ t, θ) k(i), y t+1 ) Pr(λ t+1 = j s x t, λ t, θ, y t+1 ) f N (y t+1 ; β j m t, V j ) p (λ t+1 = j λ t, θ). Draw state x t+1 conditionally on λ t+1 x (i) t+1 p(x t+1 λ (i) t+1, (sx t, θ) k(i), y t+1 ) by a simply Kalman filter update. 60

62 Example A. Updating sufficient statistics for states, s x t+1 The Kalman filter recursion yield where m t+1 = m t + A t+1 (y t+1 β λt+1 m t ) C t+1 = C t + x A t+1 Q 1 t+1 A t+1 Q t+1 = (C t + x)γ t+1 γ t+1 + I 2 A t+1 = (C t + x)γ t+1q 1 t+1 61

63 Example A. Updating suff. statistics for parameters, s t+1 Recall that s t+1 = S(s t, x t+1, λ t+1 ). Then, (β i, s t+1 ) N ( b i,t+1, ) B i,t+1 for i = 1, 2, ( ( ν0t s t+1 ) IG 2, d ) 0,t+1 2 ( (σx s 2 ν1t t+1 ) IG 2, d ) 1,t+1 2 (p s t+1 ) Beta(p 1,t+1, p 2,t+1 ) (q s t+1 ) Beta(q 1,t+1, q 2,t+1 ) where I λt+1 =i = I i, I λt=i,λt+1 =j = I ij, ν it = ν i,t 1 + 1, B 1 i,t+1 = B 1 it + xt+1 2, B 1 i,t+1 b i,t+1 = B 1 it b it + x t+1 y t+1,2 I i, p i,t+1 = p it + I 1i (similarly for q i,t+1 ) for i = 1, 2, d 0,t+1 = d 0,t + (y t+1,1 x t+1 ) 2 + ] 2 j=1 [(y t+1,2 b j,t+1 x t+1 ) y t+1,2 + B 1 j,t+1 b j,t+1 I j, and d 1,t+1 = d 1,t + (x t+1 x t ) 2. 62

64 Example A. Filtering and smoothing for states 63

65 Example A. Sequential parameter learning 64

66 PL in (state) non-linear normal dynamic models The model now is y t+1 = F λt+1 x t+1 + ɛ t+1 where ɛ t+1 N ( 0, V λt+1 ) x t+1 = G λt+1 Z (x t ) + ω t+1 where ω t+1 N ( 0, W λt+1 ) where ɛ t+1 and λ t+1 are modeled as before. Algorithm: Step 1 (Re-sample): Generate an index k(i) Multi ( w (i)) where w (i) p (y t+1 (x t, θ) (i)) Step 2 (Propagate): ) λ t+1 p (λ t+1 (λ t, θ) k(i), y t+1 ) x t+1 p (x t+1 (x t,, θ) k(i), λ t+1, y t+1 s t+1 = S (s t, x t+1, λ t+1, y t+1 ) 65

67 Example B. Fat-tailed nonlinear model 6 Let y t+1 = x t+1 + σ λ t+1 ɛ t+1 ( ν where λ t+1 IG 2, ν ) 2 x t+1 = g(x t )β + σ x u t+1 where g(x t ) = x t 1 + x 2 t where ɛ t+1 and u t+1 are independent standard normals and ν is known. The observation error term is non-normal λ t+1 ɛ t+1 t ν. 6 From dejong et al (2007) 66

68 Example B. Sequential inference 67

69 Example C. Dynamic multinomial logit model Let us study the multinomial logit model 7 P (y t+1 = 1 β t+1 ) = eftβt 1 + e Ftβt and β t+1 = φβ t + σ x ɛ β t+1 where β 0 N(0, /(1 ρ 2 )). Scott s (2007) data augmentation structure leads to a mixture Kalman filter model y t+1 = I(z t 0) z t+1 = Z t β + ɛ t+1 where ɛ t+1 lne(1) Here ɛ t is an extreme value distribution of type 1 where E(1) is an exponential of mean one. The key is that it is easily to simulate p(z t β, y t ) using ( ln Ui z t+1 = ln 1 + e β i β ln V ) i e β i β I y t+1 =0 7 Carvalho, Lopes and Polson (2008) 68

70 Example C. 10-component mixture of normals Frunwirth-Schnatter and Schnatter (2007) uses a 10-component mixture of normals: p(ɛ t ) = e ɛt e ɛ t 10 w j N (µ j, sj 2 ) j=1 Hence conditional on an indicator λ t we can analyze y t = I(z t 0) and z t = µ λt + Z t β + s λt ɛ t where ɛ t N(0, 1) and Pr(λ t = j) = w j. Also, ) s β t+1 = K (st β, z t+1, λ t+1, θ, y t+1 p(y t+1 s β t, θ) = λ t+1 p(y t+1 s β t, λ t+1, θ) for re-sampling. Propagation now requires λ t+1 p λ t+1 (s β t, θ)k(i), y t+1 z t+1 p z t+1 (s β t, θ)k(i), λ t+1, y t+1 β t+1 p z t+1 (s β, θ) k(i), λ t+1, z t+1 69

71 Example C. Simulated exercise Y t Time β t Time 5 α 0.4 σ Time Time PL based on 30,000 particles. 70

72 Example D. Sequential Bayesian Lasso We develop a sequential version of Bayesian Lasso 8 for a simple problem of signal detection. The model takes the form (y t θ t ) N(θ t, 1) p(θ t τ) = (2τ) 1 exp ( θ t /τ) for t = 1,..., n and IG(a 0, b 0 ). Data augmentation: It is easy to see that p(θ t τ) = p(θ t τ, λ t )p(λ t )dλ t where λ t Exp(2) θ t τ, λ t N(0, λ t ) 8 Carlin and Polson (1991) and Hans (2008) 71

73 Example D. Data augmentation The natural set of latent variables is given by the augmentation variable λ n+1 and conditional sufficient statistics leading to Z n = (λ n+1, a n, b n ) The sequence of variables λ n+1 are i.i.d. and so can be propagated directly with p(λ n+1 ). The conditional sufficient statistics (a n+1, b n+1 ) are deterministically determined based on parameters (θ n+1, λ n+1 ) and previous values (a n, b n ). 72

74 Example D. PL algorithm 1. After n observations: { (Z n, τ) (i)} N i=1. 2. Draw λ (i) n+1 Exp(2). 3. Resample old particles with weights 4. Sample θ (i) C 1 n w (i) n+1 p(y n+1; 0, 1+(i) λ (i) n+1 ). n+1 N(m(i) n 1(i) 2(i) λ n+1. = 1 + τ n, C (i) ), where m n (i) 5. Suff. stats: a (i) n+1 = ã(i) n + 1/2, b (i) n+1 = b n (i) 6. Sample (offline) (i) IG(a n+1, b n+1 ). 7. Let Z (i) n+1 = (λ(i) n+1, a(i) n+1, b(i) n+1 ). = C n (i) y n+1 and + θ 2(i) (i) n+1 /(2 λ n+1 ). 8. After n + 1 observations: { (Z n+1, τ) (i)} N i=1. 73

75 Example D. Sequential Bayes factor As the Lasso is a model for sparsity we would expect the evidence for it to increase when we observe y t = 0. We can sequentially estimate p(y n+1 y n, lasso) via p(y n+1 y n, lasso) = 1 N N p(y n+1 (λ n, τ) (i) ) i=1 with predictive p(y n+1 λ n, τ) N(0, λ n + 1). This leads to a sequential Bayes factor BF n+1 = p(y n+1 lasso) p(y n+1 normal). 74

76 Example D. Simulated data Data based on θ = (0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1) and priors IG(2, 1) for the double exponential case and IG(2, 3) for the normal case, reflecting the ratio of variances between those two distributions. Bayes factor Observations 75

77 Final remarks PL is a general framework for sequential Bayesian inference in dynamic and static models. PL is able to deal with filtering and learning and reduce the accumulation of error. The loose definition of sufficient statistics and the flexibility to freely augment x t makes PL a competitive alternative to MCMC in highly structured models. A powerful by-product of PL (and SMC in general) over MCMC schemes, is its ability to sequentially produce model comparison, assessment indicators. 76

Particle Learning and Smoothing

Particle Learning and Smoothing Carlos Carvalho, Michael Johannes, Hedibert Lopes and Nicholas Polson This version: September 2009 First draft: December 2007 Abstract In this paper we develop particle