1 The views expressed in this paper are those of the authors and do not necessarily reflect the views of the Federal Reserve Board of Governors or the Federal Reserve System. Bayesian Estimation of DSGE Models 1 Chapter 3: A Crash Course in Bayesian Inference Ed Herbst 1 Frank Schorfheide 2 1 Federal Reserve Board 2 University of Pennsylvania February 5, 2016
Topics Deriving a posterior distribution in a linear regression model Direct sampling Bayesian decision making
Bayesian Inference Ingredients of Bayesian Analysis: Likelihood function p(y φ) Prior density p(φ) Marginal data density p(y ) = p(y φ)p(φ)dφ Bayes Theorem: p(φ Y ) = p(y φ)p(φ) p(y )
Linear Regression / AR Models Consider AR(1) model: y t = y t 1 φ + u t, u t iidn(0, 1). Let x t = y t 1. Write as or y t = x tφ + u t, u t iidn(0, 1), Y = X φ + U. We can easily allow for multiple regressors. Assume φ is k 1. Notice: we treat the variance of the errors as know. The generalization to unknown variance is straightforward but tedious. Likelihood function: p(y φ) = (2π) T /2 exp { 1 } 2 (Y X φ) (Y X φ).
A Convenient Prior Prior: φ N ) { (0 k 1, τ 2 I k k, p(φ) = (2πτ 2 ) k/2 exp 1 } 2τ 2 φ φ Large τ means diffuse prior. Small τ means tight prior.
Deriving the Posterior Bayes Theorem: p(φ Y ) p(y φ)p(φ) { exp 1 } 2 [(Y X φ) (Y X φ) + τ 2 φ φ]. Guess: what if φ Y N( φ T, V T ). Then { p(θ Y ) exp 1 } 2 (φ φ T ) 1 V T (φ φ T ). Rewrite exponential term Y Y φ X Y Y X φ + φ X X φ + τ 2 φ φ = Y Y φ X Y Y X φ + φ (X X + τ 2 I)φ ( ) ( ) = φ (X X + τ 2 I) 1 X Y X X + τ 2 I ( ) φ (X X + τ 2 I) 1 X Y +Y Y Y X (X X + τ 2 I) 1 X Y.
Deriving the Posterior Exponential term is a quadratic function of φ. Deduce: posterior distribution of φ must be a multivariate normal distribution φ Y N( φ T, V T ) with φ T = (X X + τ 2 I) 1 X Y τ : τ 0: V T = (X X + τ 2 I) 1. φ Y approx ( ) N ˆφ mle, (X X ) 1. φ Y approx Pointmass at 0
Marginal Data Density Plays an important role in Bayesian model selection and averaging. Write p(y θ)p(θ) p(y ) = p(θ Y ) { = exp 1 } 2 [Y Y Y X (X X + τ 2 I) 1 X Y ] (2π) T /2 I + τ 2 X X 1/2. The exponential term measures the goodness-of-fit. I + τ 2 X X is a penalty for model complexity.
Posterior We will often abbreviate posterior distributions p(φ Y ) by π(φ) and posterior expectations of h(φ) by E π [h] = E π [h(φ)] = h(φ)π(φ)dφ = h(φ)p(φ Y )dφ. We will focus on algorithms that generate draws {φ i } N i=1 from posterior distributions of parameters in time series models. These draws can then be transformed into objects of interest, h(φ i ), and under suitable conditions a Monte Carlo average of the form h N = 1 N N h(φ i ) E π [h]. i=1 Strong law of large numbers (SLLN), central limit theorem (CLT)...
Direct Sampling In the simple linear regression model with Gaussian posterior it is possible to sample directly. For i = 1 to N, draw φ i from N ( φ, Vφ ). Provided that V π [h(φ)] < we can deduce from Kolmogorov s SLLN and the Lindeberg-Levy CLT that a.s. h N E π [h] N ( h N E π [h] ) = N ( 0, V π [h(φ)] ).
Decision Making The posterior expected loss associated with a decision δ( ) is given by ρ ( δ( ) Y ) = L ( θ, δ(y ) ) p(θ Y )dθ. Θ A Bayes decision is a decision that minimizes the posterior expected loss: δ (Y ) = argmin d ρ ( δ( ) Y ). Since in most applications it is not feasible to derive the posterior expected risk analytically, we replace ρ ( δ( ) Y ) by a Monte Carlo approximation of the form ρ N ( δ( ) Y ) = 1 N N L ( θ i, δ( ) ). i=1 A numerical approximation to the Bayes decision δ ( ) is then given by δ N(Y ) = argmin d ρ N ( δ( ) Y ).
Inference Point estimation: Quadratic loss: posterior mean Absolute error loss: posterior median Interval/Set estimation P π {θ C(Y )} = 1 α: highest posterior density sets equal-tail-probability intervals
Forecasting Example: h 1 y T +h = θ h y T + θ s u T +h s s=0 h-step ahead conditional distribution: y T +h (Y 1:T, θ) N (θ h y T, 1 ) θh. 1 θ Posterior predictive distribution: p(y T +h Y 1:T ) = p(y T +h y T, θ)p(θ Y 1:T )dθ. For each draw θ i from the posterior distribution p(θ Y 1:T ) sample a sequence of innovations u i T +1,..., ui T +h and compute y i T +h as a function of θ i, u i T +1,..., ui T +h, and Y 1:T.
Model Uncertainty Assign prior probabilities γ j,0 to models M j, j = 1,..., J. Posterior model probabilities are given by γ j,t = γ j,0 p(y M j ) J j=1 γ j,0p(y M j ), where p(y M j ) = p(y θ (j), M j )p(θ (j) M j )dθ (j) Log marginal data densities are one-step-ahead predictive scores: ln p(y M j ) T = ln p(y t θ (j), Y 1:t 1, M j )p(θ (j) Y 1:t 1, M j )dθ (j). t=1 Model averaging: J p(h Y ) = γ j,t p(h j (θ (j) ) Y, M j ). j=1
A Non-Gaussian Posterior Suppose that y t is determined by the AR(1) model but object of interest is θ, which can be bounded based on φ: φ θ and θ φ + 1. Parameter θ is set-identified. The interval Θ(φ) = [φ, φ + 1] is called the identified set. Prior for θ conditional on φ of the form θ φ U[φ, φ + 1].
A Non-Gaussian Posterior Joint posterior of θ and φ: p(θ, φ Y ) = p(φ Y )p(θ φ, Y ) p(y φ)p(θ φ)p(φ). Since θ does not enter the likelihood function, we deduce that p(y φ)p(φ) p(φ Y ) = p(y φ)p(φ)dφ p(θ φ, Y ) = p(θ φ). In our example the marginal posterior distribution of θ is given by π(θ) = θ θ 1 p(φ Y )p(θ φ)dφ ( ) ( ) θ φ θ 1 φ = Φ N Φ N, V V where Φ N (x) is the cumulative density function of a N(0, 1).
What if the Posterior is Non-Gaussian? 1.0 π(θ) 0.5 0.0 2 1 0 1 2 θ Posterior distribution π(θ) for φ = 0.5 and V φ equal to 1/4 (dotted), 1/20 (dashed), and 1/100 (solid).
Importance Sampling Approximate π( ) by using a different, tractable density g(θ) that is easy to sample from. For more general problems, posterior density may be non-normalized. So we write π(θ) = p(y θ)p(θ) p(y ) = f (θ) Z. Importance sampling is based on the identity E π [h(θ)] = h(θ)π(θ)dθ = 1 h(θ) f (θ) Z g(θ) g(θ)dθ. The ratio w(θ) = f (θ) g(θ) is called the (unnormalized) importance weight. Θ
Importance Sampling 1 For i = 1 to N, draw θ i iid g(θ) and compute the unnormalized importance weights w i = w(θ i ) = f (θi ) g(θ i ). 2 Compute the normalized importance weights W i = 1 N w i N i=1 w i. An approximation of E π [h(θ)] is given by h N = 1 N N W i h(θ i ). i=1
Importance Sampling Distribution 1.0 0.5 0.0 2 1 0 1 2 θ Posterior density π(θ) (solid) as well as two importance sampling densities ( concentrated (dashed) and diffuse (dotted)) g(θ).
Accuracy Since we are generating iid draws from g(θ), it s fairly straightforward to derive a CLT: It can be shown that N( hn E π [h]) = N ( 0, Ω(h) ), where Ω(h) = V g [(π/g)(h E π [h])]. Using a crude approximation (see, e.g., Liu (2008)), we can factorize Ω(h) as follows: Ω(h) V π [h] ( V g [π/g] + 1 ). The approximation highlights that the larger the variance of the importance weights, the less accurate the Monte Carlo approximation relative to the accuracy that could be achieved with an iid sample from the posterior. Users often monitor ESS = N V π[h] Ω(h) N 1 + V g [π/g].
Inefficiency Factors for Concentrated IS Density 10 9 8 7 6 5 N 2000 4000 6000 8000 10000 Large sample inefficiency factors InEff = Ω(h)/V π [h] (dashed) and as their small sample approximations (solid) based on N run = 1, 000. We consider h(θ) = θ (triangles) and h(θ) = θ 2 (circles). The solid line (no symbols) depicts the approximate inefficiency factor 1 + V g [π/g].
Inefficiency Factors for Diffuse IS Density 2.2 2.0 1.8 1.6 1.4 N 2000 4000 6000 8000 10000 Large sample inefficiency factors InEff = Ω(h)/V π [h] (dashed) and as their small sample approximations (solid) based on N run = 1, 000. We consider h(θ) = θ (triangles) and h(θ) = θ 2 (circles). The solid line (no symbols) depicts the approximate inefficiency factor 1 + V g [π/g].
Markov Chain Monte Carlo (MCMC) Main idea: create a sequence of serially correlated draws such that the distribution of θ i converges to the posterior distribution p(θ Y ).
Generic Metropolis-Hastings Algorithm For i = 1 to N: 1 Draw ϑ from a density q(ϑ θ i 1 ). 2 Set θ i = ϑ with probability { α(ϑ θ i 1 ) = min 1, and θ i = θ i 1 otherwise. p(y ϑ)p(ϑ)/q(ϑ θ i 1 } ) p(y θ i 1 )p(θ i 1 )/q(θ i 1 ϑ) Recall p(θ Y ) p(y θ)p(θ). We draw θ i conditional on a parameter draw θ i 1 : leads to Markov transition kernel K(θ θ).
Importance Invariance Property It can be shown that p(θ Y ) = K(θ θ)p( θ Y )d θ. Write K(θ θ) = u(θ θ) + r( θ)δ θ (θ). u(θ θ) is the density kernel (note that u(θ ) does not integrated to one) for accepted draws: u(θ θ) = α(θ θ)q(θ θ). Rejection probability: [1 ] r( θ) = α(θ θ) q(θ θ)dθ = 1 u(θ θ)dθ.
Importance Invariance Property Reversibility: Conditional on the sampler not rejecting the proposed draw, the density associated with a transition from θ to θ is identical to the density associated with a transition from { θ to θ: } p(θ Y )/q(θ θ) p( θ Y )u(θ θ) = p( θ Y )q(θ θ) min 1, p( θ Y )/q( θ θ) = min { p( θ Y )q(θ θ), p(θ Y )q( θ θ) } { } p( θ Y )/q( θ θ) = p(θ Y )q( θ θ) min p(θ Y )/q(θ θ), 1 = p(θ Y )u( θ θ). Using the reversibility result, we can now verify the invariance property: K(θ θ)p( θ Y )d θ = = = p(θ Y ) u(θ θ)p( θ Y )d θ + u( θ θ)p(θ Y )d θ + r(θ)p(θ Y ) r( θ)δ θ (θ)p( θ Y )d θ
A Discrete Example Suppose parameter vector θ is scalar and takes only two values: Θ = {τ 1, τ 2 } The posterior distribution p(θ Y ) can be represented by a set of probabilities collected in the vector π, say π = [π 1, π 2 ] with π 2 > π 1. Suppose we obtain ϑ based on transition matrix Q: [ ] q (1 q) Q =. (1 q) q
Discrete MH Algorithm Iteration i: suppose that θ i = τ j. Based on transition matrix [ Q = q (1 q) (1 q) q ], determine a proposed state ϑ = τ s. With probability α(τ s τ j ) the proposed state is accepted. Set θ i = ϑ = τ s. With probability 1 α(τ s τ j ) stay in old state and set θ i = θ i = τ j. Choose (Q terms cancel because of symmetry) { α(τ s τ j ) = min 1, π } s. π j
Discrete MH Algorithm: Transition Matrix The resulting chain s transition matrix is: [ q (1 ( q) ) K = (1 q) π1 π 2 q + (1 q) 1 π1 π 2 Straightforward calculations reveal that the transition matrix K has eigenvalues: λ 1 (K) = 1, λ 2 (K) = q (1 q). 1 π 1 Equilibrium distribution is eigenvector associated with unit eigenvalue. For q [0, 1) the equilibrium distribution is unique. π 1 ].
Convergence The persistence of the Markov chain depends on second eigenvalue, which depends on the proposal distribution Q. Define the transformed parameter ξ i = θi τ 1 τ 2 τ 1. We can represent the Markov chain associated with ξ i as first-order autoregressive process ξ i = (1 k 11 ) + λ 2 (K)ξ i + ν i. Conditional on ξ i = j, j = 0, 1, the innovation ν i has support on k jj and (1 k jj ), its conditional mean is equal to zero, and its conditional variance is equal to k jj (1 k jj ).
Convergence Autocovariance function of h(θ (s) ): COV (h(θ i ), h(θ (i l) )) = ( h(τ 2 ) h(τ 1 ) ) ( ) l 2 π 1 π1 (1 π 1 ) q (1 q) 1 π 1 ( ) l π 1 = V π [h] q (1 q) 1 π 1 If q = π 1 then the autocovariances are equal to zero and the draws h(θ i ) are serially uncorrelated (in fact, in our simple discrete setting they are also independent).
Convergence Define the Monte Carlo estimate h N = 1 N N h(θ i ). i=1 Deduce from CLT N( h N E π [h]) = N ( 0, Ω(h) ), where ΩV h is the long-run covariance matrix ( L ( ) ) l Ω(h) = lim V L l π 1 π[h] 1 + 2 q (1 q). L L 1 π 1 l=1 In turn, the asymptotic inefficiency factor is given by InEff = Ω(h) V π [h] = 1 + 2 lim L L l=1 L l L ( ) l π 1 q (1 q). 1 π 1
Autocorrelation Function of θ i 1.0 0.5 0.0 q = 0.00 q = 0.20 q = 0.50 q = 0.99 0 1 2 3 4 5 6 7 8 9
Asymptotic Inefficiency InEff 10 2 10 1 10 0 10 1 0.0 0.2 0.4 0.6 0.8 1.0 q
Small Sample Variance V[ h N ] versus HAC Estimates of Ω(h) 10 1 10 2 10 3 10 4 10 5 10 4 10 3 10 2