Bayesian Estimation of DSGE Models 1 Chapter 3: A Crash Course in Bayesian Inference

Similar documents
Introduction to Bayesian Inference

Sequential Monte Carlo Methods

Bayesian Computations for DSGE Models

Sequential Monte Carlo Methods (for DSGE Models)

Sequential Monte Carlo Methods

Sequential Monte Carlo Methods (for DSGE Models)

Monte Carlo in Bayesian Statistics

MCMC algorithms for fitting Bayesian models

Riemann Manifold Methods in Bayesian Statistics

Sequential Monte Carlo Methods

Introduction to Bayesian Computation

Principles of Bayesian Inference

BAYESIAN METHODS FOR VARIABLE SELECTION WITH APPLICATIONS TO HIGH-DIMENSIONAL DATA

Markov Chain Monte Carlo methods

Point, Interval, and Density Forecast Evaluation of Linear versus Nonlinear DSGE Models

Risk Estimation and Uncertainty Quantification by Markov Chain Monte Carlo Methods

Bayesian Regression Linear and Logistic Regression

Learning the hyper-parameters. Luca Martino

Monetary and Exchange Rate Policy Under Remittance Fluctuations. Technical Appendix and Additional Results

Markov Chain Monte Carlo (MCMC)

Likelihood-free MCMC

Bayesian Inference and MCMC

Model comparison. Christopher A. Sims Princeton University October 18, 2016

ComputationalToolsforComparing AsymmetricGARCHModelsviaBayes Factors. RicardoS.Ehlers

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Bayesian Inference in GLMs. Frequentists typically base inferences on MLEs, asymptotic confidence

STA 4273H: Statistical Machine Learning

Eco517 Fall 2013 C. Sims MCMC. October 8, 2013

Bayesian Phylogenetics:

Markov chain Monte Carlo

Metropolis Hastings. Rebecca C. Steorts Bayesian Methods and Modern Statistics: STA 360/601. Module 9

David Giles Bayesian Econometrics

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

Exercises Tutorial at ICASSP 2016 Learning Nonlinear Dynamical Models Using Particle Filters

Hierarchical models. Dr. Jarad Niemi. August 31, Iowa State University. Jarad Niemi (Iowa State) Hierarchical models August 31, / 31

LECTURE 15 Markov chain Monte Carlo

Parameter estimation and forecasting. Cristiano Porciani AIfA, Uni-Bonn

Labor-Supply Shifts and Economic Fluctuations. Technical Appendix

Stat 535 C - Statistical Computing & Monte Carlo Methods. Lecture 15-7th March Arnaud Doucet

6.1 Approximating the posterior by calculating on a grid

Hypothesis Testing. Econ 690. Purdue University. Justin L. Tobias (Purdue) Testing 1 / 33

A Review of Pseudo-Marginal Markov Chain Monte Carlo

Reminder of some Markov Chain properties:

Markov Chain Monte Carlo Methods

Introduction to Machine Learning CMU-10701

Prelim Examination. Friday August 11, Time limit: 150 minutes

Economics 583: Econometric Theory I A Primer on Asymptotics

The Metropolis-Hastings Algorithm. June 8, 2012

Stable Limit Laws for Marginal Probabilities from MCMC Streams: Acceleration of Convergence

Bayesian Model Comparison:

Hierarchical Models & Bayesian Model Selection

Markov Chain Monte Carlo

Principles of Bayesian Inference

Computational statistics

Hierarchical Modeling for Spatial Data

Markov chain Monte Carlo

ASYMPTOTICALLY INDEPENDENT MARKOV SAMPLING: A NEW MARKOV CHAIN MONTE CARLO SCHEME FOR BAYESIAN INFERENCE

Metropolis-Hastings Algorithm

10. Exchangeability and hierarchical models Objective. Recommended reading

Pseudo-marginal MCMC methods for inference in latent variable models

April 20th, Advanced Topics in Machine Learning California Institute of Technology. Markov Chain Monte Carlo for Machine Learning

Bayesian Inference for DSGE Models. Lawrence J. Christiano

Part III. A Decision-Theoretic Approach and Bayesian testing

17 : Markov Chain Monte Carlo

GAUSSIAN PROCESS REGRESSION

MONTE CARLO METHODS. Hedibert Freitas Lopes

On Bayesian Computation

CSC 2541: Bayesian Methods for Machine Learning

Multivariate Normal & Wishart

Lecture 7 and 8: Markov Chain Monte Carlo

Bayesian Inference for DSGE Models. Lawrence J. Christiano

Bayesian Linear Models

Nested Sampling. Brendon J. Brewer. brewer/ Department of Statistics The University of Auckland

Markov Chain Monte Carlo, Numerical Integration

Lecture 8: Bayesian Estimation of Parameters in State Space Models

Principles of Bayesian Inference

MIT Spring 2016

Bayesian Linear Models

CS281A/Stat241A Lecture 22

Statistics & Data Sciences: First Year Prelim Exam May 2018

Example: Ground Motion Attenuation

Foundations of Statistical Inference

(5) Multi-parameter models - Gibbs sampling. ST440/540: Applied Bayesian Analysis

Infinite-State Markov-switching for Dynamic. Volatility Models : Web Appendix

Part 8: GLMs and Hierarchical LMs and GLMs

Stat 535 C - Statistical Computing & Monte Carlo Methods. Lecture February Arnaud Doucet

Motivation Scale Mixutres of Normals Finite Gaussian Mixtures Skew-Normal Models. Mixture Models. Econ 690. Purdue University

Bayesian Linear Regression

Adaptive HMC via the Infinite Exponential Family

BAYESIAN ECONOMETRICS

Control Variates for Markov Chain Monte Carlo

DAG models and Markov Chain Monte Carlo methods a short overview

TEORIA BAYESIANA Ralph S. Silva

Density Estimation. Seungjin Choi

Pattern Recognition and Machine Learning. Bishop Chapter 11: Sampling Methods

CSC 2541: Bayesian Methods for Machine Learning

1 Geometry of high dimensional probability distributions

Spatial Statistics Chapter 4 Basics of Bayesian Inference and Computation

General Bayesian Inference I

COS513 LECTURE 8 STATISTICAL CONCEPTS

Transcription:

1 The views expressed in this paper are those of the authors and do not necessarily reflect the views of the Federal Reserve Board of Governors or the Federal Reserve System. Bayesian Estimation of DSGE Models 1 Chapter 3: A Crash Course in Bayesian Inference Ed Herbst 1 Frank Schorfheide 2 1 Federal Reserve Board 2 University of Pennsylvania February 5, 2016

Topics Deriving a posterior distribution in a linear regression model Direct sampling Bayesian decision making

Bayesian Inference Ingredients of Bayesian Analysis: Likelihood function p(y φ) Prior density p(φ) Marginal data density p(y ) = p(y φ)p(φ)dφ Bayes Theorem: p(φ Y ) = p(y φ)p(φ) p(y )

Linear Regression / AR Models Consider AR(1) model: y t = y t 1 φ + u t, u t iidn(0, 1). Let x t = y t 1. Write as or y t = x tφ + u t, u t iidn(0, 1), Y = X φ + U. We can easily allow for multiple regressors. Assume φ is k 1. Notice: we treat the variance of the errors as know. The generalization to unknown variance is straightforward but tedious. Likelihood function: p(y φ) = (2π) T /2 exp { 1 } 2 (Y X φ) (Y X φ).

A Convenient Prior Prior: φ N ) { (0 k 1, τ 2 I k k, p(φ) = (2πτ 2 ) k/2 exp 1 } 2τ 2 φ φ Large τ means diffuse prior. Small τ means tight prior.

Deriving the Posterior Bayes Theorem: p(φ Y ) p(y φ)p(φ) { exp 1 } 2 [(Y X φ) (Y X φ) + τ 2 φ φ]. Guess: what if φ Y N( φ T, V T ). Then { p(θ Y ) exp 1 } 2 (φ φ T ) 1 V T (φ φ T ). Rewrite exponential term Y Y φ X Y Y X φ + φ X X φ + τ 2 φ φ = Y Y φ X Y Y X φ + φ (X X + τ 2 I)φ ( ) ( ) = φ (X X + τ 2 I) 1 X Y X X + τ 2 I ( ) φ (X X + τ 2 I) 1 X Y +Y Y Y X (X X + τ 2 I) 1 X Y.

Deriving the Posterior Exponential term is a quadratic function of φ. Deduce: posterior distribution of φ must be a multivariate normal distribution φ Y N( φ T, V T ) with φ T = (X X + τ 2 I) 1 X Y τ : τ 0: V T = (X X + τ 2 I) 1. φ Y approx ( ) N ˆφ mle, (X X ) 1. φ Y approx Pointmass at 0

Marginal Data Density Plays an important role in Bayesian model selection and averaging. Write p(y θ)p(θ) p(y ) = p(θ Y ) { = exp 1 } 2 [Y Y Y X (X X + τ 2 I) 1 X Y ] (2π) T /2 I + τ 2 X X 1/2. The exponential term measures the goodness-of-fit. I + τ 2 X X is a penalty for model complexity.

Posterior We will often abbreviate posterior distributions p(φ Y ) by π(φ) and posterior expectations of h(φ) by E π [h] = E π [h(φ)] = h(φ)π(φ)dφ = h(φ)p(φ Y )dφ. We will focus on algorithms that generate draws {φ i } N i=1 from posterior distributions of parameters in time series models. These draws can then be transformed into objects of interest, h(φ i ), and under suitable conditions a Monte Carlo average of the form h N = 1 N N h(φ i ) E π [h]. i=1 Strong law of large numbers (SLLN), central limit theorem (CLT)...

Direct Sampling In the simple linear regression model with Gaussian posterior it is possible to sample directly. For i = 1 to N, draw φ i from N ( φ, Vφ ). Provided that V π [h(φ)] < we can deduce from Kolmogorov s SLLN and the Lindeberg-Levy CLT that a.s. h N E π [h] N ( h N E π [h] ) = N ( 0, V π [h(φ)] ).

Decision Making The posterior expected loss associated with a decision δ( ) is given by ρ ( δ( ) Y ) = L ( θ, δ(y ) ) p(θ Y )dθ. Θ A Bayes decision is a decision that minimizes the posterior expected loss: δ (Y ) = argmin d ρ ( δ( ) Y ). Since in most applications it is not feasible to derive the posterior expected risk analytically, we replace ρ ( δ( ) Y ) by a Monte Carlo approximation of the form ρ N ( δ( ) Y ) = 1 N N L ( θ i, δ( ) ). i=1 A numerical approximation to the Bayes decision δ ( ) is then given by δ N(Y ) = argmin d ρ N ( δ( ) Y ).

Inference Point estimation: Quadratic loss: posterior mean Absolute error loss: posterior median Interval/Set estimation P π {θ C(Y )} = 1 α: highest posterior density sets equal-tail-probability intervals

Forecasting Example: h 1 y T +h = θ h y T + θ s u T +h s s=0 h-step ahead conditional distribution: y T +h (Y 1:T, θ) N (θ h y T, 1 ) θh. 1 θ Posterior predictive distribution: p(y T +h Y 1:T ) = p(y T +h y T, θ)p(θ Y 1:T )dθ. For each draw θ i from the posterior distribution p(θ Y 1:T ) sample a sequence of innovations u i T +1,..., ui T +h and compute y i T +h as a function of θ i, u i T +1,..., ui T +h, and Y 1:T.

Model Uncertainty Assign prior probabilities γ j,0 to models M j, j = 1,..., J. Posterior model probabilities are given by γ j,t = γ j,0 p(y M j ) J j=1 γ j,0p(y M j ), where p(y M j ) = p(y θ (j), M j )p(θ (j) M j )dθ (j) Log marginal data densities are one-step-ahead predictive scores: ln p(y M j ) T = ln p(y t θ (j), Y 1:t 1, M j )p(θ (j) Y 1:t 1, M j )dθ (j). t=1 Model averaging: J p(h Y ) = γ j,t p(h j (θ (j) ) Y, M j ). j=1

A Non-Gaussian Posterior Suppose that y t is determined by the AR(1) model but object of interest is θ, which can be bounded based on φ: φ θ and θ φ + 1. Parameter θ is set-identified. The interval Θ(φ) = [φ, φ + 1] is called the identified set. Prior for θ conditional on φ of the form θ φ U[φ, φ + 1].

A Non-Gaussian Posterior Joint posterior of θ and φ: p(θ, φ Y ) = p(φ Y )p(θ φ, Y ) p(y φ)p(θ φ)p(φ). Since θ does not enter the likelihood function, we deduce that p(y φ)p(φ) p(φ Y ) = p(y φ)p(φ)dφ p(θ φ, Y ) = p(θ φ). In our example the marginal posterior distribution of θ is given by π(θ) = θ θ 1 p(φ Y )p(θ φ)dφ ( ) ( ) θ φ θ 1 φ = Φ N Φ N, V V where Φ N (x) is the cumulative density function of a N(0, 1).

What if the Posterior is Non-Gaussian? 1.0 π(θ) 0.5 0.0 2 1 0 1 2 θ Posterior distribution π(θ) for φ = 0.5 and V φ equal to 1/4 (dotted), 1/20 (dashed), and 1/100 (solid).

Importance Sampling Approximate π( ) by using a different, tractable density g(θ) that is easy to sample from. For more general problems, posterior density may be non-normalized. So we write π(θ) = p(y θ)p(θ) p(y ) = f (θ) Z. Importance sampling is based on the identity E π [h(θ)] = h(θ)π(θ)dθ = 1 h(θ) f (θ) Z g(θ) g(θ)dθ. The ratio w(θ) = f (θ) g(θ) is called the (unnormalized) importance weight. Θ

Importance Sampling 1 For i = 1 to N, draw θ i iid g(θ) and compute the unnormalized importance weights w i = w(θ i ) = f (θi ) g(θ i ). 2 Compute the normalized importance weights W i = 1 N w i N i=1 w i. An approximation of E π [h(θ)] is given by h N = 1 N N W i h(θ i ). i=1

Importance Sampling Distribution 1.0 0.5 0.0 2 1 0 1 2 θ Posterior density π(θ) (solid) as well as two importance sampling densities ( concentrated (dashed) and diffuse (dotted)) g(θ).

Accuracy Since we are generating iid draws from g(θ), it s fairly straightforward to derive a CLT: It can be shown that N( hn E π [h]) = N ( 0, Ω(h) ), where Ω(h) = V g [(π/g)(h E π [h])]. Using a crude approximation (see, e.g., Liu (2008)), we can factorize Ω(h) as follows: Ω(h) V π [h] ( V g [π/g] + 1 ). The approximation highlights that the larger the variance of the importance weights, the less accurate the Monte Carlo approximation relative to the accuracy that could be achieved with an iid sample from the posterior. Users often monitor ESS = N V π[h] Ω(h) N 1 + V g [π/g].

Inefficiency Factors for Concentrated IS Density 10 9 8 7 6 5 N 2000 4000 6000 8000 10000 Large sample inefficiency factors InEff = Ω(h)/V π [h] (dashed) and as their small sample approximations (solid) based on N run = 1, 000. We consider h(θ) = θ (triangles) and h(θ) = θ 2 (circles). The solid line (no symbols) depicts the approximate inefficiency factor 1 + V g [π/g].

Inefficiency Factors for Diffuse IS Density 2.2 2.0 1.8 1.6 1.4 N 2000 4000 6000 8000 10000 Large sample inefficiency factors InEff = Ω(h)/V π [h] (dashed) and as their small sample approximations (solid) based on N run = 1, 000. We consider h(θ) = θ (triangles) and h(θ) = θ 2 (circles). The solid line (no symbols) depicts the approximate inefficiency factor 1 + V g [π/g].

Markov Chain Monte Carlo (MCMC) Main idea: create a sequence of serially correlated draws such that the distribution of θ i converges to the posterior distribution p(θ Y ).

Generic Metropolis-Hastings Algorithm For i = 1 to N: 1 Draw ϑ from a density q(ϑ θ i 1 ). 2 Set θ i = ϑ with probability { α(ϑ θ i 1 ) = min 1, and θ i = θ i 1 otherwise. p(y ϑ)p(ϑ)/q(ϑ θ i 1 } ) p(y θ i 1 )p(θ i 1 )/q(θ i 1 ϑ) Recall p(θ Y ) p(y θ)p(θ). We draw θ i conditional on a parameter draw θ i 1 : leads to Markov transition kernel K(θ θ).

Importance Invariance Property It can be shown that p(θ Y ) = K(θ θ)p( θ Y )d θ. Write K(θ θ) = u(θ θ) + r( θ)δ θ (θ). u(θ θ) is the density kernel (note that u(θ ) does not integrated to one) for accepted draws: u(θ θ) = α(θ θ)q(θ θ). Rejection probability: [1 ] r( θ) = α(θ θ) q(θ θ)dθ = 1 u(θ θ)dθ.

Importance Invariance Property Reversibility: Conditional on the sampler not rejecting the proposed draw, the density associated with a transition from θ to θ is identical to the density associated with a transition from { θ to θ: } p(θ Y )/q(θ θ) p( θ Y )u(θ θ) = p( θ Y )q(θ θ) min 1, p( θ Y )/q( θ θ) = min { p( θ Y )q(θ θ), p(θ Y )q( θ θ) } { } p( θ Y )/q( θ θ) = p(θ Y )q( θ θ) min p(θ Y )/q(θ θ), 1 = p(θ Y )u( θ θ). Using the reversibility result, we can now verify the invariance property: K(θ θ)p( θ Y )d θ = = = p(θ Y ) u(θ θ)p( θ Y )d θ + u( θ θ)p(θ Y )d θ + r(θ)p(θ Y ) r( θ)δ θ (θ)p( θ Y )d θ

A Discrete Example Suppose parameter vector θ is scalar and takes only two values: Θ = {τ 1, τ 2 } The posterior distribution p(θ Y ) can be represented by a set of probabilities collected in the vector π, say π = [π 1, π 2 ] with π 2 > π 1. Suppose we obtain ϑ based on transition matrix Q: [ ] q (1 q) Q =. (1 q) q

Discrete MH Algorithm Iteration i: suppose that θ i = τ j. Based on transition matrix [ Q = q (1 q) (1 q) q ], determine a proposed state ϑ = τ s. With probability α(τ s τ j ) the proposed state is accepted. Set θ i = ϑ = τ s. With probability 1 α(τ s τ j ) stay in old state and set θ i = θ i = τ j. Choose (Q terms cancel because of symmetry) { α(τ s τ j ) = min 1, π } s. π j

Discrete MH Algorithm: Transition Matrix The resulting chain s transition matrix is: [ q (1 ( q) ) K = (1 q) π1 π 2 q + (1 q) 1 π1 π 2 Straightforward calculations reveal that the transition matrix K has eigenvalues: λ 1 (K) = 1, λ 2 (K) = q (1 q). 1 π 1 Equilibrium distribution is eigenvector associated with unit eigenvalue. For q [0, 1) the equilibrium distribution is unique. π 1 ].

Convergence The persistence of the Markov chain depends on second eigenvalue, which depends on the proposal distribution Q. Define the transformed parameter ξ i = θi τ 1 τ 2 τ 1. We can represent the Markov chain associated with ξ i as first-order autoregressive process ξ i = (1 k 11 ) + λ 2 (K)ξ i + ν i. Conditional on ξ i = j, j = 0, 1, the innovation ν i has support on k jj and (1 k jj ), its conditional mean is equal to zero, and its conditional variance is equal to k jj (1 k jj ).

Convergence Autocovariance function of h(θ (s) ): COV (h(θ i ), h(θ (i l) )) = ( h(τ 2 ) h(τ 1 ) ) ( ) l 2 π 1 π1 (1 π 1 ) q (1 q) 1 π 1 ( ) l π 1 = V π [h] q (1 q) 1 π 1 If q = π 1 then the autocovariances are equal to zero and the draws h(θ i ) are serially uncorrelated (in fact, in our simple discrete setting they are also independent).

Convergence Define the Monte Carlo estimate h N = 1 N N h(θ i ). i=1 Deduce from CLT N( h N E π [h]) = N ( 0, Ω(h) ), where ΩV h is the long-run covariance matrix ( L ( ) ) l Ω(h) = lim V L l π 1 π[h] 1 + 2 q (1 q). L L 1 π 1 l=1 In turn, the asymptotic inefficiency factor is given by InEff = Ω(h) V π [h] = 1 + 2 lim L L l=1 L l L ( ) l π 1 q (1 q). 1 π 1

Autocorrelation Function of θ i 1.0 0.5 0.0 q = 0.00 q = 0.20 q = 0.50 q = 0.99 0 1 2 3 4 5 6 7 8 9

Asymptotic Inefficiency InEff 10 2 10 1 10 0 10 1 0.0 0.2 0.4 0.6 0.8 1.0 q

Small Sample Variance V[ h N ] versus HAC Estimates of Ω(h) 10 1 10 2 10 3 10 4 10 5 10 4 10 3 10 2