The Metropolis-Hastings Algorithm. June 8, 2012

Size: px

Start display at page:

Download "The Metropolis-Hastings Algorithm. June 8, 2012"

Kristina Stafford
5 years ago
Views:

1 The Metropolis-Hastings Algorithm June 8, 22

2 The Plan. Understand what a simulated distribution is 2. Understand why the Metropolis-Hastings algorithm works 3. Learn how to apply the Metropolis-Hastings algorithm Most of today s lecture can be found in Chib (2) and An and Schorfheide (27)

3 Bayesian statistics: From last time Variance of estimator (frequentist) vs variance of parameter (Bayesian) Subjective view of probability Probabilities are statement about our knowledge Treating model parameters as random variables thus do not mean that we think that they necessarily vary over time.

4 Bayesian statistics: From last time Bayesian procedures can be derived from 4 Bayesian Principles. The Likelihood Principle 2. The Sufficiency Principle 3. The Conditionality Principle 4. The Stopping Rule Principle

5 Main concepts and notation The main components in Bayesian inference are: Data (observables) Z T R T n A model: Parameters θ R k A prior distribution p(θ) : R k R + Likelihood function p(z θ) : R T n R k R + Posterior density p(θ Z) : R T n R k R + We need a method to construct the posterior density

6 The end product of Bayesian statistics Most of Bayesian econometrics consists of simulating distributions of parameters using numerical methods. A simulated posterior is a numerical approximation to the distribution p(z θ)p(θ) This is useful since the the distribution p(z θ)p(θ) (by Bayes rule) is proportional to p(θ Z) p(θ Z) = p(z θ)p(θ) p(z) We rely on ergodicity, i.e. that the moments of the constructed sample correspond to the moments of the distribution p(θ Z) The most popular (and general) procedure to simulate the posterior is called the Metropolis-Hastings Algorithm

7 Metropolis-Hastings Metropolis-Hastings is a way to simulate a sample from a target distribution In practice, the target distribution will in most cases be the posterior density p(θ Z) but it doesn t have to be But what does it mean to sample from a given distribution?

8 Sampling from a distribution Example: Standard Normal distribution N(, ) If distribution is ergodic, sample shares all the properties of the true distribution asymptotically Why don t we just compute the moments directly? When we can, we should (as in the Normal distribution s case) Not always possible, either because tractability reasons or for computational burden

9 Sample from Standard Normal

10 Sample from Standard Normal

11 Sample from Standard Normal x 4

12 Sample from Standard Normal

13 Sampling from a distribution Example: Standard Normal distribution N(, ) If distribution is ergodic, sample shares all the properties of the true distribution asymptotically Why don t we just compute the moments directly? When we can, we should (as in the Normal distribution s case) Not always possible, either because tractability reasons or for computational burden

14 What is a Markov Chain? A process for which the distribution of next period variables are independent of the past, once we condition on the current state A Markov chain is a stochastic process with the Markov property Markov chain is sometimes taken to mean only processes with a countably finite number of states Here, the term will be used in the broader sense Markov Chain Monte Carlo methods provide a way of generating samples that share the properties of the target density (i.e. the object of interest)

15 What is a Markov Chain? Markov property p(θ j+ θ j ) = p(θ j+ θ j, θ j, θ j 2,..., θ ) Transition density function p(θ j+ θ j ) describes the distribution of θ j+ conditional on θ j. In most applications, we know the conditional transition density and can figure out unconditional properties like E(θ) and E ( θ 2) MCMC methods can be used to do the opposite: Determine a particular conditional transition density such that the unconditional distribution converges to that of the target distribution. Let s have a first look at the Metropolis-Hastings Algorithm

16 The Random-Walk Metropolis Algorithm. Start with an arbitrary value θ 2. Update from θ j to θ j+ (j =, 2,...J) by 2. Generate θ N (θ j, Σ) 2.2 Define 2.3 Take θ j+ = 3. Repeat Step 2 J times. But why does it work? α = min ( p(θ ) ) p(θ j ), { θ with probability α θ j otherwise } ()

17 Simulating distributions using MCMC methods We can look at a simple discrete state space example: θ can take two values θ { θ L, θ H} Probabilities p(θ L ) =.4 and p(θ H ) =.6 How can we construct a sequence θ (), θ (2), θ (3),..., θ (J) such that the relative number of occurrences of θ L and θ H in the sample correspond to those of the density described by p(θ)?

18 A two-state Markov Chain for? Transition probabilities are defined as ( π i,k = p θ j+ = θ i θ j = θ k) : i, k {L, H} Unconditional probabilities solves the equation [ ] [ ] [ π πl π H = πl π LL π HL H π LH π HH ] Problem: Define transition probabilities such that unconditional probabilities equal are those of the target distribution

19 A two-state Markov Chain for? We want the chain to spend more time in the more likely state θ H but not all the time. If θ j = θ L then θ j+ = θ H If θ j = θ H then θ j+ = θ L with probability α and θ j+ = θ H with probability ( α) Finding the right α : [ πl π H ] = [ πl π H ] [ α ( α) ] We get two equations, π L = απ H, for α gives α = π L π H π H = π L + ( α)π H, solving

20 Simulating distributions using MCMC methods Let s do an example by hand.

21 Simulating distributions using MCMC methods Using the ratio of relative probability/density to decide whether to accept a candidate turns out to be extremely general: exactly the same condition ensures that Markov Chain converges to target distribution also for continuous parameter spaces Loosely speaking, it is the condition that makes sure that the Markov chain spend just the right amount of time at each point in the parameter space How the candidate is generated almost doesn t matter at all as long as chain is: Irreducible Aperiodic Partly depends on choice that determine how proposal density is generated

22 Proposal density Any proposal density with infinite support would do (e.g. normal) The choice a matter of efficiency. Some options: RW-MH Adaptive Random Walk Independent M-H

23 The Random-Walk Metropolis Algorithm. Start with an arbitrary value θ 2. Update from θ j to θ j+ (j =, 2,...J) by 2. Generate θ N (θ j, Σ) 2.2 Define 2.3 Take θ j+ = 3. Repeat Step 2 J times. α = min ( L (Z θ ) ) L (Z θ j ), { θ with probability α θ j otherwise We can use this to estimate the likelihood function of a simple DSGE model } (2)

24 A simple DSGE model x t = ρx t + ut x y t = E t (y t+ ) γ [r t E t (π t+ )] + ut y π t = E t (π t+ ) + κ [y t x t ] + ut π r t = φ π π t

25 A simple DSGE model Substitute in the interest rate in the Euler equation x t = ρx t + u x t y t = E t (y t+ ) γ [φ ππ t E t (π t+ )] + u y t π t = E t (π t+ ) + κ [y t x t ] + u π t

26 Solving model using method of undetermined coefficients Conjecture that model can be put in the form x t = ρx t + ut x y t = ax t + ut y π t = bx t + ut π Why is this a good guess?

27 Solving model using method of undetermined coefficients Substitute in conjectured form of solution (ignoring the shocks ut y and ut π for now) into structural equation ax t = aρx t γ [φ πbx t bρx t ] bx t = bρx t + κ [ax t x t ] where we used that x t = ρx t + u x t implies that E[x t+ x t ] = ρx t

28 Solving model using method of undetermined coefficients Equate coefficients on right and left hand side a = aρ γ φ πb + γ bρ b = bρ + κ [a ] or [ ( ρ) γ (φ π ρ) κ ( ρ) ] [ a b ] = [ κ ]

29 Solving model using method of undetermined coefficients Solve for a and b [ ] a = b [ ( ρ) γ (φ π ρ) κ ( ρ) ] [ κ ] or [ a b ] [ = κ φ ρ c κγ ρ c ] where c = γ κρ 2γρ + κφ + γρ 2 <

30 A simple DSGE model The solved model x t = ρx t + u x t y t = κ ρ φ π x t + ut y c π t = κγ ρ x t + ut π c where c = γ κρ 2γρ + κφ + γρ 2 < We want to estimate the distributions of θ = {ρ, γ, κ, φ, σ x, σ y, σ π, }

31 A simple DSGE model Put the solved model in state space form X t = AX t + Cu t Z t = DX t + v t where X t = x t, A = ρ, Cu t = ut x [ ] [ yt κ φπ ρ Z t =, D = c π t κγ ρ c ] [ u y, v t = t u π t ]

32 The log likelihood function of a state space system For a given state space system X t = AX t + Cu t Z t (p ) = DX t + v t we can evaluate the log likelihood by computing T [ L(Z Θ) =.5 p ln(2π) + ln Ω t + Z ] tω t Z t t= where Z t are the innovation from the Kalman filter

33 The parameter vector θ and the variance of random walk innovations in MCMC Σ Parameterize the model according to θ = {ρ, γ, κ, φ, σ x, σ y, σ π, } = {.9, 2,.,.5,,, } Generate data from true model and T = Set starting value θ () = θ (OK, this option is not available in practice.) Set covariance matrix of random walk increments in Metropolis Algorithm proportional to absolute values of true parameters Σ = ε diag(abs(θ))

34 The Random-Walk Metropolis Algorithm We now have all we need:. Start with an arbitrary value θ 2. Update from θ j to θ j+ (j =, 2,...J) by 2. Generate θ N (θ j, Σ) 2.2 Define 2.3 Take θ j+ = 3. Repeat Step 2 J times α = min ( L (Z θ ) ) L (Z θ j ), { θ with probability α θ j otherwise } (3)

35 The MCMC with J=

36 The MCMC with J= x

37 The MCMC with J=

38 The MCMC with J= x

39 The MCMC with J= x x x x x x x 4

40 The MCMC with J=

41 Identification Some posterior distribution look flat Can be a short sample issue Nothing to do if we are estimating a model on real data, but here we can check Can be a problem with mapping between parameters and likelihood

42 The MCMC with T=2 and J=

43 Identification Posterior distribution for γ and κ kappa still look flat probably a true identification issue Little can be said about identification a priori

44 Convergence How many draws do we need? Optimal J increases with number of parameters One informal check is to plot the diagonal of the recursive covariance matrix of the MCMC j j θ (i) θ (i) for j =, 2,...J i=

45 Checking for convergence 2.5 x x

46 Checking for convergence J= x

47 Combining prior and sample information Sometimes we know more about the parameters than what the data tells us, i.e. we have some prior information.

48 What do we mean by prior information? For a DSGE model, we may have information about deep parameters Range of some parameters restricted by theory, e.g. risk aversion should be positive Discount rate is inverse of average real interest rates Price stickiness can be measured by surveys We may know something about the mean of a process

49 How do we combine prior and sample information? Bayes theorem: P (θ Z) P(Z) = P (Z θ) P(θ) P (θ Z) = P (Z θ) P(θ) P(Z) Since P(Z) is a constant, we can use P (Z θ) P(θ) as the posterior likelihood (a likelihood function is any function that is proportional to the probability). We now need to choose P(θ)

50 Choosing prior distributions The beta distribution is a good choice when parameter is in [,] where P(x) = ( x)b x a B(a, b) B(a, b) = (a )!(b )! (a + b )! Easier to parameterize using expression for mean, mode and variance: µ = σ 2 = a a + b, x = a a + b 2 ab (a + b) 2 (a + b + )

51 Examples of beta distributions holding mean fixed

52 Examples of beta distributions holding s.d. fixed

53 Choosing prior distributions The inverse gamma distribution is a good choice when parameter is positive P(x) = ba Γ(a) (/x)a+ exp( b/x) where Γ(a) = (a )! Again, easier to parameterize using expression for mean, mode and variance: µ = σ 2 = b b ; a >, x = a a + b 2 (a ) 2 (a 2) ; a > 2

54 Examples of inverse gamma distributions

55 The Random-Walk Metropolis Algorithm with priors. Start with an arbitrary value θ 2. Update from θ j to θ j+ (j =, 2,...J) by 2. Generate θ N (θ j, Σ) 2.2 Define 2.3 Take α = min θ j+ = ( L (Z θ ) P(θ ) ) L (Z θ j ) P(θ j ), { θ with probability α θ j otherwise The only difference compared to before is that the priors appear in the ratio in (3). } (4)

56 Practical implementation Since we compute log-likelihood we can use the log of the likelihood ratio in the Random Walk Metropolis Algorithm ln α = min [( ln L (Z θ ) + ln P(θ ) ln L (Z θ j ) ln P(θ j )), ] If priors across parameters are independent we have that ln P(θ j ) = ln P(θ,j ) + ln P(θ 2,j ) ln P(θ q,j ) where θ j = [ θ,j θ 2,j θ q,j ]

57 Let s get deep with an old friend: x t = ρx t + u x t y t = E t (y t+ ) γ [r t E t (π t+ )] + u y t π t = E t (π t+ ) + κ [y t x t ] + u π t r t = φ π π t The parameter κ is in the benchmark 3-equation NK model given by ( δ)( δβ) κ = δ where δ is the Calvo parameter of price stickiness and β is the discount factor. We now have a new parameter vector θ = {ρ, γ, δ, β, φ, σ x, σ y, σ π, }

58 The priors The prior on relative risk aversion γ is truncated Normal with mean 2 and s.d..5. The prior on the discount factor β is Beta with mean.99 and s.d.. The prior on the Calvo parameter δ is Beta with mean.75 and s.d

59 The Random-Walk Metropolis Algorithm with priors. Start with an arbitrary value θ 2. Update from θ j to θ j+ (j =, 2,...J) by 2. Generate θ N (θ j, Σ) 2.2 Define 2.3 Take α = min θ j+ = 3. Repeat Step 2 J times ( L (Z θ ) P(θ ) ) L (Z θ j ) P(θ j ), { θ with probability α θ j otherwise } (5)

60 The log prior The log prior is given by ln P(θ j ) = ln P(θ,j ) + ln P(θ 2,j ) ln P(θ q,j ) = ln P(γ j ) + ln P(δ j ) + ln P(β j ) since we can ignore the (constant) probabilities on the uniform priors. We then have that ln [L (Z θ ) P(θ )] = ln L(Z θ) + ln P(θ j ) = α = min i.e. all we need for the RWMA ( exp [ln L(Z θ ) + ln P(θ ) )] exp [ln L(Z θ j ) + ln P(θ j )],

61 Posterior with uniform and informative priors

62 Inference about parameters Compute sample equivalent: Central tendencies: Mean, mode Variance Compute frequency of occurrence: How likely is it that θ and θ 2 are both smaller than?

63 That s it for today.

Bayesian Methods for DSGE models Lecture 1 Macro models as data generating processes

Bayesian Methods for DSGE models Lecture 1 Macro models as data generating processes Kristoffer Nimark CREI, Universitat Pompeu Fabra and Barcelona GSE June 30, 2014 Bayesian Methods for DSGE models Course