Spring 2006: Introduction to Markov Chain Monte Carlo (MCMC)

36-724 Spring 2006: Introduction to Marov Chain Monte Carlo (MCMC) Brian Juner February 16, 2006 Hierarchical Normal Model Direct Simulation An Alternative Approach: MCMC Complete Conditionals for Hierarchical Normal Aside: Marginal Density Estimates Summary 1 36-724 February 16, 2006

Hierarchical Normal Model p(µ,τ 2 ) y 11 y n1 1 N(θ µ,τ 2 ) N(y θ 1,σ 2 ) N(y θ 2,σ 2 ) N(y θ j,σ 2 ) N(y θ J,σ 2 ) y 12 y n2 2 y 1 j y i j y n j j y 1J y nj J Withσ 2 j =σ2 /n j nown, the model is where y j = y j. p(y,θ,µ,τ 2 σ 2 ) jn(y j θ j,σ 2 j } {{ } ) Level 1 jn(θ j µ,τ 2 ) } {{ } Level 2 p(µ,τ 2 ) } {{ } Level 3 2 36-724 February 16, 2006

Taing p(µ,τ 2 ) p(τ 2 ), we have where and ˆθ j ˆµ = p(θ,µ,τ 2 y) Direct Simulation p(τ 2 y) V 1/2 µ = p(θ µ,τ 2, y)p(µ τ 2, y)p(τ 2 y) p(µ τ 2, y) = N(µ ˆµ, V µ ) P(θ j µ,τ 2, y j ) = N(θ j ˆθ j, V j ) = y j/σ 2 j +µ/τ2 1/σ 2 j + V j = 1/τ2 j y j /(τ 2 +σ 2 j ) j 1/(τ 2 +σ 2 j ) V µ = [ jn(y j ˆµ,τ 2 +σ 2 j )] p(τ 2 ) 1 1/σ 2 j + 1/τ2 1 j 1/(τ 2 +σ 2 j ) This wasn t bad but it requires cleverness and lacs flexibility. 3 36-724 February 16, 2006

An Alternative Approach: MCMC The problem: Learn aboutπ(τ)=π(θ,µ,τ 2 ), whereτ=(θ,µ,τ 2 ) is some high-dimensional set of variables (parameters). The essential idea: Define a (stationary) Marov chainm 0,M 1,M 2,... with states M m = (θ m,µ m, (τ 2 ) m ). Simulate them m s; under regularity conditions (e.g., Tierney, 1994),M m will converge in distribution to a stationary distributionπ(θ,µ,τ 2 ): π(θ 1,µ 1, (τ 2 ) 1 ) = P[M m+1 = (θ 1,µ 1, (τ 2 ) 1 ) M m = (θ 0,µ 0, (τ 2 ) 0 )] π(θ 0,µ 0, (τ 2 ) 0 ) d(θ 0,µ 0, (τ 2 ) 0 ) For Bayes, design chain so thatπ(θ,µ,τ 2 ) turns out to be the posterior, p(θ,µ,τ 2 ) y). 4 36-724 February 16, 2006

Typical MCMC paradigm: Write p(θ,µ,τ 2 y)= p(τ 1,τ 2,...,τ d y), where (τ 1,...,τ d ) is a disjoint partition of the original model parameters (θ,µ,τ 2 ). General theory of MCMC (e.g., Tierney, 1994; Chib and Greenberg, 1995): construct statem m = (τ (m) 1,...,τ(m) d ), by sampling eachτ m from its complete conditional distribution: To step fromm m 1 = (τ (m 1) 1,...,τ (m 1) d ) tom m = (τ (m) 1,...,τ(m) d ): 1. τ (m) 1 p(τ 1 τ (m 1) 2,...,τ (m 1) d, y) p(τ 1 rest); 2. τ (m) 2 p(τ 2 τ (m) 1,τ(m 1) 3,...,τ (m 1) d, y) p(τ 2 rest); 3. τ (m) 3 p(τ 3 τ (m) 1,τ(m) 2,τ(m 1) d. τ (m) d. 4,...,τ (m 1) d p(τ d τ (m) 1,τ(m) 2,τ(m) 3,...,τ(m) d 1, y).., y) p(τ 3 rest);. p(τ d rest); This is the basic idea behind all so-called Gibbs samplers. 5 36-724 February 16, 2006

Marov Chain Monte Carlo (MCMC) vs. Independent-draws MC: Replace draws from higher-dimensional p(τ y) with a sequence of low-dimensional draws from p(τ 1 rest), p(τ 2 rest), etc. For Bayes, often can exploit partial conjugacy to simplify problem of drawing from p (τ rest). Costs: dependent rather than iid samples, approximate rather than exact sampling of p(τ y). 6 36-724 February 16, 2006

Sampling schemes Once a set of complete conditionals is decided upon: If you can sampleτ directly from p(τ rest), do so this is a Gibbs step; Otherwise perform a Metropolis step: Sample a proposal value τ from any convenient proposal distribution q m (τ τ (m 1) ) Compute the acceptance probability α from q m(τ τ (m 1) ) and the complete conditional p(τ rest) p(τ τ (m) 1,...,τ(m) 1,τ(m 1) +1,...,τ (m 1) d, y): Acceptτ (m) α = min p(τ rest)q m(τ (m 1) τ ) p(τ (m 1) rest)q m (τ τ(m 1) ), 1 ; =τ with probabilityα ; otherwise setτ(m) This is often called Metropolis-Hastings within Gibbs. =τ (m 1). 7 36-724 February 16, 2006

Identifying the complete conditionals Suppose y only depends onθ, so the lielihood is p(y θ), and suppose the θ s,µ s andτ 2 s have prior distributions p(θ µ,τ 2 ), p(µ τ 2 ) and p(τ 2 ). Then the complete conditional forθ, for example, is p(θ rest) = p(θ y,µ,τ 2 ) = = p(y,θ,µ,τ 2 ) p(y, t,µ,τ2 ) dt p(y θ) p(θ µ,τ 2 ) p(µ τ 2 ) p(τ 2 ) p(y t) p(t µ,τ2 ) p(µ τ 2 ) p(τ 2 ) dt p(y θ) p(θ µ,τ 2 ) Key observation: The shape of p(θ y,µ,τ 2 ) is determined by just the parts of the joint model that depend explicitly onθ. Similarly for other parameters. 8 36-724 February 16, 2006

Convergence M 0,M 1,M 2,...,M B, } {{ } Burn-in segment M B+1,...,M B+M } {{ } Usable MCMC sample How large should B be? Use time series plots to see when the chain has stabilized. Useacf() in Splus/R to chec when Corr(M m,m m+b ) small; Run several chains from different starting points; let B be so large that (within-chain variation)<(between chain variation). CODA subroutine pacage for Splus/R offers a menu of such checs (www.mrc-bsu.cam.ac.u/bugs/). Even after burn-in, the MC sample may have to be grouped or sub-sampled to reduce autocorrelation within the sample. 9 36-724 February 16, 2006

Complete Conditionals for Hierarchical Normal p(y,θ,µ,τ 2 σ 2 ) jn(y j θ j,σ 2 j ) jn(θ j µ,τ 2 ) p(µ,τ 2 ) p(θ j rest) N(y j θ j,σ 2 j ) N(θ j µ,τ 2 ) ) y j /σ N (θ 2 j +µ/τ2 1 j, 1/σ 2 j +1/τ2 1/σ 2 j +1/τ2 p(µ rest) jn(θ j µ,τ 2 ) p(µ,τ 2 ) N(µ θ,τ 2 /J) p(µ,τ 2 ) p(τ 2 rest) (τ 2 ) J/2 exp { 1 2 Inv-Gamma ( τ 2 J 2 1, 1 2 j(θ j µ) 2 /τ 2} p(µ,τ 2 ) j(θ j µ) 2) p(µ,τ 2 ) If p(µ,τ 2 ) IG(τ 2 α,β), we can use Gibbs steps (i.e., sample these complete conditionals directly)... See R code for this lecture 10 36-724 February 16, 2006

Aside: Marginal Density Estimates MCMC (and other Monte Carlo methods) can be used to obtain a sample from the joint posterior p(τ 1,...,τ d y). What if we just want p(τ 1 y)? Let (τ (m) 1,...,τ(m) d ), m=1... M, be a sample from p(τ 1,...,τ d y). P[τ 1 t y]=e[1 {τ1 t} y] 1 M m 1 (m) soτ(1) {τ t}, 1,...,τ(M) 1 is a sample from 1 p(τ 1 y). A histogram or density estimate based onτ m 1 thus estimates p(τ 1 y). Var ( 1 M m 1 (m) {τ t} y)= 1 Var (1 M {τ 1 t} y). 1 P[τ 1 t y]=e{p[τ 1 t y,τ 2,...,τ d ] y} 1 M differentiating, we see that 1 M m p(τ 1 y,τ (m) Var ( 1 M Which is better? m P[τ 1 t y,τ (m) 2,...,τ(m) d ]; ) also estimates p(τ 1 y). 2,...,τ(m) d m P[τ 1 t y,τ (m) 2,...,τ(m) d ] y)= 1 Var (P[τ M 1 t y,τ (m) 2,...,τ(m) d ] y). Var (1 {τ1 t} y)=var (P[τ 1 t y,τ 2,...,τ d ] y)+ E[Var (1 {τ1 t} y,τ 2,...,τ d ) y] The second method can be expected to have lower variance (Rao-Blacwellized density estimate: Casella & Robert, 1996, Biometria). 11 36-724 February 16, 2006

Summary As an alternative to direct iid sampling from the joint posterior, one can do successive substitution sampling of the complete conditionals. This sets up a Monte Carlo marov chain, whose stationary distribution is the joint posterior. The complete conditionals are often easy to identify. Even if they cannot be sampled directly with Gibbs steps they can be sampled using the Metropolis-Hastings rejection method. The complete conditionals are also useful for computing Rao-Blacwellized marginal density estimates. Heavy autocorrelation between draws in MCMC means we have to throw away an initial burn-in segment of draws, and we may also have to subsample or bloc-average the draws we eep, to reduce autocorrelation. MCMC is easy to set up but slow to operate. Much of the art of MCMC is choosing good parametrizations so that the autocorrelation between MCMC draws is low. 12 36-724 February 16, 2006