Markov Chain Monte Carlo Recall: To compute the expectation E ( h(y ) ) we use the approximation E(h(Y )) 1 n n h(y ) t=1 with Y (1),..., Y (n) h(y). Thus our aim is to sample Y (1),..., Y (n) from f(y). Problem: Independent sampling from f(y) may be difficult. Markov chain Monte Carlo (MCMC) approach Generate Markov chain {Y } with stationary distribution f(y). Early iterations Y (1),..., Y (m) reflect starting value Y (). These iterations are called burn-in. After the burn-in, we say the chain has converged. Omit the burn-in from averages: 1 n m n t=m+1 h(y ) Burn in Stationarity 1 Y 1 1 3 4 5 6 7 8 9 1 How do we construct a Markov chain {Y } which has stationary distribution f(y)? Gibbs sampler Metropolis-Hastings algorithm (Metropolis et al 1953; Hastings 197) MCMC, April 9, 4-1 -
Gibbs Sampler Let Y = (,..., Y d ) be d dimensional with d and distribution f(y). The full conditional distribution of Y i is given by f(y i y 1,..., y i 1, y i+1,..., y d ) = f(y 1,..., y i 1, y i, y i+1,..., y d ) f(y1,..., y i 1, y i, y i+1,..., y d ) dy i Gibbs sampling Sample or update in turn: 1 f(y 1 Y, Y 3,..., Y d ) f(y 1, Y 3,..., Y d ) 3 f(y 3 1,, Y 4,..., Y d )... d f(y d 1,,..., d 1 ) Always use most recent values. In two dimensions, the sample path of the Gibbs sampler looks like this:.3 t=1 t=.5 t=4 Y. t=3.15.1 t=5 t=7 t=6.15..5.3.35 MCMC, April 9, 4 - -
Gibbs Sampler Detailed balance for Gibbs sampler: For simplicity, let Y = (, Y ) T. Then the update at time t + 1 is obtained from the previous Y in two steps: 1 p(y 1 Y ) p(y 1 ) Accordingly the transition matrix P (y, y ) = P( = y Y = y) can be factorized into two separate transition matrices P (y, y ) = P 1 (y, ỹ) P (ỹ, y ) where ỹ = (y 1, y ) T is the intermediate result after the first step. Obviously we have P 1 (y, ỹ) = p(y 1 y ) and P (ỹ, y ) = p(y y 1). Note that for any y, y, we have P 1 (y, y ) = if y y and P (y, y ) = if y 1 y 1. According to the detailed balance for time-dependent Markov chains, it suffices to show detailed balance for each of the transition matrices: For any states y, y such that y = y p(y) P 1 (y, y ) = p(y 1, y ) p(y 1 y ) = p(y 1 y ) p(y 1, y ) = p(y 1 y ) p(y 1, y ) = P 1 (y, y) p(y ), while for y, y with y y the equation is trivially fulfilled. Similarly we obtain for y, y such that y 1 = y 1 p(y) P (y, y ) = p(y 1, y ) p(y y 1 ) = p(y y 1 ) p(y, y 1 ) = p(y y 1) p(y 1, y ) = P (y, y) p(y ), while for y, y with y 1 y 1 the equation trivially holds. Altogether this shows that p(y) is indeed the stationary distribution of the Gibbs sampler. Note that combined we get p(y) P (y, y ) = p(y) P 1 (y, ỹ) P 1 (ỹ, y ) = p(y ) P (y, ỹ) P 1 (ỹ, y) p(y ) P (y, y). Explanation: Markov chains {Y t } which satisfy the detailed balance equation are called time-reversible since it can be shown that P(Y t+1 = y Y t = y) = P(Y t = y Y t+1 = y ). For the above Gibbs sampler, to go back in time we have to update the two components in reverse order - first and then 1. MCMC, April 9, 4-3 -
Gibbs Sampler Example: Bayes inference for a univariate normal sample Consider normally distributed observations Y = (,..., Y n ) T Y i iid N (µ, σ ). Likelihood function: ( 1 ) n f(y µ, σ ) σ ( exp 1 σ n ) (Y i µ) Prior distribution (noninformative prior): π(µ, σ ) 1 σ Posterior distribution: ( 1 ) n π(µ, σ Y ) +1 ( exp σ 1 σ Define τ = 1/σ. Then we can show that π(µ σ, Y ) = N ( Ȳ, σ /n ) ( n π(τ µ, Y ) = Γ, 1 n ) (Y i µ) n ) (Y i µ) Gibbs sampler: µ (t+1) N ( Ȳ, (n τ ) 1) ( n τ (t+1) Γ, 1 n ) (Y i µ (t+1) ) with σ (t+1) = 1/τ (t+1) MCMC, April 9, 4-4 -
Gibbs Sampler Implementation in R n<- #Data Y<-rnorm(n,,) MC<-;N<-1 #Run MC= chains of length N=1 p<-rep(,*mc*n) #Allocate memory for results dim(p)<-c(,mc,n) for (j in (1:MC)) { #Loop over chains p<-rgamma(1,n/,1/) #Starting value for tau for (i in (1:N)) { #Gibbs iterations p1<-rnorm(1,mean(y),sqrt(1/(p*n))) #Update mu p<-rgamma(1,n/,sum((y-p1)^)/) #Update tau p[1,j,i]<-p1 #Save results p[,j,i]<-p } } Results: Bayes inference for a univariate normal sample Two runs of Gibbs sampler (N=5): 3.5 3..5. 1.5 1..5.6.4. 1 3 4 5 1. µ 1. τ.8.8.6.6 ACF.4 ACF.4.... 1 3 4 5 Lag 1 3 4 5 Lag Marginal and joint posterior distributions (based on 1 draws): 8 1 8.6.5 6 4 6 4 τ.4.3..1.5 1. 1.5..5 3. 3.5 µ..1..3.4.5.6 τ..5 1. 1.5..5 3. 3.5 µ MCMC, April 9, 4-5 -
Markov Chain Monte Carlo Example: Bivariate normal distribtution Let Y = (, Y ) T be normally distributed with mean µ = (, ) T and covariance matrix ( ) 1 ρ Σ =. ρ 1 The conditional distributions are Y = N (ρ Y, 1 ρ ) Y = N (ρ, 1 ρ ) Thus the steps of the Gibbs sampler are 1 N (ρ Y, 1 ρ ), N (ρ 1, 1 ρ ). Note: We can obtain an independent sample Y = (Y 1, Y ) T by 1 N (, 1), N (ρ 1, 1 ρ ). MCMC, April 9, 4-6 -
Markov Chain Monte Carlo Comparison of MCMC and independent draws s 1 to 4 3 Independent sampling (n=3) Burn in 3 1 1 1 1 3 1 1 3 s 1 to 7 3 1 1 3 Independent sampling (n=6) 3 8 6 4 5 4 3 3 4 1 4 5 3 1 1 3 s 1 to 1 3 1 1 3 Independent sampling (n=9) 5 8 6 6 4 6 4 6 7 3 1 1 3 s 1 to 1 8 3 1 1 3 8 Independent sampling (n=99) 7 8 6 6 8 4 4 9 9 3 1 1 3 3 1 1 3 1 1 MCMC, April 9, 4-7 -
Convergence diagnostics Markov Chain Monte Carlo Plot chain for each quantity of interest. Plot auto-correlation function (ACF) ρ i (h) = corr ( Y i, Y (t+h) ) i. measures the correlation of values h lags apart. Slow decay of ACF indicates slow convergence and bad mixing. Can be used to find independent subsample. Run multiple, independent chains (e.g. 3-1). Several long runs (Gelman and Rubin 199) gives indication of convergence a sense of statistical security one very long run (Geyer, 199) reaches parts other schemes cannot reach. Widely dispersed starting values are particularly helpful to detect slow convergence. µ1 4 convergence ACF µ 1..8.6.4.. 1 1 3 5 1 15 Lag 5 1 15 iteration 4 6 8 1 If not satisfied, try some other diagnostics ( literature). MCMC, April 9, 4-8 -
Markov Chain Monte Carlo Note: Even after the chain reached convergence, it might not yet good enough for estimating E(h(Y )). 3 1 1 3 1 3 4 5 6 7 8 9 1 Problem: Chain should show good mixing (transition between states) run the chain for a longer period 6 5 4 3 1 3 1 1 3 3 1 1 3 1 3 4 5 6 7 8 9 1 11 1 13 14 15 16 17 18 19 1 9 6 3 3 1 1 3 Monte Carlo error Suppose we want to estimate E ( g(y ) ) by ĥ = 1 N N h(y ) t=1 with Y f(y). The error of the approximation (Monte Carlo error) is Estimation of Monte Carlo error: Let {Y (i,t) } be I Markov chains. Then 1 I(I 1) I (ĥ(i) ĥ) var(ĥ). var(ĥ) can be estimated by where ĥ(i) is the MCMC estimate based in the ith chain ĥ is the average of the ĥ(i) (overall estimate) MCMC, April 9, 4-9 -