MH I Metropolis-Hastings (MH) algorithm is the most popular method of getting dependent samples from a probability distribution a lot of Bayesian mehods rely on the use of MH algorithm and it s famous cousin the Gibbs sampler January 2, 2006 c 2006 - Gopi Goswami (goswami@stat.harvard.edu) Page 1
MH II goal is to sample from target density: π(x) exp [ H(x)/β] the above form is known as the Boltzman form of a distribution H(x) is called the fitness or energy function β is called the temperature EXAMPLE: target density for X Normal 1 (µ, σ 2 ): π(x) = 1 [ σ 2π exp [ exp 1 2σ 1 (x µ)2 2σ2 (x µ)2 2 ] ] here H(x) = 1 2σ (x µ) 2 and β = 1 2 January 2, 2006 c 2006 - Gopi Goswami (goswami@stat.harvard.edu) Page 2
MH III going to use a proposal distribution pdf to generate guesses or proposals for the draws from the target: T (x, ) EXAMPLE: given x, Normal proposal pdf for Y Normal 1 (x, τ 2 ): T (x, ) Normal 1 (x, τ 2 ; ) January 2, 2006 c 2006 - Gopi Goswami (goswami@stat.harvard.edu) Page 3
MH IV going to have to evaluate the proposal density pdf: T (x, y) make sure T (y, x) > 0 whenever T (x, y) > 0, otherwise the sampler will not work also don t just assume T (y, x) = T (x, y), a very common trap for beginners EXAMPLE: Normal proposal density: T (x, y) Normal 1 (x, τ 2 ; y) [ exp 1 ] (y x)2 2τ 2 January 2, 2006 c 2006 - Gopi Goswami (goswami@stat.harvard.edu) Page 4
MH V acceptance probability used in Metropolis-Hastings algorithm: α(x, y) = min { 1, } π(y)t (y, x) π(x)t (x, y) if proposal is symmetric, i.e., if T (x, y) = T (y, x) then we have: α(x, y) = min if π(y) π(x) then α(x, y) = 1 if π(y) < π(x) then α(x, y) < 1 { 1, π(y) } π(x) aside: if proposal is symmetric then the algorithm is called Metropolis algorithm note since we deal with ratios above its enough to know π( ) and T (, ) up to a proportionality constant January 2, 2006 c 2006 - Gopi Goswami (goswami@stat.harvard.edu) Page 5
MH VI the MH algorithm (for N-many iterations): 1. initialize: set t = 0 and get a starting value x (t) 2. propose: generate y from T (x (t), ) 3. eval: evaluate acceptance probability α(x (t), y) 4. move: generate u from Uniform(0, 1) and set x (t+1) = y if u α(x (t), y) x (t) otherwise 5. if t N stop otherwise set t = t + 1 and go to step 2 aside: its enough to compute α(, ) without the min part because u 1, what??? January 2, 2006 c 2006 - Gopi Goswami (goswami@stat.harvard.edu) Page 6
MH VII process the samples: {x (t) : t = 0, 1,..., N} discard some initial samples, say, N/10 is the burn-in period, for notational ease, reindex the rest as: {x (t) : t = 1, 2,..., M} use the rest for inference EXAMPLE: to estimate the mean of the target density use the estimator: 1 M M t=1 x (t) January 2, 2006 c 2006 - Gopi Goswami (goswami@stat.harvard.edu) Page 7
a simple example: set up: target: X Normal 1 (µ, σ 2 ) proposal: Y Normal 1 (x, τ 2 ) so we have: π(x) exp [ 1 2σ 2 (x µ) 2] MH VIII T (x, ) Normal 1 (x, τ 2 ; ) [ T (x, y) exp 1 2τ (y x) 2], note its symmetric! 2 { } { α(x, y) = min = min 1, exp 1, π(y) π(x) [ 1 2σ 2 (y µ) 2 + 1 2σ 2 (x µ) 2]} important aside: all the above expressions are nice and fine but while implementing do all your computations in log-scale January 2, 2006 c 2006 - Gopi Goswami (goswami@stat.harvard.edu) Page 8
MH IX some general guidelines while implementing a typical MH sampler: tweak your proposal T (x, y) so that (see [2, Gelman et. al]) α(x, y) [40%, 50%] if x(, y) R 1 α(x, y) [20%, 30%] if x(, y) R d, d > 1 too high (above 70%) or too low (below 10%) values of α(x, y) is a sign of bad choice for T (x, y) start your sampler from dispersed starting values and check that you converge around the same region of the sample space propose to move very highly correlated variables together do not use very high-dimensional proposals, such proposals are rarely accpeted January 2, 2006 c 2006 - Gopi Goswami (goswami@stat.harvard.edu) Page 9
EM I goal is to find the Maximum Likelihood Estimator (MLE) or the Maximum A Posterior (MAP) Estimator Expectation-Maximization (EM) algorithm is the most popular method for the above the above maximization problem involves two steps: the Expectation step or the E-step the Maximization step or the M-step January 2, 2006 c 2006 - Gopi Goswami (goswami@stat.harvard.edu) Page 10
EM II set up: data: y := (y 1, y 2,..., y n ) parameter of interest: θ nuisance parameter or missing data: z E-step: Q θ θ (t) := 8 < E θ (t) [log p (θ, z y)] = R log p (θ, z y) p(z θ (t), y) dz : E θ (t) [log p (z, y θ)] = R log p (z, y θ) p(z θ (t), y) dz for MAP for MLE M-step: θ (t+1) := arg max θ Q (θ θ (t)) January 2, 2006 c 2006 - Gopi Goswami (goswami@stat.harvard.edu) Page 11
EM III the EM algorithm with ɛ-close stopping: 1. initialize: set t = 0 and get a starting value θ (t) 2. E-step: get Q ( θ θ (t)) 3. M-step: get θ (t+1) = arg max θ Q ( θ θ (t)) 4. if θ (t+1) θ (t) ɛ stop otherwise set t = t + 1 and go to step 2 in some easy cases you could combine the E-step and the M-step if you have a closed from expression for Q ( θ θ (t)) and (hence) for arg max θ Q ( θ θ (t)) January 2, 2006 c 2006 - Gopi Goswami (goswami@stat.harvard.edu) Page 12
EM IV EXAMPLE: we want MAP estimator of µ from (with σ 2 unknown): y i Normal 1 (µ, σ 2 ), i = 1, 2,..., n µ Normal 1 (µ 0, τ0 2 ) p(log σ) 1 so we have: data: y := (y 1, y 2,..., y n ) parameter of interest: θ = µ nuisance parameter: z = σ 2 January 2, 2006 c 2006 - Gopi Goswami (goswami@stat.harvard.edu) Page 13
EM V we observe: log p (θ, z y) = log p ( µ, σ 2 y ) = const 1 2τ 2 0 (µ µ 0 ) 2 (n + 1) log σ 1 2σ 2 n (y i µ) 2 i=1 we also note: p(z θ (t), y) = p(σ 2 µ (t), y) Inv χ 2 ( n, 1 n n i=1 ) (y i µ (t)) 2 January 2, 2006 c 2006 - Gopi Goswami (goswami@stat.harvard.edu) Page 14
EM VI E-step: only compute the expectations of the terms which involve θ because other terms are not useful in the M-step so we note: Q ( θ θ (t)) = Q (µ µ (t)) = const 1 2τ 2 0 = const 1 2τ 2 0 [ ] (µ µ 0 ) 2 1 n E µ (t) 2σ 2 (y i µ) 2 i=1 { 1 (µ µ 0 ) 2 1 1 n 2} (y i µ (t)) n 2 n i=1 we are ignoring the followin for the mentioned reason i=1 (y i µ) 2 (n + 1)E µ (t) [log σ] January 2, 2006 c 2006 - Gopi Goswami (goswami@stat.harvard.edu) Page 15
EM VII M-step: note Q ( θ θ (t)) = Q ( µ µ (t)) is a quadratic in µ and hence easy to maximize taking derivatives once (and then twice) one can show θ (t+1) := arg max θ = arg max µ = 1 τ 2 0 Q (θ θ (t)) Q (µ µ (t)) n µ 0 + P n ȳ i=1(y i µ (t) ) 2 1 τ 2 0 = µ (t+1) 1 n n + P n i=1(y i µ (t) ) 2 1 n January 2, 2006 c 2006 - Gopi Goswami (goswami@stat.harvard.edu) Page 16
EM VIII EXAMPLE: we want MLE for mixture proportions, (π 1, π 2,..., π k ): we have k-many known densities f j ( ), j = 1, 2,..., k there are k-many unknown proportions π j, j = 1, 2,..., k with k j=1 π j = 1 y i k j=1 π jf j ( ), i = 1, 2,..., n so we have: data: y := (y 1, y 2,..., y n ) parameter of interest: θ := (π 1, π 2,..., π k ) introduce missing data: z := (z 1, z 2,..., z n ) such that [z i θ] Multinomial(1, θ), i = 1, 2,..., n note here we need to cook up the missing data in such a way that integrating / summing it out gives us back our original model, see next slide January 2, 2006 c 2006 - Gopi Goswami (goswami@stat.harvard.edu) Page 17
now we can rewrite our model as: EM IX [y i z i = e j, θ] f j ( ), i = 1, 2,..., n, j = 1, 2,..., k p(z i = e j θ) = π j, i = 1, 2,..., n, j = 1, 2,..., k here e j is the j-th canonical vector for j = 1, 2,..., k (e.g. e 1 = (1, 0, 0,..., 0) etc.) check that: z p(y, z θ) = p(y θ) so we have: p log p (z, y θ) = ( z ij = 1 y, θ (t)) = p n i=1 k z ij log {π j f j (y i )} j=1 (z i = e j y, θ (t)) and = π (t) j f j (y i ) k j =1 π(t) j f j (y i ) = a(t) ij, say January 2, 2006 c 2006 - Gopi Goswami (goswami@stat.harvard.edu) Page 18
E-step: EM X Q ( θ θ (t)) = = n i=1 n i=1 k E θ (t)(z ij ) log {π j f j (y i )} j=1 k j=1 a (t) ij log {π jf j (y i )} M-step: its a constrained maximization problem with k j=1 π j = 1 which gives: θ (t+1) := arg max Q (θ θ (t)) = θ n i=1 a(t) ij n i=1 k j=1 a(t) ij = 1 n n i=1 a (t) ij January 2, 2006 c 2006 - Gopi Goswami (goswami@stat.harvard.edu) Page 19
EM XI the tricky (theoretical) part of the EM algorithm is that many missing data schemes may give rise to the same model under consideration but not all are helpful EXAMPLE: in the mixture proportions example defining z the following way is not helpful at all (although it satisfies z p(y, z θ) = p(y θ)) p(z i = j θ) = π j, i = 1, 2,..., n, j = 1, 2,..., k note here z i is of dimension 1 as opposed to k, as before to find the best missing data scheme is an art, really check out The Art of Data Augmentation [5, van Dyk et.al] January 2, 2006 c 2006 - Gopi Goswami (goswami@stat.harvard.edu) Page 20
References [1] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm (C/R: p22-37). Journal of the Royal Statistical Society, Series B, Methodological, 39:1 22, 1977. [2] A. Gelman, G. O. Roberts, and W. R. Gilks. Efficient Metropolis jumping rules. In Bayesian Statistics 5 Proceedings of the Fifth Valencia International Meeting, pages 599 607, 1996. [3] Andrew Gelman and Donald B. Rubin. Inference from iterative simulation using multiple sequences (Disc: p483-501, 503-511). Statistical Science, 7:457 472, 1992. [4] Charles J. Geyer. Practical Markov chain Monte Carlo (Disc: p483-503). Statistical Science, 7:473 483, 1992. [5] David A. van Dyk and Xiao-Li Meng. The art of data augmentation (Pkg: p1-111). Journal of Computational and Graphical Statistics, 10(1):1 50, 2001. January 2, 2006 c 2006 - Gopi Goswami (goswami@stat.harvard.edu) Page 21