Parallel Tempering I this is a fancy (M)etropolis-(H)astings algorithm it is also called (M)etropolis (C)oupled MCMC i.e. MCMCMC! (as the name suggests,) it consists of running multiple MH chains in parallel invented by Charles J. Geyer [1, Geyer, 1991] March 21, 2006 c 2006 - Gopi Goswami (goswami@stat.harvard.edu) Page 1
Parallel Tempering II want samples from the target density: g(z ), z R d let H(z ) = log(g(z )), then we have, g(z ) = exp{ H(z )/1.0} H( ), the negative of the log density is called the fitness function in general, one might be interested in sampling from: g(z ) exp{ H(z )/τ min }, z R d assuming exp{ H(z )/τ min } dz < note H(ũ) H(ṽ) g(ũ) g(ṽ), so, low fitness values corresponds to good or high-probability samples March 21, 2006 c 2006 - Gopi Goswami (goswami@stat.harvard.edu) Page 2
Parallel Tempering III consider a temperature ladder (just a decreasing sequence of positive numbers): t 1 > t 2 > > t N > 0.0, where t N = τ min extend the sample space: x := (x 1;...;x i;...;x N) R Nd terminology: population or state of the chain: (x 1, t 1 ;... ; x i, t i ;...;x N, t N ) i th chromosome: x i modified target density: N f(x ) f i (x i) i=1 f i (x i) exp{ H(x i)/t i }, i = 1, 2,...,N f N ( ) = g( ) where and note because t N = τ min March 21, 2006 c 2006 - Gopi Goswami (goswami@stat.harvard.edu) Page 3
Parallel Tempering III (P)arallel (T)empering consists two types of moves: MH update (local move) apply MH updates to the individual chains at the different temperature levels or to the chromosomes also called the Mutation move Exchange update (global move) propose to swap the states of the chains at the two neighboring temperature levels or two neighboring chromosomes also called the Random Exchange move March 21, 2006 c 2006 - Gopi Goswami (goswami@stat.harvard.edu) Page 4
Mutation I 1. choose i ɛ {1,...,N} using some distribution p(i = i x ), could be random or deterministic 2. for simple (R)andom (W)alk (M)etropolis, propose ỹ i = x i + ε i, where ε i is suitably chosen from a symmetric mean zero proposal distribution: T i (x i, ) note we could choose T i (x i, ) Normal d (x i,, σ 2 i I d; ) the σ 2 values may need some tweaking after observing the level-specific acceptance rates of the mutation move one can also do block or coordinate wise Gibbs or use a general MH on x i here 3. accept (ỹ, t ) = (x 1, t 1 ;...;ỹ i, t i ;...;x N, t N ) with probability α m = min(1, r m ) where, r m = f i(ỹ i ) f i (x i) p(i = i ỹ) p(i = i x ) March 21, 2006 c 2006 - Gopi Goswami (goswami@stat.harvard.edu) Page 5
Mutation II computation of r m : here we are using a mixture of updaters note only change between x and ỹ is in the i-th chromosome: x i has changed to ỹ i so T(x, ỹ) = p(i = i x ) T i (x i, ỹ i ) and T(ỹ, x ) = p(i = i ỹ) T i (ỹ i, x i) hence we have: r m = f(ỹ)t(ỹ, x ) f(x )T(x, ỹ) = f i(ỹ i )p(i = i ỹ)t i (ỹ i, x i) f i (x i)p(i = i x )T i (x i, ỹ i ) = f i(ỹ i ) f i (x i) p(i = i ỹ) p(i = i x ), here T i(, ) s cancel in because it s a RWM in general, it may not March 21, 2006 c 2006 - Gopi Goswami (goswami@stat.harvard.edu) Page 6
Mutation III at higher temperature levels, Mutation moves are easily accepted because the distribution is flat and thus hotter chains travel around the sample space a lot at lower temperature levels, Mutation moves are rarely accepted because the distribution is very spiky and hence any colder chains tend to get stuck around a mode thus Mutation does local exploration for lower temperatures and since the lowest temperature is the temperature of interest only doing Mutation doesn t help, one needs to consider the next move but the sticky nature of the Mutation move at lower temperature is a plus point as well, this tends to foster finer local exploration March 21, 2006 c 2006 - Gopi Goswami (goswami@stat.harvard.edu) Page 7
Random Exchange I 1. select i ɛ {1,...,N} with p(i 1 = i x ) = 1 N also select j i s.t. p(i 2 = 2 x, I 1 = 1) = 1, p(i 2 = N 1 x, I 1 = N) = 1 and for i 1, N, p(i 2 = i ± 1 x, I 1 = i) = 0.5 2. propose to exchange x i and x j 3. accept (ỹ, t ) = (x 1, t 1 ;...;x j, t i ;... ; x i, t j ;...;x N, t N ), with probability α re = min(1, r re ) where, r re = f i(x j)f j (x i) f i (x i)f j (x j) what are T(x, ỹ), T(ỹ, x ) here? = exp[(h(x j) H(x i)) (1/t j 1/t i )] March 21, 2006 c 2006 - Gopi Goswami (goswami@stat.harvard.edu) Page 8
Random Exchange II for i > j and H(x j) H(x i) implies r re 1, because 1/t j 1/t i so, good samples are brought down the ladder and in the process, the bad guys are pushed up so this move probabilistically transports good samples down and bad up the ladder this can cause jumps between two widely separate modes, thus this move has a global nature March 21, 2006 c 2006 - Gopi Goswami (goswami@stat.harvard.edu) Page 9
Parallel Tempering Algorithm I (0) i (0), i = 1, 2,...,N} giving x we initialize the population to {x we take a suitably chosen temperature ladder {t i, i = 1, 2,...,N}, for a concrete recipe see [2, Goswami et. al.] choose a moves mixture probability vector (q, 1 q), q (0, 1) Algorithm 0.1 (PT: one iteration). 1. with probability q apply the Mutation move N-times on the population 2. with probability 1 q apply the Random Exchange move N-times on the resultant population thus we get draws: x (0) x (1) (2) x (t) x (t) samples of interest: upon convergence we look at N, t = 1, 2,...m} out {x of {x (t), t = 1, 2,...m} March 21, 2006 c 2006 - Gopi Goswami (goswami@stat.harvard.edu) Page 10
Parallel Tempering III is doing all this extra work worth the effort: PT isn t stuck in one mode like MH! Figure 1: MH-PT Comparison March 21, 2006 c 2006 - Gopi Goswami (goswami@stat.harvard.edu) Page 11
Parallel Tempering III is doing all this extra work worth the effort: PT yields less auto-correlation than MH (BTW, these plots are not enough, one need to look at AIAT, this is what others do, we shouldn t) Figure 2: MH-PT Comparison March 21, 2006 c 2006 - Gopi Goswami (goswami@stat.harvard.edu) Page 12
Parallel Tempering III PT is a computationally expensive method and so it is generally used for harder problems where simple MH cannot possibly jump between modes in finite amount of time the draws produced by MH are very highly correlated intuitively why does PT work? the Mutation move at higher temperatures, helps to cover the whole space and at lower temperatures fosters finer local exploration the Exchange move does the transportation job facilitating global exploration: good ones go down bad guys go up March 21, 2006 c 2006 - Gopi Goswami (goswami@stat.harvard.edu) Page 13
References [1] C. J. Geyer. Markov chain Monte Carlo maximum likelihood. In Computing Science and Statistics: Proc. 23rd Symp. Interface, pages 156 163, 1991. [2] Gopika R. Goswami and Jun S. Liu. On real-parameter evolutionary monte carlo algorithm. Statistics and Computing, 2006. (just accepted). March 21, 2006 c 2006 - Gopi Goswami (goswami@stat.harvard.edu) Page 14