Markov Chain Monte Carlo Lecture 6

where (x 1,..., x N ) X N, N s called the populaton sze, f(x) f (x) for at least one {1, 2,..., N}, and those dfferent from f(x) are called the tral dstrbutons n terms of mportance samplng. Dfferent ways of specfcaton of the tral dstrbutons and updatng the populaton Markov Chan Monte Carlo Lecture 6 An actvely pursued research drecton for allevatng the local-trap problem suffered by the Metropols-Hastngs (MH) algorthm s the populatonbased MCMC, where a populaton of Markov chans are run n parallel, each equpped wth possbly dfferent but related nvarant dstrbutons. Informaton exchange between dfferent chans provdes a means for the target chans to learn from past samples, and ths n turn mproves the convergence of the target chans. Mathematcally, the populaton-based MCMC may be descrbed as follows. In order to smulate from a target dstrbuton f(x), one smulates of an augmented system wth the nvarant dstrbuton f(x 1,..., x N ) = N =1 f (x ), (1)

of Markov chans lead to dfferent algorthms, such as the adaptve drecton samplng (Glks et al., 1994), conjugate gradent Monte Carlo (Lu, Lang and Wong, 2000), parallel temperng (Geyer, 1991; Hukushma and Nemoto, 1996), evolutonary Monte Carlo (Lang and Wong, 2000, 2001), sequental parallel temperng (Lang, 2003), and equ-energy sampler (Kou, Zhou and Wong, 2006).

f(r) r d 1 f(x (t) + r e ), (2) Markov Chan Monte Carlo Lecture 6 Adaptve drecton samplng Adaptve drecton samplng (ADS) (Glks et al., 1994) s an early populaton-based MCMC method, n whch each dstrbuton f (x) s dentcal to the target dstrbuton, and at each teraton, one sample s randomly selected from the current populaton to undergo an update along a drecton toward another sample randomly selected from the remanng set of the current populaton. An mportant form of the ADS s the snooker algorthm. 1. Select one ndvdual, say x (t) c, at random from the current populaton x (t). The x (t) c s called the current pont. 2. Select another ndvdual, say x (t) a, from the remanng set of the current populaton,.e., {x (t) : c}, and form a drecton e t = x (t) c x (t) a. The ndvdual x (t) a s called the anchor pont. 3. Set y c = x (t) a +r t e t, where r t s a scalar sampled from the densty

where d s the dmenson of x, and the factor r d 1 s derved from a transformaton Jacoban (Roberts and Glks, 1994). 4. Form the new populaton x (t+1) by replacng x (t) c all other ndvduals unchanged (.e., set x (t+1) by y c and leavng = x (t) for c).

To show the sampler s proper, we need to show that at the equlbrum the new sample y c s ndependent of the x (t) for a and s dstrbuted as f(x). Ths fact follows drectly from the followng lemma, whch s a generalzed verson of Lemma 3.1 of Roberts and Glks (1994) and was proved by Lu, Lang and Wong (2000). Lemma 0.1 (Lu, Lang and Wong, 2000) Suppose x π(x) and y s any fxed pont n a d-dmensonal space. Let e = x y. If r s drawn from dstrbuton f(r) r d 1 π(y + re), then x = y + re follows the dstrbuton π(x). If y s generated from a dstrbuton ndependent of x, then x s ndependent of y.

Conjugate gradent Monte Carlo (Lu, Lang and Wong, 2000) Let x (t) = (x (t) 1,..., x (t) N ) denote the current populaton of samples. One teraton of the CGMC sampler conssts of the followng steps. 1. Select one ndvdual, say x (t) c, at random from the current populaton x (t). 2. Select another ndvdual, say x (t) a, at random from the remanng set of the populaton,.e. {x (t) : c}. Startng wth x (t) a, conduct a determnstc search, usng the conjugate gradent method or the steepest descent method, to fnd a local mode of f(x). Denote the local mode by z a (t), whch s called the anchor pont. 3. Set y c = z (t) a sampled from the densty + r t e t, where e t = x (t) c z a (t), and r t s a scalar f(r) r d 1 f(z (t) a + r t e t ), (3) where d s the dmenson of x, and the factor r d 1 s derved from the transformaton Jacoban.

4. Form the new populaton x (t+1) by replacng x (t) c other ndvduals unchanged (.e., set x (t+1) = x (t) by y c and leavng for c). The gradent-based optmzaton procedure performed n step 2 can be replaced by some other optmzaton procedures, for example, a short run of smulated annealng (Krkpatrck et al., 1983). Snce the local optmzaton step s usually expensve n computaton, Lu, Lang and Wong (2000) proposed the multple-try MH algorthm for the lne samplng step, whch enables effectve use of the local modal nformaton of the dstrbuton and thus mprove the convergence of the algorthm.

Sample MH Algorthm (Lewandowsk and Lu, 2008) In adaptve drecton samplng and conjugate gradent Monte Carlo, when updatng the populaton, one frst selects an ndvdual from the populaton and then updates the selected ndvdual usng the standard Metropols-Hastngs procedure. If the canddate state s of hgh qualty relatve to the whole populaton, one certanly wants to keep t n the populaton. However, the acceptance of the canddate state depends on the qualty of the ndvdual that s selected for updatng. To mprove the acceptance rate of hgh qualty canddates and to mprove the set {x (t) : = 1,..., N} as a sample of sze N from f(x), Lewandowsk and Lu (2008) proposed the samplng Metropols-Hastngs (SMH) algorthm.

Sample MH Algorthm Take one canddate draw x (t) 0 from a proposal dstrbuton g(x) on X, and compute the acceptance probablty α (t) 0 = N =0 g(x (t) ) f(x (t) N =1 g(x (t) ) f(x (t) ) ) mn 0 k N. g(x (t) k ) f(x (t) k ) Draw U Unf0, 1, and set { } S t+1 = x (t+1) 1,..., x (t+1) n { St, f U > α (t) 0 ; = { } x (t) 1,..., x (t) 1, x(t) 0, x (t) +1,..., x(t) n, f U α (t) 0,

where s chosen from (1,..., n) wth the probablty weghts ( ) g(x (t) (t) 1 ) g(xn ),...,. f(x (t) 1 ) f(x (t) n ) Thus, x t+1 and x t dffer by one element at most. It s easy to see that n the case of N = 1, SMH reduces to the tradtonal MH wth ndependence proposals. The mert of SMH s that to accept a canddate state, t compares the canddate wth the whole populaton, nstead of a sngle ndvdual randomly selected from the current populaton. Lewandowsk and Lu (2008) show that SMH wll converge under mld condtons to the target dstrbuton N =1 f(x ) for {x 1,..., x N }, and can be more effcent than the tradtonal MH and adaptve drecton samplng.

Parallel temperng (Geyer, 1991) Parallel temperng smulates n parallel a sequence of dstrbutons f (x) exp( H(x)/T ), = 1,..., n, (4) where T s the temperature assocated wth the dstrbuton f (x). The temperatures form a ladder T 1 > T 2 > > T n 1 > T n 1, so f n (x) f(x) corresponds to the target dstrbuton. The dea underlyng ths algorthm can be explaned as follows: Rasng temperature flattens the energy landscape of the dstrbuton and thus eases the MH traversal of the sample space, the hgh densty samples generated at the hgh temperature levels can be transmtted to the target temperature level through the exchange operatons, and ths n turn mproves convergence of the target Markov chan.

N ) denote the current populaton of samples. One teraton of parallel temperng conssts of the followng steps. Let x (t) = (x (t) 1,..., x (t) usng the MH algo- 1. Parallel MH step: Update each x (t) rthm. to x (t+1) 2. State swappng step: Try to exchange x (t+1) wth ts neghbors: Set j = 1 or + 1 accordng to probabltes q e (, j), where q e (, + 1) = q e (, 1) = 0.5 for 1 < < N and q e (1, 2) = q e (N, N 1) = 1, and accept the swap wth probablty mn { 1, exp ( [ H(x (t+1) ] [ ) H(x (t+1) 1 j ) 1 T T j ])}. (5)

Evolutonary Monte Carlo (Lang and Wong, 2000, 2001) The genetc algorthm (Holland, 1975) has been successfully appled to many hard optmzaton problems, such as the travelng salesman problem, proten foldng, machne learnng, among others. It s known that ts crossover operator s the key to the power of the genetc algorthm, whch makes t possble to explore a far greater range of potental solutons to a problem than conventonal optmzaton algorthms. Motvated by the genetc algorthm, Lang and Wong (2000, 2001) proposed the evolutonary Monte Carlo algorthm (EMC), whch ncorporates most attractve features of the genetc algorthm nto the framework of Markov chan Monte Carlo. EMC works n a fashon smlar to parallel temperng: A populaton of Markov chans are smulated n parallel wth each chan havng a dfferent temperature. The dfference between the two algorthms s that EMC ncludes a genetc operator, namely, the crossover operator n ts smulaton. The numercal results ndcate that the crossover operator mproves the convergence of the smulaton and that EMC can outperform parallel temperng n almost all scenaros.

Suppose the target dstrbuton of nterest s wrtten n the form f(x) exp{ H(x)}, x X R d, where the dmenson d > 1, and H(x) s called the ftness functon n terms of genetc algorthms. Let x = {x 1,..., x N } denote a populaton of sze N wth x from the dstrbuton wth densty f (x) exp{ H(x)/T }. In terms of genetc algorthms, x s called a chromosome or an ndvdual, each element of x s called a gene, and a realzaton of the element s called a genotype. As n parallel temperng, the temperatures form a decreasng ladder T 1 > T 2 > > T N 1, wth f N (x) beng the target dstrbuton.

Mutaton The mutaton operator s defned as an addtve Metropols-Hastngs move. One chromosome, say x k, s randomly selected from the current populaton x. A new chromosome s generated by addng a random vector e k so that y k = x k + e k, (6) where the scale of e k s chosen such that the operaton has a moderate acceptance rate, e.g., 0.2 to 0.5, as suggested by Gelman, Roberts and Glks (1996). The new populaton y = {x 1,, x k 1, y k, x k+1,, x N } s accepted wth probablty mn(1,r m ), where r m = f(y) f(x) T (x y) T (y x) = exp { H(y k) H(x k ) T k } T (x y) T (y x), (7) and T ( ) denotes the transton probablty between populatons.

f(r) r d 1 f(x + re). (8) Markov Chan Monte Carlo Lecture 6 Crossover One type of crossover operators that works for the real-coded chromosomes s the so-called real crossover, whch ncludes the k-pont and unform crossover operators. They are called real crossover by Wrght (1991) to ndcate that they are appled to real-coded chromosomes. In addton to the real crossover, Lang and Wong (2001a) proposed the snooker crossover operator, whch works as follows: 1. Randomly select one chromosome, say x, from the current populaton x. 2. Select the other chromosome, say x j, from the sub-populaton x\ {x } wth a probablty proportonal to exp{ H(x j )/T s }, where T s s called the selecton temperature. 3. Let e = x x j, and y = x j + re, where r (, ) s a random varable sampled from the densty

4. Construct a new populaton by replacng x wth the offsprng y, and replace x by y.

Exchange Ths operaton s the same as that used n parallel temperng (Geyer, 1991; Hukushma and Nemoto, 1996). Gven the current populaton x and the temperature ladder t, (x, t) = (x 1, T 1,, x N, T N ), one tres to make an exchange between x and x j wthout changng the t s. The new populaton s accepted wth probablty mn(1,r e ), ( )} 1 r e = f(x ) f(x) T (x x ) T (x x) = exp { (H(x ) H(x j )) T 1 T j (9) Typcally, the exchange s only performed on neghborng temperature levels.,

The Algorthm Based on the operators descrbed above, the algorthm can be summarzed as follows. Gven an ntal populaton x = {x 1,, x N } and a temperature ladder t = {T 1, T 2,, T N }, EMC terates between the followng two steps: 1. Apply ether mutaton or crossover operator to the populaton wth probablty q m and 1 q m, respectvely. The q m s called the mutaton rate. 2. Try to exchange x wth x j for N pars (, j) wth beng sampled unformly on {1,, N} and j = ± 1 wth probablty q e (, j), where q e (, + 1) = q e (, 1) = 0.5 and q e (1, 2) = q e (N, N 1) = 1.

Consder smulatng from a 2D mxture normal dstrbuton f(x) = 1 20 2πσ k=1 w k exp{ 1 2σ 2 (x µ k) (x µ k )}, (10) where σ = 0.1, w 1 = = w 20 = 0.05. The mean vectors µ 1, µ 2,, µ 20 (gven n Table 1) are unformly drawn from the rectangle [0, 10] [0, 10]. Among them, components 2, 4, and 15 are well separated from the others. The dstance between component 4 and ts nearest neghborng component s 3.15, and the dstance between component 15 and ts nearest neghborng component (except component 2) s 3.84, whch are 31.5 and 38.4 tmes of the standard devaton, respectvely. Mxng the components across so long dstances puts a great challenge on EMC.

Table 1: Mean vectors of the 20 components of the mxture normal dstrbuton (Lang and Wong, 2001). k µ k1 µ k2 k µ k1 µ k2 k µ k1 µ k2 k µ k1 µ k2 1 2.18 5.76 6 3.25 3.47 11 5.41 2.65 16 4.93 1.50 2 8.67 9.59 7 1.70 0.50 12 2.70 7.88 17 1.83 0.09 3 4.24 8.48 8 4.59 5.60 13 4.98 3.70 18 2.26 0.31 4 8.41 1.68 9 6.91 5.81 14 1.14 2.39 19 5.54 6.86 5 3.93 8.82 10 6.87 5.40 15 8.33 9.50 20 1.69 8.11

Table 2: Comparson of EMC and parallel temperng for the mxture normal example. (Lang and Wong, 2001) parameter true value EMC-A EMC-B PT est. SD est. SD est. SD µ 1 4.48 4.48 0.004 4.44 0.026 3.78 0.032 µ 2 4.91 4.91 0.008 4.86 0.023 4.34 0.044 Σ 11 5.55 5.55 0.006 5.54 0.051 3.66 0.111 Σ 22 9.86 9.84 0.010 9.78 0.048 8.55 0.049 Σ 12 2.61 2.59 0.011 2.58 0.043 1.29 0.084

x y 0 2 4 6 8 10 0 2 4 6 8 10 (a) evolutonary samplng x y 0 2 4 6 8 10 0 2 4 6 8 10 (b) parallel temperng Fgure 1: The sample path of the frst 10000 teratons at temperature t = 1. (a) EMC. (b) Parallel temperng. (Lang and Wong, 2001a)

x y 0 2 4 6 8 10 0 2 4 6 8 10 (a) evolutonary samplng x y 0 2 4 6 8 10 0 2 4 6 8 10 (b) parallel temperng Fgure 2: The plot of whole samples. (a) EMC. (b) Parallel temperng. (Lang and Wong, 2001a)