Decomposition Methods and Sampling Circuits in the Cartesian Lattice

Size: px

Start display at page:

Download "Decomposition Methods and Sampling Circuits in the Cartesian Lattice"

Donna Lloyd
5 years ago
Views:

1 Decomposition Methods and Sampling Circuits in the Cartesian Lattice Dana Randall College of Computing and School of Mathematics Georgia Institute of Technology Atlanta, GA Abstract. Decomposition theorems are useful tools for bounding the convergence rates of Markov chains. The theorems relate the mixing rate of a Markov chain to smaller, derivative Markov chains, defined by a partition of the state space, and can be useful when standard, direct methods fail. Not only does this simplify the chain being analyzed, but it allows a hybrid approach whereby different techniques for bounding convergence rates can be used on different pieces. We demonstrate this approach by giving bounds on the mixing time of a chain on circuits of length 2n in ZZ d. 1 Introduction Suppose that you want to sample from a large set of combinatorial objects. A popular method for doing this is to define a Markov chain whose state space Ω consists of the elements of the set, and use it to perform a random walk. We first define a graph H connecting pairs of states that are close under some metric. This underlying graph on the state space representing allowable transitions is known as the Markov kernel. To define the transition probabilities of the Markov chain, we need to consider the desired stationary distribution π on Ω. A method known as the Metropolis algorithm assigns probabilities to the edges of H so that the resulting Markov chain will converge to this distribution. In particular, if is the maximum degree of any vertex in H, and (x, y) is any edge, P (x, y) = 1 ( 2 min 1, π(y) ). π(x) We then assign self loops all remaining probability at each vertex, so P (x, x) 1/2 for all x Ω. If H is connected, π will be the unique stationary distribution of this Markov chain. We can see this by verifying that detailed balance is satsified on every edge (x, y), i.e., π(x)p (x, y) = π(y)p (y, x). As a result, if we start at any vertex in Ω and perform a random walk according to the transition probabilities defined by P, and we walk long enough, we will converge to the desired distribution. For this to be useful, we need that we are converging rapidly to π, so that after a small, polynomial number of steps, our samples will be chosen from a Supported in part by NSF Grant No. CCR

2 distribution which is provably arbitrarily close to stationarity. A Markov chain with this property is rapidly mixing. Consider, for example, the set of independent sets I of some graph G. Taking the Hamming metric, we can define H by connecting any two independent sets that differ by the addition or deletion of a single vertex. A popular stationary distribution is the Gibbs distribution which assigns weight π(i) = γ I /Z γ to I, where γ > 0 is an input parameter of the system, I is the size of the independent set I, and Z γ = J I γ J is the normalizing constant known as the partition function. In the Metropolis chain, we have P (I, I ) = 1 2n min(1, γ) if I is formed by adding a vertex to I, and P (I, I ) = 1 2n min(1, γ 1 ) if I is formed by deleting a vertex from I. Recently there has been great progress in the design and analysis of Markov chains which are provably efficient. One of the most popular proof techniques is coupling. Informally, coupling says that if two copies of the Markov chain can be simultaneously simulated so that they end up in the same state very quickly, regardless of the starting states, then the chain is rapidly mixing. In many instances this is not hard to establish, which gives a very easy proof of fast convergence. Despite the appeal of these simple coupling arguments, a major drawback is that many Markov chains which appear to be rapidly mixing do not seem to admit coupling proofs. In fact, the complexity of typical Markov chains often makes it difficult to use any of the standard techniques, which include bounding the conductance, the log Sobolev constant or the spectral gap, all closely related to the mixing rate. The decomposition method offers a way to systematically simplify the Markov chain by breaking it into more manageable pieces. The idea is that it should be easier to apply some of these techniques to the simplified Markov chains and then infer a bound on the original Markov chain. In this survey we will concentrate on the state decomposition theorem which utilizes some partition of the state space. It says that if the Markov chain is rapidly mixing when restricted to each piece of the partition, and if there is sufficient flow between the pieces (defined by a projection of the chain), then the original Markov chain must be rapidly mixing as well. The allows us to take a top-down approach to mixing rate analysis, whereby we need only consider the mixing rate of the restrictions and the projection. In many cases it is easier to define good couplings on these simpler Markov chains, or to use one of the other known methods of analysis. We note, however, that using indirect methods such as the decomposition or comparison (defined later) invariably adds orders of magnitude the bounds on the running time of the algorithm. Hence it is wise to use these methods judiciously unless the goal is simply to establish a polynomial bound on the mixing rate. 2 Mixing machinery In what follows, we assume that M is an ergodic (i.e. irreducible and aperiodic), reversible Markov chain with finite state space Ω, transition probability matrix P, and stationary distribution π. The time a Markov chain takes to converge to its stationary distribution, i.e., the mixing time of the chain, is measured in terms of the distance between the distribution at time t and the stationary distribution. Letting P t (x, y) denote the t-step probability

3 of going from x to y, the total variation distance at time t is For ε > 0, the mixing time τ(ε) is P t 1, π tv = max P t (x, y) π(y). x Ω 2 y Ω τ(ε) = min{t : P t, π tv ε, t t}. We say a Markov chain is rapidly mixing if the mixing time is bounded above by a polynomial in n and log 1 ε, where n is the size of each configuration in the state space. It is well known that the mixing rate is related to the spectral gap of the transition matrix. For the transition matrix P, we let Gap(P ) = λ 0 λ 1 denote its spectral gap, where λ 0, λ 1,..., λ Ω 1 are the eigenvalues of P and 1 = λ 0 > λ 1 λ i for all i 2. The following result the spectral gap and mixing times of a chain (see, e.g., [18]). Theorem 1. Let π = min x Ω π(x). For all ε > 0 we have (a) τ(ε) 1 Gap(P ) log( 1 π ) ε (b) τ(ε) λ1 2Gap(P ) log( 1 2ε ). Hence, if 1/Gap(P ) is bounded above by a polynomial, we are guaranteed fast (polynomial time) convergence. For most of what follows we will rely on the spectral gap bound on mixing. Theorem 1 is useful for deriving a bound on the spectral gap from a coupling proof, which provides bounds on the mixing rate. We now review of some of the main techniques used to bound the mixing rate of a chain, including the decomposition theorem. 2.1 Path Coupling One of the most popular methods for bounding mixing times has been the coupling method. A coupling is a Markov chain on Ω Ω with the following properties. Instead of updating the pair of configurations independently, the coupling updates them so that i) the two processes will tend to coalesce, or move together under some measure of distance, yet ii) each process, viewed in isolation, is performing transitions exactly according to the original Markov chain. A valid coupling ensures that once the pair of configurations coalesce, they agree from that time forward. The mixing time can be bounded by the expected time for configurations to coalesce under any valid coupling. The method of path coupling simplifies our goal by letting us bound the mixing rate of a Markov chain by considering only a small subset of Ω Ω [3, 6]. Theorem 2. (Dyer and Greenhill [6]) Let Φ be an integer valued metric defined on Ω Ω which takes values in {0,..., B}. Let U be a subset of Ω Ω such that for all (x, y) Ω Ω there exists a path x = z 0, z 1,..., z r = y between x and y such that (z i, z i+1 ) U for 0 i < r and r 1 Φ(z i, z i+1 ) = Φ(x, y). i=0

4 Define a coupling (x, y) (x, y ) of the Markov chain M on all pairs (x, y) U. Suppose that there exists α < 1 such that E[Φ(x, y )] α Φ(x, y) for all (x t, y t ) U, Then the mixing time of M satisfies τ(ɛ) log(bɛ 1 ) 1 α. Useful bounds can also be derived in the case that α = 1 in the theorem (see [6]). 2.2 The disjoint decomposition method Madras and Randall [12] introduced two decomposition theorems which relate the mixing rate of a Markov chain to the mixing rates of related Markov chains. The state decomposition theorem allows the state space to be decomposed into overlapping subsets; the mixing rate of the original chain can be bounded by the mixing rates of the restricted Markov chains, which are forced to stay within the pieces, and the ergodic flow between these sets. The density decomposition theorem is of a similar flavor, but relates a Markov chain to a family of other Markov chains with the same Markov kernel, where the transition probabilities of the original chain can be described as a weighted average of the transition probabilities of the chains in the family. We will concentrate on the state decomposition theorem, and will present a newer version of the theorem due to Martin and Randall [15] which allows the decomposition of the state to be a partition, rather than requiring that the pieces overlap. Suppose that the state space is partitioned into m disjoint pieces Ω 1,..., Ω m. For each i = 1,..., m, define P i = P {Ω i } as the restriction of P to Ω i which rejects moves that leave Ω i. In particular, the restriction to Ω i is a Markov chain, M i, where the transition matrix P i is defined as follows: If x y and x, y Ω i then P i (x, y) = P (x, y); if x Ω i then P i (x, x) = 1 y Ω P i,y x i(x, y). Let π i be the normalized restriction of π to Ω i, i.e., π i (A) = π(a Ωi) π(ω i). Notice that if Ω i is connected then π i is the stationary distribution of P i. Next, define P to be the following aggregated transition matrix on the state space [m]: P (i, j) = 1 π(x)p (x, y). π(ω i ) x Ω i, y Ω j Theorem 3. (Martin and Randall [15]) Let P i and P be as above. Then the spectral gaps satisfy Gap(P ) 1 Gap(P ) min 2 Gap(P i). i [m] A useful corollary allows us to replace P in the theorem with the Metropolis chain defined on the same Markov kernel, provided some simple conditions are satisfied. Since the transitions of the Metropolis chain are fully defined by the stationary distribution π, this is often easier to analyze than the true projection. Define P M on the set [m], with Metropolis transitions P M (i, j) = min{1, π(ωj) π(ω }. i) Let i (Ω j ) = {y Ω j : x Ω i with P (x, y) > 0}.

5 Corollary 1. [15] With P M as above, suppose there exists β > 0 and γ > 0 such that (a) P (x, y) β whenever P (x, y) > 0; (b) π( i (Ω j )) γπ(ω j ) whenever P (i, j) > 0. Then Gap(P ) 1 2 βγ Gap(P M ) min Gap(P i). i=1,...,m 2.3 The comparison method When applying the decomposition theorem, we reduce the analysis of a Markov chain to bounding the convergence times of smaller related chains. In many cases it will be much simpler to analyze variants of these auxiliary Markov chains instead of the true restrictions and projections. The comparison method tells us ways in which we can slightly modify one of these Markov chains without qualitatively changing the mixing time. For instance, it allows us to add additional transition edges or to amplify some of the transition probabilities, which can be useful tricks for simplifying the analysis of a chain. Let P and P be two reversible Markov chains on the same state space Ω with the same stationary distribution π. The comparison method allows us to relate the mixing times of these two chains (see [4] and [17]). In what follows, suppose that Gap( P ), the spectral gap of P, is known (or suitably bounded) and we desire a bound on Gap(P ), the spectral gap of P, which is unknown. Following [4], we let E(P ) = {(x, y) : P (x, y) > 0} and E( P ) = {(x, y) : P (x, y) > 0} denote the sets of edges of the two chains, viewed as directed graphs. For each (x, y) E( P ), define a path γ xy using a sequence of states x = x 0, x 1,..., x k = y with (x i, x i+1 ) E(P ), and let γ xy denote the length of the path. Let Γ (z, w) = {(x, y) E( P ) : (z, w) γ xy } be the set of paths that use the transition (z, w) of P. Finally, define 1 A = max γ xy π(x) (z,w) E(P ) π(z)p (z, w) P (x, y). Γ (z,w) Theorem 4. (Diaconis and Saloff-Coste [4]) With the above notation, the spectral gaps satisfy Gap(P ) 1 A Gap( P ) It is worthwhile to note that there are several other comparison theorems which turn out to be useful, especially when applying decomposition techniques. The following lemma helps us reason about a Markov chain by slightly modifying the transition probabilities (see, e.g., [10]). We use this trick in our main application, sampling circuits. Lemma 1. Suppose P and P are Markov chains on the same state space, each reversible with respect to the distribution π. Suppose there are constants c 1 and c 2 such that c 1 P (x, y) P (x, y) c 2 P (x, y) for all x y. Then c 1 Gap(P ) Gap(P ) c 2 Gap(P ).

6 3 Sampling circuits in the Cartesian lattice A circuit in ZZ d is a walk along lattice edges which starts and ends at the origin. Our goal is to sample from C, the set of circuits of length 2n. It is useful to represent each walk as a string of 2n letters using {a 1,..., a d } and their inverses {a 1 1,..., a 1 d }, where a i represents a positive step in the ith direction, and a 1 i represents a negative step. Since these are closed circuits, the number of times a i appears must equal the number of times a 1 i appears, for all i. We will show how to uniformly sample from the set of all circuits of length 2n using an efficient Markov chain. The primary tool will be finding an appropriate decomposition of the state space. We outline the proof here and refer the reader to [16] for complete details. Using a similar strategy, Martin and Randall showed how to use a Marko chain to sample circuits in regular d-ary trees, i.e., paths of length 2n which trace edges of the tree starting and ending at the origin [15]. This problem generalizes to sampling Dyke paths according to a distribution which favors walks that hit the x-axis a large number of times, known in the statistical physics community as adsorbing staircase walks. Here too the decomposition method was the basis of the analysis. We note that there are other simple algorithms for sampling circuits on trees which do not require Markov chains. In contrast, to our knowledge, the Markov chain based algorithm discussed in this paper is the first efficient method for sampling circuits on ZZ d. 3.1 The Markov chain on circuits The Markov chain on C is based on two types of moves: transpositions of neighboring letters in the word (which keep the numbers of each letter fixed) and rotations, which replace an adjacent (a i, a 1 i ) with (a j, a 1 j ), for some pair of letters a i and a j. We now define the transition probabilities P of M, where we say x u X to mean that we choose x from set X uniformly. Starting at σ, do the following. With probability 1/2, pick i u [n 1] and transpose σ i and σ i+1. With probability 1/2, pick i u [n 1] and k u [d] and if σ i and σ i+1 are inverses (where σ i is a step in the positive direction), then replace them with (a k, a 1 k ). Otherwise keep σ unchanged. The chain is aperiodic, ergodic and reversible, and the transitions are symmetric, so the stationary distribution of this Markov chain is the uniform distribution on C. 3.2 Bounding the mixing rate of the circuits Markov chain We bound the mixing rate of M by appealing to the decomposition theorem. Let σ C and let x i equal the number of occurrences of a i, and hence a 1 i in σ, for all i. Define the trace Tr(σ) to be the vector X = (x 1,..., x d ). This defines a partition of the state space into C = C X, where the union is over all partitions of n into d pieces and C X is the set of words σ C such that Tr(σ) = X = (x 1,..., x d ). The cardinality of the set C X is ( ) 2n x 1,x 1,...,x d,x d, the number of distinct words (or permutations) of length 2n using the letters with these

7 prescribed multiplicities. The number of sets in the partition of the state space is exactly the number of partitions of n into d pieces, D = ( ) n+d 1 d 1. Each restricted Markov chain consists of all the words which have a fixed trace. Hence, transitions in the restricted chains consist of only transpositions, as rotations would change the trace. The projection P consists of a simplex containing D vertices, each representing a distinct partition of 2n. Letting Φ(X, Y ) = 1 2 X Y 1, two points X and Y are connected by an edge of P iff Φ(X, Y ) = 1, where 1 denotes the l 1 metric. In the following we make no attempt to optimize the running time, and instead simply provide polynomial bounds on the convergence rates. Step 1 The restricted Markov chains: Consider any of the restricted chains P X on the set of configurations with trace X. We need to show that this simpler chain, connecting pairs of words differing by a transposition of adjacent letters, converges quickly for any fixed trace. We can analyze the transposition moves on this set by mapping C X to the set of linear extensions of a particular partial order. Consider the alphabet i {{a i,1,..., a i,xi } {A i,1,..., A i,xi }}, and the partial order defined by the relations a i,1 a i,2... a i,xi and A i,1 A i,2... A i,xi, for all i. It is straightforward to see that there is a bijection between the set of circuits in C X and the set of linear extensions to this partial order (mapping a 1 to A). Furthermore, this bijection preserves transpositions. We appeal to the following theorem due to Bubley and Dyer [3]: Theorem 5. The transposition Markov chain on the set of linear extensions to a partial order on n elements has mixing time O(n 4 (log 2 n + log ɛ 1 )). Referring to theorem 1, we can derive the following bound. Corollary 2. The Markov chain P X has spectral gap Gap(P X ) 1/(cn 4 log 2 n) for some constant c. Step 2 The projection of the Markov chain: The states Ω of the projection consist of partitions of n into d pieces, so Ω = D. The stationary probability of X = (x 1,..., x d ) is π(x) = ( ) 2n x 1,x 1,...,x d,x d, the number of words with these multiplicities. The Markov kernel is defined by connecting two partitions X and Y if the distance Φ(X, Y ) = ( x y 1 )/2 = 1. Before applying corollary 1 we first need to bound the mixing rate of the Markov chain defined by Metropolis probabilities. In particular, if X = (x 1,..., x d ) and Y = (x 1,..., x i + 1,..., x j 1,..., x d ), then P M (X, Y ) = 1 ( 2n 2 min 1, π(y ) ) π(x) = 1 2n 2 min ( 1, x 2 j (x i + 1) 2 ).

8 We analyze this Metropolis chain indirectly by first considering a variant P M which admits a simpler path coupling proof. Using the same Markov kernel, define the transitions P M (X, Y ) = 1 2n 2 (x i + 1) 2. In particular, the x i in the denominator is the value which would be increased by the rotation. Notice that detailed balance is satisfied: π(x) π(y ) = (x i + 1) 2 x 2 j = P M (Y, X) (X, Y ). This guarantees that P M has the same stationary distribution as P M, namely π. The mixing rate of this chain can be bounded directly using path coupling. Let U Ω Ω be pairs of states X and Y such that Φ(X, Y ) = 1. We couple by choosing the same pair of indices i and j, and the same bit b { 1, 1} to update each of X and Y, where the probability for accepting each of these moves is dictated by the transitions of P M. Lemma 2. For any pair (X t, Y t ) U, the expected change in distance after one step of the coupled chain is E[Φ(X t+1, Y t+1 )] (1 1 n 6 ) Φ(X t, Y t ). Proof. If (X t, Y t ) U, then there exist coordinates k and k such that y k = x k + 1 and y k = x k 1. Without loss of generality, assume that k = 1 and k = 2. We need to determine the expected change in distance after one step of the coupled chain. Suppose that in this move we try to add 1 to x i and y i and subtract 1 from x j and y j. We consider three cases. Case 1: If {i, j} {1, 2} = 0, then both processes accept the move with the same probability and Φ(X t+1, Y t+1 ) = 1. Case 2: If {i, j} {1, 2} = 1, then we shall see that the expected change is also zero. Assume without loss of generality that i = 1 and j = 3, and first consider the case b = 1. Then we move from X to X = (x 1 + 1, x 2, x 3 1,..., x d ) with probability 1 2n 2 (x 1+1) and from Y to Y = (x , x 2 1, x 3 1,..., x d ) with probability 1 2n 2 (x 1+2). Since P 2 M (X, X ) > P M (Y, Y ), with probability P M (Y, Y ) we update both X and Y ; with probability P M (X, X ) P M (Y, Y ) we update just X; and with all remaining probability we update neither. In the first case we end up with X and Y, in the second we end up with X and Y and in the final case we stay at X and Y. All of these pairs are unit distance apart, so the expected change in distance is zero. If b = 1, then P M (X, X ) = P M (Y, Y 1 ) = 2n 2 (x 3+1) and again the coupling keeps 2 the configurations unit distance apart. Case 3: If {i, j} {1, 2} = 2, then we shall see that the expected change is at most zero. Assume without loss of generality that i = 1, j = 2 and b = 1. The probability of moving from X to X = (x 1 + 1, x 2 1,..., x d ) = Y is P M (X, X 1 ) = 2n 2 (x 1+1). 2 The probability of moving from Y to Y = (x 1 + 2, x 2 2,..., x d ) is P M (Y, Y ) = 1 2n 2 (x 1+2). So with probability P 2 M (Y, Y ) we update both configurations, keeping them unit distance apart, and with probability P M (X, X ) P M (Y, Y ) 1 2n we 6 P M

9 update just X, decreasing the distance to zero. When b = 1 the symmetric argument shows that we again have a small chance of decreasing the distance. Summing over all of these possibilities yields the lemma. The path coupling theorem implies that the mixing time is bounded by τ(ɛ) O(n 6 log n). Furthermore, we get the following bound on the spectral gap. Theorem 6. The Markov chain P M on Ω has spectral gap Gap(P M ) c /(n 6 log n) for some constant c. This bounds the spectral gap of the modified Metropolis chain P M, but we can readily compare the spectral gaps of P M and P M using lemma 1. Since all the transitions of P M are at least as large as those of P M, we find Corollary 3. The Markov chain P M on Ω has spectral gap Gap(P M ) c /(n 6 log n). Step 3 Putting the pieces together: These bounds on the spectral gaps of the restrictions P i and the Metropolis projection P M enable us to apply the decomposition theorem to derive a bound on the spectral gap of P, the original chain. Theorem 7. The Markov chain P is rapidly mixing on C and the spectral gap satisfies Gap(P ) c /(n 12 d log 3 n), for some constant c. Proof. To apply corollary 1, we need to bound the parameters β and γ. We find that β 1 4nd, the minimum probability of a transition. To bound γ we need to determine what fraction of the words in C X are neighbors of a word in C Y if Φ(X, Y ) = 1 (since π is uniform within each of these sets). If X = (x 1,..., x d ) and Y = (x 1,..., x i + 1,..., x j 1,..., x n ), this fraction is exactly the likelihood that a word in C X has an a i followed by an a 1 i, and this is easily determined to be at least 1/n. Combining β 1 4nd, γ 1 n with our bounds from lemmas 2 and 3, corollary 1 gives the claimed lower bound on the spectral gap. 4 Other applications of decomposition The key step to applying the decomposition theorem is finding an appropriate partition of the state space. In most examples a natural choice seems to be to cluster configurations of equal probability together so that the distribution for each of the restricted chains is uniform, or so that the restrictions share some essential feature which will make it easy to bound the mixing rate. In the example of section 3, the state space is divided into subsets, each representing a partition of n into d parts. It followed that the vertices of the projection formed a d-dimensional simplex, where the Markov kernel was formed by connecting vertices which are neighbors in the simplex. We briefly outline two other recent applications of the decomposition theorem where we get other natural graphs for the projection. In the first case graph defining the Markov kernel of the projection is one-dimensional and in the second it is a hypercube.

10 4.1 Independent sets Our first example is sampling independent sets of a graph according to the Gibbs measure. Recall that π(i) = γ I /Z γ, where γ > 0 is an input parameter and Z γ normalizes the distribution. There has been much activity in studying how to sample independent sets for various values of γ using a simple, natural Markov chain based on inserting, deleting or exchanging vertices at each step. Works of Luby and Vigoda [9] and Dyer and Greenhill [5] imply that this chain is rapidly mixing if γ 2/( 2), where is the maximum number of neighbors of any vertex in G. It was shown by Borgs et al. [1] that this chain is slowly mixing on some graphs for γ sufficiently large. Alternatively, Madras and Randall [12] showed that this algorithm is fast for every value of γ, provided we restrict the state space to independent sets of size at most n = V /2( + 1). This relies heavily on earlier work of Dyer and Greenhill [6] showing that a Markov chain defined by exchanges is rapidly mixing on the set of independent sets of fixed size k, whenever k n. The decomposition here is quite natural: We partition I, the set of independent sets of G, into pieces I k according to their size. The restrictions arising from this partition permit exchanges, but disallow insertions or deletions, as they exit the state space of the restricted Markov chain. These are now exactly the Markov chains proven to be rapidly mixing by Dyer and Greenhill (but with slightly greater self-loop probabilities) and hence can also be seen to be rapidly mixing. Consequently, we need only bound the mixing rate of the projection. Here the projection is a one-dimensional graph on {0,..., n }. Further calculation determines that the stationary distribution π(k) of the projection is unimodal in k, implying that the projection is also rapidly mixing. We refer the reader to [12] for details. 4.2 The swapping algorithm To further demonstrate the versatility and potential of the decomposition method, we review an application of a very different flavor. In recent work, Madras and Zheng [13] show that the swapping algorithm is rapidly mixing for the mean field Ising model (i.e., the Ising model on the complete graph), as well as for a simpler toy model. Given a graph G = (V, E), the ferromagnetic Ising model consists of a graph G whose vertices represent particles and whose edges represent interactions between particles. A spin configuration is an assignment of spins, either + or, to each of the vertices, where adjacent vertices prefer to have the same spin. Let J x,y > 0 be the interaction energy between vertices x and y, where (x, y) E. Let σ Ω = {+, } V be any assignment of {+, } to each of the vertices. The Hamiltonian of σ is H(σ) = (x,y) E J x,y 1 σx σ y, where 1 A is the indicator function which is 1 when the event A is true and 0 otherwise. The probability that the Ising spin state is σ is given by the Gibbs distribution: π(σ) = e βh(σ) Z(G),

11 where β is inverse temperature and Z(G) = σ e βh(σ). It is well known that at sufficiently low temperatures the distribution is bimodal (as a function the number of vertices assigned +), and any local dynamics will be slowly mixing. The simplest local dynamics, Glauber dynamics, is the Markov chain defined by choosing a vertex at random and flipping the spin at that vertex with the appropriate Metropolis probability. Simulated tempering, which varies the temperature during the runtime of an algorithm, appears to be a useful way to circumvent this difficulty [8, 14]. The chain moves between m close temperatures that interpolate between the temperature of interest and very high temperature, where the local dynamics converges rapidly. The swapping algorithm is a variant of tempering, introduced by Geyer [7], where the state space is Ω m and each configuration S = (σ 1,..., σ m ) Ω m consists of one sample at each temperature. The stationary distribution distribution is π(s) = Πi=1 m π i(σ i ), where π i is the distribution at temperature i. The transitions of the swapping algorithm consist of two types of moves: with probability 1/2 choose i [m] and perform a local update of σ i (using Glauber dynamics at this fixed temperature); with probability 1/2 choose i [m 1] and move from S = (σ 1,..., σ m ) to S = (σ 1,..., σ i+1, σ i,..., σ m ), i.e., swap configurations i and i + 1, with the appropriate Metropolis probability. The idea behind the swapping algorithm, and other versions of tempering, is that, in the long run, the trajectory of each Ising configuration will spend equal time at each temperature, potentially greatly speeding up mixing. Experimentally, this appears to overcome obstacles to sampling at low temperatures. Madras and Zheng show that the swapping algorithm is rapidly mixing on the meanfield Ising model at all temperatures. Let Ω + Ω be the set of configurations that are predominantly +, and similarly Ω. Define the trace of a configuration S to be Tr(S) = (v 1,..., v m ) {+, } m where v i = + if σ i Ω + and v i = if σ i Ω. The analysis of the swapping chain uses decomposition by partitioning the state space according to the trace. The projection for this decomposition is the m-dimensional hypercube where each vertex represents a distinct trace. The stationary distribution is uniform on the hypercube because, at each temperature, the likelihood of being in Ω + and Ω are equal due to symmetry. Relying on the comparison method, it suffices to analyze the following simplification of the projection: Starting at any vertex V = (v 1,..., v m ) in the hypercube, pick i u [m]. If i = 1, then with probability 1/2 flip the first bit; if i > 1, then with probability 1/2 transpose the v i 1 and v i ; and with all remaining probability do nothing. This chain is easily seen to be rapidly mixing on the hypercube and can be used to infer a bound on the spectral gap of the projection chain. To analyze the restrictions, Madras and Zheng first prove that the simple, single flip dynamics on Ω + is rapidly mixing at any temperature; this result is analytical, relying on the fact that the underlying graph is complete for the mean-field model. Using simple facts about Markov chains on product spaces, it can be shown that the each of the restricted chains must also be rapidly mixing (even without including any

12 swap moves). Once again decomposition completes the proof of rapid mixing, and we can conclude that the swapping algorithm is efficient on the complete graph. Acknowledgements I wish to thank Russell Martin for many useful discussions, especially regarding the technical details of section 3, and Dimitris Achlioptas for improving an earlier draft of this paper. References 1. C. Borgs, J.T. Chayes, A. Frieze, J.H. Kim, P. Tetali, E. Vigoda, and V.H. Vu. Torpid mixing of some MCMC algorithms in statistical physics. Proc. 40th IEEE Symposium on Foundations of Computer Science, , R. Bubley and M. Dyer. Faster random generation of linear extensions. Discrete Mathematics, 201:81 88, R. Bubley and M. Dyer. Path coupling: A technique for proving rapid mixing in Markov chains. Proc. 38th Annual IEEE Symposium on Foundations of Computer Science , P. Diaconis and L. Saloff-Coste. Comparison theorems for reversible Markov chains. Annals of Applied Probability, 3: , M. Dyer and C. Greenhill. On Markov chains for independent sets. Journal of Algorithms, 35: 17 49, M. Dyer and C. Greenhill. A more rapidly mixing Markov chain for graph colorings. Random Structures and Algorithms, 13: , C.J. Geyer Markov Chain Monte Carlo Maximum Likelihood. Computing Science and Statistics: Proceedings of the 23rd Symposium on the Interface (E.M. Keramidas, ed.), Interface Foundation, Fairfax Sta tion, C.J. Geyer and E.A. Thompson. Annealing Markov Chain Monte Carlo with Applications to Ancestral Inference. J. Amer. Statist. Assoc , M. Luby and E. Vigoda. Fast Convergence of the Glauber dynamics for sampling independent sets. Random Structures and Algorithms 15: , N. Madras and M. Piccioni. Importance sampling for families of distributions. Ann. Appl. Probab. 9: , N. Madras and D. Randall. Factoring graphs to bound mixing rates. Proc. 37th Annual IEEE Symposium on Foundations of Computer Science, , N. Madras and D. Randall. Markov chain decomposition for convergence rate analysis. Annals of Applied Probability, (to appear), N. Madras and Z. Zheng. On the swapping algorithm. Preprint, E. Marinari and G. Parisi. Simulated tempering: a new Monte Carlo scheme. Europhys. Lett , R.A. Martin and D. Randall Sampling adsorbing staircase walks using a new Markov chain decomposition method. Proceedings of the 41st Symposium on the Foundations of Computer Science (FOCS 2000), , R.A. Martin and D. Randall Disjoint decomposition with applications to sampling circuits in some Cayley graphs. Preprint, D. Randall and P. Tetali. Analyzing Glauber dynamics by comparison of Markov chains. Journal of Mathematical Physics, 41: , A.J. Sinclair. Algorithms for random generation & counting: a Markov chain approach. Birkhäuser, Boston, 1993.

Sampling Adsorbing Staircase Walks Using a New Markov Chain Decomposition Method

66 66 77 D Sampling Adsorbing Staircase Walks Using a New Markov hain Decomposition Method Russell A Martin School of Mathematics Georgia Institute of Technology martin@mathgatechedu Dana Randall ollege