Markov Chains, Conductance and Mixing Time Lecturer: Santosh Vempala Scribe: Kevin Zatloukal Markov Chains A random walk is a Markov chain. As before, we have a state space Ω. We also have a set of subsets A of Ω that form a σ-algebra, i.e., A is closed under complements and countable unions. It is an immediate consequence that A is closed under countable intersections and that, Ω A. For any u Ω and A A, we have the one-step probability, P u (A), which tells us the probability of being in A after taking one step from u. Lastly, we have a starting distribution Q 0 on Ω, which gives us a probability Q 0 (A) of starting in the set A A. With this setup, a Markov chain is a sequence of points w 0, w, w,... such that P(w 0 A) = Q 0 (A) and P(w i+ A w 0 = u 0,..., w i = u i ) = P(w i+ A w i = u i ) = P ui (A), for any A A. A distribution Q is called stationary if, for all A A, Q(A) = P u (A) dq(u). In other words, Q is stationary if the probability of being in A is the same after one step. We also have a generalized version of the symmetry we saw above in the transition probabilities (p xy = p yx ). The Markov chain is called time-reversible if, for all A, B A, Ball Walk P(w i+ B w i A) = P(w i+ A w i B). The following algorithm, called Ball-Walk, is a continuous random walk. In this case, the set corresponding to the neighborhood of x is B δ (x) = x+δb n. Algorithm Ball-Walk(δ):. Let x be a starting point in K.. Repeat sufficiently many times: Choose a random y B δ (x). If y K, set x = y.
Here, the state space is all of the set, so Ω = K. The σ-algebra is the measurable subsets of K, as is usual. We can define a density function for the probability of transitioning from u K to v, provided that u v: p(u, v) = { The probability of staying at u is if v K B δ (u), 0 otherwise. P u (u) = Vol(K B δ(u)). Vol(δB n ) Putting these together, the probability of transitioning from u to any measurable subset A is { Vol(A K Bδ (u)) Vol(δB P u (A) = n) + P u (u) if u A Vol(A K B δ (u)) if u / A. Since the density function is symmetric, it is easy to see that the uniform distribution is stationary. We can also verify this directly. We can compute the probability of being in A after one step by adding up the probability that we transition to u for each u A. Thus, after one step from the uniform distribution dq(u) = Vol(K) du, the probability of being in A is P u(u) dq(u) + v K B δ (u)\{u} dq(v) du = ( Vol(K B δ(u)) = dq(u) + = dq(u) = Q(A) ) dq(u) + v K B δ (u) dq(v) du ( ) v K B δ (u) dv Vol(K B δ(u) dq(u) Another way to look at the one-step probability distributions is in terms of flow. For any subset A A, the ergodic flow Φ(A) is the probability of transitioning from A to Ω \ A, i.e., Φ(A) = P u (Ω \ A) dq. Intuitively, in a stationary distribution, we should have Φ(A) = Φ(Ω \ A). In fact, this is a characterization of the stationary distribution.
Theorem. A distribution Q is stationary iff Φ(A) = Φ(Ω \ A) for all A A. Proof. Consider their difference Φ(A) Φ(Ω \ A) = P u(ω \ A) dq(u) u/ A P u(a) dq(u) = ( P u(a)) dq(u) u/ A P u(a) dq(u) = Q(A) P u(a) dq(u). The latter quantity is the probability of staying in A after one step, so Φ(A) Φ(Ω \ A) = 0 iff Q is stationary. Now, let s return to the question of how quickly (and whether) this random walk converges to the stationary distribution. As before, it is convenient to make the walk lazy by giving probability of staying in place instead of taking a step. This means that P u ({u}) and, more generally, for any A A, P u (A) if u A and P u(a) < if u / A. Also as before, the notion of conductance will be useful. In this general context, we define conductance as Φ(A) φ = Q(A), min A A, Q(A) where Q is the stationary distribution. This is the probability of transitioning to Ω \ A given that we are starting in A. We must also generalize the notion of the distance of a distribution from stationary. The straightforward generalization of our previous definition is Q t Q = dq t (u) dq(u) = dq t (u) dq(u) + dq(u) dq t (u), where Q + t = {u Ω dq t (u) dq(u)}, and Q t = {u Ω dq t (u) < dq(u)}. Since we know that = dq t (u) + dq t (u) = dq(u) + dq(u), 3
we can rearrange to get dq t (u) dq(u) = dq(u) dq t (u), which shows that the two terms from above are equal. Therefore, we can just as well define Q t Q = dq t (u) dq(u) = sup Q t (A) Q(A). A A We will take this as our definition of the distance d(q t, Q) between distributions Q t and Q. We would like to know how large t must be before d(q t, Q) d(q 0, Q). This is one definition of the mixing time. It would be nice to get a bound by showing that Q t (A) Q(A) drops quickly for every A. This is not the case. Instead, we can look at sup Q(A)=x Q t (A) Q(A) for each fixed x [0, ]. A bound for every x would imply the bound we need. To prove a bound on this by induction, we will define it in a formally weaker way. Our upper bound is h t (x) = sup g F x g(u) (dq t (u) dq(u)) = sup g(u) dq t (u) x, g F x where F x is the set of functions { F x = g : Ω [0, ] : } g(u) dq(u) = x. It is clear that h t (x) is an upper bound on sup Q(A)=x Q t (A) Q(A) since g could be the characteristic of A. The following lemma shows that these two quantities are in fact equal as long as Q is atom-free, i.e., there is no x Ω such that Q({x}) > 0. Lemma. If Q is atom-free, then h t (x) = sup Q(A)=x Q t (A) Q(A). Proof (Sketch). Consider the set of points X that maximize dq t /dq, their value density. (This part would not be possible of Q had an atom.) Put g(x) = for all x X. These points give the maximum payoff per unit of weight from g, so it is optimal to put as much weight on them as possible. Now, find the set of maximizing points in Ω \ X. Set g(x) = at these points. Continue until the set of points with g(x) = has measure x. 4
In fact, this argument shows that when Q is atom-free, we can find a set A that achieves the supremum. When Q has atoms, we can include the high value atoms and use this procedure on the non-atom subset of Ω; however, we made to include a fraction of one atom to achieve the supremum. This shows that h t (x) can be achieved by a function g that is 0 valued everywhere except for at most one point. Another important fact about h t is the following. Lemma 3. The function h t is concave. Proof. Let g F x and g F y. Then, we can see that αg + ( α)g F αx+( α)y, which implies that h t (αx + ( α)y) αg (u) + ( α)g (u) dq t (u) (αx + ( α)y) = α( g (u) dq t (u) x) + ( α)( g (u) dq t (u) y). Since this holds for any such g and g, it must hold if we take the sup, which gives us h t (αx + ( α)y) αh t (x) + ( α)h t (y). Thus, h t is concave. Now, we come to the main lemma relating h t to h t. This will allow us to put a bound on d(q t, Q). Lemma 4. Let Q be atom-free, and y = min{x, x}. Then h t (x) h t (x φy) + h t (x + φy). Proof. Assume that y = x, i.e., x. The other part is similar. We will construct two functions, g and g, and use these to bound h t (x). Let A A be a subset to be chosen later with Q(A) = x. Let g (u) = { P u (A) if u A, 0 if u / A, and g (u) = { if u A, P u (A) if u / A. First, note that ( g + g )(u) = P u (A) for all u Ω, which means that g (u) dq t (u) + g (u) dq t (u) = ( g (u) + g (u)) dq t (u) = P u (A) dq t (u) = Q t (A) 5
On the other hand, g (u) dq(u) + g (u) dq(u) = P u (A) dq(u) = Q(A) since Q is stationary. Hence ( g + g ) F x.putting these together, we have g (u) (dq t (u) dq(u)) + g (u) (dq t (u) dq(u)) = Q t (A) Q(A). Since Q is atom-free, there is a subset A Ω such that h t (x) = Q t (A) Q(A) = g (u) (dq t (u) dq(u)) + g (u) (dq t (u) dq(u)) h t (x ) + h t (x ), where x = g (u) dq(u) and x = g (u) dq(u). Now, we know that x + x = x. Specifically, we can see that x = g (u) dq(u) = P u (A) dq(u) dq(u) = ( P u (Ω \ A)) dq(u) x = x P u (Ω \ A) dq(u) = x Φ(A) x φx = x( φ). This implies that x x( + φ). Since h t is concave, the chord from x to x on h t lies below the chord from x( φ) to x( + φ). (See Figure.) Therefore, h t (x) h t (x( φ)) + h t (x( + φ)). 6
h t- h t 0 x x(- ) x x(+ ) x Figure : The value of ht(x) lies below the line between ht (x) and ht (x). Since x x( φ) x x( + φ) x and ht is concave, this implies that ht(x) lies below the line between ht (x( φ)) and ht (x( + φ)). Given some information about Q0, this lemma allows us to put bounds on the rate of convergence to the stationary distribution. Theorem 5. Let 0 s and C0 and C be such that h0(x) C0 + C min{ x s, x s}. Then ht(x) C0 + C min{ x s, )t x s} ( φ. Proof. We argue by induction on t for s = 0. The inequality is true for t = 0 by our hypothesis. Now, suppose that x and the inequality holds for all values less than t. From the lemma, we know that ht(x) h t (x( φ)) + h t (x( + φ)) C0 + C ( x( φ) + x( + φ))( φ ) t = C0 + C x( φ + + φ)( φ ) t C0 + C x( φ ) t, provided that φ + + φ ( φ ). The latter inequality can be verified by squaring both sides, rearranging, and squaring again. 7
Corollary 6. If M = sup A A Q 0 (A)/Q(A), then we have d(q t, Q) M( φ ) t. Proof. By the definition of M, we know that h 0 (x) min{mx, } x min{mx, }. Next, we will show that min{mx, } Mx. If Mx = min{mx, }, then Mx, which implies that Mx Mx. If = min{mx, }, then Mx, which implies that Mx Mx. So we have shown that h 0 (x) min{mx, } Mx. Thus, by the last theorem, we know that d(q t, Q) max h t (x) M( x φ ) t. 8