Markov Chains. Chapter Existence and notation. B 2 B(S) and every n 0,

Size: px

Start display at page:

Download "Markov Chains. Chapter Existence and notation. B 2 B(S) and every n 0,"

Kory Sparks
6 years ago
Views:

1 Chapter 6 Markov Chains 6.1 Existence and notation Along with the discussion of martingales, we have introduced the concept of a discrete-time stochastic process. In this chapter we will study a particular class of such stochastic processes called Markov chains. Informally, a Markov chain is a discrete-time stochastic process for which, given the present state, the future and the past are independent. The formal definition is as follows: Definition 6.1 Consider a probability space (W, F, P) with filtration (F n ) n 0 and let (S, B(S)) be a standard Borel space. Let (X n ) n 0 be an S-valued stochastic process adapted to (F n ). We call this process a Markov chain with state space S if for every B 2 B(S) and every n 0, P(X n+1 2 B F n )=P(X n+1 2 B X n ), P-a.s. (6.1) Here and henceforth, P( X n ) abbreviates P( s(x n )) the conditional distributions exist by Theorem The relation (6.1) expresses the fact that, in order to know the future, we only need to know the present state. In practical situations, in order to prove that (6.1) is true, we will typically have to calculate P(X n+1 2 B F n ) and show that it depends only on X n. Since s(x n ) F n, eq. (6.1) then follows by the smaller always wins principle for conditional expectations. If we turn the above definition around, we can view (6.1) as a way to define the Markov chain. Indeed, suppose we know the distribution of X 0. Then (6.1) allows us to calculate the joint distribution of (X 0, X 1 ). Similarly, if we know the distribution of (X 0,...,X n ) the above lets us calculate (X 0,...,X n+1 ). The object on the right-hand side of (6.1), when viewed as a measure-valued function of X n, then falls into the following category: Definition 6.2 A function p : S B(S)! [0, 1] is called a transition probability if (1) B 7! p(x, B) is a probability measure on B(S) for all x 2S. 103

2 104 CHAPTER 6. MARKOV CHAINS (2) x 7! p(x, B) is B(S)-measurable for all B 2 B(S). Our first item of concern is the existence of a Markov chain with given transition probabilities and initial distribution: Theorem 6.3 Let (p n ) n 0 be a sequence of transition probabilities and let µ be a probability measure on B(S). Then there exists a unique probability measure P µ on the measurable space (S N 0, B(S N 0)), where N 0 = N [{0}, such that w 7! X n (w) =w n is a Markov chain with respect to the filtration F n = s(x 0,...,X n ) and such that for all B 2 B(S), and P µ (X 0 2 B) =µ(b) (6.2) P µ (X n+1 2 B X n )=p n (X n, B), P µ -a.s. (6.3) for all n 0. In other words, (X n ) defined by the coordinate maps on (S N 0, B(S N 0), P µ ) is a Markov chain with transition probabilities (p n ) and initial distribution µ. Proof. This result is a direct consequence of Kolmogorov Extension Theorem. Indeed, recall that F is the least B(S N 0)-algebra containing all sets A 0 A n S... where A j 2 B(S). We will define a consistent family of probability measures on the measurable space (S n, B(S n )) by putting Z Z Z P (n) µ (A 0 A n )= µ(dx 0 ) p 0 (x 0,dx 1 )... p n 1 (x n 1,dx n ), (6.4) A 0 A 1 A n and extending this to B(S n ). It is easy to see that these measures are consistent because if A 2 B(S n ) then P (n+1) µ (A S)=P (n) µ (A). (6.5) By Kolmogorov s Extension Theorem, there exists a unique probability measure P µ on the infinite product space (S N 0, B(S N 0)) such that P µ (A S...)=P (n) µ (A) for every n 0 and every A 2 B(S n ). It remains to show that the coordinate maps X n (w) =w n define a Markov chain such that ( ) hold true. The proof of (6.2) is easy: P µ (X 0 2 B) =P µ (B S...)=P (0) µ (B) =µ(b). (6.6) In order to prove (6.3), we claim that for all B 2 B(S) and all A 2 F n, E(1 {Xn+1 2B}1 A )=E p n (X n, B)1 A. (6.7) To this end we first note that, by interpreting both sides as probability measures on B(S N 0), it suffices to prove this just for A 2 F n of the form A = A 0 A n S.... But for such A we have E(1 {Xn+1 2B}1 A )=P (n+1) µ (A 0 A n B) (6.8)

3 6.2. EXAMPLES 105 which by inspection of (6.4) it suffices to integrate the last coordinate simply equals E(1 A p n (X n, B)) as desired. Once (6.7) is proved, it remains to note that p n (X n, B) is F n -measurable. Hence p n (X n, B) is a version of P µ (X n+1 2 B F n ). But s(x n ) F n and since p n (X n, B) is s(x n )-measurable, the smaller always wins principle implies that (6.1) holds. Thus (X n ) is a Markov chain satisfying ( ) as we desired to prove. While the existence argument was carried out in a continuum setup, the remainder of this chapter will be specilized to the case when the X n s can take only a countable set of values with positive probability. Denoting the (countable) state space by S, the transition probabilities p n (x,dy) will become functions p n : S S![0, 1] with the property Â p n (x, y) =1 (6.9) y2s for all x 2 S. (Clearly, p n (x, y) is an abbreviation of p n (x, {y}).) We will call such p n s a stochastic matrix. Similarly, the initial distribution µ will become a function µ : S![0, 1] with the property (Again, µ(x) abbreviates µ({x}).) Â µ(x) =1. (6.10) x2s The subindex n on the transition matrix p n (x, y) reflects the possibility that a different transition matrix is used at each step. This would correspond to time-inhomogeneous Markov chain. While this is sometimes a useful generalization, an overwhelming majority Markov chains that are ever considered are time-homogeneous. In light of Theorem 6.3, such Markov chain are then determined by two objects: the initial distribution µ and the transition matrix p(x, y), satisfying (6.10) and (6.9) respectively. For the rest of this chapter we will focus on time-homogeneous Markov chains. We finish with a remark on general notation. As used before, the object P µ denotes the Markov chain with initial distribution µ. If µ is the point mass at state x 2S, then we denote the resulting measure by P x. 6.2 Examples We proceed by a list of examples of countable-state time-homogeneous Markov chains. Example 6.4 SRW starting at x : Let S = Z d and, using x y to denote that x and y are nearest neighbors on Z d, let p(x, y) = ( 1 2d, if x y, 0, otherwise. (6.11)

4 106 CHAPTER 6. MARKOV CHAINS Consider the measure P x generated from the initial distribution µ(y) =d x (y) using the above transition matrix. As is easy to check, the resulting Markov chain is simply the simple random walk started at x. Example 6.5 Galton-Watson branching process : Consider i.i.d. random variables (x n ) n 1 taking values in N [{0} and define the stochastic matrix p(n, m) by p(n, m) =P(x x n = m), n, m 0. (6.12) (Here we use the interpretation p(0, m) =0 unless m = 0.) Let S = N [{0} and let P 1 be the corresponding Markov chain started at x = 1. As is easy to verify, the result is exactly the Galton-Watson branching process introduced in Chapter 5. Example 6.6 Ehrenfest chain : In his study of convergence to equilibrium in thermodynamics (and perhaps in relation to the so called Maxwell s daemon), Ehrenfest introduced the following simple model: Consider two boxes with altogether m labeled balls. At each time step we pick one ball at random and move it to the other box. To formulate the problem more precisely, we will only mark down the number of balls in one of the boxes. Thus, the set of possible values i.e., the state space is simply S = {0, 1,..., m}. To calculate the transition probabilities, we note that the number of balls always changes only by one. The resulting probabilities are thus p(n, n 1) = n m and p(n, n + 1) =m n m. (6.13) Note that this automatically gives zero probability to the situation when there is no balls left or where all balls are in one box. The initial distribution can be whatever is of interest. Example 6.7 Birth-death chain : A generalization of Ehrenfest chain is the situation where we think of a population evolving in discrete time steps. At each time only one of three things can happen: Either an individual is born or dies or nothing happens. The state space of such chain will be S = N [{0}. The transition probabilities will then be determined by three sequences (a x ), (b x ) and (g x ) via p(x, x + 1) =a x, p(x, x 1) =b x, p(x, x) =g x. (6.14) Clearly, we need that a x + b x + g x = 1 for all z 2S and, since the number of individuals is always non-negative, we also require that b 0 = 0. Example 6.8 Stochastic matrix : An abstract example of a Markov chain arises whenever we are given a square stochastic matrix. For instance,! p = 3 / 4 1/ 4 (6.15) 1/ 5 4/ 5 defines a Markov chain on S = {0, 1} with the matrix elements corresponding to the values of p(x, y).

5 6.2. EXAMPLES 107 Example 6.9 Random walk on a graph : Consider a graph G =(V, E), where V is the countable set the vertex set and E V V is a binary relation the edge set of G. We will assume that G is unoriented, i.e., E is symmetric, and that there are no self-loops, i.e., E is antireflexive. Let d(x) denote the degree of vertex x which is simply the number of y 2 V with (x, y) 2 E i.e., the number of neighbors of x. Suppose that there are no isolated vertices, which means that d(x) > 0 for all x. We will define a random walk on V as a Markov chain on S = V with transition probabilities where (a xy ) x,y2v is the adjacency matrix, a xy = p(x, y) = a xy d(x), (6.16) ( 1, if (x, y) 2 E, 0, otherwise. (6.17) It is easy to verified that Â y2v a xy = d(x) and so p(x, y) is indeed a stochastic matrix. (Here we assumed without saying that G is locally finite, i.e., d(x) < for all x 2 V.) This extends the usual simple random walk on Z d which we defined in terms of sums of i.i.d. random variables to any locally finite graph. The initial distribution is typically concentrated at one point; namely, the starting point of the walk, see Example 6.4. Example 6.10 Simple exclusion : Next we add a twist to the previous example. Consider a system of particles on the vertices of a finite graph G and assume that there can be at most one particle at each site an exclusion constraint. If it weren t for the exclusion constraint, the particles would like to perform independent random walks. With the constraint in place, at each unit of time a random particle attempts a jump to a randomly chosen neighbor. If the neighbor is empty, the move is accepted, if it is occupied, the move is discarded. We will define a Markov chain mimicing this process. The state space of the chain will be S = {0, 1} V, i.e., the set of configurations of particles on V. At each unit of time, we pick an edge at random and interchange whatever there is at the endpoints. (If both endpoints are either occupied or empty, this results in no change; if one is occupied and the other empty, the particle at the occupied end is moved to the vacant end.) This algorithm is easy to implement on the computer; writing the transition probability will require a little effort. Let h and h 0 be the states of the chain before and after the jump. We have selected an edge e =(x, y) 2 E with probability 1/ E, but the move could have taken h to h 0 if and only if h x = hy, 0 h y = hx 0 and h z = hz 0 for z 6= x, y. Let E(h, h 0 ) denote the set of edges (x, y) 2 E with these properties. Then the transition probability is given by p(h, h 0 )= E(h, h0 ). (6.18) E

6 108 CHAPTER 6. MARKOV CHAINS We leave it as an exercise to show that p is really a transition matrix which boils down to showing that Â h 0 2S E(h, h 0 ) = E. Having amassed a bunch of examples, we begin investigating some general properties of countable-state time-homogeneous Markov chains. (To keep the statements of theorems and lemmas concise, we will not state these as our assumptions any more.) 6.3 Stationary and reversible measures The first question we will try to explore is that of stationarity. To motivate the forthcoming definitions, consider a Markov chain (X n ) n 0 with transition probability p and X 0 distributed according to µ 0. As is easy to check, the law of X 1 is then described by measure µ 1 which is computed by µ 1 (y) = Â µ 0 (x)p(x, y). (6.19) x2s We are interested in the situation when the distribution of X 1 is the same as that of X 0. This leads us to this, somewhat more general, definition: Definition 6.11 A (positive) measure n on the state space S is called stationary if n(y) = Â n(x)p(x, y), y 2S. (6.20) x2s If n has total mass one we call it stationary distribution. Remark 6.12 While we allow ourselves to consider measures µ on S that are not normalized, we will always assume that the measure assigns finite mass to every element of S. Clearly, once the laws of X 0 and X 1 are the same, then all X n s have the same law (provided the chain is time-homogeneous). Let us find stationary measures for the Ehrenfest and birth-death chains: Lemma 6.13 Consider the Ehrenfest chain with state space S = {0, 1,..., m} and let the transition matrix be as in (6.13). Then is a stationary distribution. n(k) = m 2 m, k = 0,..., m, (6.21) k Proof. We have to show that n satisfies (6.20). First we note that Â n(k)p(k, l) = n(l 1)p(l 1, l)+ n(l + 1)p(l + 1, l). (6.22) k2s

7 6.3. STATIONARY AND REVERSIBLE MEASURES 109 Then we calculate rhs of (6.22) = 2 mh m m l + 1 l 1 m m h = 2 m l l m l + 1 m + l + 1 m l + 1 m l + 1 i m + m l l + 1 l + 1 i. m (6.23) The proof is finished by noting that, after a cancellation, the bracket is simply one. Concerning the birth-death chain, we state the following: Lemma 6.14 Consider the birth-death chain on N [{0} characterized by sequences (a n ), (b n ) and (g n ), cf (6.14). Suppose that b n > 0 for all n 1. Then n(0) =1 and n(n) = defines a stationary measure of the chain. n a k 1, n 1, (6.24) b k=1 k In order to prove this lemma, we will introduce the following interesting concept: Definition 6.15 A measure n of a Markov chain on S is called reversible if n(x)p(x, y) =n(y)p(y, x) (6.25) holds for all x, y 2S. As we will show later, reversibility owes its name to the fact that, if we run the chain backwards (whatever this means for now) starting from n, we would get the same Markov chain. A simple consequence of reversibility is stationarity: Lemma 6.16 A reversible measure is automatically stationary. Proof. We just have to sum both sides of (6.25) over y. Since p(x, y) is stochastic, the left hand side produces n(x) while the right-hand side gives Â y2s n(y)p(y, x). Equipped with these observation, the proof of Lemma 6.14 is a piece of cake: Proof of Lemma We claim that n in (6.24) is reversible. To prove this we observe that (6.24) implies for all k 0 n(k + 1) =n(k) a k p(k, k + 1) = n(k) b k+1 p(k + 1, k). (6.26) This shows that (6.25) holds for x and y differing by one. The case when x = y is always satisfied and in the remaining cases p(x, y) =0, so (6.25) is proved in general. Hence n is reversible and thus stationary.

8 110 CHAPTER 6. MARKOV CHAINS Remark 6.17 Recall that stationary measures need not be finite and thus the existence of a stationary measure does not imply the existence of a stationary distribution. (The distinction is really whether finite or infinite because a finite stationary measure can always be normalized.) For the Ehrenfest chain we immediately produced a stationary distribution. However, for the birth-death chain the question whether n is finite or infinite depends sensitively on the asymptotic properties of the ratio a k+1 /b k. Next we will address the underlying meaning of the reversible measure. We will do this by showing that reversing a Markov chain we obtain another Markov chain, which in the reversible situation will be the same as the original chain. Theorem 6.18 Consider a Markov chain (X n ) n 0 started from a stationary initial distribution µ and transition matrix p(x, y). Fix N large and for n = 1,..., N, let Y n = X N n. (6.27) Then (Y n ) n=0 N is a (time-homogeneous) Markov chain called the reversed chain with initial distribution µ and transition matrix q(x, y) defined by q(x, y) = µ(y) p(y, x), µ(x) > 0, (6.28) µ(x) (The values q(x, y) for x such that µ(x) =0 are immaterial since x will never be visited starting from the initial distribution µ.) Proof. Fix a collection of values y 1,...,y N 2Sand consider the probability P(Y n = y n, n = 1,..., N), where Y n are defined from X n as in (6.27). Then we have P(Y n = y n, n = 0,..., N) =µ(y N )p(y N, y N 1 )...p(y 1, y 0 ). (6.29) A simple argument now shows that either this probability vanishes or µ(y k ) > 0 for all k = 0,..., N. Since we do not care about sequences with zero probability, assuming the latter we can rewrite this as follows: P(Y n = y n, n = 0,..., N) = µ(y N) µ(y N 1 ) p(y N, y N 1 )... µ(y 1) µ(y 0 ) p(y 1, y 0 )µ(y 0 ) = µ(y 0 )q(y 0, y 1 )...q(y N 1, y N ). (6.30) Hence (Y n ) is a Markov chain with initial distribution µ and transition matrix q(x, y). Clearly, a reversible measure µ would imply that q(x, y) =p(x, y), i.e., the dual chain (Y n ) is identical to (X n ). This allows us to extend any stationary Markov chain to negative infinity into a two-sided sequence (X n ) n2z. As an additional exercise we will apply these concepts to the random walk on a locally finite graph.

9 6.4. EXISTENCE/UNIQUENESS OF STATIONARY MEASURES 111 Lemma 6.19 Consider a locally finite unoriented graph G =(V, E) and let d(x) denote the degree of vertex x. Suppose that there are no isolated vertices, i.e., d(x) > 0 for every x. Then n(x) =d(x), x 2 V, (6.31) is a reversible and hence stationary measure for the random walk on G. Proof. We have p(x, y) =a xy /d(x), where a xy is the adjacency matrix. Since G is unoriented, the adjacency matrix is symmetric. This allows us to calculate n(x)p(x, y) =d(x) a xy d(x) = a xy = a yx = d(y) a yx = n(y)p(y, x). (6.32) d(y) Thus n is reversible and hence stationary. Clearly, n is finite if and only if E is finite which by the fact that no vertex is isolated implies that V is finite. However, the measure n may not be unique. Indeed, if G has two separate components, even the restriction of n to one of the component would be stationary. We proceed by analyzing the question of uniqueness (and existence) of stationary measures. 6.4 Existence/uniqueness of stationary measures As alluded to in the example of the random walk on a general graph, a simple obstruction to uniqueness is when there are parts of the state space S for which the transition from one onto the other happens with zero probability. We define a proper name for this situation: Definition 6.20 We call the transition matrix p (or the Markov chain itself) irreducible if for all x, y 2Sthere exists a number n such that p n (x, y) = Â p(x, y 1 )p(y 1, y 2 )...p(y n 1, y) > 0. (6.33) y 1,...,y n 1 2S The object p n (x, y) not to be confused with p n (x, y) is the n-th power of the transition matrix p. As is easy to check, p n (x, y) simply equal the probability that the Markov chain started at x is at y at time n, i.e., P x (X n = y) =p n (x, y). Irreducibility can be characterized in terms of a stopping time. Lemma 6.21 Consider a Markov chain (X n ) n 0 on S and let T y = inf{n 1: X n = y} be the first time the chain visits y (note that we do not count the initial state X 0 in this definition). Then the chain is irreducible if and only if for all x, y, P x (T y < ) > 0. (6.34) Proof. This is a trivial consequence of the fact that P x (X n = y) =p n (x, y).

10 112 CHAPTER 6. MARKOV CHAINS However, irreducibility alone is not sufficient to guarantee the existence and uniqueness of a stationary measure. The principal concept here is recurrence, which we have already encountered in the context of random walks: Definition 6.22 A state x 2Sis called recurrent if P x (T x < ) =1. A Markov chain is recurrent if every state is recurrent. Theorem 6.23 Consider an irreducible Markov chain (X n ) n 0 and let x 2Sbe a recurrent state. Then Tx n x (y) =E x Â 1 {Xn =y} = Â P x (X n = y, T x n) (6.35) n=1 n 1 is finite for all y 2Sand defines a stationary measure on S. Moreover, any other stationary measure is a multiple of n x. The crux of the proof and the principal reason why we need recurrence is because of the following observation: If T x < almost surely, we can also write n x as follows: Tx 1 n x (y) =E x Â 1 {Xn =y} = Â P x (X n 1 = y, T x n). (6.36) n=0 n 1 The first equality comes from the fact that if y 6= x then X n 6= y for n = 0 and n = T x anyway, while if y = x then the sum in the first expectation in (6.35) and (6.36) equals one in both cases. The second equality in (6.36) follows by a convenient relabeling. Proof of existence. Let x 2Sbe a recurrent state. Then P x (X n = y, T x n)=â P x (X n 1 = z, X n = y, T x n) z2s = Â E x P x (X n = y F n 1 )1 {Xn 1 =z}1 {Tx n} z2s = Â p(z, y)p x (X n 1 = z, T x n), z2s (6.37) where we used that {X n 1 = z} and {T x n} are both F n 1 -measurable to derive the second line. The third line is a consequence of the fact that on {X n 1 = z} we have that P x (X n = y F n 1 )=P x (X n = y X n 1 )=p(z, y). Summing the above over n 1, applying discrete Fubini (everything is positive) and invoking (6.35) on the left and (6.36) on the right-hand side gives us n x (y) = Â z2s n x (z)p(z, y). It remains to show that n x (y) < for all y 2 S. First note that n x (x) =1 by definition. Next we note that we actually have n x (x) =Â n x (z)p n (z, x) n x (y)p n (y, x) (6.38) z2s for all n 1 and all y 2S. Thus, n x (y) < whenever p n (y, x) > 0. By irreducibility this will happen for some n for every y 2Sand so n x (y) < for all y 2S. Hence n x is a stationary measure.

11 6.5. STRONG MARKOV PROPERTY 113 The proof of uniqueness provides some motivation for how n x was constructed: Proof of uniqueness. Suppose x is a recurrent state and let n x be the stationary measure in (6.35). Let µ be another stationary measure (we require that µ(y) < for all y 2S even though, as we will see, it is enough to assume that µ(x) < ). Stationarity of µ can also be written as µ(y) =µ(x)p(x, y)+â µ(z)p(z, y). (6.39) z6=x Plugging this for µ(z) in the second term and iterating gives us h µ(y) =µ(x) p(x, y)+â z6=x p(x, z)p(z, y)+ Â z 1,...,z n 6=x i p(x, z 1 )...p(z n, y) + Â µ(z 0 )p(z 0, z 1 )...p(z n, y). (6.40) z 0,...,z n 6=x We would like to pass to the limit and conclude that the last term tends to zero. However, a direct proof of this appears unlikely and so we proceed by using inequalities. Noting that the k-th term in the bracket equals P x (X k = y, T x k), we have n+1 µ(y) µ(x) Â P x (X k = y, T x k)! µ(x)n x (y). (6.41) n! k=1 In particular, we have µ(y) µ(x)n x (y) for all y 2S. Our goal is to show that equality holds. Suppose that for some x, y we have µ(y) > µ(x)n x (y). By irreducibility, there exists n 1 such that p n (y, x) > 0 and so µ(x) =Â µ(z)p n (z, x) > µ(x) Â n x (z)p(z, x) =µ(x)n x (x) =µ(x), (6.42) z2s z2s a contradiction. So µ(y) =µ(x)n x (y), i.e., µ is a rescaled version of n x. 6.5 Strong Markov property In the proof of existence and uniqueness of the stationary measure we have barely touched upon the recurrence property. Before we delve deeper into that subject which is what we will do in Section 6.6 let us state and prove an interesting consequence of the definition of Markov chain. In order to be more explicit about the whole setup, suppose that our Markov chain is defined on the measurable space (W, F ), where W = S N 0 and F is a product s- algebra. Then we can represent X n using the coordinate map X n (w) =w n. Recall the definition of the shift operator q which acts on sequences w by (qw) n = w n+1. For any n 1 we define q n to be the n-fold composition of q, i.e., (q n w) k = w k+n. If N is a stopping time of the filtration F n = s(x 0,...,X n ), then we let q N to be the operator q n on {N = n}. On {N = } we leave q N undefined.

12 114 CHAPTER 6. MARKOV CHAINS Theorem 6.24 [Strong Markov property] Consider a Markov chain (X n ) n 0 with initial distribution µ and let N be a stopping time. Suppose that P µ (N < ) > 0 and let q N be as defined above. Then for all B 2 F, P µ (1 B q N F N )=P XN (B) (6.43) almost surely on {N < }. Here F N = {A 2 F : A \{N = n} 2F n, n 0} and X N is defined to be X n on {N = n}. This property is called strong because it is a strengthening of the Markov property P µ (1 B q n F N )=P Xn (B) (6.44) to random n. In our case the proof of the Markov property and the strong Markov property amount more or less to the same. In particular, no additional assumptions are needed. This is not true for continuous-time where strong Markov property typically fails in the absence of (rather natural) continuity conditions. Proof of Theorem Let A 2 F N be such that A {N < }. First we will partition according to the values of N this is what Durrett calls the divide & conquer principle: E µ 1 A (1 B q N ) = Â E µ 1 A\{N=n} (1 B q n ). (6.45) n 0 (Note that, by our assumptions about A, we do not have to include the value N =. Once we are on {N = n} we can replace q N by q n.) Now A \{N = n} 2F n while 1 B q n is s(x n, X n+1,...)-measurable. This allows us to condition on F n and use (6.3): E µ 1 A\{N=n} (1 B q n ) = E µ 1 A\{N=n} E µ (1 B q n F n ) Plugging this back to (6.45), we conclude that = E µ 1 A\{N=n} P Xn (B). (6.46) E µ 1 A (1 B q N ) = E µ 1 A P XN (B) (6.47) for all A 2 F N with A {N < }. Since P XN (B) is F N measurable, a standard argument implies that P µ (1 B q N F N ) equals P XN (B) almost surely on {N < } as claimed. We proceed by listing some applications of the Strong Markov property. Consider the stopping time T x defined by T x = inf{n 1: X n = x}. (6.48) Here we deliberately omit n = 0 from the sum, so that even in P x we may have T x = almost surely. Let Tx n denote the n-th iteration of T x by which we mean simply T x for the sequence q Tn x 1 w, see the corresponding definition in the context of random walks. Then we have the following variant of Lemma 3.23:

13 6.5. STRONG MARKOV PROPERTY 115 Lemma 6.25 Consider a Markov chain with state space S and let T x be as defined above. Then for all x, y 2S, P x (T n y < ) =P x (T y < ) P y (T y < ) n 1. (6.49) Proof. The event {Ty n < } can be written as {T y < } intersected with the shift of {Ty n 1 < } by T y. Applying the strong Markov property we thus get P x (T n y < ) =P x (T y < )P y (T n 1 y < ). (6.50) Using the same for the second term on the right-hand side, we arrive at (6.49). As for the random walk this statement allows us to characterize recurrence of x in terms of the expected number of visits of the chain back to x. Corollary 6.26 Let N(x) =Â n 1 1 {Xn =x}. Then E x N(y) = P x(t y < ) 1 P y (T y < ). (6.51) Here the right-hand side is to be interpreted as zero if the numerator vanishes and as infinity if the numerator is positive and the denominator vanishes. Proof. This is a consequence of the fact that E x (N(y)) = Â n formula (6.49). 1 P x (T n y < ) and the Corollary 6.27 A state x is recurrent if and only if E x (N(x)) =. In particular, for an irreducible Markov chain either all states are recurrent or none of them are. Finally, an irreducible finite-state Markov chain is recurrent. Proof. A state x is recurrent iff P x (T x < ) =1 which by (6.51) is true iff E x (N(x)) =. To show the second claim, suppose that x is recurrent and let us show that so is any y 2 S. To that end, let k and l be numbers such that p k (y, x) > 0 and p l (x, y) > 0 these numbers exist by irreducibility. Then which implies p n+k+l (y, y) p k (y, x)p n (x, x)p l (x, y), (6.52) E y N(y) = Â p m (y, y) m 1 Â p k (y, x)p n (x, x)p l (x, y) n 1 = p k (y, x)p l (x, y)e x N(x). (6.53) But p k (y, x)p l (x, y) > 0 and so E x (N(x)) = implies E y (N(y)) =. Hence, all states of an irreducible Markov chain are recurrent if one of them are. Finally, if S is finite, the trivial relation Â x2s N(x) = implies E x (N(x)) = for at least one x 2S.

14 116 CHAPTER 6. MARKOV CHAINS 6.6 Recurrence, transience and stationary distributions In the previous section we concluded that, for irreducible Markov chains, recurrence is a class property, i.e., a property that either holds for all states or none. We have also shown that, once the chain is recurrent (on top of irreducibility), there exists a stationary measure. In this section we will give conditions under which the stationary measure has finite mass which means it can be normalized to produce a stationary distribution. To that end we introduce the following definitions: Definition 6.28 A state x 2Sof a Markov chain is said to be (1) transient if P x (T x < ) < 1. (2) null recurrent if P x (T x < ) =1 but E x T x =. (3) positive recurrent if E x T x <. We will justify the terminology later. Our goal is to show that a stationary distribution exists if and only if every state of the (irreducible) chain is positive recurrent. The principal result is formulated as follows: Theorem 6.29 Consider a Markov chain with state space S. If there exists a stationary measure µ with 0 < µ(s) < then every x with µ(x) > 0 is recurrent. If the chain is irreducible, then µ(x) = µ(s) (6.54) E x T x for all x 2S. In particular, E x T x < for all x, i.e., every state is positive recurrent. Proof. Let x be such that µ(x) > 0. The fact that µ is stationary implies that µ(x) = Â z2s µ(z)p n (z, x) for all n 1. Therefore = Â µ(x) = n 1 Fubini But (6.51) implies E z (N(x)) apple [1 Â µ(z) Â p n (z, x) =Â µ(z)e z N(x). (6.55) z2s n 1 z2s apple P x (T x < )] 1 and so µ(s) 1 P x (T x < ). (6.56) Since µ(s) <, we must have P x (T x < ) =1, i.e., x is recurrent. In order to prove the second part of the claim, we note that irreducibility implies that µ(x) > 0 for all x unless µ 0 which we do not consider to be worth discussion and so all states are recurrent. From (the proof of) Theorem 6.23 we glean out the relation µ(y) =µ(x)n x (y) which implies µ(s) =µ(x) Â n x (y) =µ(x)e x T x, (6.57) y2s which implies (6.54) But µ(s) < and so we must have E x T x <. We summarize the interesting part of the result in a corollary:

15 6.6. RECURRENCE, TRANSIENCE AND STATIONARY DISTRIBUTIONS 117 Corollary 6.30 equivalent: If a Markov chain with state space S is irreducible, then the following are (1) Some state is positive recurrent. (2) There exists a stationary measure µ with µ(s) <. (3) Every state is positive recurrent. Proof. (1))(2): Let x be positive recurrent. Then n x is a stationary measure with n x (S) = E x T x <. (2))(3): This is the content of Theorem (3))(1): Trivial. We finish this section by providing a justification for the terminology of positive and null recurrent states/markov chains (both are class properties): Theorem 6.31 Consider a Markov chain with state space S. Let N n (y) =Â n m=1 1 {X m =y}. If y is recurrent, then for all x 2S, N n (y) lim = 1 1 n! n E y T {Ty < }, P x -a.s. (6.58) y Proof. Let us first consider the case x = y. Then recurrence implies 1 {Ty < } = 1 almost surely. Define the sequence of times t n = Ty n Ty n 1 where t 1 = T y. By the Strong Markov Property, (t n ) are i.i.d. with the same distribution as T y. In terms of the t n s, we have N n (y) =sup{k 0: t t k apple n}, (6.59) i.e., N n (y) is a renewal sequence. The Renewal Theorem then gives us P y -almost surely. N n (y) lim = 1 = 1, (6.60) n! n E y t 1 E y T y Now we will look at the cases x 6= y. If P x (T y = ) =1 then N n (y) =0 almost surely for all n and there is nothing to prove. We can thus assume that P x (T y < ) > 0 and decompose according to the values of T y. We will use the Markov property which tells us that for any A 2 F, we have P x (q m (A) T y = m) =P y (A). We will apply this to the event A = n N n (y) lim = 1 o. (6.61) n! n E y T y Indeed, this event occurs almost surely in P y and so we have P x (q m (A) T y = m) = 1. But on q m (A) \{T y = m} we have lim n! N n+m (y) n N m (y) = 1 E y T y (6.62)

16 118 CHAPTER 6. MARKOV CHAINS which implies that N n (y)/n! 1/E y T y. Therefore, P x (A T y = m) =1 for all m with P x (T y = m) > 0. It follows that A occurs almost surely in P x ( T y < ). Hence, the limit in (6.58) equals 1/E y T y almost surely on {T y < } and zero almost surely on {T y = }. This proves the claim. The upshot of this theorem is that a state is positive recurrent if it is visited at a positive density of times and null recurrent if it is visited infinitely often, but the density of visit tends to zero. Formulas of the form (6.54) and (6.58) are reminiscent of the Kac s recurrence theorem from ergodic theory. 6.7 Convergence to equilibrium Markov chains are often run on a computer in order to sample from a complicated distribution on a large state space. The idea is to define a Markov chain for which the desired distribution is stationary and then wait long enough for the chain to equilibrate. The last aspect of Markov chains we wish to examine is the convergence to equilibrium. When run on a computer, only one state of the Markov chain is stored at each time this is why Markov chains are relatively easy to implement and so we are asking about the convergence of the distribution P µ (X n 2 ). Noting that this is captured by the quantities p n (x, y), we will thus study the convergence of p n (x, ) as n!. For irreducible Markov chains, we can generally guarantee convergence is in Cesaro sense: n 1 lim n! n Â p m (x, y) =µ(y), (6.63) m=1 Indeed, subsequential limits produce stationary measures which are zero unless the chain is positive recurrent. In the latter case, the stationary measure is unique and so every subsequential limit is the same i.e., the Cesaro averages converge. Unfortunately, the p n (x, y) themselves may not converge. For instance, if we have a chain that hops between two states, p n (x, y) will oscillate between zero and one as n changes. The obstruction is clearly related to periodicity if there was a slightest probability to not hop, the chain would soon get out of sync and the equilibrium would be reached. In order to classify the periodic situation, let I x = {n 1: p n (x, x) > 0}. By standard arguments, I x is an additive semigroup (a set closed under addition). This allows us to define a number d x as the largest integer that divides all n 2 I x. (Since one divides all integers, such number indeed exists.) We call d x the period of x. Lemma 6.32 If the Markov chain is irreducible, then d x is the same for all x. Proof. See the textbook. Definition 6.33 A Markov chain is called aperiodic if d x = 1 for all x.

17 6.7. CONVERGENCE TO EQUILIBRIUM 119 Lemma 6.34 An aperiodic Markov chain with state space S satisfies the following: For all x, y 2Sthere exists n 0 = n 0 (x, y) such that p n (x, y) > 0 for all n n 0. Proof. See (5.4) Lemma on page 314. Our goals it to prove the following result: Theorem 6.35 Consider an irreducible, aperiodic Markov chain on state space S. Suppose there exists a stationary distribution p. Then for all x 2S, p n (x, y)! n! p(y), y 2S. (6.64) The proof of this theorem will be based on a general technique, called coupling. The idea is as follows: We will run one Markov chain started at x and the other started at a z which was itself chosen at random from distribution p. As long as the chains stay away from each other, we keep generating them independently. The first moment they collide, we glue them and from that time on move both of them synchronously. The upshot is that, if we observe only the chain started at x, we see a chain started at x while if we observe the chain started at z, we observe only a chain started at z. But the latter was started from stationary distribution and so it will be stationary at each time. It follows that, provided the chains glued, also the one started from x will eventually be stationary. To make this precise, we will have to define both chains on the same probability space. We will generalize the initial distributions to any two measures µ and n on S. Let us therefore consider a Markov chain on S Swith transition probabilities 8 >< p(x 1, y 1 )p(x 2, y 2 ) if x 1 6= x 2, p (x 1, x 2 ), (y 1, y 2 ) = p(x 1, y 1 ), if x 1 = x 2 and y 1 = y 2, (6.65) >: 0, otherwise, and initial distribution µ n. We will use P µ n to denote the corresponding probability measure called the coupling measure and (X (1) n, X (2) n ) to denote the coupled process. First we will verify that each of the marginals is the original Markov chain: Lemma 6.36 Let (X (1) n, X (2) n ) denote the coupled process in measure P µ n. Then (X (1) n ) is the original Markov chain on S with initial distribution µ, while (X (2) n ) is the original Markov chain on S with initial distribution n. Proof. Let A = {X (1) k = x k, k = 0,..., n}. Abusing the notation slightly, we want to show that P µ n (A) =P µ (A). Since A fixes only the X (1) k probability of A by summing over the possible values of X (2) k : P µ n (A) =Â (y k ) s, we can calculate the n 1 µ(x 0 )n(y 0 ) p (x k, y k ), (x k+1, y k+1 ). (6.66) k=0

18 120 CHAPTER 6. MARKOV CHAINS Next we note that Â p (x, y), (x 0, y 0 ) = y 0 2S ( Â y 0 2S p(x, x 0 )p(y, y 0 ), if x 6= y, p(x, x 0 ), if x = y. (6.67) In both cases the sum equals p(x, x 0 ) which we note is independent of y. Therefore, the sums in (6.66) can be performed one by one with the result n 1 P µ n (A) =µ(x 0 ) p(x k, x k+1 ), (6.68) k=0 which is exactly P µ (A). The second marginal is handled analogously. Our next item of interest is the time when the chains first collide: Lemma 6.37 Let T = inf{n 0: X (1) n = X (2) n }. Under the conditions of Theorem 6.35, for any pair of initial distributions µ and n. P µ n (T < ) =1 (6.69) Proof. We will consider an uncoupled chain on S Swhere both original Markov chains move independently forever. This chain has the transition probability q (x 1, y 1 ), (x 2, y 2 ) = p(x 1, y 1 )p(x 2, y 2 ). (6.70) As a moment s though reveals, the time T has the same distribution in both coupled and uncoupled chains. Therefore, we just need to prove the lemma for the uncoupled chain. First, let us note that the uncoupled chain is irreducible (this is where aperiodicity is needed). Indeed, by Lemma 6.34 aperiodicity implies that p n (x 1, y 1 ) > 0 and p n (x 2, y 2 ) > 0 for n sufficiently large and so we also have q n ((x 1, y 1 ), (x 2, y 2 )) > 0 for n sufficiently large. Second, we observe that the uncoupled chain is recurrent. Indeed, ˆp(x, y) =p(x)p(y) is a stationary distribution and, using irreducibility, every state of the chain is thus recurrent. But then, for any x 2S, the first hitting time of (x, x) is finite almost surely which implies the same for T, which is the first hitting time of the diagonal in S S. The principal idea behind coupling now reduces to the following lemma: Lemma 6.38 [Coupling inequality] Consider the coupled Markov chain with initial distribution µ n and let T = inf{n 0: X (1) n = X (2) n }. Let µ n ( ) =P µ n (X (1) n 2 ) and n n ( ) =P µ n (X (2) n 2 ) be the marginals at time n. Then kµ n n n kapplep µ n (T > n), (6.71) where kµ n n n k = sup A S µ n (A) n n (A) is the variational distance of µ n and n n.

19 6.8. DISCRETE HARMONIC ANALYSIS 121 Proof. Let S + = {x 2S: µ n (x) > n n (x)}. The proof is based on the fact that This makes it reasonable to evaluate the difference kµ n n n k = µ n (S + ) n n (S + ). (6.72) µ n (S + ) n n (S + )=P µ n (X (1) n 2S + ) P µ n (X (2) n 2S + ) = E µ n 1 {X (1) n 2S + } 1 {X (2) n 2S + } = E µ n 1 {T>n} 1 {X (1) n 2S + } 1 {X (2) n 2S + }. (6.73) Here we have noted that if T apple n then either both {X (1) n 2S + } and {X (2) n 2S + } occur or both don t. Estimating the difference of the two indicators by one, we thus get µ n (S + ) n n (S + ) apple P µ n (T > n). Plugging this into (6.72), the desired estimate follows. Now we are ready to prove the convergence to equilibrium: Proof of Theorem Consider two Markov chains, one started from µ and the other from n. By Lemmas 6.36 and 6.38, the variational distance between the distributions µ n and n n of X n in these two chains is bounded by P µ n (T > n). But Lemma 6.37 implies that P µ n (T > n) tends to zero as n! and kµ n n n k!0. To get (6.64) we now let µ = d x and n = p. Then µ n ( ) =p n (x, ) while n n = p for all n. Hence we have kp n (x, ) pk!0 which means that p n (x, )! p in the variational norm. This implies (6.64). The method of proof is quite general and can be adapted to other circumstances. See Lindvall s book Lectures on the coupling method. We observe that Lemmas 6.38 and 6.36 allow us to estimate the time it takes for the two marginals to get closer than prescribed. On the basis of the proof of Lemma 6.36, the coupling time can be studied in terms of the uncoupled process, which is slightly easier to handle. 6.8 Discrete harmonic analysis We finish with a brief section on discrete harmonic analysis. The reasons for studying harmonic functions (to be defined below) on discrete structures come from the striking connection with simple random walk that we want to demonstrate. In the continuum a similar connection exists with the Brownian motion. Harmonic analysis is a subject whose primary (or initial) focus are the solutions of Laplace s equation. In discrete setting, the usual Laplacian takes a function f : Z d! R and assigns to it the function (D f )(x) = 1 2d Â y x f (y) f (x), (6.74)

20 122 CHAPTER 6. MARKOV CHAINS where y x means that y is a nearest neighbor of x on Z d. The Laplacian is directly related to the notion of harmonic functions: Definition 6.39 A function f : Z d! R is called harmonic in L Z d if f (x) = 1 2d Â f (y) (6.75) y x holds for all x 2 L. (I.e., f is harmonic in L if (D f )(x) =0 for all x 2 L.) Similarly, we call f subharmonic if apple holds and superharmonic if holds for all x 2 L. In the following we will show that harmonic functions and simple random walk have a lot in common. For that we will need the following notation: For each x 2 Z d, let P x denote the probability measure on simple random walks on Z d started at x. More specifically, P x is a measure on sequences (S n ) n 0 such that S n S 0 is the usual simple random walk and S 0 = x almost surely. Lemma 6.40 Let x 2 Z d and let L Z d be an arbitrary set. Let T = inf{n 0: S n 62 L} and let f : Z d! R be a function. Then on (W, F, P x ), we have: If f is subharmonic in L, then M n = f (S T^n ) (6.76) is a submartingale of the filtration F n = s(s 1,...,S n ). In particular, if f is harmonic in L then M n is a martingale. Remark 6.41 If we knew that f is harmonic everywhere, it would suffice to show that f (S n ) is a martingale because then the result would follow by Lemma However, in the present case an independent proof is easier. Proof. Clearly, it suffices to prove the statement for subharmonic f. Here we decompose according to the values of T: E(M n+1 F n )=E f (S T^(n+1) )1 {T n+1} F n + E f (S T^(n+1) )1 {Tapplen} F n. (6.77) Now the first term can be written E f (S T^(n+1) )1 {T n+1} F n = E f (S n + X n+1 )1 {T n+1} F n = 1 2d Â y S n f (y)1 {T n+1} f (S n )1 {T n+1}. (6.78) Here we used that 1 {T n+1} is F n -measurable to take it out of the expectation, then we wrote S n+1 = S n + X n+1 where S n is F n -measurable and X n+1 is independent of F n. This allows us to take expectation with respect to X n+1. The final inequality comes from the fact that, on {T n + 1} we have S n 2 L and so f is subharmonic at S n. In order to address the second term, we notice that f (S T^(n+1) )1 {Tapplen} = f (S T )1 {Tapplen} which is F n -measurable. Hence we get E(M n+1 F n ) f (S n )1 {T n+1} + f (S T )1 {Tapplen}, (6.79)

21 6.8. DISCRETE HARMONIC ANALYSIS 123 which is exactly M n. Hence, M n is a submartingale. As already mentioned, the core problem of harmonic analysis is to study the solutions of Laplace s equation. However, even in a finite domain, there will be plenty of solutions unless we prescribe a boundary condition. This leads us to the following (discrete) Dirichlet problem for function f : ( (D f )(x) =0, x 2 L, (6.80) f (x) =g(x), x 2 L, where L is the set of sites in L c that have a neighbor in L. The function g is the boundary condition which we need in order to make sense of the Laplacian at all sites of L. The advertised link with the theory of random walks is then provided by: Theorem 6.42 Let L Z d be finite and let T = inf{n 0: S n 62 L}. Then T < almost surely in P x for every x and f (x) =E x g(s T ), x 2 L, (6.81) is the unique solution to the Dirichlet problem (6.80) with boundary condition g. Proof. By an argument similar to that used in a homework assignment, we have P(T > n) apple e dn for n 1. Thus T < almost surely. First we will prove that the above f is a solution to (6.80). To that end we let x 2 L and pick a nearest neighbor y of x. Crucial for the argument will be the observation that the sequence (S 2, S 3,...) in measure P x ( S 1 = y) has the same distribution as (S 1, S 2,...) in P y. In particular, S T will be the same in both sequences. Therefore, f (x) =E x g(s T ) = Â E x g(s T )1{S 1 = y} y x 1 = Â y x 2d E y g(s T ) = 1 2d Â f (y). y x (6.82) Hence, f is harmonic in L. But f (x) =g(x) for all x 2 L and so f solves (6.80). Next we want to show uniqueness. Let f be a solution to (6.80). Then f is harmonic in L and so, by Lemma 6.40, M n = f (S T^n ) is a bounded martingale. Since T < almost surely, the Optional Stopping Theorem gives us that EM T = EM 0. But EM 0 = E f (S 0 )= f (x), while EM T = E f (S T )=Eg(S T ). The function f thus satisfies (6.81) and hence the solution to (6.80) is unique. Remark 6.43 In dimension d apple 2, a similar argument allows us to conclude that the same holds even in infinite domains L Z d. However, in dimensions d 3 the solution to (6.80) in infinite L can pick up an extra factor due to the possibility T =. We refer to e.g. Lawler s book Intersections of random walks for more information.

Lecture 5. If we interpret the index n 0 as time, then a Markov chain simply requires that the future depends only on the present and not on the past.

Lecture 5. If we interpret the index n 0 as time, then a Markov chain simply requires that the future depends only on the present and not on the past. 1 Markov chain: definition Lecture 5 Definition 1.1 Markov chain] A sequence of random variables (X n ) n 0 taking values in a measurable state space (S, S) is called a (discrete time) Markov chain, if