MARKOV CHAINS AND HIDDEN MARKOV MODELS

MARKOV CHAINS AND HIDDEN MARKOV MODELS MERYL SEAH Abstract. This is an expository paper outlining the basics of Markov chains. We start the paper by explaining what a finite Markov chain is. Then we describe what a stationary distribution is and show that every irreducible and aperiodic Markov chain has a unique stationary distribution. Next we talk about mixing. Then we briefly talk about an application of Markov chains, which is the use of hidden Markov models. Contents 1. Finite Markov Chains 1. Stationary Distributions 3. Mixing 4 4. Ergodic Theorem 8 Acknowledgments 10 References 10 1. Finite Markov Chains We will start with a definition then dive into an example that will make the definition clearer to picture. This expository paper will be following Levin s, Peres s, and Wilmer s book on Markov chains, which is listed in the acknowledgments section. Definition 1.1. A sequence of random variables {X t } is a Markov chain with state space Ω and transition matrix P if for all x, y Ω, all t 1, and all events H t 1 = t 1 s=0 {X s = x s } satisfying P(H t 1 {X t = x}) > 0, we have (1.) P{X t1 = y H t 1 {X t = x}} = P{X t1 = y X t = x} = P (x, y). The above equations means that the probability of moving to state y given that we are currently in state x does not depend on the sequence of states preceding state x. A finite Markov chain is best explained through an example. We will parameterize the space of all two state Markov chains by using the classic example of a frog jumping between two lily pads. We will denote one lily pad l for left and the other r for right. Suppose that every morning, the frog will either stay on the lily pad it is on or jump to the other lily pad. If the frog is on the right lily pad, then Date: DECEMBER 5. 1

MERYL SEAH it will jump to the left lily pad with a probability of p. If the frog is on the left lily pad, then it will jump to the right lily pad with a probability of q. Then Ω = {l, r}. Let the sequence (X 0, X 1, X,... ) be the sequence of lily pads that the frog sat on on day 0, day 1, day,... So based on the probabilities set up in the problem, the sequence (X 0, X 1,... ) is a Markov chain with transition matrix [ ] [ ] P (r, r) P (r, l) 1 p p (1.3) P = =. P (l, r) P (l, l) q 1 q Suppose the frog starts day 0 on the right lily pad. We can store our distribution information in a row vector (1.4) µ t = (P{X t = r X 0 = r}, P{X t = l X 0 = r}). Then µ 1 = µ 0 P and µ t1 = µ t P. Continuing to multiply by P gives us (1.5) µ t = µ 0 P t for all t 0.. Stationary Distributions [ 1 ] Suppose with have a matrix 3 3 1. When we compute this matrix to a high 3 3 ] power, to 100 for example, we notice that the matrix approaches. We use this observation to begin our discussion into stationary distributions. Definition.1. A stationary distribution of a Markov chain is a probability distribution π satisfying (.) π = πp, where P is the transition matrix of the Markov chain. Definition.3. For x Ω, the hitting time for x is τ x = min{t 0 : X t = x}. In other words, it is the time at which the chain first visits state x. We will now demonstrate that stationary distributions exist. First, we begin with a lemma about irreducible chains and expected hitting times. Definition.4. A chain P is irreducible if for any two states x, y Ω there exists an integer t such that P t (x, y) > 0. In other words, starting from any state, it is possible to get to any other state using transitions of positive probability. Lemma.5. For any states x and y of an irreducible chain, E x (τ y ) <. Proof. By the definition of irreducibility, it follows that there exists an integer r > 0 and a real ɛ > 0 such that for any states x, w Ω there exists a j r with P j (z, w) > ɛ. Proposition.6 (Existence of a Stationary Distribution). Let P be the transition matrix of an irreducible Markov chain. Then (1) there exists a probability distribution π on Ω such that π = πp and π(x) > 0 for all x Ω, and () π(x) = 1 E x(τ x ). [ 1 1 1 1

MARKOV CHAINS AND HIDDEN MARKOV MODELS 3 Proof. Let z Ω be an arbitrary state of the Markov chain. This proof will look at the average time the chain spends at each state before returning to z. We define π(y) := E z (number of visits to y before returning to z) = P z {X t = y, τ z > t} t=0 Then π(y) E z τ z. By the lemma, we know that for all y Ω, π(y) <. From how we defined π(y), we know that (.7) P x {X t = x, τ z > t}p (x, y). π(x)p (x, y) = t=0 Since the event in which {τ z t 1} = {τ z > t} is determined by what X 0,..., X t are, we know that (.8) P z {X t = x, X t1 = y, τ z t 1} = P z {X t = x, τ z t 1}P (x, y). Therefore, from the two equations, we know that We know that πp (x, y) = P z {X t1 = y, τ z t 1} = t=0 P z {X t = y, τ z t=1 t}. P z {X t = y, τ z t} = π(y) P z {X 0 = y, τ z > 0} P z {X t = y, τ z = t} t=1 t=1 = π(y) P z {X 0 = y} P z {X τ z = y} Suppose y = z. Then since X 0 = z and X τ z = z, P z {X 0 = y} and P z {X t = y} are both 1, so they cancel each other out. Now suppose y z. Then both P z {X 0 = y} and P z {X t = y} are 0. So, (.9) P z {X t = y, τ z t} = π(y). t=1 Therefore, it follows that π = πp. Then we normalize by x π(x) = E z(τ z ), so (.10) π(x) = π(x) E z (τ z ), which satisfies π = πp, showing that π is a stationary distribution and satisfying the first part of the proposition. Hence, as π(x) = 1, for any x Ω, a stationary probability distribution is: 1 (.11) π(x) = E x (τ x

4 MERYL SEAH Now we will demonstrate the uniqueness of the stationary distribution. First, we begin with some definitions. Definition.1. A function h : Ω R is harmonic at x if h(x) = P (x, y)h(y). Definition.13. A function is harmonic on D Ω if it is harmonic at every state x D. Remark.14. If h is regarded as a column vector, then a function which is harmonic on Ω satisfies P h = h. Lemma.15. Suppose that P is irreducible. A function h which is harmonic at every point of Ω is constant. Proof. Ω is finite, so there exists a state x 0 such that h(x 0 ) = M is maximal. Suppose there exists some state z such that P (x 0, z) > 0 for which we have h(z) < M, then (.16) h(x 0 ) = P (x 0, z)h(z) y z P (x 0, y)h(y) < M. However, since h(x 0 ) = M, there is a contradiction. So we know that h(z) = M for all states z such that P (x 0, z) > 0. Let y Ω. Since the chain is irreducible, this means that there exists a sequence x 0, x 1, x,..., x n = y such that P (x i, x i1 ) > 0. Following the same logic that we used to show that h(z) = M, it follows that h(y) = h(x n 1 ) = = h(x 0 ) = M. Therefore, h is constant. Corollary.17. Let P be the transition matrix of an irreducible Markov chain. Then there exists a unique probability distribution π satisfying π = πp. Proof. We already know that there exists a probability distribution satisfying π = πp because we proved the existance of a stationary distribution. By the lemma, it follows that the kernel of P I, where I is the identity matrix, has dimension 1. This means that the column rank of P I is Ω I by the rank-nullity theorem. We know that the row rank of a square matrix is equal to its column rank. This means that the equation v = vp where v is a row vector has a space of solutions that has dimension 1, meaning there is only one solution vector where the sum of its entries is 1. 3. Mixing Definition 3.1. The total variation distance between two probability distributions µ and ν on Ω is defined (3.) µ ν T V = max µ(a) ν(a). A Ω In other words, the total variation distance between two probability distributions is the maximum difference between the probabilities assigned by the distributions to an event. Definition 3.3. The variance of a random variable X is defined to be (3.4) V ar(x) = E((X E(X)) ).

MARKOV CHAINS AND HIDDEN MARKOV MODELS 5 Definition 3.5. The period of state x is defined to be the greatest common divisor of T (x), where T (x) := {t 1 : P t (x, x) > 0} is the set of times when it is possible for the chain to return to the starting position x. Remark 3.6. The chain is called aperiodic if all states have period 1. The chain is periodic if it is not aperiodic. Now we will give a lemma from number theory that we will not prove because it is not in the scope of the paper. Lemma 3.7. Any set of non-negative integers which is closed under addition and has greatest common divisor 1 must contain all but finitely many of the non-negative integers. Proposition 3.8. If P is aperiodic and irreducible, then there exists an integer r for t r such that P r (x, y) > 0 for all x, y Ω. Proof. We know from the lemma that any set of non-negative integers which is closed under addition and has greatest common divisor 1 must contain all but finitely many of the non-negative integers. Since the chain is aperiodic, the greatest common divisor of T (x) is 1. Let s, t T (x). Then P st (x, x) P s (x, x)p t (x, x) > 0, so s t T (x). Therefore, the set T (x) is closed under addition. It then follows that there exists a t(x) where t being greater than t(x) implies that t T (x). By the definition of irreducibility, we know that for all y Ω, there exists r = r(x, y) such that P r (x, y) > 0. Therefore, for t t(x) r, (3.9) P t (x, y) P t r (x, x)p r (x, y) > 0. So for t t (x) := t(x) max r(x, y), P t (x, y) > 0 for all y Ω. If t max t (x), then P t (x, y) > 0 for all x, y Ω. Definition 3.10. A matrix P is stochastic if its entries are all non-negative and (3.11) for all x Ω. P (x, y) = 1 Proposition 3.1. Let µ and ν be two probability distributions on Ω. Then (3.13) µ ν T V = 1 µ(x) ν(x). Proof. Let B = {x : µ(x) ν(x)} and let A Ω. Since any x A B c satisfies µ(x) ν(x) < 0, when elements of A B c are eliminated, the probability will not decrease. Also, adding more elements of B will not decrease the difference in probability. So (3.14) µ(a) ν(a) µ(a B) ν(a B) µ(b) ν(b). Using the same logic, it follows that (3.15) ν(a) µ(a) ν(b c ) µ(b c ).

6 MERYL SEAH Since ν(b c ) µ(b c ) = µ(b) ν(b), then when A = B, µ(a) ν(a) is equal to the upper bound. Thus, (3.16) µ ν T V = 1 [µ(b) ν(b) ν(bc ) µ(b c )] = 1 µ(x) ν(x). Remark 3.17. It follows from this proposition and the triangle inequality of real numbers that total variation distance satisfies the triangle inequality. In other words, (3.18) µ ν T V µ η T V η ν T V. Theorem 3.19 (Convergence Theorem). Suppose that P is irreducible and aperiodic, with stationary distribution π. Then there exist constants α (0, 1) and C > 0 such that (3.0) max P t (x, ) π T V Cα t Proof. P is irreducible and aperiodic, so there exists an r such that P r has strictly positive entries. Let Π be the matrix such that it has Ω rows, each of them the row vector π. For small enough δ > 0, for all x, y Ω, (3.1) P r (x, y) δπ(y) Then a stochastic matrix Q is defined by the equation (3.) P r = (1 θ)π θq Since Π is made up of the row vector π, we know that ΠM = Π for any matrix M such that πm = π. We also know that MΠ = Π if M is a stochastic matrix because its entries sum to 1. Now, we will show by induction that for k 1, (3.3) P rk = (1 θ k )Π θ k Q k If k = 1, then this equation holds because of how we defined Q. Suppose that our proposition is true for k = n. So (3.4) P rn = (1 θ k n)π θ n Q n. Then P r(n1) = P r np r = [(1 θ n )Π θ n Q n ]P r = (1 θ n )ΠP r θ n Q n P r = (1 θ n )ΠP r θ n Q n [(1 θ)π θq] = (1 θ n )ΠP r (1 θ)πθ n Q n θ n1 Q n1 Since πp r = π and Q n is stochastic, we know that ΠP r = Π and Q n Π = Π, (3.5) P r(n1) = [1 θ n1 ]Π θ n1 Q n1. Therefore, for all k 1, P rk = (1 θ k )Π θ k Q k.

MARKOV CHAINS AND HIDDEN MARKOV MODELS 7 Now we multiply by P j to get (3.6) P rkj Π = θ k (Q k P j Π). Now we add the absolute values of the entries in row x 0 for the matrix on each side of the equation and divide by. Q k P j Π is the largest possible total variation distance, which is 1, so (3.7) P rkj (x 0, ) π T V θ k. Definition 3.8. The maximum distance over x 0 Ω between P t (x 0, ) and π is denoted: (3.9) d(t) := max P t (x, ) π T V. We also define (3.30) d(t) := max x, P t (x, ) P t (y, ) T V. Lemma 3.31. If d(t) and d(t) are defined as above, then (3.3) d(t) d(t) d(t). Proof. d(t) d(t) follows from the triangle inequality. For d(t) d(t), since π is stationary, we know that (3.33) π(a) = π(y)p t (y, A) for any set A. So, it follows that P t (x, ) π T V = max A Ω P t (x, A) π(a) So by the triangle inequality, max A Ω π(y) P t (x, A) P t (y, A) y Ω = max A Ω π(y)[p t (x, A) P t (y, A)]. π(y) max A Ω P t (x, A) P t (y, A) = π(y) P t (x, ) P t (y, ) T V. Since the average of a set of numbers cannot be greater than the maximum of that set, π(y) P t (x, ) P t (y, ) T V is bounded by max P t (x, ) P t (y, ) T V. Mixing time is a way to measure the time it takes for the Markov chain to get close to the stationary distribution. Definition 3.34. The mixing time of a Markov chain is defined by (3.35) t mix (ɛ) := min{t : d(t) ɛ}, and (3.36) t mix (ɛ) := t mix ( 1 4 ). The 1 4 is an arbitrary number chosen because it is less than 1.

8 MERYL SEAH Lemma 3.37. Let P be the transition matrix of a Markov chain with state space Ω. Let µ and ν be two distributions on Ω. Then (3.38) µp νp T V µ ν T V. Proof. µp νp T V = 1 µp (x) νp (x) = 1 µ(y)p (y, x) ν(y)p (y, x) = 1 P (y, x)[µ(y) ν(y)] 1 P (y, x) µ(y) ν(y) = 1 µ(y) ν(y) P (y, x) = 1 µ(y) ν(y) = µ ν T V. Remark 3.39. By this lemma, it is clear that when c and t are non-negative, (3.40) d(ct) d(ct) d(t) c. Thus, from the above remark, it follows that (3.41) d(lt mix (ɛ)) d(lt mix (ɛ)) d(t mix (ɛ)) l (ɛ) l. Plugging in the ɛ = 1 4 (3.4) d(lt mix ) l and from the definition of mixing time, we get that (3.43) t mix (ɛ) [log ɛ 1 t mix ]. 4. Ergodic Theorem Theorem 4.1 (Strong Law of Large Numbers). Let Z 1, Z,... be a sequence of random variables with E(Z s ) = 0 for all s and (4.) V ar(z s1 Z sk ) Ck for all s and k. Then 1 t 1 (4.3) P{ lim Z s = 0} = 1. t t s=0 We will not be proving the Strong Law of Large Numbers as it is outside the scope of this paper.

MARKOV CHAINS AND HIDDEN MARKOV MODELS 9 Lemma 4.4. Let (a n ) be a bounded sequence. If, for a sequence of integers (n k ) satisfying lim k = 1, we have n k n k1 a 1 a nk (4.5) lim = a, k n k then (4.6) lim k a 1 a n n = a. Proof. We begin by defining A n := n 1 n k=1 a k. Let n k m < n k1. Then (4.7) A m = n m k m A j=n n k k 1 a j. m The fraction n k m tends to 1 since n k n k1 n k m 1. So if a j is bounded by B, then m j=n k 1 aj is bounded by m (4.8) B( n k1 n k n k ) which tends to 0, so A m a. Theorem 4.9 (Ergodic Theorem). Let f be a real-valued function defined on Ω. If (X t ) is an irreducible, aperiodic Markov chain, then for any starting distribution µ, 1 t 1 (4.10) P µ { lim f(x s ) = E π (f)} = 1. t t Proof. Suppose that the chain starts at state x. Then we will define s=0 (4.11) τ x,k := min{t > τ x,(k 1) : X t = x}, and set τ x,0 := 0. Every time the chain visits state x, it starts over in a way, so X τ, X x,k τ 1,..., X x,k τ x,(k 1) 1 are independent of each other. So, if (4.1) Y k := τ x,k 1 f(x s ), s=τ x,(k 1) then the sequence (Y k ) is independent and identically distributed. If S t = t 1 s=0 f(x s), then S τ x,n = n k=1 Y k, and by the strong law of large numbers, S τ (4.13) P x { lim x,n n n = E x(y 1 )} = 1. Using the Strong Law of Large Numbers again, τ x,n (4.14) P x { lim n n = E x(τ x )} = 1. So by division, S τ (4.15) P x { lim x,n n τ x,n = E x(y 1 ) E x (τ x ) } = 1.

10 MERYL SEAH Since τ x 1 E x (Y 1 ) = E x f(x s ) s=0 = E x f(y) τ x 1 1{Xs=y} = f(y)e x τ x 1 1{Xs=y}, and from the proof of the Convergence theorem, we know that π(x) = follows that (4.16) E x (Y 1 ) = E π (f)e x (τ x ). π(x) E z(τ z ), it So, S τ (4.17) P x { lim x,n n τ x,n = E π (f)} = 1. By the lemma, when µ equals the probability distribution with unit mass at x, the theorem is true. The proof is then complete by averaging over the starting state. Acknowledgments. It is a pleasure to thank my mentor, Jonathan DeWitt, for guiding me through this process. I would also like to thank the professors who taught me in this program. I would also like to thank the authors of the papers I referenced through this writing process. References [1] David Levin, Yuval Peres, and Elizabeth Wilmer. Markov Chains and Mixing Times. http://pages.uoregon.edu/dlevin/markov/markovmixing.pdf [] L.R. Rabiner and B.H. Juang. An Introduction to Hidden Markov Models. http://www.cs.umb.edu/ rvetro/vetrobiocomp/hmm/rabiner1986 [3] Persi Diaconis. The Markov Chain Monte Carlo Revolution. https://math.uchicago.edu/ shmuel/network-course-readings/mcmcrev.pdf [4] J. Chang. Stochastic Processes. http://www.stat.yale.edu/ pollard/courses/51.spring09/handouts/changnotes.pdf