MARKOV CHAINS AND HIDDEN MARKOV MODELS

Similar documents
CONVERGENCE THEOREM FOR FINITE MARKOV CHAINS. Contents

MARKOV CHAINS AND MIXING TIMES

MARKOV CHAIN MONTE CARLO

INTRODUCTION TO MARKOV CHAINS AND MARKOV CHAIN MIXING

MARKOV CHAINS: STATIONARY DISTRIBUTIONS AND FUNCTIONS ON STATE SPACES. Contents

RECURRENCE IN COUNTABLE STATE MARKOV CHAINS

Introduction to Markov Chains and Riffle Shuffling

Lecture 7. We can regard (p(i, j)) as defining a (maybe infinite) matrix P. Then a basic fact is

Some Definition and Example of Markov Chain

Lecture 7. µ(x)f(x). When µ is a probability measure, we say µ is a stationary distribution.

A note on adiabatic theorem for Markov chains and adiabatic quantum computation. Yevgeniy Kovchegov Oregon State University

Chapter 7. Markov chain background. 7.1 Finite state space

MARKOV CHAINS AND COUPLING FROM THE PAST

Applying Markov Chains to Monte Carlo Integration

Stochastic Processes (Week 6)

Markov Chains and Mixing Times. David A. Levin Yuval Peres Elizabeth L. Wilmer

215 Problem 1. (a) Define the total variation distance µ ν tv for probability distributions µ, ν on a finite set S. Show that

INTRODUCTION TO MCMC AND PAGERANK. Eric Vigoda Georgia Tech. Lecture for CS 6505

Characterization of cutoff for reversible Markov chains

INTRODUCTION TO MCMC AND PAGERANK. Eric Vigoda Georgia Tech. Lecture for CS 6505

Characterization of cutoff for reversible Markov chains

Markov Chains for Everybody

Applied Stochastic Processes

LECTURE NOTES FOR MARKOV CHAINS: MIXING TIMES, HITTING TIMES, AND COVER TIMES IN SAINT PETERSBURG SUMMER SCHOOL, 2012

MS&E 321 Spring Stochastic Systems June 1, 2013 Prof. Peter W. Glynn Page 1 of 10. x n+1 = f(x n ),

Modern Discrete Probability Spectral Techniques

The coupling method - Simons Counting Complexity Bootcamp, 2016

Lecture 3: September 10

LECTURE 3. Last time:

MATH 56A: STOCHASTIC PROCESSES CHAPTER 2

TOTAL VARIATION CUTOFF IN BIRTH-AND-DEATH CHAINS

25.1 Ergodicity and Metric Transitivity

Simultaneous drift conditions for Adaptive Markov Chain Monte Carlo algorithms

Approximate Counting and Markov Chain Monte Carlo

Markov Chains and Stochastic Sampling

COUPLING TIMES FOR RANDOM WALKS WITH INTERNAL STATES

4. Ergodicity and mixing

Markov chains. Randomness and Computation. Markov chains. Markov processes

P(X 0 = j 0,... X nk = j k )

Monte Carlo Methods. Leon Gu CSD, CMU

Signed Measures. Chapter Basic Properties of Signed Measures. 4.2 Jordan and Hahn Decompositions

Markov Processes on Discrete State Spaces

INTRODUCTION TO MARKOV CHAIN MONTE CARLO

Introduction to Stochastic Processes

The cutoff phenomenon for random walk on random directed graphs

Markov Chains and MCMC

Markov Chains. Arnoldo Frigessi Bernd Heidergott November 4, 2015

Solutions to Problem Set 5

Markov Chains Handout for Stat 110

Lecture 2: September 8

Lecture 11: Introduction to Markov Chains. Copyright G. Caire (Sample Lectures) 321

Positive and null recurrent-branching Process

Summary of Results on Markov Chains. Abstract

Markov Chains and Jump Processes

Lecture 5: Random Walks and Markov Chain

Detailed Proofs of Lemmas, Theorems, and Corollaries

Markov Chain Monte Carlo (MCMC)

MATH 56A: STOCHASTIC PROCESSES CHAPTER 7

Lecture 10. Theorem 1.1 [Ergodicity and extremality] A probability measure µ on (Ω, F) is ergodic for T if and only if it is an extremal point in M.

Flip dynamics on canonical cut and project tilings

Markov Chain Model for ALOHA protocol

Reinforcement Learning

arxiv: v2 [math.na] 20 Dec 2016

CSE525: Randomized Algorithms and Probabilistic Analysis May 16, Lecture 13

A note on adiabatic theorem for Markov chains

Probability & Computing

The Markov Chain Monte Carlo Method

Chapter 2. Markov Chains. Introduction

Chapter 11 Advanced Topic Stochastic Processes

CHAPTER 6. Differentiation

Markov Chain Monte Carlo

Let (Ω, F) be a measureable space. A filtration in discrete time is a sequence of. F s F t

Markov and Gibbs Random Fields

Random Walk in Periodic Environment

AN EXPLORATION OF KHINCHIN S CONSTANT

2017 HSC Mathematics Extension 1 Marking Guidelines

Markov chain Monte Carlo

Weak convergence in Probability Theory A summer excursion! Day 3

6 Markov Chain Monte Carlo (MCMC)

On Finding Optimal Policies for Markovian Decision Processes Using Simulation

Mixing time for a random walk on a ring

Markov Processes Hamid R. Rabiee

STA 294: Stochastic Processes & Bayesian Nonparametrics

7.1 Coupling from the Past

Coupling AMS Short Course

Stochastic optimization Markov Chain Monte Carlo

Necessary and sufficient conditions for strong R-positivity

Stat 516, Homework 1

Markov Random Fields

Ergodic Theorems. Samy Tindel. Purdue University. Probability Theory 2 - MA 539. Taken from Probability: Theory and examples by R.

Static Problem Set 2 Solutions

MATH 56A: STOCHASTIC PROCESSES CHAPTER 1

A Note on the Central Limit Theorem for a Class of Linear Systems 1

STOCHASTIC PROCESSES Basic notions

Lecture 21. David Aldous. 16 October David Aldous Lecture 21

STA205 Probability: Week 8 R. Wolpert

A primer on basic probability and Markov chains

Probability and Measure

RANDOM WALKS. Course: Spring 2016 Lecture notes updated: May 2, Contents

Math 456: Mathematical Modeling. Tuesday, April 9th, 2018

Transcription:

MARKOV CHAINS AND HIDDEN MARKOV MODELS MERYL SEAH Abstract. This is an expository paper outlining the basics of Markov chains. We start the paper by explaining what a finite Markov chain is. Then we describe what a stationary distribution is and show that every irreducible and aperiodic Markov chain has a unique stationary distribution. Next we talk about mixing. Then we briefly talk about an application of Markov chains, which is the use of hidden Markov models. Contents 1. Finite Markov Chains 1. Stationary Distributions 3. Mixing 4 4. Ergodic Theorem 8 Acknowledgments 10 References 10 1. Finite Markov Chains We will start with a definition then dive into an example that will make the definition clearer to picture. This expository paper will be following Levin s, Peres s, and Wilmer s book on Markov chains, which is listed in the acknowledgments section. Definition 1.1. A sequence of random variables {X t } is a Markov chain with state space Ω and transition matrix P if for all x, y Ω, all t 1, and all events H t 1 = t 1 s=0 {X s = x s } satisfying P(H t 1 {X t = x}) > 0, we have (1.) P{X t1 = y H t 1 {X t = x}} = P{X t1 = y X t = x} = P (x, y). The above equations means that the probability of moving to state y given that we are currently in state x does not depend on the sequence of states preceding state x. A finite Markov chain is best explained through an example. We will parameterize the space of all two state Markov chains by using the classic example of a frog jumping between two lily pads. We will denote one lily pad l for left and the other r for right. Suppose that every morning, the frog will either stay on the lily pad it is on or jump to the other lily pad. If the frog is on the right lily pad, then Date: DECEMBER 5. 1

MERYL SEAH it will jump to the left lily pad with a probability of p. If the frog is on the left lily pad, then it will jump to the right lily pad with a probability of q. Then Ω = {l, r}. Let the sequence (X 0, X 1, X,... ) be the sequence of lily pads that the frog sat on on day 0, day 1, day,... So based on the probabilities set up in the problem, the sequence (X 0, X 1,... ) is a Markov chain with transition matrix [ ] [ ] P (r, r) P (r, l) 1 p p (1.3) P = =. P (l, r) P (l, l) q 1 q Suppose the frog starts day 0 on the right lily pad. We can store our distribution information in a row vector (1.4) µ t = (P{X t = r X 0 = r}, P{X t = l X 0 = r}). Then µ 1 = µ 0 P and µ t1 = µ t P. Continuing to multiply by P gives us (1.5) µ t = µ 0 P t for all t 0.. Stationary Distributions [ 1 ] Suppose with have a matrix 3 3 1. When we compute this matrix to a high 3 3 ] power, to 100 for example, we notice that the matrix approaches. We use this observation to begin our discussion into stationary distributions. Definition.1. A stationary distribution of a Markov chain is a probability distribution π satisfying (.) π = πp, where P is the transition matrix of the Markov chain. Definition.3. For x Ω, the hitting time for x is τ x = min{t 0 : X t = x}. In other words, it is the time at which the chain first visits state x. We will now demonstrate that stationary distributions exist. First, we begin with a lemma about irreducible chains and expected hitting times. Definition.4. A chain P is irreducible if for any two states x, y Ω there exists an integer t such that P t (x, y) > 0. In other words, starting from any state, it is possible to get to any other state using transitions of positive probability. Lemma.5. For any states x and y of an irreducible chain, E x (τ y ) <. Proof. By the definition of irreducibility, it follows that there exists an integer r > 0 and a real ɛ > 0 such that for any states x, w Ω there exists a j r with P j (z, w) > ɛ. Proposition.6 (Existence of a Stationary Distribution). Let P be the transition matrix of an irreducible Markov chain. Then (1) there exists a probability distribution π on Ω such that π = πp and π(x) > 0 for all x Ω, and () π(x) = 1 E x(τ x ). [ 1 1 1 1

MARKOV CHAINS AND HIDDEN MARKOV MODELS 3 Proof. Let z Ω be an arbitrary state of the Markov chain. This proof will look at the average time the chain spends at each state before returning to z. We define π(y) := E z (number of visits to y before returning to z) = P z {X t = y, τ z > t} t=0 Then π(y) E z τ z. By the lemma, we know that for all y Ω, π(y) <. From how we defined π(y), we know that (.7) P x {X t = x, τ z > t}p (x, y). π(x)p (x, y) = t=0 Since the event in which {τ z t 1} = {τ z > t} is determined by what X 0,..., X t are, we know that (.8) P z {X t = x, X t1 = y, τ z t 1} = P z {X t = x, τ z t 1}P (x, y). Therefore, from the two equations, we know that We know that πp (x, y) = P z {X t1 = y, τ z t 1} = t=0 P z {X t = y, τ z t=1 t}. P z {X t = y, τ z t} = π(y) P z {X 0 = y, τ z > 0} P z {X t = y, τ z = t} t=1 t=1 = π(y) P z {X 0 = y} P z {X τ z = y} Suppose y = z. Then since X 0 = z and X τ z = z, P z {X 0 = y} and P z {X t = y} are both 1, so they cancel each other out. Now suppose y z. Then both P z {X 0 = y} and P z {X t = y} are 0. So, (.9) P z {X t = y, τ z t} = π(y). t=1 Therefore, it follows that π = πp. Then we normalize by x π(x) = E z(τ z ), so (.10) π(x) = π(x) E z (τ z ), which satisfies π = πp, showing that π is a stationary distribution and satisfying the first part of the proposition. Hence, as π(x) = 1, for any x Ω, a stationary probability distribution is: 1 (.11) π(x) = E x (τ x

4 MERYL SEAH Now we will demonstrate the uniqueness of the stationary distribution. First, we begin with some definitions. Definition.1. A function h : Ω R is harmonic at x if h(x) = P (x, y)h(y). Definition.13. A function is harmonic on D Ω if it is harmonic at every state x D. Remark.14. If h is regarded as a column vector, then a function which is harmonic on Ω satisfies P h = h. Lemma.15. Suppose that P is irreducible. A function h which is harmonic at every point of Ω is constant. Proof. Ω is finite, so there exists a state x 0 such that h(x 0 ) = M is maximal. Suppose there exists some state z such that P (x 0, z) > 0 for which we have h(z) < M, then (.16) h(x 0 ) = P (x 0, z)h(z) y z P (x 0, y)h(y) < M. However, since h(x 0 ) = M, there is a contradiction. So we know that h(z) = M for all states z such that P (x 0, z) > 0. Let y Ω. Since the chain is irreducible, this means that there exists a sequence x 0, x 1, x,..., x n = y such that P (x i, x i1 ) > 0. Following the same logic that we used to show that h(z) = M, it follows that h(y) = h(x n 1 ) = = h(x 0 ) = M. Therefore, h is constant. Corollary.17. Let P be the transition matrix of an irreducible Markov chain. Then there exists a unique probability distribution π satisfying π = πp. Proof. We already know that there exists a probability distribution satisfying π = πp because we proved the existance of a stationary distribution. By the lemma, it follows that the kernel of P I, where I is the identity matrix, has dimension 1. This means that the column rank of P I is Ω I by the rank-nullity theorem. We know that the row rank of a square matrix is equal to its column rank. This means that the equation v = vp where v is a row vector has a space of solutions that has dimension 1, meaning there is only one solution vector where the sum of its entries is 1. 3. Mixing Definition 3.1. The total variation distance between two probability distributions µ and ν on Ω is defined (3.) µ ν T V = max µ(a) ν(a). A Ω In other words, the total variation distance between two probability distributions is the maximum difference between the probabilities assigned by the distributions to an event. Definition 3.3. The variance of a random variable X is defined to be (3.4) V ar(x) = E((X E(X)) ).

MARKOV CHAINS AND HIDDEN MARKOV MODELS 5 Definition 3.5. The period of state x is defined to be the greatest common divisor of T (x), where T (x) := {t 1 : P t (x, x) > 0} is the set of times when it is possible for the chain to return to the starting position x. Remark 3.6. The chain is called aperiodic if all states have period 1. The chain is periodic if it is not aperiodic. Now we will give a lemma from number theory that we will not prove because it is not in the scope of the paper. Lemma 3.7. Any set of non-negative integers which is closed under addition and has greatest common divisor 1 must contain all but finitely many of the non-negative integers. Proposition 3.8. If P is aperiodic and irreducible, then there exists an integer r for t r such that P r (x, y) > 0 for all x, y Ω. Proof. We know from the lemma that any set of non-negative integers which is closed under addition and has greatest common divisor 1 must contain all but finitely many of the non-negative integers. Since the chain is aperiodic, the greatest common divisor of T (x) is 1. Let s, t T (x). Then P st (x, x) P s (x, x)p t (x, x) > 0, so s t T (x). Therefore, the set T (x) is closed under addition. It then follows that there exists a t(x) where t being greater than t(x) implies that t T (x). By the definition of irreducibility, we know that for all y Ω, there exists r = r(x, y) such that P r (x, y) > 0. Therefore, for t t(x) r, (3.9) P t (x, y) P t r (x, x)p r (x, y) > 0. So for t t (x) := t(x) max r(x, y), P t (x, y) > 0 for all y Ω. If t max t (x), then P t (x, y) > 0 for all x, y Ω. Definition 3.10. A matrix P is stochastic if its entries are all non-negative and (3.11) for all x Ω. P (x, y) = 1 Proposition 3.1. Let µ and ν be two probability distributions on Ω. Then (3.13) µ ν T V = 1 µ(x) ν(x). Proof. Let B = {x : µ(x) ν(x)} and let A Ω. Since any x A B c satisfies µ(x) ν(x) < 0, when elements of A B c are eliminated, the probability will not decrease. Also, adding more elements of B will not decrease the difference in probability. So (3.14) µ(a) ν(a) µ(a B) ν(a B) µ(b) ν(b). Using the same logic, it follows that (3.15) ν(a) µ(a) ν(b c ) µ(b c ).

6 MERYL SEAH Since ν(b c ) µ(b c ) = µ(b) ν(b), then when A = B, µ(a) ν(a) is equal to the upper bound. Thus, (3.16) µ ν T V = 1 [µ(b) ν(b) ν(bc ) µ(b c )] = 1 µ(x) ν(x). Remark 3.17. It follows from this proposition and the triangle inequality of real numbers that total variation distance satisfies the triangle inequality. In other words, (3.18) µ ν T V µ η T V η ν T V. Theorem 3.19 (Convergence Theorem). Suppose that P is irreducible and aperiodic, with stationary distribution π. Then there exist constants α (0, 1) and C > 0 such that (3.0) max P t (x, ) π T V Cα t Proof. P is irreducible and aperiodic, so there exists an r such that P r has strictly positive entries. Let Π be the matrix such that it has Ω rows, each of them the row vector π. For small enough δ > 0, for all x, y Ω, (3.1) P r (x, y) δπ(y) Then a stochastic matrix Q is defined by the equation (3.) P r = (1 θ)π θq Since Π is made up of the row vector π, we know that ΠM = Π for any matrix M such that πm = π. We also know that MΠ = Π if M is a stochastic matrix because its entries sum to 1. Now, we will show by induction that for k 1, (3.3) P rk = (1 θ k )Π θ k Q k If k = 1, then this equation holds because of how we defined Q. Suppose that our proposition is true for k = n. So (3.4) P rn = (1 θ k n)π θ n Q n. Then P r(n1) = P r np r = [(1 θ n )Π θ n Q n ]P r = (1 θ n )ΠP r θ n Q n P r = (1 θ n )ΠP r θ n Q n [(1 θ)π θq] = (1 θ n )ΠP r (1 θ)πθ n Q n θ n1 Q n1 Since πp r = π and Q n is stochastic, we know that ΠP r = Π and Q n Π = Π, (3.5) P r(n1) = [1 θ n1 ]Π θ n1 Q n1. Therefore, for all k 1, P rk = (1 θ k )Π θ k Q k.

MARKOV CHAINS AND HIDDEN MARKOV MODELS 7 Now we multiply by P j to get (3.6) P rkj Π = θ k (Q k P j Π). Now we add the absolute values of the entries in row x 0 for the matrix on each side of the equation and divide by. Q k P j Π is the largest possible total variation distance, which is 1, so (3.7) P rkj (x 0, ) π T V θ k. Definition 3.8. The maximum distance over x 0 Ω between P t (x 0, ) and π is denoted: (3.9) d(t) := max P t (x, ) π T V. We also define (3.30) d(t) := max x, P t (x, ) P t (y, ) T V. Lemma 3.31. If d(t) and d(t) are defined as above, then (3.3) d(t) d(t) d(t). Proof. d(t) d(t) follows from the triangle inequality. For d(t) d(t), since π is stationary, we know that (3.33) π(a) = π(y)p t (y, A) for any set A. So, it follows that P t (x, ) π T V = max A Ω P t (x, A) π(a) So by the triangle inequality, max A Ω π(y) P t (x, A) P t (y, A) y Ω = max A Ω π(y)[p t (x, A) P t (y, A)]. π(y) max A Ω P t (x, A) P t (y, A) = π(y) P t (x, ) P t (y, ) T V. Since the average of a set of numbers cannot be greater than the maximum of that set, π(y) P t (x, ) P t (y, ) T V is bounded by max P t (x, ) P t (y, ) T V. Mixing time is a way to measure the time it takes for the Markov chain to get close to the stationary distribution. Definition 3.34. The mixing time of a Markov chain is defined by (3.35) t mix (ɛ) := min{t : d(t) ɛ}, and (3.36) t mix (ɛ) := t mix ( 1 4 ). The 1 4 is an arbitrary number chosen because it is less than 1.

8 MERYL SEAH Lemma 3.37. Let P be the transition matrix of a Markov chain with state space Ω. Let µ and ν be two distributions on Ω. Then (3.38) µp νp T V µ ν T V. Proof. µp νp T V = 1 µp (x) νp (x) = 1 µ(y)p (y, x) ν(y)p (y, x) = 1 P (y, x)[µ(y) ν(y)] 1 P (y, x) µ(y) ν(y) = 1 µ(y) ν(y) P (y, x) = 1 µ(y) ν(y) = µ ν T V. Remark 3.39. By this lemma, it is clear that when c and t are non-negative, (3.40) d(ct) d(ct) d(t) c. Thus, from the above remark, it follows that (3.41) d(lt mix (ɛ)) d(lt mix (ɛ)) d(t mix (ɛ)) l (ɛ) l. Plugging in the ɛ = 1 4 (3.4) d(lt mix ) l and from the definition of mixing time, we get that (3.43) t mix (ɛ) [log ɛ 1 t mix ]. 4. Ergodic Theorem Theorem 4.1 (Strong Law of Large Numbers). Let Z 1, Z,... be a sequence of random variables with E(Z s ) = 0 for all s and (4.) V ar(z s1 Z sk ) Ck for all s and k. Then 1 t 1 (4.3) P{ lim Z s = 0} = 1. t t s=0 We will not be proving the Strong Law of Large Numbers as it is outside the scope of this paper.

MARKOV CHAINS AND HIDDEN MARKOV MODELS 9 Lemma 4.4. Let (a n ) be a bounded sequence. If, for a sequence of integers (n k ) satisfying lim k = 1, we have n k n k1 a 1 a nk (4.5) lim = a, k n k then (4.6) lim k a 1 a n n = a. Proof. We begin by defining A n := n 1 n k=1 a k. Let n k m < n k1. Then (4.7) A m = n m k m A j=n n k k 1 a j. m The fraction n k m tends to 1 since n k n k1 n k m 1. So if a j is bounded by B, then m j=n k 1 aj is bounded by m (4.8) B( n k1 n k n k ) which tends to 0, so A m a. Theorem 4.9 (Ergodic Theorem). Let f be a real-valued function defined on Ω. If (X t ) is an irreducible, aperiodic Markov chain, then for any starting distribution µ, 1 t 1 (4.10) P µ { lim f(x s ) = E π (f)} = 1. t t Proof. Suppose that the chain starts at state x. Then we will define s=0 (4.11) τ x,k := min{t > τ x,(k 1) : X t = x}, and set τ x,0 := 0. Every time the chain visits state x, it starts over in a way, so X τ, X x,k τ 1,..., X x,k τ x,(k 1) 1 are independent of each other. So, if (4.1) Y k := τ x,k 1 f(x s ), s=τ x,(k 1) then the sequence (Y k ) is independent and identically distributed. If S t = t 1 s=0 f(x s), then S τ x,n = n k=1 Y k, and by the strong law of large numbers, S τ (4.13) P x { lim x,n n n = E x(y 1 )} = 1. Using the Strong Law of Large Numbers again, τ x,n (4.14) P x { lim n n = E x(τ x )} = 1. So by division, S τ (4.15) P x { lim x,n n τ x,n = E x(y 1 ) E x (τ x ) } = 1.

10 MERYL SEAH Since τ x 1 E x (Y 1 ) = E x f(x s ) s=0 = E x f(y) τ x 1 1{Xs=y} = f(y)e x τ x 1 1{Xs=y}, and from the proof of the Convergence theorem, we know that π(x) = follows that (4.16) E x (Y 1 ) = E π (f)e x (τ x ). π(x) E z(τ z ), it So, S τ (4.17) P x { lim x,n n τ x,n = E π (f)} = 1. By the lemma, when µ equals the probability distribution with unit mass at x, the theorem is true. The proof is then complete by averaging over the starting state. Acknowledgments. It is a pleasure to thank my mentor, Jonathan DeWitt, for guiding me through this process. I would also like to thank the professors who taught me in this program. I would also like to thank the authors of the papers I referenced through this writing process. References [1] David Levin, Yuval Peres, and Elizabeth Wilmer. Markov Chains and Mixing Times. http://pages.uoregon.edu/dlevin/markov/markovmixing.pdf [] L.R. Rabiner and B.H. Juang. An Introduction to Hidden Markov Models. http://www.cs.umb.edu/ rvetro/vetrobiocomp/hmm/rabiner1986 [3] Persi Diaconis. The Markov Chain Monte Carlo Revolution. https://math.uchicago.edu/ shmuel/network-course-readings/mcmcrev.pdf [4] J. Chang. Stochastic Processes. http://www.stat.yale.edu/ pollard/courses/51.spring09/handouts/changnotes.pdf