Summary of Results on Markov Chains Enrico Scalas 1, 1 Laboratory on Complex Systems. Dipartimento di Scienze e Tecnologie Avanzate, Università del Piemonte Orientale Amedeo Avogadro, Via Bellini 25 G, 15100 Alessandria, Italy (Dated: August 30, 2007) Abstract These short lecture notes contain a summary of results on the elementary theory of Markov chains. The purpose of these notes is to let the reader understand as quickly as possible the concept of statistical equilibrium, based on the stationary distribution of homogeneous Markov chains. Some exercises related to these notes can be found in a separate document. PACS numbers: 02.50.-r, 02.50.Ey, 05.40.-a, 05.40.Jc, Electronic address: enrico.scalas@mfn.unipmn.it; URL: www.mfn.unipmn.it/~scalas 1
I. INTRODUCTION Many models used in Economics, in Physics or in other sciences are instances of Markov chains. This is the case of Schelling s model [1] or of the closely related Ising s model [2] with the usual Monte Carlo dynamics [3]. Economists will find further motivation to study Markov chains in a recent book by Aoki and Yoshikawa [4]. Markov chains have the advantage that their theory can be introduced and many results can be proven in the framework of the elementary theory of probability, without extensively using measure theoretical tools. In order to compile the present summary, the books by Hoel et al., by Kemeny and Laurie Snell, by Durrett and by Çinlar [5 8] have been consulted. These notes can be considered as a summary of the first two chapters of Hoel et al.. In this summary, random variables will be denoted by capital letters X, Y,... and their values by small letters x, y,.... In order to define a Markov chain, a random variable X n will be considered that can assume values in a finite or at most denumerable set of states S at instants denoted by the subscript n = 0, 1, 2,.... This subscript will always be interpreted as a discrete-time index. It will be further assumed that P (X n+1 = x n+1 X 0 = x 0,..., X n = x n ) = P (X n+1 = x n+1 X n = x n ), (1) for every choice of the non-negative integer n and of the values x 0,..., x n which belong to S. P ( ) is a conditional probability. The meaning of equation (1) is that the probability of X n+1 does not depend on the past history, but only on the value of X n ; this equation, the socalled Markov property, can be used to define Markov chains. The conditional probabilities P (X n+1 = x n+1 X n = x n ) are called transition probabilities. If they do not depend on n, they are stationary (or homogeneous) transition probabilities and the corresponding Markov chains are stationary (or homogeneous) Markov chains. II. PROPERTIES OF MARKOV CHAINS A. Transitions and initial distribution The transition function, P (x, y) of a Markov chain, X n, is defined as P (x, y) = P (X 1 = y X 0 = x), x, y S. (2) 2
The values of P (x, y) are non-negative and the sum over the final states y of P (x, y) is 1. In the finite case with M states, this function can be represented as a square M M matrix with non-negative matrix elements and with rows summing up to 1. For a stationary Markov chain, one has that P (X n+1 = y X n = x) = P (x, y), n 1, (3) the initial distribution is π 0 (x) = P (X 0 = 0), (4) and the joint probability distribution P (X 0 = x 0, X 1 = x 1,..., X n = x n ) can be expressed as a product of π 0 (x) and P (x, y) s in the following way P (X 0 = x 0, X 1 = x 1..., X n = x n ) = π 0 (x 0 )P (x 0, x 1 ) P (x n 1, x n ). (5) The m-step transition function P m (x, y) is the probability of going from state x to state y in m steps. It is given by P m (x, y) = P (x, y 1 )P (y 1, y 2 ) P (y m 2, y m 1 )P (y m 1, y) (6) y 1 y m 1 for m 2; for m = 1, it coincides with P (x, y) and for m = 0, it is 1 if x = y and 0 elsewhere. The following three formulae involving P m (x, y) are useful in the theory of Markov chains: P n+m (x, y) = z P (X n = y) = x P (X n+1 = y) = x P n (x, z)p m (z, y) (7) π 0 (x)p n (x, y) (8) P (X n = x)p (x, y). (9) B. Hitting times and classification of states Given a subset of states A, the hitting time T A is defined as T A = min{n > 0 : X n A}. (10) Thanks to the concept of hitting time, it is possible to classify the states of Markov chains in a very useful way. Let P x ( ) denote the probability of an event for a Markov chain starting at state x. Then one has the following formula for the n-step transition function: n P n (x, y) = P x (T y = m)p n m (y, y). (11) m=1 3
An absorbing state of a Markov chain is a state a for which P (a, a) = 1 or, equivalently, P (a, y) = 0 for any state y a. If the chain reaches such a state, it is trapped there and it will never leave. For an absorbing state, it turns out that P n (x, a) = P x (T a n) for n 1. The quantity ρ xy = P x (T y < ) (12) can be used to introduce two classes of states. ρ yy is the probability that a chain starting at y will ever return to y. A state y is recurrent if ρ yy = 1 and transient if ρ yy < 1. For a transient state, there is a positive probability to never return back. An absorbing state is recurrent. The indicator function I y (z) helps in defining the counting random variable N(y). The indicator function I y (X n ) is 1 if X n = y and 0 otherwise, therefore N(y) = I y (X n ) (13) n=1 counts the number of times in which the chain reaches state y. The event {N(y) 1} coincides with the event {T y < }. Therefore, one can write By induction, one can prove that for m 1 hence and finally P x (N(y) 1) = P x (T y < ) = ρ xy. (14) P x (N(y) m) = ρ xy ρ m 1 yy, (15) P x (N(y) = m) = ρ xy ρ m 1 yy (1 ρ yy ), (16) P x (N(y) = 0) = 1 P x (N(y) 1) = 1 ρ xy. (17) One can define G(x, y) = E x (N(y)), the average number of visits to state y for a Markov chain that started at x. It turns out that G(x, y) = E x (N(y)) = It is now possible to state the following P n (x, y). (18) n=1 Theorem 1. 1. Let y be a transient state. Then P x (N(y) < ) = 1 and G(x, y) = ρ xy 1 ρ yy, x S (19) finite for all states. 4
2. Let y be a recurrent state. Then P y (N(y) = ) = 1 and G(y, y) = and one also has P x (N(y) = ) = P x (T y < ) = ρ xy, x S. (20) Finally, if ρ xy = 0, then G(x, y) = 0, else if ρ xy > 0, then G(x, y) =. This theorem tells that the Markov chain pays only a finite number of visits to a transient state, whereas if it starts from a recurrent state it will come back there an infinite number of times. If the Markov chain starts at any state x, it may well be that it will never visit the recurrent state y, but if it gets there, it will come back infinitely many times. A Markov chain is called transient if it has only transient states and recurrent if all of its states are recurrent. A finite Markov chain at least has one recurrent state and cannot be transient. C. The decomposition of space state A state x leads to another state y if ρ xy > 0 or, equivalently, if there exists a positive integer n for which P n (x, y) > 0. If x leads to y and y leads to z, then x leads to z. Based on this concept, there is the following Theorem 2. Let x be a recurrent state and suppose that x leads to y, Then y is recurrent and ρ xy = ρ yx = 1. A set of states C is said to be closed if no state in C leads to a state outside C. An absorbing state a defines the closed set {a}. There are several caracterizations of closed sets, but they will not be included here. A closed set C is irreducible (or ergodic) if for any choice of two states x and y in C, x leads to y. It is a consequence of Theorem (2) that if C is an irreducible closed set, either every state in C is transient or every state in C is recurrent. Another consequence of Theorems (1,2) is the following Corollary 1. For an irreducible closed set of recurrent states, C one has ρ xy = 1, P x (N(y) = ) = 1, and G(x, y) = for all choices of x and y in C. Finally, one has the following important result as a direct consequence of the above theorems and corollaries Theorem 3. If C is a finite irreducible closed set, then any state in C is recurrent. 5
If we are given a finite Markov chain, it is often possible to directly verify if the process is irreducible (or ergodic) by using the transition function (matrix) and controlling whether any state leads to any other state. Finally, one can prove the following decomposition into irreducible (ergodic) components Theorem 4. A non-empty set S R of recurrent states is the union of a finite or countably infinite number of disjoint irreducible closed sets C 1, C 2,.... If the initial state of the Markov chain is within one of the sets C i, the time evolution will take place within this set and the chain will visit any of these states an infinite number of times. If the chain starts within the set of transient states S T, either it will stay in this set visiting any transient state only a finite number of times, or, if it reaches one of the C i, it will stay there and will visit any state of the irreducible closed set infinitely many times. The problem arises to determine the hitting time distribution of the various ergodic components for a chain that starts in a transient state, as well as the absorption probability ρ C (x) = P x (T C < ) for x S T. The latter problem has the following solution when S T is finite. Theorem 5. Let the set S T be finite and let C be a closed irreducible set of recurrent states. Then the system of equations f(x) = P (x, y) + P (x, y)f(y), x S T (21) y C y S T has the unique solution f(x) = ρ C (x), x S T. III. THE PATH TO STATISTICAL EQUILIBRIUM A. The stationary distribution The stationary distribution, π(x), is a function of the Markov chain state space such that its values are non-negative, its sum over state space is 1, and π(x)p (x, y) = π(y), y S. (22) x It is interesting to notice that, for all n π(x)p n (x, y) = π(y), y S. (23) x 6
Moreover, if X 0 follows the stationary distribution, then, for all n, the distribution of X n also follows the stationary distribution. Indeed, the distribution of X n does not depend on n if and only if π 0 (x) = π(x). If π(x) is a stationary distribution and n P n (x, y) = π(y) holds for every initial state x and for every state y then one can conclude that n P (X n = y) = π(y) irrespective of the initial distribution. This means that, after a transient period, the distribution of chain states reaches a stationary distribution, which can then be interpreted as an equilibrium distribution in the statistical sense. For the reasons discussed above, it is important to see under which conditions, π(x) exists and is unique and to study the convergence properties of P n (x, y). B. How many times is a recurrent state visited in average? Let N n (t) denote the number of visits to a state y up to time step n. This random variable is defined as n N n (y) = I y (X m ). (24) m=1 One can also define the average number of visits to state y, starting from x up to step n: n G n (x, y) = E x (N n (y)) = P m (x, y). (25) If m y = E y (T y ) is taken to indicate the mean return (recurrence) time to come back to y for a chain starting at y, then, as an application of the strong law of large numbers one has m=1 Theorem 6. Let y be a recurrent state, then N(y) n n = I {T y< } m y (26) with probability one and G n (x, y) n n = ρ xy m y, x S (27) The meaning of this theorem is that if a chain reaches a recurrent state y, then it returns there with frequency 1/m y. Note that the quantity N n (y)/n is immediately accessible from Monte Carlo simulation of Markov chains. A corollary is of immediate relevance to finite Markov chains: 7
Corollary 2. Let x, y be two generic states in an irreducible closed set of recurrent states C, then G n (x, y) n n and if P (X 0 C) = 1, then with probability one for any state y in C If m y = the right sides are both 0. A null recurrent state y is a recurrent state for which m y = 1 m y, (28) N(y) n n = 1 (29) m y =. A positive recurrent state y is is a recurrent state for which m y <. The following result characterizes positive recurrent states Theorem 7. If x is a positive recurrent state and x leads to y then also y is positive recurrent. In a finite irreducible closed set of states there is no null recurrent state: Theorem 8. If C is a finite irreducible closed set of states, every state in C is positive recurrent. These corollaries are immediate consequences of the above theorems and corollary Corollary 3. An irreducible Markov chain having a finite number of states is positive recurrent. Corollary 4. A Markov chain having a finite number of states has no null recurrent states. As a final remark of this subsection, note that Theorem (6) and Corollary (2) connect time averages defined by N n (y)/n to ensemble averages defined by G n (x, y)/n and they can be called ergodic theorems. Ergodic theorems are related to the so-called strong law of large numbers, one of the important results of probability theory. Theorem 9. Let ξ 1, ξ 2,..., be independent and identically distributed random variables with finite mean µ, then ξ 1 + ξ 2 + + ξ n n n If these random variables are positive with infinite mean, the theorem still holds with µ = +. 8 = µ
C. Existence, uniqueness and convergence to the stationary distribution Eventually, the main results on the existence and uniqueness of π(x) and the iting behaviour of P n (x, y) can be stated. The ergodic theorems discussed in the previous subsection do provide a rule for the Monte Carlo approximation of π(x) that can be used to prove its existence and uniqueness. First of all, the stationary weight of both transient states and null recurrent states is zero. Theorem 10. If π(x) is a stationary distribution and x is a transient state or a null recurrent state then π(x) = 0. This means that a Markov chain without positive recurrent states cannot have a stationary probability distribution. However, Theorem 11. An irreducible positive recurrent Markov chain has a unique stationary disrtribution π(x) given by π(x) = 1. (30) m x This theorem provides the utimate justification for the use of Markov chain Monte Carlo simulations to sample the stationary distribution if the hypotheses of the theorem are fulfilled. In order to get an approximate value for π(y) one lets the system equilibrate (and to fully justify this step, the convergence theorem will be necessary) and then counts the number of occurences of state y, N n (y) in a long enough simulation of the Markov chain and divides it by the number of Monte Carlo steps n. This program can be carried out when the state space is not too large. In a typical Monte Carlo simulation of the Ising model, with K sites, the number of states is 2 K and soon grows to become untractable. In a simulation, many states will be never sampled even if the Markov chain is irreducible. For this reason, Metropolis et al. introduced the importance sampling trick whose explanation is outside the scope of the present notes [3, 9]. The next corollary provides a nice characterization of positive recurrent Markov chains. Corollary 5. An irreducible Markov chain is positive recurrent if and only if it has a stationary distribution. For chains with a finite number of states the existence and uniqueness of the stationary distribution is granted if they are irreducible. 9
Corollary 6. If a Markov chain having a fnite number of states is irreducible, it has a unique stationary distribution and, finally, the corollary discussed above, where the recipe was given to estimate π(x) from Monte Carlo simulations: Corollary 7. For an irreducible positive recurrent Markov chain having stationary distribution π, one has with probability one N n (x) n n For reducible Markov chains the following results hold = π(x). (31) Theorem 12. Let S P denote the positive recurrent states of a Markov chain 1. if S P is empty, the stationary distribution does not exist; 2. if S P is not empty and irreducible, the chain has a unique distribution; 3. if S P is non empty and reducible, the chain has an infinite number of stationary distributions. Case 3 is when the chain reaches one of the closed irreducible sets and then stays there forever. It is a subtle case, where Monte Carlo simulations may not give proper results if the chain reducibility is not studied. If x is a state of a Markov chain such that P n (x, x) > 0 for some n 1, its period d x can be defined as the greatest common divisor of the set {n 1 : P n (x, x) > 0}. For two states x and y leading to each other, d x = d y. States in an irreducible Markov chain have a common period d. The chain is called periodic of period d if d > 1 and aperiodic if d = 1. The following theorem gives the conditions for the convergence of P n (x, y) to the stationary distribution: Theorem 13. For an aperiodic irreducible positive recurrent Markov chain with stationary probability π(x) P n (x, y) = π(y), x, y S. (32) n For a periodic chain with the same properties and with period d, for each pair of states in S, there is an integer r, 0 r < d, such that P n (x, y) = 0 unless n = md + r for some non-negative integer m, and P md+r (x, y) = dπ(y), x, y S. (33) m 10
This theorem is the only one in the list that needs (mild) number-theoretic tools to be proven. Acknowledgements These notes were written during a visit to Marburg University supported by an Erasmus fellowship. The author wishes to thank Prof. Guido Germano and his group for their warm hospitality. [1] T.S. Schelling, (1971) Dynamic Models of Segregation, Journal of Mathematical Sociology 1, 143-186. [2] E. Ising, (1924) Beitrag zur Theorie des Ferro- und Paramagnetismus, Dissertation, Mathematisch-Naturwissenschaftliche Fakultät der Hamburgischen Universität, Hamburg. [3] D. Landau and K. Binder (1995) A Guide to Monte Carlo Simulations in Statistical Physics, Cambridge University Press. [4] M. Aoki and H. Yoshikawa (2007) Reconstructing Macroeconomics. A Perspective from Statistical Physics and Combinatorial Stochastic Processes, Cambridge University Press. [5] P.G. Hoel, S.C. Port, and C.J. Stone (1972) Introduction to Stochastic Processes, Houghton Mifflin, Boston. [6] J.G. Kemeny and J. Laurie Snell (1976) Finite Markov Chains, Springer, New York. [7] R. Durrett (1999) Essentials of Stochastic Processes, Springer, New York. [8] E. Çinlar (1975) Introduction to Stochastic Processes, Prentice Hall, Englewood Cliffs. [9] N. Metropolis, A.W. Rosenbluth, M.N. Rosenbluth, A.H. Teller and E. Teller (1953) Equation of State Calculations by Fast Computing Machines, Journal of Chemical Physics, 21, 1087 1092. 11