APM 541: Stochastic Modelling in Biology Discrete-time Markov Chains. Jay Taylor Fall Jay Taylor (ASU) APM 541 Fall / 92

Size: px

Start display at page:

Download "APM 541: Stochastic Modelling in Biology Discrete-time Markov Chains. Jay Taylor Fall Jay Taylor (ASU) APM 541 Fall / 92"

Lambert Whitehead
6 years ago
Views:

1 APM 541: Stochastic Modelling in Biology Discrete-time Markov Chains Jay Taylor Fall 2013 Jay Taylor (ASU) APM 541 Fall / 92

2 Outline 1 Motivation 2 Markov Processes 3 Markov Chains: Basic Properties 4 Asymptotics: Class Structure 5 Asymptotics: Hitting Times and Probabilities 6 Asymptotics: Stationary Distributions 7 Time Reversal 8 Hidden Markov Models Jay Taylor (ASU) APM 541 Fall / 92

3 Motivation Stochastic Processes Definition A stochastic process is simply a collection of random variables {X t : t T } indexed by some set T, where all of the variables are defined on the same underlying probability space. We say that the process is E-valued if all of the variables take values in the set E. We say that the process is a discrete-time stochastic process if the index set T is a discrete subset of R, e.g., T Z, or a continuous-time stochastic process if the index set is an interval T = (a, b) R. Interpretation: Usually, T R is a subset of the real line and we think of the X t as specifying the state of some random system at time t T. Thus the stochastic process as a whole describes the noisy or random evolution of the system over time. Jay Taylor (ASU) APM 541 Fall / 92

4 Motivation To motivate our consideration of Markov processes, let us compare two models of density-dependent population regulation in a finite population of an asexual (or hermaphroditic) species. In both models, we will assume that reproduction is seasonal and we will let X t denote the number of individuals alive at the beginning of year t. Model 1 will make the following assumptions: Individuals become reproductively mature in the year following their birth, i.e., an individual born in year t will begin to reproduce in year t + 1. Conditional on the current population size X t, the number of surviving offspring born to an individual alive at time t is Poisson-distributed with mean r exp( γ b X t). Furthermore, the numbers of offspring born to different individuals are independent of one another. Following reproduction, each individual dies, independently of the rest, with probability 1 (1 µ) exp( γ d X t). Jay Taylor (ASU) APM 541 Fall / 92

5 Motivation Assume that the individuals alive at the beginning of time t are labeled i = 1,, X t and let η i,t denote the number of offspring born to the i th individual in that season. Since η 1,t,, η Xt,t are independent Poisson-distributed random variables, it follows that if we condition on X t = n, then the total number of offspring born at time t is also Poisson-distributed, with mean nre γ bn : η t nx η i,t Poisson `nre γ bn. i=1 Furthermore, under this same condition, the number of surviving adults S t is binomially-distributed with parameters n and (1 µ)e γ d n : S t Binomial `n, (1 µ)e γ d n. Combining these two observations, it follows that the number of individuals alive at the beginning of time t + 1 is X t+1 = S t + η t. Jay Taylor (ASU) APM 541 Fall / 92

6 Motivation This process can be simulated using the following procedure: 1 Ask the user to input values for the parameters r, µ, γ b, γ d as well as the initial number of individuals X 0 alive at time t = 0. 2 Given that X t = n, generate independent random variables η t Poisson `nre γ bn and S t Binomial `n, (1 µ)e γ d n. 3 Set X t+1 = S t + η t. 4 Increase t to t + 1 and return to step 2. Several features of this model contribute to the ease with which the process can be simulated. One is that we are able to directly sample the total number of individuals born in each season, rather than having to sample each of the individual offspring numbers separately. Equally importantly, to determine the number of individuals alive at time t + 1, we only need to know the number alive at time t. Jay Taylor (ASU) APM 541 Fall / 92

7 Motivation In the second model, we will assume that the density of the population at time t affects not only the survival and reproduction of adults alive at that time, but also the development of any juveniles that are born in that year. Specifically, we will assume that density has a lasting negative effect on the future reproduction and survival of such juveniles, as described below: Suppose that an individual is τ years old. Conditional on the population density X t in the present and on the population density X t τ in the year when that individual was born, the number of surviving offspring born to that individual at time t is Poisson-distributed with mean re γ bx t e λ bx t τ. Following reproduction, the probability that this individual dies is 1 µe γ d X t e λ d X t τ. Apart from these modifications, we will assume that Model 2 coincides with Model 1. Jay Taylor (ASU) APM 541 Fall / 92

8 Motivation As innocuous as these modifications may seem, they make the task of simulating Model 2 much more computationally demanding than Model 1. Specifically, we must deal with the following complications: Because reproduction and survival now depend on the density of the population in each individual s natal year, we must keep track of the ages of each individual in the population so that we can determine when they were born. This also necessitates sampling Poisson and binomially-distributed random variables for every age class rather than for the population as a whole. More seriously, we must know the population density not only at the present t, but potentially at every time 0, 1, 2, t 1 in the past. In other words, the entire history of the population is needed to be able to propagate the process forward from time t to t + 1. The history-dependence exhibited by Model 2 is a major obstacle both to analytical and Monte Carlo investigations. Although this may sometimes be unavoidable, it is usually preferable to begin with a model that can be propagated forward using only the present state of the system. Jay Taylor (ASU) APM 541 Fall / 92

9 Markov Processes Stochastic models that evolve in a manner that depends only on the current state are known as Markov processes. The formal definition is given below. Definition A stochastic process X = (X n : n 0) with values in a set E is said to be a discrete time Markov process if for every n 0 and every set of values x 0, x 1,, x n E, we have P (X n+1 A X 0 = x 0, X 1 = x 1,, X n = x n) = P (X n+1 A X n = x n), whenever A is a subset of E such that {X n+1 A} is an event. In this case, the functions defined by p n(x, A) = P(X n+1 A X n = x) are called the one-step transition probabilities of X. If the functions p n(x, A) do not depend on n, i.e., if there is a function p such that p(x, A) = P(X n+1 A X n = x) for every n 0, then we say that X is a time-homogeneous Markov process with transition function p. Otherwise, X is said to be time-inhomogeneous. Jay Taylor (ASU) APM 541 Fall / 92

10 Markov Processes Remarks: Markov processes are sometimes said to lack memory in the sense that if we know the current state of the process X t = x, then the values assumed by the process at future times, say X t+s = y, do not depend on how the process arrived at state x at time t. In other words, the process is only aware of its current location and forgets how it arrived there. Equivalently, it can be shown that a Markov process has the following property, which is known as the Markov property. If we condition on the event {X n = x}, then the variables (X n+k : k 1) are independent of the variables (X n k ; k 1), i.e., the future is conditionally independent of the past given the present. Although time-inhomogeneous processes have useful applications, here we will assume that all of our processes are time-homogeneous. Indeed, in some cases, we can convert a time-inhomogeneous process to one that is time-homogeneous by explicitly modeling the variables that are responsible for the changing transition probabilities. Jay Taylor (ASU) APM 541 Fall / 92

11 Markov Processes Example: Discrete-time Random Walks Let ξ 0, ξ 1, be an i.i.d. sequence of R d -valued random variables with density f and define the process X = (X n; n 0) by setting X 0 = 0 and X n+1 = X n + ξ n. X is said to be a d-dimensional discrete-time random walk and a simple calculation shows that X is a time-homogeneous Markov process with transition function P (X n+1 A X 0 = x 0,, X n = x n) = P (X n+1 A X n = x n) = P (x n + ξ n A X n = x n) = P (ξ n A x n) Z = f (y x n)dy. A Interpretation: We think of X n as the location at time n of an individual animal or particle that moves by taking random jumps at fixed time intervals. Jay Taylor (ASU) APM 541 Fall / 92

12 Markov Chains: Basic Properties Markov Chains Markov processes that take values in a discrete state space E are important enough to merit their own theory and terminology. In particular, it will be convenient for us to adopt the following two conventions: Without loss of generality, we will assume that the state space is either E = {1,, m} when E is finite or E = {1, 2, } when E is countably infinite. This amounts to assigning integer labels to the original objects that were in the state space. Probability distributions on E will be represented by row vectors, e.g., we will use the vector ν = (ν 1, ν 2, ) to represent the distribution that assigns probability ν i to state i E. If E is infinite, then we will regard ν as an infinite-dimensional vector. Jay Taylor (ASU) APM 541 Fall / 92

13 Markov Chains: Basic Properties With these conventions in force, we have the following definition. Definition A stochastic process X = (X n; n 0) with values in the countable set E = {1, 2, } is said to be a time-homogeneous discrete-time Markov chain with initial distribution ν and transition matrix P = (p ij ) if 1 for every i E, P (X 0 = i) = ν i ; 2 for every n 0 and every set of values x 0,, x n+1 E, we have P (X n+1 = x n+1 X 0 = x 0, X 1 = x 1,, X n = x n) = P (X n+1 = x n+1 X n = x n) = p xnx n+1. Caveat: Some authors (in particular, demographers) define the transition matrix of a Markov chain to be the transpose of the matrix P that we have introduced above. In this case, p ji is the probability that process transits from state i to state j in a single step. Jay Taylor (ASU) APM 541 Fall / 92

14 Markov Chains: Basic Properties Simulation of Discrete-time Markov Chains In principle, a discrete-time Markov chain X = (X n : n 0) with initial distribution ν and transition matrix P can be simulated using the following algorithm. 1 First, choose the initial state of the Markov chain X 0 = x 0 by sampling the value x 0 from the distribution ν on E. 2 If X t = i E, choose the next state of the Markov chain by sampling a value j E from the probability distribution (p i1, p i2, ) on E. Set X t+1 = j 3 Increase t to t + 1 and return to step 2. In practice, the ease with which we can simulate a Markov chain depends on how difficult it is to generate random samples from the initial distribution ν and from the transition probabilities specified by the rows of the transition matrix P. Jay Taylor (ASU) APM 541 Fall / 92

15 Markov Chains: Basic Properties Example: Seasonal Fecundity in Birds Etterson et al. (2009) propose a simple Markov chain model to describe the reproductive activities of a single female bird in a single breeding season. In these models, a female can occupy one of four states: state 1: she may be actively nesting state 2: she may have successfully fledged a brood state 3: she may have failed to fledge a brood state 4: she may have completed all nesting activities for the season. Thus, the state space can be represented as E = {1, 2, 3, 4}, where each number corresponds to one of the four states described above, and the variable X n E describes the state of the female following the n th change in state. Jay Taylor (ASU) APM 541 Fall / 92

16 Markov Chains: Basic Properties To complete the description of the Markov chain X = (X t : t 0), we need to specify both the initial distribution ν = (1, 0, 0, 0) of the female s state at the beginning of the season (i.e., she will initiate breeding) as well as the transition matrix P = (p ij ) determining how this state changes over time. Etterson and colleagues assume that the transition matrix can be written in the following form: P = 0 0 s a 1 s a 0 1 q s 0 0 q s 1 q f 0 0 q f C A = C A where s is the daily nest survival probability, a is the average time from first egg to fledging, q s is the probability that a female quits breeding following a successful breeding attempt, and q f is the probability that a female quits breeding following a failed breeding attempt. The numbers in the matrix on the right are estimates obtained from field studies of a population of Eastern Meadowlarks in Illinois. Jay Taylor (ASU) APM 541 Fall / 92

5-10 ), with a streaked brown back and yellow underparts with a black V on the breast.

17 Markov Chains: Basic Properties The Eastern Meadowlark (Sturnella magna) is a member of the blackbird family (Icteridae). Medium sized (length ), with a streaked brown back and yellow underparts with a black V on the breast. Most common in grasslands, prairies and agricultural settings. Eats mainly insects, as well as some seeds. Jay Taylor (ASU) APM 541 Fall / 92

18 Markov Chains: Basic Properties The matrices that occur as transition matrices of Markov chains are sufficiently important in their own right to warrant the following definition. Definition A matrix P = (p ij ), indexed by a countable set E, is said to be a stochastic matrix if all of the entries p ij 0 are non-negative and all of the row sums are equal to one, i.e., X p ij = 1 for every i. j E Every transition matrix is a stochastic matrix, and it can be shown that every stochastic matrix is the transition matrix of some Markov chain. Furthermore, stochastic matrices satisfy a number of useful properties that are stated in the next lemma. Jay Taylor (ASU) APM 541 Fall / 92

19 Markov Chains: Basic Properties Lemma 1 The product of any two stochastic matrices, P and Q, of the same dimension is itself a stochastic matrix. 2 Every stochastic matrix P has λ = 1 as an eigenvalue, with corresponding right eigenvector v = (1, 1, ): Pv = v. 3 Every eigenvalue λ of a stochastic matrix lies in the unit disk: λ 1. Proof: (1) Suppose that P = (p ij ) and Q = (q ij ) are stochastic matrices. Since all of the elements p ij and q ij are non-negative, it is evident that every element of PQ, (PQ) ij = X k E p ik q kj 0, is non-negative. Furthermore, every row sum of PQ is equal to 1 since X X X p ik q kj = X p ik = 1. k E (PQ) ij = X p ik q kj = X j E j E k E k E j E Jay Taylor (ASU) APM 541 Fall / 92

20 Markov Chains: Basic Properties To establish (2), we simply need to show that v = (1, 1, ) is a right eigenvector for P corresponding to eigenvalue 1. However, using the fact that each of the row sums of P is equal to 1, we have (Pv) i = X p ik v k = X p ik = 1, k E k E which shows that Pv = v. Lastly, to establish (3), we can either invoke Gershgorin s disk theorem (look it up!) or observe that if v = (v 1, v 2, ) is any column vector with finite sup norm then v sup v i <, i E X (Pv) i = p ik v X p ik v k X p ik v = v. k k E k E k E This shows that Pv v and so there can be no vector v with Pv = λv when λ > 1. Jay Taylor (ASU) APM 541 Fall / 92

21 Markov Chains: Basic Properties One consequence of the Markov property is that the joint distribution of the values assumed by a Markov chain from time 0 through time n takes a particularly simple form. Theorem Let X be a time-homogeneous discrete time Markov chain with transition matrix P = (p ij ) and initial distribution ν on E. Then n 1 Y P (X 0 = x 0, X 1 = x 1,, X n = x n) = ν(x 0) p xi,x i+1. i=0 Jay Taylor (ASU) APM 541 Fall / 92

22 Markov Chains: Basic Properties Proof: By repeated conditioning and use of the Markov property, we have P (X 0 = x 0, X 1 = x 1,, X n = x n) = = P (X 0 = x 0,, X n 1 = x n 1) P (X n = x n X 0 = x 0,, X n 1 = x n 1) = P (X 0 = x 0,, X n 1 = x n 1) P (X n = x n X n 1 = x n 1) = P (X 0 = x 0,, X n 1 = x n 1) p xn 1,x n = = P (X 0 = x 0, X 1 = x 1) p x1,x 2 p xn 1,x n = P (X 0 = x 0) P (X 1 = x 1 X 0 = x 0) p x1,x 2 p xn 1,x n n 1 Y = ν(x 0) p xi,x i+1, i=0 where ν(x 0) is the probability of x 0 under the initial distribution ν. Jay Taylor (ASU) APM 541 Fall / 92

23 Markov Chains: Basic Properties Example: Recall that Etterson et al. (2009) estimated the following transition matrix for the breeding cycle of Eastern Meadowlarks: P = B A Here, state 1 corresponds to active breeding, state 2 corresponds to fledging, state 3 corresponds to nesting failure, and state 4 corresponds to cessation of breeding activities in that season. If we assume that all females begin the season with an active breeding attempt (X 0 = 1), then the probability that a female successfully rears her first brood, fails to fledge her second brood and then successfully rears a third brood before stopping in the season is: P( ) = ν 1p 12p 21p 13p 31p 12p Jay Taylor (ASU) APM 541 Fall / 92

24 Markov Chains: Basic Properties Multi-step Transition Probabilities While the transition matrix specifies how a Markov chain will evolve from one time period to the next, for some applications we need to be able to calculate the transition probabilities over multiple time steps. As the name suggests, the n-step transition probabilities p (n) ij of a DTMC X are defined for any n 1 by p (n) ij = P (X n = j X 0 = i). In fact, our next theorem will show that the n-step transition probabilities are independent of time whenever X is time-homogeneous, i.e., for every t 0, p (n) ij = P (X t+n = j X t = i), which means that p (n) ij is just the probability that the chain moves from state i to state j in n time steps. The next theorem expresses an important relationship that holds between the n-step transition probabilities of a DTMC X and its r- and n r-step transition probabilities. Jay Taylor (ASU) APM 541 Fall / 92

25 Markov Chains: Basic Properties Theorem ( Chapman-Kolmogorov Equations) Assume that X is a time-homogeneous discrete time Markov chain with n-step transition probabilities p (n) ij. Then, for any non-negative integer r < n, the identities p (n) ij = X p (r) ik p(n r) kj k E hold for all i, j E. Proof: By using first the law of total probability and then the Markov property, we have p (n) ij = P (X n = j X 0 = i) = X k E P (X n = j, X r = k X 0 = i) = X k E P (X n = j X r = k, X 0 = i) P (X r = k X 0 = i) = X k E P (X n = j X r = k) P (X r = k X 0 = i) = X k E p (r) ik p(n r) kj. Jay Taylor (ASU) APM 541 Fall / 92

26 Markov Chains: Basic Properties Many computations involving Markov chains can be solved using techniques from linear algebra. To understand why this is so, let P (n) = (p (n) ij ) be the matrix containing the n-step transition probabilities and observe that the Chapman-Kolmogorov equations can be expressed as the following matrix equation: P (n) = P (r) P (n r). In particular, if we take n = 2 and r = 1, then since P (1) = P, we see that P (2) = PP = P 2. This, in turn, implies that P (3) = PP (2) = PP 2 = P 3 and continuing in this fashion we find that P (n) = P n. In other words, the matrix of n-step transition probabilities is equal to the n th power of the one-step transition matrix. This leads to the following formula for the marginal distribution of a Markov chain at time n. Jay Taylor (ASU) APM 541 Fall / 92

27 Markov Chains: Basic Properties Theorem Suppose that X is a time-homogeneous discrete time Markov chain with transition matrix P and initial distribution ν. Then the distribution of X n is given by the column vector νp n. Proof: The result follows from the law of total probability: P (X n = j) = X i E = X i E = X i E P(X n = j X 0 = i) P (X 0 = i) ν i p (n) ij = (νp n ) j. ν i P (n) ij Jay Taylor (ASU) APM 541 Fall / 92

28 Markov Chains: Basic Properties Example: To find the probability that a female meadowlark initiates at least 4 breeding attempts in a single season, we need to calculate the probability of the event X 6 = 1. Since the initial distribution is ν = (1, 0, 0, 0), the distribution of X 6 is given by 0 νp 6 = (1, 0, 0, 0) C A = (0.116, 0, 0, 0.884). Thus, the probability that a female initiates breeding at least four times is equal to 0.116, while the probability that she initiates breeding three or fewer times is Jay Taylor (ASU) APM 541 Fall / 92

29 Asymptotics: Class Structure Long-term Behavior of Markov Chains When studying stochastic models of biological processes, we are often interested in the long-term behavior of the model. Such considerations are important, for example, in the following questions: What is the probability that a new mutation will spread in a population? What is the probability that a tumor cell will escape immune surveillance and grow to form a tumor? What is the probability that an infectious disease will persist in a population? What is the stead-state distribution of transcription factors and other proteins within a single cell? In the following sections, we will be concerned with two kinds of asymptotic behavior: absorption into a single state and long-term fluctuations in an equilibrium process. Jay Taylor (ASU) APM 541 Fall / 92

30 Asymptotics: Class Structure We begin by introducing some useful terminology. Definition Let X be a discrete time Markov chain on E with transition matrix P. 1 We say that i leads to j, written i j, if for some integer n 0 p (n) ij = P i (X n = j) > 0. In other words, i j if the process X beginning at X 0 = i has some positive probability of eventually arriving at j. 2 We say that i communicates with j, written i j, if i leads to j and j leads to i. Jay Taylor (ASU) APM 541 Fall / 92

31 Asymptotics: Class Structure Notice that the relation i j is an equivalence relation on E. 1 Each element communicates with itself: i i; 2 i communicates with j if and only if j communicates with i; 3 If i communicates with j and j communicates with k, then i communicates with k. In other words, a Markov chain with values in a state space E induces a partition of E into disjoint subsets E 1, E 2,, E = E 1 E 2 where all of the elements contained in the subset E i are communicating with each other and no two elements contained in different subsets, say E i and E j, communicate with each other. This structure is formalized in the following definition. Jay Taylor (ASU) APM 541 Fall / 92

32 Asymptotics: Class Structure Definition Let X be a DTMC on E with transition matrix P. 1 A nonempty subset C E is said to be a communicating class if it is an equivalence class under the relation i j. In other words, each pair of elements in C is communicating, and whenever i C and j E are communicating, j C. 2 A communicating class C is said to be closed if whenever i C and i j, we also have j C. If C is a closed communicating class for a Markov chain X, then that means that once X enters C, it never leaves C. 3 A state i is said to be absorbing if {i} is a closed class, i.e., once the process enters state i, it is trapped there forever. 4 A Markov chain is said to be irreducible if the entire state space E is a communicating class. Jay Taylor (ASU) APM 541 Fall / 92

33 Asymptotics: Class Structure Example: The transition matrix for the Markov chain studied by Etterson and colleagues is: P = 0 0 s a 1 s a 0 1 q s 0 0 q s 1 q f 0 0 q f C A Provided that s (0, 1), a > 0, and q s, q f (0, 1), then the state space E = {1, 2, 3, 4} can be decomposed into two communicating classes, C 1 = {1, 2, 3} and C 2 = {4}. This shows that the Markov chain is not irreducible. Furthermore, C 1 is not closed, since states 2 and 3 lead out of C 1 into C 2. On the other hand, C 2 is closed and 4 is an absorbing class. Jay Taylor (ASU) APM 541 Fall / 92

34 Asymptotics: Hitting Times and Probabilities Suppose that X is a discrete time Markov chain and let C E be a closed communicating class for X. In this section we will consider the following two problems: 1 Given that X 0 = i, what is the probability that X is eventually absorbed by C? 2 Given that X 0 = i, what is the average time to absorption in C? Since we will be conditioning on the initial state of the Markov chain, it will be convenient to introduce the following notation. We will use P i to denote the conditional distribution of the chain given X 0 = i, P i (A) = P(A X 0 = i), where A is any event defined in terms of the chain. Similarly, we will use E i to denote conditional expectations given X 0 = i, E i [Y ] = E[Y X 0 = i], where Y is any random variable defined in terms of the chain. Jay Taylor (ASU) APM 541 Fall / 92

35 Asymptotics: Hitting Times and Probabilities The main quantities of interest are defined below. Definition Let X be a discrete time Markov chain on E with transition matrix P and let C E be a closed communicating class for X. 1 The absorption time of C is the random variable T C {0, 1, } defined by 8 < min {n 0 : X n C} if X n C for some n 0 T C = : if X n / C for all n. 2 The absorption probability of C starting from i is the probability hi C = P i T C <. 3 The mean absorption time of C starting from i is the expectation τ C i = E i ht C i. Jay Taylor (ASU) APM 541 Fall / 92

36 Asymptotics: Hitting Times and Probabilities The following theorem asserts that the absorption probabilities hi C solving a certain system of linear equations. can be calculated by Theorem The vector of absorption probabilities h C = (h C 1, h C 2, ) is the unique minimal non-negative solution of the system of linear equations, j h C i = 1 if i C h C i = P j E p ijh C j if i / C To say that h C is a minimal non-negative solution means that there is no other non-negative solution with at least one component smaller than that of h C. In fact, it can be shown that the second set of equations is satisfied even when i C, which means that the vector h C is a right eigenvector for P corresponding to the eigenvalue λ = 1: h C = Ph C. Jay Taylor (ASU) APM 541 Fall / 92

37 Asymptotics: Hitting Times and Probabilities Proof: I will show that h C is a solution to this system of equations. For a proof of minimality, see the text Markov Chains by Norris (1997). A strategy that is often successful when working with Markov chains is to condition on what the chain does in a single step. Here we will condition on the state of the chain at time t = 1. However, we first note that hi C = 1 if i C since the chain will clearly absorb in C if it starts there. On the other hand, if i / C, then the law of total probability and the Markov property imply that hi C P i (X n C for some n < ) = X P i (X 1 = j and X n C for some n < ) j E = X P i (X n C for some n < X 1 = j) P i (X 1 = j) j E = X P j (X n C for some n < ) p ij j E = X p ij hj C. j E Jay Taylor (ASU) APM 541 Fall / 92

38 Asymptotics: Hitting Times and Probabilities Example: Suppose that X is a Markov chain on the state space E = {1, 2, 3, 4} with the following transition matrix P = a 0 1 a b 0 b where a, b (0, 1). Then both 1 and 4 are absorbing states and the chain can absorb in either one of these two states when it is started in state 2 or 3. Let h i be the probability that it eventually absorbs in state 4 when started in state i: 1 C A, h i = P i (X n = 4 for some n < ). From the last theorem, we know that h = (h 1, h 2, h 3, h 4) is the unique solution to the system of equations h i = X j E p ij h j with h 4 = 1. Jay Taylor (ASU) APM 541 Fall / 92

39 Asymptotics: Hitting Times and Probabilities Written explicitly, this system of equations is: h 1 = h 1 h 2 = ah 1 + (1 a)h 3 h 3 = (1 b)h 2 + b, which does not have a unique solution (since h 1 is unconstrained). However, if we insist on a minimal non-negative solution, then h 1 = 0, which in turn uniquely determines h 2 and h 3: h 2 = (1 a)b a + b ab and h 3 = b a + b ab. Jay Taylor (ASU) APM 541 Fall / 92

40 Asymptotics: Hitting Times and Probabilities The mean absorption times τi C equations. Theorem can also be calculated by solving a system of linear Suppose that C is a closed communicating class and that the absorption probability hi C = 1 for every i E, i.e., the chain is certain to absorb in C no matter where started. Then, the vector of mean absorption times τ C = (τ1 C, τ2 C, ) is the minimal non-negative solution of the system of linear equations, j τ C i = 0 if i C τi C = 1 + P j E p ijτj C if i / C In this case, the system of equations is inhomogeneous, i.e., there are constant terms in the equations. Furthermore, the condition hi C = 1 is important, since otherwise some of the mean absorption times will be infinite (since we set T C = if absorption never occurs). Jay Taylor (ASU) APM 541 Fall / 92

41 Asymptotics: Hitting Times and Probabilities Example: We can use this last result to calculate the mean number of breeding attempts by a female Eastern Meadowlark in the Etterson model. In this model, state 4 (cessation of breeding activities) is the unique absorbing state and the probability of absorption here is equal to 1. Using the transition matrix estimated by these authors, we obtain the following system of equations: τ 1 = τ τ 3 τ 2 = τ τ 4 τ 3 = τ τ 4 τ 4 = 0. This has a unique solution, which is τ , τ , τ These are the mean number of transitions that occur prior to absorption. To determine the mean number of breeding attempts, we take X 0 = 1 and observe that every breeding attempt passes through two events (active breeding followed by fledging or failure). Thus the mean number of breeding attempts is τ 1/ Jay Taylor (ASU) APM 541 Fall / 92

42 Asymptotics: Hitting Times and Probabilities Example: Birth-Death Processes A discrete time Markov chain X = (X n; n 0) is said to be a birth-death process if X takes values in a set E = {0, 1,, N} with a tridiagonal transition matrix 0 P = 1 b 0 b d 1 1 (b 1 + d 1 ) b d 2 1 (b 2 + d 2 ) b d N 1 1 (b N 1 + d N 1 ) b N d N 1 d N. 1, C A Such processes are often used to model the dynamics of populations in which individuals reproduce or die one at a time. However, a more general characterization of a birth-death process is that it is a Markov chain on a linearly ordered set in which transitions can only take place from a state to itself or to an immediately adjacent state. For example, birth-death processes can be used to model the movement of a motor protein such as dynein along a microtubule. Jay Taylor (ASU) APM 541 Fall / 92

43 Asymptotics: Hitting Times and Probabilities Suppose that a birth-death process X is used to model a population of individuals that are reproducing and dying. In this case, X n will denote the number of individuals present in the population at time n. If X n = k with 1 k N 1, then there are three possibilities for X n+1. With probability b k, one of the k individuals reproduces and the population size increases to k + 1. Here we are assuming that each reproduction results in the birth of exactly one offspring. With probability d k, one of the k individuals dies and the population size decreases to k 1. With probability 1 b k d k, no individual reproduces or dies during that time step and the population size remains at k. Alternatively, we could stipulate that this is the probability that exactly one individual gives birth and one individual dies, in which case there still is no change in the population size. Jay Taylor (ASU) APM 541 Fall / 92

44 Asymptotics: Hitting Times and Probabilities The boundary states 0 and N need special consideration. If b 0 = 0, then 0 is an absorbing state corresponding to extinction of the population. If b 0 > 0, then 0 is no longer absorbing and we can think of this parameter as the probability (per unit time) that an extinct population is rescued by a migrant arriving from outside. If N is finite, then b N = 0 and we are tacitly assuming that density dependence is so strong that no individual reproduced when the population is this large. However, we can also take N =, in which case the population may grow unboundedly. Birth-death processes have been used extensively in the biological literature in part because they have many nice analytical properties that allow one to do explicit computations without having to resort to Monte Carlo simulations. Jay Taylor (ASU) APM 541 Fall / 92

45 Asymptotics: Hitting Times and Probabilities Suppose that N is finite, that b 0 = b N = 0, and that the remaining probabilities b 1,, b N 1 and d 1,, d N are all positive. In this case 0 is the unique absorbing state for the Markov chain and it can be shown that the population is certain to go extinct at some finite time, which we will denote T extinct. Our goal is to calculate the expected time to extinction as a function of the initial number of individuals in the population, where k = 0,, N. τ k = E k ˆTextinct, By the preceding theorem, we know that the τ k s satisfy the following system of equations: τ 0 = 0 τ k = 1 + d k τ k 1 + (1 b k d k )τ k + b k τ k+1, 1 k N 1, τ N = 1 + d N τ N 1 + (1 d N )τ N. Jay Taylor (ASU) APM 541 Fall / 92

46 Asymptotics: Hitting Times and Probabilities By bringing all of the τ k s over to the left-hand side, we can rewrite this system in the following more concise form: d k (τ k 1 τ k ) + b k (τ k+1 τ k ) = 1, 1 k N 1 d N (τ N 1 τ N ) = 1. Notice that this is a two-term recursion (i.e., τ k+2 depends only on τ k+1 and τ k ), which is analogous two a second-order ODE and can be solved analogously, effectively by integrating twice. For our first integration, we will define a new variable k = τ k+1 τ k, k = 0,, N 1. Substituting this variable into the two equations above gives: b k k d k k 1 = 1, 0 k N 1 d N N 1 = 1. Jay Taylor (ASU) APM 541 Fall / 92

47 Asymptotics: Hitting Times and Probabilities This latter system can be solved by starting at the upper boundary (k = N 1) and working down. Indeed, we have: N 1 = 1 d N k 1 = b k d k k + 1 d k, 0 k N 1. This shows that N 2 = N 3 = with the general expression 1 d N d N 2 + b N 1 d N 1 d N b N 2 + d N 2 d N 1 b N 2b N 1 d N 2 d N 1 d N, for k = 0,, N 1. k = 1 d k+1 + NX j=k+2 b k+1 b j 1 d k+1 d j Jay Taylor (ASU) APM 541 Fall / 92

48 Asymptotics: Hitting Times and Probabilities For our second integration, we will start at the lower boundary k = 0 and work up. Since τ 0 = 0, we have Also, τ 1 = τ 1 τ 0 = 0 = 1 d 1 + NX j=2 τ 2 = τ = b 1 b j 1 d 1 d j. τ 3 = τ = τ k = i=0 X k 1 i i=0 for k = 2,, N. Using our result for j, it follows that the mean time to extinction when there are k individuals is " # Xk 1 1 NX b i+1 b j 1 τ k = +. d i+1 d i+1 d j j=i+2 Jay Taylor (ASU) APM 541 Fall / 92

49 Asymptotics: Hitting Times and Probabilities Remarks: The results described in the previous two theorems are most useful when the state space is small or the transition matrix is sparse, i.e., most of the transition probabilities p ij are equal to 0. Indeed, sparseness is part of the reason that birth-death processes are so amenable to analytical calculations. If these conditions are not satisfied, then it may impractical to solve these systems of linear equations. In such cases, absorption probabilities and mean absorption times can be estimated by conducting stochastic simulations of the model and observing the proportion of simulations that end in absorption in a closed class C, as well as the times required to arrive in C. Jay Taylor (ASU) APM 541 Fall / 92

50 Asymptotics: Stationary Distributions Stationary Distributions Recall that we say that a Markov chain is irreducible if the entire state space is a communicating class, i.e., if every pair of states is communicating. In particular, because irreducible chains lack absorbing states, we usually cannot predict the states that will be occupied by the chain at distant times in the future, especially if the initial distribution is unknown. On the other hand, sometimes we can determine the distribution of the chain at such future times even when we don t know the initial distribution. The following definition is key. Definition Let X be a discrete time Markov chain with values in E and transition matrix P. A distribution π on E is said to be a stationary distribution for X if πp = π. In other words, π is a left eigenvector for P with corresponding eigenvalue 1. Jay Taylor (ASU) APM 541 Fall / 92

51 Asymptotics: Stationary Distributions The probabilistic meaning of stationary distributions is revealed by the following theorem. Theorem Suppose that π is a stationary distribution for a Markov chain X with transition matrix P. If π is the distribution of X 0, then π is also the distribution of X n for every n 0. Proof: Since X has initial distribution π and transition matrix P, we know that the distribution of X n is πp n = (πp)p n 1 = πp n 1 = = π. In other words, any stationary distribution of a Markov chain is time invariant: if ever the process has π as its distribution, then it will retain this distribution at all future times. Jay Taylor (ASU) APM 541 Fall / 92

52 Asymptotics: Stationary Distributions Example: Suppose that X is a Markov chain on the state space E = {1, 2} with the following transition matrix «a 1 a P =, 1 b b where a, b (0, 1). Since π = (p, 1 p) is a stationary distribution for X if and only if p satisfies the following system of equations, ap + (1 b)(1 p) = p (1 a)p + b(1 p) = 1 p, it follows that π = «1 b 2 a b, 1 a 2 a b is the unique stationary distribution for X. Jay Taylor (ASU) APM 541 Fall / 92

53 Asymptotics: Stationary Distributions As the following examples illustrate, neither existence nor uniqueness of stationary distributions is assured. Example: Let Z 1, Z 2, be a sequence of i.i.d. Bernoulli random variables with success probability p > 0 and let X = (X n : n 0) be the random walk defined by setting X 0 = 0 and X n = Z Z n, n 1. By the strong law of large numbers, we know that «1 P lim n n Xn = p = 1. However, this implies that X n tends to infinity almost surely and so X has no stationary distribution on the integers. Jay Taylor (ASU) APM 541 Fall / 92

54 Asymptotics: Stationary Distributions Example: Suppose that X is a Markov chain on the state space E = {1, 2, 3, 4} with the following transition matrix P = a 0 1 a b 0 b where a, b (0, 1). As we saw earlier, 1 and 4 are both absorbing states. Furthermore, any distribution of the form (p, 0, 0, 1 p) with p [0, 1] is a stationary distribution of the chain and so there are uncountably infinitely many stationary distributions. Indeed, any chain with more than one absorbing distribution will have infinitely many stationary distributions. 1 C A, Jay Taylor (ASU) APM 541 Fall / 92

55 Asymptotics: Stationary Distributions Under certain conditions, we can guarantee that not only will a Markov chain have a unique stationary distribution π, but also that the distribution of the chain at time n will tend to π as n no matter what the initial distribution may be. For this to be true, we need to exclude certain types of oscillatory behavior, hence the following definition. Definition A discrete time Markov chain with values in E and transition matrix P is said to be aperiodic if for every state i E we have p (n) ii > 0 for all sufficiently large n. The need for aperiodicity is illustrated by the following example. If X is the Markov chain with transition matrix «0 1 P =, 1 0 then π = (1/2, 1/2) is the unique stationary distribution. However, the distribution of the chain will be equal to the initial distribution at all future times n with n even. Indeed, this chain is periodic with period 2. Jay Taylor (ASU) APM 541 Fall / 92

56 Asymptotics: Stationary Distributions The following theorem provides sufficient conditions for a chain to converge to a unique stationary distribution from any initial distribution. Theorem Suppose that X is an irreducible, aperiodic Markov chain with values in a finite set E and transition matrix P. Then X has a unique stationary distribution π and for any initial distribution µ, we have lim n P(Xn = j) = π j for every j E. In particular, for all i, j E. lim n p(n) ij = π j Remark: Although existence of a stationary distribution is not guaranteed if we allow E to be infinite, the uniqueness and convergence parts of this theorem remain true for arbitrary state spaces whenever a stationary distribution can be shown to exist. Jay Taylor (ASU) APM 541 Fall / 92

57 Asymptotics: Stationary Distributions Example: If X is a Markov chain on the state space E = {1, 2} with transition matrix «a 1 a P =, 1 b b where a, b (0, 1), then X is aperiodic and irreducible and «+ (a + b 1) n 1 a 2 a b P n = 1 b 2 a b 1 b 2 a b 1 a 2 a b 1 a 2 a b 1 b 2 a b 1 a 2 a b 1 b 2 a b «for every n 0. Furthermore, since a + b 1 < 1, it is clear that lim n Pn = 1 b 2 a b 1 b 2 a b 1 a 2 a b 1 a 2 a b where each row of the limiting matrix is the stationary distribution π that we found by direct calculation. In particular, we see that the rate of convergence to the stationary distribution is geometric and depends on the magnitude of a + b 1. «, Jay Taylor (ASU) APM 541 Fall / 92

58 Asymptotics: Stationary Distributions Convergence to Stationarity It is often (although not always) the case that convergence to the stationary distribution occurs at a geometric rate. For example, let P be the transition matrix of a Markov chain on a finite space E = {1,, n} with stationary distribution π and suppose that P has n distinct eigenvalues λ 1,, λ n with 1 = λ 1 λ 2 λ n. In this case, P is diagonalizable and there is a basis of left eigenvectors, say π (1),, π (n), with π (1) = π, such that π (i) P = λ i π (i). In particular, any distribution ν on E can be expanded as a linear combination of left eigenvectors, say ν = c 1π (1) + + c nπ (n). Jay Taylor (ASU) APM 541 Fall / 92

59 Asymptotics: Stationary Distributions Suppose that ν is the initial distribution of the chain and that λ 2 < 1. Then the distribution of X t is given by νp t = (c 1π (1) + + c nπ (n) )P t = c 1λ t 1π (1) + c 2λ t 2π (2) + + c nλ t nπ (n) = c 1π + O( λ 2 t ) π as t. This shows that c 1 = 1 and that νp t converges to π geometrically at a rate which depends on the leading non-unit eigenvalue, λ 2. Jay Taylor (ASU) APM 541 Fall / 92

60 Asymptotics: Stationary Distributions Example: Suppose that X is a birth-death process with N < and that the probabilities b 0,, b N 1 and d 1,, d N lie in the interval (0, 1), i.e., X is a birth-death process with immigration into extinct populations. In this case, X is also an irreducible, aperiodic Markov chain on a finite state space and so X has a unique stationary distribution π which satisfies the following system of equations: (1 b 0)π 0 + d 1π 1 = π 0 b k 1 π k 1 + (1 b k d k )π k + d k+1 π k+1 = π k k = 1,, N 1 b N 1 π N 1 + (1 d N )π N = π N. By subtracting the probabilities on the right-hand side from both sides, this system can be rewritten in the following form: b 0π 0 + d 1π 1 = 0 b k 1 π k 1 (b k + d k )π k + d k+1 π k+1 = 0 k = 1,, N 1 b N 1 π N 1 d N π N = 0. Jay Taylor (ASU) APM 541 Fall / 92

61 Asymptotics: Stationary Distributions Because each equation depends only on two or three successive probabilities, it is possible to solve this system recursively. The first equation can be rewritten as d 1π 1 = b 0π 0, which implies that Taking k = 1, we have π 1 = b0 d 1 π 0. b 0π 0 d 1π 1 b 1π 1 + d 2π 2 = 0. However, since the first two terms cancel, this reduces to which shows that b 1π 1 + d 2π 2 = 0. π 2 = b1 d 2 π 1 = b1b0 d 2d 1 π 0. Jay Taylor (ASU) APM 541 Fall / 92

62 Asymptotics: Stationary Distributions Continuing in this way, we find that π k = b k 1 d k π k 1 = bk 1 b 0 d k d 1 «π 0 for k = 1,, N. All that remains to be determined is π 0. However, since π is a probability distribution on E, the probabilities must sum to one. Consequently, which shows that 1 = NX NX π k = π k=0 π 0 = 1 + NX k=1 k=1 b k 1 b 0 d k d 1 b k 1 b 0 d k d 1! 1.!. Jay Taylor (ASU) APM 541 Fall / 92

63 Asymptotics: Stationary Distributions Combining these results, we see that the stationary distribution for this birth-death chain is «bk 1 b 0 NX π k = 1 + d k d 1 j=1 b j 1 b 0 d j d 1 for k = 1,, N, with π 0 as above. For example, if b 0 = m, b 1 = = b N 1 = b and d 1 = = d N = d, then provided γ b/d 1. 8 >< π k = >: 1 + m d m d γk m d 1 γ N 1 k = 0 1 γ! 1 1 γ N 1 k = 1,, N, 1 γ Jay Taylor (ASU) APM 541 Fall / 92

64 Time Reversal Time Reversal of Markov Chains Suppose that X = (X t : t N) is a discrete time Markov chain and for each t > 0, let X t = X T t. Then X = ( X t : t 0) is itself a Markov process on E, called the time reversal of X, and the transition probabilities of this Markov chain are given by p xy (t) P( X t+1 = y X t = x) = P(X T (t+1) = y X T t = x) = P(X T t = x X T t 1 = y) P(X T t 1 = y) P(X T t = x) = p yx P(X T t 1 = y) P(X T t = x). As is clear from the last line, the transition probabilities of the time reversal of a Markov chain are usually time inhomogeneous even if the original chain is time homogeneous. Jay Taylor (ASU) APM 541 Fall / 92

65 Time Reversal One setting in which the time reversal of a Markov chain is certain to be time homogeneous is when the original chain is stationary, say with distribution π. In this case, the distribution of X t is π for all t and so the transition probabilities of the time reversed process are p xy p yx P(X T t 1 = y) P(X T t = x) = p yx π y π x. Furthermore, if we let Π be the diagonal matrix with entries π along the diagonal (and zeros everywhere else), then the transition matrix P of the time-reversed chain is where P T denotes the transpose of P. P = Π 1 P T Π, Jay Taylor (ASU) APM 541 Fall / 92

66 Time Reversal It is usually the case that a stationary Markov chain and its time reversal have different transition matrices and therefore are fundamentally different processes. The one exception to this case occurs when the condition described in the following definition is satisfied. Definition Suppose that X is a discrete time Markov chain on E with transition matrix P and let π be a stationary distribution for X. The chain is said to satisfy the detailed balance conditions with respect to π if the following identities are satisfied: for every pair of states x, y E. π xp xy = π y p yx If the detailed balance conditions are satisfied, then p xy = pyxπy π x = pxy πx π x = p xy, which shows that P = P, i.e., the chain and its time reversal have the same transition matrix and so the process looks the same run forwards or backwards in time. Jay Taylor (ASU) APM 541 Fall / 92

67 Time Reversal Example: Let X be the birth-death process on E = {0, 1,, N} with stationary distribution 8 >< π k = >: 1 + P N b 1 k 1 b 0 k=1 d k d 1 k = 0 bk 1 b 0 d k d P N b 1 k 1 b 0 k=1 d k d 1 k = 1,, N. Then X satisfies the detailed balance conditions with respect to π. Indeed, since most of the transition probabilities are zero, we need only check the following cases: π 0p 01 = π 0b 0 = π 1d 1 = π 1p 10 π k p k,k+1 = π k b k = π k+1 d k+1 = π k+1 p k+1,k for k = 1,, N. This shows that the time reversal of a stationary birth-death process is also a birth-death process with the same birth and death probabilities. Jay Taylor (ASU) APM 541 Fall / 92

68 Hidden Markov Models Models and Observations Suppose that we are interested in the spread of a communicable infectious disease through a population and that we decide to investigate this using the stochastic SIR model introduced previously: S t+1 = S t w t I t+1 = I t + w t r t R t+1 = R t + r t w t Binomial S t, 1 (1 γ + γ(1 β)) It r t Binomial(I t, ρ). Here, w t and r t are the numbers of new infections and new recoveries between day t and t + 1, respectively, while γ is the probability (per day) of a contact between two individuals, β is the probability of disease transmission given that there is a contact between a susceptible and an infected individual, and ρ is the probability (per day) of recovery. Jay Taylor (ASU) APM 541 Fall / 92

69 Hidden Markov Models Since S t + I t + R t = N is constant for all t 0, the variables {(S t, I t) : t 0} form a discrete-time Markov chain with values in the state space and transition probabilities E = {(s, i) : i, s 0, i + s N} p (s,i),(s,i ) = P (S t+1, I t+1) = (s, i ) (S t, I t) = (s, i)!! s i = π s s s i (1 π i ) s ρ i+s i s (1 ρ) i +s s i + s i s where π i = 1 (1 γ + γ(1 β)) i and we require that 0 i i + s s i. While the transition matrix is complicated, at the very least we can conduct Monte Carlo simulations of this process to investigate quantities such as the duration of the epidemic and the total number of infections. Jay Taylor (ASU) APM 541 Fall / 92

70 Hidden Markov Models However, what if the values of the parameters γ, β and ρ are not known in advance, but instead must be estimated from data? There are two scenarios. Scenario I: Suppose that we have access to time series data consisting of the numbers of susceptible and infected individuals at each time in some period [0, T ], i.e., we have data D = {(s t, i t) : 0 t T }, and we believe this data to be accurate. In this case, we can use the Markov property to calculate the probability of the data for each possible set of parameter values and we will call the resulting quantity the likelihood of the parameter values Θ = (γ, β, ρ) L(θ D) P ((S t, I t) = (s t, i t), 0 t T Θ) = TY 1 t=0 p (st,i t ),(s t+1,i t+1 )(Θ). Jay Taylor (ASU) APM 541 Fall / 92

71 Hidden Markov Models The true values of the parameters can then be estimated either by finding the value ˆΘ ML that maximizes the likelihood function ˆΘ ML arg max L(Θ D), Θ or by choosing a prior distribution for the parameters, say p(θ), and then using Bayes formula in conjunction with the likelihood function to calculate the posterior distribution of the parameters given the data D: p(θ D) = p(θ) L(Θ D) P(D), where Z Z P(D) = L(Θ D)dΘ = P(D Θ)dΘ Θ Θ is the so-called prior predictive probability of the data. The first approach is called maximum likelihood estimation, while the second approach is Bayesian inference, which we will discuss in some depth later in the semester. Jay Taylor (ASU) APM 541 Fall / 92

2 Discrete-Time Markov Chains

2 Discrete-Time Markov Chains Angela Peace Biomathematics II MATH 5355 Spring 2017 Lecture notes follow: Allen, Linda JS. An introduction to stochastic processes with applications to biology. CRC Press,