MARKOV CHAINS AND MARKOV DECISION THEORY. Contents

Size: px

Start display at page:

Download "MARKOV CHAINS AND MARKOV DECISION THEORY. Contents"

Lynne Roberts
5 years ago
Views:

1 MARKOV CHAINS AND MARKOV DECISION THEORY ARINDRIMA DATTA Abstract. In this paper, we begin with a forma introduction to probabiity and expain the concept of random variabes and stochastic processes. After this, we present some characteristics of a finite-state Markov Chain, which is a discrete-time stochastic process. Later, we introduce the concept of Markov Chains with rewards and the Markov Decision Theory, and appy them through various exampes. The paper ends with the description of the Dynamic Programming Agorithm, a set of rues that maximizes the aggregate rewrd over a given number of trias in a muti-tria Markov Chain with rewards. Contents 1. Preiminaries and Definitions 1 2. Finite State Markov Chains 2 3. Cassification of States of a Markov Chain 3 4. Matrix Representation and the Steady state, [ ] for arge n 4 5. Markov Chains with rewards The expected aggregate reward over mutipe transitions The expected aggregate reward with an additiona fina reward Markov decision theory Dynamic programming agorithm 12 Acknowedgments 14 References Preiminaries and Definitions In this section, we provide a forma treatment of various concepts of statistics ike the notion of events and probabiity mapping. We aso define a random variabe which is the buiding bock of any stochastic process, of which the Markov process is one. Axiom of events Given a sampe space Ω, the cass of subsets F of Ω that constitute the set of events satisfies the foowing axioms: 1.Ω is an event. 2.For every sequence of events A 1, A 2,..., the union n A n is an event 3.For every event A, the compement A c is an event. F is caed the event space. Axiom of Probabiity Given any sampe space Ω and any cass of event spaces F, a probabiity rue is a function P{} mapping each event A F to a (finite) rea 1

2 2 ARINDRIMA DATTA number in such a way that the foowing three probabiity axioms hod: 1.P{Ω} = 1. 2.For every event A, P{A} 0. 3.The probabiity of the union of any sequence A 1, A 2,... of disjoint events is given by the sum of the individua probabiities (1.1) P{ n=1a n } = P{A n } With this definition of probabiity mapping of an event, we wi now characterize a random variabe, which in itsef, is a very important concept. Definition 1.2. A random variabe is a function X from the sampe space Ω of a probabiity mode to the set of rea numbers R, denoted by X(ω) for ω Ω where the mapping X(ω) must have the property that {ω Ω : X(ω) x} is an event for each x R. Thus, random variabes can be ooked upon as rea-vaued functions from a set of possibe outcomes (sampe space Ω), ony if a probabiity distribution, defined as F X (x) = P{ω Ω : X(ω) x} exists. Or in other words, random variabes are the rea-vaued functions, ony if they turn the sampe space to a probabiity space. A stochastic process (or random process) is an infinite coection of random variabes. Any such stochastic process is usuay indexed by a rea number, often interpreted as time, so that each sampe point maps to a function of time giving rise to a sampe path. These sampe paths might vary continuousy with time or might vary ony at discrete times. In this paper, we wi be working with stochastic processes that are discrete in time variation. n=1 2. Finite State Markov Chains A cass of stochastic processes that are defined ony at integer vaues of time are caed integer-time processes, of which a finite state Markov Chain is an exampe. Thus, at each integer time n 0, there is an integer-vaued random variabe X n, caed the state at time n and a Markov Chain is the coection of these random variabes {X n ; n 0}. In addition to being an integer-time process, what reay makes a Markov Chain specia is that it must aso satisfy the foowing Markov property. Definition 2.1. Markov property of an integer-time process {X n, n 0}, is the property by which the sampe vaues for random variabe, such as X n, n 1, ie in a countabe set S, and depend on the past ony through the most recent random variabe X n 1. More specificay, for a positive integers n, and for a i, j, k,..., m in S (2.2) P(X n = j X n 1 = i; X n 2 = k,..., X 0 = m) = P(X n = j X n 1 = i) Definition 2.3. A homogeneous Markov Chain has the property that P{X n = j X n 1 = i} depends ony on i and j and not on n, and is denoted by (2.4) P{X n = j X n 1 = i} = P ij

3 MARKOV CHAINS AND MARKOV DECISION THEORY 3 Figure 1. Graphica and Matrix Representation of a 6 state Markov Chain [1] The initia state X 0 can have an arbitrary probabiity distribution. A finite-state Markov chain is a Markov chain in which S is finite. Markov chains are often described by a directed graph as in Figure 1a. In this graphica representation, there is one node for each state and a directed arc for each non-zero transition probabiity. If P ij = 0, then the arc from node i to node j is omitted. A finite-state Markov chain is aso often described by a matrix [P ] as in Figure 1b. If the chain has M states, then [P ] is an M x M matrix with eements P ij. 3. Cassification of States of a Markov Chain An (n-step) wak is an ordered string of nodes, (i 0, i 1,...i n ), n 1, in which there is a directed arc from i m 1 to i m for each m, 1 m n. A path is a wak in which no nodes are repeated. A cyce is a wak in which the first and ast nodes are the same and no other node is repeated Definition 3.1. A state j is accessibe from i (abbreviated as i j) if there is a wak in the graph from i to j For exampe, in Figure 1(a), there is a wak from node 1 to node 3 (passing through node 2), so state 3 is accessibe from 1. Remark 3.2. We see that i j if and ony if P{X n = j X 0 = i} > 0 for some n 1. We denote P{X n = j X 0 = i} by ij. Thus i j if and ony if ij > 0 for some n 1. For exampe, in Figure 1(a), P 2 13 = P 12 P 23 > 0. Two distinct states i and j communicate (denoted by i j) if i is accessibe from j and j is accessibe from i. Definition 3.3. For finite-state Markov chains, a recurrent state is a state i which is accessibe from a the states that are, in turn, accessibe from the state i. Thus i is recurrent if and ony if i j = j i. A transient state is a state that is not recurrent.

4 4 ARINDRIMA DATTA A transient state i, therefore, has the property that if we start in state i, there is a non-zero probabiity that we wi never return to i. Definition 3.4. A cass C of states is a non-empty set of states such that each i C communicates with every other state j C and communicates with no j / C. With this definition of a cass, we now specify some characteristics of states beonging to the same cass. Theorem 3.5. For finite-state Markov chains, either a states in a cass are transient or a are recurrent. Proof. : Let C be a cass, and i, m C are states of Markov chains in the same cass (i.e., i m). Assume for contradiction that state i is transient (i.e., for some state j C, i j but j i). Then since, m i and i j, so m j. Now if j m, then the wak from j to m coud be extended to i which woud make the state i recurrent and woud be a contradiction. Therefore there can be no wak from j to m, which makes the state m transient. Since we have just shown that a states in a cass are transient if any one state in the cass is, it foows that the states in a cass are either a recurrent or a transient. Definition 3.6. The period of a state i, denoted d(i), is the greatest common divisor (gcd) of those vaues of n for which Pii n > 0. If the period is 1, the state is caed aperiodic. Theorem 3.7. For any Markov chain, a states in a cass have the same period. Proof. : Let i and j be any distinct pair of states in a cass C. Then i j and there is some r such that Pij r > 0 and some s such that P ji s > 0. Since there is a wak of ength r + s from i to j and back to i, r + s must be divisibe by d(i). Let t be any integer such that Pjj t > 0. Since there is a wak of ength r + t + s from i to j, then back to j, and then to i, r + t + s is divisibe by d(i), and thus t is divisibe by d(i). Since this is true for any t such that Pjj t > 0, d(j) is divisibe by d(i). Reversing the roes of i and j, d(i) is divisibe by d(j), so d(i) = d(j). Since the states in a cass C a have the same period and are either a recurrent or a transient, we refer to the cass C itsef as having the period of its states and as being recurrent or transient. Definition 3.8. For a finite-state Markov chain, an ergodic cass of states is a cass that is both recurrent and aperiodic. A Markov chain consisting entirey of one ergodic cass is caed an ergodic chain. Definition 3.9. A unichain is a finite-state Markov chain that contains a singe recurrent cass and possiby, some transient states. Thus, an ergodic unichain is a Markov chain which soey consists of a singe aperiodic recurrent cass. 4. Matrix Representation and the Steady state, [ ] for arge n The matrix [P ] of transition probabiities of a Markov chain is caed a stochastic matrix; that is, a square matrix of nonnegative terms in which the eements in each row sum to 1. We first consider the n step transition probabiities Pij n in terms of

5 MARKOV CHAINS AND MARKOV DECISION THEORY 5 [P ]. The probabiity, of reaching state j from state i, in two steps is the sum over k of the probabiity of transition from i first to k and then to j. Thus (4.1) P 2 ij = M P ik P kj i=1 Noticeaby, this is just the i, j term of the product of the matrix [P ] with itsef. If we denote [P ][P ] as [P 2 ],this means that Pij 2 is the (i, j) eement of the matrix [P 2 ]. Simiary, it can be shown that [P ] n = [ ] and [P m+n ] = [P m ][ ]. The ast equaity can be written expicity in terms of an equation, known as the Chapman- Komogorov equation. (4.2) P m+n ij = M Pik m Pkj n k=1 Theorem 4.3. For an aperiodic Markov Chain, there exists an N < such that Pii n > 0 for a i {1,.., k} and a n N. Lemma 4.4. Let A = {a 1, a 2...} be a set of positive integers which are (i) reativey prime and (ii) cosed under addition. Then there is some N < such that for any n N, n A. Proof. : The proof of this emma can be found in Oe Haggstrom. Finite Markov Chains and Agorithmic Appications. Cambridge University Press, 2002 and because it is a technica number theory emma, we wi not reproduce the proof here. Proof. (Theorem): Let A i = {n 1 Pii n > 0} be the set of return times to state i starting from state i. By the aperiodicity of the Markov chain, A i has a greatest common factor of 1, satisfying part (i) of Lemma 4.4. Next, et a 1 and a 2 A i, then P a1 ii > 0 and P a2 ii > 0 = P a1+a2 ii = M P a1 ik P a2 ki > 0, which in turn impies that a 1 + a 2 A i. Hence A i is cosed under addition and satisfies part (ii) of Lemma 4.4. The theorem then foows from Lemma Coroary 4.5. For ergodic Markov Chains there exists an M < such that Pij n > 0 for a i, j {1,.., k} and a n M. Proof. : Using the aperiodicity of ergodic Markov chains, and appying Theorem 4.4, we are abe to find an integer N < such that Pii n > 0 for a i {1,.., k} and a n N. Next, we pick two arbitrary states i and j. Since an ergodic Markov chain consists of a singe recurrent cass, states i and j must beong to the same cass and hence communicate with each other. Thus, there is some n i,j such that i,j ij > 0. Let M i.j = N + n i.j. Then, for any m M i,j we have P(X m = j X 0 = i) P(X m = j, X m ni,j = i X 0 = i) as the event is a subset of the event in the previous ine) = P(X m ni,j = i X 0 = i)p (X m = j X m ni,j = i) by the independence of the events > 0 k=1

6 6 ARINDRIMA DATTA Therefore Pij m > 0 for a m M i,j. Repeating this process for a combinations of two arbitrary states i and j we get {M 1,1..., M 1,k, M 2,1,...M k,k }. Now we set M = max{m 1,1,..., M k,k } and this M satisfies the required property as stated in the coroary. The transition matrix : The matrix [ ] is very important as the i, j eement of this matrix is Pij n = P{X n = j X 0 = i}. Due to the Markov property, each state in a Markov chain remembers ony the most recent history. Thus, we woud expect the memory of the past to be ost with increasing n, and the dependence of Pij n on both n and i to disappear as n. This has two impications: first, [ ] shoud converge to a imit as n, and, second, for each coumn j, the eements in that coumn namey, P1j n, P 2j n,..., P Mj n shoud a tend toward the same vaue. We ca this converging imit, π j. And if Pij n π j, each row of the imiting matrix converge to (π 1,..., π M ), i.e., each row becomes same as every other row. We wi now prove this convergence property for an ergodic finite-state Markov Chain. Theorem 4.6. Let [P ] be the matrix of an ergodic finite-state Markov chain. Then there is a unique steady-state vector π, which is positive and satisfies (4.7) im n ij = π j for each i, j or in a compact notation (4.8) im n [ ] = eπ where e = (1, 1,..., 1) T Proof. For each i, j, k and n, we use the Chapman-Komogorov equation, aong with Pkj n max Pj n and P ik = 1. This gives us k +1 ij Simiary, we aso have +1 ij = k = k P ik kj k P ik kj k P ik max P ik min j = max j = min j j Now, et α = min i,j P ij and min be the vaue of that minimizes j. Then +1 ij = P ik Pkj n k = P ik Pkj n + P imin min Pj n k min P ik max Pj n + P imin min k min = max Pj n P imin (max max Pj n α(max j min j min Pj) n j j) which woud further impy that max i +1 ij max j α(max j min j ).

7 MARKOV CHAINS AND MARKOV DECISION THEORY 7 By a simiar set of inequaities, we have min i +1 ij min Pj n). Next, we subtract the two equations to obtain max +1 ij min +1 ij i i (max j min min j + α(max j j)(1 2α) Then, using induction on n, we obtain from the above equation that max i Pij n min i Pij n (1 2α)n Now, if we assume that P ij > 0, i, j then α > 0 and since, (1 2α) < 1, in the imit n we woud have max i Pij n min i Pij n 0 or (4.9) im max n j = im min n j > 0 But, α might not aways be positive. However, due to the ergodicity of our Markov chain and Coroary 4.6, we know that there exists some integer h > 0 such that Pij h > 0. Carrying out a simiar process as before and repacing α by min i,j Pij h, which is now positive, we obtain the equation max i Pij n min i Pij n (1 2α)n/h from which we get the same imit as equation 4.9. Now, define the vector π > 0 as π j = im n max Pj n = im n min Pj n > 0. Since π j ies between the minimum and the maximum of im n j j, in this imit, π j = > 0. This can be represented in a more compact notation as im [ ] = eπ where e = (1,..., 1) T n which proves the existence of the imit in the theorem. To compete the rest of the proof, we aso need to show that π as defined above is the unique steady-state vector. Let µ be any steady state vector, i.e., any probabiity vector soution to µ[p ] = µ. Then µ must satisfy µ = µ[ ] for a n > 1. In the imit n, µ = µ im n [ ] = µeπ = (µe)π = eπ = π. Thus, π is the steady state vector and is unique. 5. Markov Chains with rewards In this section, we ook into a more interesting probem, namey the Markov Chain with Rewards. Now, we associate each state i of a Markov chain with a reward, r i. The reward r i associated with a state coud aso be viewed as a cost or some rea-vaued function of the state. The concept of a reward in each state is very important for modeing corporate profits or portfoio performance. It is aso usefu for studying queuing deay, the time unti some given state is entered, and simiar interesting phenomena. It is cear from the setup, the sequence of rewards associated with the transitions between the states of the Markov chain is not independent, but is reated by the statistics of the Markov chain. The gain is defined to be the steady-state expected reward per unit time, assuming a singe recurrent cass of states and is denoted by g = i π ir i where π i is the steady-state probabiity of being in state i. Let us now expore the concept of Markov Chains with Rewards with an Exampe.

8 8 ARINDRIMA DATTA Figure 2. The conversion of a recurrent Markov chain with M = 4 into a chain for which state 1 is a trapping state [1] Exampe 5.1. The first-passage time of a set A with respect to a stochastic process is the time unti the stochastic process first enters A. Thus, it is an interesting concept, for we might be interested in knowing the average number of steps it takes to go from one given state, say i, to a fixed state, say 1 in a Markov chain. Here we cacuate the expected vaue of the first-passage-time. Since the first-passage time to a given state (say state 1) is independent of the transitions made after the first entry into that step, we can modify any given Markov chain to convert this required state into a trapping state so that there is no exit from that step. Which means, we modify P 11 to 1 and P 1j to 0 for a j 1. We eave P ij unchanged for a i 1 and a j. We show such a modification in Figure 2. This modification does not change the probabiity of any sequence of states up to the point that state 1 is first entered and so the essentia behavior of the Markov chain is preserved. Let us ca v i the expected number of steps to first reach state 1 starting in state i 1. This is our required expected first passage time to state 1. v i can be computed considering the first step and then adding the remaining steps to reach state 1 from the state that is entered next. For exampe, for the chain in figure 2, we have the equations v 2 = 1 + P 23 v 3 + P 24 v 4. v 3 = 1 + P 32 v 2 + P 33 v 3 + P 34 v 4. v 4 = 1 + P 42 v 2 + P 43 v 3. Simiary, for an arbitrary chain of M states where 1 is a trapping state and a other states are transient, this set of equations becomes (5.2) v i = 1 + j 1 P ij v j where i 1 We can now define r i = 1 for i 1 and r i = 0 for i = 1, to be the unit reward obtained for entering the trapping state from state i. This makes intuitive sense because in a rea-ife situation, we woud expect the reward to cease to exist once the trapping state is entered. With this definition of r i, v i becomes the expected aggregate reward before entering the trapping state or the expected transient reward. If we take v 1 to be 0 (i.e 0 reward in recurrent state), Equation 5.2 aong with v 1 = 0, has the vector form (5.3) v = r + [P ]v.

9 MARKOV CHAINS AND MARKOV DECISION THEORY 9 In the next two subsections we wi expore more genera cases of expected aggregate rewards in a Markov Chain with rewards The expected aggregate reward over mutipe transitions. In the genera case, we et X m be the state at time m and R m = R(X m ) the reward at that time m, which, in the context of the previous exampe, woud impy that if the sampe vaue of X m is i, then r i is the sampe vaue of R m. Taking X m = i, the aggregate expected reward v i (n) over n trias from X m to X m+n 1 is v i (n) = E[R(X m ) + R(X m+1 ) + + R(X m+n 1 ) X m = i] = r i + j P ij r j j 1 ij r j. In case of a homogeneous Markov Chain, this expression does not depend on the starting time m. Considering the expected reward for each initia state i, this expression can be compacty written in the foowing vector notation. n 1 (5.4) v(n) = r + [P ]r [1 ]r = [P h ]r where v(n) = (v 1 (n), v 2 (n),..., v M (n)) T, r = (r 1,..., r M ) T and P 0 is the identity matrix. Now if we take the case where the Markov chain is an ergodic unichain, we have im n [P ] n = eπ. Mutipying both sides of the imit with the vector r, we obtain im n [P ] n r = eπr = ge where g is the steady-state reward per unit time. And by definition, g is equa to πr. If g 0, then from equation 5.4, we can say that v(n) changes by approximatey ge for each unit increase in n. Thus, v(n) does not have a imit as n. However, as shown beow, v(n) nge does have a imit, given by im [v(n) nge] n n 1 = im [P h eπ]r. since eπr = ge n For an ergodic unichain, the imit exits or the infinite sum converges because it can be shown that Pij n π j < o(exp( nε)) for very sma ε and for a i, j, n [1]p.126. Thus, (Pij h π j) < o(exp( nε)). h=n This imit is a vector over the states of the Markov chain, which gives the asymptotic reative expected advantage of starting the chain in one state reative to another. It is aso caed the reative gain vector and denoted by w. Theorem 5.5. Let [P ] be the transition matrix for an ergodic unichain. Then the reative gain vector w satisfies the foowing inear vector equation. (5.6) w + ge = [P ]w + r

10 10 ARINDRIMA DATTA Proof. : Mutipying [P ] on the eft of both the sides of equation in the definition of w, we get [P ]w = im n = im n h=1 = im n n 1 ([P h+1 eπ]r) since eπ = [P ]eπ n ([P h eπ]r) n ([P h eπ]r) [P 0 eπ]r = w [P 0 ]r + eπr = w r + ge. Rearranging the terms, we get the required resut The expected aggregate reward with an additiona fina reward. A variation to the previous situation might be the case when an added fina reward is assigned to the fina state. We can view this fina reward, say u i, as a function of the fina state i. For exampe, it might be particuary advantageous to end in one particuar state versus the other. As before, we set R(X m+h ) to be the reward at time m + h, for 0 h n 1 and define U(X m+n ) to be the fina reward at time m + n, where U(X) = u i for X = i. Let v i (n, u) be the expected reward from time m to m+n, using the reward r from time m to m + n 1 and using the fina reward u at time m + n. Then the expected reward is obtained by modifying Equation (5.4): n 1 (5.7) v(n, u) = r + [P ]r + + [ 1 ]r + [ ]u = [P h ]r + [ ]u. This simpifies if u is taken to be the reative-gain vector w. Theorem 5.8. Let [P ] be the transition matrix of a unichain and et w be the corresponding reative-gain vector. For each n 1, if u = w, then (5.9) v(n, w) = nge + w For an arbitrary fina reward vector u, (5.10) v(n, u) = nge + w + [ ](u w) Proof. : We use induction to prove the theorem. For n = 1, we obtain from (5.7) and theorem 5.5 that (5.11) v(1, w) = r + [P ]w = ge + w so the induction hypothesis is satisfied for n = 1. For n > 1,

11 MARKOV CHAINS AND MARKOV DECISION THEORY 11 n 1 v(n, w) = [P h ]r + [ ]w n 2 = [P h ]r + [ 1 ]r + [ ]w n 2 = [P h ]r + [ 1 ](r + [P ]w) n 2 = [P h ]r + [ 1 ](ge + w) n 2 = ( [P h ]r + [ 1 ]w) + [ 1 ]ge = v(n 1, w) + ge. since ge = eπr and eπ = [ 1 ]eπ Using induction on n, we obtain (5.9). To estabish (5.10), note from (5.7) that (5.12) v(n, u) v(n, w) = [ ](u w) Then (5.10) foows by using (5.9) for the vaue of v(n, w). 6. Markov decision theory Ti now, we have ony anayzed the behavior of a Markov chain with rewards. In this section, we consider a much intricate situation where a decision maker can choose among various possibe rewards and transition probabiities. At each time m, the decision maker, given X m = i, seects one of the K i possibe choices for state i and each choice k is associated with a reward r (k) and a set of transition probabiities P (k) ij, j. We aso assume that if decision k is seected at time m, the probabiity of entering state j at time m + 1 is P (k) ij, independent of earier states and decisions. Exampe 6.1. An exampe is given in Figure 3, in which the decision maker has a choice between two possibe decisions in state 2 (K 2 = 2), and has a singe choice in state 1 (K 1 = 1). Such a situation might arise when we know that there is a trade off between instant gain (aternative 2) and ong term gain (aternative 1). We see that decision 2 is the best choice in state 2 at the nth of n trias for a arge n because of the huge reward associated with this decision. However, at an earier step, it is ess obvious what to do. We wi address this question in the next section, when we derive the agorithm to choose the right decision at each tria to maximize the aggregate reward The set of rues used by the decision maker in seecting an aternative at each time is caed a poicy. We might be interested in cacuating the expected aggregate reward over n steps of the Markov chain as a function of the poicy used by the decision maker. A famiiar situation woud be where for each state i, the poicy uses the same decision, say k i, at each occurrence of i. Such a poicy is caed a stationary poicy. Since both rewards and transition probabiities in a stationary poicy depend ony

12 12 ARINDRIMA DATTA Figure 3. A Markov decision probem with two aternatives in state 2 [1] on the state and the corresponding decision, and not on time, such a poicy corresponds to a homogeneous Markov chain with transition probabiities P (ki) ij. We denote the resuting transition probabiity matrix of the Markov Chain as [P k ], where k = (k 1,..., k M ). The aggregate gain for any such poicy was found in the previous section. 7. Dynamic programming agorithm In a more genera case, where, the choice of a poicy at any given point in time varies as a function of time, we might want to derive an agorithm to choose the optima poicy for maximizing expected aggregate reward over an arbitrary number n of trias from times m to m + n 1. It turns out that the probem is further simpified if we incude a fina reward {u i 1 i M} at time m + n. This fina reward u is chosen as a fixed vector, rather than as part of the choice of poicy. The optimized strategy, as a function of the number of steps n and the fina reward u, is caed an optima dynamic poicy for that u. This poicy is found from the dynamic programming agorithm. First et us consider the optima decision with n = 1. Given X m = i, a decision k is made with immediate reward r (k) i. If the next state X m+1 is state j, then the transition probabiity is P (k) ij and the fina reward is then u j. The expected aggregate reward over times m and m + 1, maximized over the decision k, is then (7.1) vi (1, u) = max k {r(k) i + j P (k) ij u j}. Next, we ook at vi (2, u), i.e., the maxima expected aggregate reward starting at X m = i with decisions made at times m and m + 1 and a fina reward at time m + 2. The key to dynamic programming is that an optima decision at time m + 1 can be seected based ony on the state j at time m + 1. That the decision is optima independent of the decision at time m can be shown using the foowing argument. Regardess of what the decision is made at time m, the maxima expected reward at times m + 1 (given X m+1 = j), is max k (r (k) j as found in (7.1). + P (k) j u ). Ths is equa to vj (1, u), Using this optimized decision at time m+1, it is seen that if X m = i and decision k is made at time m, then the sum of expected rewards at times m + 1 and m + 2

13 MARKOV CHAINS AND MARKOV DECISION THEORY 13 is P (k) ij v j (1, u). Adding the expected reward at time m and maximizing over j decisions at time m (7.2) v i (2, u) = max{r (k) i + j Continuing this way, we find, after n steps, that P (k) ij v i (1, u)}. (7.3) v i (n, u) = max{r i + j P ij v i (n 1, u)}. Noteworthy is the fact that the agorithm is independent of the starting time m. The parameter n, usuay referred to as stage n, is the number of decisions over which the aggregate gain is being optimized. So we obtain the optima dynamic poicy for any fixed fina reward vector u and any given number of trias. Exampe 7.4. The dynamic program agorithmn can be eaborated with a short exampe. We reconsider the case in Exampe 6.1 with fina reward u = 0. Since r 1 = 0 and u 1 = u 2 = 0, the aggregate gain in state 1 at stage 1 is v 1(1, u) = r 1 + j P 1j u j = 0. Simiary, since poicy 1 has an immediate reward r (1) 2 = 1, and poicy 2 has an immediate reward r (2) 2 = 50 in stage 2, v 2(1, u) = max{[r (1) 2 + j P (1) 2j u j], [r (2) 2 + j To go on to the stage 2, we use the resuts above for v j (1, u). P (2) 2j u j]} = max{1, 50} = 50 v 1(2, u) = r 1 + P 11 v 1(1, u) + P 12 v 2(1, u) = P 12 v 2(1, u) = 0.5 v 2(1, u) = max{[r (1) 2 + j P (1) 2j v j (1, u)], [r (2) 2 + P (2) 21 v 1(2, u)]} = max{(1 + P (2) 22 v 2(1, u)), 50} = max{50.5, 50} = 50.5 Thus for a two-tria situation ike this, decision 1 is optima in state 2 for the first tria (stage 2), and decision 2 is optima in state 2 for the second tria (stage 1). This is because, the choice of decision 2 at stage 1 has made it very profitabe to be in state 2 at stage 1. Thus if the chain is in state 2 at stage 2, it is preferabe to choose decision 1 (i.e., the sma unit gain) at stage 2 with the corresponding high probabiity of remaining in state 2 at stage 1. For arger n, however, v 1(n, u) = n/2 and v 2(n, u) = 50 + n/2. The optimum dynamic poicy (for u = 0) woud then be decision 2 for stage 1 (i.e., for the ast decision to be made) and decision 1 for a stages n > 1 (i.e., for a decisions before the ast). From this exampe we aso see that the maximization of expected gain is not aways what is most desirabe in a appications. For instance, someone who is risk-averse might we prefer decision 2 at the next to fina decision (stage 2), as this guarantees a reward of 50, rather than taking a sma chance of osing that reward.

14 14 ARINDRIMA DATTA Acknowedgments. I woud ike to thank Peter May for organizing the REU and my mentor, Yan Zhang, for painstakingy reviewing my paper. This paper has ony been possibe because of their hep, for which I am indebted to them. References [1] Robert Gaager Course Notes, MIT OCW http : //ocw.mit.edu/courses/eectrica engineering and computer science/6 262 discrete stochastic processes spring 2011/course notes/mit S11 chap03.pdf [2] Oe Haggstrom. Finite Markov Chains and Agorithmic Appications. Cambridge University Press, 2002

A Brief Introduction to Markov Chains and Hidden Markov Models

A Brief Introduction to Markov Chains and Hidden Markov Modes Aen B MacKenzie Notes for December 1, 3, &8, 2015 Discrete-Time Markov Chains You may reca that when we first introduced random processes,