Math 46 Lecture 8 Infinite Horizon discounted reward problem From the last lecture: The value function of policy u for the infinite horizon problem with discount factor a and initial state i is W i, u E n 0 n r X n, u X n X 0 i. The optimal value function is W i max u W i, u. u* is an optimal policy iff each initial state i, W i, u W i. RECURSION EQUATION FOR W i, u. W i, u r i, u i j T u i i, j W j, u. W i W i max a A r i, a j T a i, j W j. DYNAMIC PROGRAMMING EQUATION FOR. Let u* be the policy such that u i the a which gives the max value for W i. Then u* is an optimal and W i, u W i.
Approximation procedure for W i : Suppose the states are S i 0, i, i,.... Let w be the column vector of values values W(i): w w 0, w, w,... T W i 0, W i, W i,... T (superscript T means transpose). Let w 0 0, 0, 0,... T. Define w n by: w n i max a r i,a j T a i,j w n i. This sequence converges to W i (by Lecture 8). This w n i determines a maximizing action u i = a. Continue the approximations until this action at one stage is the same as in the next. Check to see if the policy u i is an optimal policy. u i is optimal iff W i, u W i iff W i, u satisfies the dynamic programming equation for W i iff W i,u max a A r i,a j T a i,j W j,u. If not optimal, continue the approximation procedure. If the dynamic programming equation does hold for all states i, then u i is the optimal policy u* and W i W i, u is the exact value of W i.
Now compute W i, u from u via matrix algebra. Suppose the state space is S = {0,,,...}. LEMMA. Suppose u is the action policy. Let T u i, j be the transition matrix for going from state i to state j according to u. Let r u r 0, u 0, r, u, r, u,... T be the rewards when following policy u. Let X x 0, x, x,... T W 0, u, W, u, W, u,... T be the expected values. Then X is the solution to the matrix equation I T u X r u Proof. Since X i W i, u, the equation W i,u r i,u i j T u i i,j W j,u for all i, becomes,
Now compute W i, u from u via matrix algebra. Suppose the state space is S = {0,,,...}. LEMMA. Suppose u is the action policy. Let T u i, j be the transition matrix for going from state i to state j according to u. Let r u r 0, u 0, r, u, r, u,... T be the rewards when following policy u. Let X x 0, x, x,... T W 0, u, W, u, W, u,... T be the expected values. Then X is the solution to the matrix equation I T u X r u Proof. Since X i W i, u, the equation W i,u r i,u i j T u i i,j W j,u for all i, becomes, X i r i,u i j T u i i,j X j and hence..., X i..., T..., r i,u i j T u i i,j X j,... T. Note that j T u i i,j X j is the matrix product T u X. Hence X r u T u X. Hence I T u X r u.
`A Markov decision problem has two states i=, and two actions a = c, d. The discount is a =.9. The rewards r i, a are r, c 5, r, c, r, d 4, r, d 3. The transition matrices for actions a = c, d are T c / /3 / /3 w() w() T d /4 /3 3/4 /3 Which state does c favor? Which state does d favor? Find an optimal policy u i. Solution. w n i max a c,d r i, a j T a i, j w n i w n i max r i,c.9 j T c i,j w n i, r i,d.9 j T d i,j w n i w n?? w n?? w() w() w 0 w 0?? w?? with action u?? w?? with action u??
`A Markov decision problem has two states i=, and two actions a = c, d. The discount is a =.9. The rewards r i, a are r, c 5, r, c, r, d 4, r, d 3. The transition matrices for actions a = c, d are T c / /3 / /3 w() w() T d /4 /3 3/4 /3 Find an optimal policy u i. Solution. w n i max a r i,a j T a i,j w n i w n i max r i,c.9 j T c i,j w n i, r i,d.9 j T d i,j w n i w n max 5.9 w n w n, 4.9 4 w n 3 4 w n w n max.9 3 w n 3 w n, 3.9 3 w n 3 w n w 0 w 0 0 w() w()
w max 5.9 0 0, 4.9 0 0 max 5, 4 5, with action u c w max.9 0 0, 3.9 0 0 max, 3 3, with action u d w max 5.9 0 0, 4.9 0 0 max 5, 4 5, with action u c w max.9 0 0, 3.9 0 0 max, 3 3, with action u d w n max 5.9 w n w n, 4.9 4 w n 3 4 w n w n max.9 3 w n 3 w n, 3.9 3 w n 3 w n c d w??... max 8.6, 7.5 with action u?? w??... max 5.9, 6.3 with action u??
w max 5.9 5 3, 4.9 4 5 3 4 3 max 5.9 4, 4.9 4 86 4 max 8.6, 7.5 0 43 5 with action u c. w max.9 3 5 3 3, 3.9 3 5 3 3 max.9 3 3,3.9 63 3 max 5.9, 6.3 0 with action u d. T c T d / / w() /4 3/4 w() /3 /3 w() /3 /3 w() For both w and w, u and u are the same. Hence we test to see if u c, u d is an optimal policy.???? For this u, T u.????
w max 5.9 5 3, 4.9 4 5 3 4 3 max 5.9 4, 4.9 4 86 4 max 8.6, 7.5 0 43 5 with action u c. w max.9 3 5 3 3, 3.9 3 5 3 3 max.9 3 3,3.9 63 3 max 5.9, 6.3 0 with action u d. T c T d / / w() /4 3/4 w() /3 /3 w() /3 /3 w() For both w and w, u and u are the same. Hence we test to see if u c, u d is an optimal policy. / / For this u, T u. /3 /3 The first row of T u i i, j is from T c since u c The second row is from since u d. T d
The rewards r i, a are r, c 5, r, c, r, d 4, r, d 3. The policy is u c, u d. Thus r u r, u, r, u T =??
r u r, u, r, u T r, c, r, d T 5, 3 T. The equation I T u X r u becomes.9.9 x which is the system.9 3.9 5 3 y 3 0 x 9 0 y 5 x 9y 00 3 0 x 4 0 y 3 3x 4y 30 With solution x 670 7 W, u, y 630 7 W, u. Now check if W i, u satisfies the dynamic programming equation for W i : W i, u max a A r i, a j T a i, j W j, u. If so, then W i W i, u and u is optimal. For i = we get W, u max a A r, a.9 j T a, j W j, u iff 670 7 max 5.9, 39.4 max 39.4, 37.8 Likewise for i =. 670 7 630 7 iff true. 670,4.9 4, 3 4 7 630 7 iff
Suppose you have invested in a stock. When should you sell it? To make money, you must sell if for more than the amount originally paid. When the stock is selling at a high price, should you sell or wait for a higher price? For the case of the unpredictable stock market the problem is intractable. We solve this problem for Markov chains with known transition probabilities. S = {0,,,... } be the set of states for a Markov chain. T i, j is the probability of going from i to j assuming you choose to continue. r i is the reward to state i. This reward is collected iff you stop at that state. At each state i, we chose an action ai{s, c}. Choose a = s if you wish to stop and collect the state s reward. Thus r i, s r i. Choose a = c if you wish to continue. You skip the state s reward. Thus r i, c 0.
You continue from state to state according to the transition probabilities until you stop at a state and collect its reward. A solution to a stopping problem is a policy u i which determines, for each state i, if you should stop ( u i s) and collect the reward r i, s r i or if you should continue ( u i c) and skip the reward r i, c 0 (hoping for getting better). Let W i be the expected reward if you start at i and act optimally. Clearly W i r i. W i r i if you stop at i instead of continuing on to other states.
` For each state i decide if we should stop and collect the reward, or continue. The bracketed numbers [ r(i)] are the rewards. Put the correct action u(i) in the space between ( ) s. At the bottom of each node list the expected value W i if you start at i and act optimally. Convention: unlabeled arrows have probability. You must stop sometime, infinite loops are not allowed. []( ) /3 []( ) [3]( ) /3 [0]( ) /3 /3 /3 [0]( ) / 3/4 [5]( ) /4 / [6]( )
[]( ) [6]( ) / / [6]( ) 3[5]( ) []( ) [4]( ) / [0]( ) / [0]( ) / [0]( ) RULES FOR INITIAL KNOWN VALUES. What should you do in each case? v If r i is the maximum reward. v If r i is > all rewards reachable from i. v v If i is a dead-end state (you can't leave). If i is in an irreducible class. v If all next states j have known values W j. /