Administrivi CSE 190: Reinforcement Lerning: An Introduction Any emil sent to me bout the course should hve CSE 190 in the subject line! Chpter 4: Dynmic Progrmming Acknowledgment: A good number of these slides re cribbed from Rich Sutton 2 Gols for this chpter Overview of collection of clssicl solution methods for MDPs known s dynmic progrmming DP Show how DP cn be used to compute vlue functions, nd hence, optiml policies Discuss efficiency nd utility of DP Lst Time: Vlue Functions The vlue of stte ihe expected return strting from tht stte; depends on the gent s policy: Stte - vlue function for policy! : # { } = E! & $ " k r t +k +1 V! s = E! R t The vlue of tking n ction in stte under policy! ihe expected return strting from tht stte, tking tht ction, nd therefter following! : % ' k =0 * 3 Action- vlue function for policy! : # { } = E! & $ " k r t + k +1, t = Q! s, = E! R t, t = CSE 190: Reinforcement Lerning, Lecture k = 0on Chpter 4 % ' * 4
Lst Time: Bellmn Eqution for Policy! The bsic ide: R t = r t +1 +! r t +2 +! 2 r t + 3 +! 3 r t + 4! = r t +1 +! r t +2 +! r t + 3 +! 2 r t + 4! Lst Time: More on the Bellmn Eqution V! s =!s, P s s" $% R s s" + # V! s "& ' s" This is set of equtions in fct, liner, one for ech stte. The vlue function for! is its unique solution. = r t +1 +! R t +1 So: V! s = E! R t { } { } = E! r t +1 + " V +1 Bckup digrms: Or, without the expecttion opertor: V! s =!s, P s s" $% R s s" + # V! s "& ' s" 5 for V! for Q! 6 Lst Time: Bellmn Optimlity Eqution for V* The vlue of stte under n optiml policy must equl the expected return for the best ction from tht stte: V!! s = mx "As Q# s, { } = mx E r t +1 + $ V! +1, t = "As Lst Time: Bellmn Optimlity Eqution for Q* Q! s, = E{ r t +1 + " mxq! +1, #, t = } = P $ % R # + " mxq! s #, # # & ' = mx & P s s% ' R s s% + $ V! s % * "As s% The relevnt bckup digrm: The relevnt bckup digrm: V * ihe unique solution of this system of nonliner equtions. 7 Q * ihe unique solution of this system of nonliner equtions. 8
This Time Policy Evlution How to solve these equtions using itertion Cn solve for optiml V* Policy Evlution: for given policy!, compute the stte-vlue function V! Recll: Stte - vlue function for policy! : But often it is fster to evlute nd improve the policy Alternting figuring out V! nd improving! # { } = E! & $ " k r t + k +1 V! s = E! R t Bellmn eqution for V! : % ' k =0 V! s =!s, P s s" $% R s s" + # V! s "& ' s" system of S simultneous liner equtions * 9 10 Itertive Methods Itertive Methods V 0! V 1!!! V k!!!! V " V 0! V 1!!! V k!!!! V " sweep sweep A sweep consists of pplying bckup opertion to ech stte. A full policy-evlution bckup: s! "s, P %& R s #' A sweep consists of pplying bckup opertion to ech stte. A full policy-evlution bckup: s! "s, P %& R s #' 11 12
Itertive Policy Evlution A Smll Gridworld 13 An undiscounted episodic tsk Nonterminl sttes: 1, 2,..., 14; One terminl stte shown twice s shded squres Actionht would tke gent off the grid leve stte unchnged Rewrd is 1 until the terminl stte is reched 14 A Smll Gridworld A Smll Gridworld Note here tht the ctions re deterministic, so this eqution: s! "s, P %& R s #' Becomes: s! "s,%& R s #' And it is undiscounted, so this: Becomes: s! "s,%& R s #' s! "s, $% R +V k s #& ' 15 16
A Smll Gridworld A Smll Gridworld s! "s, $% R +V k s #& ' s! "s,up $% R UP + V k s #& ' + "s, RIGHT $% R RIGHT + V k s ##& ' + "s, DOWN $% R DOWN +V k s ### & ' + s! 0.25 "1 + V k s # 0.25 "1 +V k s # 0.25 "1 +V k s # 0.25 "1 + V k s # "s, LEFT $% R LEFT + V k s #### & ' 17 18 A Smll Gridworld A Smll Gridworld For stte 4, for exmple, we hve: 4! 0.25 UP ["1 + V k terminl] + 0.25 RIGHT "1+ V k 5 0.25 DOWN "1+V k 8 0.25 LEFT "1+V k 4 s! 0.25 "1 +V k s # 0.25 "1+V k s # 0.25 "1+ V k s # 0.25 "1+V k s # 19 20
A Smll Gridworld A Smll Gridworld s! 0.25 "1+V k s # 0.25 "1+V k s # 0.25 "1+ V k s # 0.25 "1+V k s # 4! 0.25 UP ["1+ V k terminl] + 0.25 RIGHT "1+ V k 5 0.25 DOWN "1 +V k 8 0.25 LEFT "1+V k 4 21 22 4! 0.25 UP "1+ 0 A Smll Gridworld 0.25 RIGHT "1 + "1 0.25 DOWN "1 +" 1 0.25 LEFT "1 +" 1 = "1.75 Itertive Policy Evlution for the Smll Gridworld! = equiprobble rndom ction choices 23 24
Itertive Policy Evlution for the Smll Gridworld! = equiprobble rndom ction choices But look wht hppens if these vlues re used to mke new policy! note - this won t t lwys hppen! Exercise for the reder: Wht re the vlues of the sttes under the optiml policy? 25 Policy Improvement Suppose we hve computed V! for deterministic policy!. For given stte s, would it be better to do n ction! "s? The vlue of doing in stte s is: { } Q! s, = E! r t +1 + " V! +1, t = = $ P %& R + " V! s #' It is better to switch to ction for stte s if nd only if Q! s, > V! s 26 Policy Improvement Cont. Policy Improvement Cont. Do this for ll stteo get new policy "! tht is greedy with respect to V " : "!s = rgmxq " s, Then V "! # V " = rgmx # P s s! R s s! s! %& + $ V " s!' Wht if V "! = V "? i.e., for ll s #S, V "! s = mx$ s &' R s s! + % V " s!? P s! s! But this ihe Bellmn Optimlity Eqution. So V "! = V # nd both " nd "! re optiml policies. 27 28
Policy Itertion Policy Itertion! 0 " V! 0 "! 1 " V! 1 "!! * " V * "! * policy evlution policy improvement greedifiction 29 30 Jck s Cr Rentl Jck s Cr Rentl $10 for ech cr rented must be vilble when request rec d Two loctions, mximum of 20 cr ech Crs returned nd requested rndomly Poisson distribution, n returns/requests with prob " n e -" /n! where " is the expected number 1st loction: verge requests = 3, verge returns = 3 2nd loction: verge requests = 4, verge returns = 2 Cn move up to 5 crs between loctions overnight t $2/cr. Sttes, Actions, Rewrds? Trnsition probbilities? Note this mkes sense - loction 2 on verge loses 2 crs per dy. 31 32
Jck s CR Exercise Suppose the first cr moved is free From 1st to 2nd loction Becuse n employee trvelht wy nywy by bus Suppose only 10 crs cn be prked for free t ech loction More thn 10 cost $4 for using n extr prking lot Such rbitrry nonlinerities re common in rel problems Policy itertion: Cn we do better? Ech itertion involves policy evlution, which is itself n itertive process It looks like from the previous exmple tht policy evlution my converge long fter the greedy policy bsed on the vlues hs converged. Cn we skip steps somehow? Yes: policy evlution cn be stopped erly nd under most cses, convergence is still gurnteed! A very specil cse: Stopping fter one sweep of policy evlution. This is clled vlue itertion 33 34 Vlue Itertion Vlue Itertion Cont. Recll the full policy-evlution bckup: s! "s, P %& R s #' Here ihe full vlue-itertion bckup: s! mx P s s" s" $% R s s" + # V k s "& ' Note how this combines policy improvement nd evlution. It is simply the Bellmn optimlity eqution turned into n updte eqution! In prctice, often policy evlution sum is performed severl times between policy improvement mx sweeps. 35 36
Gmbler s Problem Gmbler s Problem Solution Gmbler cn repetedly bet $ on coin flip Heds he wins his stke, tils he loses it Initil cpitl # {$1, $2, $99} Gmbler wins if his cpitl becomes $100 loses if it becomes $0 Coin is unfir Heds gmbler wins with probbility p =.4! n n! e"! Sttes, Actions, Rewrds? 37 38 Herd Mngement Asynchronous DP You re consultnt to frmer mnging herd of cows Herd consists of 5 kinds of cows: Young Milking Breeding Old Sick Number of ech kind ihe Stte Number sold of ech kind ihe Action Cowrnsition from one kind to nother Young cows cn be born All the DP methods described so fr require exhustive sweeps of the entire stte set. Asynchronous DP does not use sweeps. Insted it works like this: Repet until convergence criterion is met: Pick stte t rndom nd pply the pproprite bckup Still need lots of computtion, but does not get locked into hopelessly long sweeps Cn you select stteo bckup intelligently? YES: n gent s experience cn ct s guide. 39 40
Generlized Policy Itertion Generlized Policy Itertion GPI: ny interction of policy evlution nd policy improvement, independent of their grnulrity. A geometric metphor for convergence of GPI: Efficiency of DP To find n optiml policy is polynomil in the number of sttes BUT, the number of sttes is often stronomicl, e.g., often growing exponentilly with the number of stte vribles wht Bellmn clled the curse of dimensionlity. In prctice, clssicl DP cn be pplied to problems with few million sttes. Asynchronous DP cn be pplied to lrger problems, nd is pproprite for prllel computtion. It is surprisingly esy to come up with MDPs for which DP methods re not prcticl. 41 42 Summry Policy evlution: bckups without mx Policy improvement: form greedy policy, if only loclly Policy itertion: lternte the bove two processes Vlue itertion: bckups with mx Full bckupo be contrsted lter with smple bckups Asynchronous DP: wy to void exhustive sweeps Generlized Policy Itertion GPI Bootstrpping: updting estimtes bsed on other estimtes END 43