Chpter 4: Dynmic Progrmming Objectives of this chpter: Overview of collection of clssicl solution methods for MDPs known s dynmic progrmming (DP) Show how DP cn be used to compute vlue functions, nd hence, optiml policies Discuss efficiency nd utility of DP R. S. Sutton nd A. G. Brto: Reinforcement Lerning: An Introduction 1
Policy Evlution Policy Evlution: for given policy, compute the stte-vlue function V Recll: Stte- vlue function for policy : V (s) E R t s t s E k r t k 1 s t s k 0 Bellmn eqution for V : V (s) (s,) P ss s R ss V ( s ) system of S simultneous liner equtions R. S. Sutton nd A. G. Brto: Reinforcement Lerning: An Introduction 2
Itertive Methods V 0 V 1 V k V k1 V sweep A sweep consists of pplying bckup opertion to ech stte. A full policy-evlution bckup: s V k1 (s) (s,) P s s R ss V k ( s ) R. S. Sutton nd A. G. Brto: Reinforcement Lerning: An Introduction 3
Itertive Policy Evlution R. S. Sutton nd A. G. Brto: Reinforcement Lerning: An Introduction 4
A Smll Gridworld An undiscounted episodic tsk Nonterminl sttes: 1, 2,..., 14; One terminl stte (shown twice s shded squres) Actions tht would tke gent off the grid leve stte unchnged Rewrd is 1 until the terminl stte is reched R. S. Sutton nd A. G. Brto: Reinforcement Lerning: An Introduction 5
Itertive Policy Evl for the Smll Gridworld equiprobble rndom ction choices R. S. Sutton nd A. G. Brto: Reinforcement Lerning: An Introduction 6
Policy Improvement Suppose we hve computed V for deterministic policy. For given stte s, would it be better to do n ction (s)? The vlue of doing in stte s is : Q (s,) E s r t 1 V (s t 1 ) s t s, t P ss R ss V ( s ) It is better to switch to ction for stte s if nd only if Q (s,) V (s) R. S. Sutton nd A. G. Brto: Reinforcement Lerning: An Introduction 7
Policy Improvement Cont. Do this for ll sttes to get new policy tht is greedy with respect to V : Then V V (s) rgmx Q (s,) rgmx P s s R s V ( s ) R. S. Sutton nd A. G. Brto: Reinforcement Lerning: An Introduction 8
Policy Improvement Cont. Wht if V V? i.e., for ll s S, V (s) mx s V ( s )? P ss R ss But this is the Bellmn Optimlity Eqution. So V V nd both nd re optiml policies. R. S. Sutton nd A. G. Brto: Reinforcement Lerning: An Introduction 9
Policy Itertion 0 V 0 1 V 1 * V * * policy evlution policy improvement greedifiction R. S. Sutton nd A. G. Brto: Reinforcement Lerning: An Introduction 10
Policy Itertion R. S. Sutton nd A. G. Brto: Reinforcement Lerning: An Introduction 11
Jck s Cr Rentl $10 for ech cr rented (must be vilble when request rec d) Two loctions, mximum of 20 crs t ech Crs returned nd requested rndomly Poisson distribution, n returns/requests with prob 1st loction: verge requests = 3, verge returns = 3 2nd loction: verge requests = 4, verge returns = 2 Cn move up to 5 crs between loctions overnight n n! e Sttes, Actions, Rewrds? Trnsition probbilities? R. S. Sutton nd A. G. Brto: Reinforcement Lerning: An Introduction 12
Jck s Cr Rentl R. S. Sutton nd A. G. Brto: Reinforcement Lerning: An Introduction 13
Jck s CR Exercise Suppose the first cr moved is free From 1st to 2nd loction Becuse n employee trvels tht wy nywy (by bus) Suppose only 10 crs cn be prked for free t ech loction More thn 10 cost $4 for using n extr prking lot Such rbitrry nonlinerities re common in rel problems R. S. Sutton nd A. G. Brto: Reinforcement Lerning: An Introduction 14
Vlue Itertion Recll the full policy-evlution bckup: s V k1 (s) (s,) P ss R ss V k ( s ) Here is the full vlue-itertion bckup: V k1 (s) mx s P ss R ss V k ( s ) R. S. Sutton nd A. G. Brto: Reinforcement Lerning: An Introduction 15
Vlue Itertion Cont. R. S. Sutton nd A. G. Brto: Reinforcement Lerning: An Introduction 16
Gmbler s Problem Gmbler cn repetedly bet $ on coin flip Heds he wins his stke, tils he loses it Initil cpitl {$1, $2, $99} Gmbler wins if his cpitl becomes $100 loses if it becomes $0 Coin is unfir Heds (gmbler wins) with probbility p =.4 n n! e Sttes, Actions, Rewrds? R. S. Sutton nd A. G. Brto: Reinforcement Lerning: An Introduction 17
Gmbler s Problem Solution R. S. Sutton nd A. G. Brto: Reinforcement Lerning: An Introduction 18
Herd Mngement You re consultnt to frmer mnging herd of cows Herd consists of 5 kinds of cows: Young Milking Breeding Old Sick Number of ech kind is the Stte Number sold of ech kind is the Action Cows trnsition from one kind to nother Young cows cn be born R. S. Sutton nd A. G. Brto: Reinforcement Lerning: An Introduction 19
Asynchronous DP All the DP methods described so fr require exhustive sweeps of the entire stte set. Asynchronous DP does not use sweeps. Insted it works like this: Repet until convergence criterion is met: Pick stte t rndom nd pply the pproprite bckup Still need lots of computtion, but does not get locked into hopelessly long sweeps Cn you select sttes to bckup intelligently? YES: n gent s experience cn ct s guide. R. S. Sutton nd A. G. Brto: Reinforcement Lerning: An Introduction 20
Generlized Policy Itertion Generlized Policy Itertion (GPI): ny interction of policy evlution nd policy improvement, independent of their grnulrity. A geometric metphor for convergence of GPI: R. S. Sutton nd A. G. Brto: Reinforcement Lerning: An Introduction 21
Efficiency of DP To find n optiml policy is polynomil in the number of sttes BUT, the number of sttes is often stronomicl, e.g., often growing exponentilly with the number of stte vribles (wht Bellmn clled the curse of dimensionlity ). In prctice, clssicl DP cn be pplied to problems with few millions of sttes. Asynchronous DP cn be pplied to lrger problems, nd pproprite for prllel computtion. It is surprisingly esy to come up with MDPs for which DP methods re not prcticl. R. S. Sutton nd A. G. Brto: Reinforcement Lerning: An Introduction 22
Summry Policy evlution: bckups without mx Policy improvement: form greedy policy, if only loclly Policy itertion: lternte the bove two processes Vlue itertion: bckups with mx Full bckups (to be contrsted lter with smple bckups) Generlized Policy Itertion (GPI) Asynchronous DP: wy to void exhustive sweeps Bootstrpping: updting estimtes bsed on other estimtes R. S. Sutton nd A. G. Brto: Reinforcement Lerning: An Introduction 23