Efficient Planning. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

Size: px

Start display at page:

Download "Efficient Planning. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction"

Heather Hoover
5 years ago
Views:

1 Efficient Plnning 1

2 Tuesdy clss summry: Plnning: ny computtionl process tht uses model to crete or improve policy Dyn frmework: 2

3 Questions during clss Why use simulted experience? Cn t you directly compute solution bsed on model? Wouldn t it be better to pln bckwrds from gol 3

4 How to Achieve Efficient Plnning? Wht type of bckup is better? Smple vs. full bckups Incrementl vs. less incrementl bckups How to order the bckups? 4

5 Wht is Efficient Plnning? Plnning lgorithm A is more efficient thn plnning lgorithm B if: it cn compute the optiml policy (or vlue function) in less time. given the sme mount of computtion time, it improves the policy (or vlue function) more. 5

6 Wht bckup type is best? 6

7 Full vs. Smple Bckups Vlue estimted Full bckups (DP) Smple bckups (one-step TD) s s V v! π (s) r s' r s' policy evlution TD() V v * * (s) mx s vlue itertion r s' s, s, Q q! π (,s) r s' r s' ' Q-policy evlution ' Srs s, s, Q q * (,s) * mx 7 r s' Q-vlue itertion ' mx r s' ' Q-lerning

8 Full vs. Smple Bckups 1 smple bckups full bckups RMS error in vlue estimte b =1 b = 2 (brnching fctor) b =1 b =1 b =1, 1b 2b Number of mx Q(s, ) computtions b successor sttes, eqully likely; initil error = 1; ssume ll next sttes vlues re correct 8

9 Smll Bckups Smll bckups re single-successor bckups bsed on the model Smll bckups hve the sme computtionl complexity s smple bckups Smll bckups hve no smpling error Smll bckups require storge for old vlues 9

10 A n optiml Xicon. Consider tht estimtes we re interested in some est eks tht is constructed from sum of other X. i problem behind this newinibckuptht is sisfollows. iin n constructed from sum of other estim The estimte A cn be computed using full bckup: t we re interested in Ide some estimte A Smll Bckups Min behind The estimte A cn be computed using full s tsk is often forx uctedupdted, from suma of cn other be estimtes Xi. re recomputed A X. X cess (MDP), whereusing full ibckup: A cn be computed A weighted Xi. sum i e on bckup. Alterntively, if we know Consider estimte A tht is constructed from the gent s bex i A X ignl. The gent s If the estimtes Xii. revlue updted, A cn be recomputed ed estimtes significnt chnge, we i is by redoing the bove Alterntively, ifi we know eturn, which the If the estimtes X re updted, A cn be reco te A for only Xbckup.. Let us indicte X j tht only Xj received significnt vlue chnge, we ure steps. An w fullupdted, bckup: by redoing the bove bckup. Alterntively, if es Xitime re AAcn be recomputed i Xi used to construct the vlue i current updte A for only. jlet us indicte jx bove bckup. Alterntively, ifonly we X know ne might RL iswnt the to smple tht received significnt vlue ch the old vlue of X, used to construct the current vlue hen be updted byvlue subtrcting this received significnt chnge, we ber of environment Wht cn we know tht only single might wnt to updte Asuccessor, for only Xj., Let us Aj doaif +we w (X x ) j j j ofupdte A, s xaj. for A cn updted by subtrcting this og onlythen X Let us indicte j. be in chnged good policy. vlue since the lst bckup? the new vlue: the old vlue of Xj, used to construct the curre dding the vlue: ofold Xjvlue, usednd to construct thenew current vlue of A, s x 2 S ccording to some selection strtegy jh. A cn then be updted by subtrc then be updted l cn Conference on M-by subtrcting this old the new vlue: the of to construct the current kup to s: xj be Ajold x vlue + XX. useddding ALet + X. Avlue j,nd USA, JMLR: h 213. i j dding thep new vlue: vlue vlue A cn thenwbe updted for single successor r (s,uthor(s). ) + ofs A. p(sthe s, )V (s ) A yxthe ix ia A thexold Xj. j +vlue: A by dding A xj + j. thexdifference between the new nd i r (S, ) + smll P p(s S, ) mx bckup: s A Q(S, ) Q(s, ) A + wj (Xj A) + R + mx Q(S, A) xj ) 1

11 Smll bckup : single-successor bckup with cost tht yi yj is frctionsmll of the cost: single-successor of full bckup. bckup bckup with cost tht 2Cons ev O(1) is frction of the cost of full bckup. xi Smll vs.xj Smple Bckups p with cost tht t xk r=leffk ( Advntge SmllSmll Bckups over Smple Bckups: No Step-S Single-s Advntge Bckups over Smple is Smll bckup : single-successor bckup cost tht Bckups: No Step-Size x izewith ep sk t s t n 1 t bckup. is frction of the cost),ofconsfull ( TD step size g : f in.8 p y u k c e k d,.8 bc le up: TD() yj p size y k p c i b m le p s RMS.6 sm ste stnt ), con size ( p size D p error T e RMS.6 t st up: TD(), decying ste ckup: n b t s le.4 smp, consmple bck ize (normlized) error TD() : ing step s p u k c (), decy r le t gh 2 evluttion tsks: 2 evluttion rri r left ft 1 tsks: rri t gh rleft = +1 Conside r = +1 Advntge Smll bbckups over Smple Bckups: No =Step-Size r = +1 O(1) coi r -1 right x D x T : r = -1 e p over Smple Bckups: No Step-Size is Req r (normlized) left i mple bcku j l smp right s smll bckup RMS error p siz.5tnt ste.6 s n o rri r left nd rn ns dom trnsitio t gh.2 rleft = +1 r 2 evluttion tsks: s rright = -1 ndom trnsition smll bckup Smll bckup : single-successor bckup with.8 cost tht.8 step size / step size decy bckup.3 is frction of.5the cost.6 of full.7 bckup. 2smll evluttion tsks: e.2.1 left rright = +1 step size / step size decy ying step size c D(/), step size decy T step size : p rri u k ), dect le bc ckup: TD( r lef b le p smp m s = +1 r =+ Tke-Home Messge: smllerr bckups more plnning Tke-Home Messge: step size / step size decy left error.4 (normlized).2 r = -1 smple bckupright.4 s smller bckups more p de : TD(),.9 p.8 size t step p: TD(), ple bcku sm cyin1g α r = +1 smller bckups decying step size rleft = +1 rright = -1 left rright = +1 rleft = +1 rright = +1 Tke-Home Messge: r smller bckups more plnni : TD(), cons.2.9 stn ), con D( kup: T bc mple smll bckup 2revluttion tsks: n do ns m trnsitio rri r left t gh ku smple bc r = +1 Tke-Home Messge:.7.8 smll bckup 1.8.1e p siz e t s g in y.6 D(), dec RMS ll bckup.7 left.2 normlized RMS error.3 left Advntge Smll Bckups over Smple rbckups: No Step-Size is= = -1 rright right 1 size.2 t gh (normlized) tnt α smll bckup n dom step size / step size decy lph / decy.4.3 ecy ns o i t trnsi rn ns dom trnsitio 11

12 Smll vs. Smple Bckups A B C trnsition probbility stte vlues stte A stte B stte A stte B 12

13 Bckup Ordering 13

14 Bckup Ordering Do Forever: 1) Select stte s 2 S ccording to some selection strtegy H 2) Apply full bckup to s: V (s) mx hˆr(s, )+ P i s p(s s, )V (s ) Asynchronous Vlue Itertion P For every selection strtegy H tht selects ech stte infinitely often the vlues V converge to the optiml vlue function V The rte of convergence depends strongly on the selection strtegy H 14

15 The Trde-Off For ny effective ordering strtegy the cost tht is sved by hving to perform less bckups should out-weigh the cost of mintining the ordering: cost to mintin ordering cost svings due to fewer bckups 15

16 Prioritized Sweeping Which sttes or stte-ction pirs should be generted during plnning? Work bckwrds from sttes whose vlues hve just chnged: Mintin queue of stte-ction pirs whose vlues would chnge lot if bcked up, prioritized by the size of the chnge When new bckup occurs, insert predecessors ccording to their priorities Alwys perform bckups from first in queue Moore & Atkeson 1993; Peng & Willims 1993 improved by McMhn & Gordon 25; Vn Seijen

17 Moore nd Atekson s Prioritized Sweeping Published in

18 Prioritized Sweeping vs. Dyn-Q Both use n=5 bckups per environmentl interction 18

19 Bellmn Error Ordering Bellmn error is mesure for the difference between the current vlue nd the vlue fter full bckup: h BE(s) = V (s) mx ˆr(s, )+ X i p(s s, )V (s ) s 19

20 Bellmn Error Ordering initilize V (s) rbitrrily for ll s compute BE(s) for ll s loop {until convergence} select stte s with worst Bellmn error perform full bckup of s BE(s ) for ll predecessor sttes s of s do recompute BE( s) end for end loop To get positive trde-off: comp. time Bellmn error << comp time Full bckup 2

21 Prioritized Sweeping with Smll Bckups initilize V (s) rbitrrily for ll s initilize U(s) =V (s) for ll s initilize Q(s, ) =V (s) for ll s, initilize N s,ns s to for ll s,, s loop {over episodes} initilize s repet {for ech step in the episode} select ction, bsed on Q(s, ) tke ction, observe r nd s N s N s + 1; N s N s s s +1 Q(s, )(Ns 1) + r + V (s ) /N s Q(s, ) V (s) mx b Q(s, b) p V (s) U(s) if s is on queue, set its priority to p; otherwise, dd it with priority p for number of updte cycles do remove top stte s from queue U U( s ) V ( s ) V ( s ) VU s ) for ll ( s, ā) pirs with N sā s > do Q( s, ā) Q( s, ā)+ N sā/n sā s U U( s) mx b Q( s, b) p V ( s) U( s) if s is on queue, set its priority to p; otherwise, dd it with priority p end for end for s s until s is terminl end loop 21

22 ing results in the best plnning efficiency? ping (PS) with smll bckups outperform Empiricl Comprison Prioritized Sweeping (PS) with smll bckups outper.55.5 initil error initil error.45 son RMS error (vg. over first 1 5 obs) PS, Moore & Atkeson PS, Wiering & Schmidhuber PS, Peng & Willims PS, Wiering & Schmidhuber PS, Peng & Willims.2 PS, smll bckups, smll bckups.15 vlue itertion S x 1 6 vlue itertion comp. time per observtion [s] S G x

23 Trjectory Smpling Trjectory smpling: perform bckups long simulted trjectories This smples from the on-policy distribution Advntges when function pproximtion is used (Chpter 8) Focusing of computtion: cn cuse vst uninteresting prts of the stte spce to be (usefully) ignored: Initil sttes Rechble under optiml control Irrelevnt sttes 23

24 Trjectory Smpling Experiment one-step full tbulr bckups uniform: cycled through ll sttection pirs on-policy: bcked up long simulted trjectories 2 rndomly generted undiscounted episodic tsks 2 ctions for ech stte, ech with b eqully likely next sttes.1 prob of trnsition to terminl stte expected rewrd on ech trnsition selected from men vrince 1 Gussin 24

25 Heuristic Serch Used for ction selection, not for chnging vlue function (=heuristic evlution function) Bcked-up vlues re computed, but typiclly discrded Extension of the ide of greedy policy only deeper Also suggests wys to select sttes to bckup: smrt focusing: 25

26 Summry Efficient plnning is bout trying to spend the vilble computtion time in the most effective wy. Bckup types: full/smple/smll Bckup Ordering gin/loss trde-off prioritized sweeping prioritized sweeping with smll bckups: Bellmn error ordering trjectory smpling: bckup long trjectories heuristic serch 26

27 27

Bellman Optimality Equation for V*

Bellman Optimality Equation for V* Bellmn Optimlity Eqution for V* The vlue of stte under n optiml policy must equl the expected return for the best ction from tht stte: V (s) mx Q (s,) A(s) mx A(s) mx A(s) Er t 1 V (s t 1 ) s t s, t s