Noe o oher eachers and users of hese sldes. Andrew would be delghed f you found hs source maeral useful n gvng your own lecures. Feel free o use hese sldes verbam, or o modfy hem o f your own needs. PowerPon orgnals are avalable. If you make use of a sgnfcan poron of hese sldes n your own lecure, please nclude hs message, or he followng lnk o he source reposory of Andrew s uorals: hp://www.cs.cmu.edu/~awm/uorals. Commens and correcons graefully receved. Renforcemen Learnng Andrew W. Moore Assocae Professor School of Compuer Scence Carnege Mellon Unversy www.cs.cmu.edu/~awm awm@cs.cmu.edu 42-268-7599 Copyrgh 2002, Andrew W. Moore Aprl 23rd, 2002 Predcng Delayed Rewards IN A 0.4 DISCOUNTED MARKOV SYSTEM 0.6 0.5 0.2 R=-5 S R=0 S 4 0.5 R=0 0.5 S 5 0.8 0.5 R=3 R=0 S 0.2 S 3 2 0.4 0.4 R= S 6 Prob(nex sae = S 5 hs sae = S 4 ) = 0.8 ec Wha s expeced sum of fuure rewards (dscouned)? Ε = 0 γ ( [ ] ) S[ 0] = S R S us Solve I! We use sandard Markov Sysem Theory Copyrgh 2002, Andrew W. Moore Renforcemen Learnng: Slde 2
Learnng Delayed Rewards S? R=???? S 4? R=??? S 2? R=??? S 5? R=??? S 3? R=??? S 6? R=??? All you can see s a seres of saes and rewards: S (R=0) S 2 (R=0) S 3 (R=4) S 2 (R=0) S 4 (R=0) S 5 (R=0) Task: Based on hs sequence, mae *(S ),*(S 2 ) *(S 6 ) Copyrgh 2002, Andrew W. Moore Renforcemen Learnng: Slde 3 Idea : Supervsed Learnng S (R=0) S 2 (R=0) Assume?=/2 S 3 (R=4) S 2 (R=0) S 4 (R=0) S 5 (R=0) A = we were n sae S and evenually go a long erm dscouned reward of 0+?0+? 2 4+? 3 0+? 4 0 = A =2 n sae S 2 ldr = 2 A =5 n sae S 4 ldr = 0 A =3 n sae S 3 ldr = 4 A =6 n sae S 5 ldr = 0 A =4 n sae S 2 ldr = 0 Sae Observaons of LTDR Mean LTDR S = (S ) S 2 2, 0 = (S 2 ) S 3 4 4 = (S 3 ) S 4 0 0 = (S 4 ) S 5 0 0 = (S 5 ) Copyrgh 2002, Andrew W. Moore Renforcemen Learnng: Slde 4 2
Supervsed Learnng ALG Wach a raecory S[0] r[0] S[] r[] S[T]r[T] For =0,, T, compue [ ] Compue ( S ) = =0 mean value of = among n sae S Le MATCHES ( ) γ r[ + ] [ ] all ranso ns begnnng on he raecor y ( S ) = { S[ ] [ ] MATCHES S You re done! S = MATCHES( S ) Copyrgh 2002, Andrew W. Moore Renforcemen Learnng: Slde 5 ( ) = S }, hen defne Supervsed Learnng ALG for he md If you have an anxous personaly you may be worred abou edge effecs for some of he fnal ransons. Wh large raecores hese are neglgble. Copyrgh 2002, Andrew W. Moore Renforcemen Learnng: Slde 6 3
Onlne Supervsed Learnng Inalze: Coun[S ] = 0 S Sum[S ] = 0 S Elgbly[S ] = 0 S Observe: When we experence S wh reward r do hs: Elg[S ]?Elg[S ] Elg[S ] Elg[S ] + Sum[S ] Sum[S ]+rxelg[s ] Coun[S ] Coun[S ] + Then a any me, (S )= Sum[S ]/Coun[S ] Copyrgh 2002, Andrew W. Moore Renforcemen Learnng: Slde 7 Onlne Supervsed Learnng Economcs Gven N saes S S N, OSL needs O(N) memory. Each updae needs O(N) work snce we mus updae all Elg[ ] array elemens Idea: Be sparse and only updae/process Elg[ ] elemens wh values >?for ny? There are only log such elemens ξ log γ Easy o prove: As T, ( S ) ( S) S Copyrgh 2002, Andrew W. Moore Renforcemen Learnng: Slde 8 4
Onlne Supervsed Learnng Le s grab OSL off he sree, bundle no a black van, ake o a bunker and nerrogae under 600 Wa lghs. S (r=0) S 2 (r=0) Sae Observaons of LTDR ^ (S ) S S 2 2, 0 S 3 4 4 S 4 0 0 S 5 0 0 COMPLAINT S 3 (r=4) S 2 (r=0) S 4 (r=0) S 5 (r=0) There s somehng a lle suspcous abou hs (effcency-wse) Copyrgh 2002, Andrew W. Moore Renforcemen Learnng: Slde 9 Cerany-Equvalen (CE) Learnng S (r=0) S 2 (r=0) Idea: Use your daa o mae he underlyng Markov sysem, nsead of ryng o mae drecly. Esmaed Markov Sysem: S 3 (r=4) S 2 (r=0) S 4 (r=0) S 5 (r=0) You draw n he ransons + probs S r = 0 S 2 r = 0 S 3 r = 4 S 4 r = 0 S 5 r = 0 Wha re he maed values? Copyrgh 2002, Andrew W. Moore Renforcemen Learnng: Slde 0 5
C.E. Mehod for Markov Sysems Inalze: Coun[S ] = 0 S #Tmes vsed S SumR[S ] = 0 Sum of rewards from S S Trans[S,S ] = 0 #Tmes ransoned from S S When we are n sae S, and we receve reward r, and we move o S Coun[S ] Coun[S ] + SumR[S ] SumR[S ] + r Trans[S,S ] Trans[S,S ] + Then a any me r (S ) = SumR[S ] / Coun[S ] P = Esmaed Prob(nex = S hs = S ) = Trans[S,S ] / Coun[S ] Copyrgh 2002, Andrew W. Moore Renforcemen Learnng: Slde C.E. for Markov Sysems (connued) So a any me we have r (S ) and P (nex=s hs=s ) S S = P So a any me we can solve he se of lnear equaons ( S ) = r ( S ) + γ P ( S S ) ( S ) [In vecor noaon, = r +?P => = (I-?P ) - r where r are vecors of lengh N P s an NxN marx N = # saes ] S Copyrgh 2002, Andrew W. Moore Renforcemen Learnng: Slde 2 6
C.E. Onlne Economcs Memory: O(N 2 ) Tme o updae couners: O() Tme o re-evaluae O(N 3 ) f use marx nverson O(N 2 k CRIT ) f use value eraon and we need k CRIT eraons o converge O(Nk CRIT ) f use value eraon, and k CRIT o converge, and M.S. s Sparse (.e. mean # successors s consan) Copyrgh 2002, Andrew W. Moore Renforcemen Learnng: Slde 3 Cerany Equvalen Learnng Memory use could be O(N 2 )! And me per updae could be O(Nk CRIT ) up o O(N 3 )! Too expensve for some people. COMPLAINT Prorzed sweepng wll help, (see laer), bu frs le s revew a very nexpensve approach Copyrgh 2002, Andrew W. Moore Renforcemen Learnng: Slde 4 7
Why hs obsesson wh onlneness? I really care abou supplyng up-o-dae maes all he me. Can you guess why? If no, all wll be revealed n good me Copyrgh 2002, Andrew W. Moore Renforcemen Learnng: Slde 5 Less Tme: More Daa Lmed Backups Do prevous C.E. algorhm. A each me mep we observe S (r) S and updae Coun[S ], SumR[S ], Trans[S,S ] And hus also updae maes r and P oucomes ( S ) Bu nsead of re-solvng for, do much less work. us do one backup of S [ ] [ S ] r + γ P [ S ] Copyrgh 2002, Andrew W. Moore Renforcemen Learnng: Slde 6 8
One Backup C.E. Economcs Space : O(N 2 ) NO IMPROVEMENT THERE! Tme o updae sascs : O() Tme o updae : O() Good News: Much cheaper per ranson Good News: Conracon Mappng proof (modfed) promses convergence o opmal Bad News: Wases daa Copyrgh 2002, Andrew W. Moore Renforcemen Learnng: Slde 7 Prorzed Sweepng [Moore + Akeson, 93] Tres o be almos as daa-effcen as full CE bu no much more expensve han One Backup CE. On every ranson, some number (ß) of saes may have a backup appled. Whch ones? The mos deservng We keep a prory queue of whch saes have he bgg poenal for changng her (S) value Copyrgh 2002, Andrew W. Moore Renforcemen Learnng: Slde 8 9
Where Are We? Tryng o do onlne predcon from sreams of ransons Space Updae Cos Daa Effcency: Supervsed Learnng Full C.E. Learnng One Backup C.E. Learnng Prorzed Sweepng 0(N s ) 0(N so ) 0(N so ) 0(N so ) 0( ) 0(N so N s ) 0(N so k CRIT ) 0() 0() log(/?) N so = # sae-oucomes (number of arrows on he M.S. dagram) N s = # saes Wha Nex? Sample Backups!!! Copyrgh 2002, Andrew W. Moore Renforcemen Learnng: Slde 9 Temporal Dfference Learnng Only manan a array nohng else [Suon 988] So you ve go (S ) (S 2 ), (S N ) and you observe S r S wha should you do? Can You Guess? A ranson from ha receves an mmedae reward of r and umps o Copyrgh 2002, Andrew W. Moore Renforcemen Learnng: Slde 20 0
S r S TD Learnng ( S ) We updae = We nudge o be closer o expeced fuure rewards α = ( S ) ( α) ( S ) + Expeced fuure α[ ] rewards ( α) ( S ) + α[ r + γ ( S )] WEIGHTED SUM s called a learnng rae parameer. (See? n he neural lecure) Copyrgh 2002, Andrew W. Moore Renforcemen Learnng: Slde 2 Smplfed TD Analyss P =? r =? TERMINATE S 0 P 2 =? r 2 =? TERMINATE r=0 P M =? : r M =? TERMINATE Suppose you always begn n S 0 You hen ranson a random o one of M places. You don know he ranson probs. You hen ge a place-dependen reward (unknown n advance). Then he ral ermnaes. Defne *(S 0 )= Expeced reward Le s mae wh TD Copyrgh 2002, Andrew W. Moore Renforcemen Learnng: Slde 22
S 0 r=0 p () p (2) p (N) r () r (2) r (N) r (k) = reward of k h ermnal sae p (k) = prob of k h ermnal sae We ll do a seres of rals. Reward on h = Ε r n = p k= k r k ral s r [ ] ( ) ( ) [ ] [ Noe Ε r s ndependen of ] Defne *(S 0 ) = * = E[r ] Copyrgh 2002, Andrew W. Moore Renforcemen Learnng: Slde 23 Le s run TD-Learnng, where = Esmae (S 0 ) before he h ral. From defnon of TD-Learnng: + = (-a) + ar Useful quany: Defne σ 2 = Varance of reward= Ε M = P k= ( r ) 2 ( k ) ( k ) [ ] 2 ( r ) Copyrgh 2002, Andrew W. Moore Renforcemen Learnng: Slde 24 2
Ε = Remember * = E[r ], s 2 = E[(r -*) 2 ] + = ar + (-a) [ + ] = [ αr + ( α ) ] = Ε ( α ) Ε[ ] Thus... lm Ε [ ] = Copyrgh 2002, Andrew W. Moore Renforcemen Learnng: Slde 25 W H Y? Is hs mpressve?? Remember * = E[r ], s 2 = E[(r -*) 2 ] + = ar + (-a) Wre S = Expeced squared error beween and * before he h eraon S + = E[( + -*) 2 ] = E[(ar +(-a) - *) 2 ] = E[(a[r -*]+(-a)[ - *]) 2 ] = E[a 2 (r -*) 2 +a(-a)(r -*)( - *)+(-a) 2 ( - *) 2 ] = a 2 E[(r -*) 2 ]+a(-a)e[(r -*)( - *)]+(-a) 2 E[( - *) 2 ] = = a 2 s 2 +(-a) 2 S WHY? Copyrgh 2002, Andrew W. Moore Renforcemen Learnng: Slde 26 3
And s hus easy o show ha. lm S = lm Ε [( ) ] 2 2 ασ = (2 α) Wha do you hnk of TD learnng? How would you mprove? Copyrgh 2002, Andrew W. Moore Renforcemen Learnng: Slde 27 Decayng Learnng Rae [Dayan 99sh] showed ha for General TD learnng of a Markow Sysem (no us our smple model) ha f you use updae rule ( S ) α [ r + γ ( S )] + ( α ) ( S ) hen, as number of observaons goes o nfny S S PROVIDED All saes vsed 8 ly ofen α = = ( ) ( ) 2 T 2 α < k. T. α < k = = Copyrgh 2002, Andrew W. Moore Renforcemen Learnng: Slde 28 k. T. Ths means T = Ths means α > k 4
Decayng Learnng Rae Ths Works: a = / Ths Doesn : a = a 0 Ths Works: a = ß/(ß+) [e.g. ß=000] Ths Doesn : a = ßa - (ß<) IN OUR EXAMPLE.USE a = / 2 2 Remember = Ε[ r ], σ = Ε[ (r ) ] + Wre = α r C C + ( α ) = r + ( ) = + ( ) = r + C and so you' ll + = r + = see ha 0 And Copyrgh 2002, Andrew W. Moore Renforcemen Learnng: Slde 29 Decayng Learnng Rae con 2 2 2 σ + [ ] ( 0 - ) = 2 lm Ε[ ( - ) ] = 0 Ε ( - ) so, ulmaely Copyrgh 2002, Andrew W. Moore Renforcemen Learnng: Slde 30 5
A Fancer TD Wre S[] = sae a me Suppose a = /4? = /2 Assume (S 23 )=0 (S 7 )=0 (S 44 )=6 Assume = 405 and S[] = S 23 Observe (r=0) S 23 S 7 wh reward 0 Now = 406, S[] = S 7, S[-] = S 23 (S 23 )=, (S 7 )=, (S 44 )= (r=0) Observe S 7 S 44 Now = 407, S[] = S44 (S 23 )=, (S 7 )=, (S 44 )= INSIGHT: (S 23 ) mgh hnk I goa ge me some of ha!!! Copyrgh 2002, Andrew W. Moore Renforcemen Learnng: Slde 3 TD(?) Commens TD(?=0) s he orgnal TD TD(?=) s almos he same as supervsed learnng (excep uses a learnng rae nsead of explc couns) TD(?=0.7) s ofen emprcally he b performer Dayan s proof holds for all 0=?= Updaes can be made more compuaonally effcen wh elgbly races (smlar o O.S.L.) Quon: Can you nven a problem ha would make TD(0) look bad and TD() look good? How abou TD(0) look good & TD() bad?? Copyrgh 2002, Andrew W. Moore Renforcemen Learnng: Slde 32 6
Learnng M.S. Summary MODEL-BASED Supervsed Learnng Full C.E. Learnng One Backup C.E. Learnng Prorzed Sweepng Space 0(N s ) 0(N so ) 0(N so ) 0(N so ) Updae Cos 0 log γ 0(N so N s ) 0(N so k CRIT ) 0() 0() Daa Effcency MODEL FREE TD(0) TD(?), 0<?= 0(N s ) 0(N s ) 0() 0 log γλ Copyrgh 2002, Andrew W. Moore Renforcemen Learnng: Slde 33 Learnng Polces for MDPs See prevous lecure sldes for defnon of and compuaon wh MDPs. The Hear of REINFORCEMENT Learnng sae Copyrgh 2002, Andrew W. Moore Renforcemen Learnng: Slde 34 7
The ask: World: You are n sae 34. Your mmedae reward s 3. You have 3 acons. Robo: I ll ake acon 2. World: You are n sae 77. Your mmedae reward s -7. You have 2 acons. Robo: I ll ake acon. World: You re n sae 34 (agan). Your mmedae reward s 3. You have 3 acons. The Markov propery means once you ve seleced an acon he P.D.F. of your nex sae s he same as he las me you red he acon n hs sae. Copyrgh 2002, Andrew W. Moore Renforcemen Learnng: Slde 35 The Cred Assgnmen Problem I m n sae 43, 39, 22, 2, 2, 3, 54, 26, reward = 0, = 0, = 0, = 0, = 0, = 0, = 0, = 00, acon = 2 = 4 = = = = 2 = 2 Yppee! I go o a sae wh a bg reward! Bu whch of my acons along he way acually helped me ge here?? Ths s he Cred Assgnmen problem. I makes Supervsed Learnng approaches (e.g. Boxes [Mche & Chambers]) very, very slow. Usng he MDP assumpon helps avod hs problem. Copyrgh 2002, Andrew W. Moore Renforcemen Learnng: Slde 36 8
Full C.E. Learnng One Backup C.E. Learnng Prorzed Sweepng MDP Polcy Learnng 0(N sao ) 0(N sao ) 0(N sao ) Space Updae Cos 0(N sao k CRIT ) 0(N?0 ) 0(ßN?0 ) We ll hnk abou Model-Free n a momen Daa Effcency The C.E. mehods are very smlar o he MS case, excep now do value-eraon-for-mdp backups = a S ( S ) max r + γ P ( S S, a) ( S ) SUCCS ( ) S Copyrgh 2002, Andrew W. Moore Renforcemen Learnng: Slde 37 Choosng Acons We re n sae S We can mae r P (nex = S hs = S, acon a) (nex = S ) So wha acon should we choose? IDEA : IDEA 2 : a = arg max r + γ P a a = random Any problems wh hese deas? Any oher suggons? Could we be opmal? ( S S, a ) ( S ) Copyrgh 2002, Andrew W. Moore Renforcemen Learnng: Slde 38 9
Model-Free R.L. Why no use T.D.? Observe updae r S a S ( S ) α ( r + γ ( S )) + ( α ) ( S ) Wha s wrong wh hs? Copyrgh 2002, Andrew W. Moore Renforcemen Learnng: Slde 39 Q-Learnng: Model-Free R.L. [Wakns, 988] Defne Q*(S,a)= Expeced sum of dscouned fuure rewards f I sar n sae S, f I hen ake acon a, and f I m subsequenly opmal Quons: Defne Q*(S,a) n erms of * Defne *(S ) n erms of Q* Copyrgh 2002, Andrew W. Moore Renforcemen Learnng: Slde 40 20
Q Noe ha Q Q-Learnng Updae ( S, a) = r + ( ) Q ( a γ P S S, α max S, ) S SUCCS ( S ) In Q-learnng we manan a able of Q values nsead of values When you see S ( S, a) α r γ max Q ( S, a ) + ( α ) Q ( S, a) + reward acon a a S do Ths s even cleverer han looks: he Q values are no based by any parcular exploraon polcy. I avods he Cred Assgnmen problem. a Copyrgh 2002, Andrew W. Moore Renforcemen Learnng: Slde 4 Q-Learnng: Choosng Acons Same ssues as for CE choosng acons Don always be greedy, so don always choose: arg max Q a Don always be random (oherwse wll ake a long me o reach somewhere excng) Bolzmann exploraon [Wakns] Q Prob(choose acon a) exp K ( s, a) Opmsm n he face of uncerany [Suon 90, Kaelblng 90] Inalze Q-values opmscally hgh o encourage exploraon Or ake no accoun how ofen each s,a par has been red ( ) s, a Copyrgh 2002, Andrew W. Moore Renforcemen Learnng: Slde 42 2
Q-Learnng Commens [Wakns] proved ha Q-learnng wll evenually converge o an opmal polcy. Emprcally s cue Emprcally s very slow Why no do Q(?)? Would no make much sense [renroduce he cred assgnmen problem] Some people (e.g. Peng & Wllams) have red o work her way around hs. Copyrgh 2002, Andrew W. Moore Renforcemen Learnng: Slde 43 If we had me Value funcon approxmaon Use a Neural Ne o represen [e.g. Tesauro] Use a Neural Ne o represen Q [e.g. Cres] Use a decson ree wh Q-learnng [Chapman + Kaelblng 9] wh C.E. learnng [Moore 9] How o spl up space? Sgnfcance on Q values [Chapman + Kaelblng] Execuon accuracy monorng [Moore 9] Game Theory [Moore + Akeson 95] New nfluence/varance crera [Munos 99] Copyrgh 2002, Andrew W. Moore Renforcemen Learnng: Slde 44 22
R.L. Theory If we had me Counerexamples [Boyan + Moore], [Bard] Value Funcon Approxmaors wh Averagng wll converge o somehng [Gordon] Neural Nes can fal [Bard] Neural Nes wh Resdual Graden updaes wll converge o somehng Lnear approxmaors for TD learnng wll converge o somehng useful [Tsskls + Van Roy] Copyrgh 2002, Andrew W. Moore Renforcemen Learnng: Slde 45 Wha You Should Know Supervsed learnng for predcng delayed rewards Cerany equvalen learnng for predcng delayed rewards Model free learnng (TD) for predcng delayed rewards Renforcemen Learnng wh MDPs: Wha s he ask? Why s hard o choose acons? Q-learnng (ncludng beng able o work hrough small smulaed examples of RL) Copyrgh 2002, Andrew W. Moore Renforcemen Learnng: Slde 46 23