Reiforceme Corol lerig Corol polices h choose opiml cios Q lerig Covergece Chper 13 Reiforceme 1
Corol Cosider lerig o choose cios, e.g., Robo lerig o dock o bery chrger o choose cios o opimize fcory oupu o ply Bckgmmo Noe severl problem chrcerisics Delyed rewrd Opporuiy for cive explorio Possibiliy h se oly prilly observble Possible eed o ler muliple sks wih sme sesors/effecors Chper 13 Reiforceme 2
Oe Exmple: TD-Gmmo Tesuro, 1995 Ler o ply Bckgmmo Immedie rewrd +100 if wi -100 if lose 0 for ll oher ses Tried by plyig 1.5 millio gmes gis iself Now pproximely equl o bes hum plyer Chper 13 Reiforceme 3
Reiforceme Problem Evirome se cio Age rewrd s 0 0 s 1 1 s 2 2... r0 r1 r2 Gol: ler o choose cios h mximize r 0 + γr 1 + γ 2 r 2 +, where 0 γ< 1 Chper 13 Reiforceme 4
Mrkov Decisio Process Assume fiie se of ses S se of cios A ech discree ime, ge observes se s S d choose cio A he receives immedie rewrd r d se chges o s +1 Mrkov ssumpio: s +1 = δ(s, ) d r = r(s, ) i.e., r d s +1 deped oly o curre se d cio fucios δ d r my be odeermiisic fucios δ d r o ecessrily kow o ge Chper 13 Reiforceme 5
Age s Tsk Execue cio i evirome, observe resuls, d ler cio policy π : S A h mximizes E[r + γr +1 + γ 2 r +2 + ] from y srig se i S here 0 γ< 1 is he discou fcor for fuure rewrds Noe somehig ew: rge fucio is π : S A bu we hve o riig exmples of form <s,> riig exmples re of form <<s,>,r> Chper 13 Reiforceme 6
Vlue Fucio To begi, cosider deermiisic worlds For ech possible policy π he ge migh dop, we c defie evluio fucio over ses V π ( s) i= 0 + γ γr i r + 1 + i + + 2 where r,r +1, re geered by followig policy π srig se s Resed, he sk is o ler he opiml policy π* π* r rgmx V π π γ 2 r ( s),( s) +... Chper 13 Reiforceme 7
0 100 0 0 0 0 0 0 0 G 0 100 72 81 9081 8172 100 90 81 G 0 100 0 r(s, (immedie rewrd) vlues 0 81 90 Q(s, vlues 90 100 G 0 G 81 90 100 V*(s) vlues Oe opiml policy Chper 13 Reiforceme 8
Wh o Ler We migh ry o hve ge ler he evluio fucio V π* (which we wrie s V*) We could he do lookhed serch o choose bes cio from y se s becuse π* (s) rgmx [ r(s, + γ V*( δ (s,)] A problem: This works well if ge kows δ : S A S, d r : S A R Bu whe i does, we c choose cios his wy Chper 13 Reiforceme 9
Q Fucio Defie ew fucio very similr o V* Q( s, r(s, + γ If ge lers Q, i c choose opiml cio eve wihou kowig d! π* π* (s) (s) rgmx [ r(s, + γ V*( δ (s,)] rgmxq( s, V*( δ (s,) Q is he evluio fucio he ge will ler Chper 13 Reiforceme 10
Triig Rule o Ler Q Noe Q d V* closely reled: V *(s) = mx Q( s, ) Which llows us o wrie Q recursively s Q( s, ) = = r(s r(s,, ) + γ ) + γ V*( δ mxq( s + 1, Le deoe lerer s curre pproximio o Q. Cosider riig rule ( s, r + γ where s' is he se resulig from pplyig cio i se s (s mx ( s, ), )) ) Chper 13 Reiforceme 11
Q for Deermiisic Worlds For ech s, iiilize ble ery Observe curre se s Do forever: Selec cio d execue i Receive immedie rewrd r Observe he ew se s' Upde he ble ery for s follows: s s' ( s, r + γ mx ( s, ) Q ˆ( s, ) ( s, 0 Chper 13 Reiforceme 12
Updig 63 63 R 100 R 100 72 81 90 81 righ ( s oice if d 1, iiil se: s 1 righ ) rewrds o - egive, he ( s,, ) r + γ + 1 ( s,, ) 0 ( s, mx ( s ( s, 2, ) 0 + 0.9 mx{63,81,100} = 90 ( s, Q( s, ex se: s 2 Chper 13 Reiforceme 13
Covergece coverges o Q. Cosider cse of deermiisic world where ech <s,> visied ifiiely ofe. Proof: defie full iervl o be iervl durig which ech <s,> is visied. Durig ech full iervl he lrges error i ble is reduced by fcor of γ Le be ble fer updes, d be he mximum error i ; h is = mx s, ˆ Q ( s, Q( s, Chper 13 Reiforceme 14
Covergece (co) ( s, ) For y ble ery upded o ierio +1, he error i he revised esime ( s, is ( s, Q( s, = (r + γ mxq ˆ (s, )) (r + γ mxq(s, )) + 1 + 1 ( s, Q( s, = γ mxq ˆ (s, ) mxq(s, ) Noe we used geerl fc h γ mx Q ˆ (s, ) Q(s, ) γ mx Q ˆ (s, ) Q(s, ) s, γ mx f (- mx 1 f 2 ( mx f (- 1 f 2 ( Chper 13 Reiforceme 15
Nodeermiisic Cse Wh if rewrd d ex se re o-deermiisic? We redefie V,Q by kig expeced vlues V π (s) E[ r + γ r + 1 + γ 2 r + 2 +...] E [ ] i γ r i= 0 + i Q( s, E[ r( s, + γ V *( δ (s,)] Chper 13 Reiforceme 16
Nodeermiisic Cse Q lerig geerlizes o odeermiisic worlds Aler riig rule o ( s, (1 α ) ˆ 1( s, + α [ r + mx Q 1( s, )] where 1 α = 1+ visis ( s, C sill prove coverge of o Q [Wkis d Dy, 1992] Chper 13 Reiforceme 17
Temporl Differece Q lerig: reduce discrepcy bewee successive Q esimes Oe sep ime differece: Q (1) Why o wo seps? Q (2) Or? Q ( ) ( s, ) r + γ mx ( s+ 1, Bled ll of hese: Q λ 2 ( s, ) r + γ r + + 1 γ mx ( s+ 2 ( s ( s,, ) ) r + γ (1 λ ) -1 r + 1 +... + γ r + 1 + γ, mx ( s +, [ (, ) λ (, ) λ (, )...] (1) (2) 2 (3) Q s + Q s + Q s + Chper 13 Reiforceme 18
Temporl Differece Q λ ( s, ) (1 λ ) [ (, ) λ (, ) λ (, )...] (1) (2) 2 (3) Q s + Q s + Q s + Equivle expressio: Q λ [ ] λ (1 λ ) mx ( s, ) + λ Q ( s, ) ( s, ) r + γ + 1 + 1 TD(λ) lgorihm uses bove riig rule Someimes coverges fser h Q lerig coverges for lerig V* for y 0 λ 1 (Dy, 1992) Tesuro s TD-Gmmo uses his lgorihm Chper 13 Reiforceme 19
Subleies d Ogoig Reserch Replce ble wih eurl ework or oher geerlizer Hdle cse where se oly prilly observble Desig opiml explorio sregies Exed o coiuous cio, se Ler d use d : S A S, d pproximio o δ Relioship o dymic progrmmig Chper 13 Reiforceme 20
RL Summry Reiforceme lerig (RL) corol lerig delyed rewrd possible h he se is oly prilly observble possible h he relioship bewee ses/cios ukow Temporl Differece ler discrepcies bewee successive esimes used i TD-Gmmo V(s) - se vlue fucio eeds kow rewrd/se rsiio fucios Chper 13 Reiforceme 21
RL Summry Q(s, - se/cio vlue fucio reled o V does o eed rewrd/se rs fucios riig rule reled o dymic progrmmig mesure cul rewrd received for cio d fuure vlue usig curre Q fucio deermiisic - replce exisig esime odeermiisic - move ble esime owrds mesure esime covergece - c be show i boh cses Chper 13 Reiforceme 22