Reinforcement Lerning nd Policy Reue Mnuel M. Veloo PEL Fll 206 Reding: Reinforcement Lerning: An Introduction R. Sutton nd A. Brto Probbilitic policy reue in reinforcement lerning gent Fernndo Fernndez nd Mnuel Veloo. In Proceeding of AAMAS 06. (Thnk to Fernndo Fernndez) Lerning Lerning from experience Supervied lerning Lbeled exmple Rewrd/reinforcement Something good/bd (poitive/negtive rewrd) hppen An gent get rewrd prt of the input percept but it i progrmmed to undertnd it rewrd. Reinforcement extenively tudied by niml pychologit.
Reinforcement Lerning The problem of getting n gent to ct in the world o to mximize it rewrd. Teching dog new trick: you cnnot tell it wht to do but you cn rewrd/punih it if it doe the right/wrong thing. Lerning: to figure out wht it did tht mde it get the rewrd/ punihment: the credit ignment problem. RL: imilr method to trin computer to do mny tk. Reinforcement Lerning Tk Aume the world i Mrkov Deciion Proce Stte nd ction known Trnition nd rewrd unknown Full obervbility Objective Lern ction policy π : S A Mximize expected rewrd E[r t γr t γ 2 r t2...] from ny trting tte in S. 0 γ < dicount fctor for future rewrd 2
Reinforcement Lerning Problem Agent ee the tte elect nd ction nd get rewrd Gol: Lern to chooe ction tht mximize r 0 γr γ 2 r 2... where 0 γ < Online Lerning Approche Cpbilitie Execute ction in world Oberve tte of world Two Lerning Approche Model-bed Model-free 3
Model-Bed Reinforcement Lerning Approch Lern the MDP Solve the MDP to determine optiml policy Approprite when model i unknown but mll enough to olve feibly Lerning the MDP Etimte the rewrd nd trnition ditribution Try every ction ome number of time Keep count (frequentit pproch) R() = R /N T( ) = N /N Solve uing vlue or policy itertion Itertive Lerning nd Action Mintin ttitic incrementlly Solve the model periodiclly 4
Model-Free Reinforcement Lerning Lern policy mpping directly Approprite when model i too lrge to tore olve or lern Do not need to try every tte/ction in order to get good policy Converge to optiml policy Vlue Function For ech poible policy π define n evlution function over tte V π ( ) r t γr 2 t γ rt... i= 0 γ i rt i where r t r t... re generted by following policy π trting t tte π* rgmx π V π () ( ) Lerning tk: Lern OPTIMAL policy 5
Lern Vlue Function Lern the evlution function V π * (i.e. V*) Select the optiml ction from ny tte i.e. hve n optiml policy by uing V* with one tep lookhed: [ ] * ( ) = rgmx r( ) V ( δ ( ) ) * π γ But rewrd nd trnition function re unknown Q Function Define new function very imilr to V* Q() r() γv*(δ()) Lern Q function Q-lerning If gent lern Q it cn chooe optiml ction even without knowing δ or r π * [ ] * ( ) = rgmx r( ) γv ( δ ( ) ) * π ( ) = rgmxq( ) 6
Q-Lerning Q nd V*: V We cn write Q recurively: ( ) = Q( ʹ) mx ʹ ( t t ) = r( t t ) γv ( δ ( t t )) = r( t t ) γ mxq( ʹ) Q t Q-lerning ctively generte exmple. It procee exmple by updting it Q vlue. While lerning Q vlue re pproximtion. ʹ Trining Rule to Lern Q (Determinitic Exmple) Let Q denote ˆ current pproximtion to Q. Then Q-lerning ue the following trining rule: ( ) r γ mx Qˆ ʹ ( ʹ ) Qˆ ʹ where ʹ i the tte reulting from pplying ction in tte nd r i the rewrd tht i returned. 7
Determinitic Ce Exmple Determinitic Ce Exmple ( ) r γ mx Qˆ ( ʹ) Qˆ right 90 ʹ 0 0.9 mx 2 { 63 800 } 8
Q Lerning Itertion Strt t top left corner with fixed policy clockwie Initilly Q() = 0; γ = 0.8 ( ) r γ mx Qˆ ʹ ( ʹ ) Qˆ ʹ Q ( E) Q (2 E) Q (3 S) Q(4 W) Q Lerning Itertion Strt t top left corner with fixed policy clockwie Initilly Q() = 0; γ = 0.8 ( ) r γ mx Qˆ ʹ ( ʹ ) Qˆ ʹ 9
0 Nondeterminitic Ce Q lerning in nondeterminitic world Redefine V Q by tking expected vlue: ( ) [ ] ( ) ( ) ( ) ( ) [ ] V E r Q r E r r E r V i i t i t t t... * 0 2 2 δ γ γ γ γ π = Nondeterminitic Ce Q lerning trining rule: ( ) ( ) ( ) ( ) ( ) ( ) (Wtkin nd Dyn992) till converge to ˆ. nd where ˆ mx ˆ ˆ * n Q Q Q r Q Q viitn n n n n n δ α γ α α = ʹ = ʹ ʹ ʹ
Explortion v Exploittion Tenion between lerning optiml trtegy nd uing wht you know o fr to mximize expected rewrd Convergence theorem depend on viiting ech tte ufficient number of time Typiclly ue reinforcement lerning while performing tk Explortion policy Wcky pproch: ct rndomly in hope of eventully exploring entire environment Greedy pproch: ct to mximize utility uing current etimte Blnced pproch: ct more wcky when gent h not much knowledge of environment nd more greedy when the gent h cted in the environment longer One-rmed bndit problem
Explortion Strtegie ε-greedy Exploit with probbility -ε Chooe remining ction uniformly Adjut ε lerning continue Boltzmn Chooe ction with probbility p = Q e e ( ) ( ' ) where t cool over time (imulted nneling) All method enitive to prmeter choice nd chnge ' Q / t / t Policy Reue Impct of chnge of rewrd function Doe not wnt to lern from crtch Trnfer lerning Lern mcro of the MPD option Vlue function trnfer Explortion bi Reue complete policie 2
Epiode MDP with borbing gol tte Trnition probbility from gol tte to the me gol tte i (therefore to ny other tte i 0) Epiode: Strt in rndom tte end in borbing tte Rewrd per epiode (K epiode H tep ech): Domin nd Tk 3
Policy Librry nd Reue π-reue Explortion 4
π-reue Policy Lerning Experimentl Reult 5
Reult Policy Reue in Q-Lerning Interetingly the pi-reue trtegy lo contribute imilrity metric between policie The gin Wi obtined while executing the pi-reue explortion trtegy reuing the pt policy i. Wi i n etimtion of how imilr the policy i i to the new one! The et of Wi vlue for ech of the policie in the librry i unknown priori but it cn be etimted on-line while the new policy i computed in the different epiode. 6
Lerning to Ue Policy Librry Similrity between policie cn be lerned Gin of uing ech policy Explore different policie Lern domin tructure: eigen policie 7
Summry Reinforcement lerning Q-lerning Policy Reue Next cl: Other reinforcement lerning lgorithm (There re mny ) 8