Outline. Reinforcement Learning. What is RL? Reinforcement learning is learning what to do so as to maximize a numerical reward signal

Size: px

Start display at page:

Download "Outline. Reinforcement Learning. What is RL? Reinforcement learning is learning what to do so as to maximize a numerical reward signal"

Leslie Ball
6 years ago
Views:

1 Otine Reinfocement Leaning Jne, 005 CS 486/686 Univesity of Wateoo Rsse & Novig Sect.-. What is einfocement eaning Tempoa-Diffeence eaning Q-eaning Machine Leaning Spevised Leaning Teache tes eane what to emembe Reinfocement Leaning Envionment povides hints to eane What is RL? Reinfocement eaning is eaning what to do so as to maximize a nmeica ewad signa Leane is not tod what actions to take, bt mst discove them by tying them ot and seeing what the ewad is Unspevised Leaning Leane discoves on its own 4 What is RL Anima Psychoogy Reinfocement eaning diffes fom spevised eaning Spevised eaning Don t toch. Yo wi get bnt Reinfocement eaning Och! Negative einfocements: Pain and hnge Positive einfocements: Pease and food Reinfocements sed to tain animas Let s do the same with comptes! 5 6

2 RL Exampes Game paying (backgammon, soitaie) Opeations eseach (picing, vehice oting) Eevato scheding Heicopte conto Reinfocement Leaning Definition: Makov decision pocess with nknown tansition and ewad modes Set of states S Set of actions A Actions may be stochastic Set of einfocement signas (ewads) Rewads may be deayed 7 8 Poicy optimization Reinfocement Leaning Pobem Makov Decision Pocess: Find optima poicy given tansition and ewad mode Execte poicy fond State Agent Rewad Action Reinfocement eaning: Lean an optima poicy whie inteacting with the envionment Envionment a0 a s0 s s 0 a 9 Goa: Lean to choose actions that maximize 0 +γ +γ +, whee 0 γ < 0 Exampe: Inveted Pendm State: x(t),x (t), θ(t), θ (t) Action: Foce F Rewad: fo any step whee poe baanced Pobem: Find δ:s A that maximizes ewads R Chaacteisitics Reinfocements: ewads Tempoa cedit assignment: when a ewad is eceived, which action shod be cedited? Expoation/expoitation tadeoff: as agent eans, shod it expoit its cent knowedge to maximize ewads o expoe to efine its knowedge? Lifeong eaning: einfocement eaning

3 Types of RL Passive vs Active eaning Passive eaning: the agent exectes a fixed poicy and ties to evaate it Active eaning: the agent pdates its poicy as it eans Mode based vs mode fee Mode-based: ean tansition and ewad mode and se it to detemine optima poicy Mode fee: deive optima poicy withot eaning the mode Passive Leaning Tansition and ewad mode known: Evaate δ: V δ (s) = R(s) + γ Σ s P(s s,δ(s)) V δ (s ) Tansition and ewad mode nknown: Estimate poicy vae as agent exectes poicy: V δ (s) = E δ [ Σ t γ t R(s t )] Mode based vs mode fee 4 Passive eaning Passive ADP γ = i = fo non-temina states Do not know the tansition pobabiities Adaptive dynamic pogamming (ADP) Mode-based Lean tansition pobabiities and ewads fom obsevations Then pdate the vaes of the states (,) (,) (,) (,) (,) (,) (,) (4,) + (,) (,) (,) (,) (,) (,) (,) (4,) + (,) (,) (,) (,) (4,) - What is the vae V(s) of being in state s? 5 6 γ = ADP Exampe (,) (,) (,) (,) (,) (,) (,) (4,) + (,) (,) (,) (,) (,) (,) (,) (4,) + (,) (,) (,) (,) (4,) - P((,) (,),) =/ P((,) (,),) =/ i = fo non-temina states V δ (s) = R(s) + γ Σ s P(s s,δ(s)) V δ (s ) Use this infomation in Passive TD Tempoa diffeence (TD) Mode fee At each time step Obseve: s,a,s, Update V δ (s) afte each move V δ (s) = V δ (s) + α (R(s) + γ V δ (s ) V δ (s)) We need to ean a the tansition pobabiities! 7 Leaning ate Tempoa diffeence 8

4 TD Convegence Thm: If α is appopiatey deceased with nmbe of times a state is visited then V δ (s) conveges to coect vae α mst satisfy: Σ t α t Σ t (α t ) < Often α(s) = /n(s) n(s) = # of times s is visited Active Leaning Utimatey, we ae inteested in impoving δ Tansition and ewad mode known: V * (s) = max a R(s) + γ Σ s P(s s,a) V * (s ) Tansition and ewad mode nknown: Impove poicy as agent exectes poicy Mode based vs mode fee 9 0 Q-eaning (aka active tempoa diffeence) Q-fnction: Q:S A R Vae of state-action pai Poicy δ(s) = agmax a Q(s,a) is the optima poicy Beman s eqation: Q*(s,a) = R(s) + γ Σ s P(s s,a) max a Q*(s,a ) Q-eaning Fo each state s and action a initiaize Q(s,a) (0 o andom) Obseve cent state Loop Seect action a and execte it Receive immediate ewad Obseve new state s Update Q(a,s) Q(s,a) = Q(s,a) + α((s)+γ max a Q(s,a ) Q(s,a)) s=s Q-eaning exampe 7 s s =0 fo non-temina states γ=0.9 α=0.5 Q(s,ight) = Q(s,ight) + α ((s ) + γ max a Q(s,a ) Q(s,ight)) = ( max[66,8,00] 7) = (7) = 8.5 Q-eaning Fo each state s and action a initiaize Q(s,a) (0 o andom) Obseve cent state Loop Seect action a and execte it Receive immediate ewad Obseve new state s Update Q(a,s) Q(s,a) = Q(s,a) + α((s)+γ max a Q(s,a ) Q(s,a)) s=s 4 4

5 Expoation vs Expoitation Common expoation methods If an agent aways chooses the action with the highest vae then it is expoiting The eaned mode is not the ea mode Leads to sboptima ests By taking andom actions (pe expoation) an agent may ean the mode Bt what is the se of eaning a compete mode if pats of it ae neve sed? Need a baance between expoitation and expoation 5 ε-geedy: With pobabiity ε execte andom action Othewise execte best action a* a* = agmax a Q(s,a) Botzmann expoation P(a) = e Q(s,a)/T Σ a e Q(s,a)/T 6 Expoation and Q-eaning Q-eaning conveges to optima Q- vaes if Evey state is visited infinitey often (de to expoation) The action seection becomes geedy as time appoaches infinity The eaning ate a is deceased fast enogh bt not too fast A Timph fo Reinfocement Leaning: TD-Gammon Backgammon paye: TD eaning with a nea netwok epesentation of the vae fnction: 7 8 Next Cass Machine eaning Decision tees Rsse and Novig: chapte 8 9 5

Value Prediction with FA. Chapter 8: Generalization and Function Approximation. Adapt Supervised Learning Algorithms. Backups as Training Examples [ ]

Value Prediction with FA. Chapter 8: Generalization and Function Approximation. Adapt Supervised Learning Algorithms. Backups as Training Examples [ ] Chapte 8: Genealization and Function Appoximation Objectives of this chapte:! Look at how expeience with a limited pat of the state set be used to poduce good behavio ove a much lage pat.! Oveview of function