Mking Comple Decisions Mrkov Decision Processes Vsn Honvr Bioinformics nd Compuionl Biology Progrm Cener for Compuionl Inelligence, Lerning, & Discovery honvr@cs.ise.edu www.cs.ise.edu/~honvr/ www.cild.ise.edu/ www.bcb.ise.edu/ www.iger.ise.edu Vsn Honvr, 006. Mking Comple Decisions: Mrkov Decision Problem How o use knowledge bou he world o mke decisions when here is unceriny bou consequences of cions Rewrds re delyed Vsn Honvr, 006. The Soluion Sequenil decision problems in uncerin environmens cn be solved by clculing policy h ssocies n opiml decision wih every environmenl se Mrkov Decision Process (MDP) Vsn Honvr, 006.
Emple The world 3 + Acions hve uncerin consequences 0.8-0. 0. sr 3 4 Vsn Honvr, 006. Vsn Honvr, 006. Vsn Honvr, 006.
Vsn Honvr, 006. Vsn Honvr, 006. Vsn Honvr, 006. 3
Vsn Honvr, 006. Cumulive Discouned Rewrd Suppose rewrds re bounded by M Cumulive discouned rewrd is bounded by M + M n+ n ( ) ( ) ( γ ) γ +.. M γ = M ( γ ) Noe :For he geomeric series o converge, 0 γ < Vsn Honvr, 006. Uiliy of Se Sequence U h U h g Addiive rewrds ([ s, s, s...]) = R( s0) + R( s) + R( s) 0 + g Discouned rewrds ([ s, s, s...]) = R( s0) + γr( s) + γ R( s) 0 +...... Vsn Honvr, 006. 4
Vsn Honvr, 006. Uiliy of Se g g The uiliy of ech se is he epeced sum of discouned rewrds if he gen eecues he policy π U π = ( s) E γ R( s ) π, s = 0 = s The rue uiliy of se corresponds o he opiml policy π* 0 Vsn Honvr, 006. Vsn Honvr, 006. 5
Clculing he Opiml Policy Vlue ierion Policy ierion Vsn Honvr, 006. Vlue Ierion g Clcule he uiliy of ech se g Then use he se uiliies o selec n opiml cion in ech se * π ( s) = rg m / s / / T ( s,, s ) U ( s ) Vsn Honvr, 006. Vlue Ierion Algorihm funcion vlue-ierion(mdp) reurns uiliy funcion locl vribles: U, U iniilly idenicl o R repe U U for ech se s do U ( s) R( s) + γ end unil close-enough(u, U ) reurn U m / s / / T ( s,, s ) U ( s ) Bellmn upde Vsn Honvr, 006. 6
Vlue Ierion Algorihm: Emple 3 0.8 0.868 0.9 + 0.76 0.660-0.705 0.655 0.6 0.388 3 4 The Uiliies of he Ses Obined Afer Vlue Ierion Vsn Honvr, 006. Policy Ierion Pick policy, hen clcule he uiliy of ech se given h policy (vlue deerminion sep) Upde he policy ech se using he uiliies of he successor ses Repe unil he policy sbilizes Vsn Honvr, 006. Vsn Honvr, 006. Policy Ierion Algorihm funcion policy-ierion(mdp) reurns policy locl vribles: U, uiliy funcion, π, policy repe U vlue-deerminion(π,u,mdp,r) unchnged? rue for ech se s do / / / if m T ( s,, s ) U ( s ) > T ( s, π ( s), s / / s s π ( s) T ( s,, s rg m unchnged? flse end unil unchnged? reurn π / s / ) U ( s ) hen / / ) U ( s ) 7
Vlue Deerminion g Simplificion of he vlue ierion lgorihm becuse he policy is fied g Liner equions becuse he m() operor hs been removed g Solve ecly for he uiliies using sndrd liner lgebr Vsn Honvr, 006. Opiml Policy (policy ierion wih liner equions) 3 + - 3 4 u (,) = 0.8 u (,) + 0. u (,) +0. u (,) Vsn Honvr, 006. u (,) = 0.8 u (,3) + 0. u (,) Prilly observble MDP (POMDP) In n inccessible environmen, he percep does no provide enough informion o deermine he se or he rnsiion probbiliy POMDP Se rnsiion funcion: P(s + s, ) Observion funcion: P(o s, ) Rewrd funcion: E(r s, ) Approch Clcule probbiliy disribuion over he possible ses given ll previous perceps, nd o bse decision on his disribuion Vsn Honvr, 006. 8
Vsn Honvr, 006. Lerning from Inercion wih he world An gen receives sensions or perceps from he environmen hrough is sensors nd cs on he environmen hrough is effecors nd occsionlly receives rewrds or punishmens from he environmen The gol of he gen is o mimize is rewrd (plesure) or minimize is punishmen (or pin) s i sumbles long in n -priori unknown, uncerin, environmen Supervised Lerning Eperience = Lbeled Emples Inpus Supervised Lerning Sysem Oupus Objecive Minimize Error beween desired nd cul oupus 9
Reinforcemen Lerning Eperience = Acion-induced Se Trnsiions nd Rewrds Inpus Reinforcemen Lerning Sysem Oupus = cions Objecive Mimize rewrd Reinforcemen lerning Lerner is no old which cions o ke Rewrds nd punishmens my be delyed Scrifice shor-erm gins for greer long-erm gins The need o rdeoff beween eplorion nd eploiion Environmen my no be observble or only prilly observble Environmen my be deerminisic or sochsic Reinforcemen lerning Environmen se rewrd cion Agen 0
.................. Key elemens of n RL Sysem Policy Rewrd Vlue Model of environmen Policy wh o do Rewrd wh is good Vlue wh is good becuse i predics rewrd Model wh follows wh An Eended Emple: Tic-Tc-Toe o X X X X X O X O X O X O X O O......... o... X O X O X O o X O X O X O X............... } O moves } X moves } O moves } X moves Assume n imperfec opponen: he/she someimes mkes miskes } X moves o o A Simple RL Approch o Tic-Tc-Toe Mke ble wih one enry per se Se o o o o o o o o o o 0.5 0.5 V(s) esimed probbiliy of winning win 0 loss 0.5 drw Now ply los of gmes. To pick our moves, look hed one sep Curren se Possible ne ses * Pick he ne se wih he highes esimed prob. of winning he lrges V(s) greedy move; Occsionlly pick move rndom n eplorory move.
RL Lerning Rule for Tic-Tc-Toe opponen's move { our move { opponen's move { our move { opponen's move { our move { sring posiion e * b c c * d e f s s he se before our greedy move he se fer our greedy move Eplorory move We incremen ech V(s) owrd V( s ) bckup : V(s) V (s) + α[ V( s ) V (s)] g g*. Why is Tic-Tc-Toe Too Esy? Number of ses is smll nd finie One-sep look-hed is lwys possible Se compleely observble Some Noble RL Applicions TD-Gmmon world s bes bckgmmon progrm (Tesuro) Elevor Conrol Cries & Bro Invenory Mngemen 0 5% improvemen over indusry sndrd mehods Vn Roy, Berseks, Lee nd Tsisiklis Dynmic Chnnel Assignmen -- high performnce ssignmen of rdio chnnels o mobile elephone clls Singh nd Berseks
The n-armed Bndi Problem Choose repeedly from one of n cions; ech choice is clled ply Afer ech ply, you ge rewrd r, where Er = Q * ( ) Disribuion of r depends only on Objecive is o mimize he rewrd in he long erm, e.g., over 000 plys The Eplorion Eploiion Dilemm Suppose you form Q ( Q * ( cion vlue esimes * = rgmq ( The greedy cion is = * eploiion * eplorion You cn eploi ll he ime; you cn eplore ll he ime You cn never sop eploring; bu you could reduce eploring Acion-Vlue Mehods Seless Adp cion-vlue esimes nd nohing else. Suppose by he -h ply, cion hd been chosen imes, producing rewrds r, r, K, r k, hen k Q ( = r + r + Lr k k lim Q ( = k Q* ( 3
Greedy ε-greedy Acion Selecion = * = rg mq ( ε-greedy Bolzmnn * wih probbiliy ε = rndom cion wih probbiliy ε Pr(choosing cion ime ) = whereτ is compuionl emperure e n b = Q ( τ e Q ( b) τ Incremenl Implemenion Recll he smple verge esimion mehod The verge of he firs k rewrds is Q k = r + r +Lr k k Incremenl upde rule does no require soring ps rewrds Q k + = Q k + [ k + r Q k + k] Trcking Nonsionry Environmen Choosing Q k o be smple verge is pproprie in Sionry environmen in which he dependence of Rewrds on cions is ime invrin when none of he Q * ( chnge over ime, In nonsionry environmen, i is beer o use eponenil, recency-weighed verge Q k + = Q k +α[ r k + Q k ] for consn α, 0< α = ( α) k Q 0 + α( α) k i r i k i = 4
Reinforcemen lerning when he gen cn sense nd respond o environmenl ses Agen se s rewrd r cion r + s + Environmen Agen nd environmen inerc discree ime seps: = 0,,, K Agen observes se sep : s S produces cion sep : A(s ) ges resuling rewrd: r + R nd resuling ne se: s +... r + r s s + + s r +3 + s + +3... + +3 The Agen Lerns Policy Policy sep, π : π ( s, = probbiliy h = when s = s mpping from ses o cion probbiliies Reinforcemen lerning mehods specify how he gen chnges is policy s resul of eperience. Roughly, he gen s gol is o ge s much rewrd s i cn over he long run. Agen-Environmen Inerfce -- Gols nd Rewrds Is sclr rewrd signl n deque noion of gol? mybe no, bu i is surprisingly fleible. A gol should specify wh we wn o chieve, no how we wn o chieve i. A gol is ypiclly ouside he gen s direc conrol The gen mus be ble o mesure success: eplicily frequenly during is lifespn 5
Rewrds Suppose he sequence of rewrds fer sep is : r, r, r + + + 3, K Wh do we wn o mimize? In generl, we wn o mimize he epeced for ech sep. reurn, E Episodic sks inercion breks nurlly ino episodes, e.g., plys of gme, rips hrough mze. R r + r + + r = + + L T, { R }, where T is finl ime sep which erminl se is reched, ending n episode. Rewrds for Coninuing Tsks Coninuing sks: inercion does no hve nurl episodes. Discouned reurn: = + + 3 + = k R r + γ r + γ r + L γ r k = 0 where γ,0 γ, is he discoun re., + k + shorsighed 0 γ frsighed Emple Pole Blncing Tsk Avoid filure: he pole flling beyond criicl ngle or he cr hiing end of rck. As n episodic sk where episode ends upon filure: rewrd =+ for ech sep before filure reurn = number of seps before filure As coninuing sk wih discouned reurn: rewrd = upon filure; 0 oherwise reurn = γ k, for k seps before filure In eiher cse, reurn is mimized by voiding filure for s long s possible. 6
Emple -- Driving sk Ge o he op of he hill s quickly s possible. rewrd = for ech sep when no op of hill reurn = number of seps before reching op of hill Reurn is mimized by minimizing he number of seps ken o rech he op of he hill. The Mrkov Propery Pr By he se sep, we men whever informion is vilble o he gen sep bou is environmen. The se cn include immedie sensions, highly processed sensions, nd srucures buil up over ime from sequences of sensions. Idelly, se should summrize ps sensions so s o rein ll essenil informion i should hve he Mrkov Propery: { s = s, r = r s,, r, s,, K, r, s, } = Pr{ s = s, r = r s, } + + s, r,nd hisories s,, r, s,, K, r, s,. 0 0 0 0 + + Mrkov Decision Processes If reinforcemen lerning sk hs he Mrkov Propery, i is clled Mrkov Decision Process (MDP). If se nd cion ses re finie, i is finie MDP. To define finie MDP, you need o specify: se nd cion ses one-sep dynmics defined by rnsiion probbiliies: P = Pr + ss rewrd probbiliies: R { s = s s = s, = } s, s S, A( s). { r s = s, =, s = s } s, s S, A( s). = E + + ss 7
Recycling Robo Finie MDP Emple A ech sep, robo hs o decide wheher i should cively serch for cn, b) wi for someone o bring i cn, or c) go o home bse nd rechrge. Serching is beer bu runs down he bery; if runs ou of power while serching, hs o be rescued (which is bd). Decisions mde on bsis of curren energy level: high, low. Rewrd = number of cns colleced Vlue Funcions The vlue of se is he epeced reurn sring from h se; depends on he gen s policy: Se - vlue funcion for policy π : π = { = } = k V ( s) Eπ R s s Eπ γ r + k k = 0 + s = s The vlue of king n cion in se under policy π is he epeced reurn sring from h se, king h cion, nd herefer following π : Acion - vlue funcion for policy π : { R s = s, = } π = = k Q ( s, Eπ Eπ γ r + k + s = s, = k = 0 Bellmn Equion for Policy π The bsic ide: R = r = r = r + + + + γr + γ + + γr ( r + γr + γ r L) + + + γ r So: Vπ (s) = E π R s = s + 3 + 3 + γ r 3 + 4 + 4 L { } { ( )s = s} = E π r + + γ Vs + Or, wihou he epecion operor: π [ R + γv ( s ] V ) π ( s) = π( s, P ss ss s 8
Opiml Vlue Funcions For finie MDPs, policies cn be prilly ordered: π π if nd only if V π (s) V π (s) for ll s S There is lwys les one (nd possibly mny) policies h is beer hn or equl o ll he ohers. This is n opiml policy. We denoe hem ll π *. Opiml policies shre he sme opiml se-vlue funcion: V (s) = mv π (s) for ll s S π Opiml policies lso shre he sme opiml cion-vlue funcion: Q (s, = mq π (s, for ll s S nd A(s) π This is he epeced reurn for king cion in se s nd herefer following n opiml policy. V Bellmn Opimliy Equion for V* V (s) = m A(s) Qπ (s, = m Er + + γ V (s + ) s = s, = A(s) { } P s s [ s + γ V ( s )] = m R s A(s) s The vlue of se under n opiml policy mus equl he epeced reurn for he bes cion from h se: The relevn bckup digrm: is he unique soluion of his sysem of nonliner equions. ( m s r s' Bellmn Opimliy Equion for Q* { } s [ R s s +γ mq ( s, ) ] Q (s, = Er + + γ mq (s +, ) s = s, = = P s s The relevn bckup digrm: (b) s, r s' m ' Q * is he unique soluion of his sysem of nonliner equions. 9
Why Opiml Se-Vlue Funcions re Useful Any policy h is greedy wih respec o V is n opiml policy. V Therefore, given, one-sep-hed serch produces he long-erm opiml cions. Wh Abou Opiml Acion-Vlue Funcions? Q * Given, he gen does no even hve o do one-sep-hed serch: π (s) = rg m A(s) Q (s, Solving he Bellmn Opimliy Equion Finding n opiml policy by solving he Bellmn Opimliy Equion requires: ccure knowledge of environmen dynmics; enough spce n ime o do he compuion; he Mrkov Propery. How much spce nd ime do we need? polynomil in number of ses (vi dynmic progrmming mehods), BUT, number of ses is ofen huge We usully hve o sele for pproimions. Mny RL mehods cn be undersood s pproimely solving he Bellmn Opimliy Equion. 0
Efficiency of DP To find n opiml policy is polynomil in he number of ses BUT, he number of ses ofen grows eponenilly wih he number of se vribles In prcice, clssicl DP cn be pplied o problems wih few millions of ses. Asynchronous DP cn be pplied o lrger problems, nd pproprie for prllel compuion. I is surprisingly esy o come up wih MDPs for which DP mehods re no prcicl. Reinforcemen lerning Environmen se rewrd cion Agen Mrkov Decision Processes Assume finie se of ses S se of cions A ech discree ime gen observes se s S nd chooses cion A hen receives immedie rewrd r nd se chnges o s + Mrkov ssumpion: s + = δ(s, ) nd r = r(s, ) i.e., r nd s + depend only on curren se nd cion funcions δ nd r my be nondeerminisic funcions δ nd r no necessrily known o gen
Agen s lerning sk Eecue cions in environmen, observe resuls, nd lern cion policy π : S A h mimizes E [r + γr + + γ r + + ] from ny sring se in S here 0 γ< is he discoun fcor for fuure rewrds Noe somehing new: Trge funcion is π : S A bu we hve no rining emples of form s, rining emples re of form s,, r Reinforcemen lerning problem Gol: lern o choose cions h mimize r 0 + γr + γ r +, where 0 γ< Lerning An Acion-Vlue Funcion Esime Q π for he curren behvior policy π. r r s + s + + s s, s +, + + s +, + Afer every rnsiion from nonerminl (, ) Q( s, ) + α[ r + γq( s, ) Q( s, )] Q s If s + + is erminl, hen Q( s +, + + ) = 0. + se s, do :
Vlue funcion To begin, consider deerminisic worlds... For ech possible policy π he gen migh dop, we cn define n evluion funcion over ses π V ( s) r + γr + γ r +... i= 0 + γ r i + i + where r, r +,... re genered by following policy π sring se s Resed, he sk is o lern he opiml policy π* π π * rg mv ( s),( s) π Wh o lern We migh ry o hve gen lern he evluion funcion V π* (which we wrie s V*) I could hen do look-hed serch o choose bes cion from ny se s becuse π *( s) rg m[ r( s, + γv *( δ ( s, )] A problem: This works well if gen knows δ : S A S, nd r : S A R Bu when i doesn', i cn' choose cions his wy 3
Acion-Vlue funcion Q funcion Define new funcion very similr o V* Q( s, r( s, + γv *( δ ( s, ) If gen lerns Q, i cn choose opiml cion even wihou knowing δ! π *( s) rg m[ r( s, + γv *( δ ( s, )] π π *( s) rg m Q( s, Q is he evluion funcion he gen will lern π Trining rule o lern Q Noe Q nd V* re closely reled: V *( s) = mq( s, ')) ' Which llows us o wrie Q recursively s Q( s, ) = r( s, ) + γv *( δ ( s, )) = r( s, ) + γ mq( δ ( s ' +, ')) Le Qˆ denoe lerner s curren pproimion o Q. Consider rining rule Qˆ( s, r + γm Qˆ( s', ' ) where s is he se resuling from pplying cion in se s. ' Q-Lerning [ + ] ( s, ) Q( s, ) + α r + γmq( s, Q( s ) Q, + 4
Q Lerning for Deerminisic Worlds For ech s, iniilize ble enry Observe curren se s Qˆ ( s, 0 Do forever: Selec n cion nd eecue i Receive immedie rewrd r Observe he new se s Upde he ble enry for Qˆ ( s, s follows: Qˆ( s) r + γ m Qˆ( s', ') s s. ' Upding Q Qˆ( s, righ ) r + γm Qˆ( s ', ' ) ' 0 + 0. 9m{ 63800,, } 90 Noice if rewrds non-negive, hen ( s,, n) Qˆ (, ) ˆ n+ s Qn ( s, nd ( s,, n) 0 Qˆ n( s, Q( s, Convergence heorem Theorem Qˆ converges o Q. Consider cse of deerminisic world, wih bounded immedie rewrds, where ech s, visied infiniely ofen. Proof: Define full inervl o be n inervl during which ech s, is visied. During ech full inervl he lrges error in Qˆ ble is reduced by fcor of γ. Le Qˆn be ble fer n updes, nd Δ n be he mimum error in Qˆn : h is Δ = m ˆ ( s, Q( s, n s, Q n 5
Convergence heorem For ny ble enry Qˆ n ( s, upded on ierion n +, he error in he revised esime ˆ ( s, ) is Q n+ Qˆ n+ ( s, Q( s, = ( r + γ m Qˆ ( s', ')) ( r + γ m Q( s', ')) = γ m Qˆ ( s', ') m Q( s', ') ' ' n n ' ' Q Lerning Recipe Qˆ Qˆ n+ n+ ( s, Q( s, = ( r + γ mqˆ ( s', ')) ( r + γ mq( s', ')) ( s, Q( s, ' s'', ' n ' ' n n n n ' γ m Qˆ ( s', ') Q( s', ') γ m Qˆ ( s'', ') Q( s'', ') ' = γ mqˆ ( s', ') mq( s', ') = γδ Noe we used generl fc h: m f( m f( m f( f ( Non-deerminisic cse Wh if rewrd nd ne se re non-deerminisic? We redefine V nd Q by king epeced vlues. π V ( s) E[ r + γr + + γ r + +...] i E γ r + i i= 0 Q( s, E[ r( s, + γv *( δ ( s, )] 6
where Nondeerminisic cse Q lerning generlizes o nondeerminisic worlds Aler rining rule o Qˆ ( s, ( α ) Qˆ ( s, [ r m Qˆ n n + α n + n n ' αn = + visis n ( s, ( s', ' )] Convergence of Qˆ o Q cn be proved [Wkins nd Dyn, 99] Temporl Difference Lerning Temporl Difference (TD) lerning mehods Cn be used when ccure models of he environmen re unvilble neiher se rnsiion funcion nor rewrd funcion re known Cn be eended o work wih implici represenions of cion-vlue funcions Are mong he mos useful reinforcemen lerning mehods Emple TD-Gmmon Lern o ply Bckgmmon (Tesuro, 995) Immedie rewrd: +00 if win -00 if lose 0 for ll oher ses Trined by plying.5 million gmes gins iself. Now comprble o he bes humn plyer. 7
λ Q ( s, ) ( λ)[ Q Temporl difference lerning Q s, ) r + γ mqˆ( s () ( + () ( s, ) + λq Q lerning: reduce discrepncy beween successive Q esimes One sep ime difference: () ( s, ) + λ Q, Why no wo seps? () Q ( s, ) r + γr + + γ mqˆ( s+, Or n? Q ( n) ( n ) ( s, ) r + γr + L+ γ r + n Blend ll of hese: + + n γ mqˆ( s + n (3), ( s, ) λ Q ( s, ) ( λ)[ Q Temporl difference lerning () Equivlen epression: ( s, ) + λq ( s, ) + λ Q λ λ Q ( s, ) = r + γ [( λ) mqˆ( s, ) + λq ( s+, + )] ( s, ) TD(λ) lgorihm uses bove rining rule Someimes converges fser hn Q lerning converges for lerning V * for ny 0 λ (Dyn, 99) Tesuro's TD-Gmmon uses his lgorihm () (3) Hndling Lrge Se Spces Replce Qˆ ble wih neurl ne or oher funcion pproimor Virully ny funcion pproimor would work provided i cn be upded in n online fshion 8
Lerning se-cion vlues Trining emples of he form: { descripion of ( s, ), v } The generl grdien-descen rule: r θ + = r θ +α[ v Q (s, )] r Q(s, ) θ Liner Grdien Descen Wkins Q(λ) 9