Compulsory Flow Q-Learning: an RL algorithm for robot navigation based on partial-policy and macro-states

Size: px

Start display at page:

Download "Compulsory Flow Q-Learning: an RL algorithm for robot navigation based on partial-policy and macro-states"

Kelley Davis
5 years ago
Views:

1 Journl of he Brzilin Compuer Sociey, 2009; 5(3): ISSN Compulsory Flow Q-Lerning: n RL lgorihm for robo nvigion bsed on pril-policy nd mcro-ses Vldinei Freire d Silv*, Ann Helen Reli Cos Lborório de Técnics Ineligenes LTI, Deprmeno de Engenhri de Compução e Sisems Digiis PCS, Escol Poliécnic d Universidde de São Pulo EPUSP, São Pulo - SP, Brsil Received: July 7, 2009; Acceped: Augus 27, 2009 Absrc: Reinforcemen Lerning is crried ou on-line, hrough ril-nd-error inercions of he gen wih he environmen, which cn be very ime consug when considering robos. In his pper we conribue new lerning lgorihm, CFQ- Lerning, which uses mcro-ses, low-resoluion discreision of he se spce, nd pril-policy o ge round obscles, boh of hem bsed on he complexiy of he environmen srucure. The use of mcro-ses voids convergence of lgorihms, bu cn ccelere he lerning process. In he oher hnd, pril-policies cn gurnee h n gen fulfils is sk, even hrough mcro-se. Experimens show h he CFQ-Lerning performs good blnce beween policy quliy nd lerning re. Keywords: mchine lerning, reinforcemen lerning, bsrcion, pril-policy, mcro-ses.. Inroducion A common clss of sks in mobile roboics is plnning n cion policy o rech desired gol se, usully hrough mximision of vlue funcion which designes sub-objecives nd helps choosing he bes ph. For insnce, he shores ph, he ph wih he shores ime, he sfes ph, or ny combinion of differen sub-objecives 5, 20. The definiion of sk in his clss my conin, besides he vlue funcion, some priori knowledge bou he do, e.g., environmen mp, environmen dynmics, gol posiion. Such knowledge llows robo plnning, while he lck of such knowledge obliges he robo eiher o lern i previously or o mke use of heurisic sregies, such s moving o gol direcion while voiding obscles 9. While he problem of mpping he environmen hs received gre enion from he roboics communiy, ly under he simulneous loclision nd mpping pproch 3,, less enion hs been given o lern he environmen dynmics. Given mp nd he robo loclision, if gol posiion is given, i is possible hrough ph plnning o deere ph free of obscles from he robo posiion o such gol. However, even if priori knowledge is considered bou moving direcions in he Eucliden spce so h n cion policy cn be compued, or vriions in he environmen dynmics, such s slippery, oblique, or crushed ground, re no cpured s well s re no inferred more generic sub-objecives. Reinforcemen Lerning (RL) 2 is lerning mehod h cn be pplied o he sk of lerning he dynmic environmen nd plnning n cion policy logeher. In RL, n uonomous gen lerns n cion policy bsed on is own experience. This policy is inferred from process of ril nd error, which is guided by he gen iself nd received reinforcemens h indice pril evluion of execued cions, besides perceiving rnsiions mong differen siuions formlly ses evidencing he environmen dynmics. The sequence of received reinforcemens deeres he vlue of ech execued rjecory. Reinforcemens cn indice wlked disnce, ime elpsed or ny desirble locl siuion fced by he robo. Wheres he roboic sk of reching gol se in n environmen populed wih obscles cn be solved hrough plnning, robos bsed on RL cn lern nd recover from big chnges in he environmen, like he ppernce of new obscles, or smll ones, like he ppernce of oil in he ground or of crushed ground 2. Moreover, RL does no need o sr lerning from he scrch, some pril soluion cn be considered so h n RL lgorihm fills he gps or subopiml soluion cn be considered so h n RL lgorihm improve i. Wihin he ls fifeen yers, mny works bou RL hve been published 8, 0, 22, 4, 4 exending Suon s ricle 22, which brough mhemicl formlism o RL. However, mos mehods depend srongly on he size of he se spce in which he lerning process is done, nd gives rise o rde-off beween policy quliy nd lerning speed. Recen works in RL re emps finding mehods h ccelere he lerning re wihou degenering he policy quliy. In such mehods hree objecives re pursued: *e-mil: vldinei.freire@gmil.com

2 66 Silv VF, Cos AHR Journl of he Brzilin Compuer Sociey sclbiliy, so h no exponenil increse occurs in he complexiy of solving sks when incresing he size of se spce; knowledge rnsfer, so h mos of common knowledge cn be shred mong differen sks; nd sbiliy, so h mehod cn be pplied o differen dos. In his pper we propose mehod h concerns sclbiliy nd knowledge rnsfer properies, so h n increse in he lerning speed for specific sk in mobile roboics cn be reched. On he oher hnd, we resric our lgorihm o specific do, h of reching gol se wihin n environmen h conins obscles where robos cnno wlk hrough. The proposed mehod uses discreision of he se spce combined wih previously lern prilpolicy 7, boh defined in ccordnce o he complexiy of he environmen srucure. This mehod is implemened in he CFQ-Lerning lgorihm, which snds for Compulsory Flow Q-Lerning. We use boh, emporl nd spil bsrcion, in order o ccelere he lerning process. Spil bsrcion is pplied hrough low resoluion discreision of he se spce, 6 nd similr ses re grouped such h hey shre chrcerisics which will be lern equl o ll of hem. Temporl bsrcion is pplied hrough mcro-cions 8, which re sequence of cions or sub-policy h re pplied o more hn one sep, so h less chnce for he robo choosing cions is lef. However, since here re disconinuiy in he se spce becuse of obscles, we my use high resoluion discreision ner such disconinuiy, or mcro-cion o over come he obscles. We hve chosen he second cse, using compulsory flow s pril-policy, which kes conrol of he robo ner obscles o ge round hem. Bsed on heoreicl nd experimenl nlysis, we show h he CFQ-Lerning performs beer blnce beween policy quliy nd lerning speed hn he Q-Lerning lgorihm does when pplied o discreised coninuous se spce. The reing of his pper is orgnised s follows. Secion 2 presens he RL formlision ogeher wih Q-lerning, he mos usul RL lgorihm, hen, he sk do of ineres is presened followed by reinforcemen lerning lgorihm h cn solve i. In Secion 3 we define formlly he pril policy, nmed compulsory flow, nd describe how o lern such flow. We hen presen he CFQ-Lerning lgorihm in Secion 4, which uses he compulsory flow nd discreision of he se spce defined for he curren sk environmen o lern cion policies. In Secion 5 we compre he performnce of he CFQ-Lerning lgorihm wih he Q-Lerning lgorihm when differen discreisions of he se spce re considered. We describe he experimens performed nd presen he resuls obined. Finlly, Secion 6 summrises our conclusions. 2. Reinforcemen Lerning nd Tsk Do In works concerning RL, Mrkovin Decision Processes (MDPs) 7 re doped s simplified models of rel problems. MDP models re buil under well-esblished mhemicl formlism, which compenses he simplifying condiions used o describe he environmen, s here re opiml lgorihms o solve problems expressed s MDPs 7. An MDP is defined by uple <A, S, P(s + s, ), r(s, )> where A is finie se of possible cions, S is finie se of possible ses s, P(s + s, ) represens rnsiion probbiliies nd r(s, ) is bounded expeced reinforcemen funcion Q-Lerning lgorihm The bsic ide behind RL is h he lerning gen cn lern how o solve n MDP sk hrough repeed inercions wih he environmen. Noe h ll h is known by he gen is he se of cions A nd he se of ses S, wheres he funcions P(s s, ) nd r(s, ) mus be lern hrough inercion wihin he environmen. The environmen is described by he se of possible ses S, nd he gen cn perform ny cion from A. Ech ime i performs n cion in some se s, he environmen reches new se nd he gen receives reinforcemen r h indices he immedie vlue of his se-cion rnsiion (see Figure ). The gen mus find ou sionry policy of cions * = π* (s ) h mximises he expeced vlue funcion V π (s ), which represens he expeced reinforcemen incurred for policy π, nd π*(s ) = rg mx π [V π (s )]. 7 I is common o ssume he discouned-reinforcemen vlue funcion, which mkes use of discoun fcor γ (0,] h forces recen reinforcemens o be more imporn hn remoe ones. V π (s ) is hus defined by: π N 0 N = 0 V (i) = lim E[ γ r(s, ) s = i] () The RL problem modelled s n MDP cn be solved by he Q-Lerning lgorihm 24, which finds n opiml policy incremenlly wihou considering he rnsiion probbiliies of he environmen model. Q-Lerning is bsed on he s Environmen Reinforcemen Agen Figure. A RL-lerning gen inercing wih is environmen. r

3 2009; 5(3) Compulsory Flow Q-Lerning: An RL lgorihm for robo nvigion bsed on pril-policy nd mcro-ses 67 TD(0) lgorihm 22, nd esimes vlue funcion Q(s,) for ech se-cion pir. This vlue funcion is recursively clculed by: Q (s, ) = Q (s, ) +α [r(s, ) + + γmx Q (s,) Q (s, )] + where α is he lerning re nd γ is he discoun fcor. During he lerning process, he ime of choosing cion α i is necessry o selec one beween wo sregies: explorion, which diversifies he policy in order o rech unknown se-cion pirs nd my improve he bes curren known policy, or exploiion, which chooses he bes curren known policy. Frequenly combinion of boh sregies is used (ε-greedy), where n explorion re ε is defined Tsk do Gol-se sks hve mny pplicions in roboics going o desired room, holding n objec, chnging he environmen, nd so on. Frequenly i is required h he robo plns he bes possible ph o solve he sk wihin coninuous se spce. Alhough RL lgorihms cn subopimlly solve hese sks (for insnce, by considering high-resoluion grid world nd using n uniry cos for ech cion choice), oo much ime cn resul o obin resonbly good policy, resuling in n inefficien lernive in mny cses. The ineres here resides in pplicions where se of golse sks re defined for he sme kind of environmen, so h i is worh cquiring in dvnce some knowledge bou his kind of environmen, nd hen reuse his knowledge in fuure sks, where differen gol posiions or differen environmens re defined. A mobile robo nviging in n one-floor house is kind of environmen h is considered in his pper. Figure 2 shows he environmen used in he experimens described in Secion 5, where mobile robo cn move in ny direcion. The do considered in his pper cn be defined in coninuous spce. In his spce we cn define se of coninuous ses X h represens every possible posiion (2) of he robo in he environmen. One of he chrcerisics of such spce is he noion of neighbourhood. For exmple, if posiion in plne is considered, he Eucliden disnce cn be considered o define neighbourhood of ech posiion, mening h he robo cn rech se in his neighbourhood in he ner fuure. Alhough he coninuous se spce presens some imporn chrcerisics when plnning, he soluion discussed in his pper RL lgorihms re only pplied o discree spces. This wy, we cn consider high resoluion discree spce S h represens he coninuous spce of he do hrough mp s(x): X S. We mus lso consider se of discree cions A. The chosen discreision should respec he following consrins:. The se of coninuous ses h is mpped ino he sme discree se mus be compc, i.e., if s(x i ) = s(x j ) = s hen s(αx i + ( α)x j ) = s for ll α (0,). This gurnees h he noion of neighbourhood is ined in he discree se spce when we consider he coninuous men posiion of every coninuous se mpped ino he sme discree se; 2. The gen moves only o neighbour ses in he discree se spce, i.e., P(s + = s s = s, ) > 0 if nd only if s nd s re neighbours. This gurnees h he noion of neighbourhood in he coninuous se spce cn be exended o he discree se spce; nd 3. There re cions h cn move he gen, wih higher probbiliy, o ny direcion in he se spce, excep o plces where obscles exis, i.e., for ll neighbouring ses s, s S, here is n cion A such h P(s + = s s = s, = ) = mx s S P(s + = s s = s, = ). This implies h if he se of discree ses llows k neighbour ses, hen A k. This gurnees h he gen cn move from ny se o is neighbours. Figure 3 shows wo perns of discree ses h respec such consrins. In he hexgon pern, here re 6 possible cions, wheres in he qudric pern, here re 8 possible cions. Figure 2. The sk environmen used in he experimens. The gol region is loclised in he op-lef corner. Figure 3. Exmples of discree perns.

4 68 Silv VF, Cos AHR Journl of he Brzilin Compuer Sociey 2.3. Q-Lerning nd coninuous spce The Q-Lerning lgorihm, s described in Secion 2., is resriced o discree spces (ses nd cions), nd when pplied o coninuous se (or cion) spce, discreision process is necessry. I is usul o use uniform discreision of he spce (ses nd cions) s re shown in Figure 3, such h discree cion is chosen nd performed (for consn period of ime, unil he gen mkes rnsiion beween discree ses, or unil noher condiion occurs) in considered discree se, which encompsses he curren rel se. In coninuous spce, when pplying coninuous conrol u(x()), is respecive vlue funcion V u(x()) (x()) is obined such h: u(x()) τ lim 0γ τ τ 0γ τ u(x()) V (x(0)) = E[ r(x(), u(x()))d] = E[ r(x(),u(x()))d = +γ V (x( τ))], where x() is he curren se, r(x,u) is he curren reinforcemen per ime nd γ is he discoun fcor. In he discreision process, se of coninuous ses is ssocied wih discree se s, s: X S, where X is he se of coninuous ses nd S is he se of discree ses. The funcion s(x) reles ech coninuous se x o discree se s = s(x) nd for ech discree se in S, i is supposed h he vlue funcion of ll is coninuous ses hs similr vlues nd similr opiml policy, which mens h if s(x ) = s(x 2 ) hen V*(x ) V*(x 2 ) nd p*(x) p*(x2). 4 In generl, he se of discree cions A is chosen in such wy h he gen cn move o ll is discree neighbours, s i ws seen in secion 2.2, he reinforcemen funcion r(s,) is derived from he coninuous se spce, nd he vlue funcion is recursively clculed by Munos nd Moore : x() +τ +α +τ γ τ +γ +τ Q (s, ) = = Q (s, ) [ r(x(), ))d mxq (s,) Q (s, )], where α is he lerning re, γ is he discoun fcor, s x() is he discree se s mpped from he coninuous se x(), nd τ is he ime ken o execue cion. As resul of he discreision process, finie crdinliy is obined. However, he performnce of he Q-Lerning lgorihm is compleely dependen on such crdinliy nd in he wy he discreision is mde. If crdinliy is high, good policy cn be obined, bu he lerning speed is low, wheres if crdinliy is low, he lerning speed is high, bu he lerned policy is of lower quliy. Then rde-off beween lerning speed nd policy quliy mus be considered for he discreision process. When discreision is used, he convergence of Q-Lerning lgorihm is lso corruped, since i is no possible o gurnee sionry rnsiion funcion P(s s,), bu i will depend on he policy (3) (4) execued, or, more specificlly, he previous discree se occurred. In mos dos, uniform discreision is no he bes soluion, since he environmen srucure is no considered. Munos nd Moore presened n lgorihm h mkes non-uniform discreision bsed on he vlue funcion vrince of coninuous ses belonged o he sme discree se nd on he influence of he vlue funcion of discree se in oher discree ses. However, he sysem dynmics is considered o be known nd deerisic. Reynolds 6 proposed n dpive lgorihm bsed on policy o he nonuniform discreision process, which cquires he dynmics of he environmen whils execues on-line discreision of he se. One reson for hving disconinuiy in opiml vlue funcions nd policies is he exisence of obscles nd prohibied se rnsiions in he environmen. Boh mehods cied bove discreises he spce in more useful wy, when compred o he uniform discreision frequenly used. However, when pplied o n environmen wih mny obscles, high resoluion is used ner obscles, since hey produce high vriions in he vlue funcion nd in he policy, wh decreses he lerning speed. We propose n lernive wy o del wih his problem. The ide is o previously define i) n obligory pril policy, nmed compulsory flow, h should be performed by he lerning gen when ner obscles, nd ii) low-resoluion discreision of he se spce bsed on he environmen srucure ogeher wih consrins on he cion policy o be used in regions free of obscles. Once he compulsory flow nd he low-resoluion discreision of he se spce re defined, his informion cn be reused in he policy lerning process for differen sks defined for he sme environmen. Bsed on hese definiions (compulsory flow nd low-resoluion discreision), we conribue new lgorihm, clled Compulsory Flow Q-Lerning, ig beer blnce beween lerning speed nd policy quliy. 3. Compulsory Flow Pril policy is mpping from environmenl region o subse of possible cions 3, 7, 5 nd i helps incorporing priori knowledge ino RL lerning mehods. Differenly from previous work in he lierure, we consider priori pril policy h rech desired do-dependen behviour in he environmen. In his pper, his cion subse hs only one possible cion for ech se. In order o define he compulsory flow, high-resoluion se spce discreision is used. Alhough i cn spend los of ime o deere he compulsory flow, i is clculed only once for ech environmen. The sme environmen srucure cn be used for differen sks nd he compulsory flow cn be reused so h he lerning speed cn be incresed. Also, we will see h he compulsory flow cn be defined loclly, mening h i is no sensiive o he globl environmen, bu o siuions fced by he robo. The compulsory flow is pril policy used when he gen is ner obscles,

5 2009; 5(3) Compulsory Flow Q-Lerning: An RL lgorihm for robo nvigion bsed on pril-policy nd mcro-ses 69 so h he robo cn ge round obscles 6. In his sense, hrdcoded pril policy cn be used o implemen ge-round behviour, being he only requiremen h he gen keeps some ineril movemen. The compulsory flow is defined by he ngenil-flow region R TF nd he ngenil-flow policy π TF(s ), where s, since π TF(s ) is funcion h defines n cion for ech s, which is reched by perforg in s. The previous cion is used in order o gurnee ineril behviour, rying o keep he sme movemen direcion when voiding obscles. Definiion : Le N TF (s, ) be he expeced number of cions performed by n gen o rech n obscle when execuing he cion in he se s nd hen following rndom policy; nd be he ls cion execued nd he new cion o be performed, respecively; N be he TF number h deeres he size of R TF ; nd be he vecor which represens cion in he coninuous spce; hen i is defined: Tngenil-flow region R TF : s i R TF if nd only if N TF(s i,) NTF, h mens, s i is ner some obscle, nd Tngenil-flow policy π TF (.): πtf(s ) = rg ( N TF(s,) N TF ),, 0 where, represens he inner produc of nd. This mens h he ngle beween he vecor π TF (s ) nd he vecor is less hn 90, wh keeps he gen in similr movemen direcion given by nd ner he border of R FT, mking he gen geing round obscles. Figure 4 illusres ngenil-flow region (gry region) nd he corresponding ngenil-flow policy when he gen srs poin, performs n cion i= h drives i ino R TF nd cives π TF (.)(see Definiion ), which voids collision wih he wll by conducing he gen hrough he compulsory flow unil poin 2, when i is relesed nd new cion i cn be chosen o be performed. When he gen is relesed from compulsory flow will be explined in he nex secion. In order for lerning gen o uonomously define (by explorion) he ngenil-flow region R TF for n unknown environmen, we propose he use of he following modificion of he Q-Lerning upde rule: + TF TF +α N (s, ) = N (s, ) [r(s, ) +γ TF + N (s,) TF N (s, )], where α is he lerning re nd γ is he discoun fcor. We use verge insed of mximision. The sme rule cn be dped for differen RL-lgorihms. The reinforcemen funcion mus be defined o deec obscles, s i is used in his work, bu i cn be used o deec oher undesirble regions, such s cliff 2, srong mgneic field, high emperure, mois, ec. Figure 5 shows he vlue N (s,) obined by using he described modificion of he Q-Lerning lgorihm (Equion ) wih r(s, ) = 0 when hiing n obscle nd r(s, ) = oherwise, nd discoun fcor γ =. The one of gry represens how fr he gen is from reching obscle wlking rndomly (blck is closer, whie is furher). The ngenil-flow region R TF of his environmen cn be obined by defining desired N TF. Figure 6 shows he ngenil-flow region R TF obined wih N TF = 7. (5) i 2 3 Figure 4. An gen s movemen, which srs poin nd follows he compulsory flow unil reching he poin 2, when i is relesed, reching poin 3. i Figure 5. The vlues N (s,) obined hrough he modificion of he Q-Lerning lgorihm wih γ =, r(s, ) = 0 when hiing n obscle nd r(s, ) = oherwise, discreision, fer 0 6 ierions.

70 Silv VF, Cos AHR Journl of he Brzilin Compuer Sociey Figure 6. Tngenil-flow region using N TF = 7. Noice h, regions wih similr srucure (corners, U-like form, ges) presen similr ngenil-flow region.

6 70 Silv VF, Cos AHR Journl of he Brzilin Compuer Sociey Figure 6. Tngenil-flow region using N TF = 7. Noice h, regions wih similr srucure (corners, U-like form, ges) presen similr ngenil-flow region. This chrcerisic is imporn so h n gen cn lern he ngenil-flow region even before king ny cion in he environmen where sk mus be done. The lerning of he ngenil-flow region nd consequenly he ngenil-flow policy cn be lern before hnd if ll ypicl siuions cn be experimened by he gen nd h such policy be defined in he spce of locl siuions. I is worh noicing gin h he ngenil-flow policy is jus one possible compulsory flow. As i ws lredy menioned, hrd-coded behviour of geing round obscles cn be progrmmed in he gen, even wih oher prohibied regions. Also, he compulsory flow cn be used no only o ge round regions where he gen cnno ge hrough, bu i cn be used s wy of gurneeing h he gen does no dmge iself. 4. Compulsory Flow Q-Lerning The CFQ-Lerning lgorihm ddresses pplicions where previous informion bou he srucure of he environmen cn be ghered nd reused. I my hppen when he robo hs hd lredy ccess o he environmen in previous sk or if he environmen is of some kind previously known. In our pproch, while high-resoluion discreision is used for he priori definiion of he compulsory flow for sk environmen, low-resoluion discreision is used in he CFQ-Lerning lgorihm o lern he sk policy, wh increses he lerning speed while sill keeping he gen sfe in dngerous regions. Similrly o he discreision process described in Secion 2.3, se M of mcro-ses m is defined by funcion m: S M, where in generl he region in mcro-se m M is much lrger hn he region in discree se s S. The se of cions A for mcro-ses is he sme s h defined for discree ses. The CFQ-Lerning lgorihm considers s inpu: ) he se S of high-resoluion discree ses wih he funcion s : X S, where X is he se of coninuous ses; 2) he se M of lowresoluion mcro-ses wih he funcion m: S M; 3) he se A of discree cions; nd 4) he ngenil-flow region R TF nd policy π TF wih he funcion N TF : S A R. In he lgorihm here re hree levels of ses: ) he coninuous level X, h is where he rel inercion of he gen wih he environmen occurs; 2) he high-resoluion discree level S, h is where he CFQ-Lerning lgorihm conrols he rel gen; nd 3) he low-resoluion discree level M, h is where he policy is lern. When n cion is chosen in S-level, corresponden cion is performed in X-level for discree ime n n. When n cion is chosen in M-level, he gen cn opere in wo modes: ) obscle free -- his mode is used in regions free of obscles nd he cion is execued in he S-level; nd 2) compulsory flow -- his mode is used when he gen reches he ngenil-flow region nd he cion deered by p TF is execued. The gen eners in mode every ime mcro-se rnsiion occurs or when he cion α kes he gen wy from obscles nd here is no gre chnge in he movemen direcion (ngle beween he direcions of he previous nd he cul cions is less or equl hn 90 ). The gen eners in mode 2 every ime he gen eners in he ngenil-flow region. The ide behind he CFQ-Lerning lgorihm is h, once n cion is chosen o be performed he mcro-se m, he cion will be execued whils he gen is in he sme mcro-se nd his cion does no drive he gen ino he compulsory-flow region R TF (previously defined for he environmen). Every ime he gen invdes he R FT region, he compulsory flow π FT (.) drives he gen unil i cn eiher perform he originl cion gin or mcro-se rnsiion occurs. In he ler cse, he lerning gen chooses new cion. The compulsory flow π FT (.) is defined on he bsis of he N TF (s, ) previously clculed nd sored for being used in he curren environmen. Tble describes he proposed CFQ-Lerning lgorihm. Vrible mcro keeps he enrnce sep of he mcro-se, vrible R mcro keeps he cumulive reinforcemen wihin curren mcro-se nd mcro keeps he firs cion chosen when enering curren mcro-se ( mcro is he cion h cuses ll he sequence of cions wihin curren mcro-se, i.e., he cion h cn fire pril policy). As sid in he Secion 2.3, ech possible cion i A represens movemen in some direcion in he se spce X. The quliy of policy lern using CFQ-Lerning depends on he ses A nd M. Le p*(s) be he number of ses visied from he se s when pplying n opiml policy nd P(s) be he number of ses visied from he se s when pplying n opiml policy lern under CFQ-Lerning. The error (P(s)-p*(s)) cn be imised if he se A represens well ll direcions of movemens: he lrger he number of possible cions, he smller he disnce beween P(s) nd p*(s), once i will be esier choosing n cion similr o he opiml policy. The sme hppens o he discreision process, becuse i cn help imising he bounding error, enlrging mcro-ses when possible (for

7 2009; 5(3) Compulsory Flow Q-Lerning: An RL lgorihm for robo nvigion bsed on pril-policy nd mcro-ses 7 insnce, nrrow corridors) or mking hem smller when here re no obvious sub-opiml cions. 5. Experimenl Resuls In he previous secion nohing hs been sid bou he CFQ-Lerning lgorihm convergence or opiml policy found under CFQ-Lerning. Experimens hve been conduced o empiriclly show h high quliy policy cn be found, i.e., close in performnce o opiml policy, nd h in fc convergence occurs o such high quliy policy. CFQ-Lerning is compred wih modified version of Q-Lerning, here clled Corse Q-Lerning, using he sme low-resoluion discreision of he se spce. The Corse Q-Lerning lgorihm used in hese experimens is pplied o he sme se of mcro-ses used by he CFQ-Lerning lgorihm. In corse Q-Lerning, once n cion is chosen inside mcro-se, he sme cion is execued unil rnsiion beween mcro-ses occurs or he gen collides wih n obscle. In he firs cse new cion is chosen fer he mcro-se rnsiion. In he ler cse, rndom cion is seleced. In order o evlue he lgorihm here defined, we choose o experimen in discree simuled environmen. Figure 7 shows he originl high-resoluion se spce, which hs 0,000 ses (00 00), wheres Figure 8 shows four differen discreisions for he sme environmen used Tble. The CFQ-Lerning lgorihm. In he beginning of ech episode:. Q(m, ) = 0 for ll m nd ll, 2. mcro mcro = 0, R = 0, 3. s 0 = s(x( 0 )), m 0 = m(s 0 ) 0 4. Choose 0 ccording o he curren policy nd do mcro = 0 A ny discree ime n, 0 n < he gene:. Execues cion during inervl [, n ] nd clcules he n n+ mcro reiforcemen n r(s, ) = + γ r(x(), )d n n n n 2. Does mcro mcro R = R + r(s, ) n+ n n n 3. Observes he nex coninuous se x( n+ ) 4. Does s = s(x( n+ )) nd m = m(s ) n+ n+ n+ 5. If m m n+ n Then updes mcro Q (m, ) ccording o: n+ n mcro Q (m, ) = n+ n mcro mcro = Q (m, ) +α [R n n n + n+ ( mcro ) γ n+ mxq (m,) n n+ mcro Q (m, )], n n Figure 7. The originl high-resoluion discreision of he environmen wih 0,000 ses (00 00). The gol region is loclised in he op-lef corner. chose n cion ccording o he curren policy nd n+ does mcro = n +, mcro R =0, mcro = n n + + mcro Else if, 0 nd n mcro N TF(s, ) > {N TF, N TF(s,)} n+ n+ * Then mcro = n + n * Else = πtf(s n ) + n+ Figure 8. The 4 differen discreisions of he environmen used o compre Q-Lerning, Corse Q-Lerning nd CFQ-Lerning lgorihms.

8 72 Silv VF, Cos AHR Journl of he Brzilin Compuer Sociey o compre CFQ-Lerning nd Corse Q-Lerning. These discreisions were mde respecing he hree properies of he se of mcro-ses M in Secion 2.2. Prmeers used in he experimens re: discoun fcor γ = 0.99, lerning re α = 0.3 decresing in ech episode wih re nd explorion re ε = 0.2 decresing in ech episode wih re The Bes Q-Lerning resuls (bold line in Figures 9, 0 nd ) were found following he procedure: ) execue Q-Lerning for 90,000 episodes in he high-resoluion discreision (00 00) nd finds he bes policy ; 2) pply he bes policy o 0,000 episodes nd clcules he verge performnce mong hose 0,000 episodes; nd 3) repe his procedure for 200 runs nd clcule he verge performnce mong hose 200 runs. This resul is shown in grphics s he Bes Q-Lerning, which is used s reference o very good performnce. Figure 9 shows he resuls using Corse Q-Lerning for differen discreisions compred wih high-resoluion Q-Lerning (00 00). I is possible o see he gre dependence of he policy quliy on he number of ses. The number of seps o rech he gol region ken by he Corse Q-Lerning using 36 mcro-ses is more hn 2 imes greer hn he number of seps ken by he high-resoluion Q-Lerning, fer 0,000 episodes. If he number of mcroses used is 6, he resul is 5 imes worse. Experimens were conduced wih CFQ-Lerning nd is on-line version (when here is no knowledge bou he environmen nd he ngenil-flow policy mus be obined during he lerning process). In he firs seps of on-line CFQ-Lerning, for ll s S nd A, N TF (s, ) = N TF + e, where ε > 0, wh mens h, ny se s is no in he ngenil-flow region R TF nd he ngenil-flow policy π TF does no ke conrol of he gen. As he gen collides wih obscles, he Equion is pplied nd he funcion N TF (.) is lern, defining he rel ngenil-flow region nd policy. In his version of CFQ-Lerning he mcro-se policy nd he ngenil-flow policy re lern concurrenly. Differenly from Corse Q-Lerning, CFQ-Lerning obins policy wih resuls closer o high-resoluion Q-Lerning. Figure 0 nd Figure show he resuls for CFQ-Lerning nd on-line CFQ-Lerning, respecively. Boh of hem show h CFQ-Lerning does no hve gre dependence on he number of ses: even when 6 ses re considered, he number of seps is only 20 percen worse hn he number of seps obined by Q-Lerning. When CFQ-Lerning nd on-line CFQ-Lerning re compred, he grees differences re in he firs,000 episodes, when on-line Q-Lerning is lerning he ngenil-flow policy nd ngenil-flow region, wheres he finl performnce of he policies is similr. I is worh menioning h he policy o be considered s he bes Q-Lerning is no relly opiml, since i ws lern nd here is no gurnee of is convergence. Also, he vlue shown in Figures 9, 0 nd is smple of lern policies. Then, i is llowed sisic vrince, occurring h some policy rech beer resul hn he bes Q-Lerning policy Q-Lerning: 0,000 ses CFQ-Lerning: 7 Mcro-ses 36 Mcro-ses 64 Mcro-ses 4 Mcro-ses Bes Q-Lerning Averge number of seps o gol Figure 9. Comprison of he Bes Q-Lerning performnce wih he lerning re of he lgorihms: Q-Lerning nd Corse Q-Lerning. Ech vlue in he grphic represens he verge performnce over 200 runs nd 50 episodes.

9 2009; 5(3) Compulsory Flow Q-Lerning: An RL lgorihm for robo nvigion bsed on pril-policy nd mcro-ses Q-Lerning: 0,000 ses CFQ-Lerning: 7 Mcro-ses 36 Mcro-ses 64 Mcro-ses 4 Mcro-ses Bes Q-Lerning Averge number of seps o gol Figure 0. Comprison of he Bes Q-Lerning performnce wih he lerning re of he lgorihms: Q-Lerning nd CFQ-Lerning. Ech vlue in he grphic represens he verge performnce over 200 runs nd 50 episodes Q-Lerning: 0,000 ses On-line CFQ-Lerning: 7 Mcro-ses 36 Mcro-ses 64 Mcro-ses 4 Mcro-ses Bes Q-Lerning Averge number of seps o gol Figure. Comprison of he Bes Q-Lerning performnce wih he lerning re of he lgorihms: Q-Lerning nd on-line CFQ-Lerning. Ech vlue in he grphic represens he verge performnce over 200 runs nd 50 episodes.

10 74 Silv VF, Cos AHR Journl of he Brzilin Compuer Sociey 6. Conclusion In his pper we presened new lerning lgorihm, which mkes use of high-resoluion se-spce discreision in he conrol process, while using low-resoluion discreision in he policy-lerning process. Using his lgorihm he lerning gen is cpble of reching he gol nd finding ou good policy fser hn by using lgorihms bsed on highresoluion discreision of he se spce. The proposed CFQ-Lerning lgorihm worked very well in he experimens conduced, hving performnce close o he opiml policy, even when using low resoluion discreision of he se spce. Alhough i is necessry o hve previous knowledge bou he environmen, such knowledge cn be exrc during he execuion of he firs sks in he environmen nd reused ler on in order o ccelere he lerning process for fuure sks. In cses where i is no possible defining he ngenilflow region nd policy priori differen soluions cn be doped. I is possible o use sensors (sonrs, lser) o sense he disnce nd he direcion of he robo o he undesirble regions nd hen, bsed on his sensing, o cree he compulsory flow. I mus be defined priori he ngenil-flow region bsed on disnce nd he ngenil-flow policy o ge round undesirble regions. Anoher opion is o lern N TF (s, ) in he sensor spce, which cn be generlised for differen prs of he environmen or even differen environmens. This sensor spce depends only on he regions nerby he gen nd heir relive posiions, no considering he globl posiion of he gen in he environmen. Ackowledgemens This reserch ws conduced under he CAPES/GRICES Projec MuliBo (Grn no. 099/03), FAPESP projec Logprop (Grn no. 2008/ ) nd CNPq projec Ob-SLAM (Grn no /2008-7). Vldinei F. Silv is greful o FAPESP (proc. 02/3678-0) nd Ann H. R. Cos is greful o CNPq (Grn No /2008-0). References Biley T nd Durrn-Whye H. Simulneous loclision nd mpping (slm): Pr ii - se of he r. Roboics nd Auomion Mgzine 2006; 3(3): Binchi RAC. Uso de heurísics pr celerção do prendizdo por reforço. [PhD hesis]. São Pulo, SP: Universidde de São Pulo; Durrn-Whye H nd Biley T. Simulneous loclision nd mpping (slm): he essenil lgorihms. Roboics nd Auomion Mgzine 2006; 3(2):-9. (pr I) 4. Foser D nd Dyn P. Srucure in he spce of vlue funcions. Mchine Lerning 2002; 49(2/3): Jrvis R. Robo ph plnning: complexiy, flexibiliy nd pplicion scope. In: Proceedings of he 2006 inernionl symposium on Prcicl cogniive gens nd robos; 2006; Perh, Ausrli. New York, SP: ACM; p Lee H, Shen Y, Yu CH, Singh G nd Andrew Y. Ng: qudruped robo obscle negoiion vi reinforcemen lerning. In: Proceedingsof he IEEE Inernionl Conference on Roboics nd Auomion; 2006; Orlndo, Florid. Los Almios, CA: IEEE Compuer Sociey Press; p Mrhi B, Russell SJ, Lhm D nd Guesrin C. Concurren hierrchicl reinforcemen lerning. In: Kelbling LP nd Sffioi A. (Eds.). Proceedings of he Nineeenh Inernionl Join Conference on Arificil Inelligence; 2005; Edinburgh. Sn Frncisco, CA: Morgn Kufmnn; p Mcgovern A, Suon RS nd Fgg AH. Roles of mcro-cions in ccelering reinforcemen lerning. In: Proceedings of he Grce Hopper Grce Hopper Celebrion of Women in Compuing; 997; Sn Jose, CA. Plo Alo, CA: Ani Borg Insiue for Women nd Technology; 997. p Michell TM. Mchine lerning. Sn Frncisco: WCB/McGrw- Hill; Moore AW nd Akeson CG. Prioriized sweeping: reinforcemen lerning wih less d nd less rel ime. Mchine Lerning 993; 3(): Munos R nd Moore A. Vrible resoluion discreizion in opiml conrol. Mchine Lerning 2002; 49(2/3): Murrk A, Sridhrn M nd Kuipers B. Deecing obscles nd drop-offs using sereo nd moion cues for sfe locl moion. In: Proceedings of Inernionl Conference on Inelligen Robos nd Sysems; 2008; Nice, Frnce. Los Almios, CA: IEEE Compuer Sociey Press; p Prr R nd Russell S. Reinforcemen lerning wih hierrchies of mchines. In: Proceedings of 0 Advnces in Neurl Informion Processing Sysems; 998; Denver, CO. Cmbridge, MA: The MIT Press; Precup D, Suon RS nd Singh SP. Theoreicl resuls on reinforcemen lerning wih emporlly bsrc behviors. In: Proceedings of he Tenh Europen Conference on Mchine Lerning; 998; Berlin. New York: Springer; 998. p Rmon J, Driessens K nd Croonenborghs T. Trnsfer lerning in reinforcemen lerning problems hrough pril policy recycling. In: Proceedings of he 8 Europen Conference on Mchine Lerning; 2007; Wrsw. New York, NY: Springer; p Reynolds SI. Decision boundry priioning: vrible resoluion model-free reinforcemen lerning. In: Proceedings of he 7 Inernionl Conference on Mchine Lerning; 2000; Plo Alo, CA. Sn Frncisco, CA: Morgn Kufmnn; p Ross SM. Applied probbiliy models wih opimizion pplicions. Sn Frncisco: Holden-Dy; Rummery GA nd Nirnjn M. On-line q-lerning using connecionis sysems. Cmbridge: Cmbridge Universiy; 994. (echnicl repor CUED/F-INFENG/TR 66). 9. Selvici AHP nd Cos AHR. A hybrid dpive rchiecure for mobile robos bsed on recive behviors. In: Proceedings of he 5 Inernionl Conference on Hybrid Inelligen Sysems; 2005; Rio de Jneiro. Los Almios: IEEE Compuer Sociey; p

11 2009; 5(3) Compulsory Flow Q-Lerning: An RL lgorihm for robo nvigion bsed on pril-policy nd mcro-ses Srndberg M. Robo ph plnning: n objec oriened pproch. [PhD Thesis]. Sweden: Royl Insiue of Technology; Suon RS nd Bro AG. Reinforcemen lerning: n inroducion. Cmbridge: MIT Press; Suon RS. Lerning o predic by mehod of emporl differences. Mchine Lerning. 988; 3(): Suon RS. Inegred rchiecures for lerning, plnning nd recing bsed on pproxig dynmic progrmg. In: Proceedings of he 7 Inernionl Conference on Mchine Lerning; 990; Ausin, TX. Sn Frncisco, CA: Morgn Kufmnn; 990. p Wkins JCHC. Lerning from Delyed Rewrds. [PhD hesis]. Cmbridge: Universiy of Cmbridge; 989.

Chapter 2: Evaluative Feedback

Chapter 2: Evaluative Feedback Chper 2: Evluive Feedbck Evluing cions vs. insrucing by giving correc cions Pure evluive feedbck depends olly on he cion ken. Pure insrucive feedbck depends no ll on he cion ken. Supervised lerning is