SCALED GRADIENT DESCENT LEARNING RATE Reinforcement learning with light-seeking robot

Size: px

Start display at page:

Download "SCALED GRADIENT DESCENT LEARNING RATE Reinforcement learning with light-seeking robot"

Chrystal Clementine Chambers
6 years ago
Views:

1 SCALED GRADIET DESCET LEARIG RATE Renforcement lernng wth lght-seekng robot Kry Främlng Helsnk Unversty of Technology, P.O. Box 54, FI-5 HUT, Fnlnd. Eml: Keywords: Abstrct: Lner functon pproxmton, Grdent descent, Lernng rte, Renforcement lernng, Lght-seekng robot Adptve behvour through mchne lernng s chllengng n mny rel-world pplctons such s robotcs. Ths s becuse lernng hs to be rpd enough to be performed n rel tme nd to vod dmge to the robot. Models usng lner functon pproxmton re nterestng n such tsks becuse they offer rpd lernng nd hve smll memory nd processng requrements. Adlnes re smple model for grdent descent lernng wth lner functon pproxmton. However, the performnce of grdent descent lernng even wth lner model gretly depends on dentfyng good vlue for the lernng rte to use. In ths pper t s shown tht the lernng rte should be scled s functon of the current nput vlues. A scled lernng rte mkes t possble to vod weght osclltons wthout slowng down lernng. The dvntges of usng the scled lernng rte re llustrted usng robot tht lerns to nvgte towrds lght source. Ths lght-seekng robot performs Renforcement Lernng tsk, where the robot collects trnng smples by explorng the envronment,.e. tkng ctons nd lernng from ther result by trlnd-error procedure. ITRODUCTIO The use of mchne lernng n rel-world control pplctons s chllengng. Rel-world tsks, such s those usng rel robots, nvolve nose comng from sensors, non-determnstc ctons nd uncontrollble chnges n the envronment. In robotcs, experments re lso longer thn smulted ones, so lernng must be reltvely rpd nd possble to perform wthout cusng dmge to the robot. Only nformton tht s vlble from robot sensors cn be used for lernng. Ths mens tht the lernng methods hve to be ble to hndle prtlly mssng nformton nd sensor nose, whch my be dffcult to tke nto ccount n smulted envronments. Artfcl neurl networks (A) re wellknown technque for mchne lernng n nosy envronments. In rel robotcs pplctons, however, A lernng my become too slow to be prctcl, especlly f the robot hs to explore the envronment nd collect trnng smples by tself. Lernng by utonomous explorton of the envronment by lernng gent s often performed usng renforcement lernng (RL) methods. Due to these requrements, one-lyer lner functon pproxmton As (often clled Adlnes (Wdrow & Hoff, 96)) re n nterestng lterntve. Ther trnng s much fster thn for non-lner As nd ther convergence propertes re lso better. Fnlly, they hve smll memorynd computng power requrements. However, when Adlne nputs come from sensors tht gve vlues of dfferent mgntude, t becomes dffcult to determne wht lernng rte to use n order to vod weght oscllton. Furthermore, s shown n the experments secton of ths pper, usng fxed lernng rte my be problemtc lso becuse the optml lernng rte chnges dependng on the stte of the gent nd the envronment. Ths s why the use of scled lernng rte s proposed, where the lernng rte vlue s modfed ccordng to Adlne nput vlues. The scled lernng rte mkes lernng steps of smlr mgntude ndependently of the nput vlues. It s lso sgnfcntly eser to determne sutble vlue for the scled lernng rte thn t s for fxed lernng rte. After ths ntroducton, Secton gves bckground nformton bout grdent descent lernng nd RL. Secton 3 defnes the scled lernng rte, followed by expermentl results n Secton 4. Relted work s treted n Secton 5, followed by conclusons.

2 GRADIET DESCET REIFORCEMET LEARIG In grdent descent lernng, the free prmeters of model re grdully modfed so tht the dfference between the output gven by model nd the correspondng correct or trget vlue becomes s smll s possble for ll trnng smples vlble. In such supervsed lernng, ech trnng smple conssts of nput vlues nd the correspondng trget vlues. Rel-world trnng smples typclly nvolve nose, whch mens tht t s not possble to obtn model tht would gve the exct trget vlue for ll trnng smples. The gol of lernng s rther to mnmze sttstcl error mesure, e.g. the Root Men Squre Error (RMSE) RMSE M M k k ( t ) k () where M s the number of trnng exmples, t k s the trget vlue for output nd trnng smple k nd k s the model output for output nd trnng smple k. In RL tsks, ech output typclly corresponds to one possble cton. RL dffers from supervsed lernng t lest n the followng wys: The gent hs to collect the trnng smples by explorng the envronment, whch forces t to keep blnce between explorng the envronment for new trnng smples nd explotng wht t hs lerned from the exstng ones (the explorton/explotton trde-off). In supervsed lernng, ll trnng smples re usully pre-collected nto trnng set, so lernng cn be performed off-lne. Trget vlues re only vlble for the used ctons. In supervsed lernng, trget vlues re typclly provded for ll ctons. The trget vlue s not lwys vlble drectly; t my be vlble only fter the gent hs performed severl ctons. Then we spek bout delyed rewrd lernng tsk. RL methods usully model the envronment s Mrkov Decson Process (MDP), where every stte of the envronment needs to be unquely dentfble. Ths s why the model used for RL lernng s often smple lookup-tble, where ech envronment stte corresponds to one row (or column) n the tble nd the columns (or rows) correspond to possble ctons. The vlues of the tble express how good ech cton s n the gven stte. Lookup-tbles re not sutble for tsks nvolvng sensorl nose or other resons for the gent not beng ble to unquely dentfy the current stte of the envronment (such tsks re clled hdden stte tsks). Ths s one of the resons for usng stte generlzton technques nsted of lookup-tbles. Generlston n RL s bsed on the ssumpton tht n cton tht s good n some stte s probbly good lso n smlr sttes. Vrous clssfcton technques hve been used for dentfyng smlr sttes. Some knd of A s typclly used for the generlston. As cn hndle ny stte descrptons, not only dscrete ones. Therefore they re well dpted for problems nvolvng contnuous-vlued stte vrbles nd nose, whch s usully the cse n robotcs pplctons.. Grdent descent lernng wth Adlnes The smplest A s the lner Adlne (Wdrow & Hoff, 96), where neurons clculte ther output vlue s weghted sum of ther nput vlues ( s) s w (), where w, s the weght of neuron ssocted wth the neuron s nput, (s) s the output vlue of neuron, s s the vlue of nput nd s the number of nputs. They re trned usng the Wdrow-Hoff trnng rule (Wdrow & Hoff, 96) new, w, + ( t w α ) s (3) where α s lernng rte. The Wdrow-Hoff lernng rule s obtned by nsertng equton () nto the RMSE expresson nd tkng the prtl dervtve gnst s. It cn esly be shown tht there s only one optml soluton for the error s functon of the Adlne weghts. Therefore grdent descent s gurnteed to converge f the lernng rte s selected suffcently smll. When the bck propgton rule for grdent descent n mult-lyer As ws developed (Rumelhrt et l., 988), t becme possble to lern non-lner functon pproxmtons nd clssfctons. Unfortuntely, lernng non-lner functons by grdent descent tends to be slow nd to converge to loclly optml solutons.. Renforcement lernng RL methods often ssume tht the envronment cn be modelled s MDP. A (fnte) MDP s tuple M(S,A,T,R), where: S s fnte set of sttes; A {,, k } s set of k ctons; T [P s ( ) s

3 S, A} re the next-stte trnston probbltes, wth P s (s ) gvng the probblty of trnstonng to stte s upon tkng cton n stte s; nd R specfes the rewrd vlues gven n dfferent sttes s S. RL methods re bsed on the noton of vlue functons. Vlue functons re ether stte-vlues (.e. vlue of stte) or cton-vlues (.e. vlue of tkng n cton n gven stte). The vlue of stte s S cn be defned formlly s V π () s Eπ γ k k rt + k+ st s (4) where V π (s) s the stte vlue tht corresponds to the expected return when strtng n s nd followng polcy π therefter. The fctor r t+k+ s the rewrd obtned when rrvng nto sttes s t+, s t+ etc. γ k s dscountng fctor tht determnes to wht degree future rewrds ffect the vlue of stte s. Acton vlue functons re usully denoted Q(s,), where A. In control pplctons, the gol of RL s to lern n cton-vlue functon tht llows the gent to use polcy tht s s close to optml s possble. However, snce the cton-vlues re ntlly unknown, the gent frst hs to explore the envronment n order to lern t... Explorng the envronment for trnng smples Although convergence s gurnteed for Wdrow- Hoff lernng n Adlnes, n RL tsks convergence of grdent descent cnnot lwys be gurnteed even for Adlnes (Boyn & Moore, 995). Ths s becuse the gent tself hs to explore the envronment nd collect trnng smples by explorng the envronment. If the cton selecton polcy π does not mke the gent collect relevnt nd representtve trnng smples, then lernng my fl to converge to good soluton. Therefore, the cton selecton polcy must provde suffcent explorton of the envronment to ensure tht good trnng smples re collected. At the sme tme, the gol of trnng s to mprove the performnce of the gent,.e. the cton selecton polcy so tht the lerned model cn be exploted. Blncng the explorton/explotton trde-off s one of the most dffcult problems n RL for control (Thrun, 99). A rndom serch polcy cheves mxml explorton, whle greedy polcy gves mxml explotton by lwys tkng the cton tht hs the hghest cton vlue. A commonly used method for blncng explorton nd explotton s to use ε-greedy explorton, where the greedy cton s selected wth probblty (-ε) nd n rbtrry cton s selected wth probblty ε usng unform probblty dstrbuton. Ths method s n undrected explorton method n the sense tht t does not use ny tsk-specfc nformton. Another undrected explorton method selects ctons ccordng to Boltzmnn-dstrbuted probbltes exp( Q( s, ) / T ) Prob ( ) (5) exp( Q( s, ) / T ) k where T (clled temperture) dusts the rndomness of cton selecton. The mn dfference between ths method nd ε-greedy explorton s tht non-greedy ctons re selected wth probblty tht s proportonl to ther current vlue estmte, whle ε-greedy explorton selects non-greedy ctons rndomly. Drected explorton uses tsk-specfc knowledge for gudng explorton. Mny such methods gude the explorton so tht the entre stte spce would be explored n order to lern the vlue functon s well s possble. In rel-world tsks exhustve explorton my be mpossble or dngerous. However, technque clled optmstc ntl vlues offers possblty of encourgng explorton of prevously un-encountered sttes mnly n the begnnng of explorton. It cn be mplemented by usng ntl vlue functon estmtes tht re bgger thn the expected ones. Ths gves the effect tht unused ctons hve bgger cton vlue estmtes thn used ones, so unused ctons tend to be selected rther thn lredy used ctons. When ll ctons hve been used suffcent number of tmes, the true vlue functon overrdes the ntl vlue functon estmtes. In ths pper, ε- greedy explorton s used for explorton. The effect of usng optmstc ntl vlues on lernng s lso studed... Delyed rewrd When rewrd s not mmedte for every stte trnston, rewrds somehow need to be propgted bckwrds through the stte hstory. Temporl Dfference (TD) methods (Sutton, 988) re currently the most used RL methods for hndlng delyed rewrd. TD methods updte the vlue of stte not only bsed on mmedte rewrd, but lso bsed on the dscounted vlue of the next stte of the gent. Therefore TD methods updte the vlue functon on every stte trnston, not only fter k Thrun (99) clls ths sem-unform dstrbuted explorton

4 trnstons tht result n drect rewrd. When rewrd hs been temporlly bck propgted suffcent number of tmes, these dscounted rewrd vlues cn be used s trget vlues for grdent descent lernng (Brto, Sutton & Wtkns, 99). Such grdent descent lernng llows usng lmost ny A s the model for RL, but unfortuntely the MDP ssumpton underlyng TD methods often gves convergence problems. Delyed rewrd tsks re out of the scope of ths pper, whch s the reson why such tsks re not nlysed more deeply here. Good overvews on delyed rewrd re (Kelblng, Lttmn & Moore, 996) nd (Sutton & Brto, 998). The mn gol of ths pper s to show how sclng the Adlne lernng rte mproves lernng, whch s llustrted usng n mmedte rewrd RL tsk. 3 SCALED LEARIG RATE In methods bsed on grdent descent, the lernng rte hs gret nfluence on the lernng speed nd on f lernng succeeds t ll. In ths secton t s shown why the lernng rte should be scled s functon of the nput vlues of Adlne-type As. Sclng the lernng rte of other As s lso dscussed. 3. Adlne lernng rte If we combne equtons () nd (3), we obtn the followng expresson for the new output vlue fter updtng Adlne weghts usng the Wdrow-Hoff lernng rule: ( s) ( w + α( t ) s ), new s s w + α, + s s w ( t new, αs ( t ) ) (6) where new (s) s the new output vlue for output fter the weght updte when the nput vlues s re presented gn. If the lernng rte s set to α, then new (s) would be exctly equl to t f the expresson s multpled by s ( t ) (7) s Then, by contnung from equton (6): new ( s) + α( t s ) t s (8) + α ( t ) (9) when settng α. Multplyng (7) by (8) scles the weght modfcton n such wy tht the new output vlue wll pproch the trget vlue wth the rto gven by the lernng rte, ndependently of the nput vlues s. In the rest of the pper, the term scled lernng rte (slr) wll be used for denotng lernng rte tht s multpled by expresson (8). If the vlue of the (un-scled) lernng rte s greter thn the vlue gven by expresson (8), then weghts re modfed so tht the new output vlue wll be on the opposte sde of the trget vlue n the grdent drecton. Such overshootng esly leds to uncontrolled weght modfctons, where weght vlues tend to oscllte towrds nfnty. Ths knd of weght osclltons usully mkes lernng fl completely. The squred sum of nput vlues n expresson (8) cnnot be llowed to be zero. Ths cn only hppen f ll nput vlues s re zero. However, the bs term used n most Adlne mplementtons vods ths. A bs term s supplementry nput wth constnt vlue one, by whch expresson () cn represent ny lner functon, no mtter wht s the nput spce dmensonlty. Most mult-lyer As mplctly lso vod stutons where ll Adlne nput vlues would be zero t the sme tme. Ths s studed more n detl n the followng secton. 3. on-lner, mult-lyer As In As, neurons re usully orgnzed nto lyers, where the output vlues of neurons n lyer re ndependent of ech other nd cn therefore be clculted smultneously. Fgure shows feedforwrd A wth one nput nd one output (there my be n rbtrry number of both). A nput vlues re dstrbuted s nput vlues to the hdden

neurons of the hdden lyer. Hdden neurons usully hve non-lner output functon wth vlues lmted to the rnge [, ]. Some non-lner output functons hve vlues lmted to the rnge [-, ]. o s Fgure.

5 neurons of the hdden lyer. Hdden neurons usully hve non-lner output functon wth vlues lmted to the rnge [, ]. Some non-lner output functons hve vlues lmted to the rnge [-, ]. o s Fgure. Three-lyer feed-forwrd A wth sgmod outputs n hdden lyer nd lner output lyer. A commonly used output functon for hdden neurons s the sgmod functon t o f ( A) A () + e where o s the output vlue nd A s the weghted sum of the nputs (). The output neurons of the output lyer my be lner or non-lner. If they re lner, they re usully Adlnes. Lernng n mult-lyer As cn be performed n mny wys. For As lke the one n Fgure, grdent descent method clled bck-propgton s the most used one (Rumelhrt et l., 988). For such lernng, two questons rse: Is t possble to use scled lernng rte lso for non-lner neurons? If the output lyer (or some other lyer) s n Adlne, s t then useful to use the scled lernng rte? The nswer to the frst queston s probbly no. The reson for ths s tht functons lke the sgmod functon () do not llow rbtrry output vlues. Therefore, f the trget vlue s outsde the rnge of possble output vlues, then t s not possble to fnd weghts tht would modfy the output vlue so tht t would become exctly equl to the trget vlue. Insted, the lernng rte of non-lner neurons s usully dynmclly dpted dependng on n estmton of the steepness of the grdent over severl trnng steps (Hykn, 999). The nswer to the second queston s mybe. If Adlne nputs re lmted to the rnge [, ], then the squred sum n (8) remns lmted. Stll, n the begnnng of trnng, hdden neuron outputs re generlly very smll. Then scled lernng rte mght ccelerte output lyer lernng, whle slowng down when hdden neurons become more ctvted. Ths could be n nterestng drecton of future reserch to nvestgte. 4 EXPERIMETAL RESULTS Experments were performed usng robot bult wth the Lego Mndstorms Robotcs Inventon System (RIS). The RIS offers chep, stndrd nd smple pltform for performng rel-world tests. In ddton to Lego buldng blocks, t ncludes two electrcl motors; two touch sensors nd one lght sensor. The mn block contns smll computer (RCX) wth connectors for motors nd sensors. Among others, the Jv progrmmng lnguge cn be used for progrmmng the RCX. Fgure. Lego Mndstorms robot. Lght sensor s t the top n the front, drected forwrds. One touch sensor s nstlled t the front nd nother t the rer. The robot hd one motor on ech sde; touch sensors n the front nd n the bck nd lght sensor drected strght forwrd mounted n the front (Fgure ). Robots usully hve more thn one lght sensor, whch were smulted by turnng the robot round nd gettng lght redngs from three dfferent drectons. One lght redng ws from the drecton strght forwrd nd the two others bout 5 degrees left/rght, obtned by lettng one motor go forwrd nd the other motor bckwrd for 5 mllseconds nd then nversng the operton. The lght sensor redng from the forwrd drecton fter performng n cton s drectly used s the rewrd vlue, thus vodng hnd tunng of the rewrd functon. Fve ctons re used, whch consst n dong one of the followng motor commnds for 45 mllseconds: ) both motors forwrd, /3) one forwrd, other stopped, 4/5) one forwrd, other bckwrd. Gong strght forwrd mens dvncng bout 5 cm, ctons /3 gong forwrd bout cm nd turnng bout 5 degrees nd ctons 4/5 turnng bout 4 degrees wthout dvncng.

6 9 lght sensor vlue The robot strts bout centmetres from the lmp, ntlly drected strght towrds t. Rechng lght vlue of 8 out of sgnfes tht the gol s reched, whch mens one to ffteen centmetres from the lmp dependng on the pproch drecton nd sensor nose. In order to rech the gol lght vlue, the robot hs to be very precsely drected strght towrds the lmp. The lmp s on the floor level nd gves drected lght n hlf-sphere n front of t. If the robot hts n obstcle or drves behnd the lmp, then t s mnully put bck to the strt poston nd drecton. The test room s n offce room wth nose due to floor reflectons, wlls nd shelves wth dfferent colours etc. The robot tself s lso source of nose due to mprecse motor movements, bttery chrge etc. However, the lght sensor s clerly the bggest source of nose s shown n Fgure 3, where lght sensor smples re ndcted for two dfferent levels of lumnosty lght sensor vlue smple ndex smple ndex number of smples number of smples lght sensor vlue lght sensor vlue Fgure 3. 5 lght smples for two dfferent lght condtons, tken wth 5 mllsecond ntervls. Averge vlues re.5 nd 6.. Rw vlues re shown to the left, vlue dstrbuton to the rght. Tble. Hnd-coded weghts. One row per cton, one column per stte vrble (lght sensor redng). Acton Left Mddle Rght Forwrd..8. Left/forwrd.6.3. Rght/forwrd..3.6 Left.6.. Rght...6 When usng n A there s one output per cton, where the output vlue corresponds to the cton-vlue estmte of the correspondng cton. Wth fve ctons nd three stte vrbles, 5x3 weght mtrx cn represent the weghts (no bs nput used here). A hnd-coded gent wth predefned weghts (Tble ) ws used n order to prove tht n Adlne lner functon pproxmtor cn solve the control tsk nd s reference for udgng how good the performnce s for lernng gents. These weghts were determned bsed on the prncple tht f the lght vlue s gretest n the mddle, then mke the forwrd-gong cton hve the bggest output vlue. In the sme wy, f the lght vlue s greter to the left, then fvour some leftturnng cton nd vce vers for the rght sde. The hnd-coded gent reched the gol fter 3- epsode verge of bout 7 steps. Lernng gents used the sme Adlne rchtecture s the hnd-coded gent. Weghts re modfed by the Wdrow-Hoff trnng rule (3). All gents used ε-greedy explorton wth ε., whch seemed to be the best vlue fter expermentton. Tests were performed both wth weghts ntlsed to smll rndom vlues n the rnge [,.) nd wth weghts hvng optmstc ntl vlues n the rnge [, ). Such weghts re optmstc becuse ther expected sum per cton neuron s.5, whle weght vlues fter trnng should converge to vlues whose sum s close to one. Ths s becuse stte vrble vlues nd rewrd vlues re ll lght sensor redngs, so the estmted rewrd vlue should be close to t lest one of the stte vrble vlues. If the RL s successful, then the estmted rewrd should even be lttle bt bgger snce the gol by defnton of RL s to mke the gent move towrds sttes gvng hgher rewrd. An un-scled lernng rte vlue of. ws used fter lot of expermentton. Ths vlue s compromse. Fr from the lmp, lght sensor vlues re bout, so expresson (8) gves the vlue /( + + ),333. Close to the lmp, lght sensor vlues pproch 8, so the correspondng vlue would be.5. Accordng to expresson (8), the un-scled lernng rte vlue. corresponds to lght sensor vlues round 58, whch s lredy close to the lmp. Therefore, excessve weght modfctons probbly occur close to the lmp, but then the number of steps remnng to rech the gol s usully so smll tht weghts do not hve the tme to oscllte. number of steps epsode number lr. slr.8 Fgure 4. Comprson between sttc (lr) nd scled (slr) lernng rte. Averges for runs.

7 Fgure 4 compres the performnce of n gent usng the un-scled lernng rte. nd n gent usng slr.8. Wth ths scled lernng rte, the frst epsode s slghtly fster. Convergence s lso much smoother wth the scled lernng rte thn wth the un-scled lernng rte. The sttstcs shown n Tble further emphsze the dvntge of usng scled lernng rte. The totl number of steps s slghtly smller, but the verge length of the lst fve epsodes s clerly lower for the gents usng the scled lernng rte. Ths s most probbly becuse the un-scled lernng rte sometmes cuses excessve weght modfctons tht prevent the gent from convergng to optml weghts. The number of mnul resets s further ndcton of excessve weght modfctons. One bd lght sensor redng my be suffcent to mke the robot get stuck nto usng the sme cton for severl steps. If tht cton hppens to be gong strght forwrd, then the robot usully hts wll fter few steps. Reducng the vlue of the unscled lernng rte could reduce ths phenomenon, but t would lso mke lernng slower. Tble. Sttstcs for gents usng dfferent lernng rtes. Averges for runs. Agent Totl Aver. 5 lst Mn. resets lr slr slr slr Fgure 5 compres the performnce of gents usng dfferent vlues for the scled lernng rte. All grphs re smooth nd converge nerly s rpdly. Ths result shows tht the scled lernng rte s tolernt to dfferent vlues. The menng of the scled lernng rte s lso esy to understnd (.e. percentge of modfcton of output vlue towrds trget vlue ), so determnng good vlue for the scled lernng rte s eser thn for the unscled lernng rte. number of steps epsode number Fgure 5. Results for dfferent vlues of scled lernng rte. Averges for runs. slr. slr.5 slr.8 Fgure 6 nd Tble 3 compre the performnce of gents whose weghts re ntlsed wth rndom vlues from the rnge [,.) nd gents whose weghts re ntlsed wth optmstc ntl vlues,.e. rndom vlues from the rnge [, ). Usng optmstc ntl vlues clerly gves fster ntl explorton. The number of mnul resets wth optmstc ntl vlues s lso lower for gents usng slr.5 nd slr.8, but nsted t s hgher for lr. nd slr.. Fnlly, when settng ε fter epsodes,.e. lwys tkng the greedy cton, the trned gents hd dentcl performnce s the hnd-coded gent. However, one should remember tht lernng gents could lso dpt to chnges n the envronment or dfferences n sensor sensblty, for nstnce. number of steps epsode number Fgure 6. Results for dfferent ntl weghts. Averges for runs. Tble 3. Sttstcs for dfferent ntl weghts. Averges for runs. nt[,.) nt[, ) Intl weghts Totl Aver. 5 lst Mn. resets [,.) [, ) RELATED WORK The mount of lterture on grdent descent lernng s bundnt. One of the most recent nd exhustve sources on the subect s (Hykn, 999). Adustng the Adlne lernng rte hs been studed prevously t lest by Luo (99), who shows tht the Adlne lernng rte should be reduced durng lernng n order to vod cyclclly umpng round the optml soluton. References n (Luo, 99) lso offer good overvew of reserch concernng the grdent descent lernng rte. However, to the uthor s knowledge, the concept of scled lernng rte ntroduced n ths pper s new. RL hs been used n mny robotc tsks, but most of them hve been performed n smulted envronments. Only few results hve been reported

8 on the use of RL on rel robots. The expermentl settng used here resembles behvor lernng performed by Ln (99) nd Mhdevn & Connell (99). Behvorl tsks treted by them nclude wll followng, gong through door, dockng nto chrger (guded by lght sensors), fndng boxes, pushng boxes nd gettng un-wedged from stlled sttes. Some of these behvors re more chllengng thn the lght-seekng behvor used n ths pper, but the smple lner Adlne model used here for stte generlzton gretly smplfes the lernng tsk compred to prevous work. An exmple of non-rl work on lght-seekng robots s Lebeltel et l. (4). 6 COCLUSIOS One of the most mportnt dvntges of the scled lernng rte presented n ths pper s tht t s esy to understnd the sgnfcton of the vlues used for t. Evdence s lso shown tht the scled lernng rte mproves lernng becuse t mkes the network output vlues pproch the correspondng trget vlues wth smlr mount ndependently of the nput vlues. Expermentl results wth rel-world lght-seekng robot llustrte the mprovement n lernng results by usng the scled lernng rte. It seems rther surprsng tht scled lernng rte hs not been used yet, ccordng to the uthor s best knowledge. One explnton mght be tht n supervsed lernng tsks, the trnng smples re usully vlble beforehnd, whch mkes t possble to normlze them nto sutble vlues. In rel-world RL tsks, wth constrnts on lernng tme nd the vlblty of trnng smples, ths my not be possble. Usng mult-lyer non-lner As lso mght reduce the utlty of sclng the lernng rte, s explned n secton 3.. In ddton to the scled lernng rte, the RL explorton/explotton trde-off s lso ddressed n the pper. The explorton polcy used determnes the qulty of collected trnng smples nd therefore gretly ffects lernng speed nd the qulty of lerned solutons. Emprcl results re shown mnly on the dvntges of usng optmstc ntl vlues for the network weghts when possble. Future work ncludes mprovng explorton polces nd hndlng delyed rewrd. Obtnng further results on the use of the scled lernng rte for other thn RL tsks would lso be useful. ACKOWLEDGEMETS I would lke to thnk Brn Bgnll for wrtng the rtcle Buldng Lght-Seekng Robot wth Q- Lernng, publshed on-lne on Aprl 9 th,. It gve me the de to use the Lego Mndstorms kt nd hs source code ws of vluble help. REFERECES Brto, A.G., Sutton, R.S., Wtkns C.J.C.H. (99). Lernng nd Sequentl Decson Mkng. In M. Gbrel nd J. Moore (eds.), Lernng nd computtonl neuroscence : foundtons of dptve networks. M.I.T. Press. Boyn, J. A., Moore, A. W. (995). Generlzton n Renforcement Lernng: Sfely Approxmtng the Vlue Functon. In Tesuro, G., Touretzky, D., Leen, T. (eds.), IPS'994 proc., Vol. 7. MIT Press, Hykn, S. (999). eurl etworks - comprehensve foundton. Prentce-Hll, ew Jersey, USA. Kelblng, L.P., Lttmn, M.L., Moore, A.W. (996). Renforcement Lernng: A Survey. Journl of Artfcl Intellgence Reserch, Vol. 4, Lebeltel, O., Bessère, P., Drd, J., Mzer, E. (4). Byesn Robot Progrmmng. Autonomous Robots, Vol. 6, Ln, L.-J. (99). Progrmmng robots usng renforcement lernng nd techng. In Proc. of the nth tonl Conference on Artfcl Intellgence (AAAI), Luo, Z. (99). On the convergence of the LMS lgorthm wth dptve lernng rte for lner feedforwrd networks. eurl Computton, Vol. 3, Mhdevn, S., Connell, J. (99). Automtc Progrmmng of Behvor-bsed Robots usng Renforcement Lernng. Artfcl Intellgence, Vol. 55, os. -3, Rumelhrt, D. E., McClellnd, J. L. et l. (988). Prllel Dstrbuted Processng Vol.. MIT Press, Msschusetts. Sutton, R. S. (988). Lernng to predct by the method of temporl dfferences. Mchne Lernng, Vol. 3, Sutton, R.S., Brto, A.G. (998). Renforcement Lernng. MIT Press, Cmbrdge, MA. Thrun, S.B. (99). The role of explorton n lernng control. In DA Whte & DA Sofge, (eds.), Hndbook of Intellgent Control: eurl, Fuzzy nd Adptve Approches. Vn ostrnd Renhold, ew York. Wdrow, B., Hoff, M.E. (96). Adptve swtchng crcuts. 96 WESCO Conventon record Prt IV, Insttute of Rdo Engneers, ew York, 96-4.

Rank One Update And the Google Matrix by Al Bernstein Signal Science, LLC

Rank One Update And the Google Matrix by Al Bernstein Signal Science, LLC Introducton Rnk One Updte And the Google Mtrx y Al Bernsten Sgnl Scence, LLC www.sgnlscence.net here re two dfferent wys to perform mtrx multplctons. he frst uses dot product formulton nd the second uses