Interntonl Conference on Control, Automton nd ystems 008 Oct. 4-7, 008 n COEX, eoul, Kore A Renforcement ernng ystem wth Chotc Neurl Networs-Bsed Adptve Herrchcl Memory tructure for Autonomous Robots Msno Obysh, Kenchro Nrt, Tsh Kuremoto nd Kunzu Kobysh Dvson of Computer cence & Desgn Engneerng, Ymguch Unversty, Ube, Jpn (Tel : +8-836-85-958; E-ml:{ m.obys,wu,ob}@ymguch-u.c.p) Abstrct: Humn lerns ncdents by own ctons nd reflects them on the subsequent cton s own experences. These experences re memorzed n hs brn nd recollected f necessry. Ths reserch ncorportes such n ntellgent nformton processng mechnsm, nd pples t to n utonomous gent tht hs three mn functons: lernng, memorzton nd ssoctve recollecton. In the proposed system, n ctor-crtc type renforcement lernng method s used for lernng. Auto-ssoctve chotc neurl networ s lso used le mutul ssoctve memory system. Moreover, the memory prt hs n dptve herrchcl lyered structure of the memory module tht conssts of chotc neurl networs n consderton of the dustment to non-mdp (Mrov Decson Process) envronment. Fnlly, the effectveness of ths proposed method s verfed through the smulton ppled to the mze-serchng problem. Keywords: Renforcement lernng,chotc neurl networ, herrchcl memory structure, utonomous robot. INTRODUCTION Renforcement lernng (R..) s frmewor for n gent to lern the choce of n optml cton bsed on renforcement sgnl []. It hs been ppled to vrety of problem such s utonomous robot nvgton nd non-lner control nd so on. However, so fr, so mny systems wth R.. hve been mde up for only use of the one ts. Reserch for systems mng use of memorzng the results of lernng of mny tss nd pplyng them to other ts wthout lernng hs been lttle done. In ths study, we use the ssoctve chotc neurl networ (ACNN) proposed by Ahr et.l [] s storge mechnsm of results of R.. However, the storge cpcty of ACNN s smll, t s not sutble for worng lone. o, to resolve the problem, we mde up the herrchcl memory structure by mng use of ACNN: short-term memory for present lernng result, long-term memory for mny useful lernng results. Another chrcterstc of the proposed system s tht t s cpble of delng wth non-mdp problem n some degree, becuse of the chotc ssocton blty of ACNN. Fnlly t s verfed tht our proposed method s useful through the computer smulton for mze serchng problem.. PROPOED YTEM TRUCTURE The proposed system conssts of two prts: memory nd lernng. The memory conssts of short-term memory (.T.M.) nd long-term memory (.T.M.). Fg. shows ts overll structure. ernng sector : ctor-crtc system s dopted. It lerns the choce of cton to mxmze the totl predctve rewrds obtned over the future consderng the envronmentl nformton (s) nd rewrd (r) s result of cton ()..T.M. sector: t memorzes the lernng pth of the nformton (envronmentl nformton nd cton) obtned n ernng prt. Unnecessry nformton s forgotten nd useful nformton s stored..t.m. sector: t memorzes only the enough sophstcted nd useful experence n.t.m.. Autonomous gent ernng sector (ctor-crtc system) Memory hort-term memory (.T.M.) contents of TM Fg. Proposed system envronmentl nput s(t) nd rewrd r (t) cton (t) nformton bout pr of cton nd envronmentl one) contents of TM ong-term memory (.T.M.) s (t) Fg. The constructon of ctor-crtc system Envronmen 3. ACTOR-CRITIC REINFORCEMENT EARNIN YTEM The ctor-crtc renforcement lernng system s shown n Fg.. 69
3.. tructure nd lernng of crtc 3.. tructure Functon of the crtc s clculton of P(t) : the predcton vlue of sum of the dscounted rewrds tht wll be gotten over the future nd ts predcton error. These re shortly explned s follows; The sum of the dscounted rewrds tht wll be gotten over the future s defned s V () t. V n () t r( t + n), () =0γ n where γ ( 0 γ < ) s constnt clled dscount rte. Eq. () s rewrtten s () t = r() t + V ( t +) V γ. () Here the predcton vlue of V () t s defned s P () t. The predcton error rˆ () t s expressed s follows; () t = r() t + P( t + ) P() t rˆ γ. (3) The prmeters of the crtc re dusted to reduce ths predcton error rˆ () t. The predcton vlue P() t s clculted s follows; y P J () t y ( t) = ω (4) = 0 n ( x ( t) = exp = m ). (5) σ Here, ω : weght of the th output, y : th output of the mddle lyer of the crtc, x : th nput, m, σ : center, dsperson for th nput of th bss functon respectvely, J : the number of nodes n the mddle lyer of the crtc,. The constructon of the crtc s lso conssted of the RBFN s shown n Fg.3. 3.. ernng ernng of crtc s done by usng commonly used Bc Propgton method whch mes predcton error rˆ () t goes to zero. Updtng rule of prmeters re s follows: ˆ c rt Δω = η c, ( =,, J ). (6) c ω 3.. tructure nd lernng of ctor 3.. tructure Fg.4 shows the constructon of the ctor. The ctor s bsclly conssted of Rdl Bss Functon Networ. The th bss functon of the mddle lyer node s s follows; Here y u m, σ n ( x = exp = J m ) σ, (7) () t y () t + n() t, ( =,, K) y : th = ω. (8) = output of the mddle lyer of the ctor, : center, dsperson for th nput of th bss functon respectvely, K: the number of the ctons, n : ddtve nose, u : representtve vlue of th cton, ω mddle lyer to : connecton weght from th th output. node of the 3.. Nose genertor Nose genertor let the output of the ctor hve the dversty by ddng the nose to t. It comes to relze the lernng of the trl nd error. Clculton of the nose n () t s s follows; n t = n = nose mn, exp( P t, (9) () t t ( ()) nose s unformly rndom number of [,] where t. As the P () t wll be bgger, the nose wll be smller. Ths leds to the stble lernng of the ctor. 3..3 ernng Prmeters of ctor, ω ( =,, J, =,, K), re dusted by usng output u of ctor nd nose n. u ˆ Δ ω = η nt rt ω, (0) η (> 0) s the lernng coeffcent. Eq. (0) mens tht ( nt δt ) s consdered s error, ω s dusted opposte to sgn of ( nt δt ). Fg.3 tructure of crtc Fg.4 tructure of ctor 70
3.3 Acton selecton The cton b t tme t s selected stochstclly usng bbs dstrbuton Eq.(). exp( ub () t T ) P( b x () t ) = K. () exp u t T = ( () ) Here, P( b x() t ) : selecton probblty of bth cton, b, T : postve constnt clled temperture constnt. 4. A HIERARCHICA MEMORY YTEM 4. Assoctve Chotc Neurl Networ (ACNN) CNN s constructed wth chotc neuron models tht hve refrctory nd contnuous output vlue. Its useful usge s s ssoctve memory networ nmed ACNN. Here re the dynmcs of ACNN. x ( t + ) = f ( y ( t + ) + z ( t + )), () y ( t + ) = y( t) α x ( t) +, (3) r z ( t + ) = z ( t) + ϖ x ( t), (4) f n = P p p ϖ = (x ) (x ), (5) P p= x (t) : output of the th neuron t tme t, y (t) : nternl stte respect to refrctory of the th neuron t tme t, z (t) : nternl stte respect to mutul operton of the th neuron t tme t., f ( ) : sgmod functon, ϖ : connecton weght from th neuron to th neuron, p x : th element of pth stored pttern. 4. Networ control Here, networ control s defned s control whch mes trnston of networ from chotc stte to non-chotc one nd vce vers. The networ control lgorthm of ACNN s shown n Fg.5. The stte of ACNN s clculted by Δ x(t), totl chnge of nternl stte x(t) temporlly, nd when Δx(t) s less thn threshold vlue θ, the chotc retrevl of ACNN s stopped by chngng vlues of prmeters r nto smll one. As result, networ converges to stored pttern ner the present networ stte. 4.3 Mutul ssoctve type ACNN 4.3. hort term memory(.t.m.) We me use of ACNN s mutul ssoctve memory system, nmely, uto-ssoctve mtrx s constructed wth envronmentl nputs s(t) nd ther correspondng ctons (t). When s (t) s set s ntl stte of ACNN, ACNN retreves (t) wth s (t) (refer to Fg.6). l s rndom vector to ween the correlton between s (t) nd (t). The memory mtrx W s descrbed s Eq.(6), here, λ s Fg.5 Networ control lgorthm Fg.6 Memory confgurton of ACNN Fg.7 Adptve herrchcl memory structure forgettng coeffcent, nd η s lernng coeffcent. λ s set to smll,becuse tht t ntl lernng stge s (t) s not correspondng to optml (t). W Input s new T [ s l ] [ s l ] old = λ W + η. (6) s Output l : ddtonl rndom memory unts for weenng correltons between s nd Actor - crtc system tored pttern [ s ( t) l( t) ( t) ] Input (t) s ACNN s mutul retrevl system Envronment.T.M. ACNN s mutul retrevl system Unt type memory structure 造.T.M. Unt type memory structure (0) Unt type memory structure () Output (t) 7
.T.M. s one unt conssts of plurl ACNNs, nd one ACNN memorzes nformton for one envronmentl nput pttern (refer to Fg.7). And.T.M hs pth nformton from strt to gol of only one mze serchng problem. 4.3. ong term memory(.t.m.) The.T.M. conssts of plurl unts. The.T.M. memorzes enough refned nformton n the.t.m. s one unt (refer to Fg.7). Nmely, when ctor-crtc lernng hs ccomplshed for certn mze problem, nformton n the.t.m. s updted s follows: In cse tht the present mze problem hs not been experenced, the stored mtrx W s set by Eq.(7) ; W = W. (7) In cse tht the present mze hs been experenced nd present lernng s ddtve lernng, the stored mtrx s updted by Eq.(8) ; new old W = λ W + η W. (8) λ s forgettng coeffcent, nd η s lernng coeffcent. λ s set to lrge vlue s sme s one of η so s not to forget prevous stored ptterns. 4.4 Adptve herrchcl memory structure Fg.7 shows whole confgurton of n dptve herrchcl memory structure. When n envronmentl stte s nputted to gent, t frst t s sent to the.t.m for confrmng f t s the stored nformton or not. If t s the stored nformton, the obtned cton correspondng to t s executed, otherwse, t s used to lern the ctor-crtc system. The pr of the enough refned nd trned envronmentl stte s nd cton n the.t.m. s sent to the.t.m. to be stored. If t s smlr to the stored pttern, nformton of the.t.m. s used to relern t the ctor-crtc system n the.t.m.. 5. COMPUTER IMUATION 5. multon condton Agent cn perceve whether there s sle or not t the forwrd, rght-forwrd, left-forwrd, rght, left s envronment s (refer to Fg.8). Agent cn move lttce to forwrd, bc, left, nd rght s cton (refer to Tble ). Therefore n ctor-crtc, stte s of envronment conssts of 5 nputs (= n). And nds of cton s 4(= K). The number of hdden nodes of R.B.F. s equl to 3(=J) n Fg. 3 nd 4. And the number of unts l s equl to n Fg.6. When gent gets the gol, gent s gven rewrd,.0. For the cse of collson wth wll, rewrd s -.0, nd for ech cton except collson s - 0.. Other prmeters used n ths smulton re shown n Tble. 5. multon nd results 5.. In the cse of smple mze Fg.8 Perceptble re of gent : shded re Tble Acton code of gent Tble Prmeters of smulton ACTOR-CRITIC σ 0. ξ 0.7 η 0.3 γ 0.5 T 0.3 - - Forgettng nd ernng coeffcents λ 0.89 η.00 λ.00 η.00 Chos control prmeters of ACNN Chos / Non-chos Chos / Non-chos α 0.0/.00 r 0.98/0.0 ε 0.05/0.05 f 0.0/00 T 0.3/0.3 - - - () Number of yer n the.t.m. (b) Fg.9 Expermentl mze nd results The num ber of stored ptterns yer 0 5406 yer 6073 yer 633 yer 3 370 yer 4 0033 yer 5 949 yer 6 73 yer 7 9996 yer 8 3383 yer 9 3007 totl 578 At frst, there s no dt n the.t.m., gent lerns the shortest pth of the mze of Fg.9() by usng the ctor-crtc system nd stores the result of lernng n the ACNN of the.t.m. correspondng to stte s n the form of Eq.(6). The fnl refned result for the me 7
s sent to be stored n the frst lyer (= unt(0) )of the.t.m. After lernng, gent restrted from the ntl poston nd got the nformton from ech lyer of the.t.m. nd got the gol le the rrow lne n Fg.9(). Fg.9(b) shows tht the number of stored ptterns concentrtes t yer 0, ths s becuse tht when gent goes ths mze gn, gent uses the nformton n unt(0) of the.t.m., but retrevl n ACNN fled on the wy, ll the nformton W of unt(0) moved to the.t.m. nd ddtonl ctor-crtc lernng ws done nd lernng results were wrtten ddtvely n the form of new Eq.(8) nd ll the nformton W ws sent to unt() of the.t.m. s new experenced nformton. Fg.9(b) shows the number of stored ptterns by Eq.(6) twce nd flure hppened t yer 0, so the number of stored ptterns concentrted t yer 0. 5.. In the cse of lsng Agent moves eepng to te the posture such tht front of the gent s lwys upsde of the pper. In Fg.0, the optml pth t stte A s rght, however, ts pth t B s left. Agent perceves stte A nd B s sme stte, but ts optml ctons re dfferent, ths s clled lsng. In our cse, both ptterns re stored s dfferent ptterns. Our method solves ths problem by usng chos control of ACNN. Nmely, n the cse of sme ste, sme nput to ACNN, ACNN outputs ether left or rght s gent cton, consequently gent moves rght t A. () Non stored pth (b) tored pth (c) tored pth Fg. Expermentl mzes nd results () tored pth (b) tored pth A (c) tored pth 3 B Alsng hppens t these re Fg.0 Expermentl mze 5..3 In the cse of use of stored pth nformton At frst, there s no dt n the.t.m., gent lerns the shortest pth of the mze of Fg. (b) by usng the ctor-crtc system nd stores the result of lernng n the.t.m. n the form of Eq.(5) for ech cton. The fnl refned result s sent to be stored n the frst lyer ( = unt(0) ) of the.t.m. n shown Fg.7. econd, for the mze of Fg.(c), there s the pth nformton, gent tres to get the cton usng the ACNN of unt(0) n the.t.m., but fls becuse of no nformton correspondng to ths envronment. Agent lso lerns (d) Non stored pth Fg. Expermentl lrge scle mze 73
nd the fnl refned result s sent to the second lyer ( = unt() ) of the.t.m.. Fg.() shows the results tht gent s movng long the optml pth by mng use of experences (memory), tht s, (b) nd (c). The colored pth n Fg. () corresponds to those of (b) nd (c). 5..4 In the cse of lrge scle mze After lernng of plurl smll sze mzes, Fg. () to (c), gent tred to get the gol for the mze n Fg. (d). Agents could not ssocte ts cton t the top of the rrow n Fg.(d). To get the gol n such lrge scle mze, mny experenced mzes re needed. 6. CONCUION We proposed renforcement lernng system wth chotc neurl networs-bsed dptve herrchcl memory structure for utonomous robots nd showed ts effectveness through gol serchng problem n plurl mzes. In our future wor, we would le to try to expnd ths method to be used n the cse of contnuous envronment. Acnowledgements A prt of ths study ws supported by JP-KAKENHI (No.850030, No.050077 nd No.050007). REFERENCE [] R.. utton, A.. Brto:"Renforcement ern ng", The MIT Press,998 [] M. Adch, K. Ahr:"Assoctve Dynmcs n Chotc Neurl Networ", Neurl Networs,Vol. 0, No., pp.83-98,997 74