A Modular On-line Profit Sharing Approach in Multiagent Domains

Size: px

Start display at page:

Download "A Modular On-line Profit Sharing Approach in Multiagent Domains"

Dorcas Johnson
5 years ago
Views:

1 A Modular O-le Prof Sharg Approach Mulage Domas Pucheg Zhou, ad Bgrog Hog Absrac How o coordae he behavors of he ages hrough learg s a challegg problem wh mul-age domas. Because of s complexy, rece work has focused o how coordaed sraeges ca be leared. Here we are eresed usg reforceme learg echques o lear he coordaed acos of a group of age whou requrg explc commucao amog hem. However, radoal reforceme learg mehods are based o he assumpo ha he evrome ca be modeled as Markov Decso Proces whch usually cao be sasfed whe mulple ages coexs he same evrome. Moreover, o effecvely coordae each age s behavor so as o acheve he goal, s ecessary o augme he sae of each age wh he formao abou oher exsg ages. Wherea as he umber of ages a mulage evrome crease he sae space of each age grows expoeally, whch wll cause he combaoal exploso problem. Prof sharg s oe of he reforceme learg mehods ha allow ages o lear effecve behavors from her expereces eve wh o-markova evromes. I hs paper, o remedy he drawback of he orgal prof sharg approach ha eeds much memory o sore each sae-aco par durg he learg proces we frsly address a kd of o-le raoal prof sharg algorhm. he, we egrae he advaages of modular learg archecure wh o-le raoal prof sharg algorhm, ad propose a ew modular reforceme learg model. he effecveess of he echque s demosraed usg he pursu problem. Keywords Mul-age learg; reforceme learg; raoal prof sharg; modular archecure. I. INRODUCION mulage sysem (MAS) whch here s a umber of Aauoomous ages eracg, wh each affecg he acos of he ohers esseally cosues a complex sysem. Performg ad compleg asks such a evrome ca be exremely dffcul. I hs paper, he cooperave MAS s cocered, whch several ages aemp, hrough her eraco, o joly solve asks or o maxmze her uly[]. Learg eables MAS o be more flexble ad robu ad makes hem beer able o hadle ucera ad chagg crcumsaces. hus how o coordae dffere ages Mauscrp receved Jue 26, 6. hs work was suppored par by he Naoal Scece Foudao uder Gra Pucheg Zhou s wh he School of Compuer Scece ad echology, Harb Isue of echology, Harb 5000, HeLogJag Provce, P.R.Cha(correspodg auhor o provde phoe: ; fax: ; e-mal: zhoupc@h.edu.c). Bgrog Hog s wh he School of Compuer Scece ad echology, Harb Isue of echology, Harb 5000, HeLogJag Provce, P.R.Cha (e-mal: hogbr@h.edu.c). behavors by learg so as o acheve he commo goal s a mpora heme mulage domas. (e.g., see [2, 3]) Reforceme learg (RL)[4, 5] s he problem faced by a age ha mus lear behavor hrough ral-ad-error eracos wh dyamc evrome ad has bee appled successfully may sgle age sysems. Learg from he evrome s robus because ages are drecly affeced by he dyamcs of he evrome. Because of hese characer RL has become oe of he mpora learg approaches for mulage learg. (e.g., see [6, 7]) I a mulage doma, o acheve he sharg goal, learg age eeds o augme s sae space o clude more effecve formao. A he same me, he creme of sae space wll lead o slow dow he learg process. Based o he dea of modular archecure, Oo[8] proposed a modular Q-learg model, whch ca reduce he sae space effecvely. However, Q-learg s based o he evrome whch should obey he Markova propery, ha he evrome ca by modeled by a Markov Decso Proces bu for mulage doma hese cao be me wh. he raoal prof sharg, whch s proposed by Myazak[9], s a cred assgme mechasm ha bases o rals ad errors. Learg ages erac wh he evrome hrough ral ad error, ad reforce effecve sae-aco pars ad suppress effecve sae-aco pars by learg epsode. Prof sharg coras wh oher reforceme learg approaches whch are based o Dyamc Programmg, such as emporal Dfferece mehod ad Q-learg, ha prof sharg guaraees covergece o a effecve polcy eve domas ha do o obey he Markova propery, f a ask s epsodc ad a cred s assged a approprae maer[0, ]. Based o hese researche hs paper we egrae he advaages of modular archecure wh raoal prof sharg, ad propose a ew modular reforceme learg approach. II. MODULAR REINFORCEMEN LEARNING A. Markov Decso Processes ad Reforceme Learg Markov Decso Process (MDP) s geerally regarded as he mahemacal foudao for RL. A fully observable MDP s a quadruple (S, A,, R), where S s a fe se of sae A s a se of aco : SAS [0,] s he sae raso fuco ha descrbes he probably ps ( ' sa, ) of edg up sae s whe performg aco a sae ad 424

2 R: SA s reward fuco ha reurs a umerc value afer akg aco a sae s. A age s polcy s a mappg : S A. he goal of he age s o fd a opmal polcy ha maxmzes he expeced sum of dscoued rewards V(, s ) E( r, s) () 0 where r s he scalar reward a me sep, [0,] s dscou facor. Equao () ca be rewre as V(, s ) r(, s a) p( s' a) V( s', ) (2) s' S where a s he aco deermed by polcy. Dyamc Programmg (DP) heory[2] guaraees here exss a leas a deermsc opmal polcy *, ha for ay s S, sasfes * * V(, s ) max{ r p(',) s s a V( s', )} (3) aa s' S Q-learg[3]s oe kd of model-free RL algorhm s ha based o DP. I essece Q-learg s a emporal-dfferece learg mehod. Q-learg drecly compues he approxmao of a opmal aco-value fuco, called Q-value, by usg he followg updae rule Qs (, a) Qs (, a) (4) ( ) r maxq( s, a) Q( s, a ) here [0,] s learg rae. B. Modular Q-learg a he modular Q-learg archecure[8] cosss of a medaor module ad learg modules whch amou o he umber of ages volved he ask, show as Fg. Each age he learg module carres ou Q-learg he evrome. I each learg module, learg coceraes o a sgle age ad he learg of oher ages s o cosdered. o complee he global goal, eeds a medaor module o arbrae he resuls of he learg modules. he medaor module combes learg modules decso polces usg a smple heursc decso procedure ad deermes a fal decso by he correspodg age. Reward Evrome Aco Sae Age Learg Module Medaor Module Q Q 2 Q L Learg Module 2 x x 2 x L Learg Module L Fg. he modular archecure for a learg age he learg ages ca decde her aco by usg he Greaes Mass (GM) sraegy proposed by Whehead[4], L k arg max Q ( xk, a). (7) aa k where Q k deoes he aco-value fuco susaed by he learg module k, L s he umber of learg modules. Iuvely, hs s a seleco b majory amog acos proposed by dvdual learg-modules. III. RAIONAL PROFI SHARING Prof sharg s orgally proposed by Grefesee[5]. he orgal verso used prof sharg as a cred assgme mehod based o ral-ad-error experece whou ulzg ay form of value esmao. However, hs approach does o guaraee he raoaly of a acqured polcy. o guaraee covergece o a raoal polcy a o-markova doma, Myazak[9, 6] roduces he raoaly heorem, whch he cred assgme fuco should sasfy, here we call raoal prof sharg. I a mulage evrome, a me sep he learg age observes a sae s, whch s geerally he parally avalable sae of s evrome. A aco a s he seleced from he aco se. Afer he aco s execued, he age deermes wheher a reward r s geeraed. If here s o reward, he age sores he sae-aco par so-called rule (s, a ) s epsodc memory, ad repeas hs cycle ul a reward s geeraed. he process of movg from a al sae o he fal reward sae s called a epsode. Whe a reward s gve o he age, he rules sored s epsodc memory are reforced a oce. he fuco ha maps saes o acos s called a polcy. A polcy s called raoal f ad oly f he expeced reward per a aco s larger ha zero. We call a subsequece of a epsode a deour whe dffere acos are seleced for he same sae a epsode. he rules o a deour are called effecve rules. o corol he effecve rule Myazak[6] gves he followg raoaly heorem of prof sharg. Lemma. Ay effecve rule ca be suppressed ff W, 2,..., W, C f j f (8) where C s a upper boud of he umber of coflcg effecve rule ad W s a upper boud of he legh of epsodes. I pracce, we ca le C=M-, where M s he umber of acos. f deoes a reforceme value for he rule seleced a sep before a reward s acqured. he equaly (8) s called suppresso codo. Lemma 2. Prof sharg ca lear a raoal polcy ff sasfes he suppresso codo of lemma. he algorhm for he prof sharg based o lookup able s followed. Algorhm. he prof sharg algorhm Ialze, for all ss, aa( s) : wsa (, ) a small cosa Repea (for each epsode) 0, alze s j 425

3 Repea (for each sep of epsode) Accordg o s, choose a usg weghed-roulee: a) Pr( a a s s) a' A( s) a' ) Save sae-aco par (s, a ) o epsodc memory Perform aco a ; observe reward r +, ad ex sae s + ul s + s ermal sae, for each par (s, a ) epsodc memory: ws (, a) ws (, a) f( r, ), 0,,..., ul he algorhm holds he soppg crera IV. MODULAR ON-LINE PROFI SHARING APPROACH Wh he modular-q model, each learg module oly cocers wh a par of formao of he evrome, hs may reduce he sze of sae space. However, wll easly lead o percepual alasg problem[7],.e., he learg age wrogly rea wo or more dffere saes as oe. A he same me, a mul-age doma, he succeeded sae of he learg age wll o oly be effeced by s aco ad he evrome, bu also be effeced by oher ages behavor oher word he evrome cao obey Markova propery. However, eve wh o-markova evrome, prof sharg mehod ca lear effecve polcy. So, here we adop prof sharg as a basc algorhm for each learg module. A. O-le Prof Sharg Algorhm Prof sharg algorhm show Algorhm s a off-le updag mehod, whch does o chage weghs of s sae-aco pars ul he ed of a ask. Oe drawback s ha wll requre ubouded memory space o sore all of he seleced sae-aco pars durg he ask. Based o he dea of elgbly races[5,8], here we propose a o-le prof sharg algorhm. he dea behd elgbly races s very smple. Each me a sae s vsed aes a shor-erm memory proces a race, whch he decays gradually over me. hs race marks he sae as elgble for learg. here are maly wo kds of elgbly race ha s accumulag ad replacg elgbly race as defed by equao (9) ad (0), respecvely. e (, s a) s aa e (, s a) (9) e (, s a) oherwse s a a e(, s a) 0 s a a (0) e (, s a) s s where s he decay parameer, ad 0. We apply elgbly races for mplemeg prof sharg mehod cremeally, ad propose a kd of o-le prof sharg algorhm, as show Algorhm 2. Algorhm 2. he o-le prof sharg algorhm Ialze, for all ss, aa( s) : wsa (, ) a small cosa Repea (for each epsode) for all a : esa (, ) 0 0, alze s Repea (for each sep of epsode) Accordg o s, choose a usg weghed-roulee: a) Pr( a a s s) a' A( s) a' ) Updae elgbly races: f use accumulag elgbly race he e( a) e ( a) f use replacg elgbly race he e( a) e( a) 0,for aa( s), a a ake aco a ; observe reward, r +, ad ex sae, s + For all a: wsa (, ) wsa (, ) r e(, sa) e(, s a) e(, s a) ul s + s he ermal sae ul he algorhm holds he soppg crera heorem. If he followg hold:. Algorhm adops cred assgme fuco f(, r) r where, ad. C 2. Algorhm 2 uses accumulag elgbly races. 3. he ermedae rewards are zero ul he ed of each epsode. he, Algorhm s equvale o Algorhm 2, boh of hem ca lear a raoal polcy. Proof. Fr we wll prove ha hese wo algorhms are equvale. Accordg o Kroecker Dela fuco:, j (, j) 0, j Equao (9) ca be rewre as e(, s a) e (, s a) (, s s)(, a a) (, ss0)(, aa0) (, ss)(, aa) (, ss )(, aa )) (, ss)(, aa) 0 (, ss)(, aa) I Algorhm 2, for ay sae-aco par ( a), afer oe erao he cremeal wegh s wsa (, ) r e(, sa), so a me sep whe a epsode ed he oal updae s w (, s a) r e (, s a) o 0 r s s a a 0 0 (, )(, ) 426

4 (, ) (, ) 2 (, ) (, ) 0 r s s a a r s s a a 0 r (, s s )(, a a ) Because all he ermedae rewards ul he ed of a epsode are zero, ha r r 2 r 0, so o(, ) (, )(, ) 0 w s a r s s a a () O he oher had, for Algorhm, a me sep whe a epsode s over, for ay sae-aco par, he sum of cremeal wegh ca be wre as Sce w (, s a) f(, r )(, s s )(, a a ) (2) f(, r ) off 0 r, equao (2) ca be rasformed o off (, ) (, )(, ) o(, ) 0 w s a r s s a a w s a herefore, for ay sae-aco par, he sums of all he updaes hese wo algorhms are same afer learg, so hey are equvale. Nex, we wll show ha hese wo algorhms ca lear a raoal polcy. For Algorhm, accordg o he cred assgme fuco, we wll have - C( ) C f(, r) C r r 0 0 C r Because C, C so - C C f(, r) r r f(, r). 0 herefore, sasfes he suppresso codo lemma, ad accordg o lemma 2, wll lear a raoal polcy. Because of he equvalece of Algorhm ad 2, we ca coclude ha Algorhm wll also lear a raoal polcy. heorem 2. If he followg hold: W. 2, where s he decay parameer equao (0), ad W s a upper boud of he legh of epsodes Lemma. 2. Algorhm 2 uses replacg elgbly races. 3. he ermedae rewards are zero ul he ed of each epsode. he, Algorhm 2 ca lear a raoal polcy. Proof. Fr defe a fuco 0, j f(, j) (, j), j he equao (0) ca be rewre as e ( a) f( s ) e ( a) ( s ) ( a, a ) f(, s s) f(, s s ) e(, s a) f(, s s )(, s s )(, a a ) (, s s )(, a a ) s s aa f ss ss aa 0 (, )(, ) (, ) (, )(, ) For ay sae-aco par (a), afer oe erao he cremeal wegh s wsa (, ) r e(, sa), so a me sep whe a epsode ed he oal updae s w (, s a) r e (, s a) o 0 Because all he ermedae rewards ul he ed of a epsode are zero, ha r r 2 r 0, so w (, s a) r e (, s a) o Assume a me sep - he vsed sae-aco par s (s c, a d ), he w ( s, a ) r e ( s, a ) r o c d c d A he same me, for ay oher uvsed sae-aco par a me -, whch s deoed by ( s, a ) U ( s, a ) s S, a A( s ) ( s, a ) we have u v u v u v u c d w ( s, a ) r e ( s, a ) o u v u v he for all uvsed sae-aco pars a me -, he sum of cremeal weghs s w ( s, a ) r e ( s, a ) o u v u v U U r ( s, s ) ( a, a ) f( s, s ) (3) 2 u v u 0 U r ( s, s ) ( a, a ) u v U Because a me - for ay (s u, a v ), holds ( s, s ) ( a, a ) 0 so u v r ( su, s ) ( av, a ) 0 (4) U Accordg o he defo, s appare ha so f( s, s ) u ( s, s ) ( a, a ) f( s, s ) ( s, s ) ( a, a ) u v u u v For each me sep a =0,,, -2, wh he se U here s a mos oe sae-aco par would be vsed, so ( s, s ) ( a, a ) Hece U u v ( s, s ) ( a, a ) f( s, s ) u v u U (5) Iegrae equao(4) ad equao (5) wh equao(3), we have 427

5 2 wo( su, av ) r r U 0 Because W W 2 herefore w ( s, a ) r w ( s, a ) U o u v o c d herefore, sasfes he suppresso codo lemma, accordg o lemma 2 Algorhm 2 ca lear a raoal polcy. B. he Proposed Modular Reforceme Learg Model Based o he aalyss meoed above, here we propose a modular prof sharg model, as show Fg 2. he learg model of he age cosss of hree module hey are Sae Decomposo Module (SDM), Learg Module (LM) ad Medaor Module (MM), respecvely. SDM deermes he sae of each LM, ad he each LM seds s local polcy o MM. MM wll deerme o choose a approprae aco for he learg age o ake. SDM Reward Sae Age LM ASM SAP PSM Evromes ad oher Ages LM 2 ASM SAP PSM MM Aco LM L ASM SAP PSM Fg. 2 he modular o-le prof sharg model of he age Each LM cosss of hree submodules. ASM: Aco Seleco Module, whch wll sed he local polcy o MM, cludg he avalable acos ad her weghs for he curre sae. SAP: Sae-Aco Par wegh able, whch keeps all of he sae-aco pars ad s wegh, also sores he elgbly race of each sae-aco par. PSM: Prof Sharg Module. Accordg o he receved reward r, he curre sae s ad he chose aco a of MM, PSM updaes he wegh of each sae-aco par usg o-le prof sharg algorhm. he proposed learg algorhm s show as Algorhm 3. Algorhm 3. he proposed modular o-le prof sharg algorhm. Ialze, se learg parameer for all ss, aa( s) each LM: wsa (, ) a small cosa 2. Repea (for each epsode): ) for all ss, a A( s) each LM, esa (, ) 0, 0 2) Repea (for each sep of epsode) () Observe curre sae s() (2) Deerme he sae of each LM, s ()(=,,L, L s he umber of LM) (3) Each LM seds s avalable acos aa(s ()) ad wegh s (), a) o MM (4) Based o receved polces from each LM, MM chooses aco a() usg GM sraegy: k a () arg max w ( s (), a) L aa( sk( )) k (5) Each LM updaes he elgbly race of he vsed sae-aco par. (6) ake aco a(); observe reward, r(+), ad ex sae, s(+) (7) For all of he sae-aco par each LM, updae s wegh ad elgbly race: wsa (, ) wsa (, ) r e(, sa) e(, s a) e(, s a) (8) If sae s(+) s a ermal sae, he go o sep 3); oherwse, +, go o sep (2) 3) If he algorhm holds he soppg crera, he he learg process s over; oherwse, go o sep ). A. Problem Doma V. SIMULAION o evaluae he learg approach proposed hs paper, we choose a modfed verso of he pursu problem[9] as he es bed. I a 8 8 orodal grd world, a sgle prey ad four huer ages are placed a radom posos he grd, as show Fg 3(a). A each me sep, ages sychroously execue oe ou of fve acos: sayg a he curre poso or movg orh, souh, we or eas. he prey ad he huer cao share a cell, bu more ha oe huer ca be a he same poso. Also, a age s o allowed o move off he evrome. Every age ca see objecs a a cera dsace. he dsace ad he cells covers are, respecvely, called he vsual deph ad he vsual evrome of he age. A age havg vsual deph d ca see all he cells sde he (2d+) (2d+) square aroud. hese cells are he vsual evrome of he age. Each age s assged a uque defer. A huer age ca locae he relave poso ad recogze he defer of ay oher ages s vsual evrome. he prey s capured, whe all of s eghbor cell are occuped by he huer as show Fg 3(b). he all of he ages are relocaed a ew radom posos. k 428

700 Average seps ul capure 600 500 400 300 Modular-Q, =0. Modular-Q, =0.8 MPS-AE, =0. MPS-RE, =0.5 (a) Fg. 3 (a) Example of he pursu problem a grd world.

2) he huers ages are learg age whle he prey selecs s ow aco based o a radom sraegy. 3) he huer ages are self-eresed, ad here s o commucao bewee each oher. B.

Q-learg (Modular-Q). Accordg o heorem, MPS-AE we se decay parameer 0., MPS-RE le 0.5. For Modular-Q, he learg rae 0. or 0.8, dscou facor 0.9.

Upo capurg he prey, all he huers surroudg he prey receve a reward of 00, ad accordgly all of her cosues learg modules uformly receve he same reward regardless of wha acos hey have jus proposed.

6 700 Average seps ul capure Modular-Q, =0. Modular-Q, =0.8 MPS-AE, =0. MPS-RE, =0.5 (a) Fg. 3 (a) Example of he pursu problem a grd world. (b) Examples of capurg saes he pursu problem has he followg characerscs. ) he evrome s fully dyamc, parally observable ad o-deermsc. 2) he huers ages are learg age whle he prey selecs s ow aco based o a radom sraegy. 3) he huer ages are self-eresed, ad here s o commucao bewee each oher. B. Expermeal Seg ad Smulao Resuls o es he proposed algorhm, we compare modular prof sharg based o accumulag elgbly race (MPS-AE), wh modular prof sharg based o replacg elgbly race (MPS-RE) ad Modular Q-learg (Modular-Q). Accordg o heorem, MPS-AE we se decay parameer 0., MPS-RE le 0.5. For Modular-Q, he learg rae 0. or 0.8, dscou facor 0.9. he oal learg epsodes are 5000, ad he maxmum eraos wh a epsode are 500 me seps. he vsual deph of he huer s d=3. Upo capurg he prey, all he huers surroudg he prey receve a reward of 00, ad accordgly all of her cosues learg modules uformly receve he same reward regardless of wha acos hey have jus proposed. I ay oher case huers receve a reward of zero. Sce he huer cao fsh capure he prey aloe, herefore, o capure he prey, he huer should ake he formao of he prey ad oher huers o accou, so as o beer coordae s behavor wh oher huers. For he problem of (N+) huers capurg oe prey, accordg o coveoal sae codg, he sae of Huer ca be deoed as s:=<p, H,, H N>, where P s he relave poso of he prey o Huer, ad H k (k=,, N) deoes he relave poso of Huer k o Huer. Whe he huer wh vsual deph d, he sze of he sae space s S =(v s +) N+. For module archecure, he sae space of each learg module for he huer s a combao of he relave posos of he prey ad oe oher huer, ha he sze of sae space s S =(v s +) 2, accordgly, he sze of oal sae space s S =N (v s +) 2, oher word he relaoshp bewee he sze of he sae space ad he umber of learg ages s rasformed from expoeal o power, whch meas grealy reduce he sze of sae space. Fg 4 s he learg curve of he hree learg algorhm uder dffere me afer several ral Fg 5 gves he hsogram of he learg resuls for hree algorhms. (b) Fg. 4 he graph of comparso of hree learg algorhms Average seps ul capure Modular-Q, =0. Modular-Q, =0.8 MPS-AE, =0. MPS-RE, = Fg. 5 he hsogram of hree learg algorhms From hese wo graph we ca fd ha, boh MPS-AE ad MPS-RE are beer ha Modular Q-learg. hese ca be explaed as followg. For he pursu problem dscussed here, he evrome cao obey Markova propery, whch meas he premse of Q-learg covergg o he opmal polcy cao be sasfed aymore. Moreover, because of he lmed sese ably, he huer wll wrogly rea wo or more dffere saes as oe, whch lead o he percepual alasg problem. Sce he learg age wll choose approprae aco, whch wll slow dow he learg speed. I addo, Q-learg s a off-polcy learg mehod. hs meas ha Q-learg s guaraeed o coverge o he opmal soluo uder MDP evrome regardless of wha polcy s followed durg rag, as log as each sae-aco pars s vsed fely of he lm. However, Modular-Q, he oe-sep value updaes for each learg module are compued uder he assumpo ha all fuure acos wll be chose opmally for ha Markova evrome. hs assumpo s vald uder he aco seleco mechasms descrbed above, whch uses he heursc polcy such as GM sraegy. Because he chose aco by medaor module wll represe some compromse polcy whch he dffere learg modules share corol. hs meas ha he compued Q-values do o coverge o he acual expeced reur uder he compose polcy. Modular prof sharg mehod wll suppress effecve sae-aco pars ha may rap o loop whou reward, ad reforce 429

7 effecve sae-aco par so wll gradually coverge o a raoal polcy, ad oba beer learg resuls. Addoally, we ca fd ha based o he parameers seg above, MPS-RE s a lle beer ha MPS-AE, whch seems someway lke he reforceme learg wh replacg elgbly races compares o reforceme learg wh accumulag elgbly races. o es performaces of proposed algorhms uder dffere parameer seg we le decay parameer wh dffere values. Fg 6 ad 7 are average learg resuls for MPS-AE ad MPS-RE uder dffere decay parameer respecvely. Whe decay parameer does hold he codos of heorem, he he performace of he algorhm decle he average capurg me s loger. hs s because whe decay parameer does o hold he codos of heorem, durg he learg proces some sae-aco pars ha wll lead o rap o loop whou ay reward also are reforced, so wll affec he performace of he learg algorhm. I addo, from he learg curves we ca fd ha, dffere decay parameer has dffere effec o he learg algorhm. As for he smulao hs paper, for MPS-AE, whe decay parameer 0., he learg algorhm performs bes. For MPS-RE, geerally, he algorhm performs beer wh he creme of decay parameers. Average seps ul capure MPS-AE, =0.05 MPS-AE, =0.0 MPS-AE, =0.5 MPS-AE, =0.20 MPS-AE, =0.50 Fg. 6 he learg resul of MPS-AE algorhm wh dffere decay parameers. Average seps ul capure MPS-RE,=0.0 MPS-RE,=0.20 MPS-RE,=0.30 MPS-RE,=0.50 MPS-RE,=0.80 Fg. 7 he learg resul of MPS-RE algorhm wh dffere decay parameers VI. CONCLUSION I order o solve he problem of huge saes mulage learg, as well as he percepual alasg problem because of he lmed sese ably of learg age we proposed a ew modular reforceme learg approach. Frsly, a kd of o-le prof sharg algorhm s proposed, ad he based o he advaages of modular archecure ad o-le prof sharg mehod, a modular o-le prof sharg approach s preseed. Expermeal resuls showed ha he proposed learg algorhm could be used o acheve opmal soluo beer, sead of radoal reforceme learg mehods. REFERENCES [] Paa L, Luke S, Cooperave Mul-Age Learg: he sae of he ar. Auoomous Ages ad Mul-Age Sysem 5, (3): [2] Ho F, Kamel M. Learg coordag sraeges for cooperave mulage sysems. Mache Learg, 998, 33(2-3): 55-77, [3] Garlad A, Alerma R. Auoomous ages ha lear o beer coordae. Auoomous Ages ad Mul-Age Sysem, 4, 8(3): [4] Kaelbg L P, Lma M L, Moore A W. Reforceme learg: A survey. Joural of Arfcal Research, 996, 4: [5] Suo R S, Baro A G. Reforceme learg: A roduco. Cambrdge, MA: MI Pres 998 [6] Excelee-oledo CB, Jegs NR. Usg reforceme learg o coordae beer. Compuaoal Iellgece, Vol. 2 No. 3, pp Blackwell Publshg 5 [7] CHEN G, YANG ZH. Coordag Mulple Ages va Reforceme Learg. Auoomous Ages ad Mul-Age Sysem 5, 0(3): [8] Oo N, Fukumoo K. Mul-age reforceme learg: A modular approach. I Proceedgs of he Secod Ieraoal Coferece o Mul-age Sysems. Porlad, Orego, USA. 996, pp: , AAAI Press [9] Myazak K, Yamamura M, Kobayash S. O he raoaly of prof sharg reforceme learg. I Proceedgs of he hrd Ieraoal Coferece o Fuzzy Logc, Neural Nes ad Sof Compug, pages Fuzzy Logc Sysems Isue, 994 [0] Ara S, Sycara K. Effecve learg approach for plag ad schedulg mul-age doma. I Proceedgs of he 6h Ieraoal Coferece o Smulao of Adapve Behavor. Par Frace. Sepember 0, pp: [] Ara S, Sycara K P, Paye R. Experece-based reforceme learg o acqure effecve behavor a mul-age doma. I Proceedgs of he 6h Pacfc Rm Ieraoal Coferece o Arfcal Iellgece. Melboure, Ausrala. 0, pp: [2] Bellma R. Dyamc programmg. Prceo, NJ: Prceo Uversy Pres 957 [3] Waks C J, Daya P. echcal Noe: Q-learg. Mache learg, 992, 8: [4] Whehead S, Karlsso J, eeberg J. Learg mulple goal behavor va ask decomposo ad dyamc polcy mergg. Robo Learg, Norwell, MA: Kluwer Academc Pres 993 [5] Grefesee J J. Cred assgme rule dscovery sysems based o geec algorhms. Mache Learg, 988, 3: [6] Myazak K, Kobayash S. O he raoaly of prof sharg parally observable markov decso process. I Proceedgs of he ffh Ieraoal Coferece o Iformao Sysems Aalyss ad Syhess. Orlado, FL, USA. 999, pp: [7] Whehead S D, Ballad D H. Acve percepo ad reforceme learg. I Proceedgs of 7h Ieraoal Coferece o Mache Learg. 990, pp: [8] Sgh S P, Suo R S. Reforceme learg wh replacg elgbly races. Mache Learg, 996, 22: [9] Beda M, Jagaaha V, Dodhawalla R. O opmal cooperao of kowledge source. echcal Repor No. BCS-G-28, Boeg Advaced echology Ceer, Boeg Compuer Servce Seale, WA,

The Poisson Process Properties of the Poisson Process

The Poisson Process Properties of the Poisson Process Posso Processes Summary The Posso Process Properes of he Posso Process Ierarrval mes Memoryless propery ad he resdual lfeme paradox Superposo of Posso processes Radom seleco of Posso Pos Bulk Arrvals ad