Tournament selection in zeroth-level classifier systems based on. average reward reinforcement learning

Size: px

Start display at page:

Download "Tournament selection in zeroth-level classifier systems based on. average reward reinforcement learning"

Lizbeth Gibbs
5 years ago
Views:

1 ournamen selecion in zeroh-level classifier sysems based on average reward reinforcemen learning Zang Zhaoxiang, Li Zhao, Wang Junying, Dan Zhiping (Hubei Key Laboraory of Inelligen Vision Based Monioring for Hydroelecric Engineering, China hree Gorges Universiy, Yichang Hubei, , China; College of Compuer and Informaion echnology, China hree Gorges Universiy, Yichang Hubei, , China) Absrac: As a geneics-based machine learning echnique, zeroh-level classifier sysem (ZCS) is based on a discouned reward reinforcemen learning algorihm, bucke-brigade algorihm, which opimizes he discouned oal reward received by an agen bu is no suiable for all muli-sep problems, especially large-size ones. here are some undiscouned reinforcemen learning mehods available, such as R-learning, which opimize he average reward per ime sep. In his paper, R-learning is used as he reinforcemen learning employed by ZCS, o replace is discouned reward reinforcemen learning approach, and ournamen selecion is used o replace roulee wheel selecion in ZCS. he modificaion resuls in classifier sysems ha can suppor long acion chains, and hus is able o solve large muli-sep problems. Key words: average reward; reinforcemen learning; R-learning; learning classifier sysems (LCS); zeroh-level classifier sysem (ZCS); muli-sep problems Inroducion Learning Classifier Sysems (LCSs) are rule-based adapive sysems which use Geneic Algorihm (GA) and some machine learning mehods o faciliae rule discovery and rule learning[]. LCSs are compeiive wih oher echniques on classificaion asks, daa mining[2, 3] or robo conrol applicaions[4, 5]. In general, an LCS is a model of an inelligen agen ineracing wih an environmen. Is abiliy o choose he bes policy acing in he environmen, namely adapabiliy, improves wih experience. he source of he improvemen is he learning from reinforcemen, i.e. payoff, provided by he environmen. he aim of an LCS is o maximize he achieved environmenal payoffs. o do his, LCSs ry o evolve and develop a populaion of compac and maximally general "condiion-acion-payoff" rules, called classifiers, which ell he sysem in each sae (idenified by he condiion) he amoun of payoffs for any available acion. So, LCSs can be seen as a special mehod of reinforcemen learning ha provides a differen approach o ge generalizaion. he original Learning Classifier Sysem framework proposed by Holland, is referred o as he radiional framework now. And hen, Willson proposed srengh-based Zeroh-level Classifier Sysem (ZCS)[6], and accuracy-based X Classifier Sysem (XCS)[7]. he XCS classifier sysem has solved he former main

2 shorcoming of LCSs, which is he problem of srong over-generals, by is accuracy based finess approach. Bull and Hurs[8] have recenly shown ha, despie is relaive simpliciy, ZCS is able o perform opimally hrough is use of finess sharing. ha is, ZCS was shown o perform as well, wih appropriae parameers, as he more complex XCS on a number of asks. Despie curren research has focused on he use of accuracy in rule predicions as he finess measure, he presen work depars from his popular approach and akes a sep backward, aiming o uncover he poenial of srengh based LCS (and paricularly ZCS) in sequenial decision problems. In his direcion, we will discuss he use of average reward in ZCS, and will inroduce an undiscouned reinforcemen learning echnique called R-learning[9, 0] for ZCS o opimize average reward, which is a differen meric from he discouned reward opimized by original ZCS. In paricular, we apply R-learning based ZCS o large muli-sep problems and compare i wih ZCS. Experimenal resuls are encouraging, in ha ZCS wih R-learning can perform opimally or near opimally in hese problems. Laer, we will refer o our proposal as "ZCSAR", and he "AR" sands for "average reward". he res of he paper is srucured as follows: Secion 2 provides some necessary background knowledge on reinforcemen learning, including Sarsa and R-learning. Secion 3 provides a brief descripion of ZCS and maze environmens. How ZCS can be modified o include he average reward reinforcemen learning is described in Secion 4, while Secion 5 analyzes he rouble resuling from our modificaion o ZCS, and presens some soluion o i. Experimens wih our proposal and some relaed discussion are given in Secion 6. Finally, Secion 9 ends he paper, presens our main conclusions and some direcions for fuure research. 2 Reinforcemen learning Reinforcemen learning is a formal framework in which an agen manipulaes is environmen hrough a series of acions, and receives some rewards as feedback o is acions, bu is no old wha he correc acions would have been. he agen sores is knowledge abou how o make decisions ha maximize rewards or minimize coss over a period of ime. Reinforcemen learning mus learn o perform a ask by rial and error from a reinforcemen signal (he reward values) ha is no as informaive as migh be desired. In reinforcemen learning for muli-sep problems, he reinforcemen signal usually gives delayed reward, which ypically comes a he end of a series of acions. Delayed reward makes learning much more difficul. Generally, he reinforcemen learning framework consiss of A discree se of environmen saes, S ; A discree se of available acions, A ; An immediae reinforcemen funcion R, mapping S A ino he real value r, where r is he expeced environmenal payoff afer performing he acion a, from A, in a paricular sae s, from S. On each sep of ineracion he agen perceives he environmen o be in sae s ; he agen hen chooses an acion a in he se A, and he acion a is performed in he environmen. As a resul of aking acion a, he agen receives a reward r and a new sae s.

3 he agen s job is o find a policy, mapping saes o acions, ha maximizes some long-run measure of reinforcemen. here are mainly wo measures o value a policy: discouned reward opimaliy and average reward opimaliy. In discouned reinforcemen learning, he performance measure being opimized usually is he infinie-horizon discouned model[], which akes he long-run reward of he agen ino accoun, bu rewards receiving in he fuure are geomerically 0 : discouned according o a discoun facor N lim E r ( ) 0 s () N where E denoes expeced value, and r ( s ) is he reward received a ime saring from sae s under a policy. An opimal discouned policy maximizes he above infinie-horizon discouned reward. On he oher hand, undiscouned reinforcemen learning usually opimizes he average reward model[9], in which he agen is supposed o ake acions ha maximize is long-run average reward per sep: E ( s) lim N N 0 N r ( s) If a policy maximizes he average reward over all saes, i is referred o as a gain opimal policy. Usually, average reward ( s) can be denoed as, which is sae independen[2] and grealy simplifies he design of average reward algorihms. How does he agen find a policy o maximize he long-run measure of reinforcemen? Mos of he reinforcemen learning algorihms are based on esimaing sae-acion pair value funcion (called acion value funcion) ha indicaes how good i is for he agen o perform a given acion in a given sae. Here, "how good" is defined in erms of fuure expeced reward value, usually as () or (2), corresponding o he discouned reward and average reward opimaliy. We will give a brief descripion of wo ypical reinforcemen learning algorihms based on discouned reward and on average reward opimaliy, respecively. (2) 2. Sarsa Algorihm Sarsa is a well-known reinforcemen learning algorihm ha can be seen as a varian of Q-learning algorihm[]. I is based on ieraively approximaing he able of all acion values Q( s, a ), named he Q-able. Iniially, all he Q( s, a ) values are se o 0. A ime sep, he agen perceives he environmen sae s, chooses an acion a by he -greedy policy. he acion a is performed in he environmen, and he agen receives an immediae reward rimm( s, a ) for doing acion a, and a new environmen sae s. hen, he enry Q( s, a ) is updaed using he following rule: Q( s, a ) Q( s, a ) Qˆ ( s, a ) Q( s, a ) (3)

4 Here, 0 is he learning rae conrolling how quickly errors in he esimaed acion values are correced; Q ˆ( s, ) a is he new esimae of Q( s, a ), and is compued as Qˆ ( s, a ) r ( s, a ) Q ( s, a ), (4) imm where rimm( s, a ) is he immediae reward received for performing a in sae s. 2.2 R-learning Since Q-learning discouns fuure rewards, i prefers acions ha resul in shor-erm ordinary rewards o hose ha resul in long-erm susained or considerable rewards. On he conrary, he R-learning algorihm[9] proposed by Schwarz maximizes he average reward per ime sep. R-learning is similar o Q-learning in form. I is based on ieraively approximaing he acion values R( s, a ), which represen he average adjused reward of doing an acion a in sae s once, and hen following corresponding policy subsequenly. R-learning algorihm consiss of he following seps: ) Iniialize all he R( s, a ) values o zero, and he average reward variable also iniialized o zero. 2) Le he curren ime sep be. From he curren sae s, choose an acion a by some exploraion/acion-selecion mechanism, such as he -greedy policy. 3) Perform he acion a, observe he immediae reward rimm( s, a ) received and he subsequen sae s. 4) Updae R values using he following rule:,,, max,, R s a R s a r s a R s a R s a (5) R imm aa 5) If Rs, a max Rs, a (i.e. if a greedy/non-random acion a was aa chosen), hen updae he average reward according o he rule:, max, max, r s a R s a R s a (6) imm aa aa 6), and go o sep 2. Here, 0 R is he learning rae for updaing acion values R(, ), and 0 is he learning rae for updaing average reward. he updae rule for acion value (, ) R differs from he rule for Q-learning in subracing he average reward from he immediae reward, and no discouning he nex maximum acion value. he esimaion of he average reward is a criical ask in R-learning. As menioned above, he average reward, under some condiions, does no depend on any sae, and is consan over he whole sae space[2]. his faciliaes he use of average reward algorihms.

5 Following he basic R-learning algorihm, [0] proposed some variaions. he variaions mainly focus on differen ways o updae he average reward, corresponding o he sep 5 given above. 3 ZCS Classifier Sysem and Is esing Environmens 3. A Brief Descripion of ZCS he following is a brief descripion of ZCS, furher informaion can be found in [6] and [8]. he ZCS archiecure was inroduced by Sewar Wilson in 994. I is a Michigan syle LCS wihou inernal memory, which periodically receives a binary encoded inpu from is environmen. he sysem deermines an appropriae response based on his inpu and performs he indicaed acion, usually alering he sae of he environmen. he acion is rewarded by a scalar reinforcemen. Inernally he sysem cycles hrough a sequence of performance, reinforcemen and discovery. he ZCS rule base consiss of a populaion of classifiers, symbolized by [ P ]. his populaion has a fixed maximum size N. Each classifier is a condiion-acion-srengh rule c, a, sr. he rule condiion c is a sring of characers from he ernary alphabe {0,,#}, where # acs as a wildcard allowing a classifier o generalize over differen inpu messages. he acion a { a,, a n } is represened by a binary sring and boh condiions and acions are iniialized randomly. Srengh scalar sr acs as an indicaion of he perceived uiliy of ha rule wihin he sysem. he srengh of each rule is iniialized o a predeermined value ermed S 0. On receip of an environmenal inpu message s, he rule-base is scanned and any classifiers whose condiion maches inpu message s is placed in a mach se [M]. Mach se [M] is a subse of he whole populaion [ P ] of classifiers. If on some ime-sep, [M] is empy or has a oal srengh Sr [ M ] ha is less han a fixed fracion (0 ) of he mean srengh of he populaion [ P ], hen a covering operaor is invoked. A new rule is creaed wih a condiion ha maches he environmenal inpu and a randomly seleced acion. he rule s condiion is hen made less specific by he random inclusion of # s a a probabiliy of P # per bi. he new rule is given a srengh equal o he populaion average and insered ino he populaion, overwriing a rule seleced for deleion. he deleed rules are chosen using roulee-wheel selecion based on he reciprocal of srengh. hus a paricular acion a is seleced from he mach se by roulee wheel selecion policy based on he oal srengh Sr( s, a ) of he classifiers in [M] which advocae ha acion. For all acions a a,, an in [M], Sr( s, a ) is named as sysem srengh, which is compued as: Sr( s, a) cl. sr (7) cl. a a cl [ M ] cl sands for a classifier, cl. sr for srengh of cl, and cl. a for is acion.

6 When an acion has been seleced, all rules in he [M] ha advocae his acion are placed in acion se [A] and he sysem execues he acion. Depending on environmenal circumsances, a scalar reward reinforcemen value r (maybe null) is supplied o ZCS as a consequence of execuing a, ogeher wih a new inpu configuraion s. Reinforcemen in ZCS consiss of redisribuing payoff beween subsequen acion ses. In each cycle, a "bucke-brigade" credi-assignmen policy similar o Sarsa is employed: ) A fixed fracion (0 ) of he srengh of each member of [A] a curren ime sep is deduced and placed in a common bucke B : sr[ A] ( i) ( ) sr[ A] ( i) ; B sr[ A]( i), where sr ( ) i [ A] i sands for he srengh of he i-h classifier of [A]. B is iniially se o zero. 2) If a reward r is received from he environmen as a consequence of execuing a a he previous ime sep -, hen a fixed fracion (0 ) of r is disribued evenly amongs he members of [ A] : sr[ A] ( i) sr [ A] ( i) r A, where A is he number of classifiers in [ A]. 3) Classifiers in [ A] (if i is non-empy) have heir srenghs incremened by B A, sr[ A] ( i) sr [ A] ( i) B A, where is a pre-deermined discoun facor ( 0 ), B is he oal amoun pu in he curren bucke in sep. 4) Finally, he bucke B is empied, and all classifiers in he se difference [M] - [A] have heir srenghs reduced by a small fracion (0 ), which acs as a "ax" o encourage exploiaion of srong classifier ses: cl [ M ] cl [ A] : cl. sr ( ) cl. sr. hen he above process can be wrien as a re-assignmen: Sr Sr ( r Sr Sr ) (8) [ A] [ A] [ A] [ A] Sr[ A] is he oal srengh of members of [ ] A, also known as Sr( s, a ) Sr is he oal srengh of members of [A], also known as Sr( s, a ). So, Equaion [ A] can be rewrien as ; Sr( s, a ) Sr( s, a ) ( r Sr( s, a) Sr( s, a )) (9) ZCS employs GA as discovery mechanism over he whole rule-se [ P ] a each insance (panmicic). On each cycle here is a probabiliy GA of GA invocaion. When called, he GA uses roulee wheel selecion o deermine he paren rules based on srengh. wo offspring are produced via crossover (single poin, using probabiliy ) and muaion (using probabiliy ). he parens hen donae half heir srengh o heir offspring who replace exising members of he populaion. he deleed rules are chosen based on he reciprocal of srengh.

7 3.2 Maze Environmens Maze problems, usually represened as grid-like wo-dimensional areas ha may conain differen objecs of any quaniy and wih differen properies (for example, obsacle, goal, or can be empy), serve as a simplified virual model of he real environmen, and can be used for developing core algorihms of many real-world applicaions relaed o he problem of navigaion. he agen should learn he shores pah o goal saes, wihou knowing he environmenal model in advance. F (a) F (b) Figure. (a) Maze6 environmen; (b) Woods4 environmen. Food objec is marked wih F, and obsacle is marked wih. LCS has been he mos widely used class of algorihms for reinforcemen learning in mazes for he las weny years, and has presened he mos promising performance resuls[6, 3]. Figure 3(a) presens Woods[6] maze environmen. he maze may conain differen obsacles in any quaniy, such as sanding for ree in Woods, and some objecs for learning purposes, like virual food F, which is he agens goal o reach. I mus be noed ha, if a maze has no enough obsacles o mark is boundary, he lef and righ edges of he maze are conneced, as are he op and boom. In his paper, he agen is randomly placed in he maze on an empy cell, and he agen has wo boolean sensors for each of he eigh adjacen squares. he agen can move ino any adjacen square ha is free. 4 Adding R-learning o ZCS In his secion, we show how ZCS can be modified o include R-learning[9, 0] o opimize average reward, which is differen from he discouned reward opimized by Sarsa-learning. he implemenaion of our sysem, ZCSAR, is also discussed here. As menioned above, ZCS uses a "bucke-brigade" credi-assignmen policy similar o Sarsa o updae he classifiers populaion. From Equaion (9), bucke-brigade algorihm in ZCS is indeed similar o he Sarsa updae rule (3). Besides, he comparison shows ha (i) ZCS represens each enry in he Q-able by a se of classifiers, i.e. Q( s, a ) is represened by he classifiers in [ A], and Q( s, a ) is represened by he classifiers in [ A ] ; (ii) he sysem srengh Sr( s, a ),

8 specified in Equaion (7), also known as Sr [ A], corresponds o he value Q( s, a ) in Equaion (3), and r Sr( s, a) in Equaion (9) corresponds o he esimae Q ˆ ( s, ) a of value Q( s, a ) in Equaion (4); (iii) Only one enry Q( s, a ) is updaed in abular Sarsa algorihm a ime sep, while in ZCS a se of classifiers is usually updaed in one ime sep. R-learning has been inroduced in Secion 2, and i is a new ype of reinforcemen learning. R-learning and Sarsa algorihm are similar in form bu no in meaning, since Sarsa algorihm is based on he discouned reward opimaliy, while R-learning, based on he average reward opimaliy, maximizes he average reward per sep. In R-learning, we can define he esimae of R( s, a ) as Rˆ( s, a ) r s, a max R s, a (0) imm aa hus, Equaion (5) can be rewrien as,, ˆ,, R R s a R s a R s a R s a () he major difference beween Equaion () and (3) is ha hey use differen mehods o compue he esimae R ˆ( s, ) a and Q ˆ ( s, ) a. Addiionally, R-learning needs o esimae he average reward, which is exra work han in Sarsa algorihm. From wha has been discussed above, he analogies beween Sarsa and ZCS, he difference and similariy beween Sarsa and R-learning have been presened. We can ge ha, he sysem srengh Sr( s, a ) in ZCS corresponds o he acion value R( s, a ), and r Sr( s, a) in ZCS corresponds o he new esimae R ˆ( s, ) a of R( s, a ). In order o add R-learning o ZCS, we only need o focus on he mehods o compue R ˆ( s, ) a in Equaion (0) and r Sr( s, a) in Equaion (9). Given he correspondence beween he sysem srengh Sr( s, a ) and he acion value R( s, a ), he average reward approach o compue r Sr( s, a) in Equaion (9) can be modified as r Sr( s, a). hus, Equaion (9) is changed as: Sr( s, a ) Sr( s, a ) ( r Sr( s, a) Sr( s, a )) (2) Equaion (2) will replace Equaion (9) in ZCS o change he whole reinforcemen learning mechanism employed by he original ZCS. Abou he specific updae rule of classifiers in [ A], he sep 2 and 3 in Secion 3. can be modified as: 2) If a reward r is received from he environmen as a consequence of execuing a a he previous ime sep -, and he esimae of average reward is, hen a fixed fracion (0 ) of r is disribued evenly amongs he members of [ A] : sr[ A] ( i) sr [ A] ( i) ( r ) A, where A is he number of classifiers in [ A]. 3) Classifiers in [ A] (if i is non-empy) have heir srenghs incremened by B A, sr[ A] ( i) sr [ A] ( i) B A, where B is he oal amoun pu in he curren bucke.

9 Nex, a procedure o esimae he average reward needs o be added o ZCS. Sep 5 in he descripion of R-learning algorihm in Secion 2.2 can be moved o ZCS hrough some modificaions. o do so, Sep 5 in Secion 2.2 can be rewrien as: If Sr[ A] max ( a A Sr s, a) (i.e. if a greedy/non-random acion a was chosen), hen updae he average reward according o he rule: r max Sr( s, a) max Sr( s, a) (3) aa aa he new ype of Sep 5 can be insered ino he procedure of ZCS, and locaed jus before he updae of classifiers in [ A]. I mus be noed ha, a he firs ime sep of each rial in an experimen, here is no need o updae he average reward, since no previous environmenal reward available a ha ime. And a he beginning of an experimen, is iniialized o zero. In addiion, he updae value of average reward is no used in Equaion (2) direcly. Insead, is more sable moving average value is adoped o avoid he heavy oscillaions wih is updae values, since average reward is updaed by he immediae reward r which is sochasic and wih grea flucuaion. he window size for moving average is 00, i.e. moving average is compued as he average of he las 00 updaed values. If he window size is oo small, he moving average will have no effec; if he window size is oo big, he changing rend of average reward will be hidden, which will limi he immediae feedback funcion of average reward. hrough he wo seps above, we have replaced Sarsa algorihm in ZCS wih R-learning, geing he new sysem ZCSAR. However, in order o speed up he process of convergence in ZCSAR, he flucuaion of he esimae needs o be reduced over ime. So we make he learning rae in Equaion (3) decayed over ime using a simple rule: max min, (4) NumOfrials max min where is he iniial value of, is he minimum learning rae required, and NumOfrials is he number of exploraion rials (problems) in an experimen. is updaed a he beginning of each exploraion rial using Equaion (4), bu no a each ime sep. 5 Subracion rouble and ournamen Selecion When ZCSAR uses Equaion (2) as reinforcemen learning mechanism, some issues arise. he updae rule for sysem srengh Sr(, ) differs from he rule for Sarsa-learning in subracing he average reward from he immediae reward, and no discouning he nex sysem srengh (acion value). he subracion may cause sysem srengh Sr(, ) negaive, which does no appear in original ZCS and discouned reward reinforcemen learning Sarsa. he negaive Sr(, ), less han zero, occurs when he value of r Sr( s, a) is coninuously negaive for some ime

10 seps. In mos ime seps, reward is delayed, so r is zero. he esimaion of he average reward is no an easy ask in sparse reward domains. I may differ largely from he rue value of average reward in early sage of learning. hus, wheher he value of r Sr( s, a) is negaive or no depends mainly on he difference of and Sr( s, a ). If Sr(, ) is negaive, he sum of srengh of classifiers in acion se is also less han zero, which means some classifiers srengh is negaive in acion se. However, all componens of ZCS were designed on he supposiion ha classifier s srengh is greaer han zero. Specially, roulee wheel selecion (proporionae selecion) based on classifier s srengh (or is reciprocal) is adoped as acion selecion mehod in mach se [M], parens selecion mehod in GA, classifier selecion mehod in GA deleion and covering operaor deleion. I is known ha classifier s srengh mus be posiive in roulee wheel selecion. ZCS is in line wih his requiremen, bu no ZCSAR. his is a problem caused by subracion. o address his problem, an easy way is o make negaive values be zero, i.e. le classifier s srengh no less han zero. We indicae his mehod as "runcaion". In oher words, if ZCSAR sill uses roulee wheel selecion, runcaion is an easy mehod o adap i. However, is runcaion mehod proper and effecive for ZCSAR? Is here any alernaive o ackle his problem? A promising proposal is o replace roulee wheel selecion wih ournamen selecion in ZCSAR. ournamen selecion wih ournamen sizes proporionae o he acual se size is shown o ouperform roulee wheel selecion in he widely-used classifier sysem XCS[4]. So i is expeced ha ournamen selecion can also improve he performance of ZCSAR. And imporanly, in conras o roulee wheel selecion, ournamen selecion is independen of finess scaling and does no require posiive classifier srengh, so classifiers srengh can be less han zero in ZCSAR wih ournamen selecion. In ournamen selecion, classifiers are no seleced proporional o heir srengh, bu ournamens are held in which he classifier wih he highes srengh wins. Sochasic ournamens are no considered herein. Paricipans for he ournamen are chosen a random from he corresponding classifier se in which selecion is applied. he size of he ournamen is dependen on he corresponding classifier se size, and he size of each ournamen has he size of he fracion (0,] of he corresponding classifier se size. Parameer conrols he selecion pressure. Insead of roulee wheel selecion in acion selecion in mach se [M], parens selecion in GA, classifier deleion selecion in GA and covering operaor, hree independen ournamens are held in which he classifier wih he highes (or lowes) srengh is seleced, and values are 0., 0.4, 0.6 respecively. Laer, we will refer o our proposals as "ZCSAR+Roulee" and "ZCSAR+ournamen" in he remainder of his work, o indicae ZCSAR wih roulee wheel selecion and runcaion mehod, and ZCSAR wih ournamen selecion respecively. 6 Experimens in Maze Environmens wo maze problems are esed and sudied here, o illusrae he generaliy and effeciveness of our approaches, and ZCS for comparison.

11 6. Experimenal Seup o conduc experimens, every experimen ypically consiss of 2000 problems (rials) ha he agen mus solve. And for each problem, he agen is placed ino a randomly chosen empy square in he mazes. hen he agen moves under he conrol of he classifier sysem avoiding obsacles unil eiher i reaches he food or had aken 500 seps, a which poin he problem ended uncondiionally. he agen will no change is posiion if i chooses an acion o move o a square wih an obsacle inside, hough one ime-sep sill elapses. When he agen reaches he food, i receives a consan reward of 000; oherwise, i receives a reward equal o 0. And in order o evaluae he final policy evolved, in each experimen, exploraion is urned off during he las 2000 problems and he sysem works only in exploiaion. In exploiaion problems, he acion which predics he highes payoff is always seleced in mach se [M], and he geneic algorihm is urned off. Sysem performance is compued as he average number of seps o food in he las 50 problems. Every saisic resuls presened in his paper is averaged on 0 experimens. he following classifier srucure was used for LCS in he experimens: Each classifier has 6 binary bis in he condiion field: wo bis for each of he 8 neighbouring squares, wih 00 represening he siuaion ha he square is empy, ha i conains food (F), and 0 ha i is an obsacle (). he general LCS s parameers used for ZCS, ZCSAR+Roulee, and ZCSAR+ournamen are se as follows: β=0.6, =0., GA =0.25, =0.5, χ=0.5, μ=0.002, S 0 =20.0, P # =0.33, N =800. Some specific parameers are se as follows: max min for ZCSAR+Roulee, and ZCSAR+ournamen, =0.005, =0.0000; and in ZCS, =0.7. he deailed descripion of hese parameers is available in [6] and [8]. 6.2 Experimenal Resuls and Discussions In he firs experimen, we applied ZCS, ZCSAR+Roulee and ZCSAR+ournamen o Maze6 environmen (Figure (a)). Maze6 is a ypical and somewha difficul environmen for esing he learning sysems since he goal posiion for agens o reach is hidden by some obsacles, and here is no any regulariy in i. Each sensory-acion pair in his maze almos needs a special classifier o cover (i.e. i only allows few generalizaions), so ZCS is likely o produce over-general classifiers in i. Besides, he opimal soluion in Maze6 requires he agen o perform long sequences of acions o reach he goal sae. he opimal average pah o he food in Maze6 is 5.9 seps. his experimen is used o show ha ZCS wih average reward reinforcemen learning can solve he general maze problem. Figure 2 repors he performance of ZCS, ZCSAR+Roulee and ZCSAR+ournamen in Maze6 environmen. In he hree cases, he resuls all converge o near opimum during he las 2000 exploiaion problems, and here is almos no difference beween hem, abou 5.85, 6.2, and 6.02 respecively. ZCSAR+Roulee and ZCSAR+ournamen can almos perform as well as ZCS in his environmen. During he learning period (firs 0000 problems), he hree sysems performance deviaes from opimum, since he GA coninues o funcion and probabilisic acion selecion (roulee wheel selecion or ournamen selecion) is

12 used. In addiion, ZCSAR+ournamen changes coninuously and oscillaes heavily wihin he firs 0000 learning problems, which is possibly caused by ournamen selecion used as acion selecion mechanism in mach se [M]. Number of Seps o Goal Maze6 ZCS ZCSAR+Roulee ZCSAR+ournamen Opimum Number of Problems (rials) Figure 2. Performance of applying ZCSAR+Roulee and ZCSAR+ournamen o Maze6, compared wih ZCS. Error bars represen he sandard error. Curves are averages over 0 experimens. 300 Number of Seps o Goal Woods4 ZCS ZCSAR+Roulee ZCSAR+ournamen Opimum Number of Problems (rials) Figure 3. Performance of applying ZCSAR+Roulee and ZCSAR+ournamen o Woods4, compared wih ZCS. Error bars represen he sandard error. Curves are averages over 0 experimens. In he second experimen, he esing environmen is Woods4 (Figure (b)), which is a corridor of 8 blank cells and a food cell a he end. he opimal average pah o he food in Woods4 is 9.5 seps. he agen needs longer sequences of acions o reach he goal posiion, resuling in a sparser recepion of delayed reward. So, i is complex o mos LCSs[5].

13 I can be seen from Figure 2 ha, in Woods4, performances of he hree sysems oscillae above he opimum during raining period, while evolve promising soluions during he las 2000 exploiaion problems. ZCSAR+ournamen ges abou 9.50 seps o find food, and ZCS ges abou 0.70 seps. ZCSAR+Roulee performs less well (near opimum) and converges o abou 2.36 seps. ZCSAR+ournamen can ge he opimal soluion in Woods4. his seems because of he average reward reinforcemen learning and ournamen selecion employed by ZCSAR+ournamen, which guaranees he sysem can disambiguae hose early saes in he long acion chains effecively. 7 Conclusions In his paper, due o he similariy beween Sarsa and bucke-brigade algorihm in ZCS, and he similariy in form beween Sarsa algorihm and R-learning, bucke-brigade algorihm in ZCS is replaced wih R-learning hrough some modificaions. R-learning is an undiscouned reinforcemen learning echnique o opimize average reward, which is a differen meric from he discouned reward opimized by bucke-brigade algorihm. hus, ZCS wih R-learning, ZCSAR, is able o maximize he average reward per ime sep, no he cumulaive discouned rewards. his is helpful o suppor long acion chains in large muli-sep learning problems. However, R-learning will cause some classifiers srengh is negaive in ZCSAR. his does no mee he supposiion ha classifier s srengh is greaer han zero in ZCS. Specially, roulee wheel selecion based on classifier s srengh (or is reciprocal) used in ZCS requires ha classifier s srengh is posiive. o address his problem, wo exended sysems are presened: "ZCSAR+Roulee" and "ZCSAR+ournamen". ZCSAR+Roulee indicaes ZCSAR wih roulee wheel selecion and runcaion mehod, while ZCSAR+ournamen indicaes ZCSAR wih ournamen selecion. runcaion means o cu off hose negaive srengh values, se hem o zero. We es ZCSAR+Roulee and ZCSAR+ournamen on wo well-known muli-sep problems, compared wih ZCS. Overall, experimens show ha ZCSAR+ournamen can evolve opimal or near-opimal soluions in hese ypically difficul muli-sep environmens, while ZCSAR+Roulee can jus reach he subopimum in Woods4 environmen. Especially in Woods4 environmen, he performance of ZCSAR+ournamen is very good, bu ZCS jus reaches a near-opimal performance. Because of he basic change of he reinforcemen learning employed by ZCS, and ournamen selecion is used o replace roulee wheel selecion, ZCSAR+ournamen sill needs some exra esing o sudy heir performance in oher problems. Addiionally, we plan o consider he impac of average reward reinforcemen learning in ZCS when he environmen is sochasic. References: []. Bull, L., A brief hisory of learning classifier sysems: from CS- o XCS and is varians. Evoluionary Inelligence, 205: p. -6. [2]. Ebadi,., e al., Human-inerpreable Feaure Paern Classificaion Sysem using Learning

14 Classifier Sysems. Evoluionary Compuaion, (4): p [3]. zima, F.A. and P.A. Mikas, ZCS Revisied: Zeroh-Level Classifier Sysems for Daa Mining, in Proceedings of he 2008 IEEE Inernaional Conference on Daa Mining Workshops. 2008, IEEE Compuer Sociey: Washingon, DC, USA. p [4]. Cadrik,. and M. Mach, Conrol of agens in a muli-agen sysem using ZCS evoluionary classifier sysems, in 204 IEEE 2h Inernaional Symposium on Applied Machine Inelligence and Informaics (SAMI). 204, IEEE: Herl'any, Slovakia. p [5]. Cádrik,. and M. Mach, Usage of ZCS Evoluionary Classifier Sysem as a Rule Maker for Cleaning Robo ask, in Emergen rends in Roboics and Inelligen Sysems, P. Sinčák, e al., P. Sinčák, e al.^ediors. 205, Springer Inernaional Publishing. p [6]. Wilson, S.W., ZCS: A zeroh level classifier sysem. Evoluionary Compuaion, (): p. -8. [7]. Wilson, S.W., Classifier Finess Based on Accuracy. Evoluionary Compuaion, (2): p [8]. Bull, L. and J. Hurs, ZCS Redux. Evoluionary Compuaion, (2): p [9]. Schwarz, A., A reinforcemen learning mehod for maximizing undiscouned rewards, in Proceedings of he enh Inernaional Conference on Machine Learning, P. Ugoff, P. Ugoff^Ediors. 993, Morgan Kaufmann. p [0]. Singh, S.P., Reinforcemen learning algorihms for average-payoff Markovian decision processes, in Proceedings of he welfh naional conference on Arificial inelligence (vol. ). 994, American Associaion for Arificial Inelligence: Menlo Park, CA, USA. p []. Suon, R.S. and A.G. Baro, Reinforcemen learning: an inroducion. Adapive compuaion and machine learning. 998, Cambridge, MA: MI Press. [2]. Mahadevan, S., Average reward reinforcemen learning: Foundaions, algorihms, and empirical resuls. Machine Learning, : p [3]. Zauchna, Z. and A. Bagnall, A learning classifier sysem for mazes wih aliasing clones. Naural Compuing, (): p [4]. Buz, M.V., K. Sasry and D.E. Goldberg, Srong, Sable, and Reliable Finess Pressure in XCS due o ournamen Selecion. Geneic Programming and Evolvable Machines, (): p [5]. Zang, Z., e al., Learning classifier sysem wih average reward reinforcemen learning. Knowledge-Based Sysems, (0): p

Physics 235 Chapter 2. Chapter 2 Newtonian Mechanics Single Particle

Physics 235 Chapter 2. Chapter 2 Newtonian Mechanics Single Particle Chaper 2 Newonian Mechanics Single Paricle In his Chaper we will review wha Newon s laws of mechanics ell us abou he moion of a single paricle. Newon s laws are only valid in suiable reference frames,