An Analysis of Actor/Critic Algorithms using Eligibility Traces: Reinforcement Learning with Imperfect Value Functions

Size: px

Start display at page:

Download "An Analysis of Actor/Critic Algorithms using Eligibility Traces: Reinforcement Learning with Imperfect Value Functions"

Kory O’Connor’
6 years ago
Views:

1 An Analysis of Acor/Criic Algorihms using Eligibiliy Traces: Reinforcemen Learning wih Imperfec Value Funcions Hajime Kimura 3 Tokyo Insiue of Technology gen@fe.dis.iech.ac.jp Shigenobu Kobayashi Tokyo Insiue of Technology kobayasi@dis.iech.ac.jp Absrac We presen an analysis of acor/criic algorihms, in which he acor updaes is policy using eligibiliy races of he policy parameers. Mos of he heoreical resuls for eligibiliy races have been for only criic's value ieraion algorihms. This paper invesigaes wha he acor's eligibiliy race does. The resuls show ha he algorihm is an exension of Williams' REINFORCE algorihms for innie horizon reinforcemen asks, and hen he criic provides an appropriae reinforcemen baseline for he acor. Thanks o he acor's eligibiliy race, he acor improves is policy by using a gradien of acual reurn, no by using a gradien of he esimaed reurn in he criic. I enables he agen o learn a fairly good policy under he condiion ha he approximaed value funcion in he criic is hopelessly inaccurae for convenional acor/criic algorihms. Also, if an accurae value funcion is esimaed by he criic, he acor's learning is dramaically acceleraed in our es cases. The behavior of he algorihm is demonsraed hrough simulaions of a linear quadraic conrol problem and a pole balancing problem. 1 Inroducion Acor/criic archiecure is an adapive version of policy ieraion [Kaelbling e al.96]. In general, policy ieraion alernaes wo phases: a policy evaluaion phase and a policy improvemen phase. The acor implemens a sochasic policy ha maps from a represenaion of a sae o a probabiliy disribuion over 3 Inerdisciplinary Graduae School of Science and Engineering, Tokyo Insiue of Technology, 4259 Nagasua Midori-ku Yokohama 226{852 JAPAN. acions. The criic aemps o esimae he evaluaion funcion for he curren policy. The acor improves is conrol policy using criic's emporal dierence (TD) as an eecive reinforcemen. In many cases, he policy improvemen is execued concurrenly wih he policy evaluaion, because i is no feasible o wai for he policy evaluaion o converge. The acor/criic algorihms have been successfully applied o a variey of delayed reinforcemen asks; ASE/ACE archiecure for a pole balancing [Baro e al. 83] [Gullapalli 92], RFALCON for a pole balancing and for conrol of a ball-beam sysem [Lin e al. 96], a car-pole swing-up ask [Doya 96]. Alhough convergence proofs for he acor/criic algorihms (e.g. [Williams e al. 9] and [Gullapalli 92]) are less han value-ieraion based algorihms such as Q-learning [Wakins e.al 92], he acor/criic algorihms have he following pracical advanages. I is easy o implemen mulidimensional coninuous acion, ha is ofen mixed wih discree acion [Gullapalli 92]. Because he acor selecs acion by is sochasic policy, herefore problems of acion selecion like as Q-learning does no exis. The Q-learning needs o esimae reurns for all sae-acion pairs, bu he criic would esimae only he reurn of each sae. Memory-less sochasic policies can be considerably beer han memory-less deerminisic policies in he case of parially observable Markov decision processes (POMDPs) [Singh 94] [Jaakkola 94] or muli-player games [Liman 94]. I is easy o incorporae an exper's knowledge ino he learning sysem by applying convenional supervised learning echniques o he acor [Clouse e al. 92]. Eligibiliy races are a fundamenal mechanism ha has been widely used o handle delayed reward [Singh 96]. Also he races are ofen used o overcome non-markovian eecs [Suon 95],

2 [Pendrih e al. 96]. In Baro, Suon and Anderson's ASE/ACE archiecure, boh he criic and he acor make use of he eligibiliy race. Theoreical resuls of eligibiliy races in he conex of TD() [Suon 88] have been obained. Bu, in acor/criic algorihms, he eec of he acor's race has no been invesigaed. This paper presens an analysis of an acor/criic algorihm, in which he acor improves is policy using eligibiliy races of he policy parameers. This may be he rs analysis of he acor's eligibiliy races. 2 Discouned Reward Crieria A each discree ime, he agen observes x conaining informaion abou is curren sae, selec acion a, and hen receives an insananeous reward r resuling from sae ransiion in he environmen. In general, he reward and he nex sae may be random, bu heir probabiliy disribuions are assumed o depend only on x and a in Markov decision processes (MDPs), in which many reinforcemen learning algorihms are sudied. The objecive of reinforcemen learning is o consruc a policy ha maximizes he agen's performance. A naural performance measure for innie horizon asks is he cumulaive discouned reward: V k k r +k, (1) where he discoun facor, < 1 species he imporance of fuure rewards. V is called he acual reurn, ha species how good he reward sequence afer ime is. By his noaion, he goal of he learning is o maximize he expeced reurn. In MDPs, he expeced reurn can be dened for all saes as: V (x) E " 1 X k k r k jx x #, (2) where E denoes he expecaion assuming he agen always uses saionary policy. V (x) is called he value funcion, ha species how good he given sae x is. In MDPs, he goal of he learning is o nd an opimal policy ha maximizes he value of each sae x dened by Equaion 2. Alhough similar value funcions can be given in POMDPs, diculies o dene he have poined ou in [Singh 94]. 3 Acor/Criic Algorihms Figure 1 and 2 give an overview of acor/criic algorihms [Suon 9] [Cries e al. 94]. There are many ways o implemen he policy and is updaing scheme in he acor. The algorihms for he criic are mosly TD mehods. We should noice he following wo poins; one is he acor implemens sochasic policy, he oher is he acor improves is policy using TDerror. This paper especially invesigaes an algorihm for he acor. x x - Acor sochasic policy - Criic 6 Agen reinforcemen for a r + ^V (x +1) ^V (x ) ^V (x) Observaion Reward Acion Environmen r a TD-error Figure 1: A generic acor/criic framework. 1. The agen observes x in he environmen, and he acor execues acion a according o he curren sochasic policy. 2. The criic receives he immediae reward r, and hen observes he resuling nex sae x +1. The criic provides TD error as an useful reinforcemen feedback o he acor, according o (TD-error) 2r + ^V (x +1 )3 ^V (x ), where < 1 is he discoun facor, ^V (x) is an esimaed value funcion by he criic. 3. The acor updaes he sochasic policy using he TDerror. If (TD-error) >, acion a performed relaively good and is probabiliy should be increased. If (TD-error) <, acion a performed relaively poorly and is probabiliy should be decreased. 4. The criic updaes esimaed value funcion ^V (x) according o TD mehods. e.g., TD() algorihm adjuss ^V (x ) ^V (x )+ (TD-error), where is he learning rae. 5. Go o sep 1. Figure 2: Main loop of he generic acor/criic algorihm.?

3 4 Adding Eligibiliy Trace o he Acor 4.1 Funcion Approximaion for Sochasic Policies In his paper, (a; W; x) denoes probabiliy of selecing acion a under he policy in he observaion x. The (a; W; X) is aken o be a probabiliy densiy funcion when he se of possible acion is coninuous. The policy is represened by a parameric funcion approximaor using he inernal variable vecor W. The agen can improve he policy by modifying W. For example, W corresponds o synapic weighs where he acion selecing probabiliy is represened by neural neworks, or W means weigh of rules in classier sysems. The advanage of using he noaion of he parameric funcion () is ha compuaional resricion and mechanisms of he agen can be specied simply by a form of he funcion, and hen we can provide a sound heory of learning algorihms for arbirary ypes of he acor. 4.2 Deails of he Algorihm Figure 3 species he acor/criic algorihm ha uses he eligibiliy race in he acor. The ASE/ACE sysem congured for pole-balancing [Baro e al. 83] is jus an insance of his algorihm. The acor's eligibiliy in sep 3 is he same variable dened in Williams' REINFORCE algorihms [Williams 92]. The eligibiliy e i () species a correlaion beween he associaed policy parameer w i and he execued acion a. The eligibiliy race D i () is a discouned running average of eligibiliy. I accumulaes he agen's hisory. When a posiive reinforcemen is given, he acor updaes W so ha he probabiliy of acions recorded in he hisory is increased. I means he TD-error a he ime aecs no only he acion a bu also a 1; a 2; A rs glance, his idea is senseless for improving he policy, bu i has very ineresing feaures given in deail laer. Noe ha he algorihm shown in Figure 3 is idenical o a sochasic gradien ascen for discouned reward [Kimura e al. 97] when he acor's discoun facor and he ^V (x) in he criic equals a consan b for all observaions. The acor requires a memory o implemen W for he policy and o implemen D i for he eligibiliy race. The amoun of he memory for D i is equal o W 's. 4.3 An Analysis of he Algorihm Assume ha he acor's discoun facor equals, and for all <, D i (), hen he algorihm shown 1. The agen observes x, and he acor execues acion a wih probabiliy (a ; W; x ). 2. The criic receives he immediae reward r, and hen observes he resuling nex sae x+1. The criic provides TD error o he acor according o (TD-error) 2 r + ^V (x +1)3 ^V (x ), (3) where < 1 is he discoun facor, ^V (x) is an esimaed value funcion by he criic. 3. The acor updaes he sochasic policy using he TDerror according o: Eligibiliy: e i ln (a ; W; x ) Eligibiliy Trace: D i() e i() + D i( 1), 1w i () (TD-error) D i () W W + p 1W (), where w i denoes he i h componen of W, e i and D i are he associaed eligibiliy and eligibiliy race respecively, ( < 1) is a discoun facor for he eligibiliy race, p is he learning rae for he acor. 4. The criic updaes esimaed value funcion ^V (x) according o TD mehods. e.g., TD() algorihm adjuss ^V (x) ^V (x) + (TD-error), where is he learning rae. 5. Go o sep 1. Figure 3: The acor/criic algorihm adding he eligibiliy race o he acor. in Figure 3 updaes he policy parameers as: 1w i () r + ^V (x +1) ^V (x ) r + ^V (x +1) ^V (x ) e i () e i () e i () D i () X! e i () r + ^V (x +1) ^V (x ) r V ^V (x )! ^V (x )!! (4) (5) Equaion 5 is given by Equaion 1 and 4. Here we assume ha he saisics of he random variable V depends only on he curren policy parameer. I means EfV g is a deerminisic funcion of W, where E de-,

4 noes he expecaion operaor. This assumpion may be righ if he policy is converged o an equilibrium poin. The criic's esimaion ^V (x ) is obviously independen of he acion a he ime. From he heory of Williams' REINFORCE algorihm [Williams 92], he value V and ^V (x ) in Equaion 5 can be seen as a reinforcemen signal and a reinforcemen baseline respecively, hen we have Efe i () (V ^V (x ))g (@@w i )EfV g. I says ha he algorihm updaes policy parameers saisically in a direcion for increasing he acual reurn V, no in a direcion of a gradien of esimaed value funcion in he criic. Also I can be seen as an exension of reinforcemen comparison mehods [Suon e al. 98], hen ^V (x ) corresponds o he reference reward. From he above analysis and Figure 3, we can explain wha he acor's eligibiliy race does. A he ime, he algorihm reinforces a using TD error r + ^V (x+1 ) ^V (x ) as a emporary expedien, hereafer he acor's eligibiliy race replaces ^V (x +1 ) wih he acual reurn (r +1 + r r ) in order. The criic does no aec he direcion of he average updae vecor, because he criic works as a reinforcemen baseline. Therefore, he acor can improve is policy, wheher he criic is able o learn he value funcion or no. If he criic approximaes he value funcion well, he acor's learning would be acceleraed. The above resuls are under he special condiion. If, he acor updaes W in he direcion of he gradien of he approximaed value funcion in he criic. The ( < < ) inerpolaes beween he above wo limiing cases. The characerisics of he are similar o he in TD() [Suon 88] and Q()-learning [Peng e al. 94]. 5 Preliminary Experimens This secion demonsraes he performance of he algorihm applying o a simple linear conrol problem. 5.1 A Linear Quadraic Regulaor (LQR) The following linear conrol problem can serve as a benchmark of delayed reinforcemen asks [Baird 94]. A a given discree-ime, he sae of he environmen is he real value x. The agen chooses a conrol acion a ha is also real value. The dynamics of he environmen is: x +1 x + a + noise, (6) where he noise is he normal disribuion ha follows he sandard deviaion noise :5. The immediae reward is given by r x 2 a2. (7) The goal is o maximize he oal discouned reward, dened by Equaion 1 or 2 for all x. Because he ask is a linear quadraic regulaor (LQR) problem, i is possible o calculae he opimal conrol rule. From he discree-ime Riccai equaion, he regulaor is given by 2 a k 1 x, where k 1 1 p (8) The value funcion is given by V 3 (x ) k 2 x 2, where k 2 is a some posiive consan. In his experimen, he se of possible saes is consrained o lie in he range [4; 4]. When he sae ransiion given by Equaion 6 does no resul in he range [4; 4], he x is runcaed.when he agen chooses an acion ha is no lie in he range [4; 4], he acion execued in he environmen is also runcaed. 5.2 Implemenaion for he LQR Problem The Acor Remember he policy (a; W; X) is a probabiliy densiy funcion when he se of possible acion is coninuous. The normal disribuion is a simple muliparameer disribuion for a coninuous random variable. I has wo parameers, he mean and he sandard deviaion. When he policy funcion is given by he equaion 9, he eligibiliy of and are (a; ; ) 1 p 2 exp( (a )2 2 2 ) (9) e a 2 (1) e (a ) 2 2. (11) 3 One useful feaure of such a Gaussian uni [Williams 92] is ha he agen has a poenial o conrol is degree of exploraory behavior. We mus draw aenion o he fac ha he eligibiliy is o divergen when goes close o, because he parameer is occupying he denominaors of Equaion 1 and 11. The divergence of he eligibiliy has a bad inuence on he algorihm. One way o overcome his problem is o conrol he sep size of he updae parameer vecor using. I is obained by seing he learning rae parameer proporional o 2, hen he eligibiliy can be seen as e a,e (a ) 2 2. (12) The acor would rs compue and deerminisically and hen draw is oupu from he normal disribuion ha follows mean equal o and sandard

5 deviaion equal o. The acor has wo inernal variables, w 1 and w 2, and compues he values of and according o w 1 x, exp(w 2 ) : (13) Then, w 1 can be seen as a feedback gain. The reason for his calculaion of is o guaranee he o keep posiive. The e 1 and e 2 are he characerisic eligibiliies of w 1 and w 2 respecively. From Equaion 12, e 1 and e 2 are given by e 1 e 1 (a ) x, 2 ((a ) 2 2 )(1 ).(15) The w 1 is iniialized o :35 6 :15, and w 2, i.e., :5. The learning rae p is xed o : The Criic The criic quanizes he coninuous sae-space (4 x 4) ino an array of boxes. We have ried wo ypes of he quanizing: one is discreizing x evenly ino 3 boxes, he oher is 1 boxes. And he criic aemps o sore in each box a predicion of he value ^V by using TD() [Suon 88]. The learning rae for TD() is xed o : Simulaion Resuls Figure 4, 5, 6, 7 and 8 show he performance of 1 rials in he LQR problem wih he discoun rae :9. Figure 4 shows he performance of he algorihm, in which he criic uses 3 boxes, he acor does no use eligibiliy races, i.e,. Figure 6 shows he performance where he criic uses 1 boxes, he acor does no use he races. The algorihm in Figure 6 converged close o he feedback gain. In conras, Figure 4 didn'. The reason for his is ha he abiliy of he funcion approximaion (3 boxes) is insucien for learning policy wihou he race. Figure 5 shows he performance where he criic uses 3 boxes, he acor uses he race, :9. I achieved much beer resuls in erms of boh he learning eciency and he qualiy of he mean value of he converged policy han he algorihm in Figure 4 or 5. Obviously, he acor's eligibiliy race relaes hese wo advanages. The reason for he learning eciency in his case may be ha he acor's race acceleraes propagaing informaion. The beer qualiy of he policy is clearly owing o he propery ha he acor improves is policy by using a gradien of acual reurn, shown in Secion 4.3. Therefore, he algorihm using he race was no inuenced by he criic's abiliy in erms of he qualiy of he mean of he policy. We can also see his propery in Figure 8, bu is deviaion is considerably large. Figure 9 shows he value funcion ha is dened by Equaion 1 and 7 over he parameer space and. The value of performance is fairly a around he opimal soluion. This is he reason ha he deviaion of he policy is large in Figure 8. This example makes i clear ha he criic conrols sep-size of he acor's backups so ha he sep-size is aken o be smaller around he local maximum. The algorihm in Figure 7 achieved bes resuls in erms of boh he mean and he deviaion of he policy. The reason for his may be owing o he criic's perfec value esimaion. In his preliminary experimen, we can see ha he algorihm using he acor's eligibiliy race performed beer han he algorihm wihou using he race in he same compuaional resources. Here we presened he resuls of he acor-criic ha use only TD() in he criic, bu we have also experimened on TD() where < 1. Roughly speaking, we have poor performance when he approaches close o 1. I follows from his ha he eligibiliy race in he criic canno make up for he criic's poor abiliy of funcion approximaion. The deails of he experimens using TD() will appear in oher papers gamma.9 Criic s Grid 3 bea. Figure 4: The average performance of 1 rials wihou he acor's eligibiliy race ( ). The criic uses 3 boxes.

6 .6.4 gamma.9 Criic s Grid 3 bea gamma.9 Criic s Grid 1 bea Figure 5: The average performance of 1 rials using he acor's race :9. The criic uses 3 boxes Figure 7: The average performance of 1 rials using he acor's race :9. The criic uses 1 boxes gamma.9 Criic s Grid 1 bea..4 bea Figure 6: The average performance of 1 rials wihou he acor's race ( ). The criic uses 1 boxes Figure 8: The average performance of 1 rials. :9. The agen learns wihou he criic, i.e., he criic provides ^V (x) for all x.

7 Opimum poin F - x x j j Figure 1: The car-pole problem Deviaion Figure 9: Value funcion over he parameer space in he LQR problem, where :9. I is fairly a around he : :5884,. 6 Applying o a Car-Pole Problem The behavior of his algorihm is demonsraed hrough a compuer simulaion of a car-pole conrol ask, ha is a muli-dimensional nonlinear nonquadraic problem. We modied he car-pole problem described in [Baro e al. 83] so ha he acion is aken o be coninuous. 6.1 Problem Formulaion The dynamics of he car-pole sysem is modeled by g sin + cos Fm` _ 2 sin + csgn( _x) M+m p _ m`, 4 ` 3 m cos2 M+m x F + m` _2 sin cos c sgn( _x), M + m where M 1: (kg) denoes mass of he car, m :1 (kg) is mass of he pole, 2` 1 (m) is a lengh of he pole, g 9:8 (msec 2 ) is he acceleraion of graviy, F (N) denoes he force applied o car's cener of mass, c :5 is a coecien of fricion of car, p :2 is a coecien of fricion of pole. In his simulaion, we use discree-ime sysem o approximae hese equaions, where 1 :2 sec. A each discree ime sep, he agen observes (x; _x; ; _ ), and conrols he force F. The agen can execue acion in arbirary range, bu he possible acion in he car-pole sysem is consrained o lie in he range [2; 2](N). When he agen chooses an acion which is no lie in ha range, he acion execued in he sysem is runcaed. The sysem begins wih (x; _x; ; _ ) (; ; ; ). The sysem fails and receives a reward (penaly) signal of 1 when he pole falls over 612 degrees or he car runs over he bounds of is rack (2:4 x 2:4), hen he car-pole sysem is rese o he iniial sae. 6.2 Deails of he Agen In his experimen, he acor adops similar implemenaion shown in Equaion 9 and 12. The sae space is consrained in he range (x; _x; ; ) _ (62:4 m; 62 m/sec; rad; 61:5 rad/sec). The acor has ve inernal variables w w 5, and compues he and according o x w 1 2:4 + w x_ 2 : w _ 4 1:5, 2 + w exp(w 5 ) : (16) Similarly o Equaion 14 and 15, he eligibiliies e e 5 are given by e 1 (a ) x, e 2 (a ) x_ e 3 (a ), e 4 (a ) _ e 5 ((a ) 2 2 )(1 + :1 ). The criic discreizes he normalized sae space evenly ino boxes, and aemps o sore in each box ^V by using TD() algorihm [Suon 88]. The parameers are se o :95, :5, p : Simulaion Resuls Figure 11 shows he performance of hree learning algorihms in which he policy represenaion is he

8 same. The acor/criic algorihm using he acor's race achieved bes resuls. In conras, he algorihm wihou using he race couldn' learn he conrol policy because of he poor abiliy of funcion approximaion in he criic. Time seps unil failure Acor/Criic using acor s eligibiliy race Acor only using acor s eligibiliy race Figure 11: Trials Acor/Criic wihou acor s eligibiliy race The average performance of hree algorihms on 1 rials. The criic uses boxes. A rial means an aemp from iniial sae o a failure. 7 Discussion Represenaion of Policies: Firs of all, acor/criic algorihms should have sucien abiliy o approximae policies. If i is saised, use of he acor's eligibiliy race ( ) enables o learn an accepable policy wih less cos raher han increasing he criic's abiliy of funcion approximaion in our es cases. The reason is ha he policy funcion represenaion would require less memory han he represenaion of he sae-acion value funcion in many cases. Conrolling Sep-Size of Backups: I is analyically shown in Secion 4.3 ha he criic provides an appropriae reinforcemen baseline o he acor. The adapive baseline conrols sep-size of he acor's backups so ha he sep-size is aken o be smaller around he local maximum. This propery would conribue he beer learning eciency and he suppression of harmful drif of he policy ha are shown in he experimens. To Overcome non-markovian: There are many ways o implemen he criic's learning scheme. [Peng e al. 94] and [Suon 95] poined ou ha increasing makes TD() less sensiive o non- Markovian eec. The acor's eligibiliy races are also useful in geing over non-markovian problems [Kimura e al. 97]. Therefore, he combinaion of TD() and he acor's eligibiliy race will be robuser in non-markovian problems. Combining wih Eicien DP-based Mehods: If he hidden sae is relaively small in he sae space, he agen may perform good in which ecien DP-based algorihms are adoped for he criic. The DP-based algorihms accelerae he acor's learning in compleely observable saes, and he acor's sochasic policy and is race ( ) would make up for he non-markovian eecs owing o he hidden sae or funcion approximaion. 8 Conclusions This paper presened an analysis of acor/criic algorihms in which he acor updaes is policy using he eligibiliy race of he policy parameers. The resuls show ha when he discoun rae of he value funcion equals he discoun facor of he acor's race, he acor improves is policy by using a gradien of acual reurn, no by using a gradien of he esimaed reurn in he criic. Then, he criic provides an adapive reinforcemen baseline o he acor conrolling he sep-size of he acor's backups. I enables he agen o learn a fairly good policy under he condiion ha he approximaed value funcion in he criic is hopelessly imperfec. The behavior is demonsraed hrough simulaions showing ha he race conribues he learning eciency and he suppression of undesirable drifs of he policy. Analysis of he algorihm in non-markovian environmens is a fuure work. Acknowledgemens We would like o hank Andrew Baro, Jing Peng, Je Schneider, Sainder Singh, Richard Suon, and reviewers for many helpful commens and suggesions. References [Baird 94] Baird, L. C.: Reinforcemen Learning in Coninuous Time: Advanage Updaing, Proceedings of IEEE Inernaional Conference on Neural Neworks, Vol. IV, pp (1994). [Baro e al. 83] Baro, A. G., Suon, R. S. and Anderson, C. W.: Neuronlike Adapive Elemens Tha Can Solve Dicul Learning Conrol Problems, IEEE Transacions on Sysems, Man, and Cyberneics, vol. SMC3, no.5, Sepember/Ocober 1983, pp

9 [Clouse e al. 92] Clouse, J. A. & Uogo, P. E.: A Teaching Mehod for Reinforcemen Learning, Proc. of he 9h Inernaional Conference on Machine Learning, pp. 931 (1992). [Cries e al. 94] Cries, R. H. and Baro, A. G.: An Acor/Criic Algorihm ha is Equivalen o Q-Learning, Advances in Neural Informaion Processing Sysems 7, pp (1994). [Doya 96] Doya, K. : Ecien Nonlinear Conrol wih Acor-Tuor Archiecure, Advances in Neural Informaion Processing Sysems 9, pp. 112{118 (1996). [Gullapalli 92] Gullapalli, V.: Reinforcemen Learning and Is Applicaion o Conrol, PhD Thesis, Universiy of Massachuses, Amhers, COINS Technical Repor 92 (1992). [Jaakkola 94] Jaakkola, T., Singh, S. P., & Jordan, M. I.: Reinforcemen Learning Algorihm for Parially Observable Markov Decision Problems, Advances in Neural Informaion Processing Sysems 7, pp (1994). [Kaelbling e al.96] Kaelbling, L. P., & Liman, M. L., & Moore, A. W.: Reinforcemen Learning: A Survey, Journal of Aricial Inelligence Research, Vol. 4, pp. 237{277 (1996). [Kimura e al. 95] Kimura, H., Yamamura, M., & Kobayashi, S.: Reinforcemen Learning by Sochasic Hill Climbing on Discouned Reward, Proceedings of he 12h Inernaional Conference on Machine Learning, pp (1995). [Kimura e al. 97] Kimura, H., Miyazaki, K. and Kobayashi, S.: Reinforcemen Learning in POMDPs wih Funcion Approximaion, Proceedings of he 14h Inernaional Conference on Machine Learning, pp. 152{16 (1997). [Lin e al. 96] Lin, C. J. and Lin, C. T.: Reinforcemen Learning for An ART-Based Fuzzy Adapive Learning Conrol Nework, IEEE Transacions on Neural Neworks, Vol.7, No. 3, pp (1996). [Liman 94] Liman, M. L.: Markov games as a framework for muli-agen reinforcemen learning, Proc. of 11h Inernaional Conference on Machine Learning, pp (1994). [Pendrih e al. 96] Pendrih, M. D. & Ryan, M. R. K.: Acual reurn reinforcemen learning versus Temporal Dierences: Some heoreical and experimenal resuls, Proceedings of he 13h Inernaional Conference on Machine Learning, pp. 373{381 (1996). [Peng e al. 94] Peng, J. and Williams, R. J.: Incremenal Muli-Sep Q-Learning, Proceedings of he 11h Inernaional Conference on Machine Learning, pp (1994). [Singh 94] Singh, S. P., Jaakkola, T., & Jordan, M. I.: Learning Wihou Sae-Esimaion in Parially Observable Markovian Decision Processes, Proceedings of he 11h Inernaional Conference on Machine Learning, pp (1994). [Singh 96] Singh, S. P., & Suon, R.S.: Reinforcemen Learning wih Replacing Eligibiliy Traces, Machine Learning 22, pp (1996). [Suon 88] Suon, R. S.: Learning o Predic by he Mehods of Temporal Dierences, Machine Learning 3, pp (1988). [Suon 9] Suon, R. S.: Reinforcemen Learning Archiecures for Animas, Proceedings of he 1s Inernaional Conference on Simulaion of Adapive Behavior, pp (199). [Suon 95] Suon, R. S.: TD Models: Modeling he world a a Mixure of Time Scales, Proceedings of he 12h Inernaional Conference on Machine Learning, pp (1995). [Suon e al. 98] Suon, R. S. & Baro, A.: Reinforcemen Learning: An Inroducion, A Bradford Book, The MIT Press (1998). [Wakins e.al 92] Wakins, C. J. C. H., & Dayan, P.: Technical Noe: Q-Learning, Machine Learning 8, pp (1992). [Williams e al. 9] Williams, R. J. & Baird, L. C.: A Mahemaical Analysis of Acor-Criic Archiecures for Learning Opimal Conrols hrough Incremenal Dynamic Programming, Proceedings of he Sixh Yale Workshop on Adapive and Learning Sysems, pp Cener for Sysems Science, Dunham Laboraory, Yale Universiy, New Haven (199). [Williams 92] Williams, R. J.: Simple Saisical Gradien Following Algorihms for Connecionis Reinforcemen Learning, Machine Learning 8, pp (1992).

RL Lecture 7: Eligibility Traces. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1

RL Lecture 7: Eligibility Traces. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 RL Lecure 7: Eligibiliy Traces R. S. Suon and A. G. Baro: Reinforcemen Learning: An Inroducion 1 N-sep TD Predicion Idea: Look farher ino he fuure when you do TD backup (1, 2, 3,, n seps) R. S. Suon and