Natural Temporal Difference Learning

Size: px

Start display at page:

Download "Natural Temporal Difference Learning"

Bruce Mosley
5 years ago
Views:

1 Proceedings of he Tweny-Eighh AAAI Conference on Arificial Inelligence Naural Temporal Difference Learning William Dabney and Philip S. Thomas School of Compuer Science Universiy of Massachuses Amhers 140 Governors Dr., Amhers, MA Absrac In his paper we invesigae he applicaion of naural gradien descen o Bellman error based reinforcemen learning algorihms. This combinaion is ineresing because naural gradien descen is invarian o he parameerizaion of he value funcion. This invariance propery means ha naural gradien descen adaps is updae direcions o correc for poorly condiioned represenaions. We presen and analyze quadraic and linear ime naural emporal difference learning algorihms, and prove ha hey are covarian. We conclude wih experimens which sugges ha he naural algorihms can mach or ouperform heir non-naural counerpars using linear funcion approximaion, and drasically improve upon heir non-naural counerpars when using non-linear funcion approximaion. Inroducion Much recen research has focused on problems wih coninuous acions. For hese problems, a significan leap in performance occurred when Kakade (2002) suggesed he applicaion of naural gradiens (Amari 1998) o policy gradien algorihms. This suggesion has resuled in many successful naural gradien based policy search algorihms (Morimura, Uchibe, and Doya 2005; Peers and Schaal 2008; Bhanagar e al. 2009; Degris, Pilarski, and Suon 2012). Despie he successful applicaions of naural gradiens o reinforcemen learning in he conex of policy search, i has no been applied o Bellman-error based algorihms like residual gradien and Sarsa(λ), which are he de faco algorihms for problems wih discree acion ses. A common complain is ha hese Bellman-error based algorihms learn slowly when using funcion approximaion. Naural gradiens are a quasi-newon approach ha is known o speed up gradien descen, and hus he synhesis of naural gradiens wih TD has he poenial o improve upon his drawback of reinforcemen learning. Addiionally, we show in he appendix ha he naural TD mehods are covarian, which makes hem more robus o he choice of represenaion han ordinary TD mehods. In his paper we provide a simple quadraic-ime naural emporal difference learning algorihm, show how he idea of compaible funcion approximaion can be leveraged Copyrigh c 2014, Associaion for he Advancemen of Arificial Inelligence ( All righs reserved. o achieve linear ime complexiy, and prove ha our algorihms are covarian. We conclude wih empirical comparisons on hree canonical domains (mounain car, carpole balancing, and acrobo) and one novel challenging domain (playing Tic-ac-oe using handwrien leers as inpu). When no oherwise specified, we assume he noaion of Suon and Baro (1998). Residual Gradien The residual gradien (RG) algorihm is he direc applicaion of sochasic gradien descen o he problem of minimizing he mean squared Bellman error (MSBE) (Baird 1995). I is given by he following updae equaions: δ = r + γq θ (s +1, a +1 ) Q θ (s, a ), (1) θ +1 = θ α δ δ, (2) where Q θ : S A R is a funcion approximaor wih parameer vecor θ. Residual gradien only follows unbiased esimaes of he gradien of he MSBE if i uses double sampling or when he domain has deerminisic sae ransiions (Suon and Baro 1998). In his paper we evaluae using sandard reinforcemen learning domains wih deerminisic ransiions, so he above formulaion of RG is unbiased. One significan drawback of residual gradien is ha i is no covarian. Consider he algorihm a wo differen levels, as depiced in Figure 1. A one level we can consider how i moves hrough he space of possible Q funcions. A anoher level, we can consider how i moves hrough wo differen parameer spaces, each corresponding o a differen represenaion of Q. Alhough hese wo represenaions may produce differen updae direcions in parameer space, we would expec a good algorihm o resul in boh represenaions producing he same updae direcion in he space of Q funcions. 1 Such an algorihm would be called covarian. Because residual gradien is no covarian, he choice of how o represen Q θ influences he direcion ha RG moves in he space of Q funcions. Oher emporal difference (TD) learning algorihms like Sarsa(λ) and TDC (Suon e al. 2009) are also 1 For echnical correcness, we mus assume ha boh represenaions can represen he same se of Q funcions. 1767

2 Q - space Q ( s, a) Q Q ( h sa, ) h θ - space h - space Figure 1: Q-space denoes he space of possible Q funcions, while θ and h-space denoe wo differen parameer spaces. The circles denoe differen locaions in θ and h-space ha correspond o he same Q funcion. The blue and red arrows denoe possible direcions ha a non-covarian algorihm migh aemp o change he parameers, which correspond o differen direcions in Q-space. The purple arrow denoes he updae direcion ha a covarian algorihm migh produce, regardless of he parameerizaion of Q. no covarian. Naural gradiens can be viewed as a way o correc he direcion of an updae o accoun for a paricular parameerizaion. Alhough naural gradiens do no always resul in covarian updaes, hey frequenly do (Bagnell and Schneider 2003). Formally, consider he direcion of seepes ascen of a funcion, L(θ), where L : R n R. If we assume ha θ resides in Euclidean space, hen he gradien, L(θ), gives he direcion of seepes ascen. However, if we assume ha θ resides in a Riemannian space wih meric ensor G(θ), hen he direcion of seepes ascen is given by G(θ) 1 L(θ) (Amari 1998). Naural Residual Gradien In his secion we describe how naural gradien descen can be applied o he residual gradien algorihm. The naural RG updae is θ +1 = θ + α G(θ ) 1 δ g, (3) where G(θ ) is he meric ensor for he parameer space and g = Q θ (s, a ) γ Q θ (s +1, a +1 ). In mos reinforcemen learning applicaions of naural gradiens, he meric ensor is used o correc for he parameerizaion of a probabiliy disribuion. In hese cases he Fisher informaion marix is a naural choice for he meric ensor (Amari and Douglas 1998). However, we are using naural gradiens o correc for he parameerizaion of a value funcion, which is no a disribuion. For a relaed applicaion, Amari (1998) suggess a ransformaion of a parameerized funcion o a parameerized probabiliy disribuion. Using his ransformaion, he Fisher informaion marix is G(θ ) = E [ δ 2 g g ]. (4) In he appendix we prove ha he class of meric ensors o which Equaion 4 belongs all resul in covarian gradien algorihms. Algorihms Quadraic Compuaional Complexiy A sraighforward implemenaion of he naural residual gradien algorihm would mainain an esimae of G(θ) and compue G(θ) 1 a each ime sep. Due o he marix inversion, his naïve algorihm has per ime sep compuaional complexiy O( θ 3 ), where we ignore he complexiy of differeniaing Q θ. This can be improved o O( θ 2 ) using he Sherman-Morrison formula o mainain an esimae of G(θ ) 1 direcly. The resuling quadraic ime naural algorihm is given by Algorihm 1, where {α } is a sep size schedule saisfying =0 α = and =0 α2 <. Algorihm 1 Naural Residual Gradien Iniialize G 1 0 = I, θ 0 = 0 δ = r( + γq θ (s +1, a +1 ) Q θ (s ), a ) Qθ (s g =,a ) γ Q θ (s +1,a +1) G 1 = G 1 1 δ2 G 1 1 gg G δ 2g G 1 1 g g θ +1 = θ + α δ G 1 Linear Compuaional Complexiy To achieve linear compuaional complexiy, we leverage he idea of compaible funcion approximaion. 2 We begin by esimaing he TD-error, δ, wih a linear funcion approximaor w (δ g ), where w are he unable parameers of he linear funcion approximaor and δ g are he compaible feaures. Specifically, we search for a w ha is a local minimum of he loss funcion L: [ L(w) = E (1 δ w g ) 2]. (5) A a local minimum of L, L(w)/ w = 0, so E [(1 δ w g ) δ g ] =0, (6) E [δ g ] =E [ δ 2 g g ] w. (7) Noice ha he lef hand side of Eq. 7 is he expeced updae o θ in he non-naural algorihms. We can herefore wrie he expeced updae o θ as θ +1 = θ + α E [δ g ] = θ + α E [ δ 2 g g ] w. (8) Therefore he expeced naural residual gradien updae is θ +1 =θ + α G(θ) 1 E [δ g ], (9) =θ + α w. (10) The challenge remains ha locally opimal w mus be aained. For his we propose a wo-imescale approach idenical o ha of Bhanagar e al. (2009). Tha is, we perform sochasic gradien descen on L(w) using a sep size schedule {β } ha decays faser han he sep size schedule {α } for updaes o θ. The resuling linear-complexiy wo-imescale naural algorihm is given by Algorihm 2. 2 The compaible feaures ha we presen are compaible wih Q θ, whereas he compaible feaures originally defined by Suon e al. (2000) are compaible wih a parameerized policy. Alhough relaed, hese wo ypes of compaible feaures are no he same. 1768

3 Algorihm 2 Naural Linear-Time Residual Gradien Iniialize w 0 = 0, θ 0 = 0 δ = r + γq θ (s +1, a +1 ) Q θ (s, a ) g = Q θ (s,a ) γ Q θ (s +1,a +1) w +1 = w + β (1 δ w g ) δ g θ +1 = θ + α w +1 The convergence properies of hese wo-imescale algorihms have been well sudied and been shown o converge under appropriae assumpions (Bhanagar e al. 2009; Kushner and Yin 2003). To summarize, wih cerain smoohness assumpions, if α = β = ; =0 =0 α 2, β 2 < ; β = o(α ), =0 =0 hen, since β 0 faser han α, θ converges as hough i was following he rue expeced naural gradien. As a resul, he linear complexiy algorihms mainain he convergence guaranees of heir non-naural counerpars. Unforunaely, unlike compaible funcion approximaion for naural policy gradien algorihms (Bhanagar e al. 2009), i is no clear how a useful baseline could be added o he sochasic gradien descen updaes of w. The baseline, b, would have o saisfy E [bδ g ] = 0, which is no even saisfied by a consan non-zero b. Exensions The meric ensor ha we derived for RG can be applied o oher similar algorihms. For example, Sarsa(λ) is no a gradien mehod, however in many ways i is similar o residual gradien. We herefore propose he use of G(θ), derived for RG, wih Sarsa(λ). Alhough no as principled as is use wih RG, in boh cases i correcs for he curvaure of he squared Bellman error and he parameerizaion of Q. This sraighforward exension gives us he algorihm for Naural Sarsa(λ) (Algorihm 3), and a linear ime Naural Sarsa(λ) algorihm can be defined similar o Algorihm 2. Algorihm 3 Naural Sarsa(λ) Iniialize G 1 0 = I, e 0 = 0, θ 0 = 0 δ = r + γq θ (s +1, a +1 ) Q θ (s, a ) g = Q θ (s,a ) e = γλe 1 + g G 1 = G 1 1 δ2 G 1 1 gg G δ 2g G 1 1 g e θ +1 = θ + α δ G 1 Anoher emporal difference learning algorihm which is closely relaed o residual gradien is he TDC algorihm (Suon e al. 2009). TDC is a linear ime gradien descen algorihm for TD-learning wih linear funcion approximaion, and suppors off-policy learning. The TDC algorihm is given by, θ +1 = θ + α δ φ α γφ +1 (φ w ), (11) w +1 = w + β (δ φ w )φ, (12) where φ = Q θ (s,a ) are basis funcions of he linear funcion approximaion. TDC minimizes he mean squared projeced Bellman error (MSPBE) using a projecion operaor ha minimizes he value funcion approximaion error. Wih a differen projecion operaor he same derivaion resuls in he sandard residual gradien algorihm. Applying he TD meric ensor we ge Naural TDC (Algorihm 4). Algorihm 4 Naural TDC Iniialize G 1 0 = I, θ 0 = 0, w 0 = 0 δ = r + γq θ (s +1, a +1 ) Q θ (s, a ) g = φ γφ +1 G 1 = G 1 1 δ2 G 1 1 gg G δ 2g G 1 1 g θ +1 = θ + α G 1 (δ φ γφ +1 (φ w )) w +1 = w + β (δ φ w )φ Experimenal Resuls Our goal is o show ha naural TD mehods improve upon heir non-naural counerpars, no o promoe one TD mehod over anoher. So, we focus our experimens on comparing he quadraic and linear ime naural varians of emporal differen learning algorihms wih he original TD algorihms hey build upon. To evaluae he performance of naural residual gradien and naural Sarsa(λ), we performed experimens on wo canonical domains: mounain car and car-pole balancing, as well as one new challenging domain ha we call visual Tic-ac-oe. We used an ɛ-greedy policy for all TD-learning algorihms. TDC is no a conrol algorihm, and hus o evaluae he performance of naural TDC we generae experience from a fixed policy in he acrobo domain and measure he mean squared error (MSE) of he learned value funcion compared wih mone carlo rollous of he fixed policy. For mounain car, car-pole balancing, and acrobo we used linear funcion approximaion wih a hird-order Fourier basis (Konidaris e al. 2012). On visual Tic-ac-oe we used a fully-conneced feed-forward arificial neural nework wih one hidden layer of 20 nodes. This allows us o show he benefis of naural gradiens when he value funcion parameerizaion is non-linear and more complex. We opimized he algorihm parameers for all experimens using a randomized search as suggesed by Bergsra and Bengio (2012). We seleced he hyper-parameers ha resuled in he larges mean discouned reurn over 20 episodes for mounain car, 50 episodes for car-pole balancing, and 100, 000 episodes for visual ic-ac-oe. Each parameer se was esed 10 imes and he performance averaged. For mounain car and car pole each algorihm s performance is an average over 50 and 30 rials respecively, wih sandard deviaions shown in he shaded regions. For visual ic-ac-oe and acrobo, algorihm performance is averaged 1769

4 Figure 2: Mounain Car (Residual Gradien) Figure 4: Car Pole (Residual Gradien). Same legend as Figure 2 parameers become meaningful. Ou of all he algorihms we found ha he quadraic ime Naural Sarsa(λ) algorihm performed he bes in mounain car, reaching he bes policy afer jus wo episodes. Figure 3: Mounain Car (Sarsa(λ)) over 10 rials, again wih sandard deviaions shown by he shaded regions. For he Sarsa(λ) experimens we include resuls for Naural Acor-Criic (Peers and Schaal 2008), o provide a comparison wih anoher approach o applying naural gradiens o reinforcemen learning. However, for hese experimens we do no include he sandard deviaions because hey make he figures much harder o read. We used a sof-max policy wih Naural Acor-Criic (NAC). Mounain Car Mounain car is a simple simulaion of an underpowered car suck in a valley; full deails of he domain can be found in he work of Suon and Baro (1998). Figures 2 and 3 give he resuls for each algorihm on mounain car. The linear ime naural residual gradien and Sarsa(λ) algorihms ake longer o learn good policies han he quadraic ime naural algorihms. One reason for he slower iniial learning of he linear algorihms is ha hey mus firs build up an esimae of he w vecor before updaes o he value funcion Car Pole Balancing Car pole balancing simulaes a car on a shor one dimensional rack wih a pole aached wih a roaional hinge, and is also referred o as he invered pendulum problem. There are many varieies of he car pole balancing domain, and we refer he reader o Baro, Suon, and Anderson (1983) for complee deails. Figures 4 and 5 give he resuls for each algorihm on car pole balancing. In he car pole balancing domain he wo quadraic algorihms, Naural Sarsa(λ) and Naural RG perform he bes. Again, he linear algorihm, akes a slower sar as i builds up an esimae of w, bu converges well above he non-naural algorihms and very close o he quadraic ones. Naural Sarsa(λ) reaches a near opimal policy wihin he firs couple of episodes, and compares favorably wih he heavily opimized Sarsa(λ), which does no even reach he same level of performance afer 100 episodes. Visual Tic-Tac-Toe Visual Tic-Tac-Toe is a novel challenging decision problem in which he agen plays Tic-ac-oe (Noughs and crosses) agains an opponen ha makes random legal moves. The game board is a 3 3 grid of handwrien leers (X, O, and B for blank) from he UCI Leer Recogniion Daa Se (Slae 1991), examples of which are shown in Figure 8. A every sep of he episode, each leer of he game board is drawn randomly wih replacemen from he se of available handwrien leers (787 X s, 753 O s, and 766 B s). Thus, i is easily possible for he agen o never see he same handwrien X, O, or B leer in a given episode. The agen s sae feaures are he 16 ineger valued aribues for each of he leers on he board. Deails of he daa se and he aribues can be found in he UCI reposiory. 1770

Figure 5: Car Pole (Sarsa(λ)) Figure 7: Acrobo Experimens (TDC) Figure 8: Visual Tic-Tac-Toe example leers Figure 6: Visual Tic-Tac-Toe Experimens There are nine possible acions available o he agen,

This is paricularly challenging because blank squares are marked by a B, making recognizing legal moves challenging in and of iself. The opponen only plays legal moves, bu chooses randomly among hem.

5 Figure 5: Car Pole (Sarsa(λ)) Figure 7: Acrobo Experimens (TDC) Figure 8: Visual Tic-Tac-Toe example leers Figure 6: Visual Tic-Tac-Toe Experimens There are nine possible acions available o he agen, bu aemping o play on a non-blank square is considered an illegal move and resuls in he agen losing is urn. This is paricularly challenging because blank squares are marked by a B, making recognizing legal moves challenging in and of iself. The opponen only plays legal moves, bu chooses randomly among hem. The reward for winning is 100, 100 for losing, and 0 oherwise. Figure 6 gives he resuls comparing Naural-LT Sarsa and Sarsa(λ) on he visual Tic-ac-oe domain using he arificial neural nework described previously. These resuls show linear naural Sarsa(λ) in a seing where i is able o accoun for he shape of a more complex value funcion parameerizaion, and hus confer greaer improvemen in convergence speed over non-naural algorihms. We do no compare quadraic ime algorihms due o compuaional limis. Acrobo Acrobo is anoher commonly sudied reinforcemen learning ask in which he agen conrols a wo-link under acuaed robo by applying orque o he lower join wih he goal of raising he op of he lower link above a cerain poin. See Suon and Baro (1998) for a full specificaion of he domain and is equaions of moion. To evaluae he off-policy Naural TDC algorihm we firs generaed a fixed policy by online raining of a hand uned Sarsa(λ) agen for 200 episodes. We hen rained TDC and Naural TDC for episodes in acrobo following he previously learned fixed policy. We evaluaed an algorihm s learned value funcion every 100 episodes by sampling saes and acions randomly and compuing he rue expeced undiscouned reurn using Mone Carlo rollous following he fixed policy. Figure 7 shows he MSE beween he learned values and he rue expeced reurn. Naural TDC clearly ou performs TDC, and in his experimen converged o much lower MSE. Addiionally, we found TDC o be sensiive o he sep-sizes used, and saw ha Naural TDC was much less sensiive o hese parameers. These resuls show ha he benefis of naural emporal difference learning, already observed in he conex of conrol learning, exend o TD-learning for value funcion esimaion as well. Discussion and Conclusion We have presened he naural residual gradien algorihm and proved ha i is covarian. We suggesed ha he emporal difference learning meric ensor, derived for naural residual gradien, can be used o creae oher naural empo- 1771

6 ral difference learning algorihms like naural Sarsa(λ) and naural TDC. The resuling algorihms begin wih he ideniy marix as heir esimae of he (inverse) meric ensor. This means ha before an esimae of he (inverse) meric ensor has been formed, hey sill provide meaningful updaes hey follow esimaes of he non-naural gradien. We showed how he concep of compaible funcion approximaion can be leveraged o creae linear-ime naural residual gradien and naural Sarsa(λ) algorihms. However, unlike he quadraic-ime varians, hese linear-ime varians do no provide meaningful updaes unil he naural gradien has been esimaed. As a resul, learning is iniially slower using he linear-ime algorihms. In our empirical sudies, he naural varians of all hree algorihms ouperformed heir non-naural counerpars on all hree domains. Addiionally, he quadraic-ime varians learn faser iniially, as expeced. Lasly, we showed empirically ha he benefis of naural gradiens are amplified when using non-linear funcion approximaion. Appendix A Proof of Covarian Theorem: The following heorem and is proof closely follow and exend he foundaions laid by Bagnell and Schneider (2003) and laer clarified by Peers and Schaal (2008) when proving ha he naural policy gradien is covarian. No algorihm can be covarian for all parameerizaions. Thus, consrains on he parameerized funcions ha we consider are required. Propery 1. Funcions g : Φ X R, and h : Θ X R are wo insananeous loss funcions parameerized by φ Φ and θ Θ respecively. These correspond o he loss funcions ĝ(φ) = E x X [g(φ, x)] and ĥ(θ) = E x X[h(θ, x)]. For breviy, hereafer, we suppress he x inpus o g and h. There exiss a differeniable funcion, Ψ : Φ Θ, such ha for some φ Φ, we have g(φ) = h(ψ(φ)) and he Jacobian of Ψ is full rank. Definiion 1. Algorihm A is covarian if, for all g, h, Ψ, and φ saisfying Propery 1, g(φ + φ) = h(ψ(φ) + θ), (13) where φ + φ and Ψ(φ) + θ are he parameers afer an updae of algorihm A. Lemma 1. An algorihm A is covarian for sufficienly small sep-sizes if θ = φ. (14) Proof. Le J Ψ(φ) be he Jacobian of Ψ(φ), i.e., J Ψ(φ) =. As such, i maps angen vecors of h o angen vecors of g, such ha g(φ) = J Ψ(φ) h(ψ(φ)), (15) when g(φ) = h(ψ(φ)), as J Ψ(φ) is a angen map (Lee 2003, p. 63). Taking he firs order Taylor expansion of boh sides of (13), we obain h(ψ(φ)) + h(ψ(φ)) θ + O( θ 2 ) = g(φ) g(φ) + φ +O( φ 2 ). For small sep-sizes, α > 0, he squared norms become negligible, and because g(φ) = h(ψ(φ)), his simplifies o h(ψ(φ)) θ = g(φ) φ, = ( J Ψ(φ) h(ψ(φ)) ) φ, = h(ψ(φ)) J Ψ(φ) φ. (16) Noice ha (16) is saisfied by θ = J Ψ(φ) φ, and hus if his equaliy holds hen A is covarian. Theorem 1. The naural gradien updae θ = G 1 θ h(θ) is covarian when he meric ensor G θ is given by [ h(θ) h(θ) ] G θ = E. (17) x X Proof. Firs, noice ha he meric ensor G φ is equivalen o G θ wih J Ψ(φ) wice as a facor, [ g(φ) g(φ) ] G φ = E, x X [ ] h(ψ(φ)) h(ψ(φ)) = E (J Ψ(φ) )(J Ψ(φ) ), x X [ ] h(ψ(φ)) h(ψ(φ)) = E J Ψ(φ) J x X Ψ(φ), [ ] h(ψ(φ)) h(ψ(φ)) = J Ψ(φ) E J x X Ψ(φ), = J Ψ(φ) G θ J Ψ(φ). (18) We show ha he righ hand side of (14) is equal o he lef, which, by Lemma 1, implies ha he naural gradien updae is covarian. J Ψ(φ) φ = J Ψ(φ) αg 1 φ g(φ), = J Ψ(φ) αg+ φ g(φ), (19) ( + = αj Ψ(φ) J Ψ(φ) G θ JΨ(φ)) JΨ(φ) h(ψ(φ)), = αj Ψ(φ) (J Ψ(φ) )+ G + θ J + Ψ(φ) J Ψ(φ) h(ψ(φ)). Since J Ψ(φ) is full rank, J + Ψ(φ) is a lef inverse, and hus J Ψ(φ) φ = αg 1 θ h(ψ(φ)), = θ. Noice ha, unlike he proof ha he naural acor-criic using LSTD is covarian (Peers and Schaal 2008), our proof does no assume ha J Ψ(φ) is inverible. Our proof is herefore more general, since i allows φ θ. 1772

7 References Amari, S., and Douglas, S Why naural gradien? In Proceedings of he 1998 IEEE Inernaional Conference on Acousics, Speech, and Signal Processing (ICASSP 98), volume 2, Amari, S Naural gradien works efficienly in learning. Neural Compuaion 10: Bagnell, J. A., and Schneider, J Covarian policy search. In Proceedings of he Inernaional Join Conference on Arificial Inelligence, Baird, L Residual algorihms: reinforcemen learning wih funcion approximaion. In Proceedings of he Twelfh Inernaional Conference on Machine Learning. Baro, A. G.; Suon, R. S.; and Anderson, C. W Neuronlike adapive elemens ha can solve difficul learning conrol problems. IEEE Transacions on Sysems, Man, and Cyberneics 13(5): Bergsra, J., and Bengio, Y Random search for hyperparameer opimizaion. In Journal of Machine Learning Research. Bhanagar, S.; Suon, R. S.; Ghavamzadeh, M.; and Lee, M Naural acor-criic algorihms. Auomaica 45(11): Degris, T.; Pilarski, P. M.; and Suon, R. S Modelfree reinforcemen learning wih coninuous acion in pracice. In Proceedings of he 2012 American Conrol Conference. Kakade, S A naural policy gradien. In Advances in Neural Informaion Processing Sysems, volume 14, Konidaris, G. D.; Kuindersma, S. R.; Grupen, R. A.; and Baro, A. G Robo learning from demonsraion by consrucing skill rees. volume 31, Kushner, H. J., and Yin, G Sochasic Approximaion and Recursive Algorihms and Applicaions. Springer. Lee, J. M Inroducion o Smooh Manifolds. Springer. Morimura, T.; Uchibe, E.; and Doya, K Uilizing he naural gradien in emporal difference reinforcemen learning wih eligibiliy races. In Inernaional Symposium on Informaion Geomery and is Applicaion, Peers, J., and Schaal, S Naural acor-criic. Neurocompuing 71: Slae, D UCI machine learning reposiory. Suon, R. S., and Baro, A. G Reinforcemen Learning: An Inroducion. Cambridge, MA: MIT Press. Suon, R. S.; McAlleser, D.; Singh, S.; and Mansour, Y Policy gradien mehods for reinforcemen learning wih funcion approximaion. In Advances in Neural Informaion Processing Sysems 12, Suon, R. S.; Maei, H. R.; Precup, D.; Bhanagar, S.; Silver, D.; Szepesvári, C.; and Wiewiora, E Fas gradiendescen mehods for emporal-difference learning wih linear funcion approximaion. In Proceedings of he 26h Annual Inernaional Conference on Machine Learning, ACM. 1773

RL Lecture 7: Eligibility Traces. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1

RL Lecture 7: Eligibility Traces. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 RL Lecure 7: Eligibiliy Traces R. S. Suon and A. G. Baro: Reinforcemen Learning: An Inroducion 1 N-sep TD Predicion Idea: Look farher ino he fuure when you do TD backup (1, 2, 3,, n seps) R. S. Suon and