True Online Temporal-Difference Learning. A. Rupam Mahmood Patrick M. Pilarski

Size: px

Start display at page:

Download "True Online Temporal-Difference Learning. A. Rupam Mahmood Patrick M. Pilarski"

Baldric Hicks
5 years ago
Views:

1 True Online Temporal-Difference Learning True Online Temporal-Difference Learning Harm van Seijen A. Rupam Mahmood Parick M. Pilarski Marlos C. Machado Richard S. Suon Reinforcemen Learning and Arificial Inelligence Laboraory Deparmen of Compuing Science Universiy of Albera T6G 2E8, Canada Edior: Absrac The emporal-difference mehods TD(λ) and Sarsa(λ) form a core par of modern reinforcemen learning. Their appeal comes from heir good performance, low compuaional cos, and heir simple inerpreaion, given by heir forward view. Recenly, new versions of hese mehods were inroduced, called rue online TD(λ) and rue online Sarsa(λ), respecively (van Seijen and Suon, 2014). Algorihmically, hese rue online mehods only make wo small changes o he updae rules of he regular mehods, and he exra compuaional cos is negligible in mos cases. However, hey follow he ideas underlying he forward view much more closely. In paricular, hey mainain an exac equivalence wih he forward view a all imes, whereas he radiional versions only approximae i for small s. We hypohesize ha hese rue online mehods no only have beer heoreical properies, bu also dominae he regular mehods empirically. In his aricle, we pu his hypohesis o he es by performing an exensive empirical comparison. Specifically, we compare he performance of rue online TD(λ)/Sarsa(λ) wih regular TD(λ)/Sarsa(λ) on random MRPs, a real-world myoelecric prosheic arm, and a domain from he Arcade Learning Environmen. We use linear funcion approximaion wih abular, binary, and non-binary feaures. Our resuls sugges ha he rue online mehods indeed dominae he regular mehods. Across all domains/represenaions he learning speed of he rue online mehods are ofen beer, bu never worse han ha of he regular mehods. An addiional advanage is ha no choice beween races has o be made for he rue online mehods. We show ha new rue online emporal-difference mehods can be derived by making changes o he real-ime forward view and hen rewriing he updae equaions. 1. Inroducion Temporal-difference (TD) learning is a core learning echnique in modern reinforcemen learning (Suon, 1988; Kaelbling e al., 1996; Suon and Baro, 1998; Szepesvári, 2010). One of he main challenges in reinforcemen learning is o make predicions, in an iniially unknown environmen, abou he (discouned) sum of fuure rewards, he reurn, based on currenly observed feaure values and a cerain behaviour policy. Wih TD learning i is 1

2 van Seijen, Mahmood, Pilarski, Machado, Suon possible o learn good esimaes of he expeced reurn quickly by boosrapping from oher expeced-reurn esimaes. TD(λ) (Suon, 1988) is a popular TD algorihm ha combines basic TD learning wih eligibiliy races o furher speed learning. The populariy of TD(λ) can be explained by is simple implemenaion, is low-compuaional complexiy and is concepually sraighforward inerpreaion, given by is forward view. The forward view of TD(λ) is ha he esimae a each ime sep is moved oward an updae arge known as as he λ-reurn, where he λ-parameer deermines he rade-off beween bias and variance of he updae arge. This rade-off has a large influence on he speed of learning and is opimal seing varies from domain o domain. The abiliy o improve his rade-off by adjusing he value of λ is wha underlies he performance advanage of eligibiliy races. Alhough he forward view provides a clear inuiion, TD(λ) closely approximaes he forward view only for appropriaely small s. Unil recenly, his was considered an unforunae, bu unavoidable par of he heory behind TD(λ). This changed wih he inroducion of rue online TD(λ)(van Seijen and Suon, 2014), which allows for full conrol over he bias-variance rade-off a any. In paricular, rue online TD(1) can achieve fully unbiased updaes. Moreover, rue online TD(λ) only requires small modificaions o he TD(λ) updae equaions, and he exra compuaional cos is negligible in mos cases. We hypohesize ha rue online TD(λ), and is conrol version rue online Sarsa(λ), no only have beer heoreical properies han heir regular counerpars, bu also dominae hem empirically. We es his hypohesis by performing an exensive empirical comparison beween rue online TD(λ), TD(λ) wih accumulaing races and TD(λ) wih replacing races, as well as rue online Sarsa(λ) and Sarsa(λ) (wih accumulaing and replacing races). The domains we use include random Markov reward processes, a real-world myoelecric prosheic arm, and a domain from he Arcade Learning Environmen (Bellemare e al., 2013). The represenaions we consider range from abular values o linear funcion approximaion wih binary and non-binary feaures. Besides he empirical sudy, we show how rue online TD(λ) can be derived. The derivaion is based on an exended version of he forward view. Whereas he updaes of he radiional forward view can only be compued a he end of an episode, he updaes of his exended forward view can be compued in real-ime, making i applicable even o non-episodic asks. By rewriing he updaes of his real-ime forward view, he rue online TD(λ) updaes can be derived. This derivaion forms a blueprin for he derivaion of oher rue online mehods. By making variaions o he real-ime forward view and following he same derivaion as for rue online TD(λ), we derive several oher rue online mehods. This aricle is organized as follows. We sar by presening he required background on Markov decision processes and inroducing TD(λ), rue online TD(λ), and rue online Sarsa(λ). We hen presen our empirical sudy. Afer his sudy, we analyze on wha ype of domains a large performance difference can be expeced. This is followed by he inroducion of he real-ime forward view and he derivaion of rue online TD(λ). Finally, we presen several oher rue online mehods. 2

3 True Online Temporal-Difference Learning 2. Markov Decision Processes Here, we presen he main learning framework. As a convenion, we indicae random variables by capial leers (e.g., S, R ), vecors by bold leers (e.g., θ, φ), funcions by lowercase leers (e.g., v), and ses by calligraphic fon (e.g., S, A). Reinforcemen learning (RL) problems are ofen formalized as Markov decision processes (MDPs), which can be described as 5-uples of he form S, A, p, r, γ, consising of S, he se of all saes; A, he se of all acions; p(s s, a), he ransiion probabiliy funcion, giving for each sae s S and acion a A he probabiliy of a ransiion o sae s S a he nex sep; r(s, a, s ), he reward funcion, giving he expeced reward for a ransiion from (s, a) o s. γ is he discoun facor, specifying how fuure rewards are weighed wih respec o he immediae reward. Some MDPs conain erminal saes, which divide he sequence of sae ransiions ino episodes. When a erminal sae is reached he curren episode ends and he sae is rese o he iniial sae. The reurn a ime is defined as he discouned sum of rewards, observed afer : G = R +1 + γ R +2 + γ 2 R = γ i 1 R +i, where R +1 is he reward received afer aking acion A in sae S. For an episodic MDP, he reurn is defined as he discouned sum of rewards unil he end of he episode: T G = γ i 1 R +i, where T is he ime sep ha he erminal sae is reached. Acions are aken a discree ime seps = 0, 1, 2,... according o a policy π : S A [0, 1], which defines for each acion he selecion probabiliy condiioned on he sae. Each policy π has a corresponding sae-value funcion v π (s), which maps each sae s S o he expeced value of he reurn G from ha sae, when following policy π: v π (s) = E{G S = s, π}. In addiion, he acion-value funcion q π (s, a) gives he expeced reurn for policy π, given ha acion a A is aken in sae s S: q π (s, a) = E{G S = s, A, = a, π}. A core ask in RL is ha of esimaing he sae-value funcion v π of some policy π from daa. In general, he learner does no have access o sae s direcly, bu can only observe a feaure vecor φ(s) R n. We esimae he value funcion using linear funcion approximaion, in which case he value of a sae s is he inner produc beween a weigh vecor θ and is feaure vecor φ(s): ˆv(s, θ) = θ φ(s) = n θ i φ i (s). 3

4 van Seijen, Mahmood, Pilarski, Machado, Suon If s is a erminal sae, hen by definiion φ(s) := 0, and hence ˆv(s, θ) = 0. As a shorhand, we will indicae φ(s ), he feaure vecor of he sae visied a ime sep, by φ. Similarly, he acion-value funcion q π can be esimaed using linear funcion approximaion. In his case, he esimae is he inner produc beween a weigh vecor and an acion-feaure vecor ψ(s, a): n ˆq(s, a, θ) = θ ψ(s, a) = θ i ψ i (s, a). If s is a erminal sae, hen by definiion ψ(s, a) := 0 for all acions a. As a convenion, we will use ψ o indicae acion-feaure vecors and φ o indicae sae-feaure vecors. As a shorhand, we will indicae ψ(s, A ) by ψ. A general model-free updae rule for linear funcion approximaion is: θ +1 = θ + α [U θ φ ]φ, (1) where U, he updae arge, is some esimae of he expeced reurn a ime sep. There are many ways o consruc an updae arge. For example, he TD(0) updae arge is: U = R +1 + γθ φ +1. (2) Updae (1) is referred o as an online updae, meaning ha he weigh vecor changes a every ime sep. Alernaively, an updae arge can be used for offline updaing. In his case, he weigh vecor says consan during an episode, and insead all weigh correcions are added a once a he end of he episode. Online updaing no only has he advanage ha i can be applied o non-episodic asks, bu i will generally produce beer valuefuncion esimaes, even when only considering he esimaes a he end of an episode (see Suon & Baro, Secions 7.1 3). Hence, offline updaing is primarily used as an analyical ool; i is rarely used in pracise. 3. Algorihms In his Secion, we presen he algorihms ha we will compare: TD(λ) wih accumulaing as well as replacing races, and rue online TD(λ). We also presen he conrol version of rue online TD(λ): rue online Sarsa(λ). Finally, we discuss several oher variaions of TD(λ). 3.1 Convenional TD(λ) The convenional TD(λ) algorihm is defined by he following updae equaions: δ = R +1 + γθ φ +1 θ φ (3) e = γλe 1 + φ (4) θ +1 = θ + αδ e (5) for 0, and wih e 1 = 0. The scalar δ is called he TD error. The vecor e is called he eligibiliy-race vecor, and he parameer λ [0, 1] is called he race-decay parameer. The updae of e shown above is referred o as he accumulaing-race updae. 4

5 True Online Temporal-Difference Learning Algorihm 1 accumulae TD(λ) INPUT: α, λ, γ, θ ini θ θ ini Loop (over episodes): obain iniial φ e 0 While erminal sae has no been reached, do: obain nex feaure vecor φ and reward R δ R + γ θ φ θ φ e γλe + φ θ θ + αδe φ φ As a shorhand, we will refer o his version of TD(λ) as accumulae TD(λ). Algorihm 1 shows he corresponding pseudocode. Accumulae TD(λ) can be very sensiive wih respec o he α and λ parameers. Especially, a large value of λ combined wih a large value of α can easily cause divergence, even on simple asks wih bounded rewards. For his reason, a varian of TD(λ) is ofen used ha is more robus wih respec o hese parameers. This varian, which assumes binary feaures, uses a differen race-updae equaion: e (i) = { γλe 1 (i) if φ (i) = 0 1 if φ (i) = 1 for all feaures i. This is referred o as he replacing-race updae. In his aricle, we use a simple generalizaion of his updae rule ha allows us o apply i o domains wih non-binary feaures as well: { γλe 1 (i) if φ (i) = 0 e (i) = for all feaures i. (6) φ (i) if φ (i) 0 Noe ha for binary feaures his generalized race updae reduces o he defaul replacingrace updae. We will refer o he version of TD(λ) ha uses Equaion 6 as. 3.2 True Online TD(λ) The rue online TD(λ) updae equaions are: δ = R +1 + γθ φ +1 θ φ (7) e = γλe 1 + φ αγλ[e 1 φ ] φ (8) θ +1 = θ + αδ e + α[θ φ θ 1φ ][e φ ] (9) for 0, and wih e 1 = 0. Compared o accumulae TD(λ) (equaions (3), (4) and (5)), boh he race updae and he weigh updae have an addiional erm. We call a race updaed in his way a duch race; we call he erm α[θ φ θ 1 φ ][e φ ] he 5

6 van Seijen, Mahmood, Pilarski, Machado, Suon TD-error ime-sep correcion, or simply he δ-correcion. Algorihm 2 shows pseudocode ha implemens equaions (7), (8) and (9). 1 Algorihm 2 rue online TD(λ) INPUT: α, λ, γ, θ ini θ θ ini, ˆv old 0 Loop (over episodes): obain iniial φ e 0 While erminal sae has no been reached, do: obain nex feaure vecor φ and reward R ˆv θ φ ˆv θ φ δ R + γ ˆv ˆv e γλe + φ αγλ(e φ) φ θ θ + α(δ + ˆv ˆv old )e α(ˆv ˆv old )φ ˆv old ˆv φ φ 3.3 Compuaional Comparison Using he pseudocode and updae equaions, we can compare he compuaional cos of he hree versions of TD(λ). Le n be he oal number of feaures and m he number of feaures wih a non-zero value. Then, he number of basic operaions (addiion and muliplicaion) per ime sep for accumulae TD(λ) is 3n + 5m. For his number is 3n + 4m (he replacing race updae akes (n m) + m operaions, insead of n + m for an accumulaing race). True online TD(λ) akes 3n + 11m operaions in oal (compuing and subracing he vecor αγλ(e φ) φ requires 4m operaions; adding he δ-correcion requires 2m operaions). Hence, if sparse feaure vecors are used (ha is, if m << n) he compuaional overhead of rue online TD(λ) is minimal compared o accumulae/. If non-sparse feaure vecors are used (ha is, if m = n), accumulae TD(λ), and rue-online TD(λ) require 8n, 7n and 14n operaions, respecively. So in his case, rue online TD(λ) is roughly wice as expensive as convenional TD(λ). 3.4 True Online Sarsa(λ) TD(λ) and rue online TD(λ) are policy evaluaion mehods. However, hey can be urned ino conrol mehods in a sraighforward way. From a learning perspecive, he main difference is ha an esimae of he acion-value funcion q π should be learned, raher han of he sae-value funcion v π. In oher words, acion feaure-vecors insead of sae feaure-vecors have o be used. Anoher difference is ha insead of having a fixed policy 1. We provide pseudocode for rue online TD(λ) wih ime-dependen in Secion 7.1. For reasons explained in ha secion, his requires a modified race updae. In addiion, for reference purposes, we provide pseudocode for he special case of abular feaures in Secion

7 True Online Temporal-Difference Learning ha generaes he behaviour, he policy depends on he acion-value esimaes. Because hese esimaes ypically improve over ime, so does he policy. The (on-policy) conrol counerpar of TD(λ) is he popular Sarsa(λ) algorihm. The conrol counerpar of rue online TD(λ) is rue online Sarsa(λ). Algorihm 3 shows pseudocode for rue online Sarsa(λ). Algorihm 3 rue online Sarsa(λ) INPUT: α, λ, γ, θ ini θ θ ini, ˆq old 0 Loop (over episodes): obain iniial sae S selec acion A based on sae S (for example ɛ-greedy) ψ feaures corresponding o S, A e 0 While erminal sae has no been reached, do: ake acion A, observe nex sae S and reward R selec acion A based on sae S ψ feaures corresponding o S, A (if S is erminal sae, ψ 0) ˆq θ ψ ˆq θ ψ δ R + γ ˆq ˆq e γλe + ψ αγλ[e ψ] ψ θ θ + α(δ + ˆq ˆq old ) e α(ˆq ˆq old )ψ ˆq old ˆq ψ ψ ; A A To ensure accurae esimaes for all sae-acion values are obained, some exploraion sraegy has o be used. A simple, bu ofen sufficien sraegy is o use an ɛ-greedy behaviour policy. Tha is, given curren sae S, wih probabiliy ɛ a random acion is seleced, and wih probabiliy 1 ɛ he greedy acion is seleced: A greedy = argmax θ ψ(s, a). a A common way o derive an acion-feaure vecor ψ(s, a) from a sae-feaure vecor φ(s) involves an acion-feaure vecor of size n A, where n is he number of sae feaures and A is he number of acions. Each acion corresponds wih a block of n feaures in his acion-feaure vecor. The feaures in ψ(s, a) ha correspond o acion a ake on he values of he sae feaures; he feaures corresponding o oher acions have a value of Oher Variaions on TD(λ) Several variaions on TD(λ) oher han hose reaed in his paper have been suggesed in he lieraure. Schapire and Warmuh (1996) inroduced a variaion of TD(λ) for which upper and lower bounds on performance can be derived and proven. Maei, Szepesvari, Suon, and ohers (Maei, 2011; Suon e al., 2009a,b, 2014) have explored generalizaions of TD(λ)-like algorihms o off-policy learning, in which he behavior policy (generaing he 7

8 van Seijen, Mahmood, Pilarski, Machado, Suon daa) and he evaluaion policy (whose value funcion is being learned) are allowed o be differen. 4. Empirical Sudy This secion conains our main empirical sudy, comparing TD(λ), as well as Sarsa(λ), wih heir rue online counerpars. For each mehod and each domain, a scan over he α and he race-decay parameer λ is performed such ha he opimal performance can be deermined. In Secion 4.4, we discuss he resuls. 4.1 Random MRPs For our firs series of experimens we used randomly consruced Markov Reward Processes (MRPs). 2 An MRP can be inerpreed as an MDP wih only a single acion per sae (consequenly, here is only one policy possible). We represen a random MRP as a 3- uple (k, b, σ), consising of k, he number of saes; b, he branching facor (ha is, he number of possible nex saes per ransiion); and σ, he sandard deviaion of he reward. The nex saes for a paricular sae are drawn from he oal se of saes a random, and wihou replacemen. The ransiion probabiliies o hose saes are randomized as well (by pariioning he uni inerval a b 1 random cu poins). The expeced value of he reward for a ransiion is drawn from a normal disribuion wih zero mean and uni variance. The acual reward is drawn from a normal disribuion wih mean equal o his expeced reward and sandard deviaion σ. Our random MRPs do no conain erminal saes. 3 We compared he performance of TD(λ) on hree differen MRPs: one wih a small number of saes, (10, 3, 0.1), one wih a large number of saes, (100, 10, 0.1), and one wih a large number of saes bu a low branching facor and no sochasiciy in reward generaion, (100, 3, 0). γ = 0.99 for all hree MRPs. Each MRP is evaluaed using hree differen represenaions. The firs represenaion consiss of abular feaures, ha is, each sae is represened wih a unique sandard-basis vecor of k dimensions. The second represenaion is based on binary feaures. The binary represenaion is consruced by firs assigning indices, from 1 o k, o all saes. Then, he binary encoding of he index of a sae is used as a feaure vecor o represen ha sae. The lengh of a feaure vecor is deermined by he oal number of saes: for k = 10, he lengh is 4; for k = 100, he lengh is 7. As an example, for k = 10 he feaure vecors of saes 1, 2 and 3 are (0, 0, 0, 1),(0, 0, 1, 0) and (0, 0, 1, 1), respecively. Finally, he hird represenaion uses non-binary, normalized feaures. For his represenaion each sae is mapped o a 5-dimensional feaure vecor, wih he value of each feaure drawn from a normal disribuion wih zero mean and uni variance. Afer all he feaure values for a sae are drawn, hey are normalized such ha he feaure vecor has uni lengh. Once generaed, he feaure vecors are kep fixed for each sae. We refer o his las represenaion as he normal represenaion. 2. The process we used o consruc hese MRPs is based on he process used by Bhanagar, Suon, Ghavamzadeh and Lee (2009). 3. The code for he MRP experimens is published online a: hps://gihub.com/armahmood/odrndmdp-experimens 8

9 True Online Temporal-Difference Learning (10, 3, 0.1), abular (10, 3, 0.1), binary (10, 3, 0.1), normal abular feaures binary feaures normal feaures accumulae TD(λ) accumulae TD(λ) rue-online TD(λ) accumulae TD(λ) rue-online TD(λ) rue-online TD(λ) λ λ λ (100, 10, 0.1), abular (100, 10, 0.1), binary (100, 10, 0.1), normal abular feaures, abular feaures accumulae TD(λ) binary feaures binary feaures, accumulae TD(λ) normal feaures normal feaures, accumulae TD(λ) accumulae TD(λ) accumulae TD(λ) rue-online TD(λ) rue-online TD(λ) accumulae TD(λ) rue-online TD(λ) λ λ λ abular feaures binary feaures abular feaures, accumulae TD(λ) binary feaures, accumulae TD(λ) normal normal feaures, feaures accumulae TD(λ) (100, 3, 0), abular (100, 3, 0), binary (100, 3, 0), normal abular feaures, binary feaures, normal feaures, accumulae TD(λ) accumulae TD(λ) rue-online TD(λ) rue-online TD(λ) accumulae TD(λ) rue-online TD(λ) λ λ λ abular feaures, accumulae TD(λ) binary feaures, accumulae TD(λ) normal feaures, accumulae TD(λ) abular feaures, binary feaures, normal feaures, abular feaures, rue-online TD(λ) binary feaures, rue-online TD(λ) normal feaures, rue-online TD(λ) Figure 1: error during early learning for hree differen MRPs, indicaed by (k, b, σ), and hree differen represenaions. The error shown is a opimal α value. by he underhe iniial weigh esimae. In each experimen, we performed a scan over α and λ. Specifically, beween 0 and 0.1, α is varied according o 10 i wih i varying from -3 o -1 wih seps of 0.2, and from 0.1 o 2.0 (linearly) abular feaures, wih replace seps TD(λ) of 0.1. In addiion, binary feaures, λ is replace varied TD(λ) from 0 o 0.9 wih normal feaures, sepsreplace of 0.1 TD(λ) and abular feaures, rue-online TD(λ) binary feaures, rue-online TD(λ) normal feaures, rue-online TD(λ) from 0.9 o 1.0 wih seps of The iniial weigh vecor is he zero vecor in all domains. As performance meric we used he mean-squared error () wih respec o he LMS soluion during early learning (for k = 10, we averaged over he firs 100 ime seps; for k = 100, we averaged over he firs 1000 ime seps). We normalized his error by dividing i Figure 1 shows he resuls for differen λ a he bes value of α. In Appendix A, he resuls for all α values are shown. A number of observaions can be made. Firs of all, abular feaures, rue-online TD(λ) binary feaures, rue-online TD(λ) normal feaures, rue-online TD(λ) he sraighforward generalizaion of he replacing-race updae rule, Equaion (6), is no effecive. For all hree domains, when replacing races are combined wih normal feaures, all λ values resul in he same performance. The reason is ha normal feaures pracically never become zero, and hence e = φ almos all he ime. A second observaion is ha he opimal performance of rue online TD(λ) is, on all domains and for all represenaions, a 9

van Seijen, Mahmood, Pilarski, Machado, Suon Figure 2: Source of he inpu daa sream and prediced signals used in his experimen: a paricipan wih an ampuaion performing a simple grasping ask using a

10 van Seijen, Mahmood, Pilarski, Machado, Suon Figure 2: Source of he inpu daa sream and prediced signals used in his experimen: a paricipan wih an ampuaion performing a simple grasping ask using a myoelecrically conrolled robo arm, as described in Pilarski e al. (2013). More deail on he subjec and experimenal seing can be found in Heber e al. (2014). leas as good as he opimal performance of accumulae TD(λ) or. A more in-dep discussion of hese resuls is provided in Secion Predicing Signals from a Myoelecric Prosheic Arm In his experimen, we compared he performance of rue online TD(λ) and TD(λ) on a real-world daa-se consising of sensorimoor signals measured during he human conrol of an elecromechanical robo arm. The source of he daa is a series of manipulaion asks performed by a paricipan wih an ampuaion, as presened by Pilarski e al. (2013). In his sudy, an ampuee paricipan used signals recorded from he muscles of heir residual limb o conrol a robo arm wih muliple degrees-of-freedom (Figure 2). Ineracions of his kind are known as myoelecric conrol (c.f., Parker e al., 2006). For consisency and comparison of resuls, we used he same source daa and predicion learning archiecure as published in Pilarski e al. (2013). In oal, wo signals are prediced: grip force and moor angle signals from he robo s hand. Specifically, he arge for he predicion is a discouned sum of each signal over ime, similar o reurn predicions (c.f., general value funcions and nexing; Suon e al., 2011; Modayil e al., 2014). Where possible, we used he same implemenaion and code base as Pilarski e al. (2013). Daa for his experimen consised of 58,000 ime seps of recorded sensorimoor informaion, sampled a 40 Hz (i.e., approximaely 25 minues of experimenal daa). The sae space consised of a ile-coded represenaion of he robo gripper s posiion, velociy, recorded gripping force, and wo muscle conracion signals from he human user. A sandard implemenaion of ile-coding was used, wih en bins per signal, eigh overlapping ilings, and a single acive bias uni. This resuls in a sae space wih 800,001 feaures, 9 of which were acive a any given ime. Hashing was used o reduce his space down o a vecor of 10

11 True Online Temporal-Difference Learning 200,000 feaures ha are hen presened o he learning sysem. All signals were normalized beween 0 and 1 before being provided o he funcion approximaion rouine. The discoun facor for predicions of boh force and angle was γ = 0.97, as in he resuls presened by Pilarski e al. (2013). Parameer sweeps over λ and α are conduced for all hree mehods. The performance meric is he mean absolue reurn error over all 58,000 ime seps of learning, normalized by dividing by he error for λ = 0. Figure 13 shows he performance for he angle as well as he force predicions a he bes α value for differen values of λ. In Appendix B, he resuls for all α values are shown. The relaive performance of and accumulae TD(λ) depends on he predicive quesion being asked. For predicing he robo s grip force signal a signal wih small magniude and rapid changes is beer han accumulae TD(λ) a all non-zero λ values. However, for predicing he robo s hand acuaor posiion, a smoohly changing signal ha varies beween a range of , accumulae TD(λ) dominaes over all non-zero λ values. True online TD dominaes boh mehods for all non-zero λ values on boh predicion asks (force and angle). angle predicion force predicion accumulae TD(λ) accumulae TD(λ) rue online TD(λ) rue online TD(λ) λ λ Figure 3: Performance as funcion of λ a he opimal α value, for he predicion of he servo moor angle (lef), as well as he grip force (righ). 4.3 Conrol in he ALE Domain Aserix In his experimen, we compared he performance of rue online Sarsa(λ) wih ha of accumulae Sarsa(λ) and replace Sarsa(λ), on a domain from he Arcade Learning Environmen (ALE) (Bellemare e al., 2013; Defazio and Graepel, 2014; Mnih e al., 2015), called Aserix. 4 The ALE is a general esbed ha provides an inerface o hundreds of Aari 2600 games in which one has access, a each frame, o he game screen, he curren RAM sae and o a reward signal obained from he ransiion beween game frames. A each frame he agen provides one of he 18 possible acions in he game (equivalen o he 18 differen acions allowed in he joysick) wih he goal of maximizing he (discouned) sum of rewards. 4. We used ALE version for our experimens. The code for he ALE experimens is published online a: hps://gihub.com/mcmachado/trueonlinesarsa 11

van Seijen, Mahmood, Pilarski, Machado, Suon In he Aserix domain (see Figure 4 for a screensho), he agen conrols a yellow avaar, which has o collec poion objecs, while avoiding harp objecs.

12 van Seijen, Mahmood, Pilarski, Machado, Suon In he Aserix domain (see Figure 4 for a screensho), he agen conrols a yellow avaar, which has o collec poion objecs, while avoiding harp objecs. Boh poions and harps move across he screen horizonally. Every ime he agen collecs a poion i receives a reward of 50 poins, and every ime i ouches a harp i looses a life (i has hree lives in oal). The game ends afer he agen has los hree lives, or afer 5 minues, whichever comes firs. 5 Figure 4: Screensho of he game Aserix. The agen can use he acions up, righ, down, and lef o move across he screen, a no-op acion, as well as combinaions of wo direcions, resuling in a diagonal move (e.g.up-righ). This resuls in 9 acions in oal. The sae-space represenaion is based on linear funcion approximaion. We use wha Bellemare e al. (2013) called Basic feaure se, which encodes he presence of colours on he Aari 2600 screen. I is obained by firs subracing he game screen background (see Bellemare e al., 2013, sec ) and hen dividing he remaining screen in o iles of size pixels. Finally, for each ile, one binary feaure is generaed for each of he 128 available colours, encoding wheher a colour is acive or no in ha ile. This generaes 28,672 feaures (besides a bias erm ha is always acive). Because episode lenghs can vary hugely (basically, from abou 10 seconds all he way up o 5 minues), consrucing a fair performance meric is non-rivial. For example, comparing he average reurn on he firs N episodes of wo mehods is only fair if hey have seen roughly he same amoun of samples in hose episodes, which is no guaraneed for his domain. On he oher hand, looking a he oal reward colleced for he firs X samples is also no a good meric, because here is no negaive reward associaed o dying. To resolve his, we look a he reurn per episode, averaged over he firs n(x) episodes, where n(x) is he number of episodes observed in he firs X samples. More specifically, our meric consiss of he average score per episode while learning for 20 hours (4,320,000 frames). In addiion, we averaged he resuling number over 400 independen runs. As wih he evaluaion experimens, we performed a scan over he α and he race-decay parameer λ. Specifically, we looked a all combinaions of α {0.20, 0.50, 0.80, 1.10, 1.40, 1.70, 2.00} and λ {0.00, 0.50, 0.80, 0.90, 0.95, 0.99} (hese values were deermined during a preliminary parameer sweep). We used a discoun facor γ = and 5. We added he 5 minue ime limi ourselves as in previous work (Bellemare e al., 2013); he original game has no ime limi. 12

13 True Online Temporal-Difference Learning ɛ-greedy exploraion wih ɛ = The weigh vecor was iniialized o he zero vecor. Also, as Bellemare e al. (2013), we ake an acion a each 5 frames, his decreases he algorihms running ime and i also ries o avoid super-human reflexes in our agens. The resuls are shown in Figure 5. On his domain, he opimal performance of all hree versions of Sarsa(λ) is similar. rue online Sarsa(λ) reurn per episode accumulae Sarsa(λ) replace Sarsa(λ) λ Figure 5: Reurn per episode, averaged over he firs 4,320,000 frames as well as 400 independen runs, as funcion of λ, a opimal α, on he Aserix domain. 4.4 Discussion Figure 6 summarizes he performance of he differen TD(λ) versions on all evaluaion domains. Specifically, i shows he error for each mehod a is bes seings of α and λ. The error is normalized by dividing i by he error a λ = 0 (remember ha all versions of TD(λ) behave he same for λ = 0). Because λ = 0 lies in he parameer range ha is being opimized over, he normalized error can never be higher han 1. If for a mehod/domain he normalized error is equal o 1, his means ha seing λ higher han 0 eiher has no effec, or ha he error ges worse. In eiher case, eligibiliy races are no effecive for ha mehod/domain. Overall, rue online TD(λ) is clearly beer han accumulae TD(λ) and in erms of opimal performance. Specifically, on each considered domain, he error for rue online TD(λ) is eiher smaller or equal o he error of accumulae/. This is especially impressive, given he wide of variey of domains, and he fac he compuaional overhead for rue online TD(λ) is small (see Secion 3.3 for deails). Comparing accumulae TD(λ) wih, i can be seen ha, when considering abular or binary feaures, on some domains accumulae TD(λ) performs bes, while on ohers performs bes. When normal feaures are used, our naive generalizaion of is no effecive (sandard is no defined for normal feaures). 13

14 van Seijen, Mahmood, Pilarski, Machado, Suon accumulae TD(λ) rue online TD(λ) normalized error (10, 3, 0.1) abular (10, 3, 0.1) binary (10, 3, 0.1) normal (100, 10, 0.1) abular (100, 10, 0.1) binary (100, 10, 0.1) normal (100, 3, 0) abular (100, 3, 0) binary (100, 3, 0) normal prosehic angle prosehic force Figure 6: Summary of he evaluaion resuls: error a opimal (α, λ)-seings for all domains, normalized wih he TD(0) error. On he Aserix domain, he performance of he hree Sarsa(λ) versions is similar. This is in accordance wih he evaluaion resuls, which showed ha he size of he performance difference is domain dependen. In he wors case, he performance of he rue online mehod is similar o ha of he regular mehod. The opimal performance is no he only facor ha deermines how good a mehod is; wha also maers is how easy i is o find his performance. The deailed plos in Appendix A and B reveal ha he parameer sensiive of accumulae TD(λ) is much higher han ha of rue online TD(λ) and. This is clearly visible in he firs MRP ask (Figure 10), as well as he experimens wih he myoelecric prosheic arm (Figure 13). There is one more hing o ake away from he experimens. In he firs MRP, (10, 3, 0.1), wih normal feaures, accumulae TD(λ), as well as, are ineffecive (see Figure 6: he normalized performance of accumulae/ is 1, meaning ha he performance a opimized λ is equal o he performance of TD(0)). However, rue online TD(λ) was able o obain a considerable performance advanage wih respec o TD(0). This demonsraes ha rue online TD(λ) expands he se of domains/represenaions where eligibiliy races are effecive. This could poenially have far-reaching consequences. Specifically, using non-binary feaures becomes a lo more ineresing. Replacing races are no feasible / ineffecive for such represenaions, while using accumulaing races can easily resul in divergence of values. However, for rue online TD(λ) non-binary feaures are no necessarily more challenging han binary feaures. Exploring new, non-binary represenaions could poenially furher improve he performance for rue online TD(λ) on domains such as he myoelecic prosheic arm or he Aserix domain. 5. Analyical Comparison The empirical sudy suggess ha rue online TD(λ) performs a leas as good as accumulae TD(λ) and. In his secion, we ry o answer he quesion on wha kind of domains a large difference in performance can be expeced, and similarly, when no difference is expeced. The following hree heorems provide some insighs ino his. 14

15 True Online Temporal-Difference Learning Theorem 1 For λ = 0, accumulae TD(λ), and rue online TD(λ) behave he same. Proof For λ = 0, he accumulaing-race updae, he (generalized) replacing-race updae and he duch-race updae all reduce o e = φ. In addiion, because e = φ, he δ- correcion of rue online TD(λ) is 0. A feaure i is visied a ime if φ (i) > 0. The following heorem shows ha any difference in behaviour beween he hree versions of TD(λ) is due o how revisis of feaures are handled. Theorem 2 When no feaures are revisied wihin he same episode, accumulae TD(λ), and rue online TD(λ) behave he same (for any λ). Proof Because a he sar of an episode all race values are 0, and because a feaure is only visied once wihin an episode, if φ (i) 0, hen e 1 (i) = 0 and if e 1 (i) 0, hen φ (i) = 0. Hence, he accumulaing-race updae and he generalized replacing-race updae have he same effec. I also means ha e 1 φ is always zero. Hence, he duch-race updae reduces o he accumulaing-race updae. In addiion, because he weigh of a feaure does no ge updaed unil he feaure is visied, if φ (i) 0, hen θ (i) θ 1 (i) = 0, and if θ (i) θ 1 (i) 0, hen φ (i) 0. I follows ha θ φ θ 1 φ is always 0, and hence he δ-correcion as well. Finally, our hird heorem saes ha for small s he behaviour of rue online TD(λ) approximaes ha of accumulae TD(λ): Theorem 3 Le acc be he weigh updae a ime due o accumulae TD(λ) and rue he weigh updae due o rue online TD(λ). If γλ < 1 and he feaure vecors and TD errors are bounded, hen acc / rue 1 if α 0. Proof The updae equaions specify ha where e acc acc := αe acc δ, := αe du δ + α[θ φ θ 1φ ][e du φ ], rue is an accumulaing race, and e du can be wrien as: by showing ha rue rue wih c(α) 0 if α 0. More specifically, rue rue [ = α e acc δ + (e du is a duch race. We will prove he heorem = α [ e acc δ + c(α) ] can be wrien as: e acc )δ + (θ θ 1 ) φ (e du φ ) ] We will show ha e du α 0. e acc 0 if α 0, and ha (θ θ 1 ) φ (e du φ ) 0 if 15

16 van Seijen, Mahmood, Pilarski, Machado, Suon The non-incremenal expression for e acc is: e acc 0 = φ 0 e acc 1 = γλφ 0 + φ 1 e acc 2 = (γλ) 2 φ 0 + γλφ 1 + φ 2. e acc = (γλ) i φ i=0 Le he value of feaure i be bounded by C, ha is φ (i) < C for all i,. Then, e acc (i) < C/(1 γλ) for all i,. Because γλ < 1, his is some finie value. The duch-race updae can be re-wrien as: e du = γλ(i αφ φ ) e du 1 + φ Using his, he non-incremenal expression for e du becomes: e du 0 = φ 0 e du 1 = γλ(i αφ 1 φ 1 )φ 0 + φ 1 e du 2 = (γλ) 2 (I αφ 2 φ 2 )(I αφ 1 φ 1 )φ 0 + γλ(i αφ 2 φ 2 )φ 1 + φ 2. Because he feaure vecors are bounded, if α 0, (I αφ i φ i ) I, and edu e acc (because he race values are bounded, his is rue even if ). Finally, we need o show ha (θ θ 1 ) φ (e rue φ ) 0 if α 0. Because he feaure vecors and race values are bounded, i suffices o show ha θ θ 1 = rue 1 0 if α 0, which follows from he definiion of rue (given he condiion ha he TD error is bounded). Based on hese hree heorems, we expec a large difference on domains for which he opimal α and opimal λ are relaively large, and where feaures are frequenly revisied. Domains wih a relaively large opimal α and opimal λ are ypically domains wih relaively low sochasiciy. So as a rule of humb, a large difference can be expeced on domains wih relaively low sochasiciy and frequen revisis of feaures. 6. Derivaion of True Online TD(λ) The defining propery of a rue online mehod is ha i mainains an exac equivalence wih an online forward view a all imes. This means ha a every momen in ime, he weigh vecor can be inerpreed as he resul of a sequence of updaes wih muli-sep updae arges. To achieve his sep-by-sep equivalence, he regular forward has o be exended, because i only specifies wha he weighs a he end of an episode should be. In his secion, we presen he exended forward view, and we derive he rue online TD(λ) updae equaions from i. 16

17 True Online Temporal-Difference Learning 6.1 The forward view of TD(λ) In Secion 2, he general updae rule for linear funcion approximaion was presened (Equaion 1), which is based on he updae rule for sochasic gradien descen. The updae equaions for TD(λ), however, are of a differen form (Equaions 3, 4 and 5). The forward view of TD(λ) relaes he TD(λ) equaions o Equaion 1. Specifically, he forward view of TD(λ) specifies ha TD(λ) approximaes he λ-reurn algorihm. This algorihm performs a series of updaes of he form of Equaion 1 wih he λ-reurn as updae arge: θ +1 = θ + α [G λ θ φ ]φ, for 0 < T, where T is he end of he episode, and G λ is he λ-reurn a ime. The λ-reurn is a muli-sep updae arge based on a weighed average of all fuure sae values, wih λ deermining he weigh disribuion. Specifically, he λ-reurn a ime is defined as: G λ = (1 λ) wih G (n) (θ) is he n-sep reurn, defined as: n=1 λ n 1 G (n) (θ ) G (n) (θ) = R +1 + γr +2 + γ 2 R γ n 1 R +n + γ n θ φ +n. For episodic asks, G (n) (θ) is equal o he full reurn, G, if + n T, and he λ-reurn can be wrien as: T 1 G λ = (1 λ) n=1 λ n 1 G (n) (θ ) + λ T 1 G. (10) The forward view offers a paricularly sraighforward inerpreaion of he λ-parameer. For λ = 0, G λ reduces o he TD(0) updae arge, while for λ = 1, G λ reduces o he full reurn. In oher words, for λ = 0 he updae arge has maximum bias and minimum variance, while for λ = 1, he updae arge is unbiased, bu has maximum variance. For λ in beween 0 and 1, he bias and variance are beween hese wo exremes. So, λ enables conrol over he rade-off beween bias and variance. While he λ-reurn algorihm has a very clear inuiion, here is only an exac equivalence for he offline case. Tha is, he offline varian of TD(λ) compues he same value esimaes as he offline varian of he λ-reurn algorihm. For he online case, here is only an approximae equivalence. Specifically, he weigh vecor a ime T, compued by accumulae TD(λ) closely approximaes he weigh vecor a ime T compued by he online λ-reurn algorihm for appropriaely small values of he parameer (Suon and Baro, 1998). Tha he forward view only applies o he weigh vecor a he end of an episode, even in he online case, is a limiaion ha is ofen overlooked. I is relaed o he fac ha he λ-reurn for S is consruced from daa sreching from ime +1 all he way o ime T, he ime ha he erminal sae is reached. A consequence is ha he λ-reurn algorihm can compue is weigh vecors only in hindsigh, a he end of an episode. This is illusraed by Figure 7, which maps each weigh vecor θ o he earlies ime ha i can be compued. Time in his case refers o he ime of daa-collecion: ime is defined as he momen 17

18 van Seijen, Mahmood, Pilarski, Machado, Suon ha sample φ is observed. By conras, TD(λ) uses only daa up o ime o compue he weigh vecor θ. Hence, TD(λ) can compue is weigh vecors wihou delay (see Figure 8). To denoe his imporan propery, we use he erm real-ime. TD(λ) is a real-ime algorihm, while he λ-reurn algorihm is no. A consequence is ha even hough boh algorihms compue a sequence of T weigh vecors, a meaningful comparison can only be made for θ T, because only a ime T does TD(λ) have access o he same daa as he λ- reurn algorihm. This limis he usefulness of he λ-reurn algorihm as an inuiive way o view TD(λ). In he nex secion, we address his limiaion. ime T θ 1 θ 2 θ 3 θ T Figure 7: The weigh vecors of he λ-reurn algorihm mapped o he earlies ime ha hey can be compued. ime θ 1 θ 2 θ 3 T θ T Figure 8: The weigh vecors of TD(λ) mapped o he earlies ime ha hey can be compued. 6.2 The Real-Time Forward View The convenional forward view explains how he weigh vecor a he end of an episode, compued by TD(λ), can be inerpreed as he resul of a sequence of updaes wih a paricular muli-sep updae arge, he λ-reurn. We wan o give a similar explanaion for 18

19 True Online Temporal-Difference Learning weigh vecors during an episode. In oher words, we wan o consruc a real-ime forward view ha explains he weigh vecors, compued by TD(λ), a all ime seps. The dilemma ha arises when rying o consruc a real-ime forward view is ha he updae arges should conain daa from many ime seps ahead, bu he real-ime aspec prohibis he use of daa beyond he curren ime sep. The soluion o his dilemma is o have updae arges ha grow over ime. In oher words, raher han defining a fixed updae arge for each visied sae, he updae arge depends on he ime sep up o which daa is observed. We call such an updae arge an inerim updae arge, and he ime sep up o which daa is observed he daa-horizon. We will use a superscrip o indicae he daa-horizon h of an updae arge: U h. A simple example of an inerim updae arge is an updae arge ha consiss of he discouned sum of rewards up o he daa-horizon: U h = R +1 + γr +2 + γ 2 R γ h 1 R h. A direc consequence of having updae arges ha depend on he daa-horizon is ha a real-ime forward view specifies an updae sequence for each daa-horizon. Below, we show he updae sequences based on an inerim updae arge U h for horizons 1, 2 and 3 (θ h 0 := θ ini, for all h). h = 1 : ] θ1 1 = θ0 1 + α [U 0 1 (θ0) 1 φ 0 φ 0, h = 2 : h = 3 : ] θ1 2 = θ0 2 + α [U 0 2 (θ0) 2 φ 0 φ 0, ] θ2 2 = θ1 2 + α [U 1 2 (θ1) 2 φ 1 φ 1, ] θ1 3 = θ0 3 + α [U 0 3 (θ0) 3 φ 0 φ 0, ] θ2 3 = θ1 3 + α [U 1 3 (θ1) 3 φ 1 φ 1, ] θ3 3 = θ2 3 + α [U 2 3 (θ2) 3 φ 2 φ 2, More generally, he updae sequence for horizon h is defined by: ] θ+1 h = θ h + α [U h (θ h ) φ φ, for 0 < h. (11) Figure 9 maps each weigh vecor o he earlies ime i can be compued. Ulimaely, he weigh-vecor sequence of ineres is no he sequence a a paricular horizon. Raher, i is he sequence consising of he final weigh vecor a each horizon: θ1 1, θ2 2, θ2 3,..., θt T. Because θ can be compued a ime, we call he forward view a real-ime forward view. In principle, Equaion (11) can be combined wih any inerim updae arge definiion o form a real-ime forward view. However, o ge he real-ime forward view ha belongs o TD(λ) a horizon-dependen version of he λ-reurn is needed. A version of he λ-reurn ha corresponds wih horizon h should no use daa beyond his horizon. In oher words, he highes n-sep reurn ha should be involved is he (h )-sep reurn. This can be achieved by replacing each n-sep reurn wih n > h wih he (h )-sep reurn. We 19

20 van Seijen, Mahmood, Pilarski, Machado, Suon ime θ 1 2 θ 1 3 θ 1 2 θ 2 3 θ 2 3 θ 3 T T θ 1 T θ 2 T θ 3 T θ T Figure 9: The weigh vecors of he new forward view mapped o he earlies ime ha hey can be compued. call his version of he λ-reurn he inerim λ-reurn, and use he noaion G λ h he inerim λ-reurn depending on horizon h. G λ h can be wrien as follows: o indicae G λ h h 1 = (1 λ) λ n 1 G (n) + (1 λ) n=1 h 1 = (1 λ) λ n 1 G (n) n=1 h 1 = (1 λ) λ n 1 G (n) n=1 h 1 = (1 λ) λ n 1 G (n) n=1 + G (h ) n=h [ (1 λ) λ n 1 G (h ) n=h [ + G (h ) λ h 1 (1 λ) λ n 1] λ k] k=0 + λ h 1 G (h ) (12) Equaion 12 fully specifies he inerim λ-reurn, excep for one small deail: he weigh vecor ha should be used for he value esimaes in he n-sep reurns has no been specified ye. The regular λ-reurn uses G (n) (θ ) (see Equaion 10). For he real-ime forward view, however, all weigh vecors have wo indices, so simply using θ does no work in his case. So which double-indexed weigh vecor should be used? The wo guiding principles on deciding which weigh vecor o use is ha we wan he forward view o be an approximaion of accumulae TD(λ) and ha an efficien implemenaion should be possible. One opion is o use G (n) (θ h ). While wih his definiion he updae-sequence a daa-horizon T is exacly he same as he sequence of updaes from he λ-reurn algorihm (basically, he λ-reurn implicily uses a daa-horizon of T ), i prohibis efficienly compuaion of θ h+1 h+1 from θh h. For his reason, we use G(n) (θ+n 1 +n 1 ), which does allow for efficien compuaion, and forms a good approximaion of accumulae TD(λ) as well (as we show below). Using 20

21 True Online Temporal-Difference Learning his weigh vecor, he full definiion of G λ h becomes: G λ h h 1 := (1 λ) n=1 ( λ n 1 G (n) θ +n 1 +n 1 ) ( + λ h 1 G (h ) θ h 1 h 1 ). (13) We call his he inerim λ-reurn. We call he algorihm ha combines he inerim λ-reurn wih Equaion 11 he inerim λ-reurn algorihm. 6.3 Derivaion In his subsecion, we derive he updae equaions of rue online TD(λ) direcly from he real-ime forward view, defined by equaions (11) and (13) (and θ0 h := θ ini). The derivaion is based on expressing θ h+1 h+1 in erms of θh h. We sar by wriing θh h direcly in erms of he iniial weigh vecor and he inerim λ-reurns. Firs, we rewrie (11), wih he inerim λ-reurn as updae arge, as: θ h +1 = (I αφ φ ) θ h + α G λ h wih I he ideniy marix. Now, consider θ h for = 1 and = 2: θ1 h = (I αφ 0 φ 0 )θ ini + αφ 0 G λ h 0 θ2 h = (I αφ 1 φ 1 )θ1 h + αφ 1 G λ h 1 = (I αφ 1 φ 1 )(I αφ 0 φ 0 )θ ini + α(i αφ 1 φ 1 )φ 0 G λ h 0 + αφ 1 G λ h 1 For general h, we can wrie: θ h = A 1 0 θ ini + α A 1 i φ i 1 G λ h i 1, where A j i is defined as: A j i := (I αφ jφ j )(I αφ j 1 φ j 1)... (I αφ i φ i ), for j i, and A j j+1 := I. We are now able o express θh h as: θ h h = Ah 1 0 θ ini + α h A h 1 i φ i 1 G λ h i 1, (14) Because for he derivaion of rue online TD(λ), we only need (14) and he definiion of G λ h, we can drop he double indices for he weigh vecors and use θ h := θ h h. 21

22 van Seijen, Mahmood, Pilarski, Machado, Suon We now derive a compac expression for he difference G λ h+1 G λ h. G λ h+1 G λ h h = (1 λ) n=1 λ n 1 G +n (θ +n 1 ) + λ h G h+1 (θ h ) h 1 (1 λ) λ n 1 G +n (θ +n 1 ) λ h 1 G h (θ h 1 ) n=1 = (1 λ)λ h 1 G h (θ h 1 ) + λ h G h+1 (θ h ) λ h 1 G h (θ h 1 ) = λ h G h+1 (θ h ) λ h G h (θ h 1 ) = λ h [ G h+1 (θ h ) G h (θ h 1 ) = λ h [ h+1 ] γ i 1 R +i + γ h+1 θh φ h ] h+1 γ i 1 R +i γ h θh 1 φ h = λ h [ γ h R h+1 + γ h+1 θh φ h+1 γ h θh 1 φ h = (λγ) h [ ] R h+1 + γ θh φ h+1 θh 1 φ h ] Noe ha he difference G λ h+1 G λ h is naurally expressed using a erm ha looks like a TD error bu wih a modified ime sep. We call his he modified TD error, δ h : δ h := R h+1 + γ θ h φ h+1 θ h 1 φ h. Using his definiion, he difference G λ h+1 G λ h can be compacly wrien as: G λ h+1 G λ h = (λγ) h δ h (15) Noe ha δ h relaes o he regular TD error, δ h, as follows: δ h = R h+1 + γ θ h φ h+1 θ h 1 φ h = R h+1 + γ θ h φ h+1 θ h φ h + θ h φ h θ h 1 φ h = δ h + θ h φ h θ h 1 φ h. (16) 22

23 True Online Temporal-Difference Learning To ge he updae rule, we have o express θ h+1 in erms of θ h. This is done below, using (14), (15) and (16). h+1 θ h+1 = A h 0 θ 0 + α A h i φ i 1 G λ h+1 i 1 = A h 0θ 0 + α = A h 0θ 0 + α h A h i φ i 1 G λ h+1 i 1 + αφ h G λ h+1 h h h A h i φ i 1 G λ h i 1 + α A h [ λ h+1 i φ i 1 G i 1 = (I αφ h φ h ) [ A h 1 0 θ 0 + α + α h A h i φ i 1 [ G λ h+1 i 1 = (I αφ h φ h ) θ h + α = (I αφ h φ h )θ h + α = θ h + α = θ h + α h h h h A h 1 i ] φ i 1 G λ h i 1 G λ h ] i 1 + αφh G λ h+1 h A h i φ i 1 [ G λ h+1 i 1 G λ h ] i 1 + αφh G λ h+1 h G λ h ] i 1 + αφh G λ h+1 h A h i φ i 1 (γλ) h+1 i δ h + αφ [ h Rh+1 + γθ ] h φ h+1 A h i φ i 1 (γλ) h+1 i δ h + αφ [ h Rh+1 + γθ ] h φ h+1 θ h φ h h A h i φ i 1 (γλ) h+1 i δ h [ + αφ h Rh+1 + γθ ] h φ h+1 θ h 1 φ h + θ h 1 φ h θ h φ h h = θ h + α A h i φ i 1 (γλ) h+1 i δ h + αφ hδ h + αφ [ ] h θh 1 φ h θ h φ h h+1 = θ h + α A h i φ i 1 (γλ) h+1 i δ h + αφ [ ] h θh 1 φ h θ h φ h = θ h + αe h δ h + αφ [ ] h θh 1 φ h θ h φ h [ = θ h + αe h δh + θh φ h θh 1 φ [ h] + αφh θ h 1 φ h θh φ ] h h+1 wih e h := A h i φ i 1 (γλ) h+1 i = θ h + αe h δ h + α [ θ h φ h θ h 1 φ h] [eh φ h ] (17) 23

24 van Seijen, Mahmood, Pilarski, Machado, Suon We now have he updae rule for θ h, in addiion o an explici definiion of e h. Nex, using his explici definiion, we derive an updae rule o compue e h from e h 1. e h = = h+1 A h i φ i 1 (γλ) h+1 i h A h i φ i 1 (γλ) h+1 i + φ h = (I αφ h φ h )γλ h A h 1 i φ i 1 (γλ) h i + φ h = (I αφ h φ h )γλe h 1 + φ h = γλe h 1 + φ h + αγλ(e h 1 φ h)φ h (18) Equaions (17) and (18), ogeher wih he definiion of δ h, form he rue online TD(λ) updae equaions. 7. Oher True Online Mehods In he previous secion, we showed ha he rue online TD(λ) equaions can be derived direcly from he real-ime forward view equaions. By using differen real-ime forward views, new rue online mehods can be derived. Someimes, small changes in he real-ime forward view, like using a ime-dependen, can resul in surprising changes in he rue online equaions. In his secion, we look a a number of such variaions. 7.1 True Online TD(λ) wih Time-Dependen Sep-size When using a ime-dependen in he base equaion of he forward view (Equaion 11) and deriving he updae equaions following he procedure from Secion 6.3, i urns ou ha a slighly differen race definiion appears. We indicae his new race using a + superscrip: e +. For fixed, his new race definiion is equal o: e + = αe, for all. Of course, using e +, insead of e also changes he weigh vecor updae slighly. Below, he full se of updae equaions is shown: δ = R +1 + γθ φ +1 θ φ e + = γλe α φ α γλ[(e + 1 ) φ ] φ θ +1 = θ + δ e + + [θ φ θ 1φ ][e + α φ ] In addiion, e + 1 := 0. We can simplify he weigh updae equaion slighly, by using δ = δ + θ φ θ 1φ, 24

RL Lecture 7: Eligibility Traces. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1

RL Lecture 7: Eligibility Traces. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 RL Lecure 7: Eligibiliy Traces R. S. Suon and A. G. Baro: Reinforcemen Learning: An Inroducion 1 N-sep TD Predicion Idea: Look farher ino he fuure when you do TD backup (1, 2, 3,, n seps) R. S. Suon and