True Online Temporal-Difference Learning. A. Rupam Mahmood Patrick M. Pilarski

Size: px
Start display at page:

Download "True Online Temporal-Difference Learning. A. Rupam Mahmood Patrick M. Pilarski"

Transcription

1 True Online Temporal-Difference Learning True Online Temporal-Difference Learning Harm van Seijen A. Rupam Mahmood Parick M. Pilarski Marlos C. Machado Richard S. Suon Reinforcemen Learning and Arificial Inelligence Laboraory Deparmen of Compuing Science Universiy of Albera T6G 2E8, Canada Edior: Absrac The emporal-difference mehods TD(λ) and Sarsa(λ) form a core par of modern reinforcemen learning. Their appeal comes from heir good performance, low compuaional cos, and heir simple inerpreaion, given by heir forward view. Recenly, new versions of hese mehods were inroduced, called rue online TD(λ) and rue online Sarsa(λ), respecively (van Seijen and Suon, 2014). Algorihmically, hese rue online mehods only make wo small changes o he updae rules of he regular mehods, and he exra compuaional cos is negligible in mos cases. However, hey follow he ideas underlying he forward view much more closely. In paricular, hey mainain an exac equivalence wih he forward view a all imes, whereas he radiional versions only approximae i for small s. We hypohesize ha hese rue online mehods no only have beer heoreical properies, bu also dominae he regular mehods empirically. In his aricle, we pu his hypohesis o he es by performing an exensive empirical comparison. Specifically, we compare he performance of rue online TD(λ)/Sarsa(λ) wih regular TD(λ)/Sarsa(λ) on random MRPs, a real-world myoelecric prosheic arm, and a domain from he Arcade Learning Environmen. We use linear funcion approximaion wih abular, binary, and non-binary feaures. Our resuls sugges ha he rue online mehods indeed dominae he regular mehods. Across all domains/represenaions he learning speed of he rue online mehods are ofen beer, bu never worse han ha of he regular mehods. An addiional advanage is ha no choice beween races has o be made for he rue online mehods. We show ha new rue online emporal-difference mehods can be derived by making changes o he real-ime forward view and hen rewriing he updae equaions. 1. Inroducion Temporal-difference (TD) learning is a core learning echnique in modern reinforcemen learning (Suon, 1988; Kaelbling e al., 1996; Suon and Baro, 1998; Szepesvári, 2010). One of he main challenges in reinforcemen learning is o make predicions, in an iniially unknown environmen, abou he (discouned) sum of fuure rewards, he reurn, based on currenly observed feaure values and a cerain behaviour policy. Wih TD learning i is 1

2 van Seijen, Mahmood, Pilarski, Machado, Suon possible o learn good esimaes of he expeced reurn quickly by boosrapping from oher expeced-reurn esimaes. TD(λ) (Suon, 1988) is a popular TD algorihm ha combines basic TD learning wih eligibiliy races o furher speed learning. The populariy of TD(λ) can be explained by is simple implemenaion, is low-compuaional complexiy and is concepually sraighforward inerpreaion, given by is forward view. The forward view of TD(λ) is ha he esimae a each ime sep is moved oward an updae arge known as as he λ-reurn, where he λ-parameer deermines he rade-off beween bias and variance of he updae arge. This rade-off has a large influence on he speed of learning and is opimal seing varies from domain o domain. The abiliy o improve his rade-off by adjusing he value of λ is wha underlies he performance advanage of eligibiliy races. Alhough he forward view provides a clear inuiion, TD(λ) closely approximaes he forward view only for appropriaely small s. Unil recenly, his was considered an unforunae, bu unavoidable par of he heory behind TD(λ). This changed wih he inroducion of rue online TD(λ)(van Seijen and Suon, 2014), which allows for full conrol over he bias-variance rade-off a any. In paricular, rue online TD(1) can achieve fully unbiased updaes. Moreover, rue online TD(λ) only requires small modificaions o he TD(λ) updae equaions, and he exra compuaional cos is negligible in mos cases. We hypohesize ha rue online TD(λ), and is conrol version rue online Sarsa(λ), no only have beer heoreical properies han heir regular counerpars, bu also dominae hem empirically. We es his hypohesis by performing an exensive empirical comparison beween rue online TD(λ), TD(λ) wih accumulaing races and TD(λ) wih replacing races, as well as rue online Sarsa(λ) and Sarsa(λ) (wih accumulaing and replacing races). The domains we use include random Markov reward processes, a real-world myoelecric prosheic arm, and a domain from he Arcade Learning Environmen (Bellemare e al., 2013). The represenaions we consider range from abular values o linear funcion approximaion wih binary and non-binary feaures. Besides he empirical sudy, we show how rue online TD(λ) can be derived. The derivaion is based on an exended version of he forward view. Whereas he updaes of he radiional forward view can only be compued a he end of an episode, he updaes of his exended forward view can be compued in real-ime, making i applicable even o non-episodic asks. By rewriing he updaes of his real-ime forward view, he rue online TD(λ) updaes can be derived. This derivaion forms a blueprin for he derivaion of oher rue online mehods. By making variaions o he real-ime forward view and following he same derivaion as for rue online TD(λ), we derive several oher rue online mehods. This aricle is organized as follows. We sar by presening he required background on Markov decision processes and inroducing TD(λ), rue online TD(λ), and rue online Sarsa(λ). We hen presen our empirical sudy. Afer his sudy, we analyze on wha ype of domains a large performance difference can be expeced. This is followed by he inroducion of he real-ime forward view and he derivaion of rue online TD(λ). Finally, we presen several oher rue online mehods. 2

3 True Online Temporal-Difference Learning 2. Markov Decision Processes Here, we presen he main learning framework. As a convenion, we indicae random variables by capial leers (e.g., S, R ), vecors by bold leers (e.g., θ, φ), funcions by lowercase leers (e.g., v), and ses by calligraphic fon (e.g., S, A). Reinforcemen learning (RL) problems are ofen formalized as Markov decision processes (MDPs), which can be described as 5-uples of he form S, A, p, r, γ, consising of S, he se of all saes; A, he se of all acions; p(s s, a), he ransiion probabiliy funcion, giving for each sae s S and acion a A he probabiliy of a ransiion o sae s S a he nex sep; r(s, a, s ), he reward funcion, giving he expeced reward for a ransiion from (s, a) o s. γ is he discoun facor, specifying how fuure rewards are weighed wih respec o he immediae reward. Some MDPs conain erminal saes, which divide he sequence of sae ransiions ino episodes. When a erminal sae is reached he curren episode ends and he sae is rese o he iniial sae. The reurn a ime is defined as he discouned sum of rewards, observed afer : G = R +1 + γ R +2 + γ 2 R = γ i 1 R +i, where R +1 is he reward received afer aking acion A in sae S. For an episodic MDP, he reurn is defined as he discouned sum of rewards unil he end of he episode: T G = γ i 1 R +i, where T is he ime sep ha he erminal sae is reached. Acions are aken a discree ime seps = 0, 1, 2,... according o a policy π : S A [0, 1], which defines for each acion he selecion probabiliy condiioned on he sae. Each policy π has a corresponding sae-value funcion v π (s), which maps each sae s S o he expeced value of he reurn G from ha sae, when following policy π: v π (s) = E{G S = s, π}. In addiion, he acion-value funcion q π (s, a) gives he expeced reurn for policy π, given ha acion a A is aken in sae s S: q π (s, a) = E{G S = s, A, = a, π}. A core ask in RL is ha of esimaing he sae-value funcion v π of some policy π from daa. In general, he learner does no have access o sae s direcly, bu can only observe a feaure vecor φ(s) R n. We esimae he value funcion using linear funcion approximaion, in which case he value of a sae s is he inner produc beween a weigh vecor θ and is feaure vecor φ(s): ˆv(s, θ) = θ φ(s) = n θ i φ i (s). 3

4 van Seijen, Mahmood, Pilarski, Machado, Suon If s is a erminal sae, hen by definiion φ(s) := 0, and hence ˆv(s, θ) = 0. As a shorhand, we will indicae φ(s ), he feaure vecor of he sae visied a ime sep, by φ. Similarly, he acion-value funcion q π can be esimaed using linear funcion approximaion. In his case, he esimae is he inner produc beween a weigh vecor and an acion-feaure vecor ψ(s, a): n ˆq(s, a, θ) = θ ψ(s, a) = θ i ψ i (s, a). If s is a erminal sae, hen by definiion ψ(s, a) := 0 for all acions a. As a convenion, we will use ψ o indicae acion-feaure vecors and φ o indicae sae-feaure vecors. As a shorhand, we will indicae ψ(s, A ) by ψ. A general model-free updae rule for linear funcion approximaion is: θ +1 = θ + α [U θ φ ]φ, (1) where U, he updae arge, is some esimae of he expeced reurn a ime sep. There are many ways o consruc an updae arge. For example, he TD(0) updae arge is: U = R +1 + γθ φ +1. (2) Updae (1) is referred o as an online updae, meaning ha he weigh vecor changes a every ime sep. Alernaively, an updae arge can be used for offline updaing. In his case, he weigh vecor says consan during an episode, and insead all weigh correcions are added a once a he end of he episode. Online updaing no only has he advanage ha i can be applied o non-episodic asks, bu i will generally produce beer valuefuncion esimaes, even when only considering he esimaes a he end of an episode (see Suon & Baro, Secions 7.1 3). Hence, offline updaing is primarily used as an analyical ool; i is rarely used in pracise. 3. Algorihms In his Secion, we presen he algorihms ha we will compare: TD(λ) wih accumulaing as well as replacing races, and rue online TD(λ). We also presen he conrol version of rue online TD(λ): rue online Sarsa(λ). Finally, we discuss several oher variaions of TD(λ). 3.1 Convenional TD(λ) The convenional TD(λ) algorihm is defined by he following updae equaions: δ = R +1 + γθ φ +1 θ φ (3) e = γλe 1 + φ (4) θ +1 = θ + αδ e (5) for 0, and wih e 1 = 0. The scalar δ is called he TD error. The vecor e is called he eligibiliy-race vecor, and he parameer λ [0, 1] is called he race-decay parameer. The updae of e shown above is referred o as he accumulaing-race updae. 4

5 True Online Temporal-Difference Learning Algorihm 1 accumulae TD(λ) INPUT: α, λ, γ, θ ini θ θ ini Loop (over episodes): obain iniial φ e 0 While erminal sae has no been reached, do: obain nex feaure vecor φ and reward R δ R + γ θ φ θ φ e γλe + φ θ θ + αδe φ φ As a shorhand, we will refer o his version of TD(λ) as accumulae TD(λ). Algorihm 1 shows he corresponding pseudocode. Accumulae TD(λ) can be very sensiive wih respec o he α and λ parameers. Especially, a large value of λ combined wih a large value of α can easily cause divergence, even on simple asks wih bounded rewards. For his reason, a varian of TD(λ) is ofen used ha is more robus wih respec o hese parameers. This varian, which assumes binary feaures, uses a differen race-updae equaion: e (i) = { γλe 1 (i) if φ (i) = 0 1 if φ (i) = 1 for all feaures i. This is referred o as he replacing-race updae. In his aricle, we use a simple generalizaion of his updae rule ha allows us o apply i o domains wih non-binary feaures as well: { γλe 1 (i) if φ (i) = 0 e (i) = for all feaures i. (6) φ (i) if φ (i) 0 Noe ha for binary feaures his generalized race updae reduces o he defaul replacingrace updae. We will refer o he version of TD(λ) ha uses Equaion 6 as. 3.2 True Online TD(λ) The rue online TD(λ) updae equaions are: δ = R +1 + γθ φ +1 θ φ (7) e = γλe 1 + φ αγλ[e 1 φ ] φ (8) θ +1 = θ + αδ e + α[θ φ θ 1φ ][e φ ] (9) for 0, and wih e 1 = 0. Compared o accumulae TD(λ) (equaions (3), (4) and (5)), boh he race updae and he weigh updae have an addiional erm. We call a race updaed in his way a duch race; we call he erm α[θ φ θ 1 φ ][e φ ] he 5

6 van Seijen, Mahmood, Pilarski, Machado, Suon TD-error ime-sep correcion, or simply he δ-correcion. Algorihm 2 shows pseudocode ha implemens equaions (7), (8) and (9). 1 Algorihm 2 rue online TD(λ) INPUT: α, λ, γ, θ ini θ θ ini, ˆv old 0 Loop (over episodes): obain iniial φ e 0 While erminal sae has no been reached, do: obain nex feaure vecor φ and reward R ˆv θ φ ˆv θ φ δ R + γ ˆv ˆv e γλe + φ αγλ(e φ) φ θ θ + α(δ + ˆv ˆv old )e α(ˆv ˆv old )φ ˆv old ˆv φ φ 3.3 Compuaional Comparison Using he pseudocode and updae equaions, we can compare he compuaional cos of he hree versions of TD(λ). Le n be he oal number of feaures and m he number of feaures wih a non-zero value. Then, he number of basic operaions (addiion and muliplicaion) per ime sep for accumulae TD(λ) is 3n + 5m. For his number is 3n + 4m (he replacing race updae akes (n m) + m operaions, insead of n + m for an accumulaing race). True online TD(λ) akes 3n + 11m operaions in oal (compuing and subracing he vecor αγλ(e φ) φ requires 4m operaions; adding he δ-correcion requires 2m operaions). Hence, if sparse feaure vecors are used (ha is, if m << n) he compuaional overhead of rue online TD(λ) is minimal compared o accumulae/. If non-sparse feaure vecors are used (ha is, if m = n), accumulae TD(λ), and rue-online TD(λ) require 8n, 7n and 14n operaions, respecively. So in his case, rue online TD(λ) is roughly wice as expensive as convenional TD(λ). 3.4 True Online Sarsa(λ) TD(λ) and rue online TD(λ) are policy evaluaion mehods. However, hey can be urned ino conrol mehods in a sraighforward way. From a learning perspecive, he main difference is ha an esimae of he acion-value funcion q π should be learned, raher han of he sae-value funcion v π. In oher words, acion feaure-vecors insead of sae feaure-vecors have o be used. Anoher difference is ha insead of having a fixed policy 1. We provide pseudocode for rue online TD(λ) wih ime-dependen in Secion 7.1. For reasons explained in ha secion, his requires a modified race updae. In addiion, for reference purposes, we provide pseudocode for he special case of abular feaures in Secion

7 True Online Temporal-Difference Learning ha generaes he behaviour, he policy depends on he acion-value esimaes. Because hese esimaes ypically improve over ime, so does he policy. The (on-policy) conrol counerpar of TD(λ) is he popular Sarsa(λ) algorihm. The conrol counerpar of rue online TD(λ) is rue online Sarsa(λ). Algorihm 3 shows pseudocode for rue online Sarsa(λ). Algorihm 3 rue online Sarsa(λ) INPUT: α, λ, γ, θ ini θ θ ini, ˆq old 0 Loop (over episodes): obain iniial sae S selec acion A based on sae S (for example ɛ-greedy) ψ feaures corresponding o S, A e 0 While erminal sae has no been reached, do: ake acion A, observe nex sae S and reward R selec acion A based on sae S ψ feaures corresponding o S, A (if S is erminal sae, ψ 0) ˆq θ ψ ˆq θ ψ δ R + γ ˆq ˆq e γλe + ψ αγλ[e ψ] ψ θ θ + α(δ + ˆq ˆq old ) e α(ˆq ˆq old )ψ ˆq old ˆq ψ ψ ; A A To ensure accurae esimaes for all sae-acion values are obained, some exploraion sraegy has o be used. A simple, bu ofen sufficien sraegy is o use an ɛ-greedy behaviour policy. Tha is, given curren sae S, wih probabiliy ɛ a random acion is seleced, and wih probabiliy 1 ɛ he greedy acion is seleced: A greedy = argmax θ ψ(s, a). a A common way o derive an acion-feaure vecor ψ(s, a) from a sae-feaure vecor φ(s) involves an acion-feaure vecor of size n A, where n is he number of sae feaures and A is he number of acions. Each acion corresponds wih a block of n feaures in his acion-feaure vecor. The feaures in ψ(s, a) ha correspond o acion a ake on he values of he sae feaures; he feaures corresponding o oher acions have a value of Oher Variaions on TD(λ) Several variaions on TD(λ) oher han hose reaed in his paper have been suggesed in he lieraure. Schapire and Warmuh (1996) inroduced a variaion of TD(λ) for which upper and lower bounds on performance can be derived and proven. Maei, Szepesvari, Suon, and ohers (Maei, 2011; Suon e al., 2009a,b, 2014) have explored generalizaions of TD(λ)-like algorihms o off-policy learning, in which he behavior policy (generaing he 7

8 van Seijen, Mahmood, Pilarski, Machado, Suon daa) and he evaluaion policy (whose value funcion is being learned) are allowed o be differen. 4. Empirical Sudy This secion conains our main empirical sudy, comparing TD(λ), as well as Sarsa(λ), wih heir rue online counerpars. For each mehod and each domain, a scan over he α and he race-decay parameer λ is performed such ha he opimal performance can be deermined. In Secion 4.4, we discuss he resuls. 4.1 Random MRPs For our firs series of experimens we used randomly consruced Markov Reward Processes (MRPs). 2 An MRP can be inerpreed as an MDP wih only a single acion per sae (consequenly, here is only one policy possible). We represen a random MRP as a 3- uple (k, b, σ), consising of k, he number of saes; b, he branching facor (ha is, he number of possible nex saes per ransiion); and σ, he sandard deviaion of he reward. The nex saes for a paricular sae are drawn from he oal se of saes a random, and wihou replacemen. The ransiion probabiliies o hose saes are randomized as well (by pariioning he uni inerval a b 1 random cu poins). The expeced value of he reward for a ransiion is drawn from a normal disribuion wih zero mean and uni variance. The acual reward is drawn from a normal disribuion wih mean equal o his expeced reward and sandard deviaion σ. Our random MRPs do no conain erminal saes. 3 We compared he performance of TD(λ) on hree differen MRPs: one wih a small number of saes, (10, 3, 0.1), one wih a large number of saes, (100, 10, 0.1), and one wih a large number of saes bu a low branching facor and no sochasiciy in reward generaion, (100, 3, 0). γ = 0.99 for all hree MRPs. Each MRP is evaluaed using hree differen represenaions. The firs represenaion consiss of abular feaures, ha is, each sae is represened wih a unique sandard-basis vecor of k dimensions. The second represenaion is based on binary feaures. The binary represenaion is consruced by firs assigning indices, from 1 o k, o all saes. Then, he binary encoding of he index of a sae is used as a feaure vecor o represen ha sae. The lengh of a feaure vecor is deermined by he oal number of saes: for k = 10, he lengh is 4; for k = 100, he lengh is 7. As an example, for k = 10 he feaure vecors of saes 1, 2 and 3 are (0, 0, 0, 1),(0, 0, 1, 0) and (0, 0, 1, 1), respecively. Finally, he hird represenaion uses non-binary, normalized feaures. For his represenaion each sae is mapped o a 5-dimensional feaure vecor, wih he value of each feaure drawn from a normal disribuion wih zero mean and uni variance. Afer all he feaure values for a sae are drawn, hey are normalized such ha he feaure vecor has uni lengh. Once generaed, he feaure vecors are kep fixed for each sae. We refer o his las represenaion as he normal represenaion. 2. The process we used o consruc hese MRPs is based on he process used by Bhanagar, Suon, Ghavamzadeh and Lee (2009). 3. The code for he MRP experimens is published online a: hps://gihub.com/armahmood/odrndmdp-experimens 8

9 True Online Temporal-Difference Learning (10, 3, 0.1), abular (10, 3, 0.1), binary (10, 3, 0.1), normal abular feaures binary feaures normal feaures accumulae TD(λ) accumulae TD(λ) rue-online TD(λ) accumulae TD(λ) rue-online TD(λ) rue-online TD(λ) λ λ λ (100, 10, 0.1), abular (100, 10, 0.1), binary (100, 10, 0.1), normal abular feaures, abular feaures accumulae TD(λ) binary feaures binary feaures, accumulae TD(λ) normal feaures normal feaures, accumulae TD(λ) accumulae TD(λ) accumulae TD(λ) rue-online TD(λ) rue-online TD(λ) accumulae TD(λ) rue-online TD(λ) λ λ λ abular feaures binary feaures abular feaures, accumulae TD(λ) binary feaures, accumulae TD(λ) normal normal feaures, feaures accumulae TD(λ) (100, 3, 0), abular (100, 3, 0), binary (100, 3, 0), normal abular feaures, binary feaures, normal feaures, accumulae TD(λ) accumulae TD(λ) rue-online TD(λ) rue-online TD(λ) accumulae TD(λ) rue-online TD(λ) λ λ λ abular feaures, accumulae TD(λ) binary feaures, accumulae TD(λ) normal feaures, accumulae TD(λ) abular feaures, binary feaures, normal feaures, abular feaures, rue-online TD(λ) binary feaures, rue-online TD(λ) normal feaures, rue-online TD(λ) Figure 1: error during early learning for hree differen MRPs, indicaed by (k, b, σ), and hree differen represenaions. The error shown is a opimal α value. by he underhe iniial weigh esimae. In each experimen, we performed a scan over α and λ. Specifically, beween 0 and 0.1, α is varied according o 10 i wih i varying from -3 o -1 wih seps of 0.2, and from 0.1 o 2.0 (linearly) abular feaures, wih replace seps TD(λ) of 0.1. In addiion, binary feaures, λ is replace varied TD(λ) from 0 o 0.9 wih normal feaures, sepsreplace of 0.1 TD(λ) and abular feaures, rue-online TD(λ) binary feaures, rue-online TD(λ) normal feaures, rue-online TD(λ) from 0.9 o 1.0 wih seps of The iniial weigh vecor is he zero vecor in all domains. As performance meric we used he mean-squared error () wih respec o he LMS soluion during early learning (for k = 10, we averaged over he firs 100 ime seps; for k = 100, we averaged over he firs 1000 ime seps). We normalized his error by dividing i Figure 1 shows he resuls for differen λ a he bes value of α. In Appendix A, he resuls for all α values are shown. A number of observaions can be made. Firs of all, abular feaures, rue-online TD(λ) binary feaures, rue-online TD(λ) normal feaures, rue-online TD(λ) he sraighforward generalizaion of he replacing-race updae rule, Equaion (6), is no effecive. For all hree domains, when replacing races are combined wih normal feaures, all λ values resul in he same performance. The reason is ha normal feaures pracically never become zero, and hence e = φ almos all he ime. A second observaion is ha he opimal performance of rue online TD(λ) is, on all domains and for all represenaions, a 9

10 van Seijen, Mahmood, Pilarski, Machado, Suon Figure 2: Source of he inpu daa sream and prediced signals used in his experimen: a paricipan wih an ampuaion performing a simple grasping ask using a myoelecrically conrolled robo arm, as described in Pilarski e al. (2013). More deail on he subjec and experimenal seing can be found in Heber e al. (2014). leas as good as he opimal performance of accumulae TD(λ) or. A more in-dep discussion of hese resuls is provided in Secion Predicing Signals from a Myoelecric Prosheic Arm In his experimen, we compared he performance of rue online TD(λ) and TD(λ) on a real-world daa-se consising of sensorimoor signals measured during he human conrol of an elecromechanical robo arm. The source of he daa is a series of manipulaion asks performed by a paricipan wih an ampuaion, as presened by Pilarski e al. (2013). In his sudy, an ampuee paricipan used signals recorded from he muscles of heir residual limb o conrol a robo arm wih muliple degrees-of-freedom (Figure 2). Ineracions of his kind are known as myoelecric conrol (c.f., Parker e al., 2006). For consisency and comparison of resuls, we used he same source daa and predicion learning archiecure as published in Pilarski e al. (2013). In oal, wo signals are prediced: grip force and moor angle signals from he robo s hand. Specifically, he arge for he predicion is a discouned sum of each signal over ime, similar o reurn predicions (c.f., general value funcions and nexing; Suon e al., 2011; Modayil e al., 2014). Where possible, we used he same implemenaion and code base as Pilarski e al. (2013). Daa for his experimen consised of 58,000 ime seps of recorded sensorimoor informaion, sampled a 40 Hz (i.e., approximaely 25 minues of experimenal daa). The sae space consised of a ile-coded represenaion of he robo gripper s posiion, velociy, recorded gripping force, and wo muscle conracion signals from he human user. A sandard implemenaion of ile-coding was used, wih en bins per signal, eigh overlapping ilings, and a single acive bias uni. This resuls in a sae space wih 800,001 feaures, 9 of which were acive a any given ime. Hashing was used o reduce his space down o a vecor of 10

11 True Online Temporal-Difference Learning 200,000 feaures ha are hen presened o he learning sysem. All signals were normalized beween 0 and 1 before being provided o he funcion approximaion rouine. The discoun facor for predicions of boh force and angle was γ = 0.97, as in he resuls presened by Pilarski e al. (2013). Parameer sweeps over λ and α are conduced for all hree mehods. The performance meric is he mean absolue reurn error over all 58,000 ime seps of learning, normalized by dividing by he error for λ = 0. Figure 13 shows he performance for he angle as well as he force predicions a he bes α value for differen values of λ. In Appendix B, he resuls for all α values are shown. The relaive performance of and accumulae TD(λ) depends on he predicive quesion being asked. For predicing he robo s grip force signal a signal wih small magniude and rapid changes is beer han accumulae TD(λ) a all non-zero λ values. However, for predicing he robo s hand acuaor posiion, a smoohly changing signal ha varies beween a range of , accumulae TD(λ) dominaes over all non-zero λ values. True online TD dominaes boh mehods for all non-zero λ values on boh predicion asks (force and angle). angle predicion force predicion accumulae TD(λ) accumulae TD(λ) rue online TD(λ) rue online TD(λ) λ λ Figure 3: Performance as funcion of λ a he opimal α value, for he predicion of he servo moor angle (lef), as well as he grip force (righ). 4.3 Conrol in he ALE Domain Aserix In his experimen, we compared he performance of rue online Sarsa(λ) wih ha of accumulae Sarsa(λ) and replace Sarsa(λ), on a domain from he Arcade Learning Environmen (ALE) (Bellemare e al., 2013; Defazio and Graepel, 2014; Mnih e al., 2015), called Aserix. 4 The ALE is a general esbed ha provides an inerface o hundreds of Aari 2600 games in which one has access, a each frame, o he game screen, he curren RAM sae and o a reward signal obained from he ransiion beween game frames. A each frame he agen provides one of he 18 possible acions in he game (equivalen o he 18 differen acions allowed in he joysick) wih he goal of maximizing he (discouned) sum of rewards. 4. We used ALE version for our experimens. The code for he ALE experimens is published online a: hps://gihub.com/mcmachado/trueonlinesarsa 11

12 van Seijen, Mahmood, Pilarski, Machado, Suon In he Aserix domain (see Figure 4 for a screensho), he agen conrols a yellow avaar, which has o collec poion objecs, while avoiding harp objecs. Boh poions and harps move across he screen horizonally. Every ime he agen collecs a poion i receives a reward of 50 poins, and every ime i ouches a harp i looses a life (i has hree lives in oal). The game ends afer he agen has los hree lives, or afer 5 minues, whichever comes firs. 5 Figure 4: Screensho of he game Aserix. The agen can use he acions up, righ, down, and lef o move across he screen, a no-op acion, as well as combinaions of wo direcions, resuling in a diagonal move (e.g.up-righ). This resuls in 9 acions in oal. The sae-space represenaion is based on linear funcion approximaion. We use wha Bellemare e al. (2013) called Basic feaure se, which encodes he presence of colours on he Aari 2600 screen. I is obained by firs subracing he game screen background (see Bellemare e al., 2013, sec ) and hen dividing he remaining screen in o iles of size pixels. Finally, for each ile, one binary feaure is generaed for each of he 128 available colours, encoding wheher a colour is acive or no in ha ile. This generaes 28,672 feaures (besides a bias erm ha is always acive). Because episode lenghs can vary hugely (basically, from abou 10 seconds all he way up o 5 minues), consrucing a fair performance meric is non-rivial. For example, comparing he average reurn on he firs N episodes of wo mehods is only fair if hey have seen roughly he same amoun of samples in hose episodes, which is no guaraneed for his domain. On he oher hand, looking a he oal reward colleced for he firs X samples is also no a good meric, because here is no negaive reward associaed o dying. To resolve his, we look a he reurn per episode, averaged over he firs n(x) episodes, where n(x) is he number of episodes observed in he firs X samples. More specifically, our meric consiss of he average score per episode while learning for 20 hours (4,320,000 frames). In addiion, we averaged he resuling number over 400 independen runs. As wih he evaluaion experimens, we performed a scan over he α and he race-decay parameer λ. Specifically, we looked a all combinaions of α {0.20, 0.50, 0.80, 1.10, 1.40, 1.70, 2.00} and λ {0.00, 0.50, 0.80, 0.90, 0.95, 0.99} (hese values were deermined during a preliminary parameer sweep). We used a discoun facor γ = and 5. We added he 5 minue ime limi ourselves as in previous work (Bellemare e al., 2013); he original game has no ime limi. 12

13 True Online Temporal-Difference Learning ɛ-greedy exploraion wih ɛ = The weigh vecor was iniialized o he zero vecor. Also, as Bellemare e al. (2013), we ake an acion a each 5 frames, his decreases he algorihms running ime and i also ries o avoid super-human reflexes in our agens. The resuls are shown in Figure 5. On his domain, he opimal performance of all hree versions of Sarsa(λ) is similar. rue online Sarsa(λ) reurn per episode accumulae Sarsa(λ) replace Sarsa(λ) λ Figure 5: Reurn per episode, averaged over he firs 4,320,000 frames as well as 400 independen runs, as funcion of λ, a opimal α, on he Aserix domain. 4.4 Discussion Figure 6 summarizes he performance of he differen TD(λ) versions on all evaluaion domains. Specifically, i shows he error for each mehod a is bes seings of α and λ. The error is normalized by dividing i by he error a λ = 0 (remember ha all versions of TD(λ) behave he same for λ = 0). Because λ = 0 lies in he parameer range ha is being opimized over, he normalized error can never be higher han 1. If for a mehod/domain he normalized error is equal o 1, his means ha seing λ higher han 0 eiher has no effec, or ha he error ges worse. In eiher case, eligibiliy races are no effecive for ha mehod/domain. Overall, rue online TD(λ) is clearly beer han accumulae TD(λ) and in erms of opimal performance. Specifically, on each considered domain, he error for rue online TD(λ) is eiher smaller or equal o he error of accumulae/. This is especially impressive, given he wide of variey of domains, and he fac he compuaional overhead for rue online TD(λ) is small (see Secion 3.3 for deails). Comparing accumulae TD(λ) wih, i can be seen ha, when considering abular or binary feaures, on some domains accumulae TD(λ) performs bes, while on ohers performs bes. When normal feaures are used, our naive generalizaion of is no effecive (sandard is no defined for normal feaures). 13

14 van Seijen, Mahmood, Pilarski, Machado, Suon accumulae TD(λ) rue online TD(λ) normalized error (10, 3, 0.1) abular (10, 3, 0.1) binary (10, 3, 0.1) normal (100, 10, 0.1) abular (100, 10, 0.1) binary (100, 10, 0.1) normal (100, 3, 0) abular (100, 3, 0) binary (100, 3, 0) normal prosehic angle prosehic force Figure 6: Summary of he evaluaion resuls: error a opimal (α, λ)-seings for all domains, normalized wih he TD(0) error. On he Aserix domain, he performance of he hree Sarsa(λ) versions is similar. This is in accordance wih he evaluaion resuls, which showed ha he size of he performance difference is domain dependen. In he wors case, he performance of he rue online mehod is similar o ha of he regular mehod. The opimal performance is no he only facor ha deermines how good a mehod is; wha also maers is how easy i is o find his performance. The deailed plos in Appendix A and B reveal ha he parameer sensiive of accumulae TD(λ) is much higher han ha of rue online TD(λ) and. This is clearly visible in he firs MRP ask (Figure 10), as well as he experimens wih he myoelecric prosheic arm (Figure 13). There is one more hing o ake away from he experimens. In he firs MRP, (10, 3, 0.1), wih normal feaures, accumulae TD(λ), as well as, are ineffecive (see Figure 6: he normalized performance of accumulae/ is 1, meaning ha he performance a opimized λ is equal o he performance of TD(0)). However, rue online TD(λ) was able o obain a considerable performance advanage wih respec o TD(0). This demonsraes ha rue online TD(λ) expands he se of domains/represenaions where eligibiliy races are effecive. This could poenially have far-reaching consequences. Specifically, using non-binary feaures becomes a lo more ineresing. Replacing races are no feasible / ineffecive for such represenaions, while using accumulaing races can easily resul in divergence of values. However, for rue online TD(λ) non-binary feaures are no necessarily more challenging han binary feaures. Exploring new, non-binary represenaions could poenially furher improve he performance for rue online TD(λ) on domains such as he myoelecic prosheic arm or he Aserix domain. 5. Analyical Comparison The empirical sudy suggess ha rue online TD(λ) performs a leas as good as accumulae TD(λ) and. In his secion, we ry o answer he quesion on wha kind of domains a large difference in performance can be expeced, and similarly, when no difference is expeced. The following hree heorems provide some insighs ino his. 14

15 True Online Temporal-Difference Learning Theorem 1 For λ = 0, accumulae TD(λ), and rue online TD(λ) behave he same. Proof For λ = 0, he accumulaing-race updae, he (generalized) replacing-race updae and he duch-race updae all reduce o e = φ. In addiion, because e = φ, he δ- correcion of rue online TD(λ) is 0. A feaure i is visied a ime if φ (i) > 0. The following heorem shows ha any difference in behaviour beween he hree versions of TD(λ) is due o how revisis of feaures are handled. Theorem 2 When no feaures are revisied wihin he same episode, accumulae TD(λ), and rue online TD(λ) behave he same (for any λ). Proof Because a he sar of an episode all race values are 0, and because a feaure is only visied once wihin an episode, if φ (i) 0, hen e 1 (i) = 0 and if e 1 (i) 0, hen φ (i) = 0. Hence, he accumulaing-race updae and he generalized replacing-race updae have he same effec. I also means ha e 1 φ is always zero. Hence, he duch-race updae reduces o he accumulaing-race updae. In addiion, because he weigh of a feaure does no ge updaed unil he feaure is visied, if φ (i) 0, hen θ (i) θ 1 (i) = 0, and if θ (i) θ 1 (i) 0, hen φ (i) 0. I follows ha θ φ θ 1 φ is always 0, and hence he δ-correcion as well. Finally, our hird heorem saes ha for small s he behaviour of rue online TD(λ) approximaes ha of accumulae TD(λ): Theorem 3 Le acc be he weigh updae a ime due o accumulae TD(λ) and rue he weigh updae due o rue online TD(λ). If γλ < 1 and he feaure vecors and TD errors are bounded, hen acc / rue 1 if α 0. Proof The updae equaions specify ha where e acc acc := αe acc δ, := αe du δ + α[θ φ θ 1φ ][e du φ ], rue is an accumulaing race, and e du can be wrien as: by showing ha rue rue wih c(α) 0 if α 0. More specifically, rue rue [ = α e acc δ + (e du is a duch race. We will prove he heorem = α [ e acc δ + c(α) ] can be wrien as: e acc )δ + (θ θ 1 ) φ (e du φ ) ] We will show ha e du α 0. e acc 0 if α 0, and ha (θ θ 1 ) φ (e du φ ) 0 if 15

16 van Seijen, Mahmood, Pilarski, Machado, Suon The non-incremenal expression for e acc is: e acc 0 = φ 0 e acc 1 = γλφ 0 + φ 1 e acc 2 = (γλ) 2 φ 0 + γλφ 1 + φ 2. e acc = (γλ) i φ i=0 Le he value of feaure i be bounded by C, ha is φ (i) < C for all i,. Then, e acc (i) < C/(1 γλ) for all i,. Because γλ < 1, his is some finie value. The duch-race updae can be re-wrien as: e du = γλ(i αφ φ ) e du 1 + φ Using his, he non-incremenal expression for e du becomes: e du 0 = φ 0 e du 1 = γλ(i αφ 1 φ 1 )φ 0 + φ 1 e du 2 = (γλ) 2 (I αφ 2 φ 2 )(I αφ 1 φ 1 )φ 0 + γλ(i αφ 2 φ 2 )φ 1 + φ 2. Because he feaure vecors are bounded, if α 0, (I αφ i φ i ) I, and edu e acc (because he race values are bounded, his is rue even if ). Finally, we need o show ha (θ θ 1 ) φ (e rue φ ) 0 if α 0. Because he feaure vecors and race values are bounded, i suffices o show ha θ θ 1 = rue 1 0 if α 0, which follows from he definiion of rue (given he condiion ha he TD error is bounded). Based on hese hree heorems, we expec a large difference on domains for which he opimal α and opimal λ are relaively large, and where feaures are frequenly revisied. Domains wih a relaively large opimal α and opimal λ are ypically domains wih relaively low sochasiciy. So as a rule of humb, a large difference can be expeced on domains wih relaively low sochasiciy and frequen revisis of feaures. 6. Derivaion of True Online TD(λ) The defining propery of a rue online mehod is ha i mainains an exac equivalence wih an online forward view a all imes. This means ha a every momen in ime, he weigh vecor can be inerpreed as he resul of a sequence of updaes wih muli-sep updae arges. To achieve his sep-by-sep equivalence, he regular forward has o be exended, because i only specifies wha he weighs a he end of an episode should be. In his secion, we presen he exended forward view, and we derive he rue online TD(λ) updae equaions from i. 16

17 True Online Temporal-Difference Learning 6.1 The forward view of TD(λ) In Secion 2, he general updae rule for linear funcion approximaion was presened (Equaion 1), which is based on he updae rule for sochasic gradien descen. The updae equaions for TD(λ), however, are of a differen form (Equaions 3, 4 and 5). The forward view of TD(λ) relaes he TD(λ) equaions o Equaion 1. Specifically, he forward view of TD(λ) specifies ha TD(λ) approximaes he λ-reurn algorihm. This algorihm performs a series of updaes of he form of Equaion 1 wih he λ-reurn as updae arge: θ +1 = θ + α [G λ θ φ ]φ, for 0 < T, where T is he end of he episode, and G λ is he λ-reurn a ime. The λ-reurn is a muli-sep updae arge based on a weighed average of all fuure sae values, wih λ deermining he weigh disribuion. Specifically, he λ-reurn a ime is defined as: G λ = (1 λ) wih G (n) (θ) is he n-sep reurn, defined as: n=1 λ n 1 G (n) (θ ) G (n) (θ) = R +1 + γr +2 + γ 2 R γ n 1 R +n + γ n θ φ +n. For episodic asks, G (n) (θ) is equal o he full reurn, G, if + n T, and he λ-reurn can be wrien as: T 1 G λ = (1 λ) n=1 λ n 1 G (n) (θ ) + λ T 1 G. (10) The forward view offers a paricularly sraighforward inerpreaion of he λ-parameer. For λ = 0, G λ reduces o he TD(0) updae arge, while for λ = 1, G λ reduces o he full reurn. In oher words, for λ = 0 he updae arge has maximum bias and minimum variance, while for λ = 1, he updae arge is unbiased, bu has maximum variance. For λ in beween 0 and 1, he bias and variance are beween hese wo exremes. So, λ enables conrol over he rade-off beween bias and variance. While he λ-reurn algorihm has a very clear inuiion, here is only an exac equivalence for he offline case. Tha is, he offline varian of TD(λ) compues he same value esimaes as he offline varian of he λ-reurn algorihm. For he online case, here is only an approximae equivalence. Specifically, he weigh vecor a ime T, compued by accumulae TD(λ) closely approximaes he weigh vecor a ime T compued by he online λ-reurn algorihm for appropriaely small values of he parameer (Suon and Baro, 1998). Tha he forward view only applies o he weigh vecor a he end of an episode, even in he online case, is a limiaion ha is ofen overlooked. I is relaed o he fac ha he λ-reurn for S is consruced from daa sreching from ime +1 all he way o ime T, he ime ha he erminal sae is reached. A consequence is ha he λ-reurn algorihm can compue is weigh vecors only in hindsigh, a he end of an episode. This is illusraed by Figure 7, which maps each weigh vecor θ o he earlies ime ha i can be compued. Time in his case refers o he ime of daa-collecion: ime is defined as he momen 17

18 van Seijen, Mahmood, Pilarski, Machado, Suon ha sample φ is observed. By conras, TD(λ) uses only daa up o ime o compue he weigh vecor θ. Hence, TD(λ) can compue is weigh vecors wihou delay (see Figure 8). To denoe his imporan propery, we use he erm real-ime. TD(λ) is a real-ime algorihm, while he λ-reurn algorihm is no. A consequence is ha even hough boh algorihms compue a sequence of T weigh vecors, a meaningful comparison can only be made for θ T, because only a ime T does TD(λ) have access o he same daa as he λ- reurn algorihm. This limis he usefulness of he λ-reurn algorihm as an inuiive way o view TD(λ). In he nex secion, we address his limiaion. ime T θ 1 θ 2 θ 3 θ T Figure 7: The weigh vecors of he λ-reurn algorihm mapped o he earlies ime ha hey can be compued. ime θ 1 θ 2 θ 3 T θ T Figure 8: The weigh vecors of TD(λ) mapped o he earlies ime ha hey can be compued. 6.2 The Real-Time Forward View The convenional forward view explains how he weigh vecor a he end of an episode, compued by TD(λ), can be inerpreed as he resul of a sequence of updaes wih a paricular muli-sep updae arge, he λ-reurn. We wan o give a similar explanaion for 18

19 True Online Temporal-Difference Learning weigh vecors during an episode. In oher words, we wan o consruc a real-ime forward view ha explains he weigh vecors, compued by TD(λ), a all ime seps. The dilemma ha arises when rying o consruc a real-ime forward view is ha he updae arges should conain daa from many ime seps ahead, bu he real-ime aspec prohibis he use of daa beyond he curren ime sep. The soluion o his dilemma is o have updae arges ha grow over ime. In oher words, raher han defining a fixed updae arge for each visied sae, he updae arge depends on he ime sep up o which daa is observed. We call such an updae arge an inerim updae arge, and he ime sep up o which daa is observed he daa-horizon. We will use a superscrip o indicae he daa-horizon h of an updae arge: U h. A simple example of an inerim updae arge is an updae arge ha consiss of he discouned sum of rewards up o he daa-horizon: U h = R +1 + γr +2 + γ 2 R γ h 1 R h. A direc consequence of having updae arges ha depend on he daa-horizon is ha a real-ime forward view specifies an updae sequence for each daa-horizon. Below, we show he updae sequences based on an inerim updae arge U h for horizons 1, 2 and 3 (θ h 0 := θ ini, for all h). h = 1 : ] θ1 1 = θ0 1 + α [U 0 1 (θ0) 1 φ 0 φ 0, h = 2 : h = 3 : ] θ1 2 = θ0 2 + α [U 0 2 (θ0) 2 φ 0 φ 0, ] θ2 2 = θ1 2 + α [U 1 2 (θ1) 2 φ 1 φ 1, ] θ1 3 = θ0 3 + α [U 0 3 (θ0) 3 φ 0 φ 0, ] θ2 3 = θ1 3 + α [U 1 3 (θ1) 3 φ 1 φ 1, ] θ3 3 = θ2 3 + α [U 2 3 (θ2) 3 φ 2 φ 2, More generally, he updae sequence for horizon h is defined by: ] θ+1 h = θ h + α [U h (θ h ) φ φ, for 0 < h. (11) Figure 9 maps each weigh vecor o he earlies ime i can be compued. Ulimaely, he weigh-vecor sequence of ineres is no he sequence a a paricular horizon. Raher, i is he sequence consising of he final weigh vecor a each horizon: θ1 1, θ2 2, θ2 3,..., θt T. Because θ can be compued a ime, we call he forward view a real-ime forward view. In principle, Equaion (11) can be combined wih any inerim updae arge definiion o form a real-ime forward view. However, o ge he real-ime forward view ha belongs o TD(λ) a horizon-dependen version of he λ-reurn is needed. A version of he λ-reurn ha corresponds wih horizon h should no use daa beyond his horizon. In oher words, he highes n-sep reurn ha should be involved is he (h )-sep reurn. This can be achieved by replacing each n-sep reurn wih n > h wih he (h )-sep reurn. We 19

20 van Seijen, Mahmood, Pilarski, Machado, Suon ime θ 1 2 θ 1 3 θ 1 2 θ 2 3 θ 2 3 θ 3 T T θ 1 T θ 2 T θ 3 T θ T Figure 9: The weigh vecors of he new forward view mapped o he earlies ime ha hey can be compued. call his version of he λ-reurn he inerim λ-reurn, and use he noaion G λ h he inerim λ-reurn depending on horizon h. G λ h can be wrien as follows: o indicae G λ h h 1 = (1 λ) λ n 1 G (n) + (1 λ) n=1 h 1 = (1 λ) λ n 1 G (n) n=1 h 1 = (1 λ) λ n 1 G (n) n=1 h 1 = (1 λ) λ n 1 G (n) n=1 + G (h ) n=h [ (1 λ) λ n 1 G (h ) n=h [ + G (h ) λ h 1 (1 λ) λ n 1] λ k] k=0 + λ h 1 G (h ) (12) Equaion 12 fully specifies he inerim λ-reurn, excep for one small deail: he weigh vecor ha should be used for he value esimaes in he n-sep reurns has no been specified ye. The regular λ-reurn uses G (n) (θ ) (see Equaion 10). For he real-ime forward view, however, all weigh vecors have wo indices, so simply using θ does no work in his case. So which double-indexed weigh vecor should be used? The wo guiding principles on deciding which weigh vecor o use is ha we wan he forward view o be an approximaion of accumulae TD(λ) and ha an efficien implemenaion should be possible. One opion is o use G (n) (θ h ). While wih his definiion he updae-sequence a daa-horizon T is exacly he same as he sequence of updaes from he λ-reurn algorihm (basically, he λ-reurn implicily uses a daa-horizon of T ), i prohibis efficienly compuaion of θ h+1 h+1 from θh h. For his reason, we use G(n) (θ+n 1 +n 1 ), which does allow for efficien compuaion, and forms a good approximaion of accumulae TD(λ) as well (as we show below). Using 20

21 True Online Temporal-Difference Learning his weigh vecor, he full definiion of G λ h becomes: G λ h h 1 := (1 λ) n=1 ( λ n 1 G (n) θ +n 1 +n 1 ) ( + λ h 1 G (h ) θ h 1 h 1 ). (13) We call his he inerim λ-reurn. We call he algorihm ha combines he inerim λ-reurn wih Equaion 11 he inerim λ-reurn algorihm. 6.3 Derivaion In his subsecion, we derive he updae equaions of rue online TD(λ) direcly from he real-ime forward view, defined by equaions (11) and (13) (and θ0 h := θ ini). The derivaion is based on expressing θ h+1 h+1 in erms of θh h. We sar by wriing θh h direcly in erms of he iniial weigh vecor and he inerim λ-reurns. Firs, we rewrie (11), wih he inerim λ-reurn as updae arge, as: θ h +1 = (I αφ φ ) θ h + α G λ h wih I he ideniy marix. Now, consider θ h for = 1 and = 2: θ1 h = (I αφ 0 φ 0 )θ ini + αφ 0 G λ h 0 θ2 h = (I αφ 1 φ 1 )θ1 h + αφ 1 G λ h 1 = (I αφ 1 φ 1 )(I αφ 0 φ 0 )θ ini + α(i αφ 1 φ 1 )φ 0 G λ h 0 + αφ 1 G λ h 1 For general h, we can wrie: θ h = A 1 0 θ ini + α A 1 i φ i 1 G λ h i 1, where A j i is defined as: A j i := (I αφ jφ j )(I αφ j 1 φ j 1)... (I αφ i φ i ), for j i, and A j j+1 := I. We are now able o express θh h as: θ h h = Ah 1 0 θ ini + α h A h 1 i φ i 1 G λ h i 1, (14) Because for he derivaion of rue online TD(λ), we only need (14) and he definiion of G λ h, we can drop he double indices for he weigh vecors and use θ h := θ h h. 21

22 van Seijen, Mahmood, Pilarski, Machado, Suon We now derive a compac expression for he difference G λ h+1 G λ h. G λ h+1 G λ h h = (1 λ) n=1 λ n 1 G +n (θ +n 1 ) + λ h G h+1 (θ h ) h 1 (1 λ) λ n 1 G +n (θ +n 1 ) λ h 1 G h (θ h 1 ) n=1 = (1 λ)λ h 1 G h (θ h 1 ) + λ h G h+1 (θ h ) λ h 1 G h (θ h 1 ) = λ h G h+1 (θ h ) λ h G h (θ h 1 ) = λ h [ G h+1 (θ h ) G h (θ h 1 ) = λ h [ h+1 ] γ i 1 R +i + γ h+1 θh φ h ] h+1 γ i 1 R +i γ h θh 1 φ h = λ h [ γ h R h+1 + γ h+1 θh φ h+1 γ h θh 1 φ h = (λγ) h [ ] R h+1 + γ θh φ h+1 θh 1 φ h ] Noe ha he difference G λ h+1 G λ h is naurally expressed using a erm ha looks like a TD error bu wih a modified ime sep. We call his he modified TD error, δ h : δ h := R h+1 + γ θ h φ h+1 θ h 1 φ h. Using his definiion, he difference G λ h+1 G λ h can be compacly wrien as: G λ h+1 G λ h = (λγ) h δ h (15) Noe ha δ h relaes o he regular TD error, δ h, as follows: δ h = R h+1 + γ θ h φ h+1 θ h 1 φ h = R h+1 + γ θ h φ h+1 θ h φ h + θ h φ h θ h 1 φ h = δ h + θ h φ h θ h 1 φ h. (16) 22

23 True Online Temporal-Difference Learning To ge he updae rule, we have o express θ h+1 in erms of θ h. This is done below, using (14), (15) and (16). h+1 θ h+1 = A h 0 θ 0 + α A h i φ i 1 G λ h+1 i 1 = A h 0θ 0 + α = A h 0θ 0 + α h A h i φ i 1 G λ h+1 i 1 + αφ h G λ h+1 h h h A h i φ i 1 G λ h i 1 + α A h [ λ h+1 i φ i 1 G i 1 = (I αφ h φ h ) [ A h 1 0 θ 0 + α + α h A h i φ i 1 [ G λ h+1 i 1 = (I αφ h φ h ) θ h + α = (I αφ h φ h )θ h + α = θ h + α = θ h + α h h h h A h 1 i ] φ i 1 G λ h i 1 G λ h ] i 1 + αφh G λ h+1 h A h i φ i 1 [ G λ h+1 i 1 G λ h ] i 1 + αφh G λ h+1 h G λ h ] i 1 + αφh G λ h+1 h A h i φ i 1 (γλ) h+1 i δ h + αφ [ h Rh+1 + γθ ] h φ h+1 A h i φ i 1 (γλ) h+1 i δ h + αφ [ h Rh+1 + γθ ] h φ h+1 θ h φ h h A h i φ i 1 (γλ) h+1 i δ h [ + αφ h Rh+1 + γθ ] h φ h+1 θ h 1 φ h + θ h 1 φ h θ h φ h h = θ h + α A h i φ i 1 (γλ) h+1 i δ h + αφ hδ h + αφ [ ] h θh 1 φ h θ h φ h h+1 = θ h + α A h i φ i 1 (γλ) h+1 i δ h + αφ [ ] h θh 1 φ h θ h φ h = θ h + αe h δ h + αφ [ ] h θh 1 φ h θ h φ h [ = θ h + αe h δh + θh φ h θh 1 φ [ h] + αφh θ h 1 φ h θh φ ] h h+1 wih e h := A h i φ i 1 (γλ) h+1 i = θ h + αe h δ h + α [ θ h φ h θ h 1 φ h] [eh φ h ] (17) 23

24 van Seijen, Mahmood, Pilarski, Machado, Suon We now have he updae rule for θ h, in addiion o an explici definiion of e h. Nex, using his explici definiion, we derive an updae rule o compue e h from e h 1. e h = = h+1 A h i φ i 1 (γλ) h+1 i h A h i φ i 1 (γλ) h+1 i + φ h = (I αφ h φ h )γλ h A h 1 i φ i 1 (γλ) h i + φ h = (I αφ h φ h )γλe h 1 + φ h = γλe h 1 + φ h + αγλ(e h 1 φ h)φ h (18) Equaions (17) and (18), ogeher wih he definiion of δ h, form he rue online TD(λ) updae equaions. 7. Oher True Online Mehods In he previous secion, we showed ha he rue online TD(λ) equaions can be derived direcly from he real-ime forward view equaions. By using differen real-ime forward views, new rue online mehods can be derived. Someimes, small changes in he real-ime forward view, like using a ime-dependen, can resul in surprising changes in he rue online equaions. In his secion, we look a a number of such variaions. 7.1 True Online TD(λ) wih Time-Dependen Sep-size When using a ime-dependen in he base equaion of he forward view (Equaion 11) and deriving he updae equaions following he procedure from Secion 6.3, i urns ou ha a slighly differen race definiion appears. We indicae his new race using a + superscrip: e +. For fixed, his new race definiion is equal o: e + = αe, for all. Of course, using e +, insead of e also changes he weigh vecor updae slighly. Below, he full se of updae equaions is shown: δ = R +1 + γθ φ +1 θ φ e + = γλe α φ α γλ[(e + 1 ) φ ] φ θ +1 = θ + δ e + + [θ φ θ 1φ ][e + α φ ] In addiion, e + 1 := 0. We can simplify he weigh updae equaion slighly, by using δ = δ + θ φ θ 1φ, 24

RL Lecture 7: Eligibility Traces. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1

RL Lecture 7: Eligibility Traces. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 RL Lecure 7: Eligibiliy Traces R. S. Suon and A. G. Baro: Reinforcemen Learning: An Inroducion 1 N-sep TD Predicion Idea: Look farher ino he fuure when you do TD backup (1, 2, 3,, n seps) R. S. Suon and

More information

Off-policy TD(λ) with a true online equivalence

Off-policy TD(λ) with a true online equivalence Off-policy TD(λ) wih a rue online equivalence Hado van Hassel A Rupam Mahmood Richard S Suon Reinforcemen Learning and Arificial Inelligence Laboraory Universiy of Albera, Edmonon, AB T6G 2E8 Canada Absrac

More information

STATE-SPACE MODELLING. A mass balance across the tank gives:

STATE-SPACE MODELLING. A mass balance across the tank gives: B. Lennox and N.F. Thornhill, 9, Sae Space Modelling, IChemE Process Managemen and Conrol Subjec Group Newsleer STE-SPACE MODELLING Inroducion: Over he pas decade or so here has been an ever increasing

More information

Physics 235 Chapter 2. Chapter 2 Newtonian Mechanics Single Particle

Physics 235 Chapter 2. Chapter 2 Newtonian Mechanics Single Particle Chaper 2 Newonian Mechanics Single Paricle In his Chaper we will review wha Newon s laws of mechanics ell us abou he moion of a single paricle. Newon s laws are only valid in suiable reference frames,

More information

T L. t=1. Proof of Lemma 1. Using the marginal cost accounting in Equation(4) and standard arguments. t )+Π RB. t )+K 1(Q RB

T L. t=1. Proof of Lemma 1. Using the marginal cost accounting in Equation(4) and standard arguments. t )+Π RB. t )+K 1(Q RB Elecronic Companion EC.1. Proofs of Technical Lemmas and Theorems LEMMA 1. Le C(RB) be he oal cos incurred by he RB policy. Then we have, T L E[C(RB)] 3 E[Z RB ]. (EC.1) Proof of Lemma 1. Using he marginal

More information

1 Review of Zero-Sum Games

1 Review of Zero-Sum Games COS 5: heoreical Machine Learning Lecurer: Rob Schapire Lecure #23 Scribe: Eugene Brevdo April 30, 2008 Review of Zero-Sum Games Las ime we inroduced a mahemaical model for wo player zero-sum games. Any

More information

Presentation Overview

Presentation Overview Acion Refinemen in Reinforcemen Learning by Probabiliy Smoohing By Thomas G. Dieerich & Didac Busques Speaer: Kai Xu Presenaion Overview Bacground The Probabiliy Smoohing Mehod Experimenal Sudy of Acion

More information

23.2. Representing Periodic Functions by Fourier Series. Introduction. Prerequisites. Learning Outcomes

23.2. Representing Periodic Functions by Fourier Series. Introduction. Prerequisites. Learning Outcomes Represening Periodic Funcions by Fourier Series 3. Inroducion In his Secion we show how a periodic funcion can be expressed as a series of sines and cosines. We begin by obaining some sandard inegrals

More information

Vehicle Arrival Models : Headway

Vehicle Arrival Models : Headway Chaper 12 Vehicle Arrival Models : Headway 12.1 Inroducion Modelling arrival of vehicle a secion of road is an imporan sep in raffic flow modelling. I has imporan applicaion in raffic flow simulaion where

More information

20. Applications of the Genetic-Drift Model

20. Applications of the Genetic-Drift Model 0. Applicaions of he Geneic-Drif Model 1) Deermining he probabiliy of forming any paricular combinaion of genoypes in he nex generaion: Example: If he parenal allele frequencies are p 0 = 0.35 and q 0

More information

Econ107 Applied Econometrics Topic 7: Multicollinearity (Studenmund, Chapter 8)

Econ107 Applied Econometrics Topic 7: Multicollinearity (Studenmund, Chapter 8) I. Definiions and Problems A. Perfec Mulicollineariy Econ7 Applied Economerics Topic 7: Mulicollineariy (Sudenmund, Chaper 8) Definiion: Perfec mulicollineariy exiss in a following K-variable regression

More information

Notes on Kalman Filtering

Notes on Kalman Filtering Noes on Kalman Filering Brian Borchers and Rick Aser November 7, Inroducion Daa Assimilaion is he problem of merging model predicions wih acual measuremens of a sysem o produce an opimal esimae of he curren

More information

10. State Space Methods

10. State Space Methods . Sae Space Mehods. Inroducion Sae space modelling was briefly inroduced in chaper. Here more coverage is provided of sae space mehods before some of heir uses in conrol sysem design are covered in he

More information

Some Basic Information about M-S-D Systems

Some Basic Information about M-S-D Systems Some Basic Informaion abou M-S-D Sysems 1 Inroducion We wan o give some summary of he facs concerning unforced (homogeneous) and forced (non-homogeneous) models for linear oscillaors governed by second-order,

More information

Lecture 2-1 Kinematics in One Dimension Displacement, Velocity and Acceleration Everything in the world is moving. Nothing stays still.

Lecture 2-1 Kinematics in One Dimension Displacement, Velocity and Acceleration Everything in the world is moving. Nothing stays still. Lecure - Kinemaics in One Dimension Displacemen, Velociy and Acceleraion Everyhing in he world is moving. Nohing says sill. Moion occurs a all scales of he universe, saring from he moion of elecrons in

More information

Christos Papadimitriou & Luca Trevisan November 22, 2016

Christos Papadimitriou & Luca Trevisan November 22, 2016 U.C. Bereley CS170: Algorihms Handou LN-11-22 Chrisos Papadimiriou & Luca Trevisan November 22, 2016 Sreaming algorihms In his lecure and he nex one we sudy memory-efficien algorihms ha process a sream

More information

ACE 562 Fall Lecture 5: The Simple Linear Regression Model: Sampling Properties of the Least Squares Estimators. by Professor Scott H.

ACE 562 Fall Lecture 5: The Simple Linear Regression Model: Sampling Properties of the Least Squares Estimators. by Professor Scott H. ACE 56 Fall 005 Lecure 5: he Simple Linear Regression Model: Sampling Properies of he Leas Squares Esimaors by Professor Sco H. Irwin Required Reading: Griffihs, Hill and Judge. "Inference in he Simple

More information

PENALIZED LEAST SQUARES AND PENALIZED LIKELIHOOD

PENALIZED LEAST SQUARES AND PENALIZED LIKELIHOOD PENALIZED LEAST SQUARES AND PENALIZED LIKELIHOOD HAN XIAO 1. Penalized Leas Squares Lasso solves he following opimizaion problem, ˆβ lasso = arg max β R p+1 1 N y i β 0 N x ij β j β j (1.1) for some 0.

More information

Bias-Variance Error Bounds for Temporal Difference Updates

Bias-Variance Error Bounds for Temporal Difference Updates Bias-Variance Bounds for Temporal Difference Updaes Michael Kearns AT&T Labs mkearns@research.a.com Sainder Singh AT&T Labs baveja@research.a.com Absrac We give he firs rigorous upper bounds on he error

More information

Article from. Predictive Analytics and Futurism. July 2016 Issue 13

Article from. Predictive Analytics and Futurism. July 2016 Issue 13 Aricle from Predicive Analyics and Fuurism July 6 Issue An Inroducion o Incremenal Learning By Qiang Wu and Dave Snell Machine learning provides useful ools for predicive analyics The ypical machine learning

More information

An introduction to the theory of SDDP algorithm

An introduction to the theory of SDDP algorithm An inroducion o he heory of SDDP algorihm V. Leclère (ENPC) Augus 1, 2014 V. Leclère Inroducion o SDDP Augus 1, 2014 1 / 21 Inroducion Large scale sochasic problem are hard o solve. Two ways of aacking

More information

Inventory Analysis and Management. Multi-Period Stochastic Models: Optimality of (s, S) Policy for K-Convex Objective Functions

Inventory Analysis and Management. Multi-Period Stochastic Models: Optimality of (s, S) Policy for K-Convex Objective Functions Muli-Period Sochasic Models: Opimali of (s, S) Polic for -Convex Objecive Funcions Consider a seing similar o he N-sage newsvendor problem excep ha now here is a fixed re-ordering cos (> 0) for each (re-)order.

More information

Chapter 2. First Order Scalar Equations

Chapter 2. First Order Scalar Equations Chaper. Firs Order Scalar Equaions We sar our sudy of differenial equaions in he same way he pioneers in his field did. We show paricular echniques o solve paricular ypes of firs order differenial equaions.

More information

Two Popular Bayesian Estimators: Particle and Kalman Filters. McGill COMP 765 Sept 14 th, 2017

Two Popular Bayesian Estimators: Particle and Kalman Filters. McGill COMP 765 Sept 14 th, 2017 Two Popular Bayesian Esimaors: Paricle and Kalman Filers McGill COMP 765 Sep 14 h, 2017 1 1 1, dx x Bel x u x P x z P Recall: Bayes Filers,,,,,,, 1 1 1 1 u z u x P u z u x z P Bayes z = observaion u =

More information

0.1 MAXIMUM LIKELIHOOD ESTIMATION EXPLAINED

0.1 MAXIMUM LIKELIHOOD ESTIMATION EXPLAINED 0.1 MAXIMUM LIKELIHOOD ESTIMATIO EXPLAIED Maximum likelihood esimaion is a bes-fi saisical mehod for he esimaion of he values of he parameers of a sysem, based on a se of observaions of a random variable

More information

Online Convex Optimization Example And Follow-The-Leader

Online Convex Optimization Example And Follow-The-Leader CSE599s, Spring 2014, Online Learning Lecure 2-04/03/2014 Online Convex Opimizaion Example And Follow-The-Leader Lecurer: Brendan McMahan Scribe: Sephen Joe Jonany 1 Review of Online Convex Opimizaion

More information

Matrix Versions of Some Refinements of the Arithmetic-Geometric Mean Inequality

Matrix Versions of Some Refinements of the Arithmetic-Geometric Mean Inequality Marix Versions of Some Refinemens of he Arihmeic-Geomeric Mean Inequaliy Bao Qi Feng and Andrew Tonge Absrac. We esablish marix versions of refinemens due o Alzer ], Carwrigh and Field 4], and Mercer 5]

More information

CSE/NB 528 Lecture 14: Reinforcement Learning (Chapter 9)

CSE/NB 528 Lecture 14: Reinforcement Learning (Chapter 9) CSE/NB 528 Lecure 14: Reinforcemen Learning Chaper 9 Image from hp://clasdean.la.asu.edu/news/images/ubep2001/neuron3.jpg Lecure figures are from Dayan & Abbo s book hp://people.brandeis.edu/~abbo/book/index.hml

More information

Final Spring 2007

Final Spring 2007 .615 Final Spring 7 Overview The purpose of he final exam is o calculae he MHD β limi in a high-bea oroidal okamak agains he dangerous n = 1 exernal ballooning-kink mode. Effecively, his corresponds o

More information

CHAPTER 10 VALIDATION OF TEST WITH ARTIFICAL NEURAL NETWORK

CHAPTER 10 VALIDATION OF TEST WITH ARTIFICAL NEURAL NETWORK 175 CHAPTER 10 VALIDATION OF TEST WITH ARTIFICAL NEURAL NETWORK 10.1 INTRODUCTION Amongs he research work performed, he bes resuls of experimenal work are validaed wih Arificial Neural Nework. From he

More information

Ensamble methods: Bagging and Boosting

Ensamble methods: Bagging and Boosting Lecure 21 Ensamble mehods: Bagging and Boosing Milos Hauskrech milos@cs.pi.edu 5329 Senno Square Ensemble mehods Mixure of expers Muliple base models (classifiers, regressors), each covers a differen par

More information

Two Coupled Oscillators / Normal Modes

Two Coupled Oscillators / Normal Modes Lecure 3 Phys 3750 Two Coupled Oscillaors / Normal Modes Overview and Moivaion: Today we ake a small, bu significan, sep owards wave moion. We will no ye observe waves, bu his sep is imporan in is own

More information

Lecture 33: November 29

Lecture 33: November 29 36-705: Inermediae Saisics Fall 2017 Lecurer: Siva Balakrishnan Lecure 33: November 29 Today we will coninue discussing he boosrap, and hen ry o undersand why i works in a simple case. In he las lecure

More information

Lecture Notes 2. The Hilbert Space Approach to Time Series

Lecture Notes 2. The Hilbert Space Approach to Time Series Time Series Seven N. Durlauf Universiy of Wisconsin. Basic ideas Lecure Noes. The Hilber Space Approach o Time Series The Hilber space framework provides a very powerful language for discussing he relaionship

More information

Temporal Abstraction in Temporal-difference Networks

Temporal Abstraction in Temporal-difference Networks Temporal Absracion in Temporal-difference Neworks Richard S. Suon, Eddie J. Rafols, Anna Koop Deparmen of Compuing Science Universiy of Albera Edmonon, AB, Canada T6G 2E8 {suon,erafols,anna}@cs.ualbera.ca

More information

GMM - Generalized Method of Moments

GMM - Generalized Method of Moments GMM - Generalized Mehod of Momens Conens GMM esimaion, shor inroducion 2 GMM inuiion: Maching momens 2 3 General overview of GMM esimaion. 3 3. Weighing marix...........................................

More information

KINEMATICS IN ONE DIMENSION

KINEMATICS IN ONE DIMENSION KINEMATICS IN ONE DIMENSION PREVIEW Kinemaics is he sudy of how hings move how far (disance and displacemen), how fas (speed and velociy), and how fas ha how fas changes (acceleraion). We say ha an objec

More information

Zürich. ETH Master Course: L Autonomous Mobile Robots Localization II

Zürich. ETH Master Course: L Autonomous Mobile Robots Localization II Roland Siegwar Margaria Chli Paul Furgale Marco Huer Marin Rufli Davide Scaramuzza ETH Maser Course: 151-0854-00L Auonomous Mobile Robos Localizaion II ACT and SEE For all do, (predicion updae / ACT),

More information

ACE 562 Fall Lecture 8: The Simple Linear Regression Model: R 2, Reporting the Results and Prediction. by Professor Scott H.

ACE 562 Fall Lecture 8: The Simple Linear Regression Model: R 2, Reporting the Results and Prediction. by Professor Scott H. ACE 56 Fall 5 Lecure 8: The Simple Linear Regression Model: R, Reporing he Resuls and Predicion by Professor Sco H. Irwin Required Readings: Griffihs, Hill and Judge. "Explaining Variaion in he Dependen

More information

Ordinary dierential equations

Ordinary dierential equations Chaper 5 Ordinary dierenial equaions Conens 5.1 Iniial value problem........................... 31 5. Forward Euler's mehod......................... 3 5.3 Runge-Kua mehods.......................... 36

More information

ACE 562 Fall Lecture 4: Simple Linear Regression Model: Specification and Estimation. by Professor Scott H. Irwin

ACE 562 Fall Lecture 4: Simple Linear Regression Model: Specification and Estimation. by Professor Scott H. Irwin ACE 56 Fall 005 Lecure 4: Simple Linear Regression Model: Specificaion and Esimaion by Professor Sco H. Irwin Required Reading: Griffihs, Hill and Judge. "Simple Regression: Economic and Saisical Model

More information

Robotics I. April 11, The kinematics of a 3R spatial robot is specified by the Denavit-Hartenberg parameters in Tab. 1.

Robotics I. April 11, The kinematics of a 3R spatial robot is specified by the Denavit-Hartenberg parameters in Tab. 1. Roboics I April 11, 017 Exercise 1 he kinemaics of a 3R spaial robo is specified by he Denavi-Harenberg parameers in ab 1 i α i d i a i θ i 1 π/ L 1 0 1 0 0 L 3 0 0 L 3 3 able 1: able of DH parameers of

More information

EXPLICIT TIME INTEGRATORS FOR NONLINEAR DYNAMICS DERIVED FROM THE MIDPOINT RULE

EXPLICIT TIME INTEGRATORS FOR NONLINEAR DYNAMICS DERIVED FROM THE MIDPOINT RULE Version April 30, 2004.Submied o CTU Repors. EXPLICIT TIME INTEGRATORS FOR NONLINEAR DYNAMICS DERIVED FROM THE MIDPOINT RULE Per Krysl Universiy of California, San Diego La Jolla, California 92093-0085,

More information

INTRODUCTION TO MACHINE LEARNING 3RD EDITION

INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN The MIT Press, 2014 Lecure Slides for INTRODUCTION TO MACHINE LEARNING 3RD EDITION alpaydin@boun.edu.r hp://www.cmpe.boun.edu.r/~ehem/i2ml3e CHAPTER 2: SUPERVISED LEARNING Learning a Class

More information

t is a basis for the solution space to this system, then the matrix having these solutions as columns, t x 1 t, x 2 t,... x n t x 2 t...

t is a basis for the solution space to this system, then the matrix having these solutions as columns, t x 1 t, x 2 t,... x n t x 2 t... Mah 228- Fri Mar 24 5.6 Marix exponenials and linear sysems: The analogy beween firs order sysems of linear differenial equaions (Chaper 5) and scalar linear differenial equaions (Chaper ) is much sronger

More information

Exponential Weighted Moving Average (EWMA) Chart Under The Assumption of Moderateness And Its 3 Control Limits

Exponential Weighted Moving Average (EWMA) Chart Under The Assumption of Moderateness And Its 3 Control Limits DOI: 0.545/mjis.07.5009 Exponenial Weighed Moving Average (EWMA) Char Under The Assumpion of Moderaeness And Is 3 Conrol Limis KALPESH S TAILOR Assisan Professor, Deparmen of Saisics, M. K. Bhavnagar Universiy,

More information

Robust estimation based on the first- and third-moment restrictions of the power transformation model

Robust estimation based on the first- and third-moment restrictions of the power transformation model h Inernaional Congress on Modelling and Simulaion, Adelaide, Ausralia, 6 December 3 www.mssanz.org.au/modsim3 Robus esimaion based on he firs- and hird-momen resricions of he power ransformaion Nawaa,

More information

Bias in Conditional and Unconditional Fixed Effects Logit Estimation: a Correction * Tom Coupé

Bias in Conditional and Unconditional Fixed Effects Logit Estimation: a Correction * Tom Coupé Bias in Condiional and Uncondiional Fixed Effecs Logi Esimaion: a Correcion * Tom Coupé Economics Educaion and Research Consorium, Naional Universiy of Kyiv Mohyla Academy Address: Vul Voloska 10, 04070

More information

3.1.3 INTRODUCTION TO DYNAMIC OPTIMIZATION: DISCRETE TIME PROBLEMS. A. The Hamiltonian and First-Order Conditions in a Finite Time Horizon

3.1.3 INTRODUCTION TO DYNAMIC OPTIMIZATION: DISCRETE TIME PROBLEMS. A. The Hamiltonian and First-Order Conditions in a Finite Time Horizon 3..3 INRODUCION O DYNAMIC OPIMIZAION: DISCREE IME PROBLEMS A. he Hamilonian and Firs-Order Condiions in a Finie ime Horizon Define a new funcion, he Hamilonian funcion, H. H he change in he oal value of

More information

3.1 More on model selection

3.1 More on model selection 3. More on Model selecion 3. Comparing models AIC, BIC, Adjused R squared. 3. Over Fiing problem. 3.3 Sample spliing. 3. More on model selecion crieria Ofen afer model fiing you are lef wih a handful of

More information

arxiv: v1 [cs.ai] 1 Jul 2015

arxiv: v1 [cs.ai] 1 Jul 2015 arxiv:507.00353v [cs.ai] Jul 205 Harm van Seijen harm.vanseijen@ualberta.ca A. Rupam Mahmood ashique@ualberta.ca Patrick M. Pilarski patrick.pilarski@ualberta.ca Richard S. Sutton sutton@cs.ualberta.ca

More information

Random Walk with Anti-Correlated Steps

Random Walk with Anti-Correlated Steps Random Walk wih Ani-Correlaed Seps John Noga Dirk Wagner 2 Absrac We conjecure he expeced value of random walks wih ani-correlaed seps o be exacly. We suppor his conjecure wih 2 plausibiliy argumens and

More information

L07. KALMAN FILTERING FOR NON-LINEAR SYSTEMS. NA568 Mobile Robotics: Methods & Algorithms

L07. KALMAN FILTERING FOR NON-LINEAR SYSTEMS. NA568 Mobile Robotics: Methods & Algorithms L07. KALMAN FILTERING FOR NON-LINEAR SYSTEMS NA568 Mobile Roboics: Mehods & Algorihms Today s Topic Quick review on (Linear) Kalman Filer Kalman Filering for Non-Linear Sysems Exended Kalman Filer (EKF)

More information

Diebold, Chapter 7. Francis X. Diebold, Elements of Forecasting, 4th Edition (Mason, Ohio: Cengage Learning, 2006). Chapter 7. Characterizing Cycles

Diebold, Chapter 7. Francis X. Diebold, Elements of Forecasting, 4th Edition (Mason, Ohio: Cengage Learning, 2006). Chapter 7. Characterizing Cycles Diebold, Chaper 7 Francis X. Diebold, Elemens of Forecasing, 4h Ediion (Mason, Ohio: Cengage Learning, 006). Chaper 7. Characerizing Cycles Afer compleing his reading you should be able o: Define covariance

More information

Ensamble methods: Boosting

Ensamble methods: Boosting Lecure 21 Ensamble mehods: Boosing Milos Hauskrech milos@cs.pi.edu 5329 Senno Square Schedule Final exam: April 18: 1:00-2:15pm, in-class Term projecs April 23 & April 25: a 1:00-2:30pm in CS seminar room

More information

Longest Common Prefixes

Longest Common Prefixes Longes Common Prefixes The sandard ordering for srings is he lexicographical order. I is induced by an order over he alphabe. We will use he same symbols (,

More information

This document was generated at 1:04 PM, 09/10/13 Copyright 2013 Richard T. Woodward. 4. End points and transversality conditions AGEC

This document was generated at 1:04 PM, 09/10/13 Copyright 2013 Richard T. Woodward. 4. End points and transversality conditions AGEC his documen was generaed a 1:4 PM, 9/1/13 Copyrigh 213 Richard. Woodward 4. End poins and ransversaliy condiions AGEC 637-213 F z d Recall from Lecure 3 ha a ypical opimal conrol problem is o maimize (,,

More information

) were both constant and we brought them from under the integral.

) were both constant and we brought them from under the integral. YIELD-PER-RECRUIT (coninued The yield-per-recrui model applies o a cohor, bu we saw in he Age Disribuions lecure ha he properies of a cohor do no apply in general o a collecion of cohors, which is wha

More information

Matlab and Python programming: how to get started

Matlab and Python programming: how to get started Malab and Pyhon programming: how o ge sared Equipping readers he skills o wrie programs o explore complex sysems and discover ineresing paerns from big daa is one of he main goals of his book. In his chaper,

More information

Notes for Lecture 17-18

Notes for Lecture 17-18 U.C. Berkeley CS278: Compuaional Complexiy Handou N7-8 Professor Luca Trevisan April 3-8, 2008 Noes for Lecure 7-8 In hese wo lecures we prove he firs half of he PCP Theorem, he Amplificaion Lemma, up

More information

Learning a Class from Examples. Training set X. Class C 1. Class C of a family car. Output: Input representation: x 1 : price, x 2 : engine power

Learning a Class from Examples. Training set X. Class C 1. Class C of a family car. Output: Input representation: x 1 : price, x 2 : engine power Alpaydin Chaper, Michell Chaper 7 Alpaydin slides are in urquoise. Ehem Alpaydin, copyrigh: The MIT Press, 010. alpaydin@boun.edu.r hp://www.cmpe.boun.edu.r/ ehem/imle All oher slides are based on Michell.

More information

SZG Macro 2011 Lecture 3: Dynamic Programming. SZG macro 2011 lecture 3 1

SZG Macro 2011 Lecture 3: Dynamic Programming. SZG macro 2011 lecture 3 1 SZG Macro 2011 Lecure 3: Dynamic Programming SZG macro 2011 lecure 3 1 Background Our previous discussion of opimal consumpion over ime and of opimal capial accumulaion sugges sudying he general decision

More information

Lecture 20: Riccati Equations and Least Squares Feedback Control

Lecture 20: Riccati Equations and Least Squares Feedback Control 34-5 LINEAR SYSTEMS Lecure : Riccai Equaions and Leas Squares Feedback Conrol 5.6.4 Sae Feedback via Riccai Equaions A recursive approach in generaing he marix-valued funcion W ( ) equaion for i for he

More information

On Measuring Pro-Poor Growth. 1. On Various Ways of Measuring Pro-Poor Growth: A Short Review of the Literature

On Measuring Pro-Poor Growth. 1. On Various Ways of Measuring Pro-Poor Growth: A Short Review of the Literature On Measuring Pro-Poor Growh 1. On Various Ways of Measuring Pro-Poor Growh: A Shor eview of he Lieraure During he pas en years or so here have been various suggesions concerning he way one should check

More information

Expert Advice for Amateurs

Expert Advice for Amateurs Exper Advice for Amaeurs Ernes K. Lai Online Appendix - Exisence of Equilibria The analysis in his secion is performed under more general payoff funcions. Wihou aking an explici form, he payoffs of he

More information

Introduction D P. r = constant discount rate, g = Gordon Model (1962): constant dividend growth rate.

Introduction D P. r = constant discount rate, g = Gordon Model (1962): constant dividend growth rate. Inroducion Gordon Model (1962): D P = r g r = consan discoun rae, g = consan dividend growh rae. If raional expecaions of fuure discoun raes and dividend growh vary over ime, so should he D/P raio. Since

More information

d 1 = c 1 b 2 - b 1 c 2 d 2 = c 1 b 3 - b 1 c 3

d 1 = c 1 b 2 - b 1 c 2 d 2 = c 1 b 3 - b 1 c 3 and d = c b - b c c d = c b - b c c This process is coninued unil he nh row has been compleed. The complee array of coefficiens is riangular. Noe ha in developing he array an enire row may be divided or

More information

Overview. COMP14112: Artificial Intelligence Fundamentals. Lecture 0 Very Brief Overview. Structure of this course

Overview. COMP14112: Artificial Intelligence Fundamentals. Lecture 0 Very Brief Overview. Structure of this course OMP: Arificial Inelligence Fundamenals Lecure 0 Very Brief Overview Lecurer: Email: Xiao-Jun Zeng x.zeng@mancheser.ac.uk Overview This course will focus mainly on probabilisic mehods in AI We shall presen

More information

Learning a Class from Examples. Training set X. Class C 1. Class C of a family car. Output: Input representation: x 1 : price, x 2 : engine power

Learning a Class from Examples. Training set X. Class C 1. Class C of a family car. Output: Input representation: x 1 : price, x 2 : engine power Alpaydin Chaper, Michell Chaper 7 Alpaydin slides are in urquoise. Ehem Alpaydin, copyrigh: The MIT Press, 010. alpaydin@boun.edu.r hp://www.cmpe.boun.edu.r/ ehem/imle All oher slides are based on Michell.

More information

5. Stochastic processes (1)

5. Stochastic processes (1) Lec05.pp S-38.45 - Inroducion o Teleraffic Theory Spring 2005 Conens Basic conceps Poisson process 2 Sochasic processes () Consider some quaniy in a eleraffic (or any) sysem I ypically evolves in ime randomly

More information

Unit Root Time Series. Univariate random walk

Unit Root Time Series. Univariate random walk Uni Roo ime Series Univariae random walk Consider he regression y y where ~ iid N 0, he leas squares esimae of is: ˆ yy y y yy Now wha if = If y y hen le y 0 =0 so ha y j j If ~ iid N 0, hen y ~ N 0, he

More information

Lecture 2 October ε-approximation of 2-player zero-sum games

Lecture 2 October ε-approximation of 2-player zero-sum games Opimizaion II Winer 009/10 Lecurer: Khaled Elbassioni Lecure Ocober 19 1 ε-approximaion of -player zero-sum games In his lecure we give a randomized ficiious play algorihm for obaining an approximae soluion

More information

Single and Double Pendulum Models

Single and Double Pendulum Models Single and Double Pendulum Models Mah 596 Projec Summary Spring 2016 Jarod Har 1 Overview Differen ypes of pendulums are used o model many phenomena in various disciplines. In paricular, single and double

More information

Designing Information Devices and Systems I Spring 2019 Lecture Notes Note 17

Designing Information Devices and Systems I Spring 2019 Lecture Notes Note 17 EES 16A Designing Informaion Devices and Sysems I Spring 019 Lecure Noes Noe 17 17.1 apaciive ouchscreen In he las noe, we saw ha a capacior consiss of wo pieces on conducive maerial separaed by a nonconducive

More information

NCSS Statistical Software. , contains a periodic (cyclic) component. A natural model of the periodic component would be

NCSS Statistical Software. , contains a periodic (cyclic) component. A natural model of the periodic component would be NCSS Saisical Sofware Chaper 468 Specral Analysis Inroducion This program calculaes and displays he periodogram and specrum of a ime series. This is someimes nown as harmonic analysis or he frequency approach

More information

RC, RL and RLC circuits

RC, RL and RLC circuits Name Dae Time o Complee h m Parner Course/ Secion / Grade RC, RL and RLC circuis Inroducion In his experimen we will invesigae he behavior of circuis conaining combinaions of resisors, capaciors, and inducors.

More information

EKF SLAM vs. FastSLAM A Comparison

EKF SLAM vs. FastSLAM A Comparison vs. A Comparison Michael Calonder, Compuer Vision Lab Swiss Federal Insiue of Technology, Lausanne EPFL) michael.calonder@epfl.ch The wo algorihms are described wih a planar robo applicaion in mind. Generalizaion

More information

On Boundedness of Q-Learning Iterates for Stochastic Shortest Path Problems

On Boundedness of Q-Learning Iterates for Stochastic Shortest Path Problems MATHEMATICS OF OPERATIONS RESEARCH Vol. 38, No. 2, May 2013, pp. 209 227 ISSN 0364-765X (prin) ISSN 1526-5471 (online) hp://dx.doi.org/10.1287/moor.1120.0562 2013 INFORMS On Boundedness of Q-Learning Ieraes

More information

Simulation-Solving Dynamic Models ABE 5646 Week 2, Spring 2010

Simulation-Solving Dynamic Models ABE 5646 Week 2, Spring 2010 Simulaion-Solving Dynamic Models ABE 5646 Week 2, Spring 2010 Week Descripion Reading Maerial 2 Compuer Simulaion of Dynamic Models Finie Difference, coninuous saes, discree ime Simple Mehods Euler Trapezoid

More information

CSE/NB 528 Lecture 14: From Supervised to Reinforcement Learning (Chapter 9) R. Rao, 528: Lecture 14

CSE/NB 528 Lecture 14: From Supervised to Reinforcement Learning (Chapter 9) R. Rao, 528: Lecture 14 CSE/NB 58 Lecure 14: From Supervised o Reinforcemen Learning Chaper 9 1 Recall from las ime: Sigmoid Neworks Oupu v T g w u g wiui w Inpu nodes u = u 1 u u 3 T i Sigmoid oupu funcion: 1 g a 1 a e 1 ga

More information

Kriging Models Predicting Atrazine Concentrations in Surface Water Draining Agricultural Watersheds

Kriging Models Predicting Atrazine Concentrations in Surface Water Draining Agricultural Watersheds 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Kriging Models Predicing Arazine Concenraions in Surface Waer Draining Agriculural Waersheds Paul L. Mosquin, Jeremy Aldworh, Wenlin Chen Supplemenal Maerial Number

More information

BU Macro BU Macro Fall 2008, Lecture 4

BU Macro BU Macro Fall 2008, Lecture 4 Dynamic Programming BU Macro 2008 Lecure 4 1 Ouline 1. Cerainy opimizaion problem used o illusrae: a. Resricions on exogenous variables b. Value funcion c. Policy funcion d. The Bellman equaion and an

More information

= ( ) ) or a system of differential equations with continuous parametrization (T = R

= ( ) ) or a system of differential equations with continuous parametrization (T = R XIII. DIFFERENCE AND DIFFERENTIAL EQUATIONS Ofen funcions, or a sysem of funcion, are paramerized in erms of some variable, usually denoed as and inerpreed as ime. The variable is wrien as a funcion of

More information

Math Week 14 April 16-20: sections first order systems of linear differential equations; 7.4 mass-spring systems.

Math Week 14 April 16-20: sections first order systems of linear differential equations; 7.4 mass-spring systems. Mah 2250-004 Week 4 April 6-20 secions 7.-7.3 firs order sysems of linear differenial equaions; 7.4 mass-spring sysems. Mon Apr 6 7.-7.2 Sysems of differenial equaions (7.), and he vecor Calculus we need

More information

( ) a system of differential equations with continuous parametrization ( T = R + These look like, respectively:

( ) a system of differential equations with continuous parametrization ( T = R + These look like, respectively: XIII. DIFFERENCE AND DIFFERENTIAL EQUATIONS Ofen funcions, or a sysem of funcion, are paramerized in erms of some variable, usually denoed as and inerpreed as ime. The variable is wrien as a funcion of

More information

Learning to Take Concurrent Actions

Learning to Take Concurrent Actions Learning o Take Concurren Acions Khashayar Rohanimanesh Deparmen of Compuer Science Universiy of Massachuses Amhers, MA 0003 khash@cs.umass.edu Sridhar Mahadevan Deparmen of Compuer Science Universiy of

More information

5.1 - Logarithms and Their Properties

5.1 - Logarithms and Their Properties Chaper 5 Logarihmic Funcions 5.1 - Logarihms and Their Properies Suppose ha a populaion grows according o he formula P 10, where P is he colony size a ime, in hours. When will he populaion be 2500? We

More information

SUPPLEMENTARY INFORMATION

SUPPLEMENTARY INFORMATION SUPPLEMENTARY INFORMATION DOI: 0.038/NCLIMATE893 Temporal resoluion and DICE * Supplemenal Informaion Alex L. Maren and Sephen C. Newbold Naional Cener for Environmenal Economics, US Environmenal Proecion

More information

Biol. 356 Lab 8. Mortality, Recruitment, and Migration Rates

Biol. 356 Lab 8. Mortality, Recruitment, and Migration Rates Biol. 356 Lab 8. Moraliy, Recruimen, and Migraion Raes (modified from Cox, 00, General Ecology Lab Manual, McGraw Hill) Las week we esimaed populaion size hrough several mehods. One assumpion of all hese

More information

Multi-scale 2D acoustic full waveform inversion with high frequency impulsive source

Multi-scale 2D acoustic full waveform inversion with high frequency impulsive source Muli-scale D acousic full waveform inversion wih high frequency impulsive source Vladimir N Zubov*, Universiy of Calgary, Calgary AB vzubov@ucalgaryca and Michael P Lamoureux, Universiy of Calgary, Calgary

More information

Lecture 4 Kinetics of a particle Part 3: Impulse and Momentum

Lecture 4 Kinetics of a particle Part 3: Impulse and Momentum MEE Engineering Mechanics II Lecure 4 Lecure 4 Kineics of a paricle Par 3: Impulse and Momenum Linear impulse and momenum Saring from he equaion of moion for a paricle of mass m which is subjeced o an

More information

Types of Exponential Smoothing Methods. Simple Exponential Smoothing. Simple Exponential Smoothing

Types of Exponential Smoothing Methods. Simple Exponential Smoothing. Simple Exponential Smoothing M Business Forecasing Mehods Exponenial moohing Mehods ecurer : Dr Iris Yeung Room No : P79 Tel No : 788 8 Types of Exponenial moohing Mehods imple Exponenial moohing Double Exponenial moohing Brown s

More information

Online Appendix to Solution Methods for Models with Rare Disasters

Online Appendix to Solution Methods for Models with Rare Disasters Online Appendix o Soluion Mehods for Models wih Rare Disasers Jesús Fernández-Villaverde and Oren Levinal In his Online Appendix, we presen he Euler condiions of he model, we develop he pricing Calvo block,

More information

Analyze patterns and relationships. 3. Generate two numerical patterns using AC

Analyze patterns and relationships. 3. Generate two numerical patterns using AC envision ah 2.0 5h Grade ah Curriculum Quarer 1 Quarer 2 Quarer 3 Quarer 4 andards: =ajor =upporing =Addiional Firs 30 Day 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 andards: Operaions and Algebraic Thinking

More information

23.5. Half-Range Series. Introduction. Prerequisites. Learning Outcomes

23.5. Half-Range Series. Introduction. Prerequisites. Learning Outcomes Half-Range Series 2.5 Inroducion In his Secion we address he following problem: Can we find a Fourier series expansion of a funcion defined over a finie inerval? Of course we recognise ha such a funcion

More information

Announcements: Warm-up Exercise:

Announcements: Warm-up Exercise: Fri Apr 13 7.1 Sysems of differenial equaions - o model muli-componen sysems via comparmenal analysis hp//en.wikipedia.org/wiki/muli-comparmen_model Announcemens Warm-up Exercise Here's a relaively simple

More information

Lab 10: RC, RL, and RLC Circuits

Lab 10: RC, RL, and RLC Circuits Lab 10: RC, RL, and RLC Circuis In his experimen, we will invesigae he behavior of circuis conaining combinaions of resisors, capaciors, and inducors. We will sudy he way volages and currens change in

More information

Numerical Dispersion

Numerical Dispersion eview of Linear Numerical Sabiliy Numerical Dispersion n he previous lecure, we considered he linear numerical sabiliy of boh advecion and diffusion erms when approimaed wih several spaial and emporal

More information

Energy Storage Benchmark Problems

Energy Storage Benchmark Problems Energy Sorage Benchmark Problems Daniel F. Salas 1,3, Warren B. Powell 2,3 1 Deparmen of Chemical & Biological Engineering 2 Deparmen of Operaions Research & Financial Engineering 3 Princeon Laboraory

More information

2.7. Some common engineering functions. Introduction. Prerequisites. Learning Outcomes

2.7. Some common engineering functions. Introduction. Prerequisites. Learning Outcomes Some common engineering funcions 2.7 Inroducion This secion provides a caalogue of some common funcions ofen used in Science and Engineering. These include polynomials, raional funcions, he modulus funcion

More information