Off-policy TD(λ) with a true online equivalence

Size: px

Start display at page:

Download "Off-policy TD(λ) with a true online equivalence"

Jack Fleming
5 years ago
Views:

1 Off-policy TD(λ) wih a rue online equivalence Hado van Hassel A Rupam Mahmood Richard S Suon Reinforcemen Learning and Arificial Inelligence Laboraory Universiy of Albera, Edmonon, AB T6G 2E8 Canada Absrac Van Seijen and Suon (2014) recenly proposed a new version of he linear TD(λ) learning algorihm ha is exacly equivalen o an online forward view and ha empirically performed beer han is classical counerpar in boh predicion and conrol problems However, heir algorihm is resriced o on-policy learning In he more general case of off-policy learning, in which he policy whose oucome is prediced and he policy used o generae daa may be differen, heir algorihm canno be applied One reason for his is ha he algorihm boosraps and hus is subjec o insabiliy problems when funcion approximaion is used A second reason rue online TD(λ) canno be used for off-policy learning is ha he off-policy case requires sophisicaed imporance sampling in is eligibiliy races To address hese limiaions, we generalize heir equivalence resul and use his generalizaion o consruc he firs online algorihm o be exacly equivalen o an off-policy forward view We show his algorihm, named rue online GTD(λ), empirically ouperforms GTD(λ) (Maei, 2011) which was derived from he same objecive as our forward view bu lacs he exac online equivalence In he general heorem ha allows us o derive his new algorihm, we encouner a new general eligibiliy-race updae 1 Temporal difference learning Eligibiliy races improve learning in emporal-difference (TD) algorihms by efficienly propagaing credi for laer observaions bac o updae earlier predicions (Suon, 1988), and can help speed up learning significanly A good way o inerpre hese races, he exen of which is regulaed by a race parameer λ 0, 1], is o consider he evenual updaes o each predicion For λ = 1 he updae for he predicion a ime is similar o a Mone Carlo updae owards he full reurn following For λ = 0 he predicion is updaed oward only he immediaely (reward) signal, and he res of he reurn is esimaed wih he predicion a he nex sae Such an inerpreaion is called a forward view, because i considers he effec of fuure observaions on he updaes In pracice, learning is ofen fases for inermediae values of λ (Suon & Baro, 1998) Tradiionally, he equivalence o a forward view was nown o hold only when he predicions are updaed offline In pracice TD algorihms are more commonly used online, during learning, bu hen his equivalence was only approximae Recenly, van Seijen and Suon (2014) developed rue online TD(λ), he firs algorihm o be exacly equivalen o a forward view under online updaing For λ = 1 he updaes by rue online TD(λ) evenually become exacly equivalen o a Mone Carlo updae owards he full reurn As demonsraed by van Seijen and Suon, such an online equivalence is more han a heoreical curiosiy, and leads o lower predicion errors han when using he radiional TD(λ) algorihm ha only achieves an offline equivalence In his paper, we generalize his resul and show exac online equivalences are possible for a wide range of forward views, leading o compuaionally efficien online algorihms by exploiing a new generic race updae A limiaion of he rue online TD(λ) algorihm by van Seijen and Suon (2014) is ha i is only applicable o onpolicy learning, when he learned predicions correspond o he policy ha is used o generae he daa Off-policy learning is imporan o be able o learn from demonsraions, o learn abou many hings a he same ime (Suon e al, 2011), and ulimaely o learn abou he unnown opimal policy A naural nex sep is herefore o apply our general equivalence resul o an off-policy forward view We consruc such a forward view and derive an equivalen new off-policy gradien TD algorihm, ha we call rue online GTD(λ) This algorihm is consruced o be equivalen for λ = 0, by design, o he exising GTD(λ) algorihm (Maei, 2011) We demonsrae empirically ha for higher λ he new algorihm is much beer behaved due o is exac

2 equivalence o a desired forward view In addiion o he pracical poenial of he new algorihm, his demonsraes he usefulness of our general equivalence resul and he resuling new race updae 2 Problem seing We consider a learning agen in an unnown environmen where a each ime sep he agen performs an acion A afer which he environmen ransiions from he curren sae S o he nex sae S +1 We do no assume he sae iself can be observed and he agen insead observes a feaure vecor φ R n, which is ypically a funcion of he sae S such ha φ = φ(s ) The agen selecs is acions according o a behavior policy b, such ha b(a S ) denoes he probabiliy of selecing acion A = a in sae S Typically b(a s) depends on s hrough φ(s) Afer performing A, he agen observes a scalar (reward) signal R +1 and he process can eiher erminae or coninue We allow for sof erminaions, defined by a poenially ime-varying sae-dependen erminaion facor γ 0, 1] (cf Suon, Mahmood, Precup & van Hassel, 2014) Wih weigh 1 γ +1 he process erminaes a ime + 1 and R +1 is considered he las reward in his episode Wih weigh γ +1 we coninue o he nex sae and observe φ +1 = φ(s+1 ) The agen hen selecs a new acion A +1 and his process repeas A special case is he episodic seing where γ = 1 for all non-erminaing imes and γ T = 0 when he episode ends a ime T The erminaion facors are commonly called discoun facors, because hey discoun he effec of laer rewards The goal is o predic he sum of fuure rewards, discouned by he probabiliies of erminaion, under a arge policy π The opimal predicion is hus defined for each sae s by v π (s) = E π =1 1 R =1 γ S 0 = s where E π ] = E A π( S ), ] is he expecancy condiional on he policy π We esimae he values v π (s) wih a parameerized funcion of he observed feaures In paricular we consider linear funcions of he feaures, such ha θ φ v π (S ) is he esimaed value of he sae a ime according o a weigh vecor θ The goal is hen o improve he predicions by updaing θ We desire online algorihms wih a consan O(n) per-sep complexiy, where n is he number of feaures in φ Such compuaional consideraions are imporan in seings wih a lo of daa or when φ is a large vecor For insance, we wan our algorihms o be able o run on a robo wih many sensors and limied on-board processing power ], 3 General online equivalence beween forward and bacward views We can hin abou wha he ideal updae would be for a predicion afer observing all relevan fuure rewards and saes Such an updae is called a forward view, because i depends on observaions from he fuure A concree example is he on-policy Mone Carlo reurn, consising of he discouned sum of all fuure rewards In pracice, full Mone Carlo updaes can have high variance I can be beer o augmen he reurn wih he hencurren predicions a he visied saes When we coninue afer some ime sep, wih weigh γ +1, we replace a porion of 1 λ +1 of he remaining reurn wih our curren predicion of his reurn a S +1 Maing use of laer predicions o updae earlier predicions in his way is called boosrapping The process hen coninues o he nex acion and reward wih oal weigh γ +1 λ +1, where again we erminae wih 1 γ +2 and hen boosrap wih 1 λ +2, and so on When λ +1 = 0 we ge he usual one-sep TD reurn R +1 + γ +1 φ +1θ If λ = 1 for all, we obain a full (discouned) Mone Carlo reurn In he on-policy seing, when we do no have o worry abou deviaions from he arge policy, we can hen updae he predicion made a ime owards he on-policy λ-reurn defined by G λ = R +1 + γ +1 (1 λ+1 )φ +1θ + λ +1 G λ +1] The discoun facors γ are normally considered a propery of he problem, bu he boosrap parameers λ can be considered unable parameers The full reurn (obained for λ = 1) is an unbiased esimae for he value of he behavior policy, bu is variance can be high The value esimaes are ypically no unbiased, bu can be considerably less variable As such, one can inerpre he λ parameers as rading off bias and variance Typically, learning is fases for inermediae values of λ If erminaion never occurs, G λ is never fully defined To consruc a well-defined forward view, we can runcae he recursion a he curren daa horizon (van Seijen & Suon, 2014; Suon e al, 2014) o obain inerim λ-reurns If we have daa up o ime, all reurns are runcaed as if λ = 0 and we boosrap on he mos recen value esimae φ θ 1 of he curren sae This gives us, for each 0 < G λ, = R +1 + γ +1 (1 λ+1 )φ +1θ + λ +1 G λ ] +1, and G λ, = φ θ 1 In his definiion of G λ,, for each ime sep j wih < j he value of sae S j is esimaed using φ j θ j 1, because θ j 1 is he mos up-o-dae weigh vecor a he momen we reach his sae Using hese inerim reurns, we can consruc an inerim forward view which, in conras o convenional forward

3 views, can be compued before an episode has concluded or even if he episode never fully erminaes For insance, when we have daa up o ime he following se of linear updaes for all imes < is an inerim forward view: θ +1 = θ + α (G λ, φ θ )φ, <, (1) where θ0 = θ 0 is he iniial weigh vecor The subscrip on θ (firs index on Gλ, ) corresponds o he sae for he h updae, he superscrip (second index on G λ, ) denoes he curren daa horizon The forward view (1) is well-defined and compuable a every ime, bu i is no very compuaionally efficien For each new observaion, when incremens o + 1, we poenially have o recompue all he updaes, as G λ,+1 migh differ from G λ, for arbirary many The resuling compuaional complexiy is O(n) per ime sep, which is problemaic when becomes large Therefore, forward views are no mean o be implemened as is They serve as a concepual updae, in which we formulae wha we wan o achieve afer observing he relevan daa In he nex heorem, we prove ha for many forward views an efficien and fully equivalen bacward view exiss ha explois eligibiliy races o consruc online updaes ha use only O(n) compuaion per ime sep, bu ha sill resul in exacly he same weigh vecors The heorem is consrucive, allowing us o find such bacward views auomaically for a given forward view Theorem 1 (Equivalence beween forward and bacward views) Consider any forward view ha updaes owards some inerim arges Y wih θ +1 = θ + η (Y φ θ )φ + x, 0 <, where θ0 = θ 0 for some iniial θ 0 and where x R n is any vecor ha does no depend on Assume ha he emporal differences Y +1 Y for differen are relaed hrough Y +1 Y = c (Y Y +1), <, (2) where c is a scalar ha does no depend on Then, he final weighs θ a each are equal o he weighs θ as defined by e 0 = η 0 φ 0 and he bacward view θ +1 = θ + (Y +1 Y )e + η (Y φ θ )φ + x, e = c 1 e 1 + η (1 c 1 φ e 1 )φ, > 0 (3) Proof We inroduce he fading marix F = I η φ φ, such ha θ+1 = F θ + η Y φ We subrac θ from o find he change when incremens Expanding θ θ+1 +1, we ge θ+1 +1 θ = F θ +1 θ + η Y +1 φ + x = F (θ +1 θ) + η Y +1 φ + (F I)θ + x = F (θ +1 θ) + η Y +1 φ η φ φ θ + x = F (θ +1 θ ) + η (Y +1 φ θ )φ + x (4) We now repeaedly expand boh θ +1 θ +1 and θ o ge θ = F 1 (θ 1 +1 θ 1) + η 1 (Y 1 +1 Y 1)φ 1 = F 1 F 2 (θ 1 +1 θ 1) + η 2 (Y 2 +1 Y 2)F 1 φ 2 + η 1 (Y 1 +1 Y 1)φ 1 = (Expand unil reaching θ0 +1 θ0 = 0) = F 1 F 0 (θ0 +1 θ0) 1 + η F 1 F +1 (Y +1 =0 1 = η F 1 F +1 (Y +1 =0 Y )φ Y )φ 1 = η F 1 F +1 c (Y Y +1)φ (Using (2)) =0 = (Apply (2) repeaedly) 1 2 = c 1 c j F 1 F +1 φ =0 η j= }{{} = e 1 = c 1 e 1 (Y +1 (Y +1 Y ) Y ) (5) The vecor e can be compued wih he recursion e = =0 1 = =0 η η 1 j= 1 j= 1 = c 1 F =0 η c j c j F F +1 φ F F +1 φ + η φ 2 j= = c 1 F e 1 + η φ c j F 1 F +1 φ + η φ = c 1 e 1 + η (1 c 1 φ e 1 )φ

4 We plug (5) bac ino (4) and obain θ+1 +1 θ = c 1 F e 1 (Y +1 Y ) + η (Y +1 φ θ )φ + x = (e η φ )(Y +1 = (Y +1 Y ) + η (Y +1 φ θ )φ + x Y )e + η (Y φ θ )φ + x Because θ 0, = θ0 for all, he desired resul follows hrough inducion The heorem shows ha under condiion (2) we can urn a general forward view ino an equivalen online algorihm ha only uses O(n) compuaion per ime sep Compared o previous wor on forward/bacward equivalences, his grans us wo imporan hings Firs, he obained equivalence is boh online and exac; mos previous equivalences were only exac under offline updaing, when he weighs are no updaed during learning (Suon & Baro, 1998; Suon e al, 2014) Second, he heorem is consrucive, and gives an equivalen bacward view direcly from a desired forward view, raher han having o prove such an equivalence in hindsigh (as in, eg, van Seijen & Suon, 2014) This is perhaps he main benefi of he heorem: raher han relying on insigh and inuiion o consruc efficien online algorihms, Theorem 1 can be used o derive an exac bacward view direcly from a desired forward view We exploi his in Secion 6 when we urn a desired offpolicy forward view ino an efficien new online off-policy algorihm We refer o races of he general form (3) as duch races The race updae can be inerpreed as firs shrining he races wih c, for insance c = γλ, and hen updaing he races for he curren sae, φ e, owards one wih a sep size of η In conras, radiional accumulaing races, defined by e = c 1 e 1 + φ, add o he race value of he curren sae raher han updaing i oward one This can cause he accumulaing races o grow large, poenially resuling in high-variance updaes To demonsrae one advanage of Theorem 1, we apply i o he on-policy TD(λ) forward view defined by (1) Theorem 2 (Equivalence for rue online TD(λ)) Define θ 0 = θ 0 Then, θ as defined by (1) equals θ as defined by he bacward view δ = R +1 + γφ +1θ φ θ 1, e = γλe 1 + α (1 γλφ e 1 )φ, θ +1 = θ + δ e + α (φ θ 1 φ θ )φ Proof In Theorem 1, we subsiue x = 0, c = γλ and Y = Gλ +1,, such ha Y Y = δ and Y = φ θ 1 The desired resul follows immediaely The bacward view in Theorem 2 is rue online TD(λ), as proposed by van Seijen and Suon (2014) Using Theorem 1, we have proved equivalence o is forward view wih a few simple subsiuions, whereas he original proof is much longer and more complex 4 Off-policy learning In his secion, we urn o off-policy learning wih funcion approximaion In consrucing an off-policy forward view wo issues arise ha are no presen in he on-policy seing Firs, we need o esimae he value of a policy ha is differen han he one used o obain he observaions Second, using a forward view such as (1) under off-policy sampling can cause i o be unsable, poenially resuling in divergence of he weighs (Suon e al, 2008) These issues can be avoided by consrucing our off-policy algorihms o minimize a mean-squared projeced Bellman error (MSPBE) wih gradien descen (Suon e al, 2009; Maei & Suon, 2010; Maei, 2011) The MSPBE was previously used o derive GTD(λ) (Maei, 2011), which is an online algorihm ha can be used o learn off-policy predicions GTD(λ) was no consruced o be exacly equivalen o any forward view and i is a naural quesion wheher he algorihm can be improved from having such an equivalence, jus as was he case wih TD(λ) and rue online TD(λ) In his secion, we inroduce an off-policy MSPBE and show how GTD(λ) can be derived In he nex secion, we use he same MSPBE o consruc a new off-policy forward view from which we will derive an exacly equivalen online bacward view To obain esimaes for one disribuion when he samples are generaed under anoher disribuion, we can weigh he observaions by he relaive probabiliies of hese observaions occurring under he arge policy, as compared o he behavior disribuion This is called imporance sampling (Rubinsein, 1981; Precup, Suon & Singh, 2000) Recall ha b(a s) and π(a s) denoe he probabiliies of selecing acion a in sae s according o he behavior policy and he arge policy, respecively Afer selecing an acion A in a sae S according o b, we observe a reward R +1 The expeced value of his reward is E b R +1 ], bu if we muliply he reward wih he imporance-sampling raio ρ = π(a S )/b(a S ) he expeced value is E b ρ R +1 S ] = a b(a S ) π(a S ) b(a S ) ER +1 S, A = a] = π(a S )ER +1 S, A = a] a = E π R +1 S ] Therefore ρ R +1 is an unbiased sample for he reward under he arge policy This echnique can be applied o all he rewards and value esimaes in a given λ-reurn

5 For insance, if we wan o obain an unbiased sample for he reward under he arge policy n seps afer he curren sae S, he oal weigh applied o his reward should be ρ ρ +1 ρ +n 1 An off-policy λ-reurn saring from sae S is given by G λρ (θ) = ρ (R +1 + γ +1 (1 λ +1 )φ +1θ (6) ) + γ +1 λ +1 G λ +1(θ) In conras o G λ, his reurn is defined as a funcion of a single weigh vecor θ This is useful laer, when we wish o deermine he gradien of his reurn wih respec o θ When using funcion approximaion i is generally no possible o esimae he value of each sae wih full accuracy or, equivalenly, o reduce he condiional expeced TD error for each sae o zero a he same ime More formally, le v θ be a parameerized value funcion defined by v θ (s) = θ φ(s) and le T λ π be a paramerized Bellman operaor defined, for any v : {s} R, by (T λ π v)(s) = E π R1 + γ 1 (1 λ 1 )v(s 1 ) + γ 1 λ 1 (T λ π v)(s 1 ) S 0 = s ] In general, we hen canno achieve v θ = Tπ λ v θ, because Tπ λ v θ is no guaraneed o be a funcion ha we can represen wih our chosen funcion approximaion I is, however, possible o find he fixed poin defined by v θ = ΠT λ π v θ (7) where Πv is a projecion of v ino he space of represenable funcions {v θ θ R n } Le d be he seady-sae disribuion of saes under he behavior policy The projecion of any v is hen defined by Πv = v θv, where θ v = arg min v θ v 2 d, θ where 2 d is a norm defined by f 2 d = s d(s)f(s)2 Following Maei (2011), he projecion is defined in erms of he seady-sae disribuion resuling from he behavior policy, which means ha d(s) = lim P(S = s A j b( S j ), j) This implies our objecive weighs he imporance of he accuracy of he predicion in each sae according o he relaive frequency ha his sae occurs under he behavior policy, which is a naural choice for online learning The fixed poin in (7) can be found by minimizing he MSPBE defined by (Maei, 2011) J(θ) = v θ ΠTπ λ v θ 2 d (8) = E b δ π (θ)φ ] E b φ φ ] 1 Eb δ π (θ)φ ], where δ π (θ) = (T λ π v θ )(S ) v θ (S ) and where he expecancies are wih respec o he seady-sae disribuion d, as induced by he behavior policy b The ideal gradien updae for ime sep is hen where 1 2 θ J(θ) θ θ +1 = θ 1 2 α θ J(θ) θ, (9) = E b θ δ π (θ)φ = E b (φ θ G λρ (θ))φ ] Eb φ φ ] 1 Eb δ π (θ )φ ] ] E b φ φ ] 1Eb δ π (θ )φ ] = E b δ π (θ )φ ] ] E b θ G λρ (θ)φ E b φ φ ] 1 Eb δ π (θ )φ ] w = E b δ π (θ )φ ] E b θ G λρ ] (θ)φ, (10) wih G λρ as defined in (6), and where w = Eb φ φ ] 1 Eb δ π (θ )φ ] Updae (9) can be inerpreed as an expeced forward view The derivaion of he GTD(λ) algorihm proceeds by exploiing he expeced equivalences (Maei, 2011) E b θ G (θ)φ ] = E b ρ γ +1 (1 λ +1 )φ +1 φ ] + E b ρ γ +1 λ +1 θ G +1 (θ)φ ] = E b ρ γ +1 (1 λ +1 )φ +1 φ ] + E b ρ 1 γ λ θ G (θ)φ ] 1 = E b ρ γ +1 (1 λ +1 )φ +1 φ ] + E b ρ 1 γ λ ρ γ +1 (1 λ +1 )φ +1 φ ] 1 + E b ρ 1 γ λ ρ γ +1 λ +1 θ G +1 (θ)φ ] 1 = E b ρ γ +1 (1 λ +1 )φ +1 φ ] + E b ρ 1 γ λ ρ γ +1 (1 λ +1 )φ +1 φ ] 1 + E b ρ 2 γ 1 λ 1 ρ 1 γ λ θ G (θ)φ ] 2 = (Repea unil we reach φ 0 ) ] = E b γ +1 (1 λ +1 )φ +1 ρ ρ i 1 γ i λ i j=0 i=j+1 φ j }{{} = (e ) = E b γ+1 (1 λ +1 )φ +1 (e ) ], (11) and, similarly, E b δ π(θ )φ ] = E b δ (θ )e ], where e = ρ (γ λ e 1 + φ ), (12) δ (θ) = R +1 + γ +1 φ +1θ φ θ The auxiliary vecor w w can be updaed wih leas mean squares (LMS) (Suon e al, 2009; Maei, 2011), using he sample δ (θ )e E b δ (θ )e ] = E b δ π (θ )φ ]

6 and he updae w +1 = w + β δ (θ )e β φ w φ The complee GTD(λ) algorihm is hen defined by 1 δ = R +1 + γ +1 φ +1θ φ θ, e = ρ (γ λ e 1 + φ ), θ +1 = θ + α δ e α γ +1 (1 λ +1 )w e φ +1, w +1 = w + β δ e β φ w φ 5 An off-policy forward view In his secion, we define an off-policy forward view which we urn ino a fully equivalen bacward view in he nex secion, using Theorem 1 GTD(λ) is derived by firs urning an expeced forward view ino an expeced bacward view, and hen sampling We propose insead o sample he expeced forward view direcly and hen inver he sampled forward view ino an equivalen online bacward view This way we obain an exac equivalence beween forward and bacward views insead of he expeced equivalence of GTD(λ) This was previously no nown o be possible, bu i has he advanage ha we can use he precise (poenially discouned and boosrapped) sample reurns consising of all fuure rewards and sae values in each updae This can resul in more accurae predicions, as confirmed by our experimens in Secion 7 The new forward view derives from he MSPBE, as defined in (8), and more specifically from he gradien updae defined by (9) and (10) To find an implemenable inerim forward view, we need sampled esimaes of all hree pars in (10) We discuss each of hese pars separaely Our inerim forward view is defined in erms of a daa horizon, so he gradien of he MSPBE is aen o θ raher han θ Furhermore, δ π is defined as he error beween a λ-reurn and a curren esimae, and herefore we need o consruc an inerim λ-reurn To esimae he firs erm of (10) we herefore need an esimae for E b δ π(θ )φ ] = E b G λρ, φ θ ], for some suiably defined G λρ, The variance of off-policy updaes is ofen lower when we weigh he errors (ha is, he difference beween he reurn and he curren esimae) wih he imporance-sampling raios, raher han weighing he reurns (Suon e al, 2014) Le δ = R +1 + γ +1 φ +1 θ φ θ 1 denoe a onesep TD error The on-policy reurn used in he forward view (1) can hen be wrien as a sum of such errors: ( 1 j ) G λ, = φ θ 1 + γ i λ i δ j j= i=+1 1 Dann, Neumann and Peers (2014) call his algorihm TDC(λ), bu we use he original name by Maei (2011) We apply he imporance-sampling weighs o he one-sep TD errors, raher han jus o he reward and boosrapped value esimae 2 This does no affec he expeced value, because E b ρ φ θ ] ] 1 S = Eb φ θ 1 S, bu i can have a beneficial effec on he variance of he resuling updaes A sampled off-policy error is hen where G λρ, G λρ, ρ φ θ E b δ π (θ )φ ], (13) 1 = ρ φ θ 1 + ρ j= ( j i=+1 γ i λ i ρ i ) δ j An equivalen recursive definiion for G λρ, is ( G λρ, = ρ R +1 + γ +1 (1 λ +1 ρ +1 )φ +1θ for <, and G λρ, when ρ = 1 for all, G λρ + γ +1 λ +1 G λρ +1, ), (14) = ρ φ θ 1 In he on-policy case,, reduces exacly o Gλ,, as used in he forward view (1) for rue online TD(λ) Furhermore, E b G λρ, S = s] = E π G λ, S = s] for any s For he second erm in (10), which can be hough of as he gradien correcion erm, we need an esimae w w As in he derivaion of GTD(λ), we use a LMS updae Assuming we have daa up o, he ideal forward-view updae for w is hen w +1 = w + β (δ λρ, φ w )φ, (15) for some appropriae sample δ λρ, E bδ π (θ )] A naural inerim esimae is defined by where δ λρ, = 0 and δ λρ, = ρ (δ + γ +1 λ +1 δ λρ +1, ), (16) δ = R +1 + γ +1 θ φ +1 θ φ This is no he only possible way o esimae w, bu his choice ensures he resuling algorihm is equivalen o GTD(0) when λ = 0, allowing us o invesigae he effecs of he rue online equivalence and he resuling new race updaes in some isolaion wihou having o worry abou oher poenial differences beween he algorihms In he nex secion we consruc an equivalen bacward view for (15) o compue he sequence {w }, where w = w, 2 For he PTD(λ) and PQ(λ) algorihms, Suon e al (2014) propose anoher weighing based on weighing fla reurn errors conaining muliple rewards In conras, our weighing is chosen o be consisen wih GTD(λ) True online versions of PTD(λ) and PQ(λ) exis, bu we do no consider hem furher in his paper

7 Finally, we use he expeced equivalence proved in (11), and hen sample o obain γ +1 (1 λ +1 )φ +1 (e ) E b θ G (θ)φ ], (17) wih e as defined in (12) We now have all he pieces o sae he off-policy forward view for θ We approximae he expeced forward view as defined by (9) and (10) by using he sampled esimaes (13), (17) and w = w w, wih w as defined by (15) This gives us he inerim forward view θ +1 = θ + α (G λρ, ρ φ θ )φ (18) wih G λρ, as defined in (14) α γ +1 (1 λ +1 )φ +1 w e, 6 Bacward view: rue online GTD(λ) In his secion, we apply Theorem 1 o conver he offpolicy forward view as given by (18) ino an efficien online bacward view Firs, we consider w Theorem 3 (Auxiliary vecors) The vecor w, as defined by he forward view in (15), is equal o w as defined by he bacward view e w = ρ 1 γ λ e w 1 + β (1 ρ 1 γ λ φ e w 1)φ, w +1 = w + ρ δ e w β φ w φ, where e w 0 = β 0 φ 0, w 0 = w 0,, and δ = R+1 + γ +1 φ +1θ φ θ Proof We apply Theorem 1 by subsiuing θ = w, η = β, x = 0 and Y, as defined in (16) Then = δλρ, δ λρ,+1 δλρ, = ρ γ +1 λ +1 (δ λρ +1,+1 δλρ +1, ), which implies c = ρ γ +1 λ +1 Finally, Y and Y +1 Y = δ λρ, = 0 = δ λρ,+1 = ρ δ Insering hese subsiuions ino he bacward view in Theorem 1 immediaely yields he bacward view in he curren heorem Theorem 4 (True online GTD(λ)) For any, he weigh vecor θ as defined by forward view in (18) is equal o θ, as defined by he bacward view e = ρ (γ λ e 1 + α (1 ρ γ λ φ e 1 )φ ), e = ρ (γ λ e 1 + φ ), θ +1 = θ + δ e + (e α ρ φ )(θ θ 1 ) φ wih w and δ as defined in Theorem 3 α γ +1 (1 λ +1 )w e φ +1, Proof Again, we apply Theorem 1 Subsiue η = ρ α, x = α γ +1 (1 λ +1 )φ +1 w e, Y = θ 1φ and Y = R +1 +γ +1 (1 λ +1 ρ +1 )θ φ +1 +λ +1 G λρ +1, ] This las subsiuion implies Y +1 Y = γ +1 λ +1 ρ +1 (Y Y +1), so ha c = γ +1 λ +1 ρ +1 Furhermore, Y +1 Y = R +1 + γ +1 θ φ +1 θ 1φ = δ + (θ θ 1 ) φ Applying Theorem 1 wih hese subsiuions, and replacing w wih he equivalen w, yields he bacward view θ +1 = θ + (δ + (θ θ 1 ) φ )e + α ρ (θ 1φ θ φ )φ α γ +1 (1 λ +1 )w e φ +1, = θ + δ e + (e α ρ φ )φ (θ θ 1 ) where e 0 = α 0 ρ 0 φ 0 and α γ +1 (1 λ +1 )w e φ +1, e = ρ γ λ e 1 + α ρ (1 ρ γ λ e 1φ )φ True online GTD(λ) algorihm is hen defined by δ = R +1 + γ +1 φ +1θ φ θ, e = ρ (γ λ e 1 + α (1 ρ γ λ φ e 1 )φ ), e = ρ (γ λ e 1 + φ ), e w = ρ 1 γ λ e w 1 + β (1 ρ 1 γ λ φ e w 1)φ, θ +1 = θ + δ e + (e α ρ φ )(θ θ 1 ) φ w +1 = w + ρ δ e w β φ w φ α γ +1 (1 λ +1 )w e φ +1, The races e and e w are duch races The race e is an accumulaing race ha follows from he gradien correcion, as discussed in Secion 4 I migh be possible o adap he forward view o replace e wih e This is already possible in pracice and in preliminary experimens he resuling algorihm performed similar o rue online GTD(λ) A more deailed invesigaion of his possibiliy is lef for fuure wor For λ = 0 he algorihm reduces o θ +1 = θ + α ρ δ φ α ρ γ +1 w φ φ +1, w +1 = w + β ρ δ φ β φ w φ, which is precisely GTD(0) 3 3 The on-policy varian of his algorihm, wih ρ = 1 for all, is nown as TDC (Suon e al, 2009; Maei, 2011)

8 7 Experimens We compare rue online GTD(λ) o GTD(λ) empirically in various seings The main goal of he experimens is o es he inuiion ha rue online GTD(λ) should be more robus o high sep sizes and high λ, due o is rue online equivalence and beer behaved races This was shown o be he case for rue online TD(λ) (van Seijen & Suon, 2014), and he experimens serve o verify ha his exends o he off-policy seing wih rue online GTD(λ) This is relevan because i implies rue online GTD(λ) should hen be easier o une in pracice, and because hese parameers can effec he limiing performance of he algorihms as well Boh algorihms opimize he MSPBE, as given in (8), which is a funcion of λ When he sae represenaion is of poor qualiy, he soluion ha minimizes he MSPBE can sill have a high mean-sqared error (MSE): v θ v π 2 d This means ha wih a low λ we are no always guaraneed o reach a low MSE, even asympoically The closer λ is o one, he closer he MSPBE becomes o he MSE, wih equaliy for λ = 1 In pracice his implies ha someimes we need a high λ o be able o obain an sufficienly accurae predicions, even if we run he algorihms a long ime To illusrae hese poins, we invesigae a fairly simple problem The problem seing is a random wal consising of 15 saes ha can be hough o lie on a horizonal line In each sae we have wo acions: move one sae o he lef, or one sae o he righ If we move lef in he lefmos sae, s 1, we bounce bac ino ha sae If we move righ in he righ-mos sae, s 15, he episode ends and we ge a reward of +1 On all oher ime seps, he reward is zero Each episode sars in s 1, which is he lef-mos sae This problem seing is similar o he one used by van Seijen and Suon (2014), wih hree differences Firs, we use 15 raher han 11 saes, bu his maes lile difference o he conclusions Second, we urn i ino an off-policy learning problem, as we describe in a momen Third, we use differen sae represenaions This las poin is because we wan o es he performance of he algorihm no jus wih feaures ha can accuraely represen he value funcion, as used by van Seijen and Suon, bu also wih feaures ha canno reduce he MSE all he way o zero In he original problem, here was a 09 probabiliy of moving righ in each sae (van Seijen & Suon, 2014) Here, we inerpre hese probabiliies as begin due o a behavior policy ha selecs he righ acion wih probabiliy 09 Then, we formulae a arge policy ha moves righ more ofen, wih probabiliy 095 The sochasic arge policy demonsraes ha our algorihm is applicable o arbirary off-policy learning ass, and ha he resuls do no depend on he arge policy being deerminisic We did also es he performance for a deerminisic policy ha moves righ always and he resuls are similar o hose given below Be- abular φ binary φ monoonic φ MSE MSE MSE λ = 1 GTD(λ) λ = α rue online GTD(λ) α Figure 1: The MSE on he random wal of GTD(λ) (lef column) and rue online GTD(λ) (righ column) The x- axis shows α, and he differen lines are for differen λ, wih λ = 0 in blue and λ = 1 in orange The op row is for 15 abular feaures, he middle row for 4 binary feaures, and he boom row for 2 monoonic feaures The MSE is minimized over β cause his is an episodic as, γ = 1 As saed above, we define hree differen sae represenaions In he firs as, we use abular feaures, such ha φ(s i ) is a vecor of 15 elemens, wih he ih elemen equal o one and all oher elemens equal o zero In he second as he sae number is urned ino a binary represenaion, such ha φ(s 1 ) = (0, 0, 0, 1), φ(s 2 ) = (0, 0, 1, 0), φ(s 3 ) = (0, 0, 1, 1), and so on up o φ(s 15 ) = (1, 1, 1, 1) The feaures are hen normalized o be uni vecors, such ha for insance φ(s 3 ) = 1 (0, 0, 2 1, 2 ) and φ(s 15 ) = ( 1 2, 1 2, 1 2, 1 2 ) In our final represenaion, we use one monoonically increasing feaure and one monoonically decreasing feaure, such ha φ(s i ) = ( 14 i+1 14, i 1 14 ) for all i These feaures were no normalized For α he range of parameers was from 2 8 o 1 wih seps in he exponen of 025 so ha α {2 8, 2 775,, 1} The secondary sep size β was varied over he same range, wih he addiion of β = 0 The race parameer λ was varied from 0 o wih seps of 1 in he exponen and wih he addiion of λ = 1, such ha λ {0, 1 2 1,, 1 2 9, , 1} The MSE (averaged over 20 repeiions) afer 10 episodes for all hree represenaions are shown in Figure 1 The lef graphs all correspond o GTD(λ) and he plos on he righ

9 MSE binary feaures GTD(λ) rue online GTD(λ) λ Figure 2: The MSE on he random wal for differen λ of GTD(λ) and rue online GTD(λ) for opimized α and β and binary feaures are for rue online GTD(λ) Each graph show he MSE as a funcion of α, wih differen lines for differen values of λ of which he exremes are highlighed (λ = 0 is blue; λ = 1 is orange) In all cases, he MSE was minimized for β, bu his secondary sep size had lile impac on he performance a all in hese problems Noe ha he blue lines in he pair of graphs in each row are exacly equal, because by design he algorihms are equivalen for λ = 0 In he op plos, he abular represenaion was used and we see ha especially wih high λ boh algorihms reach low predicion errors This demonsraes ha indeed learning can be faser wih higher λ When using funcion approximaion, in he middle and boom graphs, he benefi of having an online equivalence o a well-defined forward view becomes apparen For boh represenaions, he performance of GTD(λ) wih higher λ begins o deeriorae around α = 02 In conras, rue online GTD(λ) performs well even for α = λ = 1 Noe he log scale of he y-axis; he difference in MSE is many orders of magniude In pracice i is no always possible o fully une he algorihmic parameers and herefore he robusness of rue online GTD(λ) o differen seings is imporan However, i is sill ineresing o see wha he bes performance could be for a fully uned algorihm Therefore, in Figure 2 we show he MSE as a funcion of λ when minimized over boh α and β For all λ, rue online GTD(λ) ouperforms GTD(λ) 8 Discussion The main heoreical conribuion of his paper is a general heorem for equivalences beween forward and bacward views The heorem allows us o find an efficien fully equivalen online algorihm for a desired forward view The heorem is as general as required and as specific as possible for all applicaions of i in his paper, and in is curren form i is limied o forward views for which an O(n) bacward view exiss The heorem can be generalized furher, o include recursive (off-policy) LSTD(λ) (Boyan, 1999) and oher algorihms ha can be formulaed in erms of forward views (cf Geis & Scherrer, 2014; Dann, Neumann & Peers, 2014), bu we did no invesigae hese exensions We used Theorem 1 o consruc a new off-policy algorihm named rue online GTD(λ), which is he firs TD algorihm o have an exac online equivalence o a off-policy forward view We consruced his forward view o mainain equivalence o he exising GTD(λ) algorihm for λ = 0 The forward view we proposed is no he only possible, and in paricular i will be ineresing o invesigae differen mehods of imporance sampling We could for insance use he imporance sampling as proposed by Suon e al (2014) We did consruc he resuling online algorihm, and in preliminary ess is performance was similar o rue online GTD(λ) Liewise, if desired, i is possible o obain a full online equivalence o off-policy Mone Carlo for λ = 1 by consrucing a forward view ha achieves his For insance we could use a similar forward view as used in he paper, bu hen apply he imporance-sampling raios only o he reurns raher han o he errors For now, i remains an open quesion wha he bes off-policy forward view is True online GTD(λ) is limied o sae-value esimaes I is sraighforward o consruc a corresponding algorihm for acion values, similar o he correspondence beween GTD(λ) and GQ(λ) (Maei & Suon, 2010; Maei, 2011) and beween PTD(λ) and PQ(λ) (Suon e al, 2014) We leave such an exension for fuure wor Acnowledgmens The auhors han Joseph Modayil, Harm van Seijen and Adam Whie for fruiful discussions ha helped improve he qualiy of his wor This wor was suppored by grans from Albera Innovaes Technology Fuures, he Naional Science and Engineering Research Council of Canada, and he Albera Innovaes Cenre for Machine Learning References Boyan, J A, (1999) Leas-squares emporal difference learning In Proceedings of he 16h Inernaional Conference on Machine Learning, pp Dann, C, Neumann, G, & Peers, J (2014) Policy evaluaion wih emporal differences: A survey and comparison In Journal of Machine Learning Research 15: Geis, M, & Scherrer, B (2014) Off-policy learning wih eligibiliy races: A survey In Journal of Machine Learning Research 15: Maei, H R, & Suon, R S (2010) GQ(λ): A general gradien algorihm for emporal-difference predicion learning wih eligibiliy races In Proceedings of he Third Conference on Arificial General Inelligence, pp Alanis Press

10 Maei, H R (2011) Gradien Temporal-Difference Learning Algorihms PhD hesis, Universiy of Albera Precup, D, Suon, R S, & Singh, S (2000) Eligibiliy races for off-policy policy evaluaion In Proceedings of he 17h Inernaional Conference on Machine Learning, pp Morgan Kaufmann Rubinsein, R Y (1981) Simulaion and he Mone Carlo mehod New Yor, Wiley Suon, R S (1988) Learning o predic by he mehods of emporal differences Machine Learning 3:9 44 Suon, R S, Baro, A G (1998) Reinforcemen Learning: An Inroducion MIT Press Suon, R S, Mahmood, A R, Precup, D, & van Hassel, H (2014) A new Q(λ) wih inerim forward view and Mone Carlo equivalence In Proceedings of he 31s Inernaional Conference on Machine Learning JMLR W&CP 32(2) Suon, R S, Maei, H R, Precup, D, Bhanagar, S, Silver, D, Szepesvári, Cs, & Wiewiora, E (2009) Fas gradien-descen mehods for emporal-difference learning wih linear funcion approximaion In Proceedings of he 26h Annual Inernaional Conference on Machine Learning, pp , ACM Suon, R S, Modayil, J, Delp, M, Degris, T, Pilarsi, P M, Whie, A, & Precup, D (2011) Horde: A scalable real-ime archiecure for learning nowledge from unsupervised sensorimoor ineracion In Proceedings of he 10h Inernaional Conference on Auonomous Agens and Muliagen Sysems, pp Suon, R S, Szepesvári, Cs, & Maei, H R (2008) A convergen O(n) algorihm for off-policy emporaldifference learning wih linear funcion approximaion In Advances in Neural Informaion Processing Sysems 21, pp MIT Press van Seijen, H, & Suon, R S (2014) True online TD(λ) In Proceedings of he 31s Inernaional Conference on Machine Learning JMLR W&CP 32(1):

RL Lecture 7: Eligibility Traces. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1

RL Lecture 7: Eligibility Traces. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 RL Lecure 7: Eligibiliy Traces R. S. Suon and A. G. Baro: Reinforcemen Learning: An Inroducion 1 N-sep TD Predicion Idea: Look farher ino he fuure when you do TD backup (1, 2, 3,, n seps) R. S. Suon and