Bias in Natural Actor-Critic Algorithms

Size: px

Start display at page:

Download "Bias in Natural Actor-Critic Algorithms"

Iris Bernice Bridges
5 years ago
Views:

1 Bi in Nturl Actor-Critic Algorithm Philip S. Thom Deprtment of Computer Science, Univerity of Mchuett, Amhert, MA USA Technicl Report UM-CS Abtrct We how tht two populr dicounted rewrd nturl ctor-critic, NAC-LSTD nd enac, follow bied etimte of the nturl policy grdient. We derive the firt unbied dicounted rewrd nturl ctor-critic uing btch nd itertive pproche to grdient etimtion nd prove their convergence to globlly optiml policie for dicrete problem nd loclly optiml policie for continuou problem. Finlly, we rgue tht the bi mke the exiting lgorithm more pproprite for the verge rewrd etting. 1. Introduction We how tht two populr dicounted rewrd nturl ctor-critic, NAC-LSTD nd enac (Peter & Schl, 2008), do not produce unbied etimte of the nturl policy grdient purported. We prove tht, for et of Mrkov deciion procee, thee bied dicounted rewrd nturl ctor-critic re ctully unbied verge rewrd nturl ctor-critic, even though they ue etimte of dicounted rewrd vlue function. Another lgorithm, INAC (Degri et l., 2012), which i vrint of the NTD lgorithm (Morimur et l., 2005), w originlly preented bied dicounted rewrd lgorithm. We ugget tht it i more pproprite to think of it n verge rewrd lgorithm. We derive the unbied dicounted rewrd NAC- LSTD, enac, nd NAC-S lgorithm, where NAC-S i liner-time lgorithm imilr to NTD nd INAC. We prove tht unbied policy grdient nd nturl policy grdient lgorithm, like thoe preented, re convergent to globlly optiml policie for dicrete problem. However, the unbied dicounted rewrd lgorithm uffer from updte tht rpidly decy to zero, which cue poor dt efficiency. 2. Problem We re intereted in the problem of finding optiml deciion rule or policie for equentil deciion tk formulted Mrkov deciion procee (MDP). An MDP i tuple, M = (S, A, P, R, d 0, γ). S nd A denote the et of poible tte nd ction, which my be countble (dicrete), or uncountble (continuou). 1 P i clled the trnition function, where P = Pr( t+1= t =, t =), where t N 0 denote the time tep,, S nd A. R i the rewrd function, where R t t = r t, where t S, t A, nd r t [ r mx, r mx ] for ome uniformly bounding contnt r mx. The initil tte ditribution i d 0, where d 0 () = Pr( 0 =), nd γ i dicount fctor. A policy or tochtic policy, π Π, i ditribution over ction given tte: π(, ) = Pr( t = t =), where Π i the et of ll poible policie. A prmeterized policy µ with prmeter θ R n i function tht mp it prmeter to policie, i.e., µ : R n Π nd µ(θ)(, ) = Pr( t = t =, θ t =θ). For brevity, we write µ θ for µ(θ). We ume tht, for ll,, nd θ, µ θ (, ) i differentible with repect to θ. The tte vlue function, V π, for policy π, i function mpping tte to the expected um of dicounted rewrd (or expected return) tht would be ccrued therefrom if π were executed on M. Tht i, V π () = E[ γt r t 0 =, π, M]. 2 Similrly, the tte-ction vlue function i Q π (, ) = E[ γt r t 0 =, 0 =, π, M]. The dicounted tte ditribution, d π, give the probbility of ech tte under policy π, with dicount pplied to tte tht occur t lter time: d π () = (1 γ) γt Pr( t = 0, π, M). The objective functionl, J, give the expected dicounted return for running the provided policy on M for one epiode: J(π) = E[ γt r t π, M], where n epiode i one equence of tte, ction, 1 We bue nottion by writing ummtion nd probbilitie over S nd A. If thee et re continuou, the ummtion nd probbilitie hould be replced with integrl nd probbility denitie. 2 To void clutter, we my uppre function dependencie on M. For exmple, V π i function of M.

2 Bi in Nturl Actor-Critic Algorithm nd rewrd, trting from tte mpled from d 0 nd following the dynmic pecified by P nd R. We cll n MDP epiodic if there i one or more tte in which the proce terminte, nd, for ll policie, every epiode reche terminl tte within finite number of tep. To model epiodic MDP in unified mnner with non-epiodic MDP, we follow the formultion pecified by Sutton & Brto (1998), in which only one ction i dmiible in terminl tte, nd it cue trnition to n borbing tte with zero rewrd, which we cll pot-terminl borbing tte. Thi borbing tte lo h only one dmiible ction, which cue elf-trnition with zero rewrd. We llow γ [0, 1], where γ = 1 only when the MDP i epiodic. 3 If S nd A re countble, then the gol i to find n optiml policy, π, which mximize the objective functionl: π rg mx π Π J(π). If S or A i continuou, we erch for loclly optiml policy prmeter, θ, tht i, prmeter tifying J (θ ) = 0, where J = J µ, nd where we ume J i Lipchitz. 3. Policy Grdient Grdient cent lgorithm for mximizing the objective functionl re clled policy grdient lgorithm. Their bic updte i θ t+1 θ t + α t J (θ t ), where {α t } i clr tep ize chedule. Policy grdient method my lo ue unbied etimte of the grdient, mking them tochtic grdient cent lgorithm. Stochtic grdient cent i gurnteed to converge to locl mximum if J i Lipchitz, α t =, nd α2 t < (Bertek & Titikli, 2000). We ume tht ll tep ize chedule herefter tify thee contrint. The policy grdient, J (θ), i the direction θ tht mximize J (θ + θ) under the contrint tht θ 2 = ɛ 2, for ufficiently mll ɛ, where denote the Eucliden (L 2 ) norm. Amri (1998) uggeted tht Riemnnin ditnce my be more pproprite metric thn Eucliden ditnce for prmeter pce. He cll the direction tifying thi modified contrint the nturl grdient. Kkde (2002) uggeted the ppliction of nturl grdient to policy grdient to get the nturl policy grdient. Bgnell & Schneider (2003) then derived proper Reimnnin ditnce metric, 4 bed on Amri nd Kkde work, 3 If γ = 1, every epiode reche terminl tte within ome finite time, T, o d π () um to T. 4 Recent work h propoed the ue of different metric tht ccount not only for how the ditribution over ction (the policy) chnge the prmeter chnge, but lo for how the tte ditribution chnge the prmeter nd howed tht the nturl policy grdient i covrint. Bhtngr et l. (2009) built on thi foundtion to crete everl provbly convergent policy grdient nd nturl policy grdient lgorithm for the verge rewrd etting. At thi point, it w known tht if d π () µ θ (, ) (1) [ (, ) f ϖ (, )] f ϖ(, ) = 0, where f ϖ (, ) i liner function pproximtor with prmeter vector ϖ = [w, v ], w = θ, feture vector ψ = [( log µ θ(, )), φ() ], for rbitrry uniformly bounded φ, nd f ϖ (, ) = ϖ ψ, then the nturl policy grdient i J (θ) = w (Sutton et l., 2000; Kkde, 2002). 5 The chllenge w then to devie method for finding w tifying Eqution Finding w To tify Eqution 1, Sutton et l. (2000), working in the v = 0 etting, ugget letting f ϖ : S A R be n pproximtion to with prmeter vector ϖ = w. They clim tht lerning f ϖ by following µ θ nd updting ϖ by rule uch ϖ t [ ˆ ( t, t ) f ϖ ( t, t )] 2, where ˆ (, ) i ome unbied etimte of (, ), will reult in tifctory w. However, thi i only true for the verge rewrd etting or the dicounted etting when γ = 1 becue, in the dicounted etting, d π in Eqution 1 i the dicounted weighting of tte encountered, where the tte oberved when merely following µ θ come from the undicounted tte ditribution. Peter & Schl (2006; 2008) oberved tht the cheme propoed by Sutton et l. (2000) i forwrd TD(1) lgorithm. Becue forwrd nd bckwrd TD(λ) re pproximtely equivlent, they ugget uing let qure temporl difference (LSTD), bckwrd TD(λ) method, to pproximte with f ϖ, where λ = 1. They cll the reulting lgorithm the nturl ctor-critic uing LSTD (NAC-LSTD) nd the epiodic nturl ctor-critic (enac). Becue the cheme propoed by Sutton et l., nd thu TD(1), doe not incorporte the γ t weighting in the dicounted tte ditribution, thi reult in w tht do not tify chnge (Morimur et l., 2009). 5 Notice tht if φ() = 0, we cn drop v from Eqution 1 to get the exct contrint pecified by Sutton et l. (2000). Eqution 1 follow immeditely ince µ θ(, )v φ() fϖ(,) = 0 for ll, µ θ, φ, ϖ nd M. Alo, for implicity lter, we ume tht φ() = 0 for the pot-terminl borbing tte.

3 Bi in Nturl Actor-Critic Algorithm Eqution 1, nd thu bi in the nturl policy grdient etimte. One olution would be to convert the dicounted MDP into n equivlent undicounted MDP, decribed in Section 2.3 of Bertek & Titikli (1996). To do thi, ech oberved trjectory mut be truncted fter ech trnition with probbility 1 γ. Notice tht NAC-LSTD i not bied when γ = 1 becue then the dicounted nd undicounted tte ditribution re identicl. 6 So, fter the trjectorie re truncted, the exiting NAC-LSTD lgorithm could be ued with γ = 1 to find policy for the originl MDP. However, thi pproch my dicrd ignificnt mount of dt when truncting epiode. Inted, we propoe the ue of ll of the oberved dt with proper dicounting in order to produce unbied grdient etimte. We preent new objective functionl, H, nd prove tht the locl minim of thi objective give w tifying Eqution 1. We then provide the tochtic grdient cent updte for thi objective. When following µ θ, the dicounting from the dicounted tte ditribution cn be hifted into the objective functionl in order to properly tify Eqution 1. We elect w tht i component of locl minimum for the objective functionl H: H(ϖ) = Pr ( t = M, µ θ ) µ θ (, ) [ γ t ( (, ) f ϖ (, )) 2] (2) = E, [γ ( ) ] 2 t ˆQµ θ (, ) f ϖ (, ). The objective functionl i lwy finite becue either γ < 1 or the MDP i epiodic. If the MDP i epiodic, it mut enter the pot-terminl borbing tte within finite number of tep. In thi tte, ψ, = 0, nd Q π (, ) = 0 for ll π nd the one dmiible, o µ θ(, )γ t ( (, ) f ϖ (, )) 2 = 0 for ll ϖ. Hence, if the MDP i epiodic, only finite number of term in the infinite um will be non-zero. We propoe performing tochtic grdient decent on H to obtin locl minimum where H(ϖ) = 0, o γ t Pr( t = M, µ θ ) µ θ (, ) [ (, ) f ϖ (, )] f ϖ(, ) = 0. (3) 6 It i uncler whether enac would be unbied in thi itution, decribed in Section 7. By the definition of d π, thi i equivlent to Eqution 1. Hence, when grdient decent on H h converged, the reulting w component of ϖ tifie Eqution 1. Notice tht the expecttion in Eqution 2 i over the oberved probbilitie of tte nd ction t time t if executing µ θ on M. Hence, we cn updte ϖ vi tochtic grdient decent: ϖ ϖ + η (4) [ ( )] γ t fϖ ˆQµ θ ( t, t ) ( t, t ) f ϖ ( t, t ), where ˆ i n unbied etimte of nd η i tep ize tifying the typicl decy contrint. The ubtitution of ˆQµ θ for doe not influence convergence (Bertek & Titikli, 2000). Becue f ϖ (, )/ i zero for terminl tte nd the potterminl borbing tte, the bove updte need only be performed for the pre-terminl tte. With v = 0, thi differ from the method propoed by Sutton et l. (2000) only by the um over time nd the γ t term. 5. Algorithm A imple lgorithm to find w would be to execute epiode nd then perform the updte in Eqution 4 uing the Monte Crlo return, ˆQµ θ ( t, t ) = τ=0 γτ r t+τ, the unbied etimte of ( t, t ). Thi i forwrd TD(1) lgorithm, with n dditionl dicount pplied to updte bed on the time t which they occur. However, thi lgorithm require tht entire trjectorie be tored in memory. To overcome thi, we cn derive the equivlent bckwrd updte by following Sutton nd Brto derivtion of bckwrd TD(λ) (Sutton & Brto, 1998). The reulting on-policy bckwrd lgorithm for etimting for fixed µ θ i: e t+1 =γλe t + γ t f ϖ( t, t ) (5) δ t =r t + γf ϖ ( t+1, t+1 ) f ϖ ( t, t ) (6) ϖ t+1 =ϖ t + η t δ t e t+1, (7) where λ i decy prmeter for eligibility trce in TD(λ) nd t, t, nd r t come from running µ θ on M. Although the bckwrd nd forwrd lgorithm re only pproximtely equivlent (Sutton & Brto, 1998), their convergence gurntee re the me (Bertek & Titikli, 1996). Hence, if λ = 1 nd η t i decyed ppropritely, the modified bckwrd TD(λ) lgorithm bove will produce w tifying Eqution 1. The only difference between thi lgorithm nd Sr(λ) i the γ t in Eqution 5. One cn then reproduce the work of Brdtke & Brto (1996) to crete

4 Bi in Nturl Actor-Critic Algorithm LSTD in thi new etting, which pproximte V µ θ in let qure mnner. Thi cn be extended long the line of Lgoudki & Prr (2001) to crete LSQ, which pproximte in let qure mnner. The reulting LSQ lgorithm in Peter nd Schl NAC-LSTD chnge only by the introduction of γ t term: z t+1 = λz t + γ t ˆφ t. We omit the complete peudocode for NAC-LSTD due to pce contrint. To crete n epiodic lgorithm, we convert Eqution 1 into ytem of liner eqution uing the umption tht ll epiode terminte within T tep, for ome finite number T. We rewrite Eqution 1 by replcing the infinite um in d π with finite one becue f ϖ (, )/ i zero for borbing tte: T µ θ (, )γ t (8) Pr( t = ) ( (, ) ϖ ψ ) ψ = 0. By collecting the term with ϖ on the left nd the other on the right, we get T Pr( t = ) µ θ (, )γ t ψ ψ ϖ =b, (9) where b =, T Pr( t=)µ θ (, )γ t (, )ψ. If we let A =, T Pr( t=)µ θ (, )γ t ψ ψ, then we get the ytem of liner eqution: Aϖ = b, where A i ψ by ψ qure mtrix. We cn then generte unbied etimte of A nd b from mple trjectorie. A the number of oberved trjectorie grow, our etimte of A nd b converge to their true vlue, giving n unbied etimte of the nturl grdient. The reulting epiodic nturl ctor-critic lgorithm, enac2, i preented in Algorithm 1. For both lgorithm preented, the uer mut elect either Type1 or Type2 updte. In the former, which emulte the updte cheme propoed by Peter & Schl (2008), the policy i updted when the grdient etimte h converged, while in the ltter, which emulte the two-timecle updte cheme propoed by Bhtngr et l. (2009), the policy i updted fter contnt number of time tep. The uer mut lo elect f(t) = γ t to get the unbied lgorithm or f(t) = 1 to get the bied lgorithm. The unbied lgorithm re only truly unbied when λ = 1, β = 0 (if β i preent), nd ɛ 0 (Type1) or k (Type2), in which ce they compute nd cend the exct nturl policy grdient. NAC-LSTD nd enac2 hve computtionl complexity proportionl to ϖ 2 per time tep jut to updte ttitic, nd ϖ 3 to compute the nturl policy grdient etimte for policy improvement tep. Thi Algorithm 1 epiodic Nturl Actor Critic 2 enac2 1: Input: MDP M, prmeterized policy µ θ (, ) with initil prmeter θ, bi function φ() for the ttevlue etimtion, updte frequency prmeter k, dicount prmeter γ, decy contnt β, lerning rte chedule {η t}, nd mximum epiode durtion T. 2: A 0; b 0; τ 0 3: for ep = 0, 1, 2,... do 4: Run n epiode nd remember the trjectory, { t, t, t+1, r t}, t [0, T 1]. 5: Updte Sttitic: 6: A A + T f(t)ψ t t ψ t t 7: b b + T Ṱ f(t)ψ t γˆt t t t=t rˆt 8: [wep, vep] = ( A A ) 1 A b; // If Type2, thi need only be done every k epiode. 9: Updte Actor (Nturl Policy Grdient): 10: if ( Type1, ep k 0, nd (w ep, w ep k ) ɛ ) or ( ) 11: Type2 nd (ep + 1) mod k = 0 then w 12: θ θ + η ep τ w ep 2 13: τ = τ + 1; A βa; b βb complexity cn be improved to liner by uing the modified Sr(λ) lgorithm in plce of LSTD to find w tifying Eqution 1. We cll the reulting lgorithm the Nturl Actor-Critic uing Sr(λ), or NAC-S. Notice tht ome men zero term cn be removed from the Sr(λ) updte nd the reulting lgorithm, provided in Algorithm 2, cn be viewed the dicounted rewrd nd eligibility trce extenion of the Nturl-Grdient Actor-Critic with Advntge Prmeter (Bhtngr et l., 2009). 7 NAC-S cn lo be viewed INAC (Degri et l., 2012) or NTD (Morimur et l., 2005) corrected to include the γ t term nd with the option of computing exct grdient etimte or uing two-timecle. Notice tht in ll lgorithm preented in thi pper, the nturl grdient i normlized. Thi normliztion i optionl. It my void convergence gurntee nd it often mke it difficult to chieve empiricl convergence. However, in prctice we find it eier to find fixed tep ize tht work on difficult problem when uing normlized updte to θ. Amri defined the nturl grdient only direction nd even dicrded cling contnt in hi derivtion of cloed form for the nturl grdient (Amri, 1998). 6. Convergence The nturl ctor-critic compute nd cend the nturl grdient of J, nd thu will converge to loclly optiml policy, t which point J (θ) = 0, um- 7 To get Bhtngr lgorithm, elect Type2 updte with k = 1, f(t) = 1, nd replce the dicounted TD error with the verge rewrd TD error.

5 Bi in Nturl Actor-Critic Algorithm Algorithm 2 Nturl Actor Critic uing Sr(λ) NAC-S(λ) 1: Input: MDP M, prmeterized policy µ θ (, ) with initil prmeter θ, bi function φ() for the ttevlue etimtion, updte frequency prmeter k, dicount prmeter γ, eligibility decy rte λ, nd lerning rte chedule {αt w }, {αt v } nd {η t}. 2: w 0 0; v 0 0; count 0 3: for epiode = 0, 1, 2,... do 4: Drw initil tte 0 d 0( ) 5: e w 1 = 0; e v 1 = 0; τ 1 = 0; 6: for t = 0, 1, 2,... do τ 2 = 0 7: t µ θ ( t, ); t+1 P( t, t, ); r t R t t ; 8: count count + 1 9: Updte Critic (Sr): 10: δ t = r t + γv t φ( t+1) v t φ( t) 11: 12: e w t = γλe w t 1 + f(t)[ log µ θ( t, t)] e v t = γλe v t 1 + f(t)φ( t) 13: w t+1 = w t+αt τ w 1 [δ t w t [ log µ θ( t, t)]]e w t 14: v t+1 = v t + αt τ v 1 δ te v t 15: Updte Actor (Nturl Policy Grdient): 16: if ( Type1, t k 0, nd (w t, w t k ) ɛ ) or ( ) 17: Type2 nd (count mod k = 0) then 18: w θ θ + η t+1 τ2 w t+1 2 ; τ 1 = t; τ 2 = τ : if t+1 terminl then brek out of loop over t ing the tep ize chedule re properly decyed nd tht the nturl ctor-critic etimte of the nturl grdient re unbied (Amri, 1998). A tted previouly, when λ = 1, β = 0 (if β i preent), nd ɛ 0 (Type1) or k (Type2), the nturl grdient etimte will be exct. In prctice, lrge k or mll ɛ nd mll fixed tep ize uully reult in convergence. Policy grdient pproche re typiclly purported to hve one ignificnt drwbck: where Q-bed method converge to globlly optiml policie for problem with dicrete tte nd ction, policy grdient lgorithm cn become tuck in rbitrrily bd locl optim (e.g., Peter & Bgnell, 2010; Peter, 2010). We rgue tht with umption imilr to thoe required by Q-lerning nd Sr, cending the policy grdient reult in convergence to globlly optiml policy well. 8 Firt, we ume tht S nd A re countble nd tht every tte-ction pir i oberved infinitely often. Second, we ume tht for ll θ, ll tte, nd ll ction nd â, where â, there i direction dθ of chnge to θ tht cue the probbility of in tte to incree while tht of â decree, while ll other ction probbilitie remin unchnged. Thee two umption re tified by policy prmeteriztion uch tbulr Gibb oftmx ction election (Sutton & Brto, 1998). We rgue tht t ll uboptiml θ, the policy grdient will be non-zero. For 8 Notice tht thi pplie to ll lgorithm tht cend the policy grdient or nturl policy grdient. ny policy tht i not globlly optiml, there exit rechble tte for which increing the probbility of pecific ction while decreing the probbility of â would incree J (ee Section 4.2 of Sutton & Brto (1998)). By our firt umption, thi ttection pir i reched by the policy, nd by our econd umption, there i direction, dθ, of chnge to θ tht cn mke exctly thi chnge. So, the directionl derivtive of J t θ in the direction dθ i non-zero nd therefore the grdient of J t θ mut lo be non-zero. Hence, θ cnnot be locl optimum. Policy grdient i typiclly pplied to problem with continuou tte or ction et, in which ce the umption bove cnnot be tified, o convergence to only locl optimum cn be gurnteed. However, the bove rgument ugget tht, in prctice nd on continuou problem, locl optim cn be voided by increing explortion nd the repreenttionl power of the policy prmeteriztion. However, if one deire pecific low-dimenionl policy prmeteriztion, uch proportionl-derivtive controller with limited explortion, then increing the explortion nd repreenttionl power of the policy my not be n cceptble option, in which ce locl optim my be unvoidble. 7. Anlyi of Bied Algorithm In thi ection we nlyze how the bi chnge performnce. Recll tht, without the correct dicounting, w re the weight tht minimize the qured error in the etimte, with tte mpled from ctul epiode. With the proper dicounting, tte tht re viited t lter time fctor le into w. Becue w will be the chnge to the policy prmeter, thi men tht in the bied lgorithm the chnge to the policy prmeter conider tte tht re viited t lter time jut much tte tht re viited erlier. Thi ugget tht the bied lgorithm my be optimizing different objective functionl imilr to J (θ) = (1 γ) ()V µ θ (), (10) where d π i the ttionry ditribution of the Mrkov chin induced by the policy π. More formlly, we ume d π () = lim t Pr( t = 0, π, M) exit nd i independent of 0 for ll policie. Notice tht J i not intereting for epiodic MDP ince, for ll policie, d π () i non-zero for only the pot-terminl borbing tte. So, henceforth, our dicuion i limited to the non-epiodic etting. For comprion, we cn write J in the me form: J (θ) = d 0()V µ θ (). The originl objective functionl, J, give the expected return from n epiode. Thi men tht for mll γ,

6 Bi in Nturl Actor-Critic Algorithm Action = Stte Action = Stte Action =0.5 Optiml Bied enac Unbied Stte Figure 1. The optiml policy (optiml), the ction elected by the bied NAC-LSTD, enac2, nd INAC (bied), the ction elected by the unbied NAC-LSTD, enac2, NAC-S, well rndom retrt hill-climbing lgorithm (unbied), nd the ction elected by enac (enac). it brely conider the qulity of the policy t tte tht re viited lte in trjectory. On the other hnd, J conider tte bed on their viittion frequency, regrdle of when they re viited. Kkde (2001) howed tht J, which include dicounting in V µ θ, i the typicl verge rewrd objective functionl. To ee tht the bied lgorithm pper to optimize omething cloer to thi verge rewrd objective, conider n MDP with S = [0, 1], where 0 = 0, = 1 i terminl, t+1 = t , nd R = ( ) 2. The optiml policy i to elect t = t. We prmeterize the policy with one prmeter, uch tht µ θ elect ction t N (θ, σ 2 ) for ll tte, where N i norml ditribution with mll contnt vrince, σ 2. If γ = 1, the optiml prmeter, θ, i θ = 0.5. Both the bied nd unbied lgorithm converge to thi θ. However, when γ = or γ = 0.5, the optiml θ decree in order to receive more rewrd initilly. We found tht the unbied nturl ctor-critic properly converge to the new optiml θ, doe imple hill-climbing lgorithm tht we implemented control. However, the bied lgorithm till converge to θ We found tht enac converge to θ tht differ from thoe of ll other lgorithm when γ 1, which ugget tht enac, but not enac2, my hve dditionl bi. Thee reult re preented in Figure 1. Thi difference rie the quetion of whether the bied lgorithm ctully compute the nturl policy grdient in the verge rewrd etting. In the reminder of thi ection, we prove tht they do whenever V µ θ () () = 0. (11) To derive Eqution 11, we firt review reult concerning the verge rewrd nturl policy grdient. The 9 We ued rndom retrt for ll method nd oberved no locl optim. typicl objective for verge rewrd lerning i J 1 (θ) = lim n n E[r t µ θ, M]. (12) A mentioned previouly, Kkde (2001) howed tht thi i equivlent to the definition in Eqution 10. The tte-ction vlue function i defined (, ) = Kkde (2002) tted tht if E[r t J (θ) 0 =, 0 =, µ θ, M]. (13) d π () µ θ (, ) (14) [ Qµ θ (, ) f ϖ (, ) ] f ϖ (, ) = 0 then the nturl grdient of J i J (θ) = w. (15) Thu, the unbied verge rewrd nturl policy grdient i given by w tifying Eqution 14. The bied lgorithm perform tochtic grdient decent ccording to the cheme propoed by Sutton et l. (2000). They mple tte,, from nd ction,, from µ θ nd perform grdient decent on the qured difference between (, ) nd f ϖ (, ). Thu, they elect w tifying d π () µ θ (, ) (16) [ (, ) f ϖ (, )] f ϖ(, ) = 0. Notice tht Eqution 16 ue the dicounted ttection vlue function while Eqution 14 ue the verge rewrd tte-ction vlue funciton. To determine if nd when the bied lgorithm compute J(θ), we mut determine when contnt multiple of the olution to Eqution 16 tify Eqution 14. To do thi, we olve Eqution 16 for w nd ubtitute contnt, k > 0, time thee w into Eqution 14 to generte contrint tht, when tified, reult in the bied lgorithm producing the me direction (but not necerily mgnitude) the verge rewrd nturl policy grdient. When doing o, we ume tht v = 0, ince it doe not influence the olution to either eqution. Firt, we mut etblih lemm tht relte the policy grdient theorem uing the verge rewrd tte ditribution but dicounted rewrd tte-ction vlue function (left hnd ide of Lemm

7 Bi in Nturl Actor-Critic Algorithm 1) to the derivtive of J without proper ppliction of the chin rule: Lemm 1, () µ θ(, ) (, ) = (1 γ) () V µ θ, for ll θ, µ, nd M. For proof of Lemm 1, ee the ppendix. Solving Eqution 16 for w, which give the direction of the bied lgorithm, we get ( w = () ) µ θ (, ) (, )ψ() (17) ( () ) 1 µ θ (, )ψ()ψ(). Notice tht the econd term i the invere (verge) Fiher informtion mtrix (Bgnell & Schneider, 2003). Subtituting k time thi w into Eqution 14 for w nd cnceling the product of the Fiher informtion mtrix nd it invere give 0 = µ θ (, ) (, )ψ (18) k µ θ (, ) (, )ψ = J (θ) k(1 γ) () V µ θ (), by ubtitution of the policy grdient theorem (Sutton et l., 2000) nd Lemm 1. Thu, when, for ome k, J (θ) = k(1 γ) () V µ θ (), (19) the bied lgorithm produce the direction of the unbied verge rewrd nturl policy grdient. If we let k = 1, we will till get contrint tht reult in the two direction being the me, lthough if the contrint i not tified, it doe not men the two re different (ince different k my reult in Eqution 19 being tified). Setting k = 1 nd ubtituting Eqution 10 for J (θ), we get: (1 γ) () V µ θ () V µ θ () () ()V µ θ () = (1 γ) + () V µ θ () = () V µ θ () () V µ θ () = 0. (20) We hve hown tht when Eqution 11 hold, the bied lgorithm compute the verge rewrd nturl policy grdient. 8. Dicuion nd Concluion We hve hown tht NAC-LSTD nd enac produce bied etimte of the nturl grdient. We rgued tht they, nd INAC nd NTD, ct more like verge rewrd nturl ctor-critic tht do not properly ccount for how chnge to θ chnge the expected return vi d µ θ. We proved tht in certin itution the bied lgorithm produce unbied etimte of the nturl policy grdient for the verge rewrd etting. The bi tem from improper dicounting when pproximting the tte-ction vlue function uing comptible function pproximtion. We derived the properly dicounted lgorithm to produce the unbied NAC-LSTD nd enac2, well the bied nd unbied NAC-S, liner time complexity lterntive to the qured to cubic time complexity NAC-LSTD nd enac2. However, the unbied lgorithm hve criticl drwbck tht limit their prcticlity. The unbied lgorithm dicount their updte by γ t. For mll γ, the updte will decy to zero rpidly, cuing the unbied lgorithm to ignore dt collected fter hort burn-in period. Conider n MDP like the one preented erlier, where the et of tte tht occur erly nd thoe tht occur lter re dijoint. In thi etting, the dicounted rewrd objective mndte tht dt recorded lte in trjectorie mut be ignored. In thi itution, the rpid decy of updte i cure of the choice of objective function. However, if the tte tht re viited erly in trjectory re lo viited lter in trjectory, off-policy method my be ble to tke dvntge of dt from lte in n epiode to provide meningful updte even for the dicounted rewrd etting. They my lo be ble to properly ue dt from previou policie to improve the etimte of the nturl policy grdient in principled mnner. Thee re poible venue for future reerch. Another intereting extenion would be to determine how γ hould be elected in the bied lgorithm. Recll tht Eqution 10 i the verge rewrd objective, for ll γ. Thi ugget tht in the bied lgorithm γ my be elected by the reercher. Smller vlue of γ re known to reult in fter convergence of vlue function etimte (Szepevri, 1997), however lrger γ typiclly reult in moother vlue function tht my be eier to pproximte ccurtely with few feture. Ltly, we rgued tht, with certin policy prmeteriztion, policy grdient method converge to globlly optiml policie for dicrete problem, nd uggeted tht locl optim my be voided in continuou problem by increing explortion nd the policy repreenttionl power. Future work my ttempt to provide globl convergence gurntee for ubet of

8 Bi in Nturl Actor-Critic Algorithm the continuou-ction etting by intelligently increing the repreenttionl power of the policy when it become tuck in locl optimum. Reference Amri, S. Nturl grdient work efficiently in lerning. Neurl Computtion, 10: , Bgnell, J. A. nd Schneider, J. Covrint policy erch. In Proceeding of the Interntionl Joint Conference on Artificil Intelligence, pp , Bertek, D. P. nd Titikli, J. N. Neuro-Dynmic Progrmming. Athen Scientific, Belmont, MA, Bertek, D. P. nd Titikli, J. N. Grdient convergence in grdient method. SIAM J. Optim., 10: , Bhtngr, S., Sutton, R. S., Ghvmzdeh, M., nd Lee, M. Nturl ctor-critic lgorithm. Automtic, 45(11): , Brdtke, S. J. nd Brto, A. G. Liner let-qure lgorithm for temporl difference lerning. Mchine Lerning, 22:33 57, Degri, T., Pilrki, P. M., nd Sutton, R. S. Model-free reinforcement lerning with continuou ction in prctice. In Proceeding of the 2012 Americn Control Conference, Kkde, S. Optimizing verge rewrd uing dicounted rewrd. In Proceeding of the 14th Annul Conference on Computtionl Lerning Theory, Kkde, S. A nturl policy grdient. In Advnce in Neurl Informtion Proceing Sytem, volume 14, pp , Lgoudki, M. nd Prr, R. Model-free let-qure policy itertion. In Neurl Informtion Proceing Sytem: Nturl nd Synthetic, pp , Morimur, T., Uchibe, E., nd Doy, K. Utilizing the nturl grdient in temporl difference reinforcement lerning with eligibility trce. In Interntionl Sympoium on Informtion Geometry nd it Appliction, Morimur, T., Uchibe, E., Yohimoto, J., nd Doy, K. A generlized nturl ctor-critic lgorithm. In Neurl Informtion Proceing Sytem: Nturl nd Synthetic, Peter, J. Policy grdient method. Scholrpedi, 5(11): 3698, Peter, J. nd Bgnell, J. A. Policy grdient method. Encyclopedi of Mchine Lerning, Peter, J. nd Schl, S. Policy grdient method for robotic. In Proceeding of the IEEE/RSJ Interntionl Conference on Intelligent Robot nd Sytem, Peter, J. nd Schl, S. Nturl ctor-critic. Neurocomputing, 71: , Sutton, R. S. nd Brto, A. G. Reinforcement Lerning: An Introduction. MIT Pre, Cmbridge, MA, Sutton, R. S., McAlleter, D., Singh, S., nd Mnour, Y. Policy grdient method for reinforcement lerning with function pproximtion. In Advnce in Neurl Informtion Proceing Sytem 12, pp , Szepevri, C. S. The ymptotic convergence-rte of q- lerning. In Advnce in Neurl Informtion Proceing Sytem, volume 10, pp , Appendix: Proof of Lemm 1 V µ θ () = µ θ (, ) (, ) (21) = [ µθ (, ) (, ) + µ θ (, ) ] Qµ θ (, ) = [ µ θ (, ) (, )+ ( ) ] µ θ (, ) R + P γv µ θ ( ) = [ µ θ (, ) (, )+ ] µ θ (, ) P γ V µ θ ( ) Solving for µ θ (,) (, ) yield. µ θ (, ) (, ) = (22) V µ θ () γ µ θ (, ) P V µ θ ( ). Summing both ide over ll tte weighted by give () µ θ (, ) (, ) (23) ( ) = () V µ θ () γ () µ θ (, ) ) P µ V θ ( ) = =(1 γ) () V µ θ () γ () V µ θ (). () V µ θ ()

Reinforcement learning

Reinforcement learning Reinforcement lerning Regulr MDP Given: Trnition model P Rewrd function R Find: Policy π Reinforcement lerning Trnition model nd rewrd function initilly unknown Still need to find the right policy Lern