Policy Gradient Methods for Reinforcement Learning with Function Approximation

Size: px

Start display at page:

Download "Policy Gradient Methods for Reinforcement Learning with Function Approximation"

Clinton Bond
6 years ago
Views:

1 Policy Grdient Method for Reinforcement Lerning with Function Approximtion Richrd S. Sutton, Dvid McAlleter, Stinder Singh, Yihy Mnour AT&T Lb Reerch, 180 Prk Avenue, Florhm Prk, NJ Abtrct Function pproximtion i eentil to reinforcement lerning, but the tndrd pproch of pproximting vlue function nd determining policy from it h o fr proven theoreticlly intrctble. In thi pper we explore n lterntive pproch in which the policy i explicitly repreented by it own function pproximtor, independent of the vlue function, nd i updted ccording to the grdient of expected rewrd with repect to the policy prmeter. Willim REINFORCE method nd ctor critic method re exmple of thi pproch. Our min new reult i to how tht the grdient cn be written in form uitble for etimtion from experience ided by n pproximte ction-vlue or dvntge function. Uing thi reult, we prove for the firt time tht verion of policy itertion with rbitrry differentible function pproximtion i convergent to loclly optiml policy. Lrge ppliction of reinforcement lerning (RL) require the ue of generlizing function pproximtor uch neurl network, deciion-tree, or intnce-bed method. The dominnt pproch for the lt decde h been the vlue-function pproch, in which ll function pproximtion effort goe into etimting vlue function, with the ction-election policy repreented implicitly the greedy policy with repect to the etimted vlue (e.g., the policy tht elect in ech tte the ction with highet etimted vlue). The vlue-function pproch h worked well in mny ppliction, but h everl limittion. Firt, it i oriented towrd finding determinitic policie, where the optiml policy i often tochtic, electing different ction with pecific probbilitie (e.g., ee Singh, Jkkol, nd Jordn, 1994). Second, n rbitrrily mll chnge in the etimted vlue of n ction cn cue it to be, or not be, elected. Such dicontinuou chnge hve been identified key obtcle to etblihing convergence urnce for lgorithm following the vlue-function pproch (Bertek nd Titikli, 1996). For exmple, Q-lerning, Sr, nd dynmic progrmming method hve ll been hown unble to converge to ny policy for imple MDP nd imple function pproximtor (Gordon, 1995, 1996; Bird, 1995; Titikli nd vn Roy, 1996; Bertek nd Titikli, 1996). Thi cn occur even if the bet pproximtion i found t ech tep before chnging the policy, nd whether the notion of bet i in the men-qured-error ene or the lightly different ene of reidul-grdient, temporl-difference, nd dynmic-progrmming method.

2 In thi pper we explore n lterntive pproch to function pproximtion in RL. Rther thn pproximting vlue function nd uing tht to compute determinitic policy, we pproximte tochtic policy directly uing n independent function pproximtor with it own prmeter. For exmple, the policy might be repreented by neurl network whoe input i repreenttion of the tte, whoe output i ction election probbilitie, nd whoe weight re the policy prmeter. Let θ denote the vector of policy prmeter nd ρ the performnce of the correponding policy (e.g., the verge rewrd per tep). Then, in the policy grdient pproch, the policy prmeter re updted pproximtely proportionl to the grdient: θ α, (1) where α i poitive-definite tep ize. If the bove cn be chieved, then θ cn uully be ured to converge to loclly optiml policy in the performnce meure ρ. Unlike the vlue-function pproch, here mll chnge in θ cn cue only mll chnge in the policy nd in the tte-viittion ditribution. In thi pper we prove tht n unbied etimte of the grdient (1) cn be obtined from experience uing n pproximte vlue function tifying certin propertie. Willim (1988, 1992) REINFORCE lgorithm lo find n unbied etimte of the grdient, but without the itnce of lerned vlue function. REINFORCE lern much more lowly thn RL method uing vlue function nd h received reltively little ttention. Lerning vlue function nd uing it to reduce the vrince of the grdient etimte pper to be eentil for rpid lerning. Jkkol, Singh nd Jordn (1995) proved reult very imilr to our for the pecil ce of function pproximtion correponding to tbulr POMDP. Our reult trengthen their nd generlize it to rbitrry differentible function pproximtor. Our reult lo ugget wy of proving the convergence of wide vriety of lgorithm bed on ctor-critic or policy-itertion rchitecture (e.g., Brto, Sutton, nd Anderon, 1983; Sutton, 1984; Kimur nd Kobyhi, 1998). In thi pper we tke the firt tep in thi direction by proving for the firt time tht verion of policy itertion with generl differentible function pproximtion i convergent to loclly optiml policy. Bird nd Moore (1999) obtined weker but uperficilly imilr reult for their VAPS fmily of method. Like policy-grdient method, VAPS include eprtely prmeterized policy nd vlue function updted by grdient method. However, VAPS method do not climb the grdient of performnce (expected long-term rewrd), but of meure combining performnce nd vluefunction ccurcy. A reult, VAPS doe not converge to loclly optiml policy, except in the ce tht no weight i put upon vlue-function ccurcy, in which ce VAPS degenerte to REINFORCE. Similrly, Gordon (1995) fitted vlue itertion i lo convergent nd vlue-bed, but doe not find loclly optiml policy. 1 Policy Grdient Theorem We conider the tndrd reinforcement lerning frmework (ee, e.g., Sutton nd Brto, 1998), in which lerning gent interct with Mrkov deciion proce (MDP). The tte, ction, nd rewrd t ech time t {0, 1, 2,...} re denoted t S, t A, nd r t Rrepectively. The environment dynmic re chrcterized by tte trnition probbilitie, P = Pr{ t+1 = t =, t = }, nd expected rewrd R = E {r t+1 t =, t = },, S, A. The gent deciion mking procedure t ech time i chrcterized by policy, π(,, θ) =Pr{ t = t =, θ}, S, A, where θ R l, for l << S, i prmeter vector. We ume tht π i diffentible with repect to it prmeter, i.e., tht π(,) exit. We lo uully write jut π(, ) for π(,, θ).

3 With function pproximtion, two wy of formulting the gent objective re ueful. One i the verge rewrd formultion, in which policie re rnked ccording to their long-term expected rewrd per tep, ρ(π): 1 ρ(π) = lim n n E {r 1 + r r n π} = d π () π(, )R, where d π () = lim t Pr{ t = 0,π} i the ttionry ditribution of tte under π, which we ume exit nd i independent of 0 for ll policie. In the verge rewrd formultion, the vlue of tte ction pir given policy i defined Q π (, ) = E {r t ρ(π) 0 =, 0 =, π}, S, A. t=1 The econd formultion we cover i tht in which there i deignted trt tte 0, nd we cre only bout the long-term rewrd obtined from it. We will give our reult only once, but they will pply to thi formultion well under the definition { } { } ρ(π) =E γ t 1 r t 0,π nd Q π (, ) =E γ k 1 r t+k t =, t =, π. t=1 where γ 0, 1] i dicount rte (γ = 1 i llowed only in epiodic tk). In thi formultion, we define d π () dicounted weighting of tte encountered trting t 0 nd then following π: d π () = t=0 γt Pr{ t = 0,π}. Our firt reult concern the grdient of the performnce metric with repect to the policy prmeter: Theorem 1 (Policy Grdient). For ny MDP, in either the verge-rewrd or trt-tte formultion, Proof: See the ppendix. = d π () k=1 Q π (, ). (2) Mrbch nd Titikli (1998) decribe relted but different expreion for the grdient in term of the tte-vlue function, citing Jkkol, Singh, nd Jordn (1995) nd Co nd Chen (1997). In both tht expreion nd our, the key point i tht their re no term of the form dπ () : the effect of policy chnge on the ditribution of tte doe not pper. Thi i convenient for pproximting the grdient by mpling. For exmple, if w mpled from the ditribution obtined by following π, then π(,) Q π (, ) would be n unbied etimte of. Of coure, Q π (, ) i lo not normlly known nd mut be etimted. One pproch i to ue the ctul return, R t = k=1 r t+k ρ(π) (or R t = k=1 γk 1 r t+k in the trt-tte formultion) n pproximtion for ech Q π ( t, t ). Thi led to Willim epiodic REINFORCE lgorithm, θ t π(t,t) 1 R t π( (the 1 t, t) π( t, t) correct for the overmpling of ction preferred by π), which i known to follow in expected vlue (Willim, 1988, 1992). 2 Policy Grdient with Approximtion Now conider the ce in which Q π i pproximted by lerned function pproximtor. If the pproximtion i ufficiently good, we might hope to ue it in plce of Q π in (2) nd till point roughly in the direction of the grdient. For exmple, Jkkol,

4 Singh, nd Jordn (1995) proved tht for the pecil ce of function pproximtion riing in tbulr POMDP one could ure poitive inner product with the grdient, which i ufficient to enure improvement for moving in tht direction. Here we extend their reult to generl function pproximtion nd prove equlity with the grdient. Let f w : S A Rbe our pproximtion to Q π, with prmeter w. It i nturl to lern f w by following π nd updting w by rule uch w t ˆQ π ( t, t ) f w ( t, t )] 2 ˆQ π ( t, t ) f w ( t, t )] fw(t,t), where ˆQ π ( t, t ) i ome unbied etimtor of Q π ( t, t ), perhp R t. When uch proce h converged to locl optimum, then d π () π(, ) Q π (, ) f w (, ) ] f w (, ) =0. (3) Theorem 2 (Policy Grdient with Function Approximtion). If f w tifie (3) nd i comptible with the policy prmeteriztion in the ene tht then f w (, ) = = d π () 1 π(, ), (4) f w (, ). (5) Proof: Combining (3) nd (4) give d π () Q π (, ) f w (, ) ] = 0 (6) which tell u tht the error in f w (, ) i orthogonl to the grdient of the policy prmeteriztion. Becue the expreion bove i zero, we cn ubtrct it from the policy grdient theorem (2) to yield = = = d π () d π () d π () Q π (, ) f w (, ). d π () Q π (, ) Q π (, )+f w (, )] Q π (, ) f w (, ) ] 3 Appliction to Deriving Algorithm nd Advntge Given policy prmeteriztion, Theorem 2 cn be ued to derive n pproprite form for the vlue-function prmeteriztion. For exmple, conider policy tht i Gibb ditribution in liner combintion of feture: π(, ) = eθt φ b eθt φ b, S, A, where ech φ i n l-dimenionl feture vector chrcterizing tte-ction pir,. Meeting the comptibility condition (4) require tht f w (, ) = 1 π(, ) = φ b π(, b)φ b,

5 o tht the nturl prmeteriztion of f w i f w (, ) =w T φ b π(, b)φ b ]. In other word, f w mut be liner in the me feture the policy, except normlized to be men zero for ech tte. Other lgorithm cn eily be derived for vriety of nonliner policy prmeteriztion, uch multi-lyer bckpropgtion network. The creful reder will hve noticed tht the form given bove for f w require tht it hve zero men for ech tte: π(, )f w(, ) = 0, S. In thi ene it i better to think of f w n pproximtion of the dvntge function, A π (, ) = Q π (, ) V π () (much in Bird, 1993), rther thn of Q π. Our convergence requirement (3) i relly tht f w get the reltive vlue of the ction correct in ech tte, not the bolute vlue, nor the vrition from tte to tte. Our reult cn be viewed jutifiction for the pecil ttu of dvntge the trget for vlue function pproximtion in RL. In fct, our (2), (3), nd (5), cn ll be generlized to include n rbitrry function of tte dded to the vlue function or it pproximtion. For exmple, (5) cn be generlized to = dπ () π(,) f w (, )+v()],where v : S Ri n rbitrry function. (Thi follow immeditely becue π(,) = 0, S.) The choice of v doe not ffect ny of our theorem, but cn ubtntilly ffect the vrince of the grdient etimtor. The iue here re entirely nlogou to thoe in the ue of reinforcement beline in erlier work (e.g., Willim, 1992; Dyn, 1991; Sutton, 1984). In prctice, v hould preumbly be et to the bet vilble pproximtion of V π. Our reult etblih tht tht pproximtion proce cn proceed without ffecting the expected evolution of f w nd π. 4 Convergence of Policy Itertion with Function Approximtion Given Theorem 2, we cn prove for the firt time tht form of policy itertion with function pproximtion i convergent to loclly optiml policy. Theorem 3 (Policy Itertion with Function Approximtion). Let π nd f w be ny differentible function pproximtor for the policy nd vlue function repectively tht tify the comptibility condition (4) nd for which mx θ,,,i,j 2 π(,) i j <B<. Let {α k } k=0 be ny tep-ize equence uch tht lim k α k = 0 nd k α k =. Then, for ny MDP with bounded rewrd, the equence {(θ k,w k )}, defined by ny θ 0, π k = π(,,θ k ), nd w k = w uch tht d π k () θ k+1 = θ k + α k d π k () π k (, )Q π k (, ) f w (, )] f w(, ) π k (, ) f wk (, ), (π converge uch tht lim k ) k = 0. Proof: Our Theorem 2 ure tht the θ k updte i in the direction of the grdient. The bound on 2 π(,) i j nd on the MDP rewrd together ure u tht i lo bounded. Thee, together with the tep-ize requirement, re the necery =0 2 ρ i j condition to pply Propoition 3.5 from pge 96 of Bertek nd Titikli (1996), which ure convergence to locl optimum.

6 Acknowledgement The uthor wih to thnk Mrth Steentrup nd Doin Precup for comment, nd Michel Kern for inight into the notion of optiml policy under function pproximtion. Reference Bird, L. C. (1993). Advntge Updting. Wright Lb. Technicl Report WL-TR Bird, L. C. (1995). Reidul lgorithm: Reinforcement lerning with function pproximtion. Proc. of the Twelfth Int. Conf. on Mchine Lerning, pp Morgn Kufmnn. Bird, L. C., Moore, A. W. (1999). Grdient decent for generl reinforcement lerning. NIPS 11. MIT Pre. Brto, A. G., Sutton, R. S., Anderon, C. W. (1983). Neuronlike element tht cn olve difficult lerning control problem. IEEE Trn. on Sytem, Mn, nd Cybernetic 13:835. Bertek, D. P., Titikli, J. N. (1996). Neuro-Dynmic Progrmming. Athen Scientific. Co, X.-R., Chen, H.-F. (1997). Perturbtion reliztion, potentil, nd enitivity nlyi of Mrkov Procee, IEEE Trn. on Automtic Control 42(10): Dyn, P. (1991). Reinforcement comprion. In D. S. Touretzky, J. L. Elmn, T. J. Sejnowki, nd G. E. Hinton (ed.), Connectionit Model: Proceeding of the 1990 Summer School, pp Morgn Kufmnn. Gordon, G. J. (1995). Stble function pproximtion in dynmic progrmming. Proceeding of the Twelfth Int. Conf. on Mchine Lerning, pp Morgn Kufmnn. Gordon, G. J. (1996). Chttering in SARSA(λ). CMU Lerning Lb Technicl Report. Jkkol, T., Singh, S. P., Jordn, M. I. (1995) Reinforcement lerning lgorithm for prtilly obervble Mrkov deciion problem, NIPS 7, pp Morgn Kufmn. Kimur, H., Kobyhi, S. (1998). An nlyi of ctor/critic lgorithm uing eligibility trce: Reinforcement lerning with imperfect vlue function. Proceeding of the Fifteenth Interntionl Conference on Mchine Lerning. Morgn Kufmnn. Mrbch, P., Titikli, J. N. (1998) Simultion-bed optimiztion of Mrkov rewrd procee, technicl report LIDS-P-2411, Mchuett Intitute of Technology. Singh, S. P., Jkkol, T., Jordn, M. I. (1994). Lerning without tte-etimtion in prtilly obervble Mrkovin deciion problem. Proceeding of the Eleventh Interntionl Conference on Mchine Lerning, pp Morgn Kufmnn. Sutton, R. S. (1984). Temporl Credit Aignment in Reinforcement Lerning. Ph.D. thei, Univerity of Mchuett, Amhert. Sutton, R. S., Brto, A. G. (1998). Reinforcement Lerning: An Introduction. MIT Pre. Titikli, J. N. Vn Roy, B. (1996). Feture-bed method for lrge cle dynmic progrmming. Mchine Lerning 22: Willim, R. J. (1988). Towrd theory of reinforcement-lerning connectionit ytem. Technicl Report NU-CCS-88-3, Northetern Univerity, College of Computer Science. Willim, R. J. (1992). Simple ttiticl grdient-following lgorithm for connectionit reinforcement lerning. Mchine Lerning 8: Appendix: Proof of Theorem 1 We prove the theorem firt for the verge-rewrd formultion nd then for the trttte formultion. V π () def = π(, )Q π (, ) S = Q π (, )+π(, ) ] Qπ (, ) = Q π (, )+π(, ) R ρ(π)+ ]] P V π ( )

7 Therefore, = = Q π (, )+π(, ) + ]] P V π ( ) Q π (, )+π(, ) ] P V π ( ) V π () Summing both ide over the ttionry ditribution d π, d π () = d π () Q π (, )+ d π () d π () V π (), π(, ) P V π ( ) but ince d π i ttionry, d π () = d π () = d π () For the trt-tte formultion: V π () def = π(, )Q π (, ) = Q π (, )+ d π ( ) V π ( ) Q π (, ). S d π () V π () Q π (, )+π(, ) ] Qπ (, ) Q π (, )+π(, ) = R + γp V π ( ) = Q π (, )+π(, ) ] γp V π ( ) (7) = γ k Pr( x, k, π) π(x, ) Q π (x, ), x k=0 fter everl tep of unrolling (7), where Pr( x, k, π) i the probbility of going from tte to tte x in k tep under policy π. It i then immedite tht = { E } γ t 1 r t 0,π = V π ( 0 ) = = t=1 γ k Pr( 0, k, π) Q π (, ) k=0 d π () Q π (, ). ]]

Reinforcement learning

Reinforcement learning Reinforcement lerning Regulr MDP Given: Trnition model P Rewrd function R Find: Policy π Reinforcement lerning Trnition model nd rewrd function initilly unknown Still need to find the right policy Lern