A Generalized Path Integral Control Approach to Reinforcement Learning

Size: px

Start display at page:

Download "A Generalized Path Integral Control Approach to Reinforcement Learning"

Clifford Wilcox
6 years ago
Views:

1 Journal of Machne Learnng Research Submtted /0; Revsed 7/0; Publshed /0 A Generalzed Path Integral Control Approach to Renforcement Learnng Evangelos ATheodorou Jonas Buchl Stefan Schaal Department of Computer Scence Unversty of Southern Calforna Los Angeles, CA , USA ETHEODOR@USCEDU JONAS@BUCHLIORG SSCHAAL@USCEDU Edtor: Danel Lee Abstract Wth the goal to generate more scalable algorthms wth hgher effcency and fewer open parameters, renforcement learnng RL has recently moved towards combnng classcal technques from optmal control and dynamc programmng wth modern learnng technques from statstcal estmaton theory In ths ven, ths paper suggests to use the framework of stochastc optmal control wth path ntegrals to derve a novel approach to RL wth parameterzed polces Whle soldly grounded n value functon estmaton and optmal control based on the stochastc Hamlton-Jacob- Bellman HJB equatons, polcy mprovements can be transformed nto an approxmaton problem of a path ntegral whch has no open algorthmc parameters other than the exploraton nose The resultng algorthm can be conceved of as model-based, sem-model-based, or even model free, dependng on how the learnng problem s structured The update equatons have no danger of numercal nstabltes as nether matrx nversons nor gradent learnng rates are requred Our new algorthm demonstrates nterestng smlartes wth prevous RL research n the framework of probablty matchng and provdes ntuton why the slghtly heurstcally motvated probablty matchng approach can actually perform well Emprcal evaluatons demonstrate sgnfcant performance mprovements over gradent-based polcy learnng and scalablty to hgh-dmensonal control problems Fnally, a learnng experment on a smulated 2 degree-of-freedom robot dog llustrates the functonalty of our algorthm n a complex robot learnng scenaro We beleve that Polcy Improvement wth Path Integrals PI 2 offers currently one of the most effcent, numercally robust, and easy to mplement algorthms for RL based on trajectory roll-outs Keywords: stochastc optmal control, renforcement learnng, parameterzed polces Introducton Whle renforcement learnng RL s among the most general frameworks of learnng control to create truly autonomous learnng systems, ts scalablty to hgh-dmensonal contnuous state-acton systems, for example, humanod robots, remans problematc Classcal value-functon based methods wth functon approxmaton offer one possble approach, but functon approxmaton under the non-statonary teratve learnng process of the value-functon remans dffcult when one exceeds about 5-0 dmensons Alternatvely, drect polcy learnng from trajectory roll-outs has recently made sgnfcant progress Peters, 2007, but can stll become numercally brttle and full of open Also at ATR Computatonal Neuroscence Laboratores, Kyoto , Japan c 200 Evangelos Theodorou, Jonas Buchl and Stefan Schaal

2 THEODOROU, BUCHLI AND SCHAAL tunng parameters n complex learnng problems In new developments, RL researchers have started to combne the well-developed methods from statstcal learnng and emprcal nference wth classcal RL approaches n order to mnmze tunng parameters and numercal problems, such that ultmately more effcent algorthms can be developed that scale to sgnfcantly more complex learnng system Dayan and Hnton, 997; Koeber and Peters, 2008; Peters and Schaal, 2008c; Toussant and Storkey, 2006; Ghavamzadeh and Yaakov, 2007; Desenroth et al, 2009; Vlasss et al, 2009; Jetchev and Toussant, 2009 In the sprt of these latter deas, ths paper addresses a new method of probablstc renforcement learnng derved from the framework of stochastc optmal control and path ntegrals, based on the orgnal work of Kappen 2007 and Broek et al 2008 As wll be detaled n the sectons below, ths approach makes an appealng theoretcal connecton between value functon approxmaton usng the stochastc HJB equatons and drect polcy learnng by approxmatng a path ntegral, that s, by solvng a statstcal nference problem from sample roll-outs The resultng algorthm, called Polcy Improvement wth Path Integrals PI 2, takes on a surprsngly smple form, has no open algorthmc tunng parameters besdes the exploraton nose, and t has numercally robust performance n hgh dmensonal learnng problems It also makes an nterestng connecton to prevous work on RL based on probablty matchng Dayan and Hnton, 997; Peters and Schaal, 2008c; Koeber and Peters, 2008 and motvates why probablty matchng algorthms can be successful Ths paper s structured nto several major sectons: Secton 2 addresses the theoretcal development of stochastc optmal control wth path ntegrals Ths s a farly theoretcal secton For a quck readng, we would recommend Secton 2 for our basc notaton, and Table for the fnal results Exposng the reader to a sketch of the detals of the dervatons opens the possblty to derve path ntegral optmal control solutons for other dynamcal systems than the one we address n Secton 2 The man steps of the theoretcal developmennclude: Problem formulaton of stochastc optmal control wth the stochastc Hamlton-Jacob- Bellman HJB equaton The transformaton of the HJB nto a lnear PDE The generalzed path ntegral formulaton for control systems wth controlled and uncontrolled dfferental equatons General dervaton of optmal controls for the path ntegral formalsm Path ntegral optmal control appled to specal cases of control systems Secton 3 relates path ntegral optmal control to renforcement learnng Several man ssues are addressed: Renforcement learnng wth parameterzed polces Dynamc Movement Prmtves DMP as a specal case of parameterzed polces, whch matches the problem formulaton of path ntegral optmal control Dervaton of Polcy Improvement wth Path Integrals PI 2, whch s an applcaton of path ntegral optmal control to DMPs Secton 4 dscusses related work 338

3 A GENERALIZED PATH INTEGRAL CONTROL APPROACH TO REINFORCEMENT LEARNING Secton 5 llustrates several applcatons of PI 2 to control problems n robotcs Secton 6 addresses several mportanssues and characterstcs of RL wth PI 2 2 Stochastc Optmal Control wth Path Integrals The goal n stochastc optmal control framework s to control a stochastc dynamcal system whle mnmzng a performance crteron Therefore, stochastc optmal control can be thought as a constraned optmzaton problem n whch the constrans corresponds to stochastc dynamcal systems The analyss and dervatons of stochastc optmal control and path ntegrals n the next sectons rely on the Bellman Prncple of optmalty Bellman and Kalaba, 964 and the HJB equaton 2 Stochastc Optmal Control Defnton and Notaton For our techncal developments, we wll use largely a control theoretc notaton from trajectorybased optmal control, however, wth an attempt to have as much overlap as possble wth the standard RL notaton Sutton and Barto, 998 Let us defne a fnte horzon cost functon for a trajectory τ whch can also be a pece of a trajectory startng at tme n state x t and endng at tme t N Rτ =φ tn + tn r t dt, wth φ tn = φx tn denotng a termnal reward at tme t N and r t denotng the mmedate cost at tme t In stochastc optmal control Stengel, 994, the goal s to fnd the controls u t that mnmze the value functon: Vx t = V t = mneτ u [Rτ ], 2 t :t N where the expectaton Eτ [] s taken over all trajectores startng at x t We consder the rather general class of control systems: ẋ t = fx t,t+gx t u t + ε t =f t + G t u t + ε t, 3 wth x t R n denotng the state of the system, G t = Gx t R n p the control matrx, f t = fx t R n the passve dynamcs, u t R p the control vector and ε t R p Gaussan nose wth varance Σε As mmedate cost we consder r t = rx t,u t,t=q t + 2 ut t Ru t, 4 where q t = qx t,t s an arbtrary state-dependent cost functon, and R s the postve sem-defnte weght matrx of the quadratc control cost The stochastc HJB equaton Stengel, 994; Flemng and Soner, 2006 assocated wth ths stochastc optmal control problem s expressed as follows: t V t = mn r t + x V t T F t + u 2 trace xx V t G t ΣεGt T, 5 If we need to emphasze a partcular tme, we denote t by, whch also smplfes a transton to dscrete tme notaton later We use t wthout subscrpt when no emphass s needed when ths tme slce occurs, t 0 for the start of a trajectory, and t N for the end of a trajectory 339

4 THEODOROU, BUCHLI AND SCHAAL where F s defned as F t = fx t,t+gx t u t To fnd the mnmum, the cost functon 4 s nserted nto 5 and the gradent of the expresson nsde the parenthess s taken wth respect to controls u and set to zero The correspondng optmal control s gven by the equaton: ux t =u t = R G T t xt V t Substtuton of the optmal control above, nto the stochastc HJB 5, results n the followng nonlnear and second order Partal Dfferental Equaton PDE: t V t = q t + x V t T f t 2 xv t T G t R Gt T x V t + 2 trace xx V t G t ΣεGt T The x and xx symbols refer to the Jacoban and Hessan, respectvely, of the value functon wth respect to the state x, whle s the partal dervatve wth respect to tme For notatonal compactness, we wll mostly use subscrpted symbols to denote tme and state dependences, as ntroduced n the equatons above 22 Transformaton of HJB nto a Lnear PDE In order to fnd a soluton to the PDE above, we use a exponental transformaton of the value functon: V t = λlogψ t Gven ths logarthmc transformaton, the partal dervatves of the value functon wth respect to tme and state are expressed as follows: t V t = λ Ψ t t Ψ t, x V t = λ Ψ t x Ψ t, xx V t = λ Ψ 2 t x Ψ t x Ψ T t λ Ψ t xx Ψ t Insertng the logarthmc transformaton and the dervatves of the value functon we obtan: λ t Ψ t = q t λ x Ψ t T f t λ2 Ψ t Ψ t 2Ψt 2 x Ψ t T G t R Gt T x Ψ t + traceγ, 6 2 where the term Γ s expressed as: Γ= λ Ψt 2 x Ψ t x Ψt T λ xx Ψ t G t ΣεGt T Ψ t The trace of Γ s therefore: traceγ=λ Ψ 2 trace x Ψt T G t ΣεG t x Ψ t λ trace xx Ψ t G t ΣεG T t 7 Ψ t 340

5 A GENERALIZED PATH INTEGRAL CONTROL APPROACH TO REINFORCEMENT LEARNING Comparng the underlned terms n 6 and 7, one can recognze that these terms wll cancel under the assumpton of λr = Σε, whch mples the smplfcaton: λg t R G T t = G t ΣεG T t = Σx t =Σ t 8 The ntuton behnd ths assumpton cf also Kappen, 2007; Broek et al, 2008 s that, snce the weght control matrx R s nverse proportonal to the varance of the nose, a hgh varance control npumples cheap control cost, whle small varance control nputs have hgh control cost From a control theoretc stand pont such a relatonshp makes sense due to the fact that under a large dsturbance = hgh varance sgnfcant control authorty s requred to brng the system back to a desrable state Ths control authorty can be acheved wth correspondng low control cosn R Wth ths smplfcaton, 6 reduces to the followng form t Ψ t = λ q tψ t + ft T x Ψ t + 2 trace xx Ψ t G t ΣεGt T, 9 wth boundary condton: Ψ tn = exp λ φ t N The partal dfferental equaton PDE n 9 corresponds to the so called Chapman Kolmogorov PDE, whch s of second order and lnear Analytcal solutons of 9 cannot be found n general for general nonlnear systems and cost functons However, there s a connecton between solutons of PDEs and ther representaton as stochastc dfferental equaton SDEs, thas mathematcally expressed by the Feynman-Kac formula Øksendal, 2003; Yong, 997 The Feynman-Kac formula see appendx B can be used to fnd dstrbutons of random processes whch solve certan SDEs as well as to propose numercal methods for solvng certan PDEs Applyng the Feynman-Kac theorem, the soluton of 9 s: Ψ t = Eτ Ψ tn e t N t λ q tdt = Eτ [exp λ φ t N tn q t dt λ ] 0 Thus, we have transformed our stochastc optmal control problem nto the approxmaton problem of a path ntegral Wth a vew towards a dscrete tme approxmaton, whch wll be needed for numercal mplementatons, the soluton 0 can be formulated as: Ψ t = lm pτ x exp [ λ φ tn + ] N q dt dτ, j= where τ = x t,,x tn s a sample path or trajectory pece startng at state x t and the term pτ x s the probablty of sample path τ condtoned on the start state x t Snce Equaton provdes the exponental cost to go Ψ t n state x t, the ntegraton above s taken wth respect to sample paths τ =x t,x t+,,x tn The dfferental term dτ s defned as dτ =dx t,,dx tn Evaluaton of the stochastc ntegral n requres the specfcaton of pτ x, whch s the topc of our analyss n the next secton 23 Generalzed Path Integral Formulaton To develop our algorthms, we wll need to consder a more general development of the path ntegral approach to stochastc optmal control than presented n Kappen 2007 and Broek et al 2008 In partcular, we have to address than many stochastc dynamcal systems, the control transton matrx G s state dependent and ts structure depends on the partton of the state n drectly and 34

6 THEODOROU, BUCHLI AND SCHAAL non-drectly actuated parts Snce only some of the states are drectly controlled, the state vector s parttoned nto x = [x mt x ct ] T wth x m R k the non-drectly actuated part and x c R l the drectly actuated part Subsequently, the passve dynamcs term and the control transton matrx can be parttoned as f t =[f m T T t ] T wth f m R k, f c R l T and G t =[0 k p ] T wth G c t f c t R l p The dscretzed state space representaton of such systems s gven as: x t+ = x t + f t dt+ G t u t dt+ dtε t, or, n parttoned vector form: x m + x m x c t = + x c + f m f c dt+ 0k p G c u t dt+ dtε t G c t 2 Essentally the stochastc dynamcs are parttoned nto controlled equatons n whch the state x c + s drectly actuated and the uncontrolled equatons n whch the state x m + s not drectly actuated Snce stochastcty s only added n the drectly actuated terms c of 2, we can develop pτ x as follows pτ x t = pτ + x t = px tn,,x t+ x t = Π N j= p x + x, where we exploted the fact that the start state x t of a trajectory s gven and does not contrbute to ts probablty For systems where the control has lower dmensonalty than the state 2, the transton probabltes p x + x are factorzed as follows: p x + x = p x m + x p x c + x = p x m + x m,x c p x c + x m,x c p x c + x, 3 where we have used the fact that p x m + x m,x c,x c s the Drac delta functon, snce x m + can be computed determnstcally from x m For all practcal purposes, 2 the transton probablty of the stochastc dynamcs s reduced to the transton probablty of the drectly actuated part of the state: pτ x t =Π N j= p x + x Π N j= p x c + x 4 Snce we assume that the nose ε s zero mean Gaussan dstrbuted wth varance Σε, where Σε R l l, the transton probablty of the drectly actuated part of the state s defned as: 3 p x c + x = 2πl Σ /2 exp 2 x c + x c f c dt 2 Σ 2 The delta functons wll all ntegrate to n the path ntegral 3 For notatonal smplcty, we wrte weghted square norms or Mahalanobs dstances as v T Mv= v 2 M, 5 342

7 A GENERALIZED PATH INTEGRAL CONTROL APPROACH TO REINFORCEMENT LEARNING where the covarance Σ R l l s expressed as Σ = G c results n the probablty of a path expressed as: pτ x t Π N j= 2π l Σ /2 exp 2 N j= ΣεG c x c + x c T dt Combnng 5 and 4 f c dt 2 Σ Fnally, we ncorporate the assumpton 8 about the relaton between the control cost and the varance of the nose, whch needs to be adjusted to the controlled space as Σ = G c ΣεG c T dt = λg c R G c T dt = λh dt wth H = G c R G c T Thus, we obtan: pτ x t Π N j= 2π l Σ /2 exp 2λ N j= x c + x c 2 f c t dt j Ht j dt as: Wth ths formulaton of the probablty of a trajectory, we can rewrte the the path ntegral Ψ t = lm = lm exp λ φ tn + N j= q dt+ x 2 N c + x c j= dt f c 2 H dt Π N j= 2π l/2 dτc Σ /2 Dτ exp λ Sτ dτ c, 6 where, we defned N Sτ =φ tn + j= q dt+ 2 N j= x c + x c 2 f c t dt j dt, Ht j and Dτ =Π N j= 2π l/2 Σ /2 Note that the ntegraton s over dτ c = dx c,,dx c t N, as the non-drectly actuated states can be ntegrated out due to the fact that the state transton of the non-drectly actuated states s determnstc, and just added Drac delta functons n the ntegral cf Equaton 3 Equaton 6 s wrtten n a more compact form as: Ψ t = lm = lm exp λ Sτ logdτ exp λ Zτ dτ c dτ c, 7 where Zτ =Sτ +λlogdτ It can be shown that ths term s factorzed n path dependent and path ndependent terms of the form: 343

8 THEODOROU, BUCHLI AND SCHAAL Zτ = Sτ + λn l 2 log2πdtλ, where Sτ = Sτ + λ 2 N j= log H Ths formula s a requred step for the dervaton of optmal controls n the next secton The constant term λn l 2 log2πdtλ can be the source of numercal nstabltes especally n cases where fne dscretzaton dt of stochastc dynamcs s requred However, n the next secton, and n a great detal n Appendx A, lemma, we show how ths term drops out of the equatons 24 Optmal Controls For every moment of tme, the optmal controls are gven as u t = R G T xt V t Due to the exponental transformaton of the value functon, the equaton of the optmal controls can be wrtten as u t = λr G t xt Ψ t Ψ t After substtutng Ψ t wth 7 and cancelng the state ndependent terms of the cost we have: e Sτ λ dτ c u t = lm λr Gt T c x e λ Sτ dτ c Further analyss of the equaton above leads to a smplfed verson for the optmal controls as wth the probablty Pτ and local controls u L τ defned as, u t = Pτ u L τ dτ c, 8 Pτ = e λ Sτ e λ Sτ dτ 9 u L τ = R G c t T lm c x Sτ The path cost Sτ s a generalzed verson of the path cosn Kappen 2005a and Kappen 2007, whch only consdered systems wth state ndependent control transton 4 G t To fnd the local controls u L τ we have to calculate the lm c x Sτ Appendx A and more precsely lemma 2 shows n detal the dervaton of the fnal result: c x Sτ = Ht G c ε t b t, lm where the new term b t s expressed as b t = λh t Φ t and Φ t R l s a vector wth the j th element defned as: Φ t j = 2 trace H c [x H ] 4 More precsely f G t c = G c then the term λ 2 N j= log H dsappears snce s state ndependent and t appears n both nomnator and denomnator n 9 In ths case, the path coss reduced to Sτ =Sτ 344

9 A GENERALIZED PATH INTEGRAL CONTROL APPROACH TO REINFORCEMENT LEARNING The local control can now be expressed as: u L τ =R G c t T Ht G c ε t b t, By substtutng H t = G c R G c t T n the equaton above, we get our man result for the local controls of the sampled path for the generalzed path ntegral formulaton: u L τ =R G c T G c R G c T G c ε t b t 20 The equatons n boxes 8, 9 and 20 form the soluton for the generalzed path ntegral stochastc optmal control problem Gven that ths resuls of general value and consttutes the foundaton to derve our renforcement learnng algorthm n the next secton, but also snce many other specal cases can be derved from t, we summarzed all relevant equatons n Table The Gven components of Table nclude a model of the system dynamcs, the cost functon, knowledge of the system s nose process, and a mechansm to generate trajectores τ Is mportant to realze that ths s a model-based approach, as the computatons of the optmal controls requres knowledge of ε ε can be obtaned n two ways Frst, the trajectores τ can be generated purely n smulaton, where the nose s generated from a random number generator Second, trajectores could be generated by a real system, and the nose ε would be computed from the dfference between the actual and the predcted system behavor, thas, G c ε = ẋ t ˆẋ t = ẋ t f t + G t u t Computng the predcton ˆẋ t also requres a model of the system dynamcs Prevous results n Kappen 2005a, Kappen 2007, Kappen 2005b and Broek et al 2008 are specal cases of our generalzed formulaton In the next secton we show how our generalzed formulaton s specalzed to dfferent classes of stochastc dynamcal systems and we provde the correspondng formula of local controls for each class 25 Specal Cases The purpose of ths secton s twofold Frst, t demonstrates how to apply the path ntegral approach to specalzed forms of dynamcal systems, and how the local controls n 20 smplfy for these cases Second, ths secton prepares the specal case whch we wll need for our renforcement learnng algorthm n Secton 3 25 SYSTEMS WITH ONE DIMENSIONAL DIRECTLY ACTUATED STATE The generalzed formulaton of stochastc optmal control wth path ntegrals n Table can be appled to a varety of stochastc dynamcal systems wth dfferent types of control transton matrces One case of partcular nteress where the dmensonalty of the drectly actuated part of the state s D, whle the dmensonalty of the control vector s D or hgher dmensonal As wll be seen below, ths stuaton arses when the controls are generated by a lnearly parameterzed functon approxmator The control transton matrx thus becomes a row vector G c Accordng to 20, the local controls for such systems are expressed as follows: u L τ = R g c g ct R g c 345 g ct ε t b t = g ct R p

10 THEODOROU, BUCHLI AND SCHAAL Gven: The system dynamcs ẋ t = f t + G t u t + ε t cf 3 The mmedate cost r t = q t + 2 ut t Ru t cf 4 A termnal cost term φ tn cf The varance Σε of the mean-zero nose ε t Trajectory startng at and endng at t N : τ =x t,,x tn A parttonng of the system dynamcs nto c controlled and m uncontrolled equatons, where n=c+m s the dmensonalty of the state x t cf Secton 23 Optmal Controls: Optmal controls at every tme step : u t = Pτ uτ dτ c Probablty of a trajectory: Pτ = e λ Sτ e λ Sτ dτ Generalzed trajectory cost: Sτ =Sτ + λ 2 N j= log H where Sτ =φ tn + N j= q dt+ x 2 N c + x c 2 j= dt f c dt H = G c R G c Local Controls: u L τ =R G c T b t = λh t Φ t [Φ t ] j = 2 trace H T c [x H ] G c R G c T H G c ε t b t where Table : Summary of optmal control derved from the path ntegral formalzm Snce the drectly actuated part of the state s D, the vector x c whch appears n the partal dfferentaton above In the case that g c dfferentaton wth respect to x c collapses nto the scalar x c does not depend on x c, the results to zero and the the local controls smplfy to: u L τ = R g c g ct g ct R g c ε t 252 SYSTEMS WITH PARTIALLY ACTUATED STATE The generalzed formula of the local controls 20 was derved for the case where the control transton matrx s state dependent and ts dmensonalty s G c t R l p wth l < n and p the dmensonalty of the control There are many specal cases of stochastc dynamcal systems n optmal control 346

11 A GENERALIZED PATH INTEGRAL CONTROL APPROACH TO REINFORCEMENT LEARNING and robotc applcatons that belong nto ths general class More precsely, for systems havng a state dependent control transton matrx thas square G c R l l wth l = p the local controls based on 20 are reformulated as: u L τ =ε t G c bt 2 Interestngly, a rather general class of mechancal systems such as rgd-body and mult-body dynamcs falls nto ths category When these mechancal systems are expressed n state space formulaton, the control transton matrx s equal to rgd body nerta matrx G c = Mθ t Scavcco and Sclano, 2000 Future work wll address ths specal topc of path ntegral control for mult-body dynamcs Another specal case of systems wth partally actuated state s when the control transton matrx s state ndependent and has dmensonalty G c t = G c R l p The local controls, accordng to 20, become: If G c s square and state ndependent, G c u L τ =R G ct G c R G ct G c ε t 22 = G c R l l, we wll have: u L τ =ε t 23 Ths specal case was explored n Kappen 2005a, Kappen 2007, Kappen 2005b and Broek et al 2008 Our generalzed formulaton allows a broader applcaton of path ntegral control n areas lke robotcs and other control systems, where the control transton matrx s typcally parttoned nto drectly and non-drectly actuated states, and typcally also state dependent 253 SYSTEMS WITH FULLY ACTUATED STATE SPACE In ths class of stochastc systems, the control transton matrx s not parttoned and, therefore, the control u drectly affects all the states The local controls for such systems are provded by smply substtutng G c R n p n 20 wth G t R n n Snce G t s a square matrx we obtan: u L τ =ε t G b t, wth b t = λh t Φ t and Φ t j = 2 trace Ht xt j H t, where the dfferentaton s not taken wth respect to x c j but wth respect to the full state x t j For ths fully actuated state space, there are subclasses of dynamcal systems wth square and/or state ndependent control transton matrx The local controls for these cases are found by just substtutng G c wth G t n 2, 22 and 23 3 Renforcement Learnng wth Parameterzed Polces Equpped wth the theoretcal framework of stochastc optmal control wth path ntegrals, we can now turn to ts applcaton to renforcement learnng wth parameterzed polces Snce the begnnng of actor-crtc algorthms Barto et al, 983, one goal of renforcement learnng has been 347

12 THEODOROU, BUCHLI AND SCHAAL to learn compact polcy representatons, for example, wth neural networks as n the early days of machne learnng Mller et al, 990, or wth general parameterzatons Peters, 2007; Desenroth et al, 2009 Parameterzed polces have much fewer parameters than the classcal tme-ndexed approach of optmal control, where every tme step has t own set of parameters, thas, the optmal controls at ths tme step Usually, functon approxmaton technques are used to represent the optmal controls and the open parameters of the functon approxmator become the polcy parameters Functon approxmators use a state representaton as nput and not an explct tme dependent representaton Ths representaton allows generalzaton across states and promses to acheve better generalzaton of the control polcy to a larger state space, such that polces become re-usable and do not have to be recomputed n every new stuaton The path ntegral approach from the prevous sectons also follows the classcal tme-based optmal control strategy, as can be seen from the tme dependent soluton for optmal controls n 33 However, a mnor re-nterpretaton of the approach and some small mathematcal adjustments allow us to carry t over to parameterzed polces and renforcement learnng, whch results n a new algorthm called Polcy Improvement wth Path Integrals PI 2 3 Parameterzed Polces We are focusng on drect polcy learnng, where the parameters of the polcy are adjusted by a learnng rule drectly, and nondrectly as n value functon approaches of classcal renforcement learnng Sutton and Barto, 998 see Peters 2007 for a dscusson of pros and cons of drect vs ndrect polcy learnng Drect polcy learnng usually assumes a general cost functon Sutton et al, 2000; Peters, 2007 n the form of Jx 0 = pτ 0 Rτ 0 dτ 0, 24 whch s optmzed over states-acton trajectores 5 τ 0 =x t0,a t0,,x tn Under the frst order Markov property, the probablty of a trajectory s pτ = px t Π N j= px + x,a pa x Both the state transton and the polcy are assumed to be stochastc The partcular formulaton of the stochastc polcy s a desgn parameter, motvated by the applcaton doman, analytcal convenence, and the need to nject exploraton durng learnng For contnuous state acton domans, Gaussan dstrbutons are most commonly chosen Gullapall, 990; Wllams, 992; Peters, 2007 An nterestng generalzed stochastc polcy was suggested n Rueckstess et al 2008 and appled n Koeber and Peters 2008, where the stochastc polcy pa t x t s lnearly parameterzed as: a t = g T θ+ε t, 25 wth g t denotng a vector of bass functons and θ the parameter vector Ths polcy has state dependent nose, whch can contrbute to faster learnng as the sgnal-to-nose rato becomes adaptve snce s a functon of g t It should be noted that a standard addtve-nose polcy can be expressed n ths formulaton, too, by choosng one bass functon g t j = 0 For Gaussan nose ε the probablty of an acton s pa t x t =N θ T g t,σ t wth Σt = g T Σεg t Comparng the polcy formulaton 5 We use a t to denote actons here n order to avod usng the symbol u n a conflctng way n the equatons below, and to emphasze that an acton does not necessarly concde wth the control command to a physcal system 348

13 A GENERALIZED PATH INTEGRAL CONTROL APPROACH TO REINFORCEMENT LEARNING n 25 wth the control term n 3, one recognzes that the control polcy formulaton 25 should fnto the framework of path ntegral optmal control 32 Generalzed Parameterzed Polces Before gong nto more detal of our proposed renforcement learnng algorthm, s worthwhle contemplatng what the acton a t actually represents In many applcatons of stochastc optmal control there are three man problems to be consdered: trajectory plannng, feedforward control, and feedback control The results of optmzaton could thus be an optmal knematc trajectory, the correspondng feedforward commands to track the desred trajectory accurately n face of the system s nonlneartes, and/or tme varyng lnear feedback gans gan schedulng for a negatve feedback controller that compensates for perturbatons from accurate trajectory trackng There are very few optmal control algorthms whch compute all three ssues smultaneously, such as Dfferental Dynamc ProgrammngDDP Jacobson and Mayne, 970, or ts smpler verson the Iteratve Lnear Quadratc RegulatorLQR Todorov, 2005 However, these are model based methods whch requre rather accurate knowledge of the dynamcs and make restrctve assumptons concernng dfferentablty of the system dynamcs and the cost functon Path ntegral optmal control allows more flexblty than these related methods The concept of an acton can be vewed n a broader sense Essentally, we consder any nput to the control system as an acton, not unlke the nputs to a transfer functon n classcal lnear control theory The nput can be a motor command, but can also be anythng else, for nstance, a desred state, thas subsequently converted to a motor command by some trackng controller, or a control gan Buchl et al, 200 As an example, consder a robotc system wth rgd body dynamcs RBD equatons Scavcco and Sclano, 2000 usng a parameterzed polcy: q = Mq Cq, q vq+mq u, 26 u = Gqθ+ε t, 27 where M s the RBD nerta matrx, C are Corols and centrpetal forces, and v denotes gravty forces The state of the robos descrbed by the jont angles q and jont veloctes q The polcy 27 s lnearly parameterzed by θ, wth bass functon matrx G one would assume that the dmensonalty of θ s sgnfcantly larger than that of q to assure suffcent expressve power of ths parameterzed polcy Insertng 27 nto 26 results n a dfferental equaton thas compatble wth the system equatons 3 for path ntegral optmal control: q = fq, q+ Gqθ+ε t 28 where fq, q = Mq Cq, q vq, Gq = Mq Gq Ths example s a typcal example where the polcy drectly represents motor commands Alternatvely, we could create another form of control structure for the RBD system: q = Mq Cq, q vq+mq u, u = K P q d q+k D q d q, q d = Gq d, q d θ+ε t

14 THEODOROU, BUCHLI AND SCHAAL Here, a Proportonal-Dervatve PD controller wth postve defnte gan matrces K P and K D converts a desred trajectory q d, q d nto a motor command u In contrast to the prevous example, the parameterzed polcy generates the desred trajectory n 29, and the dfferental equaton for the desred trajectory s compatble wth the path ntegral formalsm What we would lke to emphasze s that the control system s structure s left to the creatvty of ts desgner, and that path ntegral optmal control can be appled on varous levels Importantly, as developed n Secton 23, only the controlled dfferental equatons of the entre control system contrbute to the path ntegral formalsm, thas, 28 n the frst example, or 29 n the second example And only these controlled dfferental equatons need to be known for applyng path ntegral optmal control none of the varables of the uncontrolled equatons s ever used At ths pont, we make a very mportant transton from model-based to model-free learnng In the example of 28, the dynamcs model of the control system needs to be known to apply path ntegral optmal control, as ths s a controlled dfferental equaton In contrast, n 29, the system dynamcs are n an uncontrolled dfferental equaton, and are thus rrelevant for applyng path ntegral optmal control In ths case, only knowledge of the desred trajectory dynamcs s needed, whch s usually created by the system desgner Thus, we obtaned a model-free learnng system 33 Dynamc Movement Prmtves as Generalzed Polces As we are nterested n model-free learnng, we follow the control structure of the 2 nd example of the prevous secton, thas, we optmze control polces whch represent desred trajectores We use Dynamc Movement Prmtves DMPs Ijspeert et al, 2003 as a specal case of parameterzed polces, whch are expressed by the dfferental equatons: τ żt = f t + gt T θ+ε t, 30 τ ẏt = z t, τ ẋt = αx t, f t = α z β z g y t z t Essentally, these polces code a learnable pont attractor for a movement from y t0 to the goal g, where θ determnes the shape of the attractor y t,ẏ t denote the poston and velocty of the trajectory, whle z t,x t are nternal states α z,β z,τ are tme constants The bass functons g t R p are defned by a pecewse lnear functon approxmator wth Gaussan weghtng kernels, as suggested n Schaal and Atkeson 998: [g t ] j = w j x t p k= w g y 0, k w j = exp 05h j x t c j 2, 3 wth bandwth h j and center c j of the Gaussan kernels for more detals see Ijspeert et al 2003 The DMP representaton s advantageous as t guarantees attractor propertes towards the goal whle remanng lnear n the parameters θ of the functon approxmator By varyng the parameter θ the shape of the trajectory changes whle the goal state g and ntal state y t0 reman fxed These propertes facltate learnng Peters and Schaal, 2008a 350

15 A GENERALIZED PATH INTEGRAL CONTROL APPROACH TO REINFORCEMENT LEARNING 34 Polcy Improvements wth Path Integrals: The PI 2 Algorthm As can be easly recognzed, the DMP equatons are of the form of our control system 3, wth only one controlled equaton and a one dmensonal actuated state Ths case has been treated n Secton 25 The motor commands are replaced wth the parameters θ the ssue of tme dependent vs constant parameters wll be addressed below More precsely, the DMP equatons can be wrtten as: ẋ t ż t ẏ t = αx t y t α z β z g y t z t + 0 p 0 p g c T t θ t + ε t x m t The state of the DMP s parttoned nto the controlled part x c t = y t and uncontrolled part =x t z t T The control transton matrx depends on the state, however, t depends only on one of the state varables of the uncontrolled part of the state, thas, x t The path cost for the stochastc dynamcs of the DMPs s gven by: N Sτ = φ tn + j= N φ tn + j= N = φ tn + j= N = φ tn + j= N = φ tn + j= q dt+ 2 q + 2 q + 2 q + 2 q + 2 N j= N j= x c + x c 2 f c t dt j dt+ Ht j g ct θ + ε N j= 2 θ + ε T g c N j= 2 θ + ε T g g ct t 2 Ht j Ht j g ct c g ct N λ 2 θ + ε R g c t j= θ + ε log H N j= 2 θ + ε T Mt T j RM θ + ε 32 wth M = R g gt T j H gt T j R g t becomes a scalar gven by H t = g ct t t R g c t Interestngly, the term j λ 2 N j= log H for the case of DMPs depends only on x t, whch s a determnstc varable and therefore can be gnored snce s the same for all sampled paths We also absorbed, wthout loss of generalty, the tme step dn cost terms Consequently, the fundamental result of the path ntegral stochastc optmal problem for the case of DMPs s expressed as: where the probablty Pτ and local controls uτ are defned as u t = Pτ u L τ dτ c, 33 g ct Pτ = e Sτ λ e Sτ, u L τ = R g c λ dτ g ct R g c ε t, 35

16 THEODOROU, BUCHLI AND SCHAAL and the path cost gven as N Sτ =φ tn + j= q + 2 N εt T j Mt T j RM ε j= Note that θ=0 n these equatons, thas, the parameters are ntalzed to zero These equatons correspond to the case where the stochastc optmal control problem s solved wth one evaluaton of the optmal controls 33 usng dense samplng of the whole state space under the passve dynamcs e, θ = 0, whch requres a sgnfcant amount of exploraton nose Such an approach was pursued n the orgnal work by Kappen 2007 and Broek et al 2008, where a potentally large number of sample trajectores was needed to acheve good results Extendng ths samplng approach to hgh dmensonal spaces, however, s dauntng, as wth very hgh probablty, we would sample prmarly rather useless trajectores Thus, basng samplng towards good ntal condtons seems to be mandatory for hgh dmensonal applcatons Thus, we consder only local samplng and an teratve update procedure Gven a current guess of θ, we generate sample roll-outs usng stochastc parameters θ+ε t at every tme step To see how the generalzed path ntegral formulaton s modfed for the case of teratve updatng, we start wth the equatons of the update of the parameter vector θ, whch can be wrtten as: θ new = The correcton parameter vector δθ t Pτ R g t g T t θ+ε t g T t R dτ g t = Pτ R g t g T t ε t g T t R dτ + R g t g t g t g T t R g t R g t g T t = δθ t + tracer g t g T t θ = δθ t + M t θ 34 s defned as δθ t = Pτ R g t g T t ε t g T t R g t dτ Is mportant to s now tme dependent, thas, for every tme step, a dfferent optmal parameter note that θ new vector s computed In order to return to one sngle tme ndependent parameter vector θ new, the vectors θ new need to be averaged over tme We start wth a frst tentatve suggeston of averagng over tme, and then explan why s napproprate, and what the correct way of tme averagng has to look lke The tentatve and most ntutve tme average s: θ new = N N θ new = =0 N N δθ t + =0 N T θ N M t θ =0 Thus, we would update θ based on two terms The frst term s the average of δθ t, whch s reasonable as t reflects the knowledge we ganed from the exploraton nose However, there would be a second update term due to the average over projected mean parameters θ from every tme step t should be noted that M t s a projecton matrx onto the range space of g t under the metrc R, such that a multplcaton wth M t can only shrnk the norm of θ From the vewpont of havng optmal parameters for every tme step, ths update componens reasonable as t trvally elmnates the part of the parameter vector that les n the null space of g t and whch contrbutes to the command cost 352

17 A GENERALIZED PATH INTEGRAL CONTROL APPROACH TO REINFORCEMENT LEARNING of a trajectory n a useless way From the vew pont of a parameter vector thas constant and tme ndependent and thas updated teratvely, ths second update s undesrable, as the multplcaton of the parameter vector θ wth M t n 34 and the averagng operaton over the tme horzon reduces the L 2 norm of the parameters at every teraton, potentally n an uncontrolled way 6 What we rather wans to acheve convergence when the average of δθ t becomes zero, and we do not want to contnue updatng due to the second term The problem s avoded by elmnatng the projecton matrx n the second term of averagng, such that become: θ new = N N δθ t + =0 N N θ= =0 N N δθ t + θ =0 The meanng of ths reduced update s smply that we keep a componenn θ thas rrelevant and contrbutes to our trajectory cosn a useless way However, ths rrelevant component wll not prevent us from reachng the optmal effectve soluton, thas, the soluton that les n the range space of g t Gven ths modfed update, s, however, also necessary to derve a compatble cost functon As mentoned before, n the unmodfed scenaro, the last term of 32 s: N 2 j= θ+ε T M T RM θ+ε To avod a projecton of θ, we modfy ths cost term to be: N 2 j= θ+m ε T Rθ+M ε Wth ths modfed cost term, the path ntegral formalsm results n the desred θ new wthout the M t projecton of θ The man equatons of the teratve verson of the generalzed path ntegral formulaton, called Polcy Improvement wth Path Integrals PI 2, can be summarzed as: Pτ = e λ Sτ e λ Sτ dτ, 35 N Sτ = φ tn + δθ t = j= N j= q dt+ θ+m t 2 j ε T Rθ+M ε dt, 36 Pτ M t ε t dτ, 37 =0 [δθ] N w j, [δθ t ] j j =0 w, 38 j, N θ new = θ old + δθ = N N Essentally, 35 computes a dscrete probablty at tme of each trajectory roll-out wth the help of the cost 36 For every tme step of the trajectory, a parameter update s computed n 37 based 6 To be precse, θ would be projected and contnue shrnkng untl t les n the ntersecton of all null spaces of the g t bass functon ths null space can easly be of measure zero 353

18 THEODOROU, BUCHLI AND SCHAAL on a probablty weghted average over trajectores The parameter updates at every tme step are fnally averaged n 38 Note that we chose a weghted average by gvng every parameter update a weght 7 accordng to the tme steps lefn the trajectory and the actvaton of the kernel n 3 Ths average can be nterpreted as usng a functon approxmator wth only a constant offset parameter vector to approxmate the tme dependent parameters Gvng early ponts n the trajectory a hgher weghs useful snce ther parameters affect a large tme horzon and thus hgher trajectory costs Other functon approxmaton or averagng schemes could be used to arrve at a fnal parameter update we preferred ths smple approach as t gave very good learnng results The fnal parameter update s θ new = θ old + δθ The parameter λ regulates the senstvty of the exponentated cost and can automatcally be optmzed for every tme step to maxmally dscrmnate between the experenced trajectores More precsely, a constant term can be subtracted from 36 as long as all Sτ reman postve ths constant term 8 cancels n 35 Thus, for a gven number of roll-outs, we compute the exponental term n 35 as exp λ Sτ = exp Sτ mnsτ h maxsτ mnsτ wth h set to a constant, whch we chose to be h = 0 n all our evaluatons The max and mn operators are over all sample roll-outs Ths procedure elmnates λ and leaves the varance of the exploraton nose ε as the only open algorthmc parameter for PI 2 It should be noted that the equatons for PI 2 have no numercal ptfalls: no matrx nversons and no learnng rates, 9 renderng PI 2 to be very easy to use n practce The pseudocode for the fnal PI 2 algorthm for a one dmensonal control system wth functon approxmaton s gven n Table 2 A tutoral Matlab example of applyng PI 2 can be found at 4 Related Work In the next sectons we dscuss related work n the areas of stochastc optmal control and renforcement learnng and analyze the connectons and dfferences wth the PI 2 algorthm and the generalzed path ntegral control formulaton 4 Stochastc Optmal Control and Path Integrals The path ntegral formalsm for optmal control was ntroduced n Kappen 2005a,b In ths work, the role of nose n symmetry breakng phenomena was nvestgated n the context of stochastc optmal control In Kappen et al 2007, Wegernck et al 2006, and Broek et al 2008, the path ntegral formalsm s extended for the stochastc optmal control of mult-agent systems Recent work on stochastc optmal control by Todorov 2008, Todorov 2007 and Todorov 2009b shows that for a class of dscrete stochastc optmal control problems, the Bellman equa- 7 The use of the kernel weghts n the bass functons 3 for the purpose of tme averagng has shown better performance wth respect to other weghtng approaches, across all of our experments Therefore ths s the weghtng that we suggest Users may develop other weghtng schemes as more sutable to ther needs hmnsτ 8 In fact, the term nsde the exponent results by addng maxsτ mnsτ, whch cancels n 35, to the term hsτ maxsτ mnsτ whch s equal to λ Sτ 9 R s a user desgn parameter and usually chosen to be dagonal and nvertble, 354

19 A GENERALIZED PATH INTEGRAL CONTROL APPROACH TO REINFORCEMENT LEARNING Gven: An mmedate cost functon r t = q t + θ T t Rθ t cf A termnal cost term φ tn cf A stochastc parameterzed polcy a t = g T t θ+ε t cf 25 The bass functon g t from the system dynamcs cf 3 and Secton 25 The varance Σε of the mean-zero nose ε t The ntal parameter vector θ Repeat untl convergence of the trajectory cost R: Create K roll-outs of the system from the same start state x 0 usng stochstc parameters θ+ε t at every tme step For k=k, compute: Pτ,k = Sτ e λ,k K Sτ k= [e λ,k ] Sτ,k =φ tn,k+ N j= q,k+ 2 N j=+ θ+m,kε,k T Rθ+M,kε,k M,k = R g,k g T,k g T,k R g,k For =N, compute: δθ t = K k= [Pτ,kM t,k ε t,k] Compute [δθ] j = N =0 N w j, [δθ t ] j N =0 w j, N Update θ θ+δθ Create one noseless roll-out to check the trajectory cost R = φ tn + N =0 r In case the nose cannot be turned off, thas, a stochastc system, multple roll-outs need be averaged Table 2: Pseudocode of the PI 2 algorthm for a D Parameterzed Polcy Note that the dscrete tme step dt was absorbed as a constant multpler n the cost terms ton can be wrtten as the KL dvergence between the probablty dstrbuton of the controlled and uncontrolled dynamcs Furthermore s shown that the class of dscrete KL dvergence control problem s equvalent to the contnuous stochastc optmal control formalsm wth quadratc cost control functon and under the presence of Gaussan nose In Kappen et al 2009, the KL dvergence control formalsm s consdered and s transformed to a probablstc nference problem In all ths aforementoned work, both n the path ntegral formalsm as well as n KL dvergence control, the class of stochastc dynamcal systems under consderaton s rather restrctve snce the 355

20 THEODOROU, BUCHLI AND SCHAAL control transton matrx s state ndependent Moreover, the connecton to drect polcy learnng n RL and model-free learnng was not made n any of the prevous projects Our PI 2 algorthm dffers wth respect to the aforementoned work n the followng ponts In Todorov 2009b the stochastc optmal control problem s nvestgated for dscrete acton - state spaces and therefore s treated as Markov Decson Process MDP To apply our PI 2 algorthm, we do not dscretze the state space and we do not treat the problem as an MDP Instead we work n contnuous state - acton spaces whch are sutable for performng RL n hgh dmensonal robotc systems To the best of our knowledge, our results present RL n one of the most hgh dmensonal contnuous state acton spaces In our dervatons, the probablstc nterpretaton of control comes drectly from the Feynman- Kac Lemma Thus we do not have to mpose any artfcal pseudo-probablty treatment of the cost as n Todorov 2009b In addton, for the contnuous state - acton spaces we do not have to learn the value functon as s suggested n Todorov 2009b va Z-learnng Instead we drectly fnd the controls based on our generalzaton of optmal controls In the prevous work, the problem of how to sample trajectores s not addressed Samplng s performed at once wth the hope to cover the all state space We follow a rather dfferent approach that allows to attack robotc learnng problems of the complexty and dmensonalty of the lttle dog robot The work n Todorov 2009a consders stochastc dynamcs wth state dependent control matrx However, the way of how the stochastc optmal control problem s solved s by mposng strong assumptons on the structure of the cost functon and, therefore, restrctons of the proposed soluton to specal cases of optmal control problems The use of ths specfc cost functon allows transformng the stochastc optmal control problem to a determnstc optmal control problem Under ths transformaton, the stochastc optmal control problem can be solved by usng determnstc algorthms Wth respect to the work n Broek et al 2008, Wegernck et al 2006 and Kappen et al 2009 our PI 2 algorthm has been derved for a rather general class of systems wth control transton matrx thas state dependent In ths general class, Rgd body and mult-body dynamcs as well as the DMPs are ncluded Furthermore we have shown how our results generalze prevous work 42 Renforcement Learnng of Parameterzed Polces There are two man classes of related algorthms: Polcy Gradent algorthms and probablstc algorthms Polcy Gradent algorthms Peters and Schaal, 2006a,b compute the gradent of the cost functon 24 at every teraton and the polcy parameters are updated accordng to θ new = θ old + α θ J Some well-establshed algorthms, whch we wll also use for comparsons, are as follows see also Peters and Schaal, 2006a,b 42 REINFORCE Wllams 992 ntroduced the epsodc REINFORCE algorthm, whch s derved from takng the dervatve of 24 wth respect to the polcy parameters Ths algorthm has rather slow convergence 356

21 A GENERALIZED PATH INTEGRAL CONTROL APPROACH TO REINFORCEMENT LEARNING due to a very nosy estmate of the polcy gradent Is also very senstve to a reward baselne parameter b k see below Recent work derved the optmal baselne for REINFORCE cf Peters and Schaal, 2008a, whch mproved the performance sgnfcantly The epsodc REINFORCE update equatons are: N θk J = Eτ 0 [Rτ 0 b k b k = =0 θk ln pa t x t [ N Eτ 0 =0 θ k ln pa t x t 2 Rτ0 [ N Eτ 0 =0 θ k ln pa t x t ] 2, where k denotes the k-th coeffcent of the parameter vector and Rτ 0 = N N =0 r 422 GPOMDP AND THE POLICY GRADIENT THEOREM ALGORITHM In ther GPOMDP algorthm, Baxter and Bartlett 200 ntroduced several mprovements over RE- INFORCE that made the gradent estmates more effcent GPOMDP can also be derved from the polcy gradent theorem Sutton et al, 2000; Peters and Schaal, 2008a, and an optmal reward baselne can be added cf Peters and Schaal, 2008a In our context, the GPOMDP learnng algorthm can be wrtten as: [ ] N θk J = Eτ 0 r b k j θk ln pa t x t, b k = j=0 =0 Eτ 0 [ θk ln pa t x t 2 r t ] Eτ 0 [ θk ln pa t x t 2] ] ], 423 THE EPISODIC NATURAL ACTOR CRITIC One of the most effcent polcy gradent algorthm was ntroduced n Peters and Schaal 2008b, called the Epsodc Natural Actor Crtc In essence, the method uses the Fsher Informaton Matrx to project the REINFORCE gradent onto a more effectve update drecton, whch s motvated by the theory of natural gradents by Amar 999 The enac algorthm takes the form of: [ θ J J 0 ξ t,k = where J 0 s a constant offset term 424 POWER ] [ θk ln pa t x t ], [ ] [ N = Eτ 0 ξ t,kξt T,k Eτ 0 =0 Rτ 0 ] N ξ t,k, =0 The PoWER algorthm Koeber and Peters, 2008 s a probablstc polcy mprovement method, not a gradent algorthm Is derved from an Expectaton-Maxmzaton framework usng probablty 357

22 THEODOROU, BUCHLI AND SCHAAL matchng Dayan and Hnton, 997; Peters and Schaal, 2008c Usng the notaton of ths paper, the parameter update of PoWER becomes: [ ] N g t g δθ=e τ0 T [ tn t R t =0 gt T E τ0 g t g t R gt ε t t =t o gt T g t where R t = N j= r If we set R = c I n the update 37 of PI 2, and set g gt T = I n the matrx gt T g t nverson term of 39, the two algorthms look essentally dentcal But should be noted that the rewards r t n PoWER need to behave lke an mproper probablty, thas, be strctly postve and ntegrate to a constant number ths property can make the desgn of sutable cost functons more complcated PI 2, n contrast, uses exponentated sum of reward terms, where the mmedate reward can be arbtrary, and only the cost on the motor commands needs be quadratc Our emprcal evaluatons revealed that, for cost functons that share the same optmum n the PoWER pseudo-probablty formulaton and the PI 2 notaton, both algorthms perform essentally dentcal, ndcatng that the matrx nverson term n PoWER may be unmportant for many systems It should be noted than Vlasss et al 2009, PoWER was extended to the dscounted nfnte horzon case, where PoWER s the specal case of a non-dscounted fnte horzon problem 5 Evaluatons We evaluated PI 2 n several synthetc examples n comparson wth REINFORCE, GPOMDP, enac, and, when possble, PoWER Except for PoWER, all algorthms are sutable for optmzng mmedate reward functons of the knd r t = q t + u t Ru t As mentoned above, PoWER requres that the mmedate reward behaves lke an mproper probablty Ths property s ncompatble wth r t = q t + u t Ru t and requres some specal nonlnear transformatons, whch usually change the nature of the optmzaton problem, such that PoWER optmzes a dfferent cost functon Thus, only one of the examples below has a compatble a cost functon for all algorthms, ncludng PoWER In all examples below, exploraton nose and, when applcable, learnng rates, were tuned for every ndvdual algorthms to acheve the best possble numercally stable performance Exploraton nose was only added to the maxmally actvated bass functon n a motor prmtve, 0 and the nose was kept constant for the entre tme that ths bass functon had the hghest actvaton emprcally, ths tck helped mproves the learnng speed of all algorthms 5 Learnng Optmal Performance of a DOF Reachng Task The frst evaluaton consders learnng optmal parameters for a DOF DMP cf Equaton 30 The mmedate cost and termnal cost are, respectvely: r t = 05 f 2 t θ T θ, φ tn = 0000ẏ 2 t N + 0g y tn 2 wth y t0 = 0 and g= we use radans as unts motvated by our nteresn robotcs applcaton, but we could also avod unts entrely The nterpretaton of ths coss that we would lke to reach the goal g wth hgh accuracy whle mnmzng the acceleraton of the movement and whle keepng the parameter vector short Each algorthm was run for 5 trals to compute a parameter update, and 0 Thas, the nose vector n 25 has only one non-zero component ], 358

23 A GENERALIZED PATH INTEGRAL CONTROL APPROACH TO REINFORCEMENT LEARNING a total of 000 updates were performed Note that 5 trals per update were chosen as the DMP had 0 bass functons, and the enac requres at least trals to perform a numercally stable update due to ts matrx nverson The motor prmtves were ntalzed to approxmate a 5-th order polynomal as pont-to-pont movement cf Fgure a,b, called a mnmum-jerk trajectory n the motor control lterature; the movement duraton was 05 seconds, whch s smlar to normal human reachng movements Gaussan nose of N0,0 was added to the ntal parameters of the movement prmtves n order to have dfferenntal condtons for every run of the algorthms The results are gven n Fgure Fgure a,b show the ntal before learnng trajectory generated by the DMP together wth the learnng results of the four dfferent algorthms after learnng essentally, all algorthms acheve the same result such that all trajectores le on top of each other In Fgure c, however, t can be seen that PI 2 outperforms the gradent algorthms by an order of magntude Fgure d llustrates learnng curves for the same task as n Fgure c, just that parameter updates are computed already after two roll-outs the enac was excluded from ths evaluaton as t would be too heurstc to stablze ts ll-condtoned matrx nverson that results from such few roll-outs PI 2 contnues to converge much faster than the other algorthms even n ths specal scenaro However, there are some notceable fluctuaton after convergence Ths nose around the convergence baselne s caused by usng only two nosy roll-outs to contnue updatng the parameters, whch causes contnuous parameter fluctuatons around the optmal parameters Annealng the exploraton nose, or just addng the optmal trajectory from the prevous parameter update as one of the roll-outs for the next parameter update can allevate ths ssue we do not llustrate such lttle trcks n ths paper as they really only affect fne tunng of the algorthm 52 Learnng Optmal Performance of a DOF Va-Pont Task The second evaluaton was dentcal to the frst evaluaton, just that the cost functon now forced the movement to pass through an ntermedate va-pont at t = 300ms Ths evaluaton s an abstract approxmaton of httng a target, for example, as n playng tenns, and requres a sgnfcant change n how the movemens performed relatve to the ntal trajectory Fgure 2a The cost functon was r 300ms = G y t300ms 2, φ tn = 0 wth G = 025 Only ths sngle reward was gven For ths cost functon, the PoWER algorthm can be appled, too, wth cost functon r 300ms = exp /λ r 300ms and r t = 0 otherwse Ths transformed cost functon has the same optmum as r 300ms The resultng learnng curves are gven n Fgure 2 and resemble the prevous evaluaton: PI 2 outperforms the gradent algorthms by roughly an order of magntude, whle all the gradent algorthms have almosdentcal learnng curves As was expected from the smlarty of the update equatons, PoWER and PI 2 have n ths specal case the same performance and are hardly dstngushable n Fgure 2 Fgure 2a demonstrates that all algorthms pass through the desred target G, but that there are remanng dfferences between the algorthms n how they approach the target G these dfference have a small numercal effecn the fnal cost where PI 2 and PoWER have the lowest cost, but these dfference are hardly task relevant 53 Learnng Optmal Performance of a Mult-DOF Va-Pont Task A thrd evaluaton examned the scalablty of our algorthms to a hgh-dmensonal and hghly redundant learnng problem Agan, the learnng task was to pass through an ntermedate target G, 359

24 THEODOROU, BUCHLI AND SCHAAL Poston [rad] Intal PI^2 02 REINFORCE PG 0 NAC a b Tme [s] Velocty [rad/s] Tme [s] Cost Cost c 0 d 0 00 Number of Roll-Outs Number of Roll-Outs Fgure : Comparson of renforcement learnng of an optmzed movement wth motor prmtves a Poston trajectores of the ntal trajectory before learnng and the results of all algorthms after learnng the dfferent algorthms are essentally ndstghushable b The same as a, just usng the velocty trajectores c Average learnng curves for the dfferent algorthms wth std error bars from averagng 0 runs for each of the algorthms d Learnng curves for the dfferent algorthms when only two roll-outs are used per update note that the enac cannot work n ths case and s omtted just that a d = 2,0, or 50 dmensonal motor prmtve was employed We assume that the mult- DOF systems model planar robot arms, where d lnks of equal length l = /d are connected n an open chan wth revolute jonts Essentally, these robots look lke a mult-segment snake n a plane, where the tal of the snake s fxed at the orgn of the 2D coordnate system, and the head of the snake can be moved n the 2D plane by changng the jont angles between all the lnks Fgure 3b,d,f llustrate the movement over tme of these robots: the ntal poston of the robots s when all jont angles are zero and the robot arm completely concdes wth the x-axs of the coordnate frame The goal states of the motor prmtves command each DOF to move to a jont angle, such that the entre robot confguraton afterwards looks lke a sem-crcle where the most dstal lnk of the robot 360

25 A GENERALIZED PATH INTEGRAL CONTROL APPROACH TO REINFORCEMENT LEARNING Poston [rad] 04 G Intal PI^2 REINF PG NAC PoWER Tme [s] Cost a b Number of Roll-Outs Fgure 2: Comparson of renforcement learnng of an optmzed movement wth motor prmtves for passng through an ntermedate target G a Poston trajectores of the ntal trajectory before learnng and the results of all algorthms after learnng b Average learnng curves for the dfferent algorthms wth std error bars from averagng 0 runs for each of the algorthms the end-effector touches the y-axs The hgher prorty task, however, s to move the end-effector through a va-pont G =05, 05 To formalze ths task as a renforcement learnng problem, we denote the jont angles of the robots as ξ, wth =,2,,d, such that the frst lne of 30 reads now as ξ,t = f,t + g T,t θ + ε,t ths small change of notaton s to avod a clash of varables wth thex,y task space of the robot The end-effector poston s computed as: x t = d d = cos j,t, y t = j=ξ d d = The mmedate reward functon for ths problem s defned as sn j= ξ j,t r t = d = d+ 0 f 2,t + 05 θt θ d = d+, 39 r 300ms = x t300ms y t300ms 2, φ tn = 0, where r 300ms s added to r t at tme t = 300ms, thas, we would lke to pass through the vapont at ths tme The ndvdual DOFs of the motor prmtve were ntalzed as n the DOF examples above The cost term n 39 penalzes each DOF for usng hgh acceleratons and large parameter vectors, whch s a crtcal component to acheve a good resoluton of redundancy n the arm Equaton 39 also has a weghtng term d+ that penalzes DOFs proxmal to the orgn more than those that are dstal to the orgn ntutvely, appled to human arm movements, ths would mean that wrst movements are cheaper than shoulder movements, whch s motvated by the 36

26 THEODOROU, BUCHLI AND SCHAAL fact that the wrst has much lower mass and nerta and s thus energetcally more effcent to move The results of ths experment are summarzed n Fgure 3 The learnng curves n the left column demonstrate agan that PI 2 has an order of magntude faster learnng performance than the other algorthms, rrespectve of the dmensonalty PI 2 also converges to the lowest cosn all examples: Algorthm 2-DOFs 0-DOFs 50-DOFs PI ± ± ±50 REINFORCE ± ± ± PG ± ± ± NAC 3000 ± ± ± 2000 Fgure 3 also llustrates the path taken by the end-effector before and after learnng All algorthms manage to pass through the va-pont G approprately, although the path partcularly before reachng the va-pont can be qute dfferent across the algorthms Gven that PI 2 reached the lowest cost wth low varance n all examples, t appears to have found the best soluton We also added a stroboscopc sketch of the robot arm for the PI 2 soluton, whch proceeds from the very rght to the left as a functon of tme It should be emphaszed that there were absolutely no parameter tunng needed to acheve the PI 2 results, whle all gradent algorthms requred readjustng of learnng rates for every example to acheve best performance 54 Applcaton to Robot Learnng Fgure 4 llustrates our applcaton to a robot learnng problem The robot dog s to jump across as gap The jump should make forward progress as much as possble, as s a maneuver n a legged locomoton competton whch scores the speed of the robot note that we only used a physcal smulator of the robot for ths experment, as the actual robot was not avalable The robot has three DOFs per leg, and thus a total of d = 2 DOFs Each DOF was represented as a DMP wth 50 bass functons An ntal seed behavor Fgure 5-top was taught by learnng from demonstraton, whch allowed the robot barely to reach the other sde of the gap wthout fallng nto the gap the demonstraton was generated from a manual adjustment of splne nodes n a splne-based trajectory plan for each leg PI 2 learnng used prmarly the forward progress as a reward, and slghtly penalzed the squared acceleraton of each DOF, and the length of the parameter vector Addtonally, a penalty was ncurred f the yaw or the roll exceeded a threshold value these penaltes encouraged the robot to 362

27 A GENERALIZED PATH INTEGRAL CONTROL APPROACH TO REINFORCEMENT LEARNING G Cost y [m] a 0 b x [m] Number of Roll-Outs DOF G Cost y [m] c 0 d Number of Roll-Outs DOF x [m] G Cost PG e 0 NAC f x [m] 0 Intal PI2 REINFORCE 00 Number of Roll-Outs y [m] 0 50 DOF Fgure 3: Comparson of learnng mult-dof movements 2,0, and 50 DOFs wth planar robot arms passng through a va-pont G a,c,e llustrate the learnng curves for dfferent RL algorthms, whle b,d,f llustrate the end-effector movement after learnng for all algorthms Addtonally, b,d,f also show the ntal end-effector movement, before learnng to pass through G, and a stroboscopc vsualzaton of the arm movement for the fnal result of PI 2 the movements proceed n tme startng at the very rght and endng by almost touchng the y axs 363

THEODOROU, BUCHLI AND SCHAAL 600 500 400 Cost 300 200 00 a Real & Smulated Robot Dog 0 0 00 Number of Roll-Outs b Learnng curve for Dog Jump wth PI 2 ±std Fgure 4: Renforcement learnng of optmzng to

traversed the gap wth entre body Ths learned behavor allowed the robot to traverse a gap at much hgher speed n a competton on learnng locomoton The experments for ths paper were conducted only on the

28 THEODOROU, BUCHLI AND SCHAAL Cost a Real & Smulated Robot Dog Number of Roll-Outs b Learnng curve for Dog Jump wth PI 2 ±std Fgure 4: Renforcement learnng of optmzng to jump over a gap wth a robot dog The mprovement n cost corresponds to about 5 cm mprovemenn jump dstance, whch changed the robot s behavor from an ntal barely successful jump to jump that completely traversed the gap wth entre body Ths learned behavor allowed the robot to traverse a gap at much hgher speed n a competton on learnng locomoton The experments for ths paper were conducted only on the robot smulator jump straght forward and not to the sde, and not to fall over The exact cost functon s: d r t = r roll + r yaw + a f,t+ 2 05a 2 θ T θ a = e 6,a 2 = e 8, = { 00 roll t 03 2, f roll t >03 r roll = 0, otherwse, { 00 yaw t 0 2, f yaw t >0 r yaw = 0, otherwse, φ tn = 50000goal x nose 2, where roll,yaw are the roll and yaw angles of the robot s body, and x nose s the poston of the front tp the nose of the robon the forward drecton, whch s the drecton towards the goal The multplers for each reward component were tuned to have a balanced nfluence of all terms Ten learnng trals were performed ntally for the frst parameter update The best 5 trals were kept, and fve addtonal new trals were performed for the second and all subsequent updates Essentally, ths method performs mportance samplng, as the rewards for the 5 trals n memory were re-computed 364

A GENERALIZED PATH INTEGRAL CONTROL APPROACH TO

mages from the smulated robot dog jumpng over a

learnng Whle the two sequences look qute smlar

frame, the robot s body s sgnfcantly hgher n the

made about 5cm more forward progress as before

rest on the other sde of the gap, whch allows

before learnng, the robot s body and ts hnd legs

not allow for a successful contnuaton of walkng

trals was performed per run, and ten runs were

devatons of learnng curves Fgure 4 llustrates

performance of the robot was converged and

almost the entre body was lyng on the other sde

performance n a sequence of snapshots of the

algorthmcally very smple, and manual tunng only

a dfferent research topc beyond the scope of ths

verson of stochastc optmal control wth path

2007 and Broek et al 2008 The key results were

how to compute the optmal controls for a general

state-dependent control transton matrx One

nterpreted n the framework of renforcement

29 A GENERALIZED PATH INTEGRAL CONTROL APPROACH TO REINFORCEMENT LEARNING Fgure 5: Sequence of mages from the smulated robot dog jumpng over a 4cm gap Top: before learnng Bottom: After learnng Whle the two sequences look qute smlar at the frst glance, s apparent than the 4th frame, the robot s body s sgnfcantly hgher n the ar, such that after landng, the body of the dog made about 5cm more forward progress as before In partcular, the entre robot s body comes to rest on the other sde of the gap, whch allows for an easy transton to walkng In contrast, before learnng, the robot s body and ts hnd legs are stll on the rght sde of the gap, whch does not allow for a successful contnuaton of walkng wth the latest parameter vectors A total of 00 trals was performed per run, and ten runs were collected for computng mean and standard devatons of learnng curves Fgure 4 llustrates that after about 30 trals e, 5 updates, the performance of the robot was converged and sgnfcantly mproved, such that after the jump, almost the entre body was lyng on the other sde of the gap Fgure 4 captures the temporal performance n a sequence of snapshots of the robot It should be noted that applyng PI 2 was algorthmcally very smple, and manual tunng only focused on generated a good cost functon, whch s a dfferent research topc beyond the scope of ths paper 6 Dscusson Ths paper derved a more general verson of stochastc optmal control wth path ntegrals, based on the orgnal work by Kappen 2007 and Broek et al 2008 The key results were presented n Table and Secton 25, whch consdered how to compute the optmal controls for a general class of stochastc control systems wth state-dependent control transton matrx One mportant class of these systems can be nterpreted n the framework of renforcement learnng wth parameterzed polces For ths class, we derved Polcy Improvement wth Path Integrals PI 2 as a novel algorthm for learnng a parameterzed polcy PI 2 nherts ts sound foundaton n frst order prncples of stochastc optmal control from the path ntegral formalsm Is a probablstc learnng method wthout open algorthmc tunng parameters, except for the exploraton nose In our evaluatons, PI 2 outperformed gradent algorthms sgnfcantly Is also numercally smpler and has easer cost functon desgn than prevous probablstc RL methods that requre thammedate rewards are pseudo-probabltes The smlarty of PI 2 wth algorthms based on probablty matchng ndcates that the prncple of probablty matchng seems to approxmate a stochastc optmal control framework Our evaluatons demonstrated that PI 2 can scale to hgh dmensonal control systems, unlke many other renforcement learnng systems Some ssues, however, deserve more detaled dscussons n the followng paragraphs 365

Erratum: A Generalized Path Integral Control Approach to Reinforcement Learning

Erratum: A Generalized Path Integral Control Approach to Reinforcement Learning Journal of Machne Learnng Research 00-9 Submtted /0; Publshed 7/ Erratum: A Generalzed Path Integral Control Approach to Renforcement Learnng Evangelos ATheodorou Jonas Buchl Stefan Schaal Department of