Reinforcement Learning for Robotic Locomotions

Reinforcement Lerning for Robotic Locomotion Bo Liu Stnford Univerity 121 Cmpu Drive Stnford, CA 94305, USA bliuxix@tnford.edu Hunzhong Xu Stnford Univerity 121 Cmpu Drive Stnford, CA 94305, USA xuhunvc@tnford.edu Songze Li Stnford Univerity 121 Cmpu Drive Stnford, CA 94305, USA ongzeli@tnford.edu Abtrct In Reinforcement Lerning, it i uully more convenient to optimize in the policy π( ) pce thn forming policy indirectly by ccurtely evluting tte-ction function Q(, ) or vlue function V(). Becue the vlue function doen t precribe ction nd the tte-ction function might be very hrd to etimte in continuou or lrge ction pce. Therefore, in n environment tht uffer from high cot of evlution, uch robotic locomotion ytem, policy optimiztion method i better option due to the ignificntly mller policy pce. In thi project, we invetigte prticulr policy optimiztion method, the Trut Region Policy Optimiztion lgorithm, explore poible vrint of it, nd propoe new method bed on our undertnding of thi lgorithm. 1 Introduction The originl motivtion of thi project come from the deire to undertnd humn motor control unit in the field of bioengineering. Controlling imulted humn body to chieve complex motion uing n rtificil brin will provide not only better undertnding of how humn motor ytem work but lo theoreticl foundtion for urgerie for people hving phyicl dibilitie. Therefore, we focu on finding efficient Reinforcement Lerning lgorithm in environment with high cot evlution. We eventully decided on exploring policy grdient method becue erching in policy pce cn be much more efficient. In prticulr, we invetigte into recent work clled Trut Region Policy Optimiztion (Schulmn et l., 2015), in which the uthor formulte the reinforcement lerning objective n optimiztion problem ubject to trut region contrint. Thi lgorithm work well in prctice nd h theoreticl gurntee to improve in ech epiode if the objective nd contrint re evluted exctly. In the originl pper, the trut region correpond to the region tht i cloe to the old policy in the policy pce. The uthor define cloene in term of the KullbckLeibler (KL) divergence between the old nd new policy ditribution, e.g. D KL (π old ( ) π new ( )). In fct, during optimiztion tep, it ue econd order pproximtion of thi KL contrint which involve the Hein of the KL divergence. Although uing conjugte grdient method, uggeted by the uthor, llow u to void computing the Hein exctly, in generl thi i till computtionlly expenive. Therefore, we quetion to wht extent cn KL contrint outperform other ey-to-compute contrint, or even no contrint t ll, to compente for it cot. Bed on our undertnding of thi model, we lo view thi problem from nother perpective nd propoe new method tht etimte the dvntge vlue uing neurl net nd updte policy directly. 2 Relted Work There re minly two pproche for the propoed problem in reinforcement lerning: policy grdient method nd Q-function method. A policy grdient method directly optimize prmetrized control policy by grdient decent. It belong to the cl of policy erch technique

tht mximize the expected return of policy in fixed policy cl, while trditionl vlue function pproximtion pproche derive policie from vlue function. Policy grdient method llow the trightforwrd incorportion of domin knowledge in the policy prmetriztion nd require ignificntly fewer prmeter to repreent the optiml policy thn the correponding vlue function. They re gurnteed to converge to policy loclly optiml t let. Furthermore, they cn hndle continuou tte nd ction, even including imperfect tte informtion. Except for the vnill policy grdient method, there exit vrint like nturl policy grdient (Kkde, 2002) nd lgorithm tht ue trut region (Schulmn et l., 2015). Inted of prmeterizing the policy, the Q-function focue on the tte-ction function Q π ( t, t ) nd updte it with Bellmn Eqution. The optiml Q i obtined vi vlue itertion, in which we repetedly pply Bellmn opertor until it converge. It h been hown tht under ome mild umption, for ny initil Q, thi vlue itertion lgorithm i gurnteed to converge to Q π, where π i the optiml policy (Bird nd other, 1995). Vrint of thi bic vlue itertion lgorithm include the neurl fitted Q-itertion prmeterize Q-function with neurl network nd replce the Bellmn opertor with minimizing MSE between two Q-function. Q-function method do not work generlly policy grdient method lthough they re more mple efficient when they do work. Algorithm Simple & Sclble Dt Efficient Vnill Policy Grdient Good Bd Nturl Policy Grdient Bd OK Q-lerning Good OK Tble 1: comprion of different lgorithm 3 Nottion To mke further derivtion nd decription cler, we provide our nottion for clic Mrkov Deciion Proce (MDP) nd review the conventionl reinforcement lerning objective. MDP i 6-tuple of (S, A, T, r, ρ, γ), where S i the et of poible tte, A i the et of poible ction, T : S A S R i the trnition probbility, r : S R i the rewrd function, ρ : S R i the initil ditribution of tte 0 nd γ i the dicount fctor. Let π denote tochtic policy π : S A [0, 1]. The objective of clic RL i to mximize the expected future rewrd: [ η(π) = E 0, 0,... γ t r( t ) ] where 0 ρ 0 ( 0 ), t π( t t ), t+1 T( t+1 t, t ). Let Q π (, ) nd V π () denote the tndrd ttection function nd vlue function, where [ Q π ( t, t ) = E t+1, t+1,... γ t r( l ) ] Define the dvntge : l=t V π ( t ) = E t Q π ( t, t ) A π (, ) = Q π (, ) V π () In ddition, we define ρ π : S R the dicounted viittion frequencie function: ρ π () = P( 0 = ) + γp( 1 = ) + γ 2 P( 2 = ) +... 4 Method 4.1 Policy Grdient Method Policy grdient method, the nme ugget, trie to etimte the derivtive of objective with repect to policy prmeter directly. The mot commonly ued grdient etimtor h the form ĝ = E t [ θ log π θ ( t t )Â t ] The TRPO lgorithm i pecific derivtive under thi cl of lgorithm. 4.2 Trut Region Policy Optimiztion Trut region policy optimiztion (TRPO) method combine the ide of Minorize-Mximiztion nd policy grdient method. It define urrogte function which i eier to optimize nd provide trict lower bound for the originl objective function. But

in order to ue thi urrogte, the optimiztion h to be done in pce ner the previou policy. Let π 0 denote the beline policy nd π denote ny policy. The following identity expree the expected return of π (Kkde nd Lngford, 2002): η(π) = η(π 0 ) + E 0, 0, π[ A π0 ( t, t )] = η(π 0 ) + P( t = π) π( )γ t A π0 (, ) devition. = η(π 0 ) + γ t P( t = π) π( )A π0 (, ) = η(π 0 ) + ρ π () π( )A π0 (, ) By ubtituting ρ π with ρ π0, we get n pproximtion of the dicounted future rewrd under the new policy π. The new objective i: L(π) = η(π 0 ) + ρ π0 () π( )A π0 (, ) (1) It h been pointed out tht thi pproximtion mtche the originl objective η(π) to firt order (Schulmn et l., 2015). Therefore, mll enough tep π θold π tht improve L(π θold ) lo improve η(π). But the problem i wht mll men here. The uthor ugget uing the KL divergence meure nd provide rigorou theoreticl proof. For implicity, we will not include full derivtion here but only the eventul urrogte objective: mx L θold (θ) CD KL (θ old, θ) θ where C i ome properly choen contnt. However, in prctice, if we ued the penlty coefficient C recommended by the theory, the tep ize would be too mll. To find fter nd more robut optimiztion, trut region i introduced nd the objective become mximizing L ubject to D KL inide the trut region: mx L θold (θ) θ ubject to D KL (θ old, θ) δ We cn further etimte the objective uing importnce mpling bed on the π old ditribution. The contrint cn lo be etimted in the me wy uing Monte-Crlo mpling nd econd order pproximtion. 4.3 Model for TRPO For TRPO nd it vrint, we ue neurl network for both policy nd vlue model. For both network, the input i the tte obervtion. The output of policy model i multi-vrite norml ditribution for ction prmeterized by it men nd tndrd The output of vlue model i ingle number tht i the etimtion of the vlue t thi prticulr tte obervtion. 4.4 Subtitution for KL Divergence The bic ide of TRPO i to optimize the urrogte lo function under the the retriction tht the new policy i cloe enough to the old one. However, one my wonder why we hve to ue KL divergence here to meure the cloene between the two policie. A nturl ubtitution for the KL divergence i the men-qure error (MSE) θ old θ 2 2. Replcing KL contrint with the MSE will ignificntly peed up the trining procedure ince MSE i eier to etimte. Moreover, we recognize tht uing MSE correpond to the originl policy grdient method becue when the MSE i ufficiently mll, the direction tht mximize the objective correpond to it grdient. We experiment TRPO with both KL nd MSE contrint. For beline, we lo implement the method tht purely optimize the objective without ny contrint. We will further nlyze the comprion in following ection with figure. 4.5 Neurl Network Advntge Etimtion Bck to eqution (1) L(π) = η(π 0 ) + ρ π0 () π( )A π0 (, ) Since η(π 0 ) i contnt, mximizing L(π) i equivlent to mking the L(π) η(π 0 ) poitive poible, for exmple: ρ π0 () π( )A π0 (, ) > 0 The intuition here i tht the dvntge vlue A π0 (, ) i n indiction of how good certin

ction i nd we would like to incree the probbilitie for better ction tht hve lrge poitive dvntge vlue. In other word, if dvntge vlue cn be ccurtely etimted, we cn optimize the problem by directly mximizing the probbilitie of good mpled ction. In prctice, our ction re mpled from multi-vrite Guin in continuou pce. We etimte dvntge of mpled ction directly from the Monte-Crlo mpling trjectory uing the generlized dvntge etimtion (GAE) (Schulmn et l., 2015b), but we lo wnt n etimtion of the dvntge vlue if men ction were tken t ech time tep. Then by clculting the difference, whenever men ction dvntge i mller (in prctice, we require the difference to be lrger thn threhold to remove ome noie), we mk out the probbility of the mpled ction nd mximize them. Hence, the only problem i how to etimte dvntge vlue for men ction, the ction tht we didn t tke during roll-out. We olve thi problem by uing 3-lyer feed-forwrd neurl network. The input of the network i the conctention of tte obervtion nd ction ; the output of the network i the etimted dvntge vlue. During implementtion, we ue TRPO for the firt few epiode until the MSE between clculted nd etimted dvntge i mller thn threhold. The bove method trt fter thi. 5 Experiment 6 Reult nd Anlyi Figure 1: Swimmer: Averge rewrd v.. Epiode Figure 2: Hopper: Averge rewrd v.. Epiode We tet the different method in the MuJoCo imultor, phyic engine from openai gym. The 3 environment we ue hve increing complexity: Swimmer: 10-dimenionl tte pce, liner rewrd Hopper: 12-dimenionl tte pce, me rewrd Swimmer, with poitive bonu for being in non-terminl tte Wlker: 18-dimenionl tte pce, me rewrd Hopper with n dded penlty for trong impct of the feet gint the ground. Figure 3: Wlker: Averge rewrd v.. Epiode

A hown bove, KL divergence doe outperform it vrint. In pite of the high cot, KL contrint i better when meuring ditnce between policie. Becue in MSE ce, though intuitively cloer prmeter correpond to imilr policie, in lrger policy pce even mll updte in prmeter will reult in much different policie. In contrt, the KL contrint i directly etimted upon different π(θ) nd roll-out from different policie re gurnteed to be imilr. However, MSE cn ignificntly horten the trining time. So our uggetion i combintion: pply TRPO with MSE contrint to let the model quickly climb to certin point nd then ue originl TRPO lter for fter convergence. Our dvntge etimtion method did not work t firt. We pent lot of time debugging. One thing we noticed i tht the lgorithm tended to hve lrge MSE lo for dvntge etimtion fter ome conecutive epoch of improvement. In other word, thi provide evidence tht our model work when etimtion i ccurte. In ddition, we conjecture tht the unexpected drop might reult from the over-fitting of our dvntge neurl net. So whenever it ee new obervtion nd ction tht differ from previou experience, it give wrong etimtion of dvntge vlue nd the policy will updte long the wrong grdient direction. Therefore, by crefully tuning L2-regulriztion contnt nd dropout for the neurl network, the model indeed improve, hown in the figure below. In ddition to the nlyi bove, we lo notice n intereting fct bout TRPO: depite the fct tht the verge rewrd incree in ech epiode, the lgorithm i highly enitive to initiliztion. A even uing different rndom eed, the model performnce vrie lrgely. Figure 5: TRPO with different rndom eed 7 Concluion nd Future work In thi project, we explore the cutting-edge TRPO lgorithm in the cl of policy grdient method nd try out ome poible vrint of it. We lo experiment our own modifiction uing dvntge etimtion neurl network. We my continue our invetigtion into other poible ubtitution for KL divergence. In ddition, we would lo like to do more error nlyi on why even the TRPO model i not robut enough nd find method to lower the vrince. 8 Contribution Bo Liu: reponible for coding of TRPO, it vrint nd dvntge etimtion method; prticipted in writeup of project propol, miletone, poter nd finl report; preented poter. Figure 4: TRPO nd dvntge etimtion Hunzhong Xu: reponible for nlyzing cot of Conjugte Grdient in KL, nd etimting feibility of it vrint; helped debugging; collected dt nd ploted ll figure; prticipted in miletone, poter nd finl report writeup.

Songze Li: reponible for project ide nd running experiment; prticipted in miletone nd finl report writeup. Reference Leemon Bird et l. 1995. Reidul lgorithm: Reinforcement lerning with function pproximtion. In Proceeding of the twelfth interntionl conference on mchine lerning, pge 30 37. Shm Kkde nd John Lngford. 2002. Approximtely optiml pproximte reinforcement lerning. Shm M Kkde. 2002. A nturl policy grdient. In Advnce in neurl informtion proceing ytem, pge 1531 1538. John Schulmn, Sergey Levine, Pieter Abbeel, Michel Jordn, nd Philipp Moritz. 2015. Trut region policy optimiztion. In Proceeding of the 32nd Interntionl Conference on Mchine Lerning (ICML-15), pge 1889 1897. John Schulmn, Philipp Moritz, Sergey Levine, Michel Jordn, nd Pieter Abbeel. 2015b. High-dimenionl continuou control uing generlized dvntge etimtion. rxiv preprint rxiv:1506.02438.