Trust Region Policy Optimization

Size: px

Start display at page:

Download "Trust Region Policy Optimization"

Mildred Bennett
6 years ago
Views:

1 Consider n infinite-horizon discounted Mrkov decision process (MDP), defined by the tuple (S, A, P, r, ρ 0, γ), where S is finite set of sttes, A is finite set of ctions, P : S A S R is the trnsition probbility distrirxiv: v5 [cs.lg 20 Apr 2017 John Schulmn JOSCHU@EECS.BERKELEY.EDU Sergey Levine SLEVINE@EECS.BERKELEY.EDU Philipp Moritz PCMORITZ@EECS.BERKELEY.EDU Michel Jordn JORDAN@CS.BERKELEY.EDU Pieter Abbeel PABBEEL@CS.BERKELEY.EDU University of Cliforni, Berkeley, Deprtment of Electricl Engineering nd Computer Sciences Abstrct We describe n itertive procedure for optimizing policies, with gurnteed monotonic improvement. By mking severl pproximtions to the theoreticlly-justified procedure, we develop prcticl lgorithm, clled Trust Region Policy Optimiztion (TRPO). This lgorithm is similr to nturl policy grdient methods nd is effective for optimizing lrge nonliner policies such s neurl networks. Our experiments demonstrte its robust performnce on wide vriety of tsks: lerning simulted robotic swimming, hopping, nd wlking gits; nd plying Atri gmes using imges of the screen s input. Despite its pproximtions tht devite from the theory, TRPO tends to give monotonic improvement, with little tuning of hyperprmeters. 1 Introduction Most lgorithms for policy optimiztion cn be clssified into three brod ctegories: (1) policy itertion methods, which lternte between estimting the vlue function under the current policy nd improving the policy (Bertseks, 2005); (2) policy grdient methods, which use n estimtor of the grdient of the expected return (totl rewrd) obtined from smple trjectories (Peters & Schl, 2008) (nd which, s we lter discuss, hve close connection to policy itertion); nd (3) derivtive-free optimiztion methods, such s the cross-entropy method (CEM) nd covrince mtrix dpttion (CMA), which tret the return s blck box function to be optimized in terms of the policy prmeters (Szit & Lörincz, 2006). Generl derivtive-free stochstic optimiztion methods such s CEM nd CMA re preferred on mny problems, becuse they chieve good results while being simple to understnd nd implement. For exmple, while Proceedings of the 31 st Interntionl Conference on Mchine Lerning, Lille, Frnce, JMLR: W&CP volume 37. Copyright 2015 by the uthor(s). Tetris is clssic benchmrk problem for pproximte dynmic progrmming (ADP) methods, stochstic optimiztion methods re difficult to bet on this tsk (Gbillon et l., 2013). For continuous control problems, methods like CMA hve been successful t lerning control policies for chllenging tsks like locomotion when provided with hnd-engineered policy clsses with low-dimensionl prmeteriztions (Wmpler & Popović, 2009). The inbility of ADP nd grdient-bsed methods to consistently bet grdient-free rndom serch is unstisfying, since grdient-bsed optimiztion lgorithms enjoy much better smple complexity gurntees thn grdient-free methods (Nemirovski, 2005). Continuous grdient-bsed optimiztion hs been very successful t lerning function pproximtors for supervised lerning tsks with huge numbers of prmeters, nd extending their success to reinforcement lerning would llow for efficient trining of complex nd powerful policies. In this rticle, we first prove tht minimizing certin surrogte objective function gurntees policy improvement with non-trivil step sizes. Then we mke series of pproximtions to the theoreticlly-justified lgorithm, yielding prcticl lgorithm, which we cll trust region policy optimiztion (TRPO). We describe two vrints of this lgorithm: first, the single-pth method, which cn be pplied in the model-free setting; second, the vine method, which requires the system to be restored to prticulr sttes, which is typiclly only possible in simultion. These lgorithms re sclble nd cn optimize nonliner policies with tens of thousnds of prmeters, which hve previously posed mjor chllenge for model-free policy serch (Deisenroth et l., 2013). In our experiments, we show tht the sme TRPO methods cn lern complex policies for swimming, hopping, nd wlking, s well s plying Atri gmes directly from rw imges. 2 Preliminries

2 bution, r : S R is the rewrd function, ρ 0 : S R is the distribution of the initil stte s 0, nd γ (0, 1) is the discount fctor. Let π denote stochstic policy π : S A [0, 1, nd let η(π) denote its expected discounted rewrd: η(π) = E s0, 0,... γ t r(s t ), where s 0 ρ 0 (s 0 ), t π( t s t ), s t+1 P (s t+1 s t, t ). We will use the following stndrd definitions of the sttection vlue function Q π, the vlue function V π, nd the dvntge function A π : Q π (s t, t ) = E st+1, t+1,... γ l r(s t+l ), l=0 V π (s t ) = E t,s t+1,... γ l r(s t+l ), l=0 A π (s, ) = Q π (s, ) V π (s), where t π( t s t ), s t+1 P (s t+1 s t, t ) for t 0. The following useful identity expresses the expected return of nother policy π in terms of the dvntge over π, ccumulted over timesteps (see Kkde & Lngford (2002) or Appendix A for proof): η( π) = η(π) + E s0, 0, π γ t A π (s t, t ) (1) where the nottion E s0, 0, π [... indictes tht ctions re smpled t π( s t ). Let ρ π be the (unnormlized) discounted visittion frequencies ρ π (s)=p (s 0 = s)+γp (s 1 = s)+γ 2 P (s 2 = s)+..., where s 0 ρ 0 nd the ctions re chosen ccording to π. We cn rewrite Eqution (1) with sum over sttes insted of timesteps: η( π) = η(π) + P (s t = s π) s = η(π) + s = η(π) + s γ t P (s t = s π) ρ π (s) π( s)γ t A π (s, ) π( s)a π (s, ) π( s)a π (s, ). (2) This eqution implies tht ny policy updte π π tht hs nonnegtive expected dvntge t every stte s, i.e., π( s)a π(s, ) 0, is gurnteed to increse the policy performnce η, or leve it constnt in the cse tht the expected dvntge is zero everywhere. This implies the clssic result tht the updte performed by exct policy itertion, which uses the deterministic policy π(s) = rg mx A π (s, ), improves the policy if there is t lest one stte-ction pir with positive dvntge vlue nd nonzero stte visittion probbility, otherwise the lgorithm hs converged to the optiml policy. However, in the pproximte setting, it will typiclly be unvoidble, due to estimtion nd pproximtion error, tht there will be some sttes s for which the expected dvntge is negtive, tht is, π( s)a π(s, ) < 0. The complex dependency of ρ π (s) on π mkes Eqution (2) difficult to optimize directly. Insted, we introduce the following locl pproximtion to η: L π ( π) = η(π) + s ρ π (s) π( s)a π (s, ). (3) Note tht L π uses the visittion frequency ρ π rther thn ρ π, ignoring chnges in stte visittion density due to chnges in the policy. However, if we hve prmeterized policy π θ, where π θ ( s) is differentible function of the prmeter vector θ, then L π mtches η to first order (see Kkde & Lngford (2002)). Tht is, for ny prmeter vlue θ 0, L πθ0 (π θ0 ) = η(π θ0 ), θ L πθ0 (π θ ) θ=θ0 = θ η(π θ ) θ=θ0. (4) Eqution (4) implies tht sufficiently smll step π θ0 π tht improves L πθold will lso improve η, but does not give us ny guidnce on how big of step to tke. To ddress this issue, Kkde & Lngford (2002) proposed policy updting scheme clled conservtive policy itertion, for which they could provide explicit lower bounds on the improvement of η. To define the conservtive policy itertion updte, let π old denote the current policy, nd let π = rg mx π L πold (π ). The new policy π new ws defined to be the following mixture: π new ( s) = (1 α)π old ( s) + απ ( s). (5) Kkde nd Lngford derived the following lower bound: η(π new ) L πold (π new ) 2ɛγ (1 γ) 2 α2 where ɛ = mx E π ( s) [A π (s, ). (6) s (We hve modified it to mke it slightly weker but simpler.) Note, however, tht so fr this bound only pplies to mixture policies generted by Eqution (5). This policy clss is unwieldy nd restrictive in prctice, nd it is desirble for prcticl policy updte scheme to be pplicble to ll generl stochstic policy clsses. 3 Monotonic Improvement Gurntee for Generl Stochstic Policies Eqution (6), which pplies to conservtive policy itertion, implies tht policy updte tht improves the right-hnd

3 side is gurnteed to improve the true performnce η. Our principl theoreticl result is tht the policy improvement bound in Eqution (6) cn be extended to generl stochstic policies, rther thn just mixture polices, by replcing α with distnce mesure between π nd π, nd chnging the constnt ɛ ppropritely. Since mixture policies re rrely used in prctice, this result is crucil for extending the improvement gurntee to prcticl problems. The prticulr distnce mesure we use is the totl vrition divergence, i p i q i for dis- (π, π) s which is defined by D T V (p q) = 1 2 crete probbility distributions p, q. 1 Define DTV mx D mx TV (π, π) = mx D T V (π( s) π( s)). (7) s Theorem 1. Let α = D mx TV (π old, π new ). Then the following bound holds: η(π new ) L πold (π new ) 4ɛγ (1 γ) 2 α2 where ɛ = mx s, A π(s, ) (8) We provide two proofs in the ppendix. The first proof extends Kkde nd Lngford s result using the fct tht the rndom vribles from two distributions with totl vrition divergence less thn α cn be coupled, so tht they re equl with probbility 1 α. The second proof uses perturbtion theory. Next, we note the following reltionship between the totl vrition divergence nd the KL divergence (Pollrd (2000), Ch. 3): D T V (p q) 2 D KL (p q). Let DKL mx(π, π) = mx s D KL (π( s) π( s)). The following bound then follows directly from Theorem 1: η( π) L π ( π) CDKL mx (π, π), where C = 4ɛγ (1 γ) 2. (9) Algorithm 1 describes n pproximte policy itertion scheme bsed on the policy improvement bound in Eqution (9). Note tht for now, we ssume exct evlution of the dvntge vlues A π. It follows from Eqution (9) tht Algorithm 1 is gurnteed to generte monotoniclly improving sequence of policies η(π 0 ) η(π 1 ) η(π 2 ).... To see this, let M i (π) = L πi (π) CDKL mx(π i, π). Then η(π i+1 ) M i (π i+1 ) by Eqution (9) η(π i ) = M i (π i ), therefore, η(π i+1 ) η(π i ) M i (π i+1 ) M(π i ). (10) Thus, by mximizing M i t ech itertion, we gurntee tht the true objective η is non-decresing. This lgorithm 1 Our result is strightforwrd to extend to continuous sttes nd ctions by replcing the sums with integrls. Algorithm 1 Policy itertion lgorithm gurnteeing nondecresing expected return η Initilize π 0. for i = 0, 1, 2,... until convergence do Compute ll dvntge vlues A πi (s, ). Solve the constrined optimiztion problem end for π i+1 = rg mx [L πi (π) CDKL mx (π i, π) π where C = 4ɛγ/(1 γ) 2 nd L πi (π)=η(π i )+ s ρ πi (s) π( s)a πi (s, ) is type of minoriztion-mximiztion (MM) lgorithm (Hunter & Lnge, 2004), which is clss of methods tht lso includes expecttion mximiztion. In the terminology of MM lgorithms, M i is the surrogte function tht minorizes η with equlity t π i. This lgorithm is lso reminiscent of proximl grdient methods nd mirror descent. Trust region policy optimiztion, which we propose in the following section, is n pproximtion to Algorithm 1, which uses constrint on the KL divergence rther thn penlty to robustly llow lrge updtes. 4 Optimiztion of Prmeterized Policies In the previous section, we considered the policy optimiztion problem independently of the prmeteriztion of π nd under the ssumption tht the policy cn be evluted t ll sttes. We now describe how to derive prcticl lgorithm from these theoreticl foundtions, under finite smple counts nd rbitrry prmeteriztions. Since we consider prmeterized policies π θ ( s) with prmeter vector θ, we will overlod our previous nottion to use functions of θ rther thn π, e.g. η(θ) := η(π θ ), L θ ( θ) := L πθ (π θ), nd D KL (θ θ) := D KL (π θ π θ). We will use θ old to denote the previous policy prmeters tht we wnt to improve upon. The preceding section showed tht η(θ) L θold (θ) CDKL mx(θ old, θ), with equlity t θ = θ old. Thus, by performing the following mximiztion, we re gurnteed to improve the true objective η: mximize [L θold (θ) CDKL mx (θ old, θ). θ In prctice, if we used the penlty coefficient C recommended by the theory bove, the step sizes would be very smll. One wy to tke lrger steps in robust wy is to use constrint on the KL divergence between the new policy nd the old policy, i.e., trust region constrint: mximize θ L θold (θ) (11) subject to D mx KL (θ old, θ) δ.

4 This problem imposes constrint tht the KL divergence is bounded t every point in the stte spce. While it is motivted by the theory, this problem is imprcticl to solve due to the lrge number of constrints. Insted, we cn use heuristic pproximtion which considers the verge KL divergence: D ρ KL(θ 1, θ 2 ) := E s ρ [D KL (π θ1 ( s) π θ2 ( s)). We therefore propose solving the following optimiztion problem to generte policy updte: mximize L θold (θ) (12) θ subject to D ρ θ old KL (θ old, θ) δ. Similr policy updtes hve been proposed in prior work (Bgnell & Schneider, 2003; Peters & Schl, 2008b; Peters et l., 2010), nd we compre our pproch to prior methods in Section 7 nd in the experiments in Section 8. Our experiments lso show tht this type of constrined updte hs similr empiricl performnce to the mximum KL divergence constrint in Eqution (11). 5 Smple-Bsed Estimtion of the Objective nd Constrint The previous section proposed constrined optimiztion problem on the policy prmeters (Eqution (12)), which optimizes n estimte of the expected totl rewrd η subject to constrint on the chnge in the policy t ech updte. This section describes how the objective nd constrint functions cn be pproximted using Monte Crlo simultion. We seek to solve the following optimiztion problem, obtined by expnding L θold in Eqution (12): mximize ρ θold (s) π θ ( s)a θold (s, ) θ s subject to D ρ θ old KL (θ old, θ) δ. (13) We first replce s ρ θ old (s) [... in the objective by the expecttion 1 1 γ E s ρ θold [.... Next, we replce the dvntge vlues A θold by the Q-vlues Q θold in Eqution (13), which only chnges the objective by constnt. Lst, we replce the sum over the ctions by n importnce smpling estimtor. Using q to denote the smpling distribution, the contribution of single s n to the loss function is [ πθ ( s n ) π θ ( s n )A θold (s n, ) = E q q( s n ) A θ old (s n, ). Our optimiztion problem in Eqution (13) is exctly equivlent to the following one, written in terms of expecttions: mximize E s ρθold, q θ [ πθ ( s) q( s) Q θ old (s, ) subject to E s ρθold [D KL (π θold ( s) π θ ( s)) δ. (14) ρ 0 trjectories s n n ll stte-ction pirs used in objective ρ 0 1 s n 2 smpling trjectories rollout set two rollouts using CRN Figure 1. Left: illustrtion of single pth procedure. Here, we generte set of trjectories vi simultion of the policy nd incorporte ll stte-ction pirs (s n, n) into the objective. Right: illustrtion of vine procedure. We generte set of trunk trjectories, nd then generte brnch rollouts from subset of the reched sttes. For ech of these sttes s n, we perform multiple ctions ( 1 nd 2 here) nd perform rollout fter ech ction, using common rndom numbers (CRN) to reduce the vrince. All tht remins is to replce the expecttions by smple verges nd replce the Q vlue by n empiricl estimte. The following sections describe two different schemes for performing this estimtion. The first smpling scheme, which we cll single pth, is the one tht is typiclly used for policy grdient estimtion (Brtlett & Bxter, 2011), nd is bsed on smpling individul trjectories. The second scheme, which we cll vine, involves constructing rollout set nd then performing multiple ctions from ech stte in the rollout set. This method hs mostly been explored in the context of policy itertion methods (Lgoudkis & Prr, 2003; Gbillon et l., 2013). 5.1 Single Pth In this estimtion procedure, we collect sequence of sttes by smpling s 0 ρ 0 nd then simulting the policy π θold for some number of timesteps to generte trjectory s 0, 0, s 1, 1,..., s T 1, T 1, s T. Hence, q( s) = π θold ( s). Q θold (s, ) is computed t ech stte-ction pir (s t, t ) by tking the discounted sum of future rewrds long the trjectory. 5.2 Vine In this estimtion procedure, we first smple s 0 ρ 0 nd simulte the policy π θi to generte number of trjectories. We then choose subset of N sttes long these trjectories, denoted s 1, s 2,..., s N, which we cll the rollout set. For ech stte s n in the rollout set, we smple K ctions ccording to n,k q( s n ). Any choice of q( s n ) with support tht includes the support of π θi ( s n ) will produce consistent estimtor. In prctice, we found tht q( s n ) = π θi ( s n ) works well on continuous problems, such s robotic locomotion, while the uniform distribution works well on discrete tsks, such s the Atri gmes, where it cn sometimes chieve better explortion. For ech ction n,k smpled t ech stte s n, we esti-

5 mte ˆQ θi (s n, n,k ) by performing rollout (i.e., short trjectory) strting with stte s n nd ction n,k. We cn gretly reduce the vrince of the Q-vlue differences between rollouts by using the sme rndom number sequence for the noise in ech of the K rollouts, i.e., common rndom numbers. See (Bertseks, 2005) for dditionl discussion on Monte Crlo estimtion of Q-vlues nd (Ng & Jordn, 2000) for discussion of common rndom numbers in reinforcement lerning. In smll, finite ction spces, we cn generte rollout for every possible ction from given stte. The contribution to L θold from single stte s n is s follows: L n (θ) = K π θ ( k s n ) ˆQ(s n, k ), (15) k=1 where the ction spce is A = { 1, 2,..., K }. In lrge or continuous stte spces, we cn construct n estimtor of the surrogte objective using importnce smpling. The self-normlized estimtor (Owen (2013), Chpter 9) of L θold obtined t single stte s n is L n (θ) = K π θ ( n,k s n) k=1 K k=1 ˆQ(s π θold ( n,k s n) n, n,k ), (16) π θ ( n,k s n) π θold ( n,k s n) ssuming tht we performed K ctions n,1, n,2,..., n,k from stte s n. This self-normlized estimtor removes the need to use bseline for the Q-vlues (note tht the grdient is unchnged by dding constnt to the Q-vlues). Averging over s n ρ(π), we obtin n estimtor for L θold, s well s its grdient. The vine nd single pth methods re illustrted in Figure 1. We use the term vine, since the trjectories used for smpling cn be likened to the stems of vines, which brnch t vrious points (the rollout set) into severl short offshoots (the rollout trjectories). The benefit of the vine method over the single pth method tht is our locl estimte of the objective hs much lower vrince given the sme number of Q-vlue smples in the surrogte objective. Tht is, the vine method gives much better estimtes of the dvntge vlues. The downside of the vine method is tht we must perform fr more clls to the simultor for ech of these dvntge estimtes. Furthermore, the vine method requires us to generte multiple trjectories from ech stte in the rollout set, which limits this lgorithm to settings where the system cn be reset to n rbitrry stte. In contrst, the single pth lgorithm requires no stte resets nd cn be directly implemented on physicl system (Peters & Schl, 2008b). 6 Prcticl Algorithm Here we present two prcticl policy optimiztion lgorithm bsed on the ides bove, which use either the single pth or vine smpling scheme from the preceding section. The lgorithms repetedly perform the following steps: 1. Use the single pth or vine procedures to collect set of stte-ction pirs long with Monte Crlo estimtes of their Q-vlues. 2. By verging over smples, construct the estimted objective nd constrint in Eqution (14). 3. Approximtely solve this constrined optimiztion problem to updte the policy s prmeter vector θ. We use the conjugte grdient lgorithm followed by line serch, which is ltogether only slightly more expensive thn computing the grdient itself. See Appendix C for detils. With regrd to (3), we construct the Fisher informtion mtrix (FIM) by nlyticlly computing the Hessin of the KL divergence, rther thn using the covrince mtrix of the grdients. Tht is, we estimte A ij s 1 N 2 N n=1 θ i θ j D KL (π θold ( s n ) π θ ( s n )), rther thn 1 N N n=1 θ i log π θ ( n s n ) θ j log π θ ( n s n ). The nlytic estimtor integrtes over the ction t ech stte s n, nd does not depend on the ction n tht ws smpled. As described in Appendix C, this nlytic estimtor hs computtionl benefits in the lrge-scle setting, since it removes the need to store dense Hessin or ll policy grdients from btch of trjectories. The rte of improvement in the policy is similr to the empiricl FIM, s shown in the experiments. Let us briefly summrize the reltionship between the theory from Section 3 nd the prcticl lgorithm we hve described: The theory justifies optimizing surrogte objective with penlty on KL divergence. However, the lrge penlty coefficient C leds to prohibitively smll steps, so we would like to decrese this coefficient. Empiriclly, it is hrd to robustly choose the penlty coefficient, so we use hrd constrint insted of penlty, with prmeter δ (the bound on KL divergence). The constrint on D mx KL (θ old, θ) is hrd for numericl optimiztion nd estimtion, so insted we constrin D KL (θ old, θ). Our theory ignores estimtion error for the dvntge function. Kkde & Lngford (2002) consider this error in their derivtion, nd the sme rguments would hold in the setting of this pper, but we omit them for simplicity. 7 Connections with Prior Work As mentioned in Section 4, our derivtion results in policy updte tht is relted to severl prior methods, providing unifying perspective on number of policy updte

problem: mximize θ [ θ L θold (θ) θ=θold (θ θ old ) subject to 1 2 (θ old θ) T A(θ old )(θ old θ) δ, (17) where A(θ old ) ij = E s ρπ [D KL (π( s, θ old ) π( s, θ)) θ i θ θ=θold.

Though this difference might seem subtle, our experiments demonstrte tht it significntly improves the lgorithm s performnce on lrger problems.

The policy itertion updte cn lso be obtined by solving the unconstrined problem mximize π L πold (π), using L s defined in Eqution (3). Severl other methods employ n updte similr to Eqution (12).

6 schemes. The nturl policy grdient (Kkde, 2002) cn be obtined s specil cse of the updte in Eqution (12) by using liner pproximtion to L nd qudrtic pproximtion to the D KL constrint, resulting in the following problem: mximize θ [ θ L θold (θ) θ=θold (θ θ old ) subject to 1 2 (θ old θ) T A(θ old )(θ old θ) δ, (17) where A(θ old ) ij = E s ρπ [D KL (π( s, θ old ) π( s, θ)) θ i θ θ=θold. j The updte is θ new = θ old + 1 λ A(θ old) 1 θ L(θ) θ=θold, where the stepsize 1 λ is typiclly treted s n lgorithm prmeter. This differs from our pproch, which enforces the constrint t ech updte. Though this difference might seem subtle, our experiments demonstrte tht it significntly improves the lgorithm s performnce on lrger problems. We cn lso obtin the stndrd policy grdient updte by using n l 2 constrint or penlty: [ mximize θ L θold (θ) θ θ=θold (θ θ old ) (18) subject to 1 2 θ θ old 2 δ. The policy itertion updte cn lso be obtined by solving the unconstrined problem mximize π L πold (π), using L s defined in Eqution (3). Severl other methods employ n updte similr to Eqution (12). Reltive entropy policy serch (REPS) (Peters et l., 2010) constrins the stte-ction mrginls p(s, ), while TRPO constrins the conditionls p( s). Unlike REPS, our pproch does not require costly nonliner optimiztion in the inner loop. Levine nd Abbeel (2014) lso use KL divergence constrint, but its purpose is to encourge the policy not to stry from regions where the estimted dynmics model is vlid, while we do not ttempt to estimte the system dynmics explicitly. Pirott et l. (2013) lso build on nd generlize Kkde nd Lngford s results, nd they derive different lgorithms from the ones here. 8 Experiments We designed our experiments to investigte the following questions: 1. Wht re the performnce chrcteristics of the single pth nd vine smpling procedures? 2. TRPO is relted to prior methods (e.g. nturl policy grdient) but mkes severl chnges, most notbly by using fixed KL divergence rther thn fixed penlty coefficient. How does this ffect the performnce of the lgorithm? Figure 2. 2D robot models used for locomotion experiments. From left to right: swimmer, hopper, wlker. The hopper nd wlker present prticulr chllenge, due to underctution nd contct discontinuities. Screen input Input lyer Joint ngles nd kinemtics Input lyer Conv. lyer filters Fully connected lyer 30 units Conv. lyer filters Men prmeters Hidden lyer 20 units Smpling Stndrd devitions Action probbilities Control Smpling Control Figure 3. Neurl networks used for the locomotion tsk (top) nd for plying Atri gmes (bottom). 3. Cn TRPO be used to solve chllenging lrge-scle problems? How does TRPO compre with other methods when pplied to lrge-scle problems, with regrd to finl performnce, computtion time, nd smple complexity? To nswer (1) nd (2), we compre the performnce of the single pth nd vine vrints of TRPO, severl blted vrints, nd number of prior policy optimiztion lgorithms. With regrd to (3), we show tht both the single pth nd vine lgorithm cn obtin high-qulity locomotion controllers from scrtch, which is considered to be hrd problem. We lso show tht these lgorithms produce competitive results when lerning policies for plying Atri gmes from imges using convolutionl neurl networks with tens of thousnds of prmeters. 8.1 Simulted Robotic Locomotion We conducted the robotic locomotion experiments using the MuJoCo simultor (Todorov et l., 2012). The three simulted robots re shown in Figure 2. The sttes of the robots re their generlized positions nd velocities, nd the controls re joint torques. Underctution, high dimensionlity, nd non-smooth dynmics due to contcts mke these

7 tsks very chllenging. The following models re included in our evlution: 1. Swimmer. 10-dimensionl stte spce, liner rewrd for forwrd progress nd qudrtic penlty on joint effort to produce the rewrd r(x, u) = v x 10 5 u 2. The swimmer cn propel itself forwrd by mking n undulting motion. 2. Hopper. 12-dimensionl stte spce, sme rewrd s the swimmer, with bonus of +1 for being in nonterminl stte. We ended the episodes when the hopper fell over, which ws defined by thresholds on the torso height nd ngle. 3. Wlker. 18-dimensionl stte spce. For the wlker, we dded penlty for strong impcts of the feet ginst the ground to encourge smooth wlk rther thn hopping git. We used δ = 0.01 for ll experiments. See Tble 2 in the Appendix for more detils on the experimentl setup nd prmeters used. We used neurl networks to represent the policy, with the rchitecture shown in Figure 3, nd further detils provided in Appendix D. To estblish stndrd bseline, we lso included the clssic crt-pole blncing problem, bsed on the formultion from Brto et l. (1983), using liner policy with six prmeters tht is esy to optimize with derivtive-free blck-box optimiztion methods. The following lgorithms were considered in the comprison: single pth TRPO; vine TRPO; cross-entropy method (CEM), grdient-free method (Szit & Lörincz, 2006); covrince mtrix dption (CMA), nother grdient-free method (Hnsen & Ostermeier, 1996); nturl grdient, the clssic nturl policy grdient lgorithm (Kkde, 2002), which differs from single pth by the use of fixed penlty coefficient (Lgrnge multiplier) insted of the KL divergence constrint; empiricl FIM, identicl to single pth, except tht the FIM is estimted using the covrince mtrix of the grdients rther thn the nlytic estimte; mx KL, which ws only trctble on the crt-pole problem, nd uses the mximum KL divergence in Eqution (11), rther thn the verge divergence, llowing us to evlute the qulity of this pproximtion. The prmeters used in the experiments re provided in Appendix E. For the nturl grdient method, we swept through the possible vlues of the stepsize in fctors of three, nd took the best vlue ccording to the finl performnce. Lerning curves showing the totl rewrd verged cross five runs of ech lgorithm re shown in Figure 4. Single pth nd vine TRPO solved ll of the problems, yielding the best solutions. Nturl grdient performed well on the two esier problems, but ws unble to generte hopping nd wlking gits tht mde forwrd progress. These results provide empiricl evidence tht constrining the KL divergence is more robust wy to choose step sizes nd mke fst, consistent progress, compred to using fixed rewrd rewrd Crtpole 4 Vine Single Pth Nturl Grdient Mx KL 2 Empiricl FIM CEM CMA RWR Hopper Vine Single Pth Nturl Grdient CEM RWR cost (-velocity + ctrl) rewrd Swimmer Vine Single Pth Nturl Grdient Empiricl FIM CEM CMA RWR Wlker Vine Single Pth Nturl Grdient CEM RWR Figure 4. Lerning curves for locomotion tsks, verged cross five runs of ech lgorithm with rndom initiliztions. Note tht for the hopper nd wlker, score of 1 is chievble without ny forwrd velocity, indicting policy tht simply lerned blnced stnding, but not wlking. penlty. CEM nd CMA re derivtive-free lgorithms, hence their smple complexity scles unfvorbly with the number of prmeters, nd they performed poorly on the lrger problems. The mx KL method lerned somewht more slowly thn our finl method, due to the more restrictive form of the constrint, but overll the result suggests tht the verge KL divergence constrint hs similr effect s the theoreclly justified mximum KL divergence. Videos of the policies lerned by TRPO my be viewed on the project website: site/trpopper/. Note tht TRPO lerned ll of the gits with generlpurpose policies nd simple rewrd functions, using miniml prior knowledge. This is in contrst with most prior methods for lerning locomotion, which typiclly rely on hnd-rchitected policy clsses tht explicitly encode notions of blnce nd stepping (Tedrke et l., 2004; Geng et l., 2006; Wmpler & Popović, 2009). 8.2 Plying Gmes from Imges To evlute TRPO on prtilly observed tsk with complex observtions, we trined policies for plying Atri gmes, using rw imges s input. The gmes require lerning vriety of behviors, such s dodging bullets nd hitting blls with pddles. Aside from the high dimensionlity, chllenging elements of these gmes include delyed rewrds (no immedite penlty is incurred when life is lost in Brekout or Spce Invders); complex sequences of behvior (Q*bert requires chrcter to hop on 21 different pltforms); nd non-sttionry imge sttistics (Enduro involves chnging nd flickering bckground). We tested our lgorithms on the sme seven gmes reported on in (Mnih et l., 2013) nd (Guo et l., 2014), which re

8 B. Rider Brekout Enduro Pong Q*bert Sequest S. Invders Rndom Humn (Mnih et l., 2013) Deep Q Lerning (Mnih et l., 2013) UCC-I (Guo et l., 2014) TRPO - single pth TRPO - vine Tble 1. Performnce comprison for vision-bsed RL lgorithms on the Atri domin. Our lgorithms (bottom rows) were run once on ech tsk, with the sme rchitecture nd prmeters. Performnce vries substntilly from run to run (with different rndom initiliztions of the policy), but we could not obtin error sttistics due to time constrints. mde vilble through the Arcde Lerning Environment (Bellemre et l., 2013) The imges were preprocessed following the protocol in Mnih et l (2013), nd the policy ws represented by the convolutionl neurl network shown in Figure 3, with two convolutionl lyers with 16 chnnels nd stride 2, followed by one fully-connected lyer with 20 units, yielding 33,500 prmeters. The results of the vine nd single pth lgorithms re summrized in Tble 1, which lso includes n expert humn performnce nd two recent methods: deep Q-lerning (Mnih et l., 2013), nd combintion of Monte-Crlo Tree Serch with supervised trining (Guo et l., 2014), clled UCC-I. The 500 itertions of our lgorithm took bout 30 hours (with slight vrition between gmes) on 16-core computer. While our method only outperformed the prior methods on some of the gmes, it consistently chieved resonble scores. Unlike the prior methods, our pproch ws not designed specificlly for this tsk. The bility to pply the sme policy serch method to methods s diverse s robotic locomotion nd imge-bsed gme plying demonstrtes the generlity of TRPO. 9 Discussion We proposed nd nlyzed trust region methods for optimizing stochstic control policies. We proved monotonic improvement for n lgorithm tht repetedly optimizes locl pproximtion to the expected return of the policy with KL divergence penlty, nd we showed tht n pproximtion to this method tht incorportes KL divergence constrint chieves good empiricl results on rnge of chllenging policy lerning tsks, outperforming prior methods. Our nlysis lso provides perspective tht unifies policy grdient nd policy itertion methods, nd shows them to be specil limiting cses of n lgorithm tht optimizes certin objective subject to trust region constrint. In the domin of robotic locomotion, we successfully lerned controllers for swimming, wlking nd hopping in physics simultor, using generl purpose neurl networks nd minimlly informtive rewrds. To our knowledge, no prior work hs lerned controllers from scrtch for ll of these tsks, using generic policy serch method nd non-engineered, generl-purpose policy representtions. In the gme-plying domin, we lerned convolutionl neurl network policies tht used rw imges s inputs. This requires optimizing extremely high-dimensionl policies, nd only two prior methods report successful results on this tsk. Since the method we proposed is sclble nd hs strong theoreticl foundtions, we hope tht it will serve s jumping-off point for future work on trining lrge, rich function pproximtors for rnge of chllenging problems. At the intersection of the two experimentl domins we explored, there is the possibility of lerning robotic control policies tht use vision nd rw sensory dt s input, providing unified scheme for trining robotic controllers tht perform both perception nd control. The use of more sophisticted policies, including recurrent policies with hidden stte, could further mke it possible to roll stte estimtion nd control into the sme policy in the prtillyobserved setting. By combining our method with model lerning, it would lso be possible to substntilly reduce its smple complexity, mking it pplicble to rel-world settings where smples re expensive. Acknowledgements We thnk Emo Todorov nd Yuvl Tss for providing the MuJoCo simultor; Bruno Scherrer, Tom Erez, Greg Wyne, nd the nonymous ICML reviewers for insightful comments, nd Vitchyr Pong nd Shne Gu for pointing our errors in previous version of the mnuscript. This reserch ws funded in prt by the Office of Nvl Reserch through Young Investigtor Awrd nd under grnt number N , DARPA through Young Fculty Awrd, by the Army Reserch Office through the MAST progrm. References Bgnell, J. A. nd Schneider, J. Covrint policy serch. IJCAI, Brtlett, P. L. nd Bxter, J. Infinite-horizon policy-grdient estimtion. rxiv preprint rxiv: , Brto, A., Sutton, R., nd Anderson, C. Neuronlike dptive elements tht cn solve difficult lerning control problems. IEEE Trnsctions on Systems, Mn nd Cybernetics, (5): , 1983.

9 Bellemre, M. G., Nddf, Y., Veness, J., nd Bowling, M. The rcde lerning environment: An evlution pltform for generl gents. Journl of Artificil Intelligence Reserch, 47: , jun Bertseks, D. Dynmic progrmming nd optiml control, volume Deisenroth, M., Neumnn, G., nd Peters, J. A survey on policy serch for robotics. Foundtions nd Trends in Robotics, 2(1-2):1 142, Gbillon, Victor, Ghvmzdeh, Mohmmd, nd Scherrer, Bruno. Approximte dynmic progrmming finlly performs well in the gme of Tetris. In Advnces in Neurl Informtion Processing Systems, Geng, T., Porr, B., nd Wörgötter, F. Fst biped wlking with reflexive controller nd reltime policy serching. In Advnces in Neurl Informtion Processing Systems (NIPS), Guo, X., Singh, S., Lee, H., Lewis, R. L., nd Wng, X. Deep lerning for rel-time tri gme ply using offline Monte- Crlo tree serch plnning. In Advnces in Neurl Informtion Processing Systems, pp , Hnsen, Nikolus nd Ostermeier, Andres. Adpting rbitrry norml muttion distributions in evolution strtegies: The covrince mtrix dpttion. In Evolutionry Computtion, 1996., Proceedings of IEEE Interntionl Conference on, pp IEEE, Hunter, Dvid R nd Lnge, Kenneth. A tutoril on MM lgorithms. The Americn Sttisticin, 58(1):30 37, Kkde, Shm. A nturl policy grdient. In Advnces in Neurl Informtion Processing Systems, pp MIT Press, Kkde, Shm nd Lngford, John. Approximtely optiml pproximte reinforcement lerning. In ICML, volume 2, pp , Lgoudkis, Michil G nd Prr, Ronld. Reinforcement lerning s clssifiction: Leverging modern clssifiers. In ICML, volume 3, pp , Levin, D. A., Peres, Y., nd Wilmer, E. L. Mrkov chins nd mixing times. Americn Mthemticl Society, Levine, Sergey nd Abbeel, Pieter. Lerning neurl network policies with guided policy serch under unknown dynmics. In Advnces in Neurl Informtion Processing Systems, pp , Mrtens, J. nd Sutskever, I. Trining deep nd recurrent networks with hessin-free optimiztion. In Neurl Networks: Tricks of the Trde, pp Springer, Mnih, V., Kvukcuoglu, K., Silver, D., Grves, A., Antonoglou, I., Wierstr, D., nd Riedmiller, M. Plying Atri with deep reinforcement lerning. rxiv preprint rxiv: , Nemirovski, Arkdi. Efficient methods in convex progrmming Ng, A. Y. nd Jordn, M. PEGASUS: A policy serch method for lrge mdps nd pomdps. In Uncertinty in rtificil intelligence (UAI), Owen, Art B. Monte Crlo theory, methods nd exmples Pscnu, Rzvn nd Bengio, Yoshu. Revisiting nturl grdient for deep networks. rxiv preprint rxiv: , Peters, J. nd Schl, S. Reinforcement lerning of motor skills with policy grdients. Neurl Networks, 21(4): , Peters, J., Mülling, K., nd Altün, Y. Reltive entropy policy serch. In AAAI Conference on Artificil Intelligence, Peters, Jn nd Schl, Stefn. Nturl ctor-critic. Neurocomputing, 71(7): , 2008b. Pirott, Mtteo, Restelli, Mrcello, Pecorino, Alessio, nd Clndriello, Dniele. Sfe policy itertion. In Proceedings of The 30th Interntionl Conference on Mchine Lerning, pp , Pollrd, Dvid. Asymptopi: n exposition of sttisticl symptotic theory URL pollrd/books/asymptopi. Szit, István nd Lörincz, András. Lerning tetris using the noisy cross-entropy method. Neurl computtion, 18(12): , Tedrke, R., Zhng, T., nd Seung, H. Stochstic policy grdient reinforcement lerning on simple 3d biped. In IEEE/RSJ Interntionl Conference on Intelligent Robots nd Systems, Todorov, Emnuel, Erez, Tom, nd Tss, Yuvl. MuJoCo: A physics engine for model-bsed control. In Intelligent Robots nd Systems (IROS), 2012 IEEE/RSJ Interntionl Conference on, pp IEEE, Wmpler, Kevin nd Popović, Zorn. Optiml git nd form for niml locomotion. In ACM Trnsctions on Grphics (TOG), volume 28, pp. 60. ACM, Wright, Stephen J nd Nocedl, Jorge. Numericl optimiztion, volume 2. Springer New York, 1999.

10 A Proof of Policy Improvement Bound Trust Region Policy Optimiztion This proof (of Theorem 1) uses techniques from the proof of Theorem 4.1 in (Kkde & Lngford, 2002), dpting them to the more generl setting considered in this pper. An informl overview is s follows. Our proof relies on the notion of coupling, where we jointly define the policies π nd π so tht they choose the sme ction with high probbility = (1 α). Surrogte loss L π ( π) ccounts for the the dvntge of π the first time tht it disgrees with π, but not subsequent disgreements. Hence, the error in L π is due to two or more disgreements between π nd π, hence, we get n O(α 2 ) correction term, where α is the probbility of disgreement. We strt out with lemm from Kkde & Lngford (2002) tht shows tht the difference in policy performnce η( π) η(π) cn be decomposed s sum of per-timestep dvntges. Lemm 1. Given two policies π, π, η( π) = η(π)+e τ π γ t A π (s t, t ) This expecttion is tken over trjectories τ := (s 0, 0, s 1, 0,... ), nd the nottion E τ π [... indictes tht ctions re smpled from π to generte τ. Proof. First note tht A π (s, ) = E s P (s s,) [r(s) + γv π (s ) V π (s). Therefore, E τ π γ t A π (s t, t ) Rerrnging, the result follows. = E τ π γ t (r(s t ) + γv π (s t+1 ) V π (s t )) = E τ π [ V π (s 0 ) + γ t r(s t ) = E s0 [V π (s 0 ) + E τ π γ t r(s t ) Define Ā(s) to be the expected dvntge of π over π t stte s: Now Lemm 1 cn be written s follows: Note tht L π cn be written s = η(π) + η( π) (24) (19) (20) (21) (22) (23) Ā(s) = E π( s) [A π (s, ). (25) η( π) = η(π) + E τ π γ t Ā(s t ) L π ( π) = η(π) + E τ π γ t Ā(s t ) The difference in these equtions is whether the sttes re smpled using π or π. To bound the difference between η( π) nd L π ( π), we will bound the difference rising from ech timestep. To do this, we first need to introduce mesure of how much π nd π gree. Specificlly, we ll couple the policies, so tht they define joint distribution over pirs of ctions. Definition 1. (π, π) is n α-coupled policy pir if it defines joint distribution (, ã) s, such tht P ( ã s) α for ll s. π nd π will denote the mrginl distributions of nd ã, respectively. (26) (27)

11 Computtionlly, α-coupling mens tht if we rndomly choose seed for our rndom number genertor, nd then we smple from ech of π nd π fter setting tht seed, the results will gree for t lest frction 1 α of seeds. Lemm 2. Given tht π, π re α-coupled policies, for ll s, Ā(s) 2α mx s, A π(s, ) (28) Proof. Ā(s) = Eã π [A π (s, ã) = E (,ã) (π, π) [A π (s, ã) A π (s, ) since E π [A π (s, ) = 0 (29) = P ( ã s)e (,ã) (π, π) ã [A π (s, ã) A π (s, ) (30) Ā(s) α 2 mx A π(s, ) (31) s, Lemm 3. Let (π, π) be n α-coupled policy pir. Then [Ā(st Est π ) [Ā(st E st π ) 2α mx Ā(s) 4α(1 (1 α) t ) mx A π(s, ) (32) s s Proof. Given the coupled policy pir (π, π), we cn lso obtin coupling over the trjectory distributions produced by π nd π, respectively. Nmely, we hve pirs of trjectories τ, τ, where τ is obtined by tking ctions from π, nd τ is obtined by tking ctions from π, where the sme rndom seed is used to generte both trjectories. We will consider the dvntge of π over π t timestep t, nd decompose this expecttion bsed on whether π grees with π t ll timesteps i < t. Let n t denote the number of times tht i ã i for i < t, i.e., the number of times tht π nd π disgree before timestep t. [Ā(st E st π ) [Ā(st = P (n t = 0)E st π n ) [Ā(st + P (n t > 0)E st π n t>0 ) (33) The expecttion decomposes similrly for ctions re smpled using π: Note tht the n t = 0 terms re equl: [Ā(st E st π ) [Ā(st = P (n t = 0)E st π n ) [Ā(st + P (n t > 0)E st π n t>0 ) (34) [Ā(st E st π n ) [Ā(st = E st π n ), (35) becuse n t = 0 indictes tht π nd π greed on ll timesteps less thn t. Subtrcting Equtions (33) nd (34), we get [Ā(st E st π ) [Ā(st E st π ) = P (n t > 0) ( [Ā(st E st π n t>0 ) [Ā(st E st π n t>0 ) ) (36) By definition of α, P (π, π gree t timestep i) 1 α, so P (n t = 0) (1 α) t, nd P (n t > 0) 1 (1 α) t (37) Next, note tht [Ā(st Est π n t>0 ) [Ā(st E st π n t>0 ) [Ā(st Est π n t>0 ) [Ā(st + Est π n t>0 ) (38) 4α mx π(s, ) s, (39) Where the second inequlity follows from Lemm 3. Plugging Eqution (37) nd Eqution (39) into Eqution (36), we get [Ā(st Est π ) [Ā(st E st π ) 4α(1 (1 α) t ) mx A π(s, ) (40) s,

12 The preceding Lemm bounds the difference in expected dvntge t ech timestep t. We cn sum over time to bound the difference between η( π) nd L π ( π). Subtrcting Eqution (26) nd Eqution (27), nd defining ɛ = mx s, A π (s, ), η( π) L π ( π) = γ t [Ā(st Eτ π ) [Ā(st E τ π ) (41) γ t 4ɛα(1 (1 α) t ) (42) = 4ɛα = ( ) 1 1 γ 1 1 γ(1 α) 4α 2 γɛ (1 γ)(1 γ(1 α)) (43) (44) 4α2 γɛ (1 γ) 2 (45) Lst, to replce α by the totl vrition divergence, we need to use the correspondence between TV divergence nd coupled rndom vribles: Suppose p X nd p Y re distributions with D T V (p X p Y ) = α. Then there exists joint distribution (X, Y ) whose mrginls re p X, p Y, for which X = Y with probbility 1 α. See (Levin et l., 2009), Proposition 4.7. It follows tht if we hve two policies π nd π such tht mx s D T V (π( s) π( s)) α, then we cn define n α-coupled policy pir (π, π) with pproprite mrginls. Tking α = mx s D T V (π( s) π( s)) α in Eqution (45), Theorem 1 follows. B Perturbtion Theory Proof of Policy Improvement Bound We lso provide n lterntive proof of Theorem 1 using perturbtion theory. Proof. Let G = (1+γP π +(γp π ) ) = (1 γp π ) 1, nd similrly Let G = (1+γP π +(γp π ) ) = (1 γp π ) 1. We will use the convention tht ρ ( density on stte spce) is vector nd r ( rewrd function on stte spce) is dul vector (i.e., liner functionl on vectors), thus rρ is sclr mening the expected rewrd under density ρ. Note tht η(π) = rgρ 0, nd η( π) = c Gρ 0. Let = P π P π. We wnt to bound η( π) η(π) = r( G G)ρ 0. We strt with some stndrd perturbtion theory mnipultions. Left multiply by G nd right multiply by G. Substituting the right-hnd side into G gives So we hve G 1 G 1 = (1 γp π ) (1 γp π ) = γ. (46) G G = γg G G = G + γg G (47) G = G + γg G + γ 2 G G G (48) η( π) η(π) = r( G G)ρ = γrg Gρ 0 + γ 2 rg G Gρ 0 (49) Let us first consider the leding term γrg Gρ 0. Note tht rg = v, i.e., the infinite-horizon stte-vlue function. Also note tht Gρ 0 = ρ π. Thus we cn write γcg Gρ 0 = γv ρ π. We will show tht this expression equls the expected

13 dvntge L π ( π) L π (π). L π ( π) L π (π) = s = s = s = s ρ π (s) ( π( s) π( s))a π (s, ) ρ π (s) ( πθ ( s) π θ( s) ) [ r(s) + p(s s, )γv(s ) v(s) s ρ π (s) (π( s) π( s)) p(s s, )γv(s ) s ρ π (s) s (p π (s s) p π (s s))γv(s ) = γv ρ π (50) Next let us bound the O( 2 ) term γ 2 rg G Gρ. First we consider the product γrg = γv. Consider the component s of this dul vector. (γv ) s = ( π(s, ) π(s, ))Q π (s, ) = ( π(s, ) π(s, ))A π (s, ) π(s, ) π(s, ) mx A π(s, ) 2αɛ (51) where the lst line used the definition of the totl-vrition divergence, nd the definition of ɛ = mx s, A π (s, ). We bound the other portion G Gρ using the l 1 opertor norm { } Aρ 1 A 1 = sup (52) ρ ρ 1 where we hve tht G 1 = G 1 = 1/(1 γ) nd 1 = 2α. Tht gives So we hve tht G Gρ 1 G 1 1 G 1 ρ 1 = 1 1 γ 2α 1 1 γ 1 (53) γ 2 rg G Gρ γ γrg G Gρ 1 γ v G Gρ 1 2α γ 2αɛ (1 γ) 2 = 4γɛ (1 γ) 2 α2 (54) C Efficiently Solving the Trust-Region Constrined Optimiztion Problem This section describes how to efficiently pproximtely solve the following constrined optimiztion problem, which we must solve t ech itertion of TRPO: mximize L(θ) subject to D KL (θ old, θ) δ. (55)

14 The method we will describe involves two steps: (1) compute serch direction, using liner pproximtion to objective nd qudrtic pproximtion to the constrint; nd (2) perform line serch in tht direction, ensuring tht we improve the nonliner objective while stisfying the nonliner constrint. The serch direction is computed by pproximtely solving the eqution Ax = g, where A is the Fisher informtion mtrix, i.e., the qudrtic pproximtion to the KL divergence constrint: D KL (θ old, θ) 1 2 (θ θ old) T A(θ θ old ), where A ij = θ i θ j D KL (θ old, θ). In lrge-scle problems, it is prohibitively costly (with respect to computtion nd memory) to form the full mtrix A (or A 1 ). However, the conjugte grdient lgorithm llows us to pproximtely solve the eqution Ax = b without forming this full mtrix, when we merely hve ccess to function tht computes mtrix-vector products y Ay. Appendix C.1 describes the most efficient wy to compute mtrix-vector products with the Fisher informtion mtrix. For dditionl exposition on the use of Hessin-vector products for optimizing neurl network objectives, see (Mrtens & Sutskever, 2012) nd (Pscnu & Bengio, 2013). Hving computed the serch direction s A 1 g, we next need to compute the mximl step length β such tht θ + βs will stisfy the KL divergence constrint. To do this, let δ = D KL 1 2 (βs)t A(βs) = 1 2 β2 s T As. From this, we obtin β = 2δ/s T As, where δ is the desired KL divergence. The term s T As cn be computed through single Hessin vector product, nd it is lso n intermedite result produced by the conjugte grdient lgorithm. Lst, we use line serch to ensure improvement of the surrogte objective nd stisfction of the KL divergence constrint, both of which re nonliner in the prmeter vector θ (nd thus deprt from the liner nd qudrtic pproximtions used to compute the step). We perform the line serch on the objective L θold (θ) X [D KL (θ old, θ) δ, where X [... equls zero when its rgument is true nd + when it is flse. Strting with the mximl vlue of the step length β computed in the previous prgrph, we shrink β exponentilly until the objective improves. Without this line serch, the lgorithm occsionlly computes lrge steps tht cuse ctstrophic degrdtion of performnce. C.1 Computing the Fisher-Vector Product Here we will describe how to compute the mtrix-vector product between the verged Fisher informtion mtrix nd rbitrry vectors. This mtrix-vector product enbles us to perform the conjugte grdient lgorithm. Suppose tht the prmeterized policy mps from the input x to distribution prmeter vector µ θ (x), which prmeterizes the distribution π(u x). Now the KL divergence for given input x cn be written s follows: D KL (π θold ( x) π θ ( x)) = kl(µ θ (x), µ old ) (56) where kl is the KL divergence between the distributions corresponding to the two men prmeter vectors. Differentiting kl twice with respect to θ, we obtin µ (x) µ b (x) kl θ i θ b(µ θ (x), µ old ) + 2 µ (x) kl j θ i θ (µ θ (x), µ old ) (57) j where the primes ( ) indicte differentition with respect to the first rgument, nd there is n implied summtion over indices, b. The second term vnishes, leving just the first term. Let J := µ(x) θ i (the Jcobin), then the Fisher informtion mtrix cn be written in mtrix form s J T MJ, where M = kl b (µ θ(x), µ old ) is the Fisher informtion mtrix of the distribution in terms of the men prmeter µ (s opposed to the prmeter θ). This hs simple form for most prmeterized distributions of interest. The Fisher-vector product cn now be written s function y J T MJy. Multipliction by J T nd J cn be performed by most utomtic differentition nd neurl network pckges (multipliction by J T is the well-known bckprop opertion), nd the opertion for multipliction by M cn be derived for the distribution of interest. Note tht this Fisher-vector product is strightforwrd to verge over set of dtpoints, i.e., inputs x to µ. One could lterntively use generic method for clculting Hessin-vector products using reverse mode utomtic differentition ((Wright & Nocedl, 1999), chpter 8), computing the Hessin of D KL with respect to θ. This method would be slightly less efficient s it does not exploit the fct tht the second derivtives of µ(x) (i.e., the second term in Eqution (57)) cn be ignored, but my be substntilly esier to implement. We hve described procedure for computing the Fisher-vector product y Ay, where the Fisher informtion mtrix is verged over set of inputs to the function µ. Computing the Fisher-vector product is typiclly bout s expensive s computing the grdient of n objective tht depends on µ(x) (Wright & Nocedl, 1999). Furthermore, we need to compute

Reinforcement Learning

Reinforcement Learning Reinforcement Lerning Tom Mitchell, Mchine Lerning, chpter 13 Outline Introduction Comprison with inductive lerning Mrkov Decision Processes: the model Optiml policy: The tsk Q Lerning: Q function Algorithm