Trust Region Policy Optimization

Size: px
Start display at page:

Download "Trust Region Policy Optimization"

Transcription

1 Consider n infinite-horizon discounted Mrkov decision process (MDP), defined by the tuple (S, A, P, r, ρ 0, γ), where S is finite set of sttes, A is finite set of ctions, P : S A S R is the trnsition probbility distrirxiv: v5 [cs.lg 20 Apr 2017 John Schulmn JOSCHU@EECS.BERKELEY.EDU Sergey Levine SLEVINE@EECS.BERKELEY.EDU Philipp Moritz PCMORITZ@EECS.BERKELEY.EDU Michel Jordn JORDAN@CS.BERKELEY.EDU Pieter Abbeel PABBEEL@CS.BERKELEY.EDU University of Cliforni, Berkeley, Deprtment of Electricl Engineering nd Computer Sciences Abstrct We describe n itertive procedure for optimizing policies, with gurnteed monotonic improvement. By mking severl pproximtions to the theoreticlly-justified procedure, we develop prcticl lgorithm, clled Trust Region Policy Optimiztion (TRPO). This lgorithm is similr to nturl policy grdient methods nd is effective for optimizing lrge nonliner policies such s neurl networks. Our experiments demonstrte its robust performnce on wide vriety of tsks: lerning simulted robotic swimming, hopping, nd wlking gits; nd plying Atri gmes using imges of the screen s input. Despite its pproximtions tht devite from the theory, TRPO tends to give monotonic improvement, with little tuning of hyperprmeters. 1 Introduction Most lgorithms for policy optimiztion cn be clssified into three brod ctegories: (1) policy itertion methods, which lternte between estimting the vlue function under the current policy nd improving the policy (Bertseks, 2005); (2) policy grdient methods, which use n estimtor of the grdient of the expected return (totl rewrd) obtined from smple trjectories (Peters & Schl, 2008) (nd which, s we lter discuss, hve close connection to policy itertion); nd (3) derivtive-free optimiztion methods, such s the cross-entropy method (CEM) nd covrince mtrix dpttion (CMA), which tret the return s blck box function to be optimized in terms of the policy prmeters (Szit & Lörincz, 2006). Generl derivtive-free stochstic optimiztion methods such s CEM nd CMA re preferred on mny problems, becuse they chieve good results while being simple to understnd nd implement. For exmple, while Proceedings of the 31 st Interntionl Conference on Mchine Lerning, Lille, Frnce, JMLR: W&CP volume 37. Copyright 2015 by the uthor(s). Tetris is clssic benchmrk problem for pproximte dynmic progrmming (ADP) methods, stochstic optimiztion methods re difficult to bet on this tsk (Gbillon et l., 2013). For continuous control problems, methods like CMA hve been successful t lerning control policies for chllenging tsks like locomotion when provided with hnd-engineered policy clsses with low-dimensionl prmeteriztions (Wmpler & Popović, 2009). The inbility of ADP nd grdient-bsed methods to consistently bet grdient-free rndom serch is unstisfying, since grdient-bsed optimiztion lgorithms enjoy much better smple complexity gurntees thn grdient-free methods (Nemirovski, 2005). Continuous grdient-bsed optimiztion hs been very successful t lerning function pproximtors for supervised lerning tsks with huge numbers of prmeters, nd extending their success to reinforcement lerning would llow for efficient trining of complex nd powerful policies. In this rticle, we first prove tht minimizing certin surrogte objective function gurntees policy improvement with non-trivil step sizes. Then we mke series of pproximtions to the theoreticlly-justified lgorithm, yielding prcticl lgorithm, which we cll trust region policy optimiztion (TRPO). We describe two vrints of this lgorithm: first, the single-pth method, which cn be pplied in the model-free setting; second, the vine method, which requires the system to be restored to prticulr sttes, which is typiclly only possible in simultion. These lgorithms re sclble nd cn optimize nonliner policies with tens of thousnds of prmeters, which hve previously posed mjor chllenge for model-free policy serch (Deisenroth et l., 2013). In our experiments, we show tht the sme TRPO methods cn lern complex policies for swimming, hopping, nd wlking, s well s plying Atri gmes directly from rw imges. 2 Preliminries

2 bution, r : S R is the rewrd function, ρ 0 : S R is the distribution of the initil stte s 0, nd γ (0, 1) is the discount fctor. Let π denote stochstic policy π : S A [0, 1, nd let η(π) denote its expected discounted rewrd: η(π) = E s0, 0,... γ t r(s t ), where s 0 ρ 0 (s 0 ), t π( t s t ), s t+1 P (s t+1 s t, t ). We will use the following stndrd definitions of the sttection vlue function Q π, the vlue function V π, nd the dvntge function A π : Q π (s t, t ) = E st+1, t+1,... γ l r(s t+l ), l=0 V π (s t ) = E t,s t+1,... γ l r(s t+l ), l=0 A π (s, ) = Q π (s, ) V π (s), where t π( t s t ), s t+1 P (s t+1 s t, t ) for t 0. The following useful identity expresses the expected return of nother policy π in terms of the dvntge over π, ccumulted over timesteps (see Kkde & Lngford (2002) or Appendix A for proof): η( π) = η(π) + E s0, 0, π γ t A π (s t, t ) (1) where the nottion E s0, 0, π [... indictes tht ctions re smpled t π( s t ). Let ρ π be the (unnormlized) discounted visittion frequencies ρ π (s)=p (s 0 = s)+γp (s 1 = s)+γ 2 P (s 2 = s)+..., where s 0 ρ 0 nd the ctions re chosen ccording to π. We cn rewrite Eqution (1) with sum over sttes insted of timesteps: η( π) = η(π) + P (s t = s π) s = η(π) + s = η(π) + s γ t P (s t = s π) ρ π (s) π( s)γ t A π (s, ) π( s)a π (s, ) π( s)a π (s, ). (2) This eqution implies tht ny policy updte π π tht hs nonnegtive expected dvntge t every stte s, i.e., π( s)a π(s, ) 0, is gurnteed to increse the policy performnce η, or leve it constnt in the cse tht the expected dvntge is zero everywhere. This implies the clssic result tht the updte performed by exct policy itertion, which uses the deterministic policy π(s) = rg mx A π (s, ), improves the policy if there is t lest one stte-ction pir with positive dvntge vlue nd nonzero stte visittion probbility, otherwise the lgorithm hs converged to the optiml policy. However, in the pproximte setting, it will typiclly be unvoidble, due to estimtion nd pproximtion error, tht there will be some sttes s for which the expected dvntge is negtive, tht is, π( s)a π(s, ) < 0. The complex dependency of ρ π (s) on π mkes Eqution (2) difficult to optimize directly. Insted, we introduce the following locl pproximtion to η: L π ( π) = η(π) + s ρ π (s) π( s)a π (s, ). (3) Note tht L π uses the visittion frequency ρ π rther thn ρ π, ignoring chnges in stte visittion density due to chnges in the policy. However, if we hve prmeterized policy π θ, where π θ ( s) is differentible function of the prmeter vector θ, then L π mtches η to first order (see Kkde & Lngford (2002)). Tht is, for ny prmeter vlue θ 0, L πθ0 (π θ0 ) = η(π θ0 ), θ L πθ0 (π θ ) θ=θ0 = θ η(π θ ) θ=θ0. (4) Eqution (4) implies tht sufficiently smll step π θ0 π tht improves L πθold will lso improve η, but does not give us ny guidnce on how big of step to tke. To ddress this issue, Kkde & Lngford (2002) proposed policy updting scheme clled conservtive policy itertion, for which they could provide explicit lower bounds on the improvement of η. To define the conservtive policy itertion updte, let π old denote the current policy, nd let π = rg mx π L πold (π ). The new policy π new ws defined to be the following mixture: π new ( s) = (1 α)π old ( s) + απ ( s). (5) Kkde nd Lngford derived the following lower bound: η(π new ) L πold (π new ) 2ɛγ (1 γ) 2 α2 where ɛ = mx E π ( s) [A π (s, ). (6) s (We hve modified it to mke it slightly weker but simpler.) Note, however, tht so fr this bound only pplies to mixture policies generted by Eqution (5). This policy clss is unwieldy nd restrictive in prctice, nd it is desirble for prcticl policy updte scheme to be pplicble to ll generl stochstic policy clsses. 3 Monotonic Improvement Gurntee for Generl Stochstic Policies Eqution (6), which pplies to conservtive policy itertion, implies tht policy updte tht improves the right-hnd

3 side is gurnteed to improve the true performnce η. Our principl theoreticl result is tht the policy improvement bound in Eqution (6) cn be extended to generl stochstic policies, rther thn just mixture polices, by replcing α with distnce mesure between π nd π, nd chnging the constnt ɛ ppropritely. Since mixture policies re rrely used in prctice, this result is crucil for extending the improvement gurntee to prcticl problems. The prticulr distnce mesure we use is the totl vrition divergence, i p i q i for dis- (π, π) s which is defined by D T V (p q) = 1 2 crete probbility distributions p, q. 1 Define DTV mx D mx TV (π, π) = mx D T V (π( s) π( s)). (7) s Theorem 1. Let α = D mx TV (π old, π new ). Then the following bound holds: η(π new ) L πold (π new ) 4ɛγ (1 γ) 2 α2 where ɛ = mx s, A π(s, ) (8) We provide two proofs in the ppendix. The first proof extends Kkde nd Lngford s result using the fct tht the rndom vribles from two distributions with totl vrition divergence less thn α cn be coupled, so tht they re equl with probbility 1 α. The second proof uses perturbtion theory. Next, we note the following reltionship between the totl vrition divergence nd the KL divergence (Pollrd (2000), Ch. 3): D T V (p q) 2 D KL (p q). Let DKL mx(π, π) = mx s D KL (π( s) π( s)). The following bound then follows directly from Theorem 1: η( π) L π ( π) CDKL mx (π, π), where C = 4ɛγ (1 γ) 2. (9) Algorithm 1 describes n pproximte policy itertion scheme bsed on the policy improvement bound in Eqution (9). Note tht for now, we ssume exct evlution of the dvntge vlues A π. It follows from Eqution (9) tht Algorithm 1 is gurnteed to generte monotoniclly improving sequence of policies η(π 0 ) η(π 1 ) η(π 2 ).... To see this, let M i (π) = L πi (π) CDKL mx(π i, π). Then η(π i+1 ) M i (π i+1 ) by Eqution (9) η(π i ) = M i (π i ), therefore, η(π i+1 ) η(π i ) M i (π i+1 ) M(π i ). (10) Thus, by mximizing M i t ech itertion, we gurntee tht the true objective η is non-decresing. This lgorithm 1 Our result is strightforwrd to extend to continuous sttes nd ctions by replcing the sums with integrls. Algorithm 1 Policy itertion lgorithm gurnteeing nondecresing expected return η Initilize π 0. for i = 0, 1, 2,... until convergence do Compute ll dvntge vlues A πi (s, ). Solve the constrined optimiztion problem end for π i+1 = rg mx [L πi (π) CDKL mx (π i, π) π where C = 4ɛγ/(1 γ) 2 nd L πi (π)=η(π i )+ s ρ πi (s) π( s)a πi (s, ) is type of minoriztion-mximiztion (MM) lgorithm (Hunter & Lnge, 2004), which is clss of methods tht lso includes expecttion mximiztion. In the terminology of MM lgorithms, M i is the surrogte function tht minorizes η with equlity t π i. This lgorithm is lso reminiscent of proximl grdient methods nd mirror descent. Trust region policy optimiztion, which we propose in the following section, is n pproximtion to Algorithm 1, which uses constrint on the KL divergence rther thn penlty to robustly llow lrge updtes. 4 Optimiztion of Prmeterized Policies In the previous section, we considered the policy optimiztion problem independently of the prmeteriztion of π nd under the ssumption tht the policy cn be evluted t ll sttes. We now describe how to derive prcticl lgorithm from these theoreticl foundtions, under finite smple counts nd rbitrry prmeteriztions. Since we consider prmeterized policies π θ ( s) with prmeter vector θ, we will overlod our previous nottion to use functions of θ rther thn π, e.g. η(θ) := η(π θ ), L θ ( θ) := L πθ (π θ), nd D KL (θ θ) := D KL (π θ π θ). We will use θ old to denote the previous policy prmeters tht we wnt to improve upon. The preceding section showed tht η(θ) L θold (θ) CDKL mx(θ old, θ), with equlity t θ = θ old. Thus, by performing the following mximiztion, we re gurnteed to improve the true objective η: mximize [L θold (θ) CDKL mx (θ old, θ). θ In prctice, if we used the penlty coefficient C recommended by the theory bove, the step sizes would be very smll. One wy to tke lrger steps in robust wy is to use constrint on the KL divergence between the new policy nd the old policy, i.e., trust region constrint: mximize θ L θold (θ) (11) subject to D mx KL (θ old, θ) δ.

4 This problem imposes constrint tht the KL divergence is bounded t every point in the stte spce. While it is motivted by the theory, this problem is imprcticl to solve due to the lrge number of constrints. Insted, we cn use heuristic pproximtion which considers the verge KL divergence: D ρ KL(θ 1, θ 2 ) := E s ρ [D KL (π θ1 ( s) π θ2 ( s)). We therefore propose solving the following optimiztion problem to generte policy updte: mximize L θold (θ) (12) θ subject to D ρ θ old KL (θ old, θ) δ. Similr policy updtes hve been proposed in prior work (Bgnell & Schneider, 2003; Peters & Schl, 2008b; Peters et l., 2010), nd we compre our pproch to prior methods in Section 7 nd in the experiments in Section 8. Our experiments lso show tht this type of constrined updte hs similr empiricl performnce to the mximum KL divergence constrint in Eqution (11). 5 Smple-Bsed Estimtion of the Objective nd Constrint The previous section proposed constrined optimiztion problem on the policy prmeters (Eqution (12)), which optimizes n estimte of the expected totl rewrd η subject to constrint on the chnge in the policy t ech updte. This section describes how the objective nd constrint functions cn be pproximted using Monte Crlo simultion. We seek to solve the following optimiztion problem, obtined by expnding L θold in Eqution (12): mximize ρ θold (s) π θ ( s)a θold (s, ) θ s subject to D ρ θ old KL (θ old, θ) δ. (13) We first replce s ρ θ old (s) [... in the objective by the expecttion 1 1 γ E s ρ θold [.... Next, we replce the dvntge vlues A θold by the Q-vlues Q θold in Eqution (13), which only chnges the objective by constnt. Lst, we replce the sum over the ctions by n importnce smpling estimtor. Using q to denote the smpling distribution, the contribution of single s n to the loss function is [ πθ ( s n ) π θ ( s n )A θold (s n, ) = E q q( s n ) A θ old (s n, ). Our optimiztion problem in Eqution (13) is exctly equivlent to the following one, written in terms of expecttions: mximize E s ρθold, q θ [ πθ ( s) q( s) Q θ old (s, ) subject to E s ρθold [D KL (π θold ( s) π θ ( s)) δ. (14) ρ 0 trjectories s n n ll stte-ction pirs used in objective ρ 0 1 s n 2 smpling trjectories rollout set two rollouts using CRN Figure 1. Left: illustrtion of single pth procedure. Here, we generte set of trjectories vi simultion of the policy nd incorporte ll stte-ction pirs (s n, n) into the objective. Right: illustrtion of vine procedure. We generte set of trunk trjectories, nd then generte brnch rollouts from subset of the reched sttes. For ech of these sttes s n, we perform multiple ctions ( 1 nd 2 here) nd perform rollout fter ech ction, using common rndom numbers (CRN) to reduce the vrince. All tht remins is to replce the expecttions by smple verges nd replce the Q vlue by n empiricl estimte. The following sections describe two different schemes for performing this estimtion. The first smpling scheme, which we cll single pth, is the one tht is typiclly used for policy grdient estimtion (Brtlett & Bxter, 2011), nd is bsed on smpling individul trjectories. The second scheme, which we cll vine, involves constructing rollout set nd then performing multiple ctions from ech stte in the rollout set. This method hs mostly been explored in the context of policy itertion methods (Lgoudkis & Prr, 2003; Gbillon et l., 2013). 5.1 Single Pth In this estimtion procedure, we collect sequence of sttes by smpling s 0 ρ 0 nd then simulting the policy π θold for some number of timesteps to generte trjectory s 0, 0, s 1, 1,..., s T 1, T 1, s T. Hence, q( s) = π θold ( s). Q θold (s, ) is computed t ech stte-ction pir (s t, t ) by tking the discounted sum of future rewrds long the trjectory. 5.2 Vine In this estimtion procedure, we first smple s 0 ρ 0 nd simulte the policy π θi to generte number of trjectories. We then choose subset of N sttes long these trjectories, denoted s 1, s 2,..., s N, which we cll the rollout set. For ech stte s n in the rollout set, we smple K ctions ccording to n,k q( s n ). Any choice of q( s n ) with support tht includes the support of π θi ( s n ) will produce consistent estimtor. In prctice, we found tht q( s n ) = π θi ( s n ) works well on continuous problems, such s robotic locomotion, while the uniform distribution works well on discrete tsks, such s the Atri gmes, where it cn sometimes chieve better explortion. For ech ction n,k smpled t ech stte s n, we esti-

5 mte ˆQ θi (s n, n,k ) by performing rollout (i.e., short trjectory) strting with stte s n nd ction n,k. We cn gretly reduce the vrince of the Q-vlue differences between rollouts by using the sme rndom number sequence for the noise in ech of the K rollouts, i.e., common rndom numbers. See (Bertseks, 2005) for dditionl discussion on Monte Crlo estimtion of Q-vlues nd (Ng & Jordn, 2000) for discussion of common rndom numbers in reinforcement lerning. In smll, finite ction spces, we cn generte rollout for every possible ction from given stte. The contribution to L θold from single stte s n is s follows: L n (θ) = K π θ ( k s n ) ˆQ(s n, k ), (15) k=1 where the ction spce is A = { 1, 2,..., K }. In lrge or continuous stte spces, we cn construct n estimtor of the surrogte objective using importnce smpling. The self-normlized estimtor (Owen (2013), Chpter 9) of L θold obtined t single stte s n is L n (θ) = K π θ ( n,k s n) k=1 K k=1 ˆQ(s π θold ( n,k s n) n, n,k ), (16) π θ ( n,k s n) π θold ( n,k s n) ssuming tht we performed K ctions n,1, n,2,..., n,k from stte s n. This self-normlized estimtor removes the need to use bseline for the Q-vlues (note tht the grdient is unchnged by dding constnt to the Q-vlues). Averging over s n ρ(π), we obtin n estimtor for L θold, s well s its grdient. The vine nd single pth methods re illustrted in Figure 1. We use the term vine, since the trjectories used for smpling cn be likened to the stems of vines, which brnch t vrious points (the rollout set) into severl short offshoots (the rollout trjectories). The benefit of the vine method over the single pth method tht is our locl estimte of the objective hs much lower vrince given the sme number of Q-vlue smples in the surrogte objective. Tht is, the vine method gives much better estimtes of the dvntge vlues. The downside of the vine method is tht we must perform fr more clls to the simultor for ech of these dvntge estimtes. Furthermore, the vine method requires us to generte multiple trjectories from ech stte in the rollout set, which limits this lgorithm to settings where the system cn be reset to n rbitrry stte. In contrst, the single pth lgorithm requires no stte resets nd cn be directly implemented on physicl system (Peters & Schl, 2008b). 6 Prcticl Algorithm Here we present two prcticl policy optimiztion lgorithm bsed on the ides bove, which use either the single pth or vine smpling scheme from the preceding section. The lgorithms repetedly perform the following steps: 1. Use the single pth or vine procedures to collect set of stte-ction pirs long with Monte Crlo estimtes of their Q-vlues. 2. By verging over smples, construct the estimted objective nd constrint in Eqution (14). 3. Approximtely solve this constrined optimiztion problem to updte the policy s prmeter vector θ. We use the conjugte grdient lgorithm followed by line serch, which is ltogether only slightly more expensive thn computing the grdient itself. See Appendix C for detils. With regrd to (3), we construct the Fisher informtion mtrix (FIM) by nlyticlly computing the Hessin of the KL divergence, rther thn using the covrince mtrix of the grdients. Tht is, we estimte A ij s 1 N 2 N n=1 θ i θ j D KL (π θold ( s n ) π θ ( s n )), rther thn 1 N N n=1 θ i log π θ ( n s n ) θ j log π θ ( n s n ). The nlytic estimtor integrtes over the ction t ech stte s n, nd does not depend on the ction n tht ws smpled. As described in Appendix C, this nlytic estimtor hs computtionl benefits in the lrge-scle setting, since it removes the need to store dense Hessin or ll policy grdients from btch of trjectories. The rte of improvement in the policy is similr to the empiricl FIM, s shown in the experiments. Let us briefly summrize the reltionship between the theory from Section 3 nd the prcticl lgorithm we hve described: The theory justifies optimizing surrogte objective with penlty on KL divergence. However, the lrge penlty coefficient C leds to prohibitively smll steps, so we would like to decrese this coefficient. Empiriclly, it is hrd to robustly choose the penlty coefficient, so we use hrd constrint insted of penlty, with prmeter δ (the bound on KL divergence). The constrint on D mx KL (θ old, θ) is hrd for numericl optimiztion nd estimtion, so insted we constrin D KL (θ old, θ). Our theory ignores estimtion error for the dvntge function. Kkde & Lngford (2002) consider this error in their derivtion, nd the sme rguments would hold in the setting of this pper, but we omit them for simplicity. 7 Connections with Prior Work As mentioned in Section 4, our derivtion results in policy updte tht is relted to severl prior methods, providing unifying perspective on number of policy updte

6 schemes. The nturl policy grdient (Kkde, 2002) cn be obtined s specil cse of the updte in Eqution (12) by using liner pproximtion to L nd qudrtic pproximtion to the D KL constrint, resulting in the following problem: mximize θ [ θ L θold (θ) θ=θold (θ θ old ) subject to 1 2 (θ old θ) T A(θ old )(θ old θ) δ, (17) where A(θ old ) ij = E s ρπ [D KL (π( s, θ old ) π( s, θ)) θ i θ θ=θold. j The updte is θ new = θ old + 1 λ A(θ old) 1 θ L(θ) θ=θold, where the stepsize 1 λ is typiclly treted s n lgorithm prmeter. This differs from our pproch, which enforces the constrint t ech updte. Though this difference might seem subtle, our experiments demonstrte tht it significntly improves the lgorithm s performnce on lrger problems. We cn lso obtin the stndrd policy grdient updte by using n l 2 constrint or penlty: [ mximize θ L θold (θ) θ θ=θold (θ θ old ) (18) subject to 1 2 θ θ old 2 δ. The policy itertion updte cn lso be obtined by solving the unconstrined problem mximize π L πold (π), using L s defined in Eqution (3). Severl other methods employ n updte similr to Eqution (12). Reltive entropy policy serch (REPS) (Peters et l., 2010) constrins the stte-ction mrginls p(s, ), while TRPO constrins the conditionls p( s). Unlike REPS, our pproch does not require costly nonliner optimiztion in the inner loop. Levine nd Abbeel (2014) lso use KL divergence constrint, but its purpose is to encourge the policy not to stry from regions where the estimted dynmics model is vlid, while we do not ttempt to estimte the system dynmics explicitly. Pirott et l. (2013) lso build on nd generlize Kkde nd Lngford s results, nd they derive different lgorithms from the ones here. 8 Experiments We designed our experiments to investigte the following questions: 1. Wht re the performnce chrcteristics of the single pth nd vine smpling procedures? 2. TRPO is relted to prior methods (e.g. nturl policy grdient) but mkes severl chnges, most notbly by using fixed KL divergence rther thn fixed penlty coefficient. How does this ffect the performnce of the lgorithm? Figure 2. 2D robot models used for locomotion experiments. From left to right: swimmer, hopper, wlker. The hopper nd wlker present prticulr chllenge, due to underctution nd contct discontinuities. Screen input Input lyer Joint ngles nd kinemtics Input lyer Conv. lyer filters Fully connected lyer 30 units Conv. lyer filters Men prmeters Hidden lyer 20 units Smpling Stndrd devitions Action probbilities Control Smpling Control Figure 3. Neurl networks used for the locomotion tsk (top) nd for plying Atri gmes (bottom). 3. Cn TRPO be used to solve chllenging lrge-scle problems? How does TRPO compre with other methods when pplied to lrge-scle problems, with regrd to finl performnce, computtion time, nd smple complexity? To nswer (1) nd (2), we compre the performnce of the single pth nd vine vrints of TRPO, severl blted vrints, nd number of prior policy optimiztion lgorithms. With regrd to (3), we show tht both the single pth nd vine lgorithm cn obtin high-qulity locomotion controllers from scrtch, which is considered to be hrd problem. We lso show tht these lgorithms produce competitive results when lerning policies for plying Atri gmes from imges using convolutionl neurl networks with tens of thousnds of prmeters. 8.1 Simulted Robotic Locomotion We conducted the robotic locomotion experiments using the MuJoCo simultor (Todorov et l., 2012). The three simulted robots re shown in Figure 2. The sttes of the robots re their generlized positions nd velocities, nd the controls re joint torques. Underctution, high dimensionlity, nd non-smooth dynmics due to contcts mke these

7 tsks very chllenging. The following models re included in our evlution: 1. Swimmer. 10-dimensionl stte spce, liner rewrd for forwrd progress nd qudrtic penlty on joint effort to produce the rewrd r(x, u) = v x 10 5 u 2. The swimmer cn propel itself forwrd by mking n undulting motion. 2. Hopper. 12-dimensionl stte spce, sme rewrd s the swimmer, with bonus of +1 for being in nonterminl stte. We ended the episodes when the hopper fell over, which ws defined by thresholds on the torso height nd ngle. 3. Wlker. 18-dimensionl stte spce. For the wlker, we dded penlty for strong impcts of the feet ginst the ground to encourge smooth wlk rther thn hopping git. We used δ = 0.01 for ll experiments. See Tble 2 in the Appendix for more detils on the experimentl setup nd prmeters used. We used neurl networks to represent the policy, with the rchitecture shown in Figure 3, nd further detils provided in Appendix D. To estblish stndrd bseline, we lso included the clssic crt-pole blncing problem, bsed on the formultion from Brto et l. (1983), using liner policy with six prmeters tht is esy to optimize with derivtive-free blck-box optimiztion methods. The following lgorithms were considered in the comprison: single pth TRPO; vine TRPO; cross-entropy method (CEM), grdient-free method (Szit & Lörincz, 2006); covrince mtrix dption (CMA), nother grdient-free method (Hnsen & Ostermeier, 1996); nturl grdient, the clssic nturl policy grdient lgorithm (Kkde, 2002), which differs from single pth by the use of fixed penlty coefficient (Lgrnge multiplier) insted of the KL divergence constrint; empiricl FIM, identicl to single pth, except tht the FIM is estimted using the covrince mtrix of the grdients rther thn the nlytic estimte; mx KL, which ws only trctble on the crt-pole problem, nd uses the mximum KL divergence in Eqution (11), rther thn the verge divergence, llowing us to evlute the qulity of this pproximtion. The prmeters used in the experiments re provided in Appendix E. For the nturl grdient method, we swept through the possible vlues of the stepsize in fctors of three, nd took the best vlue ccording to the finl performnce. Lerning curves showing the totl rewrd verged cross five runs of ech lgorithm re shown in Figure 4. Single pth nd vine TRPO solved ll of the problems, yielding the best solutions. Nturl grdient performed well on the two esier problems, but ws unble to generte hopping nd wlking gits tht mde forwrd progress. These results provide empiricl evidence tht constrining the KL divergence is more robust wy to choose step sizes nd mke fst, consistent progress, compred to using fixed rewrd rewrd Crtpole 4 Vine Single Pth Nturl Grdient Mx KL 2 Empiricl FIM CEM CMA RWR Hopper Vine Single Pth Nturl Grdient CEM RWR cost (-velocity + ctrl) rewrd Swimmer Vine Single Pth Nturl Grdient Empiricl FIM CEM CMA RWR Wlker Vine Single Pth Nturl Grdient CEM RWR Figure 4. Lerning curves for locomotion tsks, verged cross five runs of ech lgorithm with rndom initiliztions. Note tht for the hopper nd wlker, score of 1 is chievble without ny forwrd velocity, indicting policy tht simply lerned blnced stnding, but not wlking. penlty. CEM nd CMA re derivtive-free lgorithms, hence their smple complexity scles unfvorbly with the number of prmeters, nd they performed poorly on the lrger problems. The mx KL method lerned somewht more slowly thn our finl method, due to the more restrictive form of the constrint, but overll the result suggests tht the verge KL divergence constrint hs similr effect s the theoreclly justified mximum KL divergence. Videos of the policies lerned by TRPO my be viewed on the project website: site/trpopper/. Note tht TRPO lerned ll of the gits with generlpurpose policies nd simple rewrd functions, using miniml prior knowledge. This is in contrst with most prior methods for lerning locomotion, which typiclly rely on hnd-rchitected policy clsses tht explicitly encode notions of blnce nd stepping (Tedrke et l., 2004; Geng et l., 2006; Wmpler & Popović, 2009). 8.2 Plying Gmes from Imges To evlute TRPO on prtilly observed tsk with complex observtions, we trined policies for plying Atri gmes, using rw imges s input. The gmes require lerning vriety of behviors, such s dodging bullets nd hitting blls with pddles. Aside from the high dimensionlity, chllenging elements of these gmes include delyed rewrds (no immedite penlty is incurred when life is lost in Brekout or Spce Invders); complex sequences of behvior (Q*bert requires chrcter to hop on 21 different pltforms); nd non-sttionry imge sttistics (Enduro involves chnging nd flickering bckground). We tested our lgorithms on the sme seven gmes reported on in (Mnih et l., 2013) nd (Guo et l., 2014), which re

8 B. Rider Brekout Enduro Pong Q*bert Sequest S. Invders Rndom Humn (Mnih et l., 2013) Deep Q Lerning (Mnih et l., 2013) UCC-I (Guo et l., 2014) TRPO - single pth TRPO - vine Tble 1. Performnce comprison for vision-bsed RL lgorithms on the Atri domin. Our lgorithms (bottom rows) were run once on ech tsk, with the sme rchitecture nd prmeters. Performnce vries substntilly from run to run (with different rndom initiliztions of the policy), but we could not obtin error sttistics due to time constrints. mde vilble through the Arcde Lerning Environment (Bellemre et l., 2013) The imges were preprocessed following the protocol in Mnih et l (2013), nd the policy ws represented by the convolutionl neurl network shown in Figure 3, with two convolutionl lyers with 16 chnnels nd stride 2, followed by one fully-connected lyer with 20 units, yielding 33,500 prmeters. The results of the vine nd single pth lgorithms re summrized in Tble 1, which lso includes n expert humn performnce nd two recent methods: deep Q-lerning (Mnih et l., 2013), nd combintion of Monte-Crlo Tree Serch with supervised trining (Guo et l., 2014), clled UCC-I. The 500 itertions of our lgorithm took bout 30 hours (with slight vrition between gmes) on 16-core computer. While our method only outperformed the prior methods on some of the gmes, it consistently chieved resonble scores. Unlike the prior methods, our pproch ws not designed specificlly for this tsk. The bility to pply the sme policy serch method to methods s diverse s robotic locomotion nd imge-bsed gme plying demonstrtes the generlity of TRPO. 9 Discussion We proposed nd nlyzed trust region methods for optimizing stochstic control policies. We proved monotonic improvement for n lgorithm tht repetedly optimizes locl pproximtion to the expected return of the policy with KL divergence penlty, nd we showed tht n pproximtion to this method tht incorportes KL divergence constrint chieves good empiricl results on rnge of chllenging policy lerning tsks, outperforming prior methods. Our nlysis lso provides perspective tht unifies policy grdient nd policy itertion methods, nd shows them to be specil limiting cses of n lgorithm tht optimizes certin objective subject to trust region constrint. In the domin of robotic locomotion, we successfully lerned controllers for swimming, wlking nd hopping in physics simultor, using generl purpose neurl networks nd minimlly informtive rewrds. To our knowledge, no prior work hs lerned controllers from scrtch for ll of these tsks, using generic policy serch method nd non-engineered, generl-purpose policy representtions. In the gme-plying domin, we lerned convolutionl neurl network policies tht used rw imges s inputs. This requires optimizing extremely high-dimensionl policies, nd only two prior methods report successful results on this tsk. Since the method we proposed is sclble nd hs strong theoreticl foundtions, we hope tht it will serve s jumping-off point for future work on trining lrge, rich function pproximtors for rnge of chllenging problems. At the intersection of the two experimentl domins we explored, there is the possibility of lerning robotic control policies tht use vision nd rw sensory dt s input, providing unified scheme for trining robotic controllers tht perform both perception nd control. The use of more sophisticted policies, including recurrent policies with hidden stte, could further mke it possible to roll stte estimtion nd control into the sme policy in the prtillyobserved setting. By combining our method with model lerning, it would lso be possible to substntilly reduce its smple complexity, mking it pplicble to rel-world settings where smples re expensive. Acknowledgements We thnk Emo Todorov nd Yuvl Tss for providing the MuJoCo simultor; Bruno Scherrer, Tom Erez, Greg Wyne, nd the nonymous ICML reviewers for insightful comments, nd Vitchyr Pong nd Shne Gu for pointing our errors in previous version of the mnuscript. This reserch ws funded in prt by the Office of Nvl Reserch through Young Investigtor Awrd nd under grnt number N , DARPA through Young Fculty Awrd, by the Army Reserch Office through the MAST progrm. References Bgnell, J. A. nd Schneider, J. Covrint policy serch. IJCAI, Brtlett, P. L. nd Bxter, J. Infinite-horizon policy-grdient estimtion. rxiv preprint rxiv: , Brto, A., Sutton, R., nd Anderson, C. Neuronlike dptive elements tht cn solve difficult lerning control problems. IEEE Trnsctions on Systems, Mn nd Cybernetics, (5): , 1983.

9 Bellemre, M. G., Nddf, Y., Veness, J., nd Bowling, M. The rcde lerning environment: An evlution pltform for generl gents. Journl of Artificil Intelligence Reserch, 47: , jun Bertseks, D. Dynmic progrmming nd optiml control, volume Deisenroth, M., Neumnn, G., nd Peters, J. A survey on policy serch for robotics. Foundtions nd Trends in Robotics, 2(1-2):1 142, Gbillon, Victor, Ghvmzdeh, Mohmmd, nd Scherrer, Bruno. Approximte dynmic progrmming finlly performs well in the gme of Tetris. In Advnces in Neurl Informtion Processing Systems, Geng, T., Porr, B., nd Wörgötter, F. Fst biped wlking with reflexive controller nd reltime policy serching. In Advnces in Neurl Informtion Processing Systems (NIPS), Guo, X., Singh, S., Lee, H., Lewis, R. L., nd Wng, X. Deep lerning for rel-time tri gme ply using offline Monte- Crlo tree serch plnning. In Advnces in Neurl Informtion Processing Systems, pp , Hnsen, Nikolus nd Ostermeier, Andres. Adpting rbitrry norml muttion distributions in evolution strtegies: The covrince mtrix dpttion. In Evolutionry Computtion, 1996., Proceedings of IEEE Interntionl Conference on, pp IEEE, Hunter, Dvid R nd Lnge, Kenneth. A tutoril on MM lgorithms. The Americn Sttisticin, 58(1):30 37, Kkde, Shm. A nturl policy grdient. In Advnces in Neurl Informtion Processing Systems, pp MIT Press, Kkde, Shm nd Lngford, John. Approximtely optiml pproximte reinforcement lerning. In ICML, volume 2, pp , Lgoudkis, Michil G nd Prr, Ronld. Reinforcement lerning s clssifiction: Leverging modern clssifiers. In ICML, volume 3, pp , Levin, D. A., Peres, Y., nd Wilmer, E. L. Mrkov chins nd mixing times. Americn Mthemticl Society, Levine, Sergey nd Abbeel, Pieter. Lerning neurl network policies with guided policy serch under unknown dynmics. In Advnces in Neurl Informtion Processing Systems, pp , Mrtens, J. nd Sutskever, I. Trining deep nd recurrent networks with hessin-free optimiztion. In Neurl Networks: Tricks of the Trde, pp Springer, Mnih, V., Kvukcuoglu, K., Silver, D., Grves, A., Antonoglou, I., Wierstr, D., nd Riedmiller, M. Plying Atri with deep reinforcement lerning. rxiv preprint rxiv: , Nemirovski, Arkdi. Efficient methods in convex progrmming Ng, A. Y. nd Jordn, M. PEGASUS: A policy serch method for lrge mdps nd pomdps. In Uncertinty in rtificil intelligence (UAI), Owen, Art B. Monte Crlo theory, methods nd exmples Pscnu, Rzvn nd Bengio, Yoshu. Revisiting nturl grdient for deep networks. rxiv preprint rxiv: , Peters, J. nd Schl, S. Reinforcement lerning of motor skills with policy grdients. Neurl Networks, 21(4): , Peters, J., Mülling, K., nd Altün, Y. Reltive entropy policy serch. In AAAI Conference on Artificil Intelligence, Peters, Jn nd Schl, Stefn. Nturl ctor-critic. Neurocomputing, 71(7): , 2008b. Pirott, Mtteo, Restelli, Mrcello, Pecorino, Alessio, nd Clndriello, Dniele. Sfe policy itertion. In Proceedings of The 30th Interntionl Conference on Mchine Lerning, pp , Pollrd, Dvid. Asymptopi: n exposition of sttisticl symptotic theory URL pollrd/books/asymptopi. Szit, István nd Lörincz, András. Lerning tetris using the noisy cross-entropy method. Neurl computtion, 18(12): , Tedrke, R., Zhng, T., nd Seung, H. Stochstic policy grdient reinforcement lerning on simple 3d biped. In IEEE/RSJ Interntionl Conference on Intelligent Robots nd Systems, Todorov, Emnuel, Erez, Tom, nd Tss, Yuvl. MuJoCo: A physics engine for model-bsed control. In Intelligent Robots nd Systems (IROS), 2012 IEEE/RSJ Interntionl Conference on, pp IEEE, Wmpler, Kevin nd Popović, Zorn. Optiml git nd form for niml locomotion. In ACM Trnsctions on Grphics (TOG), volume 28, pp. 60. ACM, Wright, Stephen J nd Nocedl, Jorge. Numericl optimiztion, volume 2. Springer New York, 1999.

10 A Proof of Policy Improvement Bound Trust Region Policy Optimiztion This proof (of Theorem 1) uses techniques from the proof of Theorem 4.1 in (Kkde & Lngford, 2002), dpting them to the more generl setting considered in this pper. An informl overview is s follows. Our proof relies on the notion of coupling, where we jointly define the policies π nd π so tht they choose the sme ction with high probbility = (1 α). Surrogte loss L π ( π) ccounts for the the dvntge of π the first time tht it disgrees with π, but not subsequent disgreements. Hence, the error in L π is due to two or more disgreements between π nd π, hence, we get n O(α 2 ) correction term, where α is the probbility of disgreement. We strt out with lemm from Kkde & Lngford (2002) tht shows tht the difference in policy performnce η( π) η(π) cn be decomposed s sum of per-timestep dvntges. Lemm 1. Given two policies π, π, η( π) = η(π)+e τ π γ t A π (s t, t ) This expecttion is tken over trjectories τ := (s 0, 0, s 1, 0,... ), nd the nottion E τ π [... indictes tht ctions re smpled from π to generte τ. Proof. First note tht A π (s, ) = E s P (s s,) [r(s) + γv π (s ) V π (s). Therefore, E τ π γ t A π (s t, t ) Rerrnging, the result follows. = E τ π γ t (r(s t ) + γv π (s t+1 ) V π (s t )) = E τ π [ V π (s 0 ) + γ t r(s t ) = E s0 [V π (s 0 ) + E τ π γ t r(s t ) Define Ā(s) to be the expected dvntge of π over π t stte s: Now Lemm 1 cn be written s follows: Note tht L π cn be written s = η(π) + η( π) (24) (19) (20) (21) (22) (23) Ā(s) = E π( s) [A π (s, ). (25) η( π) = η(π) + E τ π γ t Ā(s t ) L π ( π) = η(π) + E τ π γ t Ā(s t ) The difference in these equtions is whether the sttes re smpled using π or π. To bound the difference between η( π) nd L π ( π), we will bound the difference rising from ech timestep. To do this, we first need to introduce mesure of how much π nd π gree. Specificlly, we ll couple the policies, so tht they define joint distribution over pirs of ctions. Definition 1. (π, π) is n α-coupled policy pir if it defines joint distribution (, ã) s, such tht P ( ã s) α for ll s. π nd π will denote the mrginl distributions of nd ã, respectively. (26) (27)

11 Computtionlly, α-coupling mens tht if we rndomly choose seed for our rndom number genertor, nd then we smple from ech of π nd π fter setting tht seed, the results will gree for t lest frction 1 α of seeds. Lemm 2. Given tht π, π re α-coupled policies, for ll s, Ā(s) 2α mx s, A π(s, ) (28) Proof. Ā(s) = Eã π [A π (s, ã) = E (,ã) (π, π) [A π (s, ã) A π (s, ) since E π [A π (s, ) = 0 (29) = P ( ã s)e (,ã) (π, π) ã [A π (s, ã) A π (s, ) (30) Ā(s) α 2 mx A π(s, ) (31) s, Lemm 3. Let (π, π) be n α-coupled policy pir. Then [Ā(st Est π ) [Ā(st E st π ) 2α mx Ā(s) 4α(1 (1 α) t ) mx A π(s, ) (32) s s Proof. Given the coupled policy pir (π, π), we cn lso obtin coupling over the trjectory distributions produced by π nd π, respectively. Nmely, we hve pirs of trjectories τ, τ, where τ is obtined by tking ctions from π, nd τ is obtined by tking ctions from π, where the sme rndom seed is used to generte both trjectories. We will consider the dvntge of π over π t timestep t, nd decompose this expecttion bsed on whether π grees with π t ll timesteps i < t. Let n t denote the number of times tht i ã i for i < t, i.e., the number of times tht π nd π disgree before timestep t. [Ā(st E st π ) [Ā(st = P (n t = 0)E st π n ) [Ā(st + P (n t > 0)E st π n t>0 ) (33) The expecttion decomposes similrly for ctions re smpled using π: Note tht the n t = 0 terms re equl: [Ā(st E st π ) [Ā(st = P (n t = 0)E st π n ) [Ā(st + P (n t > 0)E st π n t>0 ) (34) [Ā(st E st π n ) [Ā(st = E st π n ), (35) becuse n t = 0 indictes tht π nd π greed on ll timesteps less thn t. Subtrcting Equtions (33) nd (34), we get [Ā(st E st π ) [Ā(st E st π ) = P (n t > 0) ( [Ā(st E st π n t>0 ) [Ā(st E st π n t>0 ) ) (36) By definition of α, P (π, π gree t timestep i) 1 α, so P (n t = 0) (1 α) t, nd P (n t > 0) 1 (1 α) t (37) Next, note tht [Ā(st Est π n t>0 ) [Ā(st E st π n t>0 ) [Ā(st Est π n t>0 ) [Ā(st + Est π n t>0 ) (38) 4α mx π(s, ) s, (39) Where the second inequlity follows from Lemm 3. Plugging Eqution (37) nd Eqution (39) into Eqution (36), we get [Ā(st Est π ) [Ā(st E st π ) 4α(1 (1 α) t ) mx A π(s, ) (40) s,

12 The preceding Lemm bounds the difference in expected dvntge t ech timestep t. We cn sum over time to bound the difference between η( π) nd L π ( π). Subtrcting Eqution (26) nd Eqution (27), nd defining ɛ = mx s, A π (s, ), η( π) L π ( π) = γ t [Ā(st Eτ π ) [Ā(st E τ π ) (41) γ t 4ɛα(1 (1 α) t ) (42) = 4ɛα = ( ) 1 1 γ 1 1 γ(1 α) 4α 2 γɛ (1 γ)(1 γ(1 α)) (43) (44) 4α2 γɛ (1 γ) 2 (45) Lst, to replce α by the totl vrition divergence, we need to use the correspondence between TV divergence nd coupled rndom vribles: Suppose p X nd p Y re distributions with D T V (p X p Y ) = α. Then there exists joint distribution (X, Y ) whose mrginls re p X, p Y, for which X = Y with probbility 1 α. See (Levin et l., 2009), Proposition 4.7. It follows tht if we hve two policies π nd π such tht mx s D T V (π( s) π( s)) α, then we cn define n α-coupled policy pir (π, π) with pproprite mrginls. Tking α = mx s D T V (π( s) π( s)) α in Eqution (45), Theorem 1 follows. B Perturbtion Theory Proof of Policy Improvement Bound We lso provide n lterntive proof of Theorem 1 using perturbtion theory. Proof. Let G = (1+γP π +(γp π ) ) = (1 γp π ) 1, nd similrly Let G = (1+γP π +(γp π ) ) = (1 γp π ) 1. We will use the convention tht ρ ( density on stte spce) is vector nd r ( rewrd function on stte spce) is dul vector (i.e., liner functionl on vectors), thus rρ is sclr mening the expected rewrd under density ρ. Note tht η(π) = rgρ 0, nd η( π) = c Gρ 0. Let = P π P π. We wnt to bound η( π) η(π) = r( G G)ρ 0. We strt with some stndrd perturbtion theory mnipultions. Left multiply by G nd right multiply by G. Substituting the right-hnd side into G gives So we hve G 1 G 1 = (1 γp π ) (1 γp π ) = γ. (46) G G = γg G G = G + γg G (47) G = G + γg G + γ 2 G G G (48) η( π) η(π) = r( G G)ρ = γrg Gρ 0 + γ 2 rg G Gρ 0 (49) Let us first consider the leding term γrg Gρ 0. Note tht rg = v, i.e., the infinite-horizon stte-vlue function. Also note tht Gρ 0 = ρ π. Thus we cn write γcg Gρ 0 = γv ρ π. We will show tht this expression equls the expected

13 dvntge L π ( π) L π (π). L π ( π) L π (π) = s = s = s = s ρ π (s) ( π( s) π( s))a π (s, ) ρ π (s) ( πθ ( s) π θ( s) ) [ r(s) + p(s s, )γv(s ) v(s) s ρ π (s) (π( s) π( s)) p(s s, )γv(s ) s ρ π (s) s (p π (s s) p π (s s))γv(s ) = γv ρ π (50) Next let us bound the O( 2 ) term γ 2 rg G Gρ. First we consider the product γrg = γv. Consider the component s of this dul vector. (γv ) s = ( π(s, ) π(s, ))Q π (s, ) = ( π(s, ) π(s, ))A π (s, ) π(s, ) π(s, ) mx A π(s, ) 2αɛ (51) where the lst line used the definition of the totl-vrition divergence, nd the definition of ɛ = mx s, A π (s, ). We bound the other portion G Gρ using the l 1 opertor norm { } Aρ 1 A 1 = sup (52) ρ ρ 1 where we hve tht G 1 = G 1 = 1/(1 γ) nd 1 = 2α. Tht gives So we hve tht G Gρ 1 G 1 1 G 1 ρ 1 = 1 1 γ 2α 1 1 γ 1 (53) γ 2 rg G Gρ γ γrg G Gρ 1 γ v G Gρ 1 2α γ 2αɛ (1 γ) 2 = 4γɛ (1 γ) 2 α2 (54) C Efficiently Solving the Trust-Region Constrined Optimiztion Problem This section describes how to efficiently pproximtely solve the following constrined optimiztion problem, which we must solve t ech itertion of TRPO: mximize L(θ) subject to D KL (θ old, θ) δ. (55)

14 The method we will describe involves two steps: (1) compute serch direction, using liner pproximtion to objective nd qudrtic pproximtion to the constrint; nd (2) perform line serch in tht direction, ensuring tht we improve the nonliner objective while stisfying the nonliner constrint. The serch direction is computed by pproximtely solving the eqution Ax = g, where A is the Fisher informtion mtrix, i.e., the qudrtic pproximtion to the KL divergence constrint: D KL (θ old, θ) 1 2 (θ θ old) T A(θ θ old ), where A ij = θ i θ j D KL (θ old, θ). In lrge-scle problems, it is prohibitively costly (with respect to computtion nd memory) to form the full mtrix A (or A 1 ). However, the conjugte grdient lgorithm llows us to pproximtely solve the eqution Ax = b without forming this full mtrix, when we merely hve ccess to function tht computes mtrix-vector products y Ay. Appendix C.1 describes the most efficient wy to compute mtrix-vector products with the Fisher informtion mtrix. For dditionl exposition on the use of Hessin-vector products for optimizing neurl network objectives, see (Mrtens & Sutskever, 2012) nd (Pscnu & Bengio, 2013). Hving computed the serch direction s A 1 g, we next need to compute the mximl step length β such tht θ + βs will stisfy the KL divergence constrint. To do this, let δ = D KL 1 2 (βs)t A(βs) = 1 2 β2 s T As. From this, we obtin β = 2δ/s T As, where δ is the desired KL divergence. The term s T As cn be computed through single Hessin vector product, nd it is lso n intermedite result produced by the conjugte grdient lgorithm. Lst, we use line serch to ensure improvement of the surrogte objective nd stisfction of the KL divergence constrint, both of which re nonliner in the prmeter vector θ (nd thus deprt from the liner nd qudrtic pproximtions used to compute the step). We perform the line serch on the objective L θold (θ) X [D KL (θ old, θ) δ, where X [... equls zero when its rgument is true nd + when it is flse. Strting with the mximl vlue of the step length β computed in the previous prgrph, we shrink β exponentilly until the objective improves. Without this line serch, the lgorithm occsionlly computes lrge steps tht cuse ctstrophic degrdtion of performnce. C.1 Computing the Fisher-Vector Product Here we will describe how to compute the mtrix-vector product between the verged Fisher informtion mtrix nd rbitrry vectors. This mtrix-vector product enbles us to perform the conjugte grdient lgorithm. Suppose tht the prmeterized policy mps from the input x to distribution prmeter vector µ θ (x), which prmeterizes the distribution π(u x). Now the KL divergence for given input x cn be written s follows: D KL (π θold ( x) π θ ( x)) = kl(µ θ (x), µ old ) (56) where kl is the KL divergence between the distributions corresponding to the two men prmeter vectors. Differentiting kl twice with respect to θ, we obtin µ (x) µ b (x) kl θ i θ b(µ θ (x), µ old ) + 2 µ (x) kl j θ i θ (µ θ (x), µ old ) (57) j where the primes ( ) indicte differentition with respect to the first rgument, nd there is n implied summtion over indices, b. The second term vnishes, leving just the first term. Let J := µ(x) θ i (the Jcobin), then the Fisher informtion mtrix cn be written in mtrix form s J T MJ, where M = kl b (µ θ(x), µ old ) is the Fisher informtion mtrix of the distribution in terms of the men prmeter µ (s opposed to the prmeter θ). This hs simple form for most prmeterized distributions of interest. The Fisher-vector product cn now be written s function y J T MJy. Multipliction by J T nd J cn be performed by most utomtic differentition nd neurl network pckges (multipliction by J T is the well-known bckprop opertion), nd the opertion for multipliction by M cn be derived for the distribution of interest. Note tht this Fisher-vector product is strightforwrd to verge over set of dtpoints, i.e., inputs x to µ. One could lterntively use generic method for clculting Hessin-vector products using reverse mode utomtic differentition ((Wright & Nocedl, 1999), chpter 8), computing the Hessin of D KL with respect to θ. This method would be slightly less efficient s it does not exploit the fct tht the second derivtives of µ(x) (i.e., the second term in Eqution (57)) cn be ignored, but my be substntilly esier to implement. We hve described procedure for computing the Fisher-vector product y Ay, where the Fisher informtion mtrix is verged over set of inputs to the function µ. Computing the Fisher-vector product is typiclly bout s expensive s computing the grdient of n objective tht depends on µ(x) (Wright & Nocedl, 1999). Furthermore, we need to compute

Reinforcement Learning

Reinforcement Learning Reinforcement Lerning Tom Mitchell, Mchine Lerning, chpter 13 Outline Introduction Comprison with inductive lerning Mrkov Decision Processes: the model Optiml policy: The tsk Q Lerning: Q function Algorithm

More information

Reinforcement learning II

Reinforcement learning II CS 1675 Introduction to Mchine Lerning Lecture 26 Reinforcement lerning II Milos Huskrecht milos@cs.pitt.edu 5329 Sennott Squre Reinforcement lerning Bsics: Input x Lerner Output Reinforcement r Critic

More information

2D1431 Machine Learning Lab 3: Reinforcement Learning

2D1431 Machine Learning Lab 3: Reinforcement Learning 2D1431 Mchine Lerning Lb 3: Reinforcement Lerning Frnk Hoffmnn modified by Örjn Ekeberg December 7, 2004 1 Introduction In this lb you will lern bout dynmic progrmming nd reinforcement lerning. It is ssumed

More information

1 Online Learning and Regret Minimization

1 Online Learning and Regret Minimization 2.997 Decision-Mking in Lrge-Scle Systems My 10 MIT, Spring 2004 Hndout #29 Lecture Note 24 1 Online Lerning nd Regret Minimiztion In this lecture, we consider the problem of sequentil decision mking in

More information

A Fast and Reliable Policy Improvement Algorithm

A Fast and Reliable Policy Improvement Algorithm A Fst nd Relible Policy Improvement Algorithm Ysin Abbsi-Ydkori Peter L. Brtlett Stephen J. Wright Queenslnd University of Technology UC Berkeley nd QUT University of Wisconsin-Mdison Abstrct We introduce

More information

LECTURE NOTE #12 PROF. ALAN YUILLE

LECTURE NOTE #12 PROF. ALAN YUILLE LECTURE NOTE #12 PROF. ALAN YUILLE 1. Clustering, K-mens, nd EM Tsk: set of unlbeled dt D = {x 1,..., x n } Decompose into clsses w 1,..., w M where M is unknown. Lern clss models p(x w)) Discovery of

More information

Administrivia CSE 190: Reinforcement Learning: An Introduction

Administrivia CSE 190: Reinforcement Learning: An Introduction Administrivi CSE 190: Reinforcement Lerning: An Introduction Any emil sent to me bout the course should hve CSE 190 in the subject line! Chpter 4: Dynmic Progrmming Acknowledgment: A good number of these

More information

19 Optimal behavior: Game theory

19 Optimal behavior: Game theory Intro. to Artificil Intelligence: Dle Schuurmns, Relu Ptrscu 1 19 Optiml behvior: Gme theory Adversril stte dynmics hve to ccount for worst cse Compute policy π : S A tht mximizes minimum rewrd Let S (,

More information

Recitation 3: More Applications of the Derivative

Recitation 3: More Applications of the Derivative Mth 1c TA: Pdric Brtlett Recittion 3: More Applictions of the Derivtive Week 3 Cltech 2012 1 Rndom Question Question 1 A grph consists of the following: A set V of vertices. A set E of edges where ech

More information

Numerical integration

Numerical integration 2 Numericl integrtion This is pge i Printer: Opque this 2. Introduction Numericl integrtion is problem tht is prt of mny problems in the economics nd econometrics literture. The orgniztion of this chpter

More information

Monte Carlo method in solving numerical integration and differential equation

Monte Carlo method in solving numerical integration and differential equation Monte Crlo method in solving numericl integrtion nd differentil eqution Ye Jin Chemistry Deprtment Duke University yj66@duke.edu Abstrct: Monte Crlo method is commonly used in rel physics problem. The

More information

Estimation of Binomial Distribution in the Light of Future Data

Estimation of Binomial Distribution in the Light of Future Data British Journl of Mthemtics & Computer Science 102: 1-7, 2015, Article no.bjmcs.19191 ISSN: 2231-0851 SCIENCEDOMAIN interntionl www.sciencedomin.org Estimtion of Binomil Distribution in the Light of Future

More information

New Expansion and Infinite Series

New Expansion and Infinite Series Interntionl Mthemticl Forum, Vol. 9, 204, no. 22, 06-073 HIKARI Ltd, www.m-hikri.com http://dx.doi.org/0.2988/imf.204.4502 New Expnsion nd Infinite Series Diyun Zhng College of Computer Nnjing University

More information

Review of Calculus, cont d

Review of Calculus, cont d Jim Lmbers MAT 460 Fll Semester 2009-10 Lecture 3 Notes These notes correspond to Section 1.1 in the text. Review of Clculus, cont d Riemnn Sums nd the Definite Integrl There re mny cses in which some

More information

Math 1B, lecture 4: Error bounds for numerical methods

Math 1B, lecture 4: Error bounds for numerical methods Mth B, lecture 4: Error bounds for numericl methods Nthn Pflueger 4 September 0 Introduction The five numericl methods descried in the previous lecture ll operte by the sme principle: they pproximte the

More information

Bellman Optimality Equation for V*

Bellman Optimality Equation for V* Bellmn Optimlity Eqution for V* The vlue of stte under n optiml policy must equl the expected return for the best ction from tht stte: V (s) mx Q (s,) A(s) mx A(s) mx A(s) Er t 1 V (s t 1 ) s t s, t s

More information

The Regulated and Riemann Integrals

The Regulated and Riemann Integrals Chpter 1 The Regulted nd Riemnn Integrls 1.1 Introduction We will consider severl different pproches to defining the definite integrl f(x) dx of function f(x). These definitions will ll ssign the sme vlue

More information

{ } = E! & $ " k r t +k +1

{ } = E! & $  k r t +k +1 Chpter 4: Dynmic Progrmming Objectives of this chpter: Overview of collection of clssicl solution methods for MDPs known s dynmic progrmming (DP) Show how DP cn be used to compute vlue functions, nd hence,

More information

Chapter 4: Dynamic Programming

Chapter 4: Dynamic Programming Chpter 4: Dynmic Progrmming Objectives of this chpter: Overview of collection of clssicl solution methods for MDPs known s dynmic progrmming (DP) Show how DP cn be used to compute vlue functions, nd hence,

More information

Advanced Calculus: MATH 410 Notes on Integrals and Integrability Professor David Levermore 17 October 2004

Advanced Calculus: MATH 410 Notes on Integrals and Integrability Professor David Levermore 17 October 2004 Advnced Clculus: MATH 410 Notes on Integrls nd Integrbility Professor Dvid Levermore 17 October 2004 1. Definite Integrls In this section we revisit the definite integrl tht you were introduced to when

More information

1.9 C 2 inner variations

1.9 C 2 inner variations 46 CHAPTER 1. INDIRECT METHODS 1.9 C 2 inner vritions So fr, we hve restricted ttention to liner vritions. These re vritions of the form vx; ǫ = ux + ǫφx where φ is in some liner perturbtion clss P, for

More information

Actor-Critic. Hung-yi Lee

Actor-Critic. Hung-yi Lee Actor-Critic Hung-yi Lee Asynchronous Advntge Actor-Critic (A3C) Volodymyr Mnih, Adrià Puigdomènech Bdi, Mehdi Mirz, Alex Grves, Timothy P. Lillicrp, Tim Hrley, Dvid Silver, Kory Kvukcuoglu, Asynchronous

More information

Driving Cycle Construction of City Road for Hybrid Bus Based on Markov Process Deng Pan1, a, Fengchun Sun1,b*, Hongwen He1, c, Jiankun Peng1, d

Driving Cycle Construction of City Road for Hybrid Bus Based on Markov Process Deng Pan1, a, Fengchun Sun1,b*, Hongwen He1, c, Jiankun Peng1, d Interntionl Industril Informtics nd Computer Engineering Conference (IIICEC 15) Driving Cycle Construction of City Rod for Hybrid Bus Bsed on Mrkov Process Deng Pn1,, Fengchun Sun1,b*, Hongwen He1, c,

More information

Continuous Random Variables

Continuous Random Variables STAT/MATH 395 A - PROBABILITY II UW Winter Qurter 217 Néhémy Lim Continuous Rndom Vribles Nottion. The indictor function of set S is rel-vlued function defined by : { 1 if x S 1 S (x) if x S Suppose tht

More information

Duality # Second iteration for HW problem. Recall our LP example problem we have been working on, in equality form, is given below.

Duality # Second iteration for HW problem. Recall our LP example problem we have been working on, in equality form, is given below. Dulity #. Second itertion for HW problem Recll our LP emple problem we hve been working on, in equlity form, is given below.,,,, 8 m F which, when written in slightly different form, is 8 F Recll tht we

More information

NUMERICAL INTEGRATION. The inverse process to differentiation in calculus is integration. Mathematically, integration is represented by.

NUMERICAL INTEGRATION. The inverse process to differentiation in calculus is integration. Mathematically, integration is represented by. NUMERICAL INTEGRATION 1 Introduction The inverse process to differentition in clculus is integrtion. Mthemticlly, integrtion is represented by f(x) dx which stnds for the integrl of the function f(x) with

More information

Tests for the Ratio of Two Poisson Rates

Tests for the Ratio of Two Poisson Rates Chpter 437 Tests for the Rtio of Two Poisson Rtes Introduction The Poisson probbility lw gives the probbility distribution of the number of events occurring in specified intervl of time or spce. The Poisson

More information

Acceptance Sampling by Attributes

Acceptance Sampling by Attributes Introduction Acceptnce Smpling by Attributes Acceptnce smpling is concerned with inspection nd decision mking regrding products. Three spects of smpling re importnt: o Involves rndom smpling of n entire

More information

Properties of Integrals, Indefinite Integrals. Goals: Definition of the Definite Integral Integral Calculations using Antiderivatives

Properties of Integrals, Indefinite Integrals. Goals: Definition of the Definite Integral Integral Calculations using Antiderivatives Block #6: Properties of Integrls, Indefinite Integrls Gols: Definition of the Definite Integrl Integrl Clcultions using Antiderivtives Properties of Integrls The Indefinite Integrl 1 Riemnn Sums - 1 Riemnn

More information

THE EXISTENCE-UNIQUENESS THEOREM FOR FIRST-ORDER DIFFERENTIAL EQUATIONS.

THE EXISTENCE-UNIQUENESS THEOREM FOR FIRST-ORDER DIFFERENTIAL EQUATIONS. THE EXISTENCE-UNIQUENESS THEOREM FOR FIRST-ORDER DIFFERENTIAL EQUATIONS RADON ROSBOROUGH https://intuitiveexplntionscom/picrd-lindelof-theorem/ This document is proof of the existence-uniqueness theorem

More information

Theoretical foundations of Gaussian quadrature

Theoretical foundations of Gaussian quadrature Theoreticl foundtions of Gussin qudrture 1 Inner product vector spce Definition 1. A vector spce (or liner spce) is set V = {u, v, w,...} in which the following two opertions re defined: (A) Addition of

More information

CMDA 4604: Intermediate Topics in Mathematical Modeling Lecture 19: Interpolation and Quadrature

CMDA 4604: Intermediate Topics in Mathematical Modeling Lecture 19: Interpolation and Quadrature CMDA 4604: Intermedite Topics in Mthemticl Modeling Lecture 19: Interpoltion nd Qudrture In this lecture we mke brief diversion into the res of interpoltion nd qudrture. Given function f C[, b], we sy

More information

8 Laplace s Method and Local Limit Theorems

8 Laplace s Method and Local Limit Theorems 8 Lplce s Method nd Locl Limit Theorems 8. Fourier Anlysis in Higher DImensions Most of the theorems of Fourier nlysis tht we hve proved hve nturl generliztions to higher dimensions, nd these cn be proved

More information

Generation of Lyapunov Functions by Neural Networks

Generation of Lyapunov Functions by Neural Networks WCE 28, July 2-4, 28, London, U.K. Genertion of Lypunov Functions by Neurl Networks Nvid Noroozi, Pknoosh Krimghee, Ftemeh Sfei, nd Hmed Jvdi Abstrct Lypunov function is generlly obtined bsed on tril nd

More information

Solution for Assignment 1 : Intro to Probability and Statistics, PAC learning

Solution for Assignment 1 : Intro to Probability and Statistics, PAC learning Solution for Assignment 1 : Intro to Probbility nd Sttistics, PAC lerning 10-701/15-781: Mchine Lerning (Fll 004) Due: Sept. 30th 004, Thursdy, Strt of clss Question 1. Bsic Probbility ( 18 pts) 1.1 (

More information

Learning to Serve and Bounce a Ball

Learning to Serve and Bounce a Ball Sndr Amend Gregor Gebhrdt Technische Universität Drmstdt Abstrct In this pper we investigte lerning the tsks of bll serving nd bll bouncing. These tsks disply chrcteristics which re common in vriety of

More information

Genetic Programming. Outline. Evolutionary Strategies. Evolutionary strategies Genetic programming Summary

Genetic Programming. Outline. Evolutionary Strategies. Evolutionary strategies Genetic programming Summary Outline Genetic Progrmming Evolutionry strtegies Genetic progrmming Summry Bsed on the mteril provided y Professor Michel Negnevitsky Evolutionry Strtegies An pproch simulting nturl evolution ws proposed

More information

7.2 The Definite Integral

7.2 The Definite Integral 7.2 The Definite Integrl the definite integrl In the previous section, it ws found tht if function f is continuous nd nonnegtive, then the re under the grph of f on [, b] is given by F (b) F (), where

More information

Chapter 5 : Continuous Random Variables

Chapter 5 : Continuous Random Variables STAT/MATH 395 A - PROBABILITY II UW Winter Qurter 216 Néhémy Lim Chpter 5 : Continuous Rndom Vribles Nottions. N {, 1, 2,...}, set of nturl numbers (i.e. ll nonnegtive integers); N {1, 2,...}, set of ll

More information

Math& 152 Section Integration by Parts

Math& 152 Section Integration by Parts Mth& 5 Section 7. - Integrtion by Prts Integrtion by prts is rule tht trnsforms the integrl of the product of two functions into other (idelly simpler) integrls. Recll from Clculus I tht given two differentible

More information

ODE: Existence and Uniqueness of a Solution

ODE: Existence and Uniqueness of a Solution Mth 22 Fll 213 Jerry Kzdn ODE: Existence nd Uniqueness of Solution The Fundmentl Theorem of Clculus tells us how to solve the ordinry differentil eqution (ODE) du = f(t) dt with initil condition u() =

More information

State space systems analysis (continued) Stability. A. Definitions A system is said to be Asymptotically Stable (AS) when it satisfies

State space systems analysis (continued) Stability. A. Definitions A system is said to be Asymptotically Stable (AS) when it satisfies Stte spce systems nlysis (continued) Stbility A. Definitions A system is sid to be Asymptoticlly Stble (AS) when it stisfies ut () = 0, t > 0 lim xt () 0. t A system is AS if nd only if the impulse response

More information

p-adic Egyptian Fractions

p-adic Egyptian Fractions p-adic Egyptin Frctions Contents 1 Introduction 1 2 Trditionl Egyptin Frctions nd Greedy Algorithm 2 3 Set-up 3 4 p-greedy Algorithm 5 5 p-egyptin Trditionl 10 6 Conclusion 1 Introduction An Egyptin frction

More information

W. We shall do so one by one, starting with I 1, and we shall do it greedily, trying

W. We shall do so one by one, starting with I 1, and we shall do it greedily, trying Vitli covers 1 Definition. A Vitli cover of set E R is set V of closed intervls with positive length so tht, for every δ > 0 nd every x E, there is some I V with λ(i ) < δ nd x I. 2 Lemm (Vitli covering)

More information

Review of basic calculus

Review of basic calculus Review of bsic clculus This brief review reclls some of the most importnt concepts, definitions, nd theorems from bsic clculus. It is not intended to tech bsic clculus from scrtch. If ny of the items below

More information

CS 188 Introduction to Artificial Intelligence Fall 2018 Note 7

CS 188 Introduction to Artificial Intelligence Fall 2018 Note 7 CS 188 Introduction to Artificil Intelligence Fll 2018 Note 7 These lecture notes re hevily bsed on notes originlly written by Nikhil Shrm. Decision Networks In the third note, we lerned bout gme trees

More information

Lecture 1. Functional series. Pointwise and uniform convergence.

Lecture 1. Functional series. Pointwise and uniform convergence. 1 Introduction. Lecture 1. Functionl series. Pointwise nd uniform convergence. In this course we study mongst other things Fourier series. The Fourier series for periodic function f(x) with period 2π is

More information

Math 270A: Numerical Linear Algebra

Math 270A: Numerical Linear Algebra Mth 70A: Numericl Liner Algebr Instructor: Michel Holst Fll Qurter 014 Homework Assignment #3 Due Give to TA t lest few dys before finl if you wnt feedbck. Exercise 3.1. (The Bsic Liner Method for Liner

More information

Chapter 14. Matrix Representations of Linear Transformations

Chapter 14. Matrix Representations of Linear Transformations Chpter 4 Mtrix Representtions of Liner Trnsformtions When considering the Het Stte Evolution, we found tht we could describe this process using multipliction by mtrix. This ws nice becuse computers cn

More information

ECO 317 Economics of Uncertainty Fall Term 2007 Notes for lectures 4. Stochastic Dominance

ECO 317 Economics of Uncertainty Fall Term 2007 Notes for lectures 4. Stochastic Dominance Generl structure ECO 37 Economics of Uncertinty Fll Term 007 Notes for lectures 4. Stochstic Dominnce Here we suppose tht the consequences re welth mounts denoted by W, which cn tke on ny vlue between

More information

Math 8 Winter 2015 Applications of Integration

Math 8 Winter 2015 Applications of Integration Mth 8 Winter 205 Applictions of Integrtion Here re few importnt pplictions of integrtion. The pplictions you my see on n exm in this course include only the Net Chnge Theorem (which is relly just the Fundmentl

More information

Travelling Profile Solutions For Nonlinear Degenerate Parabolic Equation And Contour Enhancement In Image Processing

Travelling Profile Solutions For Nonlinear Degenerate Parabolic Equation And Contour Enhancement In Image Processing Applied Mthemtics E-Notes 8(8) - c IN 67-5 Avilble free t mirror sites of http://www.mth.nthu.edu.tw/ men/ Trvelling Profile olutions For Nonliner Degenerte Prbolic Eqution And Contour Enhncement In Imge

More information

Numerical Integration

Numerical Integration Chpter 1 Numericl Integrtion Numericl differentition methods compute pproximtions to the derivtive of function from known vlues of the function. Numericl integrtion uses the sme informtion to compute numericl

More information

Student Activity 3: Single Factor ANOVA

Student Activity 3: Single Factor ANOVA MATH 40 Student Activity 3: Single Fctor ANOVA Some Bsic Concepts In designed experiment, two or more tretments, or combintions of tretments, is pplied to experimentl units The number of tretments, whether

More information

Recitation 3: Applications of the Derivative. 1 Higher-Order Derivatives and their Applications

Recitation 3: Applications of the Derivative. 1 Higher-Order Derivatives and their Applications Mth 1c TA: Pdric Brtlett Recittion 3: Applictions of the Derivtive Week 3 Cltech 013 1 Higher-Order Derivtives nd their Applictions Another thing we could wnt to do with the derivtive, motivted by wht

More information

Jack Simons, Henry Eyring Scientist and Professor Chemistry Department University of Utah

Jack Simons, Henry Eyring Scientist and Professor Chemistry Department University of Utah 1. Born-Oppenheimer pprox.- energy surfces 2. Men-field (Hrtree-Fock) theory- orbitls 3. Pros nd cons of HF- RHF, UHF 4. Beyond HF- why? 5. First, one usully does HF-how? 6. Bsis sets nd nottions 7. MPn,

More information

MAA 4212 Improper Integrals

MAA 4212 Improper Integrals Notes by Dvid Groisser, Copyright c 1995; revised 2002, 2009, 2014 MAA 4212 Improper Integrls The Riemnn integrl, while perfectly well-defined, is too restrictive for mny purposes; there re functions which

More information

Testing categorized bivariate normality with two-stage. polychoric correlation estimates

Testing categorized bivariate normality with two-stage. polychoric correlation estimates Testing ctegorized bivrite normlity with two-stge polychoric correltion estimtes Albert Mydeu-Olivres Dept. of Psychology University of Brcelon Address correspondence to: Albert Mydeu-Olivres. Fculty of

More information

Math 426: Probability Final Exam Practice

Math 426: Probability Final Exam Practice Mth 46: Probbility Finl Exm Prctice. Computtionl problems 4. Let T k (n) denote the number of prtitions of the set {,..., n} into k nonempty subsets, where k n. Argue tht T k (n) kt k (n ) + T k (n ) by

More information

Credibility Hypothesis Testing of Fuzzy Triangular Distributions

Credibility Hypothesis Testing of Fuzzy Triangular Distributions 666663 Journl of Uncertin Systems Vol.9, No., pp.6-74, 5 Online t: www.jus.org.uk Credibility Hypothesis Testing of Fuzzy Tringulr Distributions S. Smpth, B. Rmy Received April 3; Revised 4 April 4 Abstrct

More information

Riemann is the Mann! (But Lebesgue may besgue to differ.)

Riemann is the Mann! (But Lebesgue may besgue to differ.) Riemnn is the Mnn! (But Lebesgue my besgue to differ.) Leo Livshits My 2, 2008 1 For finite intervls in R We hve seen in clss tht every continuous function f : [, b] R hs the property tht for every ɛ >

More information

Numerical Integration

Numerical Integration Chpter 5 Numericl Integrtion Numericl integrtion is the study of how the numericl vlue of n integrl cn be found. Methods of function pproximtion discussed in Chpter??, i.e., function pproximtion vi the

More information

P 3 (x) = f(0) + f (0)x + f (0) 2. x 2 + f (0) . In the problem set, you are asked to show, in general, the n th order term is a n = f (n) (0)

P 3 (x) = f(0) + f (0)x + f (0) 2. x 2 + f (0) . In the problem set, you are asked to show, in general, the n th order term is a n = f (n) (0) 1 Tylor polynomils In Section 3.5, we discussed how to pproximte function f(x) round point in terms of its first derivtive f (x) evluted t, tht is using the liner pproximtion f() + f ()(x ). We clled this

More information

How to simulate Turing machines by invertible one-dimensional cellular automata

How to simulate Turing machines by invertible one-dimensional cellular automata How to simulte Turing mchines by invertible one-dimensionl cellulr utomt Jen-Christophe Dubcq Déprtement de Mthémtiques et d Informtique, École Normle Supérieure de Lyon, 46, llée d Itlie, 69364 Lyon Cedex

More information

Lecture Note 9: Orthogonal Reduction

Lecture Note 9: Orthogonal Reduction MATH : Computtionl Methods of Liner Algebr 1 The Row Echelon Form Lecture Note 9: Orthogonl Reduction Our trget is to solve the norml eution: Xinyi Zeng Deprtment of Mthemticl Sciences, UTEP A t Ax = A

More information

Chapter 4 Contravariance, Covariance, and Spacetime Diagrams

Chapter 4 Contravariance, Covariance, and Spacetime Diagrams Chpter 4 Contrvrince, Covrince, nd Spcetime Digrms 4. The Components of Vector in Skewed Coordintes We hve seen in Chpter 3; figure 3.9, tht in order to show inertil motion tht is consistent with the Lorentz

More information

13: Diffusion in 2 Energy Groups

13: Diffusion in 2 Energy Groups 3: Diffusion in Energy Groups B. Rouben McMster University Course EP 4D3/6D3 Nucler Rector Anlysis (Rector Physics) 5 Sept.-Dec. 5 September Contents We study the diffusion eqution in two energy groups

More information

Data Assimilation. Alan O Neill Data Assimilation Research Centre University of Reading

Data Assimilation. Alan O Neill Data Assimilation Research Centre University of Reading Dt Assimiltion Aln O Neill Dt Assimiltion Reserch Centre University of Reding Contents Motivtion Univrite sclr dt ssimiltion Multivrite vector dt ssimiltion Optiml Interpoltion BLUE 3d-Vritionl Method

More information

5.7 Improper Integrals

5.7 Improper Integrals 458 pplictions of definite integrls 5.7 Improper Integrls In Section 5.4, we computed the work required to lift pylod of mss m from the surfce of moon of mss nd rdius R to height H bove the surfce of the

More information

Discrete Mathematics and Probability Theory Spring 2013 Anant Sahai Lecture 17

Discrete Mathematics and Probability Theory Spring 2013 Anant Sahai Lecture 17 EECS 70 Discrete Mthemtics nd Proility Theory Spring 2013 Annt Shi Lecture 17 I.I.D. Rndom Vriles Estimting the is of coin Question: We wnt to estimte the proportion p of Democrts in the US popultion,

More information

Chapters 4 & 5 Integrals & Applications

Chapters 4 & 5 Integrals & Applications Contents Chpters 4 & 5 Integrls & Applictions Motivtion to Chpters 4 & 5 2 Chpter 4 3 Ares nd Distnces 3. VIDEO - Ares Under Functions............................................ 3.2 VIDEO - Applictions

More information

Week 10: Line Integrals

Week 10: Line Integrals Week 10: Line Integrls Introduction In this finl week we return to prmetrised curves nd consider integrtion long such curves. We lredy sw this in Week 2 when we integrted long curve to find its length.

More information

Ordinary differential equations

Ordinary differential equations Ordinry differentil equtions Introduction to Synthetic Biology E Nvrro A Montgud P Fernndez de Cordob JF Urchueguí Overview Introduction-Modelling Bsic concepts to understnd n ODE. Description nd properties

More information

Frobenius numbers of generalized Fibonacci semigroups

Frobenius numbers of generalized Fibonacci semigroups Frobenius numbers of generlized Fiboncci semigroups Gretchen L. Mtthews 1 Deprtment of Mthemticl Sciences, Clemson University, Clemson, SC 29634-0975, USA gmtthe@clemson.edu Received:, Accepted:, Published:

More information

x = b a N. (13-1) The set of points used to subdivide the range [a, b] (see Fig. 13.1) is

x = b a N. (13-1) The set of points used to subdivide the range [a, b] (see Fig. 13.1) is Jnury 28, 2002 13. The Integrl The concept of integrtion, nd the motivtion for developing this concept, were described in the previous chpter. Now we must define the integrl, crefully nd completely. According

More information

Quadratic Forms. Quadratic Forms

Quadratic Forms. Quadratic Forms Qudrtic Forms Recll the Simon & Blume excerpt from n erlier lecture which sid tht the min tsk of clculus is to pproximte nonliner functions with liner functions. It s ctully more ccurte to sy tht we pproximte

More information

CS 188: Artificial Intelligence Spring 2007

CS 188: Artificial Intelligence Spring 2007 CS 188: Artificil Intelligence Spring 2007 Lecture 3: Queue-Bsed Serch 1/23/2007 Srini Nrynn UC Berkeley Mny slides over the course dpted from Dn Klein, Sturt Russell or Andrew Moore Announcements Assignment

More information

Exam 2, Mathematics 4701, Section ETY6 6:05 pm 7:40 pm, March 31, 2016, IH-1105 Instructor: Attila Máté 1

Exam 2, Mathematics 4701, Section ETY6 6:05 pm 7:40 pm, March 31, 2016, IH-1105 Instructor: Attila Máté 1 Exm, Mthemtics 471, Section ETY6 6:5 pm 7:4 pm, Mrch 1, 16, IH-115 Instructor: Attil Máté 1 17 copies 1. ) Stte the usul sufficient condition for the fixed-point itertion to converge when solving the eqution

More information

Chapter 3 Polynomials

Chapter 3 Polynomials Dr M DRAIEF As described in the introduction of Chpter 1, pplictions of solving liner equtions rise in number of different settings In prticulr, we will in this chpter focus on the problem of modelling

More information

Best Approximation. Chapter The General Case

Best Approximation. Chapter The General Case Chpter 4 Best Approximtion 4.1 The Generl Cse In the previous chpter, we hve seen how n interpolting polynomil cn be used s n pproximtion to given function. We now wnt to find the best pproximtion to given

More information

Lecture 14: Quadrature

Lecture 14: Quadrature Lecture 14: Qudrture This lecture is concerned with the evlution of integrls fx)dx 1) over finite intervl [, b] The integrnd fx) is ssumed to be rel-vlues nd smooth The pproximtion of n integrl by numericl

More information

APPROXIMATE INTEGRATION

APPROXIMATE INTEGRATION APPROXIMATE INTEGRATION. Introduction We hve seen tht there re functions whose nti-derivtives cnnot be expressed in closed form. For these resons ny definite integrl involving these integrnds cnnot be

More information

Generalized Fano and non-fano networks

Generalized Fano and non-fano networks Generlized Fno nd non-fno networks Nildri Ds nd Brijesh Kumr Ri Deprtment of Electronics nd Electricl Engineering Indin Institute of Technology Guwhti, Guwhti, Assm, Indi Emil: {d.nildri, bkri}@iitg.ernet.in

More information

Chapter 0. What is the Lebesgue integral about?

Chapter 0. What is the Lebesgue integral about? Chpter 0. Wht is the Lebesgue integrl bout? The pln is to hve tutoril sheet ech week, most often on Fridy, (to be done during the clss) where you will try to get used to the ides introduced in the previous

More information

Reversals of Signal-Posterior Monotonicity for Any Bounded Prior

Reversals of Signal-Posterior Monotonicity for Any Bounded Prior Reversls of Signl-Posterior Monotonicity for Any Bounded Prior Christopher P. Chmbers Pul J. Hely Abstrct Pul Milgrom (The Bell Journl of Economics, 12(2): 380 391) showed tht if the strict monotone likelihood

More information

Goals: Determine how to calculate the area described by a function. Define the definite integral. Explore the relationship between the definite

Goals: Determine how to calculate the area described by a function. Define the definite integral. Explore the relationship between the definite Unit #8 : The Integrl Gols: Determine how to clculte the re described by function. Define the definite integrl. Eplore the reltionship between the definite integrl nd re. Eplore wys to estimte the definite

More information

Conservation Law. Chapter Goal. 5.2 Theory

Conservation Law. Chapter Goal. 5.2 Theory Chpter 5 Conservtion Lw 5.1 Gol Our long term gol is to understnd how mny mthemticl models re derived. We study how certin quntity chnges with time in given region (sptil domin). We first derive the very

More information

Advanced Calculus: MATH 410 Uniform Convergence of Functions Professor David Levermore 11 December 2015

Advanced Calculus: MATH 410 Uniform Convergence of Functions Professor David Levermore 11 December 2015 Advnced Clculus: MATH 410 Uniform Convergence of Functions Professor Dvid Levermore 11 December 2015 12. Sequences of Functions We now explore two notions of wht it mens for sequence of functions {f n

More information

Math 360: A primitive integral and elementary functions

Math 360: A primitive integral and elementary functions Mth 360: A primitive integrl nd elementry functions D. DeTurck University of Pennsylvni October 16, 2017 D. DeTurck Mth 360 001 2017C: Integrl/functions 1 / 32 Setup for the integrl prtitions Definition:

More information

4.4 Areas, Integrals and Antiderivatives

4.4 Areas, Integrals and Antiderivatives . res, integrls nd ntiderivtives 333. Ares, Integrls nd Antiderivtives This section explores properties of functions defined s res nd exmines some connections mong res, integrls nd ntiderivtives. In order

More information

CS667 Lecture 6: Monte Carlo Integration 02/10/05

CS667 Lecture 6: Monte Carlo Integration 02/10/05 CS667 Lecture 6: Monte Crlo Integrtion 02/10/05 Venkt Krishnrj Lecturer: Steve Mrschner 1 Ide The min ide of Monte Crlo Integrtion is tht we cn estimte the vlue of n integrl by looking t lrge number of

More information

New data structures to reduce data size and search time

New data structures to reduce data size and search time New dt structures to reduce dt size nd serch time Tsuneo Kuwbr Deprtment of Informtion Sciences, Fculty of Science, Kngw University, Hirtsuk-shi, Jpn FIT2018 1D-1, No2, pp1-4 Copyright (c)2018 by The Institute

More information

Physics 116C Solution of inhomogeneous ordinary differential equations using Green s functions

Physics 116C Solution of inhomogeneous ordinary differential equations using Green s functions Physics 6C Solution of inhomogeneous ordinry differentil equtions using Green s functions Peter Young November 5, 29 Homogeneous Equtions We hve studied, especilly in long HW problem, second order liner

More information

Discrete Mathematics and Probability Theory Summer 2014 James Cook Note 17

Discrete Mathematics and Probability Theory Summer 2014 James Cook Note 17 CS 70 Discrete Mthemtics nd Proility Theory Summer 2014 Jmes Cook Note 17 I.I.D. Rndom Vriles Estimting the is of coin Question: We wnt to estimte the proportion p of Democrts in the US popultion, y tking

More information

different methods (left endpoint, right endpoint, midpoint, trapezoid, Simpson s).

different methods (left endpoint, right endpoint, midpoint, trapezoid, Simpson s). Mth 1A with Professor Stnkov Worksheet, Discussion #41; Wednesdy, 12/6/217 GSI nme: Roy Zho Problems 1. Write the integrl 3 dx s limit of Riemnn sums. Write it using 2 intervls using the 1 x different

More information

1 Probability Density Functions

1 Probability Density Functions Lis Yn CS 9 Continuous Distributions Lecture Notes #9 July 6, 28 Bsed on chpter by Chris Piech So fr, ll rndom vribles we hve seen hve been discrete. In ll the cses we hve seen in CS 9, this ment tht our

More information

Scalable Learning in Stochastic Games

Scalable Learning in Stochastic Games Sclble Lerning in Stochstic Gmes Michel Bowling nd Mnuel Veloso Computer Science Deprtment Crnegie Mellon University Pittsburgh PA, 15213-3891 Abstrct Stochstic gmes re generl model of interction between

More information

Sufficient condition on noise correlations for scalable quantum computing

Sufficient condition on noise correlations for scalable quantum computing Sufficient condition on noise correltions for sclble quntum computing John Presill, 2 Februry 202 Is quntum computing sclble? The ccurcy threshold theorem for quntum computtion estblishes tht sclbility

More information

221B Lecture Notes WKB Method

221B Lecture Notes WKB Method Clssicl Limit B Lecture Notes WKB Method Hmilton Jcobi Eqution We strt from the Schrödinger eqution for single prticle in potentil i h t ψ x, t = [ ] h m + V x ψ x, t. We cn rewrite this eqution by using

More information

Introduction to Numerical Analysis

Introduction to Numerical Analysis Introduction to Numericl Anlysis Doron Levy Deprtment of Mthemtics nd Center for Scientific Computtion nd Mthemticl Modeling (CSCAMM) University of Mrylnd June 14, 2012 D. Levy CONTENTS Contents 1 Introduction

More information