arxiv: v6 [stat.ml] 13 Apr 2018

Size: px
Start display at page:

Download "arxiv: v6 [stat.ml] 13 Apr 2018"

Transcription

1 Expected Policy Grdient Kmil Cioek nd Shimon Whiteon Deprtment of Computer Science, Univerity of Oxford Wolfon Building, Prk Rod, Oxford OX1 3QD rxiv: v6 [tt.ml 13 Apr 2018 Abtrct We propoe expected policy grdient (EPG), which unify tochtic policy grdient (SPG) nd determinitic policy grdient (DPG) for reinforcement lerning. Inpired by expected r, EPG integrte cro the ction when etimting the grdient, inted of relying only on the ction in the mpled trjectory. We etblih new generl policy grdient theorem, of which the tochtic nd determinitic policy grdient theorem re pecil ce. We lo prove tht EPG reduce the vrince of the grdient etimte without requiring determinitic policie nd, for the Guin ce, with no computtionl overhed. Finlly, we how tht it i optiml in certin ene to explore with Guin policy uch tht the covrince i proportionl to e H, where H i the cled Hein of the critic with repect to the ction. We preent empiricl reult confirming tht thi new form of explortion ubtntilly outperform DPG with the Orntein-Uhlenbeck heuritic in four chllenging MuJoCo domin. Introduction Policy grdient method (Sutton et l., 2000; Peter nd Schl, 2006, 2008b; Silver et l., 2014), which optimie policie by grdient cent, hve enjoyed gret ucce in reinforcement lerning problem with lrge or continuou ction pce. The rchetypl lgorithm optimie n ctor, i.e., policy, by following policy grdient tht i etimted uing critic, i.e., vlue function. The policy cn be tochtic or determinitic, yielding tochtic policy grdient (SPG) (Sutton et l., 2000) or determinitic policy grdient (DPG) (Silver et l., 2014). The theory underpinning thee method i quite frgmented, ech pproch h eprte policy grdient theorem gurnteeing the policy grdient i unbied under certin condition. Furthermore, both pproche hve ignificnt hortcoming. For SPG, vrince in the grdient etimte men tht mny trjectorie re uully needed for lerning. Since gthering trjectorie i typiclly expenive, there i gret need for more mple efficient method. DPG ue of determinitic policie mitigte the problem of vrince in the grdient but rie other difficultie. The theoreticl upport for DPG i limited ince it ume Copyright c 2018, Aocition for the Advncement of Artificil Intelligence ( All right reerved. critic tht pproximte Q when in prctice it pproximte Q inted. In ddition, DPG lern off-policy 1, which i undeirble when we wnt lerning to tke the cot of explortion into ccount. More importntly, lerning off-policy neceitte deigning uitble explortion policy, which i difficult in prctice. In fct, efficient explortion in DPG i n open problem nd mot ppliction imply ue independent Guin noie or the Orntein-Uhlenbeck heuritic (Uhlenbeck nd Orntein, 1930; Lillicrp et l., 2015). In thi pper, we propoe new pproch clled expected policy grdient (EPG) tht unifie policy grdient in wy tht yield both theoreticl nd prcticl inight. Inpired by expected r (Sutton nd Brto, 1998; vn Seijen et l., 2009), the min ide i to integrte cro the ction elected by the tochtic policy when etimting the grdient, inted of relying only on the ction elected during the mpled trjectory. EPG enble two theoreticl contribution. Firt, we etblih number of equivlence between EPG nd DPG, mong which i new generl policy grdient theorem, of which the tochtic nd determinitic policy grdient theorem re pecil ce. Second, we prove tht EPG reduce the vrince of the grdient etimte without requiring determinitic policie nd, for the Guin ce, with no computtionl overhed over SPG. EPG lo enble prcticl contribution: principled explortion trtegy for continuou problem. We how tht it i optiml in certin ene to explore with Guin policy uch tht the covrince i proportionl to e H, where H i the cled Hein of the critic with repect to the ction. We preent empiricl reult confirming tht thi new pproch to explortion ubtntilly outperform DPG with Orntein-Uhlenbeck explortion in four chllenging MuJoCo domin. Bckground A Mrkov deciion proce i tuple (S, A, R, p, p 0, γ) where S i et of tte, A i et of ction (in prctice either A = R d or A i finite), R(, ) i rewrd function, p(, ) i trnition kernel, p 0 i n initil tte ditribution, nd γ [0, 1) i dicount fctor. A policy π( ) 1 We how in thi pper tht, in certin etting, off-policy DPG i equivlent to EPG, our on-policy method.

2 i ditribution over ction given tte. We denote trjectorie τ π = ( 0, 0, r 0, 1, 1, r 1,... ), where 0 p 0, t π( t 1 ) nd r t i mple rewrd. A policy π induce Mrkov proce with trnition kernel p π ( ) = dπ( )p(, ) where we ue the ymbol dπ( ) to denote Lebegue integrtion gint the meure π( ) where i fixed. We ume the induced Mrkov proce i ergodic with ingle invrint meure defined for the whole tte pce. The vlue function i V π = E τ [ i γ ir i where ction re mpled from π. The Q-function i Q π ( ) = E R [r, + γe p( ) [V π ( ) nd the dvntge function i A π ( ) = Q π ( ) V π (). An optiml policy mximie the totl return J = dp 0()V π (). Since we conider only on-policy lerning with jut one current policy, we drop the π uper/ubcript where it i redundnt. If π i prmeteried by θ, then tochtic policy grdient (SPG) (Sutton et l., 2000; Peter nd Schl, 2006, 2008b) perform grdient cent on J, the grdient of J with repect to θ (grdient without ubcript re lwy with repect to θ). For tochtic policie, we hve: J = dρ() dπ( ) log π( )(Q(, ) + b()), (1) where ρ i the dicounted-ergodic occupncy meure, defined in the upplement, nd b() i beline, which cn be ny function tht depend on the tte but not the ction, ince dπ( ) log π( )b() = 0. Typiclly, (1) i pproximted from mple from trjectory τ of length T : ˆ J = T t=0 γt log π( t t )( ˆQ( t, t ) + b( t )). (2) If the policy i determinitic (we denote it π()), we cn ue determinitic policy grdient (Silver et l., 2014) inted: J = dρ() π() Q( = π(), ). (3) Thi updte i then pproximted uing mple: ˆ J = T t=0 γt π() ˆQ( = π(t ), t ). (4) Since the policy i determinitic, the problem of explortion i ddreed uing n externl ource of noie, typiclly modeled uing zero-men Orntein-Uhlenbeck (OU) proce (Uhlenbeck nd Orntein, 1930; Lillicrp et l., 2015) prmetrized by ψ nd σ: n i n i 1 ψ + N (0, σi) π() + n i. (5) In (2) nd (4), ˆQ i critic tht pproximte Q nd cn be lerned by r (Rummery nd Nirnjn, 1994; Sutton, 1996): ˆQ( t, t ) ˆQ( t, t ) + α [ r t+1 + γ ˆQ( t+1, t+1 ) ˆQ( t, t ). (6) Alterntively, we cn ue expected r (Sutton nd Brto, 1998; vn Seijen et l., 2009), which mrginlie out t+1, the ditribution over which i pecified by the known policy, to reduce the vrince in the updte: ˆQ( t, t ) ˆQ( t, t ) + α [ r t+1 + γ dπ( ) ˆQ( t+1, ) ˆQ( t, t ). (7) We could lo ue dvntge lerning (Bird nd other, 1995) or LSTDQ (Lgoudki nd Prr, 2003). If the critic function pproximtor i comptible, then the ctor, i.e., π, converge (Sutton et l., 2000). Inted of lerning ˆQ, we cn et b() = V () o tht Q(, ) + b() = A(, ) nd then ue the TD error δ(r,, ) = r + γv ( ) V () n etimte of A(, ) (Bhtngr et l., 2008): ˆ J = T t=0 γt log π( t t )(r + γ ˆV ( ) ˆV ()), (8) where ˆV () i n pproximte vlue function lerned uing ny policy evlution lgorithm. (8) work becue E [δ(r,, ), = A(, ), i.e., the TD error i n unbied etimte of the dvntge function. The benefit of thi pproch i tht it i ometime eier to pproximte V thn Q nd tht the return in the TD error i unprojected, i.e., it i not ditorted by function pproximtion. However, the TD error i noiy, introducing vrince in the grdient. To cope with thi vrince, we cn reduce the lerning rte when the vrince of the grdient would otherwie explode, uing, e.g., Adm (Kingm nd B, 2014), nturl policy grdient (Kkde, 2002; Amri, 1998; Peter nd Schl, 2008), the dptive tep ize method (Pirott, Retelli, nd Bcett, 2013) or Newton method (Furmton nd Brber, 2012; Prii, Pirott, nd Retelli, 2016). However, thi reult in low lerning when the vrince i high. One cn lo ue PGPE, which replce the tochtic policy with ditribution over determinitic policie (Sehnke et l., 2010). However, PGPE preclude updting the current policy during the epiode nd mke it difficult to explore efficiently. We cn lo eliminte ll vrince cued by the policy t the cot of mking the policy determinitic nd uing the DPG updte, which uully neceitte performing off-policy explortion. EPG, preented below, reduce to DPG in mny ueful ce, while providing principled wy to explore nd lo llowing for tochtic policie. Yet nother wy to eliminte vrince in the ctor i not to hve n ctor t ll, inted electing ction oft-greedily with repect to ˆQ lerned uing r. Thi i trivil for dicrete ction nd cn lo be done with one-tep Newton method for Q-function tht re qudric in the ction (Gu et l., 2016b). Expected Policy Grdient In thi ection, we propoe expected policy grdient (EPG). Min Algorithm Firt, we introduce Iπ Q () to denote the inner integrl in (1): J = dρ() dπ( ) log π( )(Q(, ) + b()) I Q π () = dρ()iπ Q (). (9)

3 Thi ugget new wy to write the pproximte grdient: T ˆ J = γ t Î ˆQ π ( t ), (10) t=0 g t where Î ˆQ π () i ome pproximtion to I ˆQ π () = dπ( ) log π( )( ˆQ(, ) + b()). Thi pproch mke explicit tht one tep in etimting the grdient i to evlute n integrl to etimte I ˆQ π (). The min inight behind EPG i tht, given tte, I ˆQ π () i expreed fully in term of known quntitie. Hence we cn mnipulte it nlyticlly to obtin formul or we cn jut compute the integrl uing ny numericl qudrture if n nlyticl olution i impoible. SPG given in (2) perform thi qudrture uing imple one-mple Monte Crlo method. However, relying on uch method i unnecery. In fct, the ction ued to interct with the environment need not be ued t ll in the evlution of ÎQ π () ince i bound vrible in the definition of I Q π (). The motivtion i thu imilr to tht of expected r but pplied to the ctor grdient etimte inted of the critic updte rule. EPG, hown in Algorithm 1, ue (10) to form policy grdient lgorithm tht repetedly etimte ÎQ π () with n integrtion ubroutine. Algorithm 1 Expected Policy Grdient 1: 0, t 0 2: initilie optimier, initilie policy π prmetried by θ 3: while not converged do 4: g t γ t DO-INTEGRAL( ˆQ,, π θ ) 5: g t i the etimted policy grdient per (10) 6: θ θ + optimier.update(g t ) 7: π(, ) 8:, r imultor.perform-action() 9: ˆQ.UPDATE(,, r, ) 10: t t : 12: end while EPG h benefit even when n nlyticl olution i not poible: if the ction pce i low dimenionl, numericl qudrture i chep; if it i high dimenionl, it i till often worthwhile to blnce the expene of imulting the ytem with the cot of qudrture. Actully, even in the extreme ce of expenive qudrture but chep imultion, the limited reource vilble for qudrture could till be better pent on EPG with mrt qudrture thn SPG with imple Monte Crlo. One of the motivtion of DPG w preciely tht the imple one-mple Monte-Crlo qudrture implicitly ued by SPG often yield high vrince grdient etimte, even with good beline. To ee why, conider Figure 1 (left). A imple Monte Crlo method evlute the integrl by mpling one or more time from π( ) (blue) nd evluting µ log π( )Q(, ) (red) function of. A beline cn decree the vrince by dding multiple of µ log π( ) to the red curve, but the problem remin tht the red curve h high vlue where the blue curve i lmot zero. Conequently, ubtntil vrince perit, whtever SPG updte policy PDF ction vrince of MC beline Figure 1: At left, π( ) for Guin policy with µ = θ = 0 t given tte nd contnt σ 2 (blue) nd the SPG updte θ log π( )Q(, ) (in red), obtined for Q = At right, the vrince of imple ingle-mple Monte Crlo etimtor function of the beline. In imple multimple Monte Crlo method, the vrince would go down the number of mple. the beline, even with imple liner Q-function, hown in Figure 1 (right). DPG ddreed thi problem for determinitic policie but EPG extend it to tochtic one. Reltionhip to Other Method EPG h ome imilritie with VINE mpling (Schulmn et l., 2015), which ue n (intriniclly noiy) Monte Crlo qudrture with mny mple. 2 However, the exmple in Figure 1 how tht even with computtionlly expenive mny-mple Monte Crlo method, the problem of vrince remin, regrdle of the beline. EPG i lo relted to vrince minimition technique tht interpolte between two etimtor, e.g., (Gu et l., 2016, Eq. 7) i imilr to Corollry 4. However, EPG ue qudric (not liner) pproximtion to the critic, which i crucil for explortion. Furthermore, it completely eliminte vrince in the inner integrl, oppoed to jut reducing it. The ide behind EPG w lo independently nd concurrently developed Men Actor Critic (Adi et l., 2017), though only for dicrete ction nd without upporting theoreticl nlyi. Guin Policie EPG i prticulrly ueful when we mke the common umption of Guin policy: we cn then perform the integrtion nlyticlly under reonble condition. We how below (ee Lemm 3) tht the updte to the policy men computed by EPG i equivlent to the DPG updte. Moreover, imple formul for the covrince cn be derived (ee Lemm 2). Algorithm 2 nd 3 how the reulting pecil ce of EPG, which we cll Guin policy grdient (GPG). Surpriingly, GPG i on-policy but nonethele fully equivlent to DPG, n off-policy method, with prticulr form of explortion. Hence, GPG, by pecifying the policy covrince, cn be een derivtion of n explortion trtegy for DPG. In thi wy, GPG ddree n importnt open quetion. A we how lter, thi led to improved performnce in prctice. 2 VINE mpling lo differ from EPG by performing independent rollout of Q, requiring imultor with reet.

4 Algorithm 2 Guin Policy Grdient 1: 0, t 0 2: initilie optimier 3: while not converged do 4: g t γ t DO-INTEGRAL-GAUSS( ˆQ,, π θ ) 5: θ θ + optimier.update(g t ) 6: policy prmeter θ re updted uing grdient 7: Σ GET-COVARIANCE( ˆQ,, π θ ) 8: Σ computed from crtch 9: π( ) π( ) = N(µ, Σ ) 10:, r imultor.perform-action() 11: ˆQ.UPDATE(,, r, ) 12: t t : 14: end while Algorithm 3 Guin Integrl 1: function DO-INTEGRAL-GAUSS( ˆQ,, π θ ) 2: I Q π(),µ ( µ ) ˆQ( = µ, ) Ue Lemm 1 3: return I Q π(),µ 4: end function 5: 6: function GET-COVARIANCE( ˆQ,, π θ ) 7: H COMPUTE-HESSIAN( ˆQ(µ, )) 8: return σ 2 0e ch Ue Lemm 2 9: end function The computtionl cot of GPG i mll: while it mut tore Hein mtrix H(, ) = 2 ˆQ(, ), it ize i only d d, where A = R d, which i typiclly mll, e.g., d = 6 for HlfCheeth-v1. Thi Hein i the me ize the policy covrince mtrix, which ny policy grdient mut tore nywy, nd hould not be confued with the Hein with repect to the prmeter of the neurl network, ued with Newton or nturl grdient method (Peter nd Schl, 2008; Furmton, Lever, nd Brber, 2016), which cn eily hve thound of entrie. Hence, GPG obtin EPG vrince reduction eentilly for free. Anlyi In thi ection, we nlye EPG, howing tht it unifie SPG nd DPG, tht ÎQ π () cn often be computed nlyticlly, nd tht EPG h lower vrince thn SPG. Generl Policy Grdient Theorem We begin by tting our mot generl reult, howing tht EPG cn be een generlition of both SPG nd DPG. To do thi, we firt tte new generl policy grdient theorem. We ue the horthnd without ubcript to denote the grdient with repect to policy prmeter θ. Theorem 1 (Generl Policy Grdient Theorem). If π(, ) i normlied Lebegue meure for ll, then [ J = dρ() V () dπ(, ) Q(, ). } {{ } I G () Proof. We begin by expnding the following expreion. dρ() dπ(, ) Q(, ) = dρ() dπ(,) (R(,)+γ dp(,)v ( )) = dρ() dπ(,)( R(,) +γ dp(,) V ( )) 0 = γ dρ() dp π ( ) V ( ) = dρ() V () dp 0() V () J = dρ() V () J. The firt equlity follow by expnding the definition of Q nd the penultimte one follow from Lemm B (in the upplement). Then the theorem follow by rerrnging term. The crucil benefit of Theorem 1 i tht it work for ll policie, both tochtic nd determinitic, unifying previouly eprte derivtion for the two etting. To how thi, in the following two corollrie, we ue Theorem 1 to recover the tochtic policy grdient theorem (Sutton et l., 2000) nd the determinitic policy grdient theorem (Silver et l., 2014), in ech ce by introducing dditionl umption to obtin formul for I G () expreible in term of known quntitie. Corollry 1 (Stochtic Policy Grdient Theorem). If π( ) i differentible, then J = dρ()i G() = dρ() dπ( ) log π( )Q(, ). Proof. We obtin the following by expnding V. V = dπ(, )Q(, ) = d( π(, ))Q(, ) + dπ(, )( Q(, )) We obtin I G () = dπ( ) log π( )Q(, ) = Iπ Q () by plugging thi into the definition of I G (). We obtin J by invoking Theorem 1 nd plugging in the bove expreion for I G (). We now recover the DPG updte introduced in (3). Corollry 2 (Determinitic Policy Grdient Theorem). If π( ) i Dirc-delt meure (i.e., determinitic policy) nd Q(, ) i differentible, then J = dρ()i G() = dρ() π() Q(, ). Proof. We begin by obtining n expreion for I G (). I G () = V () dπ(, ) Q(, ) = V () γ dp π ( ) V ( ) = π() Q(, ).

5 Here, the econd equlity follow by expnding the definition of Q nd the third follow from n etblihed determinitic policy grdient reult (Silver et l., 2014, Supplement, Eq. 1). We cn then obtin J by invoking Theorem 1 nd plugging in the bove expreion for I G (). Thee corollrie how tht the choice between determinitic nd tochtic policy grdient i fundmentlly choice of qudrture method. Hence, the empiricl ucce of DPG reltive to SPG (Silver et l., 2014; Lillicrp et l., 2015) cn be undertood in new light. In prticulr, it cn be ttributed, not to fundmentl limittion of tochtic policie (indeed, tochtic policie re ometime preferred), but inted to uperior qudrture. DPG integrte over Dirc-delt meure, which i known to be ey, while SPG typiclly relie on imple Monte Crlo integrtion. Thnk to EPG, determinitic pproch i no longer required to obtin method with low vrince. We dd ideline tht ince Theorem 1 cn be written I G () = V () γ dp π ( ) V ( ), which involve the derivtive of vlue function, GPG reemble vlue grdient (Hee et l., 2015). However, in our ce, we re lerning J directly nd do not perform recurive etimtion of V vlue grdient method do. Anlyticl Qudrture - Guin Policy We now derive lemm upporting GPG. Lemm 1 (Guin Policy Grdient). If the policy i Guin, i.e. π( ) N (µ, Σ ) with µ nd Σ 1/2 prmetried by θ, where Σ 1/2 i ymmetric nd Σ 1/2 Σ 1/2 = Σ nd the critic i of the form Q(, ) = A() + B() + cont where A() i ymmetric for every, then Iπ Q () = I Q π(),µ + I Q, where the men nd covrince component re given by I Q π(),µ π(),σ 1/2 = ( µ )(2A()µ + B()) nd = ( Σ 1/2 )2A()Σ 1/2. I Q π(),σ 1/2 See Lemm 1 in the upplement for proof of thi reult. While Lemm 1 require the critic to be qudric in the ction, thi umption i not very retrictive ince the coefficient B() nd A() cn be rbitrry continuou function of the tte, e.g., neurl network. Arbitrry Critic If Q doe not meet the condition of Lemm 1, we cn pproximte Q with qudric function in the neighbourhood of the policy men. Thi pproximtion i motivted by two rgument. Firt, in MDP tht model phyicl ytem with reonble rewrd function, Q i firly mooth. Second, policy grdient re locl, incrementl method nywy ince the policy men chnge lowly, the vlue of Q for ction fr from the policy men re uully not relevnt for the current updte. Corollry 3 (Approximte Guin Policy Grdient with n Arbitrry Critic). If the policy i Guin, i.e. π( ) N (µ, Σ ) with µ nd Σ 1/2 prmetried by θ in Lemm 1 nd ny critic Q(, ) doubly differentible with repect to ction for ech tte, then I Q π(),µ ( µ ) Q( = µ, ) nd I Q ( Σ 1/2 π(),σ 1/2 )H(µ, )Σ 1/2, where H(µ, ) i the Hein of Q with repect to, evluted t µ for fixed. Proof. We begin by pproximting the critic (for given ) uing the firt two term of the Tylor expnion of Q in µ. Q(, ) Q(µ, ) + ( µ ) Q( = µ, ) ( µ ) H(µ, )( µ ) = 1 2 H(µ,)+ ( Q(=µ,) H(µ,)µ )+cont. Becue of the erie trunction, the function on the righthnd ide i qudric nd we cn then ue Lemm 1: I Q π(),µ = µ(2 1 2 H(µ,)µ+ Q(=µ,) H(µ,)µ) I Q π(),σ 1/2 = µ Q(=µ,) =( Σ 1/2 )( 1 2 2H(µ,)Σ1/2 )=( Σ 1/2 )H(µ,)Σ 1/2. To ctully obtin the Hein, we could ue utomtic differentition to compute it nlyticlly. Alterntively, we cn oberve tht, if the critic relly i qudric, we cn jut red off the coefficient of the qudric term directly. Therefore, we cn pproximte the Hein by generting number of rndom ction-vlue round µ, computing the Q vlue, nd (loclly) fitting qudric. Thi proce i typiclly more computtionlly expenive thn utomtic differentition but h the dvntge of working with ReLU network (where the true Hein i zero but we till hve kind of globl curvture fter moothing) nd leverging more informtion from the critic (ince the evlution i t more thn one point). Liner GPG We now tte conequence of Lemm 1 for the ce when the critic Q i liner in the ction, i.e., the qudric term i lwy zero. Corollry 4 (Liner Guin Policy Grdient). If the policy i Guin, i.e., π( ) N (µ, Σ ) with µ prmetried by θ nd the critic i of the form Q( ) = B() + cont, then Iπ Q () = ( µ )B(). Moreover, it i unnecery to prmeterie Σ 1/2 ince the policy grdient w.r.t. to Σ 1/2 i zero (i.e., liner Q-function doe not give ny informtion bout the explortion covrince). We mke Corollry 4 explicit for two reon. Firt, it i ueful for howing n equivlence between DPG nd EPG (ee below). Second, it my ctully be ueful for non-trivil cl of phyicl ytem: if the time-mpling frequency i high enough (which implie cting in mll tep), the critic i effectively only ued to y if mll tep one wy i preferble to mll tep the other wy liner property. Equivlence between EPG nd DPG The updte for the policy men obtined in Corollry 3 i the me the DPG updte, linking the two method: I Q π () = ( µ ) Q( = µ, ).

6 We now formlie the equivlence between EPG nd DPG. Firt, on-policy GPG with liner critic (or n rbitrry critic pproximted by the firt term in the Tylor expnion) i equivlent to DPG with Guin explortion policy where the covrince ty the me. Thi follow from Corollry 4. Second, on-policy GPG with qudric critic (or n rbitrry critic pproximted by the firt two term in the Tylor expnion) i equivlent to DPG with Guin explortion policy where the covrince i computed uing the updte (where α n i equence of tep-ize): Σ 1/2 Σ 1/2 + α n H()Σ 1/2. (11) Thi follow from Corollry 3. Third, nd mot generlly, for ny critic t ll (not necerily qudric), DPG i kind of EPG for prticulr choice of qudrture (uing Dirc meure). Thi follow from Theorem 1. Surpriingly, thi men tht DPG, normlly conidered to be off-policy, cn lo be een on-policy when exploring with Guin noie. Furthermore, the comptible critic for DPG (Silver et l., 2014) i indeed liner in the ction. Hence, thi reltionhip hold whenever DPG ue comptible critic. 3 Furthermore, Lemm 1 lend new legitimcy to the common prctice of replcing the critic required by the DPG theory, which pproximte Q, with one tht pproximte Q itelf, done in SPG nd EPG. Explortion uing the Hein The econd equivlence given bove ugget tht we cn include the covrince in the ctor network nd lern it long with the men. However, nother option i to compute it from crtch t ech itertion by nlyticlly computing the reult of pplying (11) infinitely mny time. Lemm 2 (Explortion Limit). The itertive procedure defined by the eqution Σ 1/2 Σ 1/2 + αh()σ 1/2 pplied n time uing the diminihing lerning rte α = 1/n converge to Σ 1/2 e H() n. Proof. Conider the equence (Σ 1/2 ) 0 = σ 0 I, (Σ 1/2 ) n = (Σ 1/2 ) n 1 +αh()(σ 1/2 ) n 1. Expnding out the recurion, the n-th element of the equence i given : (Σ 1/2 ) n = (I + αh()) n (Σ 1/2 ) 0. We digonlie the Hein H() = UΛU for ome orthonorml mtrix U nd obtin the following expreion for (Σ 1/2 ) n. (Σ 1/2 ) n = (I+αUΛU ) n (Σ 1/2 ) 0 = U(I+αΛ) n U (Σ 1/2 ) 0 Since we hve lim n (1 + 1 n λ)n = e λ for ech digonl entry of Λ, we plug α = 1 n nd obtin the identity: lim n (Σ1/2 ) n = Ue Λ U (Σ 1/2 ) 0 = σ 0 e H(). 3 The notion of comptibility of critic i different for tochtic nd determinitic policy grdient. The prcticl impliction of Lemm 2 i tht, in policy grdient method, it i jutified to ue Guin explortion with covrince proportionl to e ch for ome rewrd cling contnt c. Thu by exploring with (cled) covrince e ch, we obtin principled lterntive to the Orntein-Uhlenbeck heuritic defined in (5). Our reult below how tht it lo perform much better in prctice. Lemm 2 h n intuitive interprettion. If H() h lrge poitive eigenvlue λ, then ˆQ(, ) h hrp minimum long the correponding eigenvector, nd the correponding eigenvlue of Σ i e λ, i.e., lo lrge. The reult i lrge explortion bonu long tht direction, enbling the lgorithm to leve locl minim. Converely, if λ i negtive, then ˆQ(, ) h mximum nd o e λ i mll, ince explortion i not needed. Vrince Anlyi We now prove tht for ny policy, the EPG etimtor of (10) h lower vrince thn the SPG etimtor of (2). Lemm 3. If for ll S, the rndom vrible log π( ) ˆQ(, ) where π( ) h nonzero vrince, then V τ[ t=0 γt log π( t t)( ˆQ( t, t)+b( t))>v τ [ t=0 γt I ˆQ π (t). The proof i deferred to the upplement (ee Lemm 3 there). Lemm 3 umption i reonble ince the only wy rndom vrible log π( ) ˆQ(, ) could hve zero vrince i if it were the me for ll ction in the policy upport (except for et of meure zero), in which ce optimiing the policy would be unnecery. Since we know tht both the etimtor of (2) nd (10) re unbied, the etimtor with lower vrince h lower MSE. Extenion to Entropy Regulrition On-policy SPG ometime include n entropy term in the grdient in order to id explortion by mking the policy more tochtic. The grdient of the differentil entropy 4 H() of the policy t tte i defined follow. H()= dπ( ) log π( ) = d π( ) log π( )+ dπ( ) log π( ) = d π( ) log π( )+ dπ( ) 1 π( ) π( ) = d π( ) log π( )+ dπ( ) 1 = d π( ) log π( )= dπ( ) log π( ) log π( ). Typiclly, we weight the entropy updte with the policy grdient updte: I E G () = I G () + α H() = dπ( ) log π( )(Q(, ) α log π( )). Thi eqution mke cler tht performing entropy regulrition i equivlent to uing different critic with Q-vlue hifted by α log π( ); thi hold for both SPG nd EPG. 4 For dicrete ction pce, the me derivtion with integrl replced by um hold for the entropy.

7 Domin ˆσ DPG ˆσ EPG HlfCheeth-v [ , InvertedPendulum-v [241.45, Recher2d-v [0.63, 2.31 Wlker2d [450.58, [875.54, n/ 0.13 [0.07, [631.98, EPG (40 run) DPG (40 run) SPG (40 run) EPG (5 run) DPG (40 run) SPG (40 run) Tble 1: Etimted tndrd devition (men nd 90% intervl) cro run fter lerning. Experiment While EPG h mny potentil ue, we focu on empiriclly evluting one prticulr ppliction: explortion driven by the Hein exponentil ( introduced in Algorithm 2 nd Lemm 2), replcing the tndrd Orntein-Uhlenbeck (OU) explortion in continuou ction domin. To thi end, we pplied EPG to four domin modelled with the Mu- JoCo phyic imultor (Todorov, Erez, nd T, 2012): HlfCheeth-v1, InvertedPendulum-v1, Recher2d-v1 nd Wlker2d-v1 nd compred it performnce to DPG nd SPG. In prctice, EPG differed from deep DPG (Lillicrp et l., 2015; Silver et l., 2014) only in the explortion trtegy, though their theoreticl underpinning re different. The hyperprmeter for DPG nd thoe of EPG tht re not relted to explortion were tken from n exiting benchmrk (Ilm et l., 2017; Brockmn et l., 2016). The explortion hyperprmeter for EPG were σ 2 0 = 0.2 nd c = 1.0 where the explortion covrince i σ 2 0e ch. Thee vlue were obtined uing grid erch from the et {0.2, 0.5, 1} for σ 2 0 nd {0.5, 1.0, 2.0} for c over the HlfCheeth-v1 domin. Since c i jut contnt cling the rewrd, it i reonble to et it to 1.0 whenever rewrd cling i lredy ued. Hence, our explortion trtegy h jut one hyperprmeter σ 2 0 oppoed pecifying pir of prmeter (tndrd devition nd men reverion contnt) for OU. We ued the me lerning prmeter for the other domin. For SPG 5, we ued OU explortion nd contnt digonl covrince of 0.2 in the ctor updte (thi pproximtely correpond to the verge vrince of the OU proce over time). The other prmeter for SPG re the me for the ret of the lgorithm. For the lerning curve, we obtined 90% confidence intervl round the lerning curve. The lerning curve how reult of independent evlution run which ued ction generted by the policy men without ny explortion noie. The reult (Figure 2) how tht EPG explortion trtegy yield much better performnce thn DPG with OU. Furthermore, SPG doe poorly, olving only the eiet domin (InvertedPendulum-v1) reonbly quickly, chieving low progre on HlfCheeth-v1, nd filing entirely on the other domin. Thi i not urpriing DPG w introduced preciely to olve the problem of high vrince SPG etimte on thi type of problem. In InvertedPendulum-v1, SPG initilly lern quickly, outperforming the other method. Thi 5 We tried lerning the covrince for SPG but the covrince etimte w untble; no regulrition hyperprmeter we teted mtched SPG performnce with OU even on the implet domin EPG (5 run) DPG (5 run) SPG (10 run) EPG (40 run) DPG (40 run) SPG (10 run) Figure 2: Lerning curve (men nd 90% intervl) for HlfCheeth-v1 (top left), InvertedPendulum-v1 (top right), Recher2d-v1 (bottom left, clipped t -14) nd Wlker2d-v1 (bottom right). The number of independent trining run i in prenthee. Horizontl xi i cled in thound of tep. i becue noiy grdient updte provide crude, indirect form of explortion tht hppen to uit thi problem. Clerly, thi i indequte for more complex domin: even for thi imple domin it led to ubpr performnce lte in lerning EPG DPG SPG Figure 3: Three run for EPG (left), DPG (middle) nd SPG (right) for the InvertedPendulum-v1 domin, demontrting tht EPG how much le unlerning. In ddition, EPG typiclly lern more conitently thn DPG with OU. In two tk, the empiricl tndrd devition cro run of EPG (ˆσ EPG ) w ubtntilly lower thn tht of DPG (ˆσ DPG ) t the end of lerning, hown in Tble 1. For the other two domin, the confidence intervl round the empiricl tndrd devition for DPG nd EPG were too wide to drw concluion. Surpriingly, for InvertedPendulum-v1, DPG lerning curve decline lte in lerning. The reon cn be een in the individul run hown in Figure 3: both DPG nd SPG uffer from evere unlerning. Thi unlerning cnnot be explined by explortion noie ince the evlution run jut ue the men ction, without exploring. Inted, OU explortion in DPG my be too core, cuing the optimier to exit good optim, while SPG unlern due to noie in the grdient. The noie lo help peed initil lerning, decribed bove, but thi doe not trnfer to other domin. EPG void thi problem by utomticlly reducing the noie when it find good optimum, i.e., Hein with lrge negtive eigenvlue.

8 Concluion Thi pper propoed new policy grdient method clled expected policy grdient (EPG), tht integrte cro the ction elected by the tochtic policy. We ued EPG to prove new generl policy grdient theorem ubuming the tochtic nd determinitic policy grdient theorem. We lo howed tht, under certin relitic condition, the qudrture required by EPG cn be performed nlyticlly, llowing DPG with principled explortion. We preented empiricl reult confirming tht thi ppliction of EPG outperform DPG nd SPG on four domin. Acknowledgement Thi project h received funding from the Europen Reerch Council (ERC) under the Europen Union Horizon 2020 reerch nd innovtion progrmme (grnt greement number ). Reference Amri, S.-I Nturl grdient work efficiently in lerning. Neurl computtion 10(2): Adi, K.; Allen, C.; Roderick, M.; Mohmed, A.-r.; Konidri, G.; nd Littmn, M Men Actor Critic. ArXiv e-print. Bird, L., et l Reidul lgorithm: Reinforcement lerning with function pproximtion. In Proceeding of the twelfth interntionl conference on mchine lerning, Bhtngr, S.; Ghvmzdeh, M.; Lee, M.; nd Sutton, R. S Incrementl nturl ctor-critic lgorithm. In Advnce in neurl informtion proceing ytem, Brockmn, G.; Cheung, V.; Petteron, L.; Schneider, J.; Schulmn, J.; Tng, J.; nd Zremb, W Openi gym. rxiv preprint rxiv: Furmton, T., nd Brber, D A unifying perpective of prmetric policy erch method for mrkov deciion procee. In Advnce in neurl informtion proceing ytem, Furmton, T.; Lever, G.; nd Brber, D Approximte newton method for policy erch in mrkov deciion procee. Journl of Mchine Lerning Reerch 17(227):1 51. Gu, S.; Lillicrp, T.; Ghhrmni, Z.; Turner, R. E.; nd Levine, S Q-prop: Smple-efficient policy grdient with n off-policy critic. rxiv preprint rxiv: Gu, S.; Lillicrp, T.; Sutkever, I.; nd Levine, S. 2016b. Continuou deep q-lerning with model-bed ccelertion. In Interntionl Conference on Mchine Lerning, Hee, N.; Wyne, G.; Silver, D.; Lillicrp, T.; Erez, T.; nd T, Y Lerning continuou control policie by tochtic vlue grdient. In Advnce in Neurl Informtion Proceing Sytem, Ilm, R.; Henderon, P.; Gomrokchi, M.; nd Precup, D Reproducibility of benchmrked deep reinforcement lerning tk for continuou control. rxiv preprint rxiv: Kkde, S. M A nturl policy grdient. In Advnce in neurl informtion proceing ytem, Kingm, D., nd B, J Adm: A method for tochtic optimiztion. rxiv preprint rxiv: Lgoudki, M. G., nd Prr, R Let-qure policy itertion. Journl of mchine lerning reerch 4(Dec): Lillicrp, T. P.; Hunt, J. J.; Pritzel, A.; Hee, N.; Erez, T.; T, Y.; Silver, D.; nd Wiertr, D Continuou control with deep reinforcement lerning. rxiv preprint rxiv: Prii, S.; Pirott, M.; nd Retelli, M Multi-objective reinforcement lerning through continuou preto mnifold pproximtion. Journl of Artificil Intelligence Reerch 57: Peter, J., nd Schl, S Policy grdient method for robotic. In Intelligent Robot nd Sytem, 2006 IEEE/RSJ Interntionl Conference on, IEEE. Peter, J., nd Schl, S Nturl ctor-critic. Neurocomputing 71(7): Peter, J., nd Schl, S. 2008b. Reinforcement lerning of motor kill with policy grdient. Neurl network 21(4): Pirott, M.; Retelli, M.; nd Bcett, L Adptive tep-ize for policy grdient method. In Advnce in Neurl Informtion Proceing Sytem, Rummery, G. A., nd Nirnjn, M On-line Q-lerning uing connectionit ytem. Univerity of Cmbridge, Deprtment of Engineering. Schulmn, J.; Levine, S.; Abbeel, P.; Jordn, M.; nd Moritz, P Trut region policy optimiztion. In Proceeding of the 32nd Interntionl Conference on Mchine Lerning (ICML-15), Sehnke, F.; Oendorfer, C.; Rücktieß, T.; Grve, A.; Peter, J.; nd Schmidhuber, J Prmeter-exploring policy grdient. Neurl Network 23(4): Silver, D.; Lever, G.; Hee, N.; Degri, T.; Wiertr, D.; nd Riedmiller, M Determinitic policy grdient lgorithm. In ICML. Sutton, R. S., nd Brto, A. G Reinforcement lerning: An introduction, volume 1. MIT pre Cmbridge. Sutton, R. S.; McAlleter, D. A.; Singh, S. P.; nd Mnour, Y Policy grdient method for reinforcement lerning with function pproximtion. In Advnce in neurl informtion proceing ytem, Sutton, R. S Generliztion in reinforcement lerning: Succeful exmple uing pre core coding. Advnce in neurl informtion proceing ytem Todorov, E.; Erez, T.; nd T, Y Mujoco: A phyic engine for model-bed control. In Intelligent Robot nd Sytem (IROS), 2012 IEEE/RSJ Interntionl Conference on, IEEE. Uhlenbeck, G. E., nd Orntein, L. S On the theory of the brownin motion. Phyicl review 36(5):823. vn Seijen, H.; vn Helt, H.; Whiteon, S.; nd Wiering, M A theoreticl nd empiricl nlyi of expected r. In ADPRL 2009: Proceeding of the IEEE Sympoium on Adptive Dynmic Progrmming nd Reinforcement Lerning,

9 Supplement We firt provide forml proof for certin ttement invoked by our pper. We then provide brief dicuion of the ue of lerning rte tht diminihed in the trjectory length in the computtion of the covrince. Proof Firt, we prove two lemm concerning the dicounted-ergodic meure ρ() which hve been implicitly relied for ome time but fr we could find, never proved explicitly. Definition 1 (Time-dependent occupncy). p( t = 0) = p 0 () p( t = i + 1) = p( )p( t = i) for i 0 Definition 2 (Truncted trjectory). Define the trjectory truncted fter N tep τ N = ( 0, 0, r 0, 1, 1, r 1,..., N ). Obervtion 1 (Expecttion wrt. truncted trjectory). Since τ N = ( 0, 1, 2,..., N ) i ocited with the denity N 1 p( i+1 i )p 0 ( 0 ), we hve tht [ N E τn γi f( i ) = = ( N 1 ) ( 0, 1,..., N p( N ) i+1 i ) p 0 ( 0 ) γi f( i ) d 0 d 1... d N = = N 0, 1,..., N (p 0 ( 0 ) ) N 1 p( i+1 i ) γ i f( i )d 0 d 1... d N = = N p( t = i)γi f()d for ny function f. Definition 3 (Expecttion with repect to infinte trjectory). For ny bounded function f, we hve [ [ N E τ γ i f( i ) lim E τ N γ i f( i ). N Here, the um on the left-hnd ide i prt of the ymbol being defined. Obervtion 2 (Property of expecttion with repect to infinte trjectory). [ E τ γi f( i ) [ N = lim N E τn γi f( i ) = N = lim N p( t = i)γi f()d = = dp( t = i)γ i f() for ny bounded function f. Definition 4 (Dicounted-ergodic occupncy meure ρ). ρ() = γ i p( t = i) The meure ρ i not normlied in generl. Intuitively, it cn be thought of mrginliing out the time in the ytem dynmic. Lemm 4 (Dicounted-ergodic property). For ny bounded function f: [ ρ()f() = E τ γ i f( i ). Proof. [ [ E τ γ i f( i ) = γ i p( t = i)f()d = γ i p( t = i) f()d Here, the firt equlity follow from Obervtion 2. } {{ } ρ()

10 Thi property i ueful ince the expreion on the left cn be eily mnipulted while the expreion on the right cn be etimted from mple uing Monte Crlo. Lemm 5 (Generlied eigenfunction property). For ny bounded function f: ( ) ( ) γ dρ() dp( )f( ) = dρ()f() dp 0 ()f() Proof. γ dρ() dp( )f( ) = γ γi p( t = i)p( )f( )dd =, = γi+1 dp( t = i + 1)f( ) = i=1 γi dp( t = i)f( ) = ( γi dp( t = i)f( ) ) ( dp 0()f() ) = ( dρ()f()) ( dp 0()f() ) Here, the firt equlity follow form definition 4, the econd one from definition 1. The lt equlity follow gin from definition 4. Definition 5 (Mrkov Rewrd Proce). A Mrkov Rewrd Proce i tuple (p, p 0, R, γ), where p( ) i trnition kernel, p 0 i the ditribution over initil tte, R( ) i rewrd ditribution conditioned on the tte nd γ i the dicount contnt. An MRP cn be thought of n MDP with fixed policy nd dynmic given by mrginliing out the ction p π ( ) = dπ( )p(, ). Since thi pper conider the ce of one policy, we bue nottion lightly by uing the me ymbol τ to denote trjectorie including ction, i.e. ( 0, 0, r 0, 1, 1, r 1,... ) nd without them ( 0, r 0, 1, r 1,... ). Lemm 6 (Second Moment Bellmn Eqution). Conider Mrkov Rewrd Proce (p, p 0, X, γ) where p( ) i Mrkov proce nd X( ) i ome probbility denity function 6. Denote the vlue function of the MRP V. Denote the econd moment function S ( ) 2 S() = E τ γ t x t 0 = x t X( t ). t=0 Then S i the vlue function of the MRP: (p, p 0, u, γ 2 ), where u() i determinitic rndom vrible given by u() = V X(x ) [x + ( E X(x ) [x ) 2 + 2γEX(x ) [x E p( ) [V ( ). Proof. [ S() = E τ (x 0 + t=1 γt x t ) 2 0 = [ = E τ x x 0 ( t=1 γt x t ) + ( t=1 γt x t ) 2 0 = [ = E τ x = + E τ [2x 0 ( [ t=1 γt x t ) 0 = + E τ ( t=1 γt x t ) 2 0 = u() γ 2 E p( ) [S( ) Thi i exctly the Bellmn eqution of the MRP (p, p 0, u, γ 2 ). The theorem follow ince the Bellmn eqution uniquely determine the vlue function. Obervtion 3 (Dominted Vlue Function). Conider two Mrkov Rewrd Procee (p, p 0, X 1, γ) nd (p, p 0, X 2, γ), where p( ) i Mrkov proce (common to both MRP) nd X 1 (), X 2 () re ome determinitic rndom vrible meeting the condition X 1 () X 2 () for every. Then the vlue function V 1 nd V 2 of the repective MRP tify V 1 () V 2 () for every. Moreover, if we hve tht X 1 () < X 2 () for ll tte, then the inequlity between vlue function i trict. Proof. Follow trivilly by expnding the vlue function erie nd compring erie elementwie. We now move our ttention to prove the Guin Policy Grdient lemm. 6 Note tht while X occupie plce in the definition of the MRP uully clled rewrd ditribution, we re uing the ymbol X, not R ince we hll pply the lemm to Xe which re contruction ditinct from the rewrd of the MDP we re olving.

11 Lemm 1 (Guin Policy Grdient). If the policy i Guin, i.e. π( ) N (µ, Σ ) with µ nd Σ 1/2 prmetried by θ, where Σ 1/2 i ymmetric nd Σ 1/2 Σ 1/2 = Σ nd the critic i of the form Q(, ) = A() + B() + cont where A() i ymmetric for every, then I Q π () = I Q π(),µ + I Q π(),σ 1/2 I Q π(),µ = ( µ )(2A()µ + B()) nd I Q π(),σ 1/2 = ( Σ 1/2 )2A()Σ 1/2, where the men nd covrince component re given by Proof. Firt, we oberve tht the critic Q defined in the ttement of the lemm doe not depend on the policy prmeter θ. Thi i becue Q i n pproximtion to the Q-function mintined by the lgorithm oppoed to the true Q-function, which i defined with repect to the policy nd doe depend on it. We cn hence move the differentition outide of the integrl, follow. Iπ Q () = π( )Q(, )d = E π [Q(, ). We now expnd the expecttion uing known expreion for the expecttion of qudrtic form: Thi give wy to the following derivtive.. E π [Q(, ) = trce(a()σ) + µ A()µ + B() µ. Σ 1/2E π [Q(, ) = Σ 1/2(trce(A()Σ) + µ A()µ + B() µ) = 2A()Σ 1/2 µ E π [Q(, ) = µ (trce(a()σ) + µ A()µ + B() µ) = 2A()µ + B() We now obtin the reult by pplying chin rule. I Q π () = I Q π(),µ + I Q π(),σ 1/2 = ( µ)(2a()µ + B()) + ( Σ 1/2 )(2A()Σ 1/2 ) Lemm 3. If for ll S, the rndom vrible log π( ) ˆQ(, ) where π( ) h nonzero vrince, then [ V τ t=0 γt log π( t t )( ˆQ( t, t ) + b( t )) > [ V τ t=0 γt I ˆQ π ( t ). Proof. Both rndom vrible hve the me men o we need only how tht: [ ( E τ t=0 γt log π( t t )( ˆQ( ) 2 t, t ) + b( t )) > [ ( ) 2 E τ t=0 γ t I ˆQ π ( t ). We trt by pplying Lemm 6[ to the lefthnd ide nd etting X = X 1 ( t ) = γ t log π( t t )( ˆQ( t, t ) + b( t )) where t ( π( t t ). Thi how tht E τ t=0 γt log π( t t )( ˆQ( ) 2 t, t ) + b( t )) i the totl return of the MRP (p, p 0, u 1, γ 2 ), where u 1 = V X1(x ) [x + ( E X1(x ) [x ) 2 + 2γEX1(x ) [x E p( ) [V ( ). Likewie, pplying [ Lemm 6 gin to the righthnd ide, intntiting X determinitic rndom vrible X 2 ( t ) = I ˆQ π ( t ), ( ) 2 we hve tht E τ t=0 γ t I ˆQ π ( t ) i the totl return of the MRP (p, p 0, u 2, γ 2 ), where u 2 = ( E X2(x ) [x ) 2 + 2γEX2(x ) [x E p( ) [V ( ). Note tht E X1(x ) [x = E X2(x ) [x nd therefore u 1 u 2. Furthermore, by umption of the lemm, the inequlity i trict. The lemm then follow by pplying Obervtion 3. For convenience, Lemm 3 lo ume infinite length trjectorie. However, thi i not prcticl limittion ince ll policy grdient method implicitly ume trjectorie re long enough to be modelled infinite. Furthermore, finite trjectory vrint lo hold, though the proof i meier.

12 Remrk on the covrince limit When we obtin e H the limiting covrince mtrix in Lemm 2 of the min pper, there i light modelling difficulty: i it jutified to ue the lerning rte of 1 n, which diminihed in the length of the trjectory, oppoed to mll finite number? We oberve tht the problem of chooing tep ize i, in generl, not pecific to our method ince ll policy grdient method rely on tochtic optimition nd hence work with diminihing lerning rte of ome ort. We do note; however, tht the tep ize we ue, which i 1 n for every point in the trjectory, i different from the tep ize typiclly ued with Robbin-Monro procedure, which i different t ech time tep. Thi men tht the um of our tep ize i finite while the um of the Robbin-Monro tep-ize diverge. Hence our choice of tep ize doe not give the gurntee typiclly ocited with tochtic optimition. We ue the tep equence ince it erve ueful intermedite tge between imply tking one PG tep of eqution (11) nd uing finite tep-ize, which would men tht the covrince would converge either to zero or diverge to infinity.

TP 10:Importance Sampling-The Metropolis Algorithm-The Ising Model-The Jackknife Method

TP 10:Importance Sampling-The Metropolis Algorithm-The Ising Model-The Jackknife Method TP 0:Importnce Smpling-The Metropoli Algorithm-The Iing Model-The Jckknife Method June, 200 The Cnonicl Enemble We conider phyicl ytem which re in therml contct with n environment. The environment i uully

More information

Reinforcement Learning for Robotic Locomotions

Reinforcement Learning for Robotic Locomotions Reinforcement Lerning for Robotic Locomotion Bo Liu Stnford Univerity 121 Cmpu Drive Stnford, CA 94305, USA bliuxix@tnford.edu Hunzhong Xu Stnford Univerity 121 Cmpu Drive Stnford, CA 94305, USA xuhunvc@tnford.edu

More information

Artificial Intelligence Markov Decision Problems

Artificial Intelligence Markov Decision Problems rtificil Intelligence Mrkov eciion Problem ilon - briefly mentioned in hpter Ruell nd orvig - hpter 7 Mrkov eciion Problem; pge of Mrkov eciion Problem; pge of exmple: probbilitic blockworld ction outcome

More information

Reinforcement learning

Reinforcement learning Reinforcement lerning Regulr MDP Given: Trnition model P Rewrd function R Find: Policy π Reinforcement lerning Trnition model nd rewrd function initilly unknown Still need to find the right policy Lern

More information

Reinforcement Learning and Policy Reuse

Reinforcement Learning and Policy Reuse Reinforcement Lerning nd Policy Reue Mnuel M. Veloo PEL Fll 206 Reding: Reinforcement Lerning: An Introduction R. Sutton nd A. Brto Probbilitic policy reue in reinforcement lerning gent Fernndo Fernndez

More information

Policy Gradient Methods for Reinforcement Learning with Function Approximation

Policy Gradient Methods for Reinforcement Learning with Function Approximation Policy Grdient Method for Reinforcement Lerning with Function Approximtion Richrd S. Sutton, Dvid McAlleter, Stinder Singh, Yihy Mnour AT&T Lb Reerch, 180 Prk Avenue, Florhm Prk, NJ 07932 Abtrct Function

More information

PHYS 601 HW 5 Solution. We wish to find a Fourier expansion of e sin ψ so that the solution can be written in the form

PHYS 601 HW 5 Solution. We wish to find a Fourier expansion of e sin ψ so that the solution can be written in the form 5 Solving Kepler eqution Conider the Kepler eqution ωt = ψ e in ψ We wih to find Fourier expnion of e in ψ o tht the olution cn be written in the form ψωt = ωt + A n innωt, n= where A n re the Fourier

More information

20.2. The Transform and its Inverse. Introduction. Prerequisites. Learning Outcomes

20.2. The Transform and its Inverse. Introduction. Prerequisites. Learning Outcomes The Trnform nd it Invere 2.2 Introduction In thi Section we formlly introduce the Lplce trnform. The trnform i only pplied to cul function which were introduced in Section 2.1. We find the Lplce trnform

More information

Bias in Natural Actor-Critic Algorithms

Bias in Natural Actor-Critic Algorithms Bi in Nturl Actor-Critic Algorithm Philip S. Thom pthom@c.um.edu Deprtment of Computer Science, Univerity of Mchuett, Amhert, MA 01002 USA Technicl Report UM-CS-2012-018 Abtrct We how tht two populr dicounted

More information

CHOOSING THE NUMBER OF MODELS OF THE REFERENCE MODEL USING MULTIPLE MODELS ADAPTIVE CONTROL SYSTEM

CHOOSING THE NUMBER OF MODELS OF THE REFERENCE MODEL USING MULTIPLE MODELS ADAPTIVE CONTROL SYSTEM Interntionl Crpthin Control Conference ICCC 00 ALENOVICE, CZEC REPUBLIC y 7-30, 00 COOSING TE NUBER OF ODELS OF TE REFERENCE ODEL USING ULTIPLE ODELS ADAPTIVE CONTROL SYSTE rin BICĂ, Victor-Vleriu PATRICIU

More information

4-4 E-field Calculations using Coulomb s Law

4-4 E-field Calculations using Coulomb s Law 1/11/5 ection_4_4_e-field_clcultion_uing_coulomb_lw_empty.doc 1/1 4-4 E-field Clcultion uing Coulomb Lw Reding Aignment: pp. 9-98 Specificlly: 1. HO: The Uniform, Infinite Line Chrge. HO: The Uniform Dik

More information

2π(t s) (3) B(t, ω) has independent increments, i.e., for any 0 t 1 <t 2 < <t n, the random variables

2π(t s) (3) B(t, ω) has independent increments, i.e., for any 0 t 1 <t 2 < <t n, the random variables 2 Brownin Motion 2.1 Definition of Brownin Motion Let Ω,F,P) be probbility pce. A tochtic proce i meurble function Xt, ω) defined on the product pce [, ) Ω. In prticulr, ) for ech t, Xt, ) i rndom vrible,

More information

Accelerator Physics. G. A. Krafft Jefferson Lab Old Dominion University Lecture 5

Accelerator Physics. G. A. Krafft Jefferson Lab Old Dominion University Lecture 5 Accelertor Phyic G. A. Krfft Jefferon L Old Dominion Univerity Lecture 5 ODU Accelertor Phyic Spring 15 Inhomogeneou Hill Eqution Fundmentl trnvere eqution of motion in prticle ccelertor for mll devition

More information

STABILITY and Routh-Hurwitz Stability Criterion

STABILITY and Routh-Hurwitz Stability Criterion Krdeniz Technicl Univerity Deprtment of Electricl nd Electronic Engineering 6080 Trbzon, Turkey Chpter 8- nd Routh-Hurwitz Stbility Criterion Bu der notlrı dece bu deri ln öğrencilerin kullnımın çık olup,

More information

Review of Calculus, cont d

Review of Calculus, cont d Jim Lmbers MAT 460 Fll Semester 2009-10 Lecture 3 Notes These notes correspond to Section 1.1 in the text. Review of Clculus, cont d Riemnn Sums nd the Definite Integrl There re mny cses in which some

More information

. The set of these fractions is then obviously Q, and we can define addition and multiplication on it in the expected way by

. The set of these fractions is then obviously Q, and we can define addition and multiplication on it in the expected way by 50 Andre Gthmnn 6. LOCALIZATION Locliztion i very powerful technique in commuttive lgebr tht often llow to reduce quetion on ring nd module to union of mller locl problem. It cn eily be motivted both from

More information

Non-Myopic Multi-Aspect Sensing with Partially Observable Markov Decision Processes

Non-Myopic Multi-Aspect Sensing with Partially Observable Markov Decision Processes Non-Myopic Multi-Apect Sening with Prtilly Oervle Mrkov Deciion Procee Shiho Ji 2 Ronld Prr nd Lwrence Crin Deprtment of Electricl & Computer Engineering 2 Deprtment of Computer Engineering Duke Univerity

More information

2. The Laplace Transform

2. The Laplace Transform . The Lplce Trnform. Review of Lplce Trnform Theory Pierre Simon Mrqui de Lplce (749-87 French tronomer, mthemticin nd politicin, Miniter of Interior for 6 wee under Npoleon, Preident of Acdemie Frncie

More information

Markov Decision Processes

Markov Decision Processes Mrkov Deciion Procee A Brief Introduction nd Overview Jck L. King Ph.D. Geno UK Limited Preenttion Outline Introduction to MDP Motivtion for Study Definition Key Point of Interet Solution Technique Prtilly

More information

ARCHIVUM MATHEMATICUM (BRNO) Tomus 47 (2011), Kristína Rostás

ARCHIVUM MATHEMATICUM (BRNO) Tomus 47 (2011), Kristína Rostás ARCHIVUM MAHEMAICUM (BRNO) omu 47 (20), 23 33 MINIMAL AND MAXIMAL SOLUIONS OF FOURH ORDER IERAED DIFFERENIAL EQUAIONS WIH SINGULAR NONLINEARIY Kritín Rotá Abtrct. In thi pper we re concerned with ufficient

More information

NUMERICAL INTEGRATION. The inverse process to differentiation in calculus is integration. Mathematically, integration is represented by.

NUMERICAL INTEGRATION. The inverse process to differentiation in calculus is integration. Mathematically, integration is represented by. NUMERICAL INTEGRATION 1 Introduction The inverse process to differentition in clculus is integrtion. Mthemticlly, integrtion is represented by f(x) dx which stnds for the integrl of the function f(x) with

More information

COUNTING DESCENTS, RISES, AND LEVELS, WITH PRESCRIBED FIRST ELEMENT, IN WORDS

COUNTING DESCENTS, RISES, AND LEVELS, WITH PRESCRIBED FIRST ELEMENT, IN WORDS COUNTING DESCENTS, RISES, AND LEVELS, WITH PRESCRIBED FIRST ELEMENT, IN WORDS Sergey Kitev The Mthemtic Intitute, Reykvik Univerity, IS-03 Reykvik, Icelnd ergey@rui Toufik Mnour Deprtment of Mthemtic,

More information

Lecture 14: Quadrature

Lecture 14: Quadrature Lecture 14: Qudrture This lecture is concerned with the evlution of integrls fx)dx 1) over finite intervl [, b] The integrnd fx) is ssumed to be rel-vlues nd smooth The pproximtion of n integrl by numericl

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Lerning Tom Mitchell, Mchine Lerning, chpter 13 Outline Introduction Comprison with inductive lerning Mrkov Decision Processes: the model Optiml policy: The tsk Q Lerning: Q function Algorithm

More information

PHYSICS 211 MIDTERM I 22 October 2003

PHYSICS 211 MIDTERM I 22 October 2003 PHYSICS MIDTERM I October 3 Exm i cloed book, cloed note. Ue onl our formul heet. Write ll work nd nwer in exm booklet. The bck of pge will not be grded unle ou o requet on the front of the pge. Show ll

More information

APPENDIX 2 LAPLACE TRANSFORMS

APPENDIX 2 LAPLACE TRANSFORMS APPENDIX LAPLACE TRANSFORMS Thi ppendix preent hort introduction to Lplce trnform, the bic tool ued in nlyzing continuou ytem in the frequency domin. The Lplce trnform convert liner ordinry differentil

More information

CONTROL SYSTEMS LABORATORY ECE311 LAB 3: Control Design Using the Root Locus

CONTROL SYSTEMS LABORATORY ECE311 LAB 3: Control Design Using the Root Locus CONTROL SYSTEMS LABORATORY ECE311 LAB 3: Control Deign Uing the Root Locu 1 Purpoe The purpoe of thi lbortory i to deign cruie control ytem for cr uing the root locu. 2 Introduction Diturbnce D( ) = d

More information

Advanced Calculus: MATH 410 Notes on Integrals and Integrability Professor David Levermore 17 October 2004

Advanced Calculus: MATH 410 Notes on Integrals and Integrability Professor David Levermore 17 October 2004 Advnced Clculus: MATH 410 Notes on Integrls nd Integrbility Professor Dvid Levermore 17 October 2004 1. Definite Integrls In this section we revisit the definite integrl tht you were introduced to when

More information

Reinforcement learning II

Reinforcement learning II CS 1675 Introduction to Mchine Lerning Lecture 26 Reinforcement lerning II Milos Huskrecht milos@cs.pitt.edu 5329 Sennott Squre Reinforcement lerning Bsics: Input x Lerner Output Reinforcement r Critic

More information

Administrivia CSE 190: Reinforcement Learning: An Introduction

Administrivia CSE 190: Reinforcement Learning: An Introduction Administrivi CSE 190: Reinforcement Lerning: An Introduction Any emil sent to me bout the course should hve CSE 190 in the subject line! Chpter 4: Dynmic Progrmming Acknowledgment: A good number of these

More information

Actor-Critic. Hung-yi Lee

Actor-Critic. Hung-yi Lee Actor-Critic Hung-yi Lee Asynchronous Advntge Actor-Critic (A3C) Volodymyr Mnih, Adrià Puigdomènech Bdi, Mehdi Mirz, Alex Grves, Timothy P. Lillicrp, Tim Hrley, Dvid Silver, Kory Kvukcuoglu, Asynchronous

More information

Efficient Planning in R-max

Efficient Planning in R-max Efficient Plnning in R-mx Mrek Grześ nd Jee Hoey Dvid R. Cheriton School of Computer Science, Univerity of Wterloo 200 Univerity Avenue Wet, Wterloo, ON, N2L 3G1, Cnd {mgrze, jhoey}@c.uwterloo.c ABSTRACT

More information

8 Laplace s Method and Local Limit Theorems

8 Laplace s Method and Local Limit Theorems 8 Lplce s Method nd Locl Limit Theorems 8. Fourier Anlysis in Higher DImensions Most of the theorems of Fourier nlysis tht we hve proved hve nturl generliztions to higher dimensions, nd these cn be proved

More information

3.4 Numerical integration

3.4 Numerical integration 3.4. Numericl integrtion 63 3.4 Numericl integrtion In mny economic pplictions it is necessry to compute the definite integrl of relvlued function f with respect to "weight" function w over n intervl [,

More information

Chapter 2 Organizing and Summarizing Data. Chapter 3 Numerically Summarizing Data. Chapter 4 Describing the Relation between Two Variables

Chapter 2 Organizing and Summarizing Data. Chapter 3 Numerically Summarizing Data. Chapter 4 Describing the Relation between Two Variables Copyright 013 Peron Eduction, Inc. Tble nd Formul for Sullivn, Sttitic: Informed Deciion Uing Dt 013 Peron Eduction, Inc Chpter Orgnizing nd Summrizing Dt Reltive frequency = frequency um of ll frequencie

More information

Exam 2, Mathematics 4701, Section ETY6 6:05 pm 7:40 pm, March 31, 2016, IH-1105 Instructor: Attila Máté 1

Exam 2, Mathematics 4701, Section ETY6 6:05 pm 7:40 pm, March 31, 2016, IH-1105 Instructor: Attila Máté 1 Exm, Mthemtics 471, Section ETY6 6:5 pm 7:4 pm, Mrch 1, 16, IH-115 Instructor: Attil Máté 1 17 copies 1. ) Stte the usul sufficient condition for the fixed-point itertion to converge when solving the eqution

More information

Analysis of Variance and Design of Experiments-II

Analysis of Variance and Design of Experiments-II Anlyi of Vrince nd Deign of Experiment-II MODULE VI LECTURE - 7 SPLIT-PLOT AND STRIP-PLOT DESIGNS Dr. Shlbh Deprtment of Mthemtic & Sttitic Indin Intitute of Technology Knpur Anlyi of covrince ith one

More information

CMDA 4604: Intermediate Topics in Mathematical Modeling Lecture 19: Interpolation and Quadrature

CMDA 4604: Intermediate Topics in Mathematical Modeling Lecture 19: Interpolation and Quadrature CMDA 4604: Intermedite Topics in Mthemticl Modeling Lecture 19: Interpoltion nd Qudrture In this lecture we mke brief diversion into the res of interpoltion nd qudrture. Given function f C[, b], we sy

More information

Math 2142 Homework 2 Solutions. Problem 1. Prove the following formulas for Laplace transforms for s > 0. a s 2 + a 2 L{cos at} = e st.

Math 2142 Homework 2 Solutions. Problem 1. Prove the following formulas for Laplace transforms for s > 0. a s 2 + a 2 L{cos at} = e st. Mth 2142 Homework 2 Solution Problem 1. Prove the following formul for Lplce trnform for >. L{1} = 1 L{t} = 1 2 L{in t} = 2 + 2 L{co t} = 2 + 2 Solution. For the firt Lplce trnform, we need to clculte:

More information

A REVIEW OF CALCULUS CONCEPTS FOR JDEP 384H. Thomas Shores Department of Mathematics University of Nebraska Spring 2007

A REVIEW OF CALCULUS CONCEPTS FOR JDEP 384H. Thomas Shores Department of Mathematics University of Nebraska Spring 2007 A REVIEW OF CALCULUS CONCEPTS FOR JDEP 384H Thoms Shores Deprtment of Mthemtics University of Nebrsk Spring 2007 Contents Rtes of Chnge nd Derivtives 1 Dierentils 4 Are nd Integrls 5 Multivrite Clculus

More information

Math 8 Winter 2015 Applications of Integration

Math 8 Winter 2015 Applications of Integration Mth 8 Winter 205 Applictions of Integrtion Here re few importnt pplictions of integrtion. The pplictions you my see on n exm in this course include only the Net Chnge Theorem (which is relly just the Fundmentl

More information

Review of basic calculus

Review of basic calculus Review of bsic clculus This brief review reclls some of the most importnt concepts, definitions, nd theorems from bsic clculus. It is not intended to tech bsic clculus from scrtch. If ny of the items below

More information

Numerical Analysis: Trapezoidal and Simpson s Rule

Numerical Analysis: Trapezoidal and Simpson s Rule nd Simpson s Mthemticl question we re interested in numericlly nswering How to we evlute I = f (x) dx? Clculus tells us tht if F(x) is the ntiderivtive of function f (x) on the intervl [, b], then I =

More information

The Regulated and Riemann Integrals

The Regulated and Riemann Integrals Chpter 1 The Regulted nd Riemnn Integrls 1.1 Introduction We will consider severl different pproches to defining the definite integrl f(x) dx of function f(x). These definitions will ll ssign the sme vlue

More information

Math 1B, lecture 4: Error bounds for numerical methods

Math 1B, lecture 4: Error bounds for numerical methods Mth B, lecture 4: Error bounds for numericl methods Nthn Pflueger 4 September 0 Introduction The five numericl methods descried in the previous lecture ll operte by the sme principle: they pproximte the

More information

Chapter 5 : Continuous Random Variables

Chapter 5 : Continuous Random Variables STAT/MATH 395 A - PROBABILITY II UW Winter Qurter 216 Néhémy Lim Chpter 5 : Continuous Rndom Vribles Nottions. N {, 1, 2,...}, set of nturl numbers (i.e. ll nonnegtive integers); N {1, 2,...}, set of ll

More information

MArkov decision processes (MDPs) have been widely

MArkov decision processes (MDPs) have been widely Spre Mrkov Deciion Procee with Cul Spre Tlli Entropy Regulriztion for Reinforcement Lerning yungje Lee, Sungjoon Choi, nd Songhwi Oh rxiv:709.0693v3 [c.lg] 3 Oct 07 Abtrct In thi pper, re Mrkov deciion

More information

{ } = E! & $ " k r t +k +1

{ } = E! & $  k r t +k +1 Chpter 4: Dynmic Progrmming Objectives of this chpter: Overview of collection of clssicl solution methods for MDPs known s dynmic progrmming (DP) Show how DP cn be used to compute vlue functions, nd hence,

More information

M. A. Pathan, O. A. Daman LAPLACE TRANSFORMS OF THE LOGARITHMIC FUNCTIONS AND THEIR APPLICATIONS

M. A. Pathan, O. A. Daman LAPLACE TRANSFORMS OF THE LOGARITHMIC FUNCTIONS AND THEIR APPLICATIONS DEMONSTRATIO MATHEMATICA Vol. XLVI No 3 3 M. A. Pthn, O. A. Dmn LAPLACE TRANSFORMS OF THE LOGARITHMIC FUNCTIONS AND THEIR APPLICATIONS Abtrct. Thi pper del with theorem nd formul uing the technique of

More information

PRACTICE EXAM 2 SOLUTIONS

PRACTICE EXAM 2 SOLUTIONS MASSACHUSETTS INSTITUTE OF TECHNOLOGY Deprtment of Phyic Phyic 8.01x Fll Term 00 PRACTICE EXAM SOLUTIONS Proble: Thi i reltively trihtforwrd Newton Second Lw problem. We et up coordinte ytem which i poitive

More information

Continuous Random Variables

Continuous Random Variables STAT/MATH 395 A - PROBABILITY II UW Winter Qurter 217 Néhémy Lim Continuous Rndom Vribles Nottion. The indictor function of set S is rel-vlued function defined by : { 1 if x S 1 S (x) if x S Suppose tht

More information

Improper Integrals, and Differential Equations

Improper Integrals, and Differential Equations Improper Integrls, nd Differentil Equtions October 22, 204 5.3 Improper Integrls Previously, we discussed how integrls correspond to res. More specificlly, we sid tht for function f(x), the region creted

More information

Chapter 4: Dynamic Programming

Chapter 4: Dynamic Programming Chpter 4: Dynmic Progrmming Objectives of this chpter: Overview of collection of clssicl solution methods for MDPs known s dynmic progrmming (DP) Show how DP cn be used to compute vlue functions, nd hence,

More information

The practical version

The practical version Roerto s Notes on Integrl Clculus Chpter 4: Definite integrls nd the FTC Section 7 The Fundmentl Theorem of Clculus: The prcticl version Wht you need to know lredy: The theoreticl version of the FTC. Wht

More information

1 Online Learning and Regret Minimization

1 Online Learning and Regret Minimization 2.997 Decision-Mking in Lrge-Scle Systems My 10 MIT, Spring 2004 Hndout #29 Lecture Note 24 1 Online Lerning nd Regret Minimiztion In this lecture, we consider the problem of sequentil decision mking in

More information

The ifs Package. December 28, 2005

The ifs Package. December 28, 2005 The if Pckge December 28, 2005 Verion 0.1-1 Title Iterted Function Sytem Author S. M. Icu Mintiner S. M. Icu Iterted Function Sytem Licene GPL Verion 2 or lter. R topic documented:

More information

New Expansion and Infinite Series

New Expansion and Infinite Series Interntionl Mthemticl Forum, Vol. 9, 204, no. 22, 06-073 HIKARI Ltd, www.m-hikri.com http://dx.doi.org/0.2988/imf.204.4502 New Expnsion nd Infinite Series Diyun Zhng College of Computer Nnjing University

More information

Consequently, the temperature must be the same at each point in the cross section at x. Let:

Consequently, the temperature must be the same at each point in the cross section at x. Let: HW 2 Comments: L1-3. Derive the het eqution for n inhomogeneous rod where the therml coefficients used in the derivtion of the het eqution for homogeneous rod now become functions of position x in the

More information

1 The Riemann Integral

1 The Riemann Integral The Riemnn Integrl. An exmple leding to the notion of integrl (res) We know how to find (i.e. define) the re of rectngle (bse height), tringle ( (sum of res of tringles). But how do we find/define n re

More information

2D1431 Machine Learning Lab 3: Reinforcement Learning

2D1431 Machine Learning Lab 3: Reinforcement Learning 2D1431 Mchine Lerning Lb 3: Reinforcement Lerning Frnk Hoffmnn modified by Örjn Ekeberg December 7, 2004 1 Introduction In this lb you will lern bout dynmic progrmming nd reinforcement lerning. It is ssumed

More information

ODE: Existence and Uniqueness of a Solution

ODE: Existence and Uniqueness of a Solution Mth 22 Fll 213 Jerry Kzdn ODE: Existence nd Uniqueness of Solution The Fundmentl Theorem of Clculus tells us how to solve the ordinry differentil eqution (ODE) du = f(t) dt with initil condition u() =

More information

Bernoulli Numbers Jeff Morton

Bernoulli Numbers Jeff Morton Bernoulli Numbers Jeff Morton. We re interested in the opertor e t k d k t k, which is to sy k tk. Applying this to some function f E to get e t f d k k tk d k f f + d k k tk dk f, we note tht since f

More information

Calculus I-II Review Sheet

Calculus I-II Review Sheet Clculus I-II Review Sheet 1 Definitions 1.1 Functions A function is f is incresing on n intervl if x y implies f(x) f(y), nd decresing if x y implies f(x) f(y). It is clled monotonic if it is either incresing

More information

CALCULUS WITHOUT LIMITS

CALCULUS WITHOUT LIMITS CALCULUS WITHOUT LIMITS The current stndrd for the clculus curriculum is, in my opinion, filure in mny spects. We try to present it with the modern stndrd of mthemticl rigor nd comprehensiveness but of

More information

Lecture 6: Singular Integrals, Open Quadrature rules, and Gauss Quadrature

Lecture 6: Singular Integrals, Open Quadrature rules, and Gauss Quadrature Lecture notes on Vritionl nd Approximte Methods in Applied Mthemtics - A Peirce UBC Lecture 6: Singulr Integrls, Open Qudrture rules, nd Guss Qudrture (Compiled 6 August 7) In this lecture we discuss the

More information

Section 4.8. D v(t j 1 ) t. (4.8.1) j=1

Section 4.8. D v(t j 1 ) t. (4.8.1) j=1 Difference Equtions to Differentil Equtions Section.8 Distnce, Position, nd the Length of Curves Although we motivted the definition of the definite integrl with the notion of re, there re mny pplictions

More information

1. Gauss-Jacobi quadrature and Legendre polynomials. p(t)w(t)dt, p {p(x 0 ),...p(x n )} p(t)w(t)dt = w k p(x k ),

1. Gauss-Jacobi quadrature and Legendre polynomials. p(t)w(t)dt, p {p(x 0 ),...p(x n )} p(t)w(t)dt = w k p(x k ), 1. Guss-Jcobi qudrture nd Legendre polynomils Simpson s rule for evluting n integrl f(t)dt gives the correct nswer with error of bout O(n 4 ) (with constnt tht depends on f, in prticulr, it depends on

More information

SIMULATION OF TRANSIENT EQUILIBRIUM DECAY USING ANALOGUE CIRCUIT

SIMULATION OF TRANSIENT EQUILIBRIUM DECAY USING ANALOGUE CIRCUIT Bjop ol. o. Decemer 008 Byero Journl of Pure nd Applied Science, ():70 75 Received: Octoer, 008 Accepted: Decemer, 008 SIMULATIO OF TRASIET EQUILIBRIUM DECAY USIG AALOGUE CIRCUIT *Adullhi,.., Ango U.S.

More information

A Fast and Reliable Policy Improvement Algorithm

A Fast and Reliable Policy Improvement Algorithm A Fst nd Relible Policy Improvement Algorithm Ysin Abbsi-Ydkori Peter L. Brtlett Stephen J. Wright Queenslnd University of Technology UC Berkeley nd QUT University of Wisconsin-Mdison Abstrct We introduce

More information

1 Probability Density Functions

1 Probability Density Functions Lis Yn CS 9 Continuous Distributions Lecture Notes #9 July 6, 28 Bsed on chpter by Chris Piech So fr, ll rndom vribles we hve seen hve been discrete. In ll the cses we hve seen in CS 9, this ment tht our

More information

Math& 152 Section Integration by Parts

Math& 152 Section Integration by Parts Mth& 5 Section 7. - Integrtion by Prts Integrtion by prts is rule tht trnsforms the integrl of the product of two functions into other (idelly simpler) integrls. Recll from Clculus I tht given two differentible

More information

MATH34032: Green s Functions, Integral Equations and the Calculus of Variations 1

MATH34032: Green s Functions, Integral Equations and the Calculus of Variations 1 MATH34032: Green s Functions, Integrl Equtions nd the Clculus of Vritions 1 Section 1 Function spces nd opertors Here we gives some brief detils nd definitions, prticulrly relting to opertors. For further

More information

Recitation 3: More Applications of the Derivative

Recitation 3: More Applications of the Derivative Mth 1c TA: Pdric Brtlett Recittion 3: More Applictions of the Derivtive Week 3 Cltech 2012 1 Rndom Question Question 1 A grph consists of the following: A set V of vertices. A set E of edges where ech

More information

Numerical Integration

Numerical Integration Chpter 5 Numericl Integrtion Numericl integrtion is the study of how the numericl vlue of n integrl cn be found. Methods of function pproximtion discussed in Chpter??, i.e., function pproximtion vi the

More information

The steps of the hypothesis test

The steps of the hypothesis test ttisticl Methods I (EXT 7005) Pge 78 Mosquito species Time of dy A B C Mid morning 0.0088 5.4900 5.5000 Mid Afternoon.3400 0.0300 0.8700 Dusk 0.600 5.400 3.000 The Chi squre test sttistic is the sum of

More information

Package ifs. R topics documented: August 21, Version Title Iterated Function Systems. Author S. M. Iacus.

Package ifs. R topics documented: August 21, Version Title Iterated Function Systems. Author S. M. Iacus. Pckge if Augut 21, 2015 Verion 0.1.5 Title Iterted Function Sytem Author S. M. Icu Dte 2015-08-21 Mintiner S. M. Icu Iterted Function Sytem Etimtor. Licene GPL (>= 2) NeedCompiltion

More information

THE EXISTENCE-UNIQUENESS THEOREM FOR FIRST-ORDER DIFFERENTIAL EQUATIONS.

THE EXISTENCE-UNIQUENESS THEOREM FOR FIRST-ORDER DIFFERENTIAL EQUATIONS. THE EXISTENCE-UNIQUENESS THEOREM FOR FIRST-ORDER DIFFERENTIAL EQUATIONS RADON ROSBOROUGH https://intuitiveexplntionscom/picrd-lindelof-theorem/ This document is proof of the existence-uniqueness theorem

More information

p-adic Egyptian Fractions

p-adic Egyptian Fractions p-adic Egyptin Frctions Contents 1 Introduction 1 2 Trditionl Egyptin Frctions nd Greedy Algorithm 2 3 Set-up 3 4 p-greedy Algorithm 5 5 p-egyptin Trditionl 10 6 Conclusion 1 Introduction An Egyptin frction

More information

Monte Carlo method in solving numerical integration and differential equation

Monte Carlo method in solving numerical integration and differential equation Monte Crlo method in solving numericl integrtion nd differentil eqution Ye Jin Chemistry Deprtment Duke University yj66@duke.edu Abstrct: Monte Crlo method is commonly used in rel physics problem. The

More information

Main topics for the First Midterm

Main topics for the First Midterm Min topics for the First Midterm The Midterm will cover Section 1.8, Chpters 2-3, Sections 4.1-4.8, nd Sections 5.1-5.3 (essentilly ll of the mteril covered in clss). Be sure to know the results of the

More information

Robot Planning in Partially Observable Continuous Domains

Robot Planning in Partially Observable Continuous Domains Robot Plnning in Prtilly Obervble Continuou Domin Joep M. Port Intitut de Robòtic i Informàtic Indutril (UPC-CSIC) Lloren i Artig 4-6, 828, Brcelon Spin Emil: port@iri.upc.edu Mtthij T. J. Spn Informtic

More information

MAA 4212 Improper Integrals

MAA 4212 Improper Integrals Notes by Dvid Groisser, Copyright c 1995; revised 2002, 2009, 2014 MAA 4212 Improper Integrls The Riemnn integrl, while perfectly well-defined, is too restrictive for mny purposes; there re functions which

More information

Module 6 Value Iteration. CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo

Module 6 Value Iteration. CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo Module 6 Vlue Itertion CS 886 Sequentil Decision Mking nd Reinforcement Lerning University of Wterloo Mrkov Decision Process Definition Set of sttes: S Set of ctions (i.e., decisions): A Trnsition model:

More information

APPROXIMATE INTEGRATION

APPROXIMATE INTEGRATION APPROXIMATE INTEGRATION. Introduction We hve seen tht there re functions whose nti-derivtives cnnot be expressed in closed form. For these resons ny definite integrl involving these integrnds cnnot be

More information

1.2. Linear Variable Coefficient Equations. y + b "! = a y + b " Remark: The case b = 0 and a non-constant can be solved with the same idea as above.

1.2. Linear Variable Coefficient Equations. y + b ! = a y + b  Remark: The case b = 0 and a non-constant can be solved with the same idea as above. 1 12 Liner Vrible Coefficient Equtions Section Objective(s): Review: Constnt Coefficient Equtions Solving Vrible Coefficient Equtions The Integrting Fctor Method The Bernoulli Eqution 121 Review: Constnt

More information

Introduction to the Calculus of Variations

Introduction to the Calculus of Variations Introduction to the Clculus of Vritions Jim Fischer Mrch 20, 1999 Abstrct This is self-contined pper which introduces fundmentl problem in the clculus of vritions, the problem of finding extreme vlues

More information

EE Control Systems LECTURE 8

EE Control Systems LECTURE 8 Coyright F.L. Lewi 999 All right reerved Udted: Sundy, Ferury, 999 EE 44 - Control Sytem LECTURE 8 REALIZATION AND CANONICAL FORMS A liner time-invrint (LTI) ytem cn e rereented in mny wy, including: differentil

More information

Properties of Integrals, Indefinite Integrals. Goals: Definition of the Definite Integral Integral Calculations using Antiderivatives

Properties of Integrals, Indefinite Integrals. Goals: Definition of the Definite Integral Integral Calculations using Antiderivatives Block #6: Properties of Integrls, Indefinite Integrls Gols: Definition of the Definite Integrl Integrl Clcultions using Antiderivtives Properties of Integrls The Indefinite Integrl 1 Riemnn Sums - 1 Riemnn

More information

LECTURE NOTE #12 PROF. ALAN YUILLE

LECTURE NOTE #12 PROF. ALAN YUILLE LECTURE NOTE #12 PROF. ALAN YUILLE 1. Clustering, K-mens, nd EM Tsk: set of unlbeled dt D = {x 1,..., x n } Decompose into clsses w 1,..., w M where M is unknown. Lern clss models p(x w)) Discovery of

More information

Numerical integration

Numerical integration 2 Numericl integrtion This is pge i Printer: Opque this 2. Introduction Numericl integrtion is problem tht is prt of mny problems in the economics nd econometrics literture. The orgniztion of this chpter

More information

Chapters 4 & 5 Integrals & Applications

Chapters 4 & 5 Integrals & Applications Contents Chpters 4 & 5 Integrls & Applictions Motivtion to Chpters 4 & 5 2 Chpter 4 3 Ares nd Distnces 3. VIDEO - Ares Under Functions............................................ 3.2 VIDEO - Applictions

More information

The goal of this section is to learn how to use a computer to approximate definite integrals, i.e. expressions of the form. Z b

The goal of this section is to learn how to use a computer to approximate definite integrals, i.e. expressions of the form. Z b Lecture notes for Numericl Anlysis Integrtion Topics:. Problem sttement nd motivtion 2. First pproches: Riemnn sums 3. A slightly more dvnced pproch: the Trpezoid rule 4. Tylor series (the most importnt

More information

Theoretical foundations of Gaussian quadrature

Theoretical foundations of Gaussian quadrature Theoreticl foundtions of Gussin qudrture 1 Inner product vector spce Definition 1. A vector spce (or liner spce) is set V = {u, v, w,...} in which the following two opertions re defined: (A) Addition of

More information

Bellman Optimality Equation for V*

Bellman Optimality Equation for V* Bellmn Optimlity Eqution for V* The vlue of stte under n optiml policy must equl the expected return for the best ction from tht stte: V (s) mx Q (s,) A(s) mx A(s) mx A(s) Er t 1 V (s t 1 ) s t s, t s

More information

The First Fundamental Theorem of Calculus. If f(x) is continuous on [a, b] and F (x) is any antiderivative. f(x) dx = F (b) F (a).

The First Fundamental Theorem of Calculus. If f(x) is continuous on [a, b] and F (x) is any antiderivative. f(x) dx = F (b) F (a). The Fundmentl Theorems of Clculus Mth 4, Section 0, Spring 009 We now know enough bout definite integrls to give precise formultions of the Fundmentl Theorems of Clculus. We will lso look t some bsic emples

More information

Goals: Determine how to calculate the area described by a function. Define the definite integral. Explore the relationship between the definite

Goals: Determine how to calculate the area described by a function. Define the definite integral. Explore the relationship between the definite Unit #8 : The Integrl Gols: Determine how to clculte the re described by function. Define the definite integrl. Eplore the reltionship between the definite integrl nd re. Eplore wys to estimte the definite

More information

Stuff You Need to Know From Calculus

Stuff You Need to Know From Calculus Stuff You Need to Know From Clculus For the first time in the semester, the stuff we re doing is finlly going to look like clculus (with vector slnt, of course). This mens tht in order to succeed, you

More information

Unit #9 : Definite Integral Properties; Fundamental Theorem of Calculus

Unit #9 : Definite Integral Properties; Fundamental Theorem of Calculus Unit #9 : Definite Integrl Properties; Fundmentl Theorem of Clculus Gols: Identify properties of definite integrls Define odd nd even functions, nd reltionship to integrl vlues Introduce the Fundmentl

More information

Robot Planning in Partially Observable Continuous Domains

Robot Planning in Partially Observable Continuous Domains Robot Plnning in Prtilly Obervble Continuou Domin Joep M. Port Intitut de Robòtic i Informàtic Indutril (UPC-CSIC) Lloren i Artig 4-6, 828, Brcelon Spin Emil: port@iri.upc.edu Mtthij T. J. Spn Informtic

More information

LINEAR STOCHASTIC DIFFERENTIAL EQUATIONS WITH ANTICIPATING INITIAL CONDITIONS

LINEAR STOCHASTIC DIFFERENTIAL EQUATIONS WITH ANTICIPATING INITIAL CONDITIONS Communiction on Stochtic Anlyi Vol. 7, No. 2 213 245-253 Seril Publiction www.erilpubliction.com LINEA STOCHASTIC DIFFEENTIAL EQUATIONS WITH ANTICIPATING INITIAL CONDITIONS NAJESS KHALIFA, HUI-HSIUNG KUO,

More information