arxiv: v6 [stat.ml] 13 Apr 2018

Similar documents
TP 10:Importance Sampling-The Metropolis Algorithm-The Ising Model-The Jackknife Method

Reinforcement Learning for Robotic Locomotions

Artificial Intelligence Markov Decision Problems

Reinforcement learning

Reinforcement Learning and Policy Reuse

Policy Gradient Methods for Reinforcement Learning with Function Approximation

PHYS 601 HW 5 Solution. We wish to find a Fourier expansion of e sin ψ so that the solution can be written in the form

20.2. The Transform and its Inverse. Introduction. Prerequisites. Learning Outcomes

Bias in Natural Actor-Critic Algorithms

CHOOSING THE NUMBER OF MODELS OF THE REFERENCE MODEL USING MULTIPLE MODELS ADAPTIVE CONTROL SYSTEM

4-4 E-field Calculations using Coulomb s Law

2π(t s) (3) B(t, ω) has independent increments, i.e., for any 0 t 1 <t 2 < <t n, the random variables

Accelerator Physics. G. A. Krafft Jefferson Lab Old Dominion University Lecture 5

STABILITY and Routh-Hurwitz Stability Criterion

Review of Calculus, cont d

. The set of these fractions is then obviously Q, and we can define addition and multiplication on it in the expected way by

Non-Myopic Multi-Aspect Sensing with Partially Observable Markov Decision Processes

2. The Laplace Transform

Markov Decision Processes

ARCHIVUM MATHEMATICUM (BRNO) Tomus 47 (2011), Kristína Rostás

NUMERICAL INTEGRATION. The inverse process to differentiation in calculus is integration. Mathematically, integration is represented by.

COUNTING DESCENTS, RISES, AND LEVELS, WITH PRESCRIBED FIRST ELEMENT, IN WORDS

Lecture 14: Quadrature

Reinforcement Learning

PHYSICS 211 MIDTERM I 22 October 2003

APPENDIX 2 LAPLACE TRANSFORMS

CONTROL SYSTEMS LABORATORY ECE311 LAB 3: Control Design Using the Root Locus

Advanced Calculus: MATH 410 Notes on Integrals and Integrability Professor David Levermore 17 October 2004

Reinforcement learning II

Administrivia CSE 190: Reinforcement Learning: An Introduction

Actor-Critic. Hung-yi Lee

Efficient Planning in R-max

8 Laplace s Method and Local Limit Theorems

3.4 Numerical integration

Chapter 2 Organizing and Summarizing Data. Chapter 3 Numerically Summarizing Data. Chapter 4 Describing the Relation between Two Variables

Exam 2, Mathematics 4701, Section ETY6 6:05 pm 7:40 pm, March 31, 2016, IH-1105 Instructor: Attila Máté 1

Analysis of Variance and Design of Experiments-II

CMDA 4604: Intermediate Topics in Mathematical Modeling Lecture 19: Interpolation and Quadrature

Math 2142 Homework 2 Solutions. Problem 1. Prove the following formulas for Laplace transforms for s > 0. a s 2 + a 2 L{cos at} = e st.

A REVIEW OF CALCULUS CONCEPTS FOR JDEP 384H. Thomas Shores Department of Mathematics University of Nebraska Spring 2007

Math 8 Winter 2015 Applications of Integration

Review of basic calculus

Numerical Analysis: Trapezoidal and Simpson s Rule

The Regulated and Riemann Integrals

Math 1B, lecture 4: Error bounds for numerical methods

Chapter 5 : Continuous Random Variables

MArkov decision processes (MDPs) have been widely

{ } = E! & $ " k r t +k +1

M. A. Pathan, O. A. Daman LAPLACE TRANSFORMS OF THE LOGARITHMIC FUNCTIONS AND THEIR APPLICATIONS

PRACTICE EXAM 2 SOLUTIONS

Continuous Random Variables

Improper Integrals, and Differential Equations

Chapter 4: Dynamic Programming

The practical version

1 Online Learning and Regret Minimization

The ifs Package. December 28, 2005

New Expansion and Infinite Series

Consequently, the temperature must be the same at each point in the cross section at x. Let:

1 The Riemann Integral

2D1431 Machine Learning Lab 3: Reinforcement Learning

ODE: Existence and Uniqueness of a Solution

Bernoulli Numbers Jeff Morton

Calculus I-II Review Sheet

CALCULUS WITHOUT LIMITS

Lecture 6: Singular Integrals, Open Quadrature rules, and Gauss Quadrature

Section 4.8. D v(t j 1 ) t. (4.8.1) j=1

1. Gauss-Jacobi quadrature and Legendre polynomials. p(t)w(t)dt, p {p(x 0 ),...p(x n )} p(t)w(t)dt = w k p(x k ),

SIMULATION OF TRANSIENT EQUILIBRIUM DECAY USING ANALOGUE CIRCUIT

A Fast and Reliable Policy Improvement Algorithm

1 Probability Density Functions

Math& 152 Section Integration by Parts

MATH34032: Green s Functions, Integral Equations and the Calculus of Variations 1

Recitation 3: More Applications of the Derivative

Numerical Integration

The steps of the hypothesis test

Package ifs. R topics documented: August 21, Version Title Iterated Function Systems. Author S. M. Iacus.

THE EXISTENCE-UNIQUENESS THEOREM FOR FIRST-ORDER DIFFERENTIAL EQUATIONS.

p-adic Egyptian Fractions

Monte Carlo method in solving numerical integration and differential equation

Main topics for the First Midterm

Robot Planning in Partially Observable Continuous Domains

MAA 4212 Improper Integrals

Module 6 Value Iteration. CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo

APPROXIMATE INTEGRATION

1.2. Linear Variable Coefficient Equations. y + b "! = a y + b " Remark: The case b = 0 and a non-constant can be solved with the same idea as above.

Introduction to the Calculus of Variations

EE Control Systems LECTURE 8

Properties of Integrals, Indefinite Integrals. Goals: Definition of the Definite Integral Integral Calculations using Antiderivatives

LECTURE NOTE #12 PROF. ALAN YUILLE

Numerical integration

Chapters 4 & 5 Integrals & Applications

The goal of this section is to learn how to use a computer to approximate definite integrals, i.e. expressions of the form. Z b

Theoretical foundations of Gaussian quadrature

Bellman Optimality Equation for V*

The First Fundamental Theorem of Calculus. If f(x) is continuous on [a, b] and F (x) is any antiderivative. f(x) dx = F (b) F (a).

Goals: Determine how to calculate the area described by a function. Define the definite integral. Explore the relationship between the definite

Stuff You Need to Know From Calculus

Unit #9 : Definite Integral Properties; Fundamental Theorem of Calculus

Robot Planning in Partially Observable Continuous Domains

LINEAR STOCHASTIC DIFFERENTIAL EQUATIONS WITH ANTICIPATING INITIAL CONDITIONS

Transcription:

Expected Policy Grdient Kmil Cioek nd Shimon Whiteon Deprtment of Computer Science, Univerity of Oxford Wolfon Building, Prk Rod, Oxford OX1 3QD {kmil.cioek,himon.whiteon}@c.ox.c.uk rxiv:1706.05374v6 [tt.ml 13 Apr 2018 Abtrct We propoe expected policy grdient (EPG), which unify tochtic policy grdient (SPG) nd determinitic policy grdient (DPG) for reinforcement lerning. Inpired by expected r, EPG integrte cro the ction when etimting the grdient, inted of relying only on the ction in the mpled trjectory. We etblih new generl policy grdient theorem, of which the tochtic nd determinitic policy grdient theorem re pecil ce. We lo prove tht EPG reduce the vrince of the grdient etimte without requiring determinitic policie nd, for the Guin ce, with no computtionl overhed. Finlly, we how tht it i optiml in certin ene to explore with Guin policy uch tht the covrince i proportionl to e H, where H i the cled Hein of the critic with repect to the ction. We preent empiricl reult confirming tht thi new form of explortion ubtntilly outperform DPG with the Orntein-Uhlenbeck heuritic in four chllenging MuJoCo domin. Introduction Policy grdient method (Sutton et l., 2000; Peter nd Schl, 2006, 2008b; Silver et l., 2014), which optimie policie by grdient cent, hve enjoyed gret ucce in reinforcement lerning problem with lrge or continuou ction pce. The rchetypl lgorithm optimie n ctor, i.e., policy, by following policy grdient tht i etimted uing critic, i.e., vlue function. The policy cn be tochtic or determinitic, yielding tochtic policy grdient (SPG) (Sutton et l., 2000) or determinitic policy grdient (DPG) (Silver et l., 2014). The theory underpinning thee method i quite frgmented, ech pproch h eprte policy grdient theorem gurnteeing the policy grdient i unbied under certin condition. Furthermore, both pproche hve ignificnt hortcoming. For SPG, vrince in the grdient etimte men tht mny trjectorie re uully needed for lerning. Since gthering trjectorie i typiclly expenive, there i gret need for more mple efficient method. DPG ue of determinitic policie mitigte the problem of vrince in the grdient but rie other difficultie. The theoreticl upport for DPG i limited ince it ume Copyright c 2018, Aocition for the Advncement of Artificil Intelligence (www.i.org). All right reerved. critic tht pproximte Q when in prctice it pproximte Q inted. In ddition, DPG lern off-policy 1, which i undeirble when we wnt lerning to tke the cot of explortion into ccount. More importntly, lerning off-policy neceitte deigning uitble explortion policy, which i difficult in prctice. In fct, efficient explortion in DPG i n open problem nd mot ppliction imply ue independent Guin noie or the Orntein-Uhlenbeck heuritic (Uhlenbeck nd Orntein, 1930; Lillicrp et l., 2015). In thi pper, we propoe new pproch clled expected policy grdient (EPG) tht unifie policy grdient in wy tht yield both theoreticl nd prcticl inight. Inpired by expected r (Sutton nd Brto, 1998; vn Seijen et l., 2009), the min ide i to integrte cro the ction elected by the tochtic policy when etimting the grdient, inted of relying only on the ction elected during the mpled trjectory. EPG enble two theoreticl contribution. Firt, we etblih number of equivlence between EPG nd DPG, mong which i new generl policy grdient theorem, of which the tochtic nd determinitic policy grdient theorem re pecil ce. Second, we prove tht EPG reduce the vrince of the grdient etimte without requiring determinitic policie nd, for the Guin ce, with no computtionl overhed over SPG. EPG lo enble prcticl contribution: principled explortion trtegy for continuou problem. We how tht it i optiml in certin ene to explore with Guin policy uch tht the covrince i proportionl to e H, where H i the cled Hein of the critic with repect to the ction. We preent empiricl reult confirming tht thi new pproch to explortion ubtntilly outperform DPG with Orntein-Uhlenbeck explortion in four chllenging MuJoCo domin. Bckground A Mrkov deciion proce i tuple (S, A, R, p, p 0, γ) where S i et of tte, A i et of ction (in prctice either A = R d or A i finite), R(, ) i rewrd function, p(, ) i trnition kernel, p 0 i n initil tte ditribution, nd γ [0, 1) i dicount fctor. A policy π( ) 1 We how in thi pper tht, in certin etting, off-policy DPG i equivlent to EPG, our on-policy method.

i ditribution over ction given tte. We denote trjectorie τ π = ( 0, 0, r 0, 1, 1, r 1,... ), where 0 p 0, t π( t 1 ) nd r t i mple rewrd. A policy π induce Mrkov proce with trnition kernel p π ( ) = dπ( )p(, ) where we ue the ymbol dπ( ) to denote Lebegue integrtion gint the meure π( ) where i fixed. We ume the induced Mrkov proce i ergodic with ingle invrint meure defined for the whole tte pce. The vlue function i V π = E τ [ i γ ir i where ction re mpled from π. The Q-function i Q π ( ) = E R [r, + γe p( ) [V π ( ) nd the dvntge function i A π ( ) = Q π ( ) V π (). An optiml policy mximie the totl return J = dp 0()V π (). Since we conider only on-policy lerning with jut one current policy, we drop the π uper/ubcript where it i redundnt. If π i prmeteried by θ, then tochtic policy grdient (SPG) (Sutton et l., 2000; Peter nd Schl, 2006, 2008b) perform grdient cent on J, the grdient of J with repect to θ (grdient without ubcript re lwy with repect to θ). For tochtic policie, we hve: J = dρ() dπ( ) log π( )(Q(, ) + b()), (1) where ρ i the dicounted-ergodic occupncy meure, defined in the upplement, nd b() i beline, which cn be ny function tht depend on the tte but not the ction, ince dπ( ) log π( )b() = 0. Typiclly, (1) i pproximted from mple from trjectory τ of length T : ˆ J = T t=0 γt log π( t t )( ˆQ( t, t ) + b( t )). (2) If the policy i determinitic (we denote it π()), we cn ue determinitic policy grdient (Silver et l., 2014) inted: J = dρ() π() Q( = π(), ). (3) Thi updte i then pproximted uing mple: ˆ J = T t=0 γt π() ˆQ( = π(t ), t ). (4) Since the policy i determinitic, the problem of explortion i ddreed uing n externl ource of noie, typiclly modeled uing zero-men Orntein-Uhlenbeck (OU) proce (Uhlenbeck nd Orntein, 1930; Lillicrp et l., 2015) prmetrized by ψ nd σ: n i n i 1 ψ + N (0, σi) π() + n i. (5) In (2) nd (4), ˆQ i critic tht pproximte Q nd cn be lerned by r (Rummery nd Nirnjn, 1994; Sutton, 1996): ˆQ( t, t ) ˆQ( t, t ) + α [ r t+1 + γ ˆQ( t+1, t+1 ) ˆQ( t, t ). (6) Alterntively, we cn ue expected r (Sutton nd Brto, 1998; vn Seijen et l., 2009), which mrginlie out t+1, the ditribution over which i pecified by the known policy, to reduce the vrince in the updte: ˆQ( t, t ) ˆQ( t, t ) + α [ r t+1 + γ dπ( ) ˆQ( t+1, ) ˆQ( t, t ). (7) We could lo ue dvntge lerning (Bird nd other, 1995) or LSTDQ (Lgoudki nd Prr, 2003). If the critic function pproximtor i comptible, then the ctor, i.e., π, converge (Sutton et l., 2000). Inted of lerning ˆQ, we cn et b() = V () o tht Q(, ) + b() = A(, ) nd then ue the TD error δ(r,, ) = r + γv ( ) V () n etimte of A(, ) (Bhtngr et l., 2008): ˆ J = T t=0 γt log π( t t )(r + γ ˆV ( ) ˆV ()), (8) where ˆV () i n pproximte vlue function lerned uing ny policy evlution lgorithm. (8) work becue E [δ(r,, ), = A(, ), i.e., the TD error i n unbied etimte of the dvntge function. The benefit of thi pproch i tht it i ometime eier to pproximte V thn Q nd tht the return in the TD error i unprojected, i.e., it i not ditorted by function pproximtion. However, the TD error i noiy, introducing vrince in the grdient. To cope with thi vrince, we cn reduce the lerning rte when the vrince of the grdient would otherwie explode, uing, e.g., Adm (Kingm nd B, 2014), nturl policy grdient (Kkde, 2002; Amri, 1998; Peter nd Schl, 2008), the dptive tep ize method (Pirott, Retelli, nd Bcett, 2013) or Newton method (Furmton nd Brber, 2012; Prii, Pirott, nd Retelli, 2016). However, thi reult in low lerning when the vrince i high. One cn lo ue PGPE, which replce the tochtic policy with ditribution over determinitic policie (Sehnke et l., 2010). However, PGPE preclude updting the current policy during the epiode nd mke it difficult to explore efficiently. We cn lo eliminte ll vrince cued by the policy t the cot of mking the policy determinitic nd uing the DPG updte, which uully neceitte performing off-policy explortion. EPG, preented below, reduce to DPG in mny ueful ce, while providing principled wy to explore nd lo llowing for tochtic policie. Yet nother wy to eliminte vrince in the ctor i not to hve n ctor t ll, inted electing ction oft-greedily with repect to ˆQ lerned uing r. Thi i trivil for dicrete ction nd cn lo be done with one-tep Newton method for Q-function tht re qudric in the ction (Gu et l., 2016b). Expected Policy Grdient In thi ection, we propoe expected policy grdient (EPG). Min Algorithm Firt, we introduce Iπ Q () to denote the inner integrl in (1): J = dρ() dπ( ) log π( )(Q(, ) + b()) I Q π () = dρ()iπ Q (). (9)

Thi ugget new wy to write the pproximte grdient: T ˆ J = γ t Î ˆQ π ( t ), (10) t=0 g t where Î ˆQ π () i ome pproximtion to I ˆQ π () = dπ( ) log π( )( ˆQ(, ) + b()). Thi pproch mke explicit tht one tep in etimting the grdient i to evlute n integrl to etimte I ˆQ π (). The min inight behind EPG i tht, given tte, I ˆQ π () i expreed fully in term of known quntitie. Hence we cn mnipulte it nlyticlly to obtin formul or we cn jut compute the integrl uing ny numericl qudrture if n nlyticl olution i impoible. SPG given in (2) perform thi qudrture uing imple one-mple Monte Crlo method. However, relying on uch method i unnecery. In fct, the ction ued to interct with the environment need not be ued t ll in the evlution of ÎQ π () ince i bound vrible in the definition of I Q π (). The motivtion i thu imilr to tht of expected r but pplied to the ctor grdient etimte inted of the critic updte rule. EPG, hown in Algorithm 1, ue (10) to form policy grdient lgorithm tht repetedly etimte ÎQ π () with n integrtion ubroutine. Algorithm 1 Expected Policy Grdient 1: 0, t 0 2: initilie optimier, initilie policy π prmetried by θ 3: while not converged do 4: g t γ t DO-INTEGRAL( ˆQ,, π θ ) 5: g t i the etimted policy grdient per (10) 6: θ θ + optimier.update(g t ) 7: π(, ) 8:, r imultor.perform-action() 9: ˆQ.UPDATE(,, r, ) 10: t t + 1 11: 12: end while EPG h benefit even when n nlyticl olution i not poible: if the ction pce i low dimenionl, numericl qudrture i chep; if it i high dimenionl, it i till often worthwhile to blnce the expene of imulting the ytem with the cot of qudrture. Actully, even in the extreme ce of expenive qudrture but chep imultion, the limited reource vilble for qudrture could till be better pent on EPG with mrt qudrture thn SPG with imple Monte Crlo. One of the motivtion of DPG w preciely tht the imple one-mple Monte-Crlo qudrture implicitly ued by SPG often yield high vrince grdient etimte, even with good beline. To ee why, conider Figure 1 (left). A imple Monte Crlo method evlute the integrl by mpling one or more time from π( ) (blue) nd evluting µ log π( )Q(, ) (red) function of. A beline cn decree the vrince by dding multiple of µ log π( ) to the red curve, but the problem remin tht the red curve h high vlue where the blue curve i lmot zero. Conequently, ubtntil vrince perit, whtever 15 10 5 SPG updte policy PDF 0 1.0 0.5 0.0 0.5 1.0 ction vrince of MC 0.70 0.60 0.50 0.40 0.30 beline Figure 1: At left, π( ) for Guin policy with µ = θ = 0 t given tte nd contnt σ 2 (blue) nd the SPG updte θ log π( )Q(, ) (in red), obtined for Q = 1 2 + 1 2. At right, the vrince of imple ingle-mple Monte Crlo etimtor function of the beline. In imple multimple Monte Crlo method, the vrince would go down the number of mple. the beline, even with imple liner Q-function, hown in Figure 1 (right). DPG ddreed thi problem for determinitic policie but EPG extend it to tochtic one. Reltionhip to Other Method EPG h ome imilritie with VINE mpling (Schulmn et l., 2015), which ue n (intriniclly noiy) Monte Crlo qudrture with mny mple. 2 However, the exmple in Figure 1 how tht even with computtionlly expenive mny-mple Monte Crlo method, the problem of vrince remin, regrdle of the beline. EPG i lo relted to vrince minimition technique tht interpolte between two etimtor, e.g., (Gu et l., 2016, Eq. 7) i imilr to Corollry 4. However, EPG ue qudric (not liner) pproximtion to the critic, which i crucil for explortion. Furthermore, it completely eliminte vrince in the inner integrl, oppoed to jut reducing it. The ide behind EPG w lo independently nd concurrently developed Men Actor Critic (Adi et l., 2017), though only for dicrete ction nd without upporting theoreticl nlyi. Guin Policie EPG i prticulrly ueful when we mke the common umption of Guin policy: we cn then perform the integrtion nlyticlly under reonble condition. We how below (ee Lemm 3) tht the updte to the policy men computed by EPG i equivlent to the DPG updte. Moreover, imple formul for the covrince cn be derived (ee Lemm 2). Algorithm 2 nd 3 how the reulting pecil ce of EPG, which we cll Guin policy grdient (GPG). Surpriingly, GPG i on-policy but nonethele fully equivlent to DPG, n off-policy method, with prticulr form of explortion. Hence, GPG, by pecifying the policy covrince, cn be een derivtion of n explortion trtegy for DPG. In thi wy, GPG ddree n importnt open quetion. A we how lter, thi led to improved performnce in prctice. 2 VINE mpling lo differ from EPG by performing independent rollout of Q, requiring imultor with reet.

Algorithm 2 Guin Policy Grdient 1: 0, t 0 2: initilie optimier 3: while not converged do 4: g t γ t DO-INTEGRAL-GAUSS( ˆQ,, π θ ) 5: θ θ + optimier.update(g t ) 6: policy prmeter θ re updted uing grdient 7: Σ GET-COVARIANCE( ˆQ,, π θ ) 8: Σ computed from crtch 9: π( ) π( ) = N(µ, Σ ) 10:, r imultor.perform-action() 11: ˆQ.UPDATE(,, r, ) 12: t t + 1 13: 14: end while Algorithm 3 Guin Integrl 1: function DO-INTEGRAL-GAUSS( ˆQ,, π θ ) 2: I Q π(),µ ( µ ) ˆQ( = µ, ) Ue Lemm 1 3: return I Q π(),µ 4: end function 5: 6: function GET-COVARIANCE( ˆQ,, π θ ) 7: H COMPUTE-HESSIAN( ˆQ(µ, )) 8: return σ 2 0e ch Ue Lemm 2 9: end function The computtionl cot of GPG i mll: while it mut tore Hein mtrix H(, ) = 2 ˆQ(, ), it ize i only d d, where A = R d, which i typiclly mll, e.g., d = 6 for HlfCheeth-v1. Thi Hein i the me ize the policy covrince mtrix, which ny policy grdient mut tore nywy, nd hould not be confued with the Hein with repect to the prmeter of the neurl network, ued with Newton or nturl grdient method (Peter nd Schl, 2008; Furmton, Lever, nd Brber, 2016), which cn eily hve thound of entrie. Hence, GPG obtin EPG vrince reduction eentilly for free. Anlyi In thi ection, we nlye EPG, howing tht it unifie SPG nd DPG, tht ÎQ π () cn often be computed nlyticlly, nd tht EPG h lower vrince thn SPG. Generl Policy Grdient Theorem We begin by tting our mot generl reult, howing tht EPG cn be een generlition of both SPG nd DPG. To do thi, we firt tte new generl policy grdient theorem. We ue the horthnd without ubcript to denote the grdient with repect to policy prmeter θ. Theorem 1 (Generl Policy Grdient Theorem). If π(, ) i normlied Lebegue meure for ll, then [ J = dρ() V () dπ(, ) Q(, ). } {{ } I G () Proof. We begin by expnding the following expreion. dρ() dπ(, ) Q(, ) = dρ() dπ(,) (R(,)+γ dp(,)v ( )) = dρ() dπ(,)( R(,) +γ dp(,) V ( )) 0 = γ dρ() dp π ( ) V ( ) = dρ() V () dp 0() V () J = dρ() V () J. The firt equlity follow by expnding the definition of Q nd the penultimte one follow from Lemm B (in the upplement). Then the theorem follow by rerrnging term. The crucil benefit of Theorem 1 i tht it work for ll policie, both tochtic nd determinitic, unifying previouly eprte derivtion for the two etting. To how thi, in the following two corollrie, we ue Theorem 1 to recover the tochtic policy grdient theorem (Sutton et l., 2000) nd the determinitic policy grdient theorem (Silver et l., 2014), in ech ce by introducing dditionl umption to obtin formul for I G () expreible in term of known quntitie. Corollry 1 (Stochtic Policy Grdient Theorem). If π( ) i differentible, then J = dρ()i G() = dρ() dπ( ) log π( )Q(, ). Proof. We obtin the following by expnding V. V = dπ(, )Q(, ) = d( π(, ))Q(, ) + dπ(, )( Q(, )) We obtin I G () = dπ( ) log π( )Q(, ) = Iπ Q () by plugging thi into the definition of I G (). We obtin J by invoking Theorem 1 nd plugging in the bove expreion for I G (). We now recover the DPG updte introduced in (3). Corollry 2 (Determinitic Policy Grdient Theorem). If π( ) i Dirc-delt meure (i.e., determinitic policy) nd Q(, ) i differentible, then J = dρ()i G() = dρ() π() Q(, ). Proof. We begin by obtining n expreion for I G (). I G () = V () dπ(, ) Q(, ) = V () γ dp π ( ) V ( ) = π() Q(, ).

Here, the econd equlity follow by expnding the definition of Q nd the third follow from n etblihed determinitic policy grdient reult (Silver et l., 2014, Supplement, Eq. 1). We cn then obtin J by invoking Theorem 1 nd plugging in the bove expreion for I G (). Thee corollrie how tht the choice between determinitic nd tochtic policy grdient i fundmentlly choice of qudrture method. Hence, the empiricl ucce of DPG reltive to SPG (Silver et l., 2014; Lillicrp et l., 2015) cn be undertood in new light. In prticulr, it cn be ttributed, not to fundmentl limittion of tochtic policie (indeed, tochtic policie re ometime preferred), but inted to uperior qudrture. DPG integrte over Dirc-delt meure, which i known to be ey, while SPG typiclly relie on imple Monte Crlo integrtion. Thnk to EPG, determinitic pproch i no longer required to obtin method with low vrince. We dd ideline tht ince Theorem 1 cn be written I G () = V () γ dp π ( ) V ( ), which involve the derivtive of vlue function, GPG reemble vlue grdient (Hee et l., 2015). However, in our ce, we re lerning J directly nd do not perform recurive etimtion of V vlue grdient method do. Anlyticl Qudrture - Guin Policy We now derive lemm upporting GPG. Lemm 1 (Guin Policy Grdient). If the policy i Guin, i.e. π( ) N (µ, Σ ) with µ nd Σ 1/2 prmetried by θ, where Σ 1/2 i ymmetric nd Σ 1/2 Σ 1/2 = Σ nd the critic i of the form Q(, ) = A() + B() + cont where A() i ymmetric for every, then Iπ Q () = I Q π(),µ + I Q, where the men nd covrince component re given by I Q π(),µ π(),σ 1/2 = ( µ )(2A()µ + B()) nd = ( Σ 1/2 )2A()Σ 1/2. I Q π(),σ 1/2 See Lemm 1 in the upplement for proof of thi reult. While Lemm 1 require the critic to be qudric in the ction, thi umption i not very retrictive ince the coefficient B() nd A() cn be rbitrry continuou function of the tte, e.g., neurl network. Arbitrry Critic If Q doe not meet the condition of Lemm 1, we cn pproximte Q with qudric function in the neighbourhood of the policy men. Thi pproximtion i motivted by two rgument. Firt, in MDP tht model phyicl ytem with reonble rewrd function, Q i firly mooth. Second, policy grdient re locl, incrementl method nywy ince the policy men chnge lowly, the vlue of Q for ction fr from the policy men re uully not relevnt for the current updte. Corollry 3 (Approximte Guin Policy Grdient with n Arbitrry Critic). If the policy i Guin, i.e. π( ) N (µ, Σ ) with µ nd Σ 1/2 prmetried by θ in Lemm 1 nd ny critic Q(, ) doubly differentible with repect to ction for ech tte, then I Q π(),µ ( µ ) Q( = µ, ) nd I Q ( Σ 1/2 π(),σ 1/2 )H(µ, )Σ 1/2, where H(µ, ) i the Hein of Q with repect to, evluted t µ for fixed. Proof. We begin by pproximting the critic (for given ) uing the firt two term of the Tylor expnion of Q in µ. Q(, ) Q(µ, ) + ( µ ) Q( = µ, ) + 1 2 ( µ ) H(µ, )( µ ) = 1 2 H(µ,)+ ( Q(=µ,) H(µ,)µ )+cont. Becue of the erie trunction, the function on the righthnd ide i qudric nd we cn then ue Lemm 1: I Q π(),µ = µ(2 1 2 H(µ,)µ+ Q(=µ,) H(µ,)µ) I Q π(),σ 1/2 = µ Q(=µ,) =( Σ 1/2 )( 1 2 2H(µ,)Σ1/2 )=( Σ 1/2 )H(µ,)Σ 1/2. To ctully obtin the Hein, we could ue utomtic differentition to compute it nlyticlly. Alterntively, we cn oberve tht, if the critic relly i qudric, we cn jut red off the coefficient of the qudric term directly. Therefore, we cn pproximte the Hein by generting number of rndom ction-vlue round µ, computing the Q vlue, nd (loclly) fitting qudric. Thi proce i typiclly more computtionlly expenive thn utomtic differentition but h the dvntge of working with ReLU network (where the true Hein i zero but we till hve kind of globl curvture fter moothing) nd leverging more informtion from the critic (ince the evlution i t more thn one point). Liner GPG We now tte conequence of Lemm 1 for the ce when the critic Q i liner in the ction, i.e., the qudric term i lwy zero. Corollry 4 (Liner Guin Policy Grdient). If the policy i Guin, i.e., π( ) N (µ, Σ ) with µ prmetried by θ nd the critic i of the form Q( ) = B() + cont, then Iπ Q () = ( µ )B(). Moreover, it i unnecery to prmeterie Σ 1/2 ince the policy grdient w.r.t. to Σ 1/2 i zero (i.e., liner Q-function doe not give ny informtion bout the explortion covrince). We mke Corollry 4 explicit for two reon. Firt, it i ueful for howing n equivlence between DPG nd EPG (ee below). Second, it my ctully be ueful for non-trivil cl of phyicl ytem: if the time-mpling frequency i high enough (which implie cting in mll tep), the critic i effectively only ued to y if mll tep one wy i preferble to mll tep the other wy liner property. Equivlence between EPG nd DPG The updte for the policy men obtined in Corollry 3 i the me the DPG updte, linking the two method: I Q π () = ( µ ) Q( = µ, ).

We now formlie the equivlence between EPG nd DPG. Firt, on-policy GPG with liner critic (or n rbitrry critic pproximted by the firt term in the Tylor expnion) i equivlent to DPG with Guin explortion policy where the covrince ty the me. Thi follow from Corollry 4. Second, on-policy GPG with qudric critic (or n rbitrry critic pproximted by the firt two term in the Tylor expnion) i equivlent to DPG with Guin explortion policy where the covrince i computed uing the updte (where α n i equence of tep-ize): Σ 1/2 Σ 1/2 + α n H()Σ 1/2. (11) Thi follow from Corollry 3. Third, nd mot generlly, for ny critic t ll (not necerily qudric), DPG i kind of EPG for prticulr choice of qudrture (uing Dirc meure). Thi follow from Theorem 1. Surpriingly, thi men tht DPG, normlly conidered to be off-policy, cn lo be een on-policy when exploring with Guin noie. Furthermore, the comptible critic for DPG (Silver et l., 2014) i indeed liner in the ction. Hence, thi reltionhip hold whenever DPG ue comptible critic. 3 Furthermore, Lemm 1 lend new legitimcy to the common prctice of replcing the critic required by the DPG theory, which pproximte Q, with one tht pproximte Q itelf, done in SPG nd EPG. Explortion uing the Hein The econd equivlence given bove ugget tht we cn include the covrince in the ctor network nd lern it long with the men. However, nother option i to compute it from crtch t ech itertion by nlyticlly computing the reult of pplying (11) infinitely mny time. Lemm 2 (Explortion Limit). The itertive procedure defined by the eqution Σ 1/2 Σ 1/2 + αh()σ 1/2 pplied n time uing the diminihing lerning rte α = 1/n converge to Σ 1/2 e H() n. Proof. Conider the equence (Σ 1/2 ) 0 = σ 0 I, (Σ 1/2 ) n = (Σ 1/2 ) n 1 +αh()(σ 1/2 ) n 1. Expnding out the recurion, the n-th element of the equence i given : (Σ 1/2 ) n = (I + αh()) n (Σ 1/2 ) 0. We digonlie the Hein H() = UΛU for ome orthonorml mtrix U nd obtin the following expreion for (Σ 1/2 ) n. (Σ 1/2 ) n = (I+αUΛU ) n (Σ 1/2 ) 0 = U(I+αΛ) n U (Σ 1/2 ) 0 Since we hve lim n (1 + 1 n λ)n = e λ for ech digonl entry of Λ, we plug α = 1 n nd obtin the identity: lim n (Σ1/2 ) n = Ue Λ U (Σ 1/2 ) 0 = σ 0 e H(). 3 The notion of comptibility of critic i different for tochtic nd determinitic policy grdient. The prcticl impliction of Lemm 2 i tht, in policy grdient method, it i jutified to ue Guin explortion with covrince proportionl to e ch for ome rewrd cling contnt c. Thu by exploring with (cled) covrince e ch, we obtin principled lterntive to the Orntein-Uhlenbeck heuritic defined in (5). Our reult below how tht it lo perform much better in prctice. Lemm 2 h n intuitive interprettion. If H() h lrge poitive eigenvlue λ, then ˆQ(, ) h hrp minimum long the correponding eigenvector, nd the correponding eigenvlue of Σ i e λ, i.e., lo lrge. The reult i lrge explortion bonu long tht direction, enbling the lgorithm to leve locl minim. Converely, if λ i negtive, then ˆQ(, ) h mximum nd o e λ i mll, ince explortion i not needed. Vrince Anlyi We now prove tht for ny policy, the EPG etimtor of (10) h lower vrince thn the SPG etimtor of (2). Lemm 3. If for ll S, the rndom vrible log π( ) ˆQ(, ) where π( ) h nonzero vrince, then V τ[ t=0 γt log π( t t)( ˆQ( t, t)+b( t))>v τ [ t=0 γt I ˆQ π (t). The proof i deferred to the upplement (ee Lemm 3 there). Lemm 3 umption i reonble ince the only wy rndom vrible log π( ) ˆQ(, ) could hve zero vrince i if it were the me for ll ction in the policy upport (except for et of meure zero), in which ce optimiing the policy would be unnecery. Since we know tht both the etimtor of (2) nd (10) re unbied, the etimtor with lower vrince h lower MSE. Extenion to Entropy Regulrition On-policy SPG ometime include n entropy term in the grdient in order to id explortion by mking the policy more tochtic. The grdient of the differentil entropy 4 H() of the policy t tte i defined follow. H()= dπ( ) log π( ) = d π( ) log π( )+ dπ( ) log π( ) = d π( ) log π( )+ dπ( ) 1 π( ) π( ) = d π( ) log π( )+ dπ( ) 1 = d π( ) log π( )= dπ( ) log π( ) log π( ). Typiclly, we weight the entropy updte with the policy grdient updte: I E G () = I G () + α H() = dπ( ) log π( )(Q(, ) α log π( )). Thi eqution mke cler tht performing entropy regulrition i equivlent to uing different critic with Q-vlue hifted by α log π( ); thi hold for both SPG nd EPG. 4 For dicrete ction pce, the me derivtion with integrl replced by um hold for the entropy.

Domin ˆσ DPG ˆσ EPG HlfCheeth-v1 1336.39 [1107.85, 1614.51 InvertedPendulum-v1 291.26 [241.45, 351.88 Recher2d-v1 1.22 [0.63, 2.31 Wlker2d-1 543.54 [450.58, 656.65 1056.15 [875.54, 1275.94 0.00 n/ 0.13 [0.07, 0.26 762.35 [631.98, 921.00 0 5000 EPG (40 run) DPG (40 run) SPG (40 run) 40 120 200 280 0 1000 EPG (5 run) DPG (40 run) SPG (40 run) 20 40 60 80 Tble 1: Etimted tndrd devition (men nd 90% intervl) cro run fter lerning. Experiment While EPG h mny potentil ue, we focu on empiriclly evluting one prticulr ppliction: explortion driven by the Hein exponentil ( introduced in Algorithm 2 nd Lemm 2), replcing the tndrd Orntein-Uhlenbeck (OU) explortion in continuou ction domin. To thi end, we pplied EPG to four domin modelled with the Mu- JoCo phyic imultor (Todorov, Erez, nd T, 2012): HlfCheeth-v1, InvertedPendulum-v1, Recher2d-v1 nd Wlker2d-v1 nd compred it performnce to DPG nd SPG. In prctice, EPG differed from deep DPG (Lillicrp et l., 2015; Silver et l., 2014) only in the explortion trtegy, though their theoreticl underpinning re different. The hyperprmeter for DPG nd thoe of EPG tht re not relted to explortion were tken from n exiting benchmrk (Ilm et l., 2017; Brockmn et l., 2016). The explortion hyperprmeter for EPG were σ 2 0 = 0.2 nd c = 1.0 where the explortion covrince i σ 2 0e ch. Thee vlue were obtined uing grid erch from the et {0.2, 0.5, 1} for σ 2 0 nd {0.5, 1.0, 2.0} for c over the HlfCheeth-v1 domin. Since c i jut contnt cling the rewrd, it i reonble to et it to 1.0 whenever rewrd cling i lredy ued. Hence, our explortion trtegy h jut one hyperprmeter σ 2 0 oppoed pecifying pir of prmeter (tndrd devition nd men reverion contnt) for OU. We ued the me lerning prmeter for the other domin. For SPG 5, we ued OU explortion nd contnt digonl covrince of 0.2 in the ctor updte (thi pproximtely correpond to the verge vrince of the OU proce over time). The other prmeter for SPG re the me for the ret of the lgorithm. For the lerning curve, we obtined 90% confidence intervl round the lerning curve. The lerning curve how reult of independent evlution run which ued ction generted by the policy men without ny explortion noie. The reult (Figure 2) how tht EPG explortion trtegy yield much better performnce thn DPG with OU. Furthermore, SPG doe poorly, olving only the eiet domin (InvertedPendulum-v1) reonbly quickly, chieving low progre on HlfCheeth-v1, nd filing entirely on the other domin. Thi i not urpriing DPG w introduced preciely to olve the problem of high vrince SPG etimte on thi type of problem. In InvertedPendulum-v1, SPG initilly lern quickly, outperforming the other method. Thi 5 We tried lerning the covrince for SPG but the covrince etimte w untble; no regulrition hyperprmeter we teted mtched SPG performnce with OU even on the implet domin. -15-5 EPG (5 run) DPG (5 run) SPG (10 run) 50 150 250 350 450 0 2000 EPG (40 run) DPG (40 run) SPG (10 run) 200 400 600 800 1000 1200 Figure 2: Lerning curve (men nd 90% intervl) for HlfCheeth-v1 (top left), InvertedPendulum-v1 (top right), Recher2d-v1 (bottom left, clipped t -14) nd Wlker2d-v1 (bottom right). The number of independent trining run i in prenthee. Horizontl xi i cled in thound of tep. i becue noiy grdient updte provide crude, indirect form of explortion tht hppen to uit thi problem. Clerly, thi i indequte for more complex domin: even for thi imple domin it led to ubpr performnce lte in lerning. 0 1000 EPG 10 30 50 70 90 0 1000 DPG 10 30 50 70 90 0 1000 SPG 10 30 50 70 90 Figure 3: Three run for EPG (left), DPG (middle) nd SPG (right) for the InvertedPendulum-v1 domin, demontrting tht EPG how much le unlerning. In ddition, EPG typiclly lern more conitently thn DPG with OU. In two tk, the empiricl tndrd devition cro run of EPG (ˆσ EPG ) w ubtntilly lower thn tht of DPG (ˆσ DPG ) t the end of lerning, hown in Tble 1. For the other two domin, the confidence intervl round the empiricl tndrd devition for DPG nd EPG were too wide to drw concluion. Surpriingly, for InvertedPendulum-v1, DPG lerning curve decline lte in lerning. The reon cn be een in the individul run hown in Figure 3: both DPG nd SPG uffer from evere unlerning. Thi unlerning cnnot be explined by explortion noie ince the evlution run jut ue the men ction, without exploring. Inted, OU explortion in DPG my be too core, cuing the optimier to exit good optim, while SPG unlern due to noie in the grdient. The noie lo help peed initil lerning, decribed bove, but thi doe not trnfer to other domin. EPG void thi problem by utomticlly reducing the noie when it find good optimum, i.e., Hein with lrge negtive eigenvlue.

Concluion Thi pper propoed new policy grdient method clled expected policy grdient (EPG), tht integrte cro the ction elected by the tochtic policy. We ued EPG to prove new generl policy grdient theorem ubuming the tochtic nd determinitic policy grdient theorem. We lo howed tht, under certin relitic condition, the qudrture required by EPG cn be performed nlyticlly, llowing DPG with principled explortion. We preented empiricl reult confirming tht thi ppliction of EPG outperform DPG nd SPG on four domin. Acknowledgement Thi project h received funding from the Europen Reerch Council (ERC) under the Europen Union Horizon 2020 reerch nd innovtion progrmme (grnt greement number 637713). Reference Amri, S.-I. 1998. Nturl grdient work efficiently in lerning. Neurl computtion 10(2):251 276. Adi, K.; Allen, C.; Roderick, M.; Mohmed, A.-r.; Konidri, G.; nd Littmn, M. 2017. Men Actor Critic. ArXiv e-print. Bird, L., et l. 1995. Reidul lgorithm: Reinforcement lerning with function pproximtion. In Proceeding of the twelfth interntionl conference on mchine lerning, 30 37. Bhtngr, S.; Ghvmzdeh, M.; Lee, M.; nd Sutton, R. S. 2008. Incrementl nturl ctor-critic lgorithm. In Advnce in neurl informtion proceing ytem, 105 112. Brockmn, G.; Cheung, V.; Petteron, L.; Schneider, J.; Schulmn, J.; Tng, J.; nd Zremb, W. 2016. Openi gym. rxiv preprint rxiv:1606.01540. Furmton, T., nd Brber, D. 2012. A unifying perpective of prmetric policy erch method for mrkov deciion procee. In Advnce in neurl informtion proceing ytem, 2717 2725. Furmton, T.; Lever, G.; nd Brber, D. 2016. Approximte newton method for policy erch in mrkov deciion procee. Journl of Mchine Lerning Reerch 17(227):1 51. Gu, S.; Lillicrp, T.; Ghhrmni, Z.; Turner, R. E.; nd Levine, S. 2016. Q-prop: Smple-efficient policy grdient with n off-policy critic. rxiv preprint rxiv:1611.02247. Gu, S.; Lillicrp, T.; Sutkever, I.; nd Levine, S. 2016b. Continuou deep q-lerning with model-bed ccelertion. In Interntionl Conference on Mchine Lerning, 2829 2838. Hee, N.; Wyne, G.; Silver, D.; Lillicrp, T.; Erez, T.; nd T, Y. 2015. Lerning continuou control policie by tochtic vlue grdient. In Advnce in Neurl Informtion Proceing Sytem, 2944 2952. Ilm, R.; Henderon, P.; Gomrokchi, M.; nd Precup, D. 2017. Reproducibility of benchmrked deep reinforcement lerning tk for continuou control. rxiv preprint rxiv:1708.04133. Kkde, S. M. 2002. A nturl policy grdient. In Advnce in neurl informtion proceing ytem, 1531 1538. Kingm, D., nd B, J. 2014. Adm: A method for tochtic optimiztion. rxiv preprint rxiv:1412.6980. Lgoudki, M. G., nd Prr, R. 2003. Let-qure policy itertion. Journl of mchine lerning reerch 4(Dec):1107 1149. Lillicrp, T. P.; Hunt, J. J.; Pritzel, A.; Hee, N.; Erez, T.; T, Y.; Silver, D.; nd Wiertr, D. 2015. Continuou control with deep reinforcement lerning. rxiv preprint rxiv:1509.02971. Prii, S.; Pirott, M.; nd Retelli, M. 2016. Multi-objective reinforcement lerning through continuou preto mnifold pproximtion. Journl of Artificil Intelligence Reerch 57:187 227. Peter, J., nd Schl, S. 2006. Policy grdient method for robotic. In Intelligent Robot nd Sytem, 2006 IEEE/RSJ Interntionl Conference on, 2219 2225. IEEE. Peter, J., nd Schl, S. 2008. Nturl ctor-critic. Neurocomputing 71(7):1180 1190. Peter, J., nd Schl, S. 2008b. Reinforcement lerning of motor kill with policy grdient. Neurl network 21(4):682 697. Pirott, M.; Retelli, M.; nd Bcett, L. 2013. Adptive tep-ize for policy grdient method. In Advnce in Neurl Informtion Proceing Sytem, 1394 1402. Rummery, G. A., nd Nirnjn, M. 1994. On-line Q-lerning uing connectionit ytem. Univerity of Cmbridge, Deprtment of Engineering. Schulmn, J.; Levine, S.; Abbeel, P.; Jordn, M.; nd Moritz, P. 2015. Trut region policy optimiztion. In Proceeding of the 32nd Interntionl Conference on Mchine Lerning (ICML-15), 1889 1897. Sehnke, F.; Oendorfer, C.; Rücktieß, T.; Grve, A.; Peter, J.; nd Schmidhuber, J. 2010. Prmeter-exploring policy grdient. Neurl Network 23(4):551 559. Silver, D.; Lever, G.; Hee, N.; Degri, T.; Wiertr, D.; nd Riedmiller, M. 2014. Determinitic policy grdient lgorithm. In ICML. Sutton, R. S., nd Brto, A. G. 1998. Reinforcement lerning: An introduction, volume 1. MIT pre Cmbridge. Sutton, R. S.; McAlleter, D. A.; Singh, S. P.; nd Mnour, Y. 2000. Policy grdient method for reinforcement lerning with function pproximtion. In Advnce in neurl informtion proceing ytem, 1057 1063. Sutton, R. S. 1996. Generliztion in reinforcement lerning: Succeful exmple uing pre core coding. Advnce in neurl informtion proceing ytem 1038 1044. Todorov, E.; Erez, T.; nd T, Y. 2012. Mujoco: A phyic engine for model-bed control. In Intelligent Robot nd Sytem (IROS), 2012 IEEE/RSJ Interntionl Conference on, 5026 5033. IEEE. Uhlenbeck, G. E., nd Orntein, L. S. 1930. On the theory of the brownin motion. Phyicl review 36(5):823. vn Seijen, H.; vn Helt, H.; Whiteon, S.; nd Wiering, M. 2009. A theoreticl nd empiricl nlyi of expected r. In ADPRL 2009: Proceeding of the IEEE Sympoium on Adptive Dynmic Progrmming nd Reinforcement Lerning, 177 184.

Supplement We firt provide forml proof for certin ttement invoked by our pper. We then provide brief dicuion of the ue of lerning rte tht diminihed in the trjectory length in the computtion of the covrince. Proof Firt, we prove two lemm concerning the dicounted-ergodic meure ρ() which hve been implicitly relied for ome time but fr we could find, never proved explicitly. Definition 1 (Time-dependent occupncy). p( t = 0) = p 0 () p( t = i + 1) = p( )p( t = i) for i 0 Definition 2 (Truncted trjectory). Define the trjectory truncted fter N tep τ N = ( 0, 0, r 0, 1, 1, r 1,..., N ). Obervtion 1 (Expecttion wrt. truncted trjectory). Since τ N = ( 0, 1, 2,..., N ) i ocited with the denity N 1 p( i+1 i )p 0 ( 0 ), we hve tht [ N E τn γi f( i ) = = ( N 1 ) ( 0, 1,..., N p( N ) i+1 i ) p 0 ( 0 ) γi f( i ) d 0 d 1... d N = = N 0, 1,..., N (p 0 ( 0 ) ) N 1 p( i+1 i ) γ i f( i )d 0 d 1... d N = = N p( t = i)γi f()d for ny function f. Definition 3 (Expecttion with repect to infinte trjectory). For ny bounded function f, we hve [ [ N E τ γ i f( i ) lim E τ N γ i f( i ). N Here, the um on the left-hnd ide i prt of the ymbol being defined. Obervtion 2 (Property of expecttion with repect to infinte trjectory). [ E τ γi f( i ) [ N = lim N E τn γi f( i ) = N = lim N p( t = i)γi f()d = = dp( t = i)γ i f() for ny bounded function f. Definition 4 (Dicounted-ergodic occupncy meure ρ). ρ() = γ i p( t = i) The meure ρ i not normlied in generl. Intuitively, it cn be thought of mrginliing out the time in the ytem dynmic. Lemm 4 (Dicounted-ergodic property). For ny bounded function f: [ ρ()f() = E τ γ i f( i ). Proof. [ [ E τ γ i f( i ) = γ i p( t = i)f()d = γ i p( t = i) f()d Here, the firt equlity follow from Obervtion 2. } {{ } ρ()

Thi property i ueful ince the expreion on the left cn be eily mnipulted while the expreion on the right cn be etimted from mple uing Monte Crlo. Lemm 5 (Generlied eigenfunction property). For ny bounded function f: ( ) ( ) γ dρ() dp( )f( ) = dρ()f() dp 0 ()f() Proof. γ dρ() dp( )f( ) = γ γi p( t = i)p( )f( )dd =, = γi+1 dp( t = i + 1)f( ) = i=1 γi dp( t = i)f( ) = ( γi dp( t = i)f( ) ) ( dp 0()f() ) = ( dρ()f()) ( dp 0()f() ) Here, the firt equlity follow form definition 4, the econd one from definition 1. The lt equlity follow gin from definition 4. Definition 5 (Mrkov Rewrd Proce). A Mrkov Rewrd Proce i tuple (p, p 0, R, γ), where p( ) i trnition kernel, p 0 i the ditribution over initil tte, R( ) i rewrd ditribution conditioned on the tte nd γ i the dicount contnt. An MRP cn be thought of n MDP with fixed policy nd dynmic given by mrginliing out the ction p π ( ) = dπ( )p(, ). Since thi pper conider the ce of one policy, we bue nottion lightly by uing the me ymbol τ to denote trjectorie including ction, i.e. ( 0, 0, r 0, 1, 1, r 1,... ) nd without them ( 0, r 0, 1, r 1,... ). Lemm 6 (Second Moment Bellmn Eqution). Conider Mrkov Rewrd Proce (p, p 0, X, γ) where p( ) i Mrkov proce nd X( ) i ome probbility denity function 6. Denote the vlue function of the MRP V. Denote the econd moment function S ( ) 2 S() = E τ γ t x t 0 = x t X( t ). t=0 Then S i the vlue function of the MRP: (p, p 0, u, γ 2 ), where u() i determinitic rndom vrible given by u() = V X(x ) [x + ( E X(x ) [x ) 2 + 2γEX(x ) [x E p( ) [V ( ). Proof. [ S() = E τ (x 0 + t=1 γt x t ) 2 0 = [ = E τ x 2 0 + 2x 0 ( t=1 γt x t ) + ( t=1 γt x t ) 2 0 = [ = E τ x 2 0 0 = + E τ [2x 0 ( [ t=1 γt x t ) 0 = + E τ ( t=1 γt x t ) 2 0 = u() γ 2 E p( ) [S( ) Thi i exctly the Bellmn eqution of the MRP (p, p 0, u, γ 2 ). The theorem follow ince the Bellmn eqution uniquely determine the vlue function. Obervtion 3 (Dominted Vlue Function). Conider two Mrkov Rewrd Procee (p, p 0, X 1, γ) nd (p, p 0, X 2, γ), where p( ) i Mrkov proce (common to both MRP) nd X 1 (), X 2 () re ome determinitic rndom vrible meeting the condition X 1 () X 2 () for every. Then the vlue function V 1 nd V 2 of the repective MRP tify V 1 () V 2 () for every. Moreover, if we hve tht X 1 () < X 2 () for ll tte, then the inequlity between vlue function i trict. Proof. Follow trivilly by expnding the vlue function erie nd compring erie elementwie. We now move our ttention to prove the Guin Policy Grdient lemm. 6 Note tht while X occupie plce in the definition of the MRP uully clled rewrd ditribution, we re uing the ymbol X, not R ince we hll pply the lemm to Xe which re contruction ditinct from the rewrd of the MDP we re olving.

Lemm 1 (Guin Policy Grdient). If the policy i Guin, i.e. π( ) N (µ, Σ ) with µ nd Σ 1/2 prmetried by θ, where Σ 1/2 i ymmetric nd Σ 1/2 Σ 1/2 = Σ nd the critic i of the form Q(, ) = A() + B() + cont where A() i ymmetric for every, then I Q π () = I Q π(),µ + I Q π(),σ 1/2 I Q π(),µ = ( µ )(2A()µ + B()) nd I Q π(),σ 1/2 = ( Σ 1/2 )2A()Σ 1/2, where the men nd covrince component re given by Proof. Firt, we oberve tht the critic Q defined in the ttement of the lemm doe not depend on the policy prmeter θ. Thi i becue Q i n pproximtion to the Q-function mintined by the lgorithm oppoed to the true Q-function, which i defined with repect to the policy nd doe depend on it. We cn hence move the differentition outide of the integrl, follow. Iπ Q () = π( )Q(, )d = E π [Q(, ). We now expnd the expecttion uing known expreion for the expecttion of qudrtic form: Thi give wy to the following derivtive.. E π [Q(, ) = trce(a()σ) + µ A()µ + B() µ. Σ 1/2E π [Q(, ) = Σ 1/2(trce(A()Σ) + µ A()µ + B() µ) = 2A()Σ 1/2 µ E π [Q(, ) = µ (trce(a()σ) + µ A()µ + B() µ) = 2A()µ + B() We now obtin the reult by pplying chin rule. I Q π () = I Q π(),µ + I Q π(),σ 1/2 = ( µ)(2a()µ + B()) + ( Σ 1/2 )(2A()Σ 1/2 ) Lemm 3. If for ll S, the rndom vrible log π( ) ˆQ(, ) where π( ) h nonzero vrince, then [ V τ t=0 γt log π( t t )( ˆQ( t, t ) + b( t )) > [ V τ t=0 γt I ˆQ π ( t ). Proof. Both rndom vrible hve the me men o we need only how tht: [ ( E τ t=0 γt log π( t t )( ˆQ( ) 2 t, t ) + b( t )) > [ ( ) 2 E τ t=0 γ t I ˆQ π ( t ). We trt by pplying Lemm 6[ to the lefthnd ide nd etting X = X 1 ( t ) = γ t log π( t t )( ˆQ( t, t ) + b( t )) where t ( π( t t ). Thi how tht E τ t=0 γt log π( t t )( ˆQ( ) 2 t, t ) + b( t )) i the totl return of the MRP (p, p 0, u 1, γ 2 ), where u 1 = V X1(x ) [x + ( E X1(x ) [x ) 2 + 2γEX1(x ) [x E p( ) [V ( ). Likewie, pplying [ Lemm 6 gin to the righthnd ide, intntiting X determinitic rndom vrible X 2 ( t ) = I ˆQ π ( t ), ( ) 2 we hve tht E τ t=0 γ t I ˆQ π ( t ) i the totl return of the MRP (p, p 0, u 2, γ 2 ), where u 2 = ( E X2(x ) [x ) 2 + 2γEX2(x ) [x E p( ) [V ( ). Note tht E X1(x ) [x = E X2(x ) [x nd therefore u 1 u 2. Furthermore, by umption of the lemm, the inequlity i trict. The lemm then follow by pplying Obervtion 3. For convenience, Lemm 3 lo ume infinite length trjectorie. However, thi i not prcticl limittion ince ll policy grdient method implicitly ume trjectorie re long enough to be modelled infinite. Furthermore, finite trjectory vrint lo hold, though the proof i meier.

Remrk on the covrince limit When we obtin e H the limiting covrince mtrix in Lemm 2 of the min pper, there i light modelling difficulty: i it jutified to ue the lerning rte of 1 n, which diminihed in the length of the trjectory, oppoed to mll finite number? We oberve tht the problem of chooing tep ize i, in generl, not pecific to our method ince ll policy grdient method rely on tochtic optimition nd hence work with diminihing lerning rte of ome ort. We do note; however, tht the tep ize we ue, which i 1 n for every point in the trjectory, i different from the tep ize typiclly ued with Robbin-Monro procedure, which i different t ech time tep. Thi men tht the um of our tep ize i finite while the um of the Robbin-Monro tep-ize diverge. Hence our choice of tep ize doe not give the gurntee typiclly ocited with tochtic optimition. We ue the tep equence ince it erve ueful intermedite tge between imply tking one PG tep of eqution (11) nd uing finite tep-ize, which would men tht the covrince would converge either to zero or diverge to infinity.