Deep Reinforcement Learning via Policy Optimization

Deep Reinforcement Learning via Policy Optimization John Schulman July 3, 2017

Introduction

Deep Reinforcement Learning: What to Learn? Policies (select next action)

Deep Reinforcement Learning: What to Learn? Policies (select next action) Value functions (measure goodness of states or state-action pairs)

Deep Reinforcement Learning: What to Learn? Policies (select next action) Value functions (measure goodness of states or state-action pairs) Models (predict next states and rewards)

Model Free RL: (Rough) Taxonomy Policy Optimization Dynamic Programming modified policy iteration DFO / Evolution Policy Gradients Policy Iteration Value Iteration Actor-Critic Methods Q-Learning

Policy Optimization vs Dynamic Programming Conceptually...

Policy Optimization vs Dynamic Programming Conceptually... Policy optimization: optimize what you care about

Policy Optimization vs Dynamic Programming Conceptually... Policy optimization: optimize what you care about Dynamic programming: indirect, exploit the problem structure, self-consistency

Policy Optimization vs Dynamic Programming Conceptually... Policy optimization: optimize what you care about Dynamic programming: indirect, exploit the problem structure, self-consistency Empirically... Policy optimization more versatile, dynamic programming methods more sample-efficient when they work Policy optimization methods more compatible with rich architectures (including recurrence) which add tasks other than control (auxiliary objectives), dynamic programming methods more compatible with exploration and off-policy learning

Parameterized Policies A family of policies indexed by parameter vector θ R d

Parameterized Policies A family of policies indexed by parameter vector θ R d Deterministic: a = π(s, θ)

Parameterized Policies A family of policies indexed by parameter vector θ R d Deterministic: a = π(s, θ) Stochastic: π(a s, θ)

Parameterized Policies A family of policies indexed by parameter vector θ R d Deterministic: a = π(s, θ) Stochastic: π(a s, θ) Analogous to classification or regression with input s, output a.

Parameterized Policies A family of policies indexed by parameter vector θ R d Deterministic: a = π(s, θ) Stochastic: π(a s, θ) Analogous to classification or regression with input s, output a. Discrete action space: network outputs vector of probabilities

Episodic Setting In each episode, the initial state is sampled from µ, and the agent acts until the terminal state is reached. For example:

Episodic Setting In each episode, the initial state is sampled from µ, and the agent acts until the terminal state is reached. For example: Taxi robot reaches its destination (termination = good)

Episodic Setting In each episode, the initial state is sampled from µ, and the agent acts until the terminal state is reached. For example: Taxi robot reaches its destination (termination = good) Waiter robot finishes a shift (fixed time) Walking robot falls over (termination = bad)

Derivative Free Optimization / Evolution

Cross Entropy Method Initialize µ R d, σ R d for iteration = 1, 2,... do Collect n samples of θ i N(µ, diag(σ)) Perform one episode with each θ i, obtaining reward R i Select the top p% of θ samples (e.g. p = 20), the elite set Fit a Gaussian distribution, to the elite set, updating µ, σ. end for Return the final µ.

Cross Entropy Method Sometimes works embarrassingly well

Cross Entropy Method Sometimes works embarrassingly well I. Szita and A. Lörincz. Learning Tetris using the noisy cross-entropy method. Neural computation (2006)

Stochastic Gradient Ascent on Distribution Let µ define distribution for policy π θ : θ P µ (θ)

Stochastic Gradient Ascent on Distribution Let µ define distribution for policy π θ : θ P µ (θ) Return R depends on policy parameter θ and noise ζ maximize E θ,ζ [R(θ, ζ)] µ R is unknown and possibly nondifferentiable Score function gradient estimator: µ E θ,ζ [R(θ, ζ)] = E θ,ζ [ µ log P µ (θ)r(θ, ζ)] 1 N µ log P µ (θ i )R i N i=1

Stochastic Gradient Ascent on Distribution Compare with cross-entropy method

Stochastic Gradient Ascent on Distribution Compare with cross-entropy method Score function grad: µ E θ,ζ [R(θ, ζ)] 1 N N µ log P µ (θ i )R i i=1

Stochastic Gradient Ascent on Distribution Compare with cross-entropy method Score function grad: µ E θ,ζ [R(θ, ζ)] 1 N N µ log P µ (θ i )R i i=1 Cross entropy method: maximize µ 1 N N log P µ (θ i )f (R i ) i=1 where f (r) = 1[r above threshold] (cross entropy method)

Connection to Finite Differences Suppose P µ is Gaussian distribution with mean µ, covariance σ 2 I log P µ (θ) = µ θ 2 /2σ 2 + const µ log P µ (θ) = (θ µ)/σ 2 R i µ log P µ (θ i ) = R i (θ i µ)/σ 2

Connection to Finite Differences Suppose P µ is Gaussian distribution with mean µ, covariance σ 2 I log P µ (θ) = µ θ 2 /2σ 2 + const µ log P µ (θ) = (θ µ)/σ 2 R i µ log P µ (θ i ) = R i (θ i µ)/σ 2 Suppose we do antithetic sampling, where we use pairs of samples θ + = µ + σz, θ = µ σz ( ) 1 R(µ + σz, ζ) µ log P µ (θ + ) + R(µ σz, ζ ) µ log P µ (θ ) 2 = 1 σ ( R(µ + σz, ζ) R(µ σz, ζ ) ) z

Deriving the Score Function Estimator Score function gradient estimator: µ E θ,ζ [R(θ, ζ)] = E θ,ζ [ µ log P µ (θ)r(θ, ζ)] 1 N µ log P µ (θ i )R i N i=1

Deriving the Score Function Estimator Score function gradient estimator: µ E θ,ζ [R(θ, ζ)] = E θ,ζ [ µ log P µ (θ)r(θ, ζ)] 1 N µ log P µ (θ i )R i N Derive by writing expectation as an integral µ dµdζ P µ (θ)r(θ, ζ) = dµdζ µ P µ (θ)r(θ, ζ) = dµdζ P µ (θ) µ log P µ (θ)r(θ, ζ) i=1 = E θ,ζ [ µ log P µ (θ)r(θ, ζ)]

Literature on DFO Evolution strategies (Rechenberg and Eigen, 1973)

Literature on DFO Evolution strategies (Rechenberg and Eigen, 1973) Simultaneous perturbation stochastic approximation (Spall, 1992)

Literature on DFO Evolution strategies (Rechenberg and Eigen, 1973) Simultaneous perturbation stochastic approximation (Spall, 1992) Covariance matrix adaptation: popular relative of CEM (Hansen, 2006)

Success Stories CMA is very effective for optimizing low-dimensional locomotion controllers

Success Stories CMA is very effective for optimizing low-dimensional locomotion controllers UT Austin Villa: RoboCup 2012 3D Simulation League Champion

Success Stories CMA is very effective for optimizing low-dimensional locomotion controllers UT Austin Villa: RoboCup 2012 3D Simulation League Champion Evolution Strategies was shown to perform well on Atari, competetive with policy gradient methods (Salimans et al., 2017)

Policy Gradient Methods

Overview Problem: maximize E[R π θ ] Here, we ll use a fixed policy parameter θ (instead of sampling θ P µ ) and estimate gradient with respect to θ

Overview Problem: maximize E[R π θ ] Here, we ll use a fixed policy parameter θ (instead of sampling θ P µ ) and estimate gradient with respect to θ Noise is in action space rather than parameter space

Overview Problem: maximize E[R π θ ] Intuitions: collect a bunch of trajectories, and... 1. Make the good trajectories more probable

Overview Problem: maximize E[R π θ ] Intuitions: collect a bunch of trajectories, and... 1. Make the good trajectories more probable 2. Make the good actions more probable

Overview Problem: maximize E[R π θ ] Intuitions: collect a bunch of trajectories, and... 1. Make the good trajectories more probable 2. Make the good actions more probable 3. Push the actions towards better actions

Score Function Gradient Estimator for Policies Now random variable is a whole trajectory τ = (s 0, a 0, r 0, s 1, a 1, r 1,..., s T 1, a T 1, r T 1, s T ) θ E τ [R(τ)] = E τ [ θ log P(τ θ)r(τ)]

Score Function Gradient Estimator for Policies Now random variable is a whole trajectory τ = (s 0, a 0, r 0, s 1, a 1, r 1,..., s T 1, a T 1, r T 1, s T ) Just need to write out P(τ θ): θ E τ [R(τ)] = E τ [ θ log P(τ θ)r(τ)] T 1 P(τ θ) = µ(s 0 ) [π(a t s t, θ)p(s t+1, r t s t, a t )] t=0 T 1 log P(τ θ) = log µ(s 0 ) + [log π(a t s t, θ) + log P(s t+1, r t s t, a t )] t=0 T 1 θ log P(τ θ) = θ log π(a t s t, θ) θ E τ [R] = E τ [ t=0 R θ T 1 ] log π(a t s t, θ) t=0

Policy Gradient: Use Temporal Structure Previous slide: [( T 1 )( T 1 )] θ E τ [R] = E τ r t θ log π(a t s t, θ) t=0 t=0

Policy Gradient: Use Temporal Structure Previous slide: [( T 1 )( T 1 )] θ E τ [R] = E τ r t θ log π(a t s t, θ) t=0 t=0 We can repeat the same argument to derive the gradient estimator for a single reward term r t. t θ E [r t ] = E r t θ log π(a t s t, θ) t=0

Policy Gradient: Use Temporal Structure Previous slide: [( T 1 )( T 1 )] θ E τ [R] = E τ r t θ log π(a t s t, θ) t=0 We can repeat the same argument to derive the gradient estimator for a single reward term r t. t θ E [r t ] = E r t θ log π(a t s t, θ) Sum this formula over t, we obtain T 1 θ E [R] = E = E t=0 [ T 1 t=0 r t t=0 t=0 t θ log π(a t s t, θ) t=0 T 1 θ log π(a t s t, θ) t =t r t ]

Policy Gradient: Introduce Baseline Further reduce variance by introducing a baseline b(s) [ T 1 ( T 1 )] θ E τ [R] = E τ θ log π(a t s t, θ) r t b(s t ) t=0 t =t

Policy Gradient: Introduce Baseline Further reduce variance by introducing a baseline b(s) [ T 1 ( T 1 )] θ E τ [R] = E τ θ log π(a t s t, θ) r t b(s t ) t=0 t =t For any choice of b, gradient estimator is unbiased.

Policy Gradient: Introduce Baseline Further reduce variance by introducing a baseline b(s) [ T 1 ( T 1 )] θ E τ [R] = E τ θ log π(a t s t, θ) r t b(s t ) t=0 For any choice of b, gradient estimator is unbiased. Near optimal choice is expected return, b(s t ) E [r t + r t+1 + r t+2 + + r T 1 ] t =t

Discounts for Variance Reduction Introduce discount factor γ, which ignores delayed effects between actions and rewards [ T 1 ( T 1 )] θ E τ [R] E τ θ log π(a t s t, θ) γ t t r t b(s t ) t=0 t =t

Discounts for Variance Reduction Introduce discount factor γ, which ignores delayed effects between actions and rewards [ T 1 ( T 1 )] θ E τ [R] E τ θ log π(a t s t, θ) γ t t r t b(s t ) t=0 t =t Now, we want b(s t ) E [ ] r t + γr t+1 + γ 2 r t+2 + + γ T 1 t r T 1 Write gradient estimator more generally as [ T 1 ] θ E τ [R] E τ θ log π(a t s t, θ)ât Â t is the advantage estimate t=0

Vanilla Policy Gradient Algorithm Initialize policy parameter θ, baseline b for iteration=1, 2,... do Collect a set of trajectories by executing the current policy At each timestep in each trajectory, compute the return ˆR t = T 1 t =t t γt r t, and the advantage estimate Â t = ˆR t b(s t ). Re-fit the baseline, by minimizing b(s t ) R t 2, summed over all trajectories and timesteps. Update the policy, using a policy gradient estimate ĝ, which is a sum of terms θ log π(a t s t, θ)ât end for

Advantage Actor-Critic Use neural network that represents policy π θ and value function V θ (approximating V π θ) Pseudocode for iteration=1, 2,... do Agent acts for T timesteps (e.g., T = 20), For each timestep t, compute ˆR t = r t + γr t+1 + + γ T t+1 r T 1 + γ T t V θ (s t ) Â t = ˆR t V θ (s t ) ˆR t is target value function, in regression problem Â t is estimated advantage function T Compute loss gradient g = θ t=1 [ log π θ (a t s t )Ât + c(v θ (s) ˆR ] t ) 2 g is plugged into a stochastic gradient ascent algorithm, e.g., Adam. end for V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. P. Lillicrap, et al. Asynchronous methods for deep reinforcement learning. (2016)

Trust Region Policy Optimization Motivation: make policy gradients more robust and sample efficient

Trust Region Policy Optimization Motivation: make policy gradients more robust and sample efficient Unlike in supervised learning, policy affects distribution of inputs, so a large bad update can be disastrous Makes use of a surrogate objective that estimates the performance of the policy around π old used for sampling L πold (π) = 1 N N i=1 π(a i s i ) π old (a i s i )Âi (1) Differentiating this objective gives the policy gradient

Trust Region Policy Optimization Pseudocode: for iteration=1, 2,... do Run policy for T timesteps or N trajectories Estimate advantage function at all timesteps end for maximize θ subject to N n=1 π θ (a n s n ) π θold (a n s n )Ân KL πθold (π θ ) δ Can solve constrained optimization problem efficiently by using conjugate gradient Closely related to natural policy gradients (Kakade, 2002), natural actor critic (Peters and Schaal, 2005), REPS (Peters et al., 2010)

Proximal Policy Optimization Use penalty instead of constraint maximize θ N n=1 π θ (a n s n ) π θold (a n s n )Ân C KL πθold (π θ ) Pseudocode: for iteration=1, 2,... do Run policy for T timesteps or N trajectories Estimate advantage function at all timesteps Do SGD on above objective for some number of epochs If KL too high, increase β. If KL too low, decrease β. end for same performance as TRPO, but only first-order optimization

Variance Reduction for Policy Gradients

Reward Shaping

Reward Shaping Reward shaping: δ(s, a, s ) = r(s, a, s ) + γφ(s ) Φ(s) for arbitrary potential Φ 1 A. Y. Ng, D. Harada, and S. Russell. Policy invariance under reward transformations: Theory and application to reward shaping. ICML. 1999.

Reward Shaping Reward shaping: δ(s, a, s ) = r(s, a, s ) + γφ(s ) Φ(s) for arbitrary potential Φ Theorem: δ admits the same optimal policies as r. 1 1 A. Y. Ng, D. Harada, and S. Russell. Policy invariance under reward transformations: Theory and application to reward shaping. ICML. 1999.

Reward Shaping Reward shaping: δ(s, a, s ) = r(s, a, s ) + γφ(s ) Φ(s) for arbitrary potential Φ Theorem: δ admits the same optimal policies as r. 1 Proof sketch: suppose Q satisfies Bellman equation (T Q = Q). If we transform r δ, policy s value function satisfies Q(s, a) = Q (s, a) Φ(s) 1 A. Y. Ng, D. Harada, and S. Russell. Policy invariance under reward transformations: Theory and application to reward shaping. ICML. 1999.

Reward Shaping Reward shaping: δ(s, a, s ) = r(s, a, s ) + γφ(s ) Φ(s) for arbitrary potential Φ Theorem: δ admits the same optimal policies as r. 1 Proof sketch: suppose Q satisfies Bellman equation (T Q = Q). If we transform r δ, policy s value function satisfies Q(s, a) = Q (s, a) Φ(s) Q satisfies Bellman equation Q also satisfies Bellman equation 1 A. Y. Ng, D. Harada, and S. Russell. Policy invariance under reward transformations: Theory and application to reward shaping. ICML. 1999.

Reward Shaping Theorem: δ admits the same optimal policies as R. A. Y. Ng, D. Harada, and S. Russell. Policy invariance under reward transformations: Theory and application to reward shaping. ICML. 1999 Alternative proof: advantage function is invariant. Let s look at effect on V π and Q π : E [ δ 0 + γδ 1 + γ 2 δ 2 +... ] = E [ (r 0 + γφ(s 1 ) Φ(s 0 )) + γ(r 1 + γφ(s 2 ) Φ(s 1 )) + γ 2 (r 2 + γφ(s 3 ) Φ(s 2 )) +... ] = E [ r 0 + γr 1 + γ 2 r 2 + Φ(s 0 ) ] Thus, A π (s, π(s)) = 0 at all states π is optimal Ṽ π (s) = V π (s) Φ(s) Q π (s) = Q π (s, a) Φ(s) Ã π (s) = A π (s, a)

Reward Shaping and Problem Difficulty Shape with Φ = V problem is solved in one step of value iteration

Reward Shaping and Problem Difficulty Shape with Φ = V problem is solved in one step of value iteration Shaping leaves policy gradient invariant (and just adds baseline to estimator) E [ θ log π θ (a 0 s 0 )(r 0 + γφ(s 1 ) Φ(s 0 )) + γ(r 1 + γφ(s 2 ) Φ(s 1 )) + γ 2 (r 2 + γφ(s 3 ) Φ(s 2 )) +... ] = E [ θ log π θ (a 0 s 0 )(r 0 + γr 1 + γ 2 r 2 + Φ(s 0 )) ] = E [ θ log π θ (a 0 s 0 )(r 0 + γr 1 + γ 2 r 2 +... ) ]

Reward Shaping and Policy Gradients First note connection between shaped reward and advantage function: E s1 [r 0 + γv π (s 1 ) V π (s 0 ) s 0 = s, a 0 = a] = A π (s, a) Now considering the policy gradient and ignoring all but first shaped reward (i.e., pretend γ = 0 after shaping) we get [ ] [ ] E θ log π θ (a t s t )δ t = E θ log π θ (a t s t )(r t + γv π (s t+1 ) V π (s t )) t [ ] = E θ log π θ (a t s t )A π (s t, a t ) t t

Reward Shaping and Policy Gradients Compromise: use more aggressive discount γλ, with λ (0, 1): called generalized advantage estimation θ log π θ (a t s t ) (γλ) k δ t+k t k=0

Reward Shaping and Policy Gradients Compromise: use more aggressive discount γλ, with λ (0, 1): called generalized advantage estimation θ log π θ (a t s t ) (γλ) k δ t+k t k=0 Or alternatively, use hard cutoff as in A3C n 1 θ log π θ (a t s t ) γ k δ t+k t = θ log π θ (a t s t ) t k=0 ( n 1 ) γ k r t+k + γ n Φ(s t+n ) Φ(s t ) k=0

Reward Shaping Summary Reward shaping transformation leaves policy gradient and optimal policy invariant

Reward Shaping Summary Reward shaping transformation leaves policy gradient and optimal policy invariant Shaping with Φ V π makes consequences of actions more immediate

Reward Shaping Summary Reward shaping transformation leaves policy gradient and optimal policy invariant Shaping with Φ V π makes consequences of actions more immediate Shaping, and then ignoring all but first term, gives policy gradient

Aside: Reward Shaping is Crucial in Practice I. Mordatch, E. Todorov, and Z. Popović. Discovery of complex behaviors through contact-invariant optimization. ACM Transactions on Graphics (TOG) 31.4 (2012), p. 43 Y. Tassa, T. Erez, and E. Todorov. Synthesis and stabilization of complex behaviors through online trajectory optimization. Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on. IEEE. 2012, pp. 4906 4913

Choosing parameters γ, λ Performance as γ, λ are varied 0.0 3D Biped 0.5 1.0 cost 1.5 2.0 2.5 γ =0.96,λ =0.96 γ =0.98,λ =0.96 γ =0.99,λ =0.96 γ =0.995,λ =0.92 γ =0.995,λ =0.96 γ =0.995,λ =0.98 γ =0.995,λ =0.99 γ =0.995,λ =1.0 γ =1,λ =0.96 γ =1, No value fn 0 100 200 300 400 500 number of policy iterations (Generalized Advantage Estimation for Policy Gradients, S. et al., ICLR 2016)

Pathwise Derivative Methods

Deriving the Policy Gradient, Reparameterized Episodic MDP: θ s 1 s 2... s T a 1 a 2... a T R T Want to compute θ E [R T ]. We ll use θ log π(a t s t ; θ)

Deriving the Policy Gradient, Reparameterized Episodic MDP: θ s 1 s 2... s T a 1 a 2... a T R T Want to compute θ E [R T ]. We ll use θ log π(a t s t ; θ) Reparameterize: a t = π(s t, z t ; θ). z t is noise from fixed distribution. θ s 1 s 2... s T a 1 a 2... a T R T z 1 z 2... z T

Using a Q-function θ s 1 s 2... s T a 1 a 2... a T R T z 1 z 2... z T [ d dθ E [R T T ] = E = E t=1 [ T t=1 dr T da t da t dθ dq(s t, a t ) da t ] [ T ] d = E E [R T a t ] da t da t=1 t dθ ] [ da t T = E dθ t=1 d dθ Q(s t, π(s t, z t ; θ)) ]

SVG(0) Algorithm Learn Q φ to approximate Q π,γ, and use it to compute gradient estimates. N. Heess, G. Wayne, D. Silver, T. Lillicrap, Y. Tassa, et al. Learning Continuous Control Policies by Stochastic Value Gradients. arxiv preprint arxiv:1510.09142 (2015)

SVG(0) Algorithm Learn Q φ to approximate Q π,γ, and use it to compute gradient estimates. Pseudocode: for iteration=1, 2,... do Execute policy π θ to collect T timesteps of data Update π θ using g θ T t=1 Q(s t, π(s t, z t ; θ)) Update Q φ using g φ T t=1 (Q φ(s t, a t ) ˆQ t ) 2, e.g. with TD(λ) end for N. Heess, G. Wayne, D. Silver, T. Lillicrap, Y. Tassa, et al. Learning Continuous Control Policies by Stochastic Value Gradients. arxiv preprint arxiv:1510.09142 (2015)

SVG(1) Algorithm θ s 1 s 2... s T a 1 a 2... a T R T z 1 z 2... z T Instead of learning Q, we learn

SVG(1) Algorithm θ s 1 s 2... s T a 1 a 2... a T R T z 1 z 2... z T Instead of learning Q, we learn State-value function V V π,γ

SVG(1) Algorithm θ s 1 s 2... s T a 1 a 2... a T R T z 1 z 2... z T Instead of learning Q, we learn State-value function V V π,γ Dynamics model f, approximating st+1 = f (s t, a t ) + ζ t

SVG(1) Algorithm θ s 1 s 2... s T a 1 a 2... a T R T z 1 z 2... z T Instead of learning Q, we learn State-value function V V π,γ Dynamics model f, approximating st+1 = f (s t, a t ) + ζ t Given transition (s t, a t, s t+1 ), infer ζ t = s t+1 f (s t, a t )

SVG( ) Algorithm θ s 1 s 2... s T a 1 a 2... a T R T z 1 z 2... z T Just learn dynamics model f

SVG( ) Algorithm θ s 1 s 2... s T a 1 a 2... a T R T z 1 z 2... z T Just learn dynamics model f Given whole trajectory, infer all noise variables

SVG( ) Algorithm θ s 1 s 2... s T a 1 a 2... a T R T z 1 z 2... z T Just learn dynamics model f Given whole trajectory, infer all noise variables Freeze all policy and dynamics noise, differentiate through entire deterministic computation graph

SVG Results Applied to 2D robotics tasks N. Heess, G. Wayne, D. Silver, T. Lillicrap, Y. Tassa, et al. Learning Continuous Control Policies by Stochastic Value Gradients. arxiv preprint arxiv:1510.09142 (2015)

SVG Results Applied to 2D robotics tasks Overall: different gradient estimators behave similarly N. Heess, G. Wayne, D. Silver, T. Lillicrap, Y. Tassa, et al. Learning Continuous Control Policies by Stochastic Value Gradients. arxiv preprint arxiv:1510.09142 (2015)

Deterministic Policy Gradient For Gaussian actions, variance of score function policy gradient estimator goes to infinity as variance goes to zero D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, et al. Deterministic Policy Gradient Algorithms. ICML. 2014

Deterministic Policy Gradient For Gaussian actions, variance of score function policy gradient estimator goes to infinity as variance goes to zero Intuition: finite difference gradient estimators But SVG(0) gradient is fine when σ 0 θ Q(s t, π(s t, θ, ζ t )) t D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, et al. Deterministic Policy Gradient Algorithms. ICML. 2014

Deterministic Policy Gradient For Gaussian actions, variance of score function policy gradient estimator goes to infinity as variance goes to zero Intuition: finite difference gradient estimators But SVG(0) gradient is fine when σ 0 θ Q(s t, π(s t, θ, ζ t )) Problem: there s no exploration. t Solution: add noise to the policy, but estimate Q with TD(0), so it s valid off-policy D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, et al. Deterministic Policy Gradient Algorithms. ICML. 2014

Deterministic Policy Gradient For Gaussian actions, variance of score function policy gradient estimator goes to infinity as variance goes to zero Intuition: finite difference gradient estimators But SVG(0) gradient is fine when σ 0 θ Q(s t, π(s t, θ, ζ t )) Problem: there s no exploration. t Solution: add noise to the policy, but estimate Q with TD(0), so it s valid off-policy Policy gradient is a little biased (even with Q = Q π ), but only because state distribution is off it gets the right gradient at every state D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, et al. Deterministic Policy Gradient Algorithms. ICML. 2014

Deep Deterministic Policy Gradient Incorporate replay buffer and target network ideas from DQN for increased stability Use lagged (Polyak-averaging) version of Q φ and π θ for fitting Q φ (towards Q π,γ ) with TD(0) ˆQ t = r t + γq φ (s t+1, π(s t+1 ; θ )) Pseudocode: for iteration=1, 2,... do Act for several timesteps, add data to replay buffer Sample minibatch Update π θ using g θ T t=1 Q(s t, π(s t, z t ; θ)) Update Q φ using g φ T t=1 (Q φ(s t, a t ) ˆQ t ) 2, end for T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, et al. Continuous control with deep reinforcement learning. arxiv preprint arxiv:1509.02971 (2015)

DDPG Results Applied to 2D and 3D robotics tasks and driving with pixel input T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, et al. Continuous control with deep reinforcement learning. arxiv preprint arxiv:1509.02971 (2015)

Policy Gradient Methods: Comparison Two kinds of policy gradient estimator

Policy Gradient Methods: Comparison Two kinds of policy gradient estimator REINFORCE / score function estimator: log π(a s) Â.

Policy Gradient Methods: Comparison Two kinds of policy gradient estimator REINFORCE / score function estimator: log π(a s) Â. Learn Q or V for variance reduction, to estimate Â

Policy Gradient Methods: Comparison Two kinds of policy gradient estimator REINFORCE / score function estimator: log π(a s) Â. Learn Q or V for variance reduction, to estimate Â Pathwise derivative estimators (differentiate wrt action) SVG(0) / DPG: d da Q(s, a) (learn Q) SVG(1): d da (r + γv (s )) (learn f, V ) d SVG( ): da t (r t + γr t+1 + γ 2 r t+2 +... ) (learn f )

Thanks Questions?