Deep Reinforcement Learning via Policy Optimization John Schulman July 3, 2017
Introduction
Deep Reinforcement Learning: What to Learn? Policies (select next action)
Deep Reinforcement Learning: What to Learn? Policies (select next action) Value functions (measure goodness of states or state-action pairs)
Deep Reinforcement Learning: What to Learn? Policies (select next action) Value functions (measure goodness of states or state-action pairs) Models (predict next states and rewards)
Model Free RL: (Rough) Taxonomy Policy Optimization Dynamic Programming modified policy iteration DFO / Evolution Policy Gradients Policy Iteration Value Iteration Actor-Critic Methods Q-Learning
Policy Optimization vs Dynamic Programming Conceptually...
Policy Optimization vs Dynamic Programming Conceptually... Policy optimization: optimize what you care about
Policy Optimization vs Dynamic Programming Conceptually... Policy optimization: optimize what you care about Dynamic programming: indirect, exploit the problem structure, self-consistency
Policy Optimization vs Dynamic Programming Conceptually... Policy optimization: optimize what you care about Dynamic programming: indirect, exploit the problem structure, self-consistency Empirically...
Policy Optimization vs Dynamic Programming Conceptually... Policy optimization: optimize what you care about Dynamic programming: indirect, exploit the problem structure, self-consistency Empirically... Policy optimization more versatile, dynamic programming methods more sample-efficient when they work
Policy Optimization vs Dynamic Programming Conceptually... Policy optimization: optimize what you care about Dynamic programming: indirect, exploit the problem structure, self-consistency Empirically... Policy optimization more versatile, dynamic programming methods more sample-efficient when they work Policy optimization methods more compatible with rich architectures (including recurrence) which add tasks other than control (auxiliary objectives), dynamic programming methods more compatible with exploration and off-policy learning
Parameterized Policies A family of policies indexed by parameter vector θ R d
Parameterized Policies A family of policies indexed by parameter vector θ R d Deterministic: a = π(s, θ)
Parameterized Policies A family of policies indexed by parameter vector θ R d Deterministic: a = π(s, θ) Stochastic: π(a s, θ)
Parameterized Policies A family of policies indexed by parameter vector θ R d Deterministic: a = π(s, θ) Stochastic: π(a s, θ) Analogous to classification or regression with input s, output a.
Parameterized Policies A family of policies indexed by parameter vector θ R d Deterministic: a = π(s, θ) Stochastic: π(a s, θ) Analogous to classification or regression with input s, output a. Discrete action space: network outputs vector of probabilities
Parameterized Policies A family of policies indexed by parameter vector θ R d Deterministic: a = π(s, θ) Stochastic: π(a s, θ) Analogous to classification or regression with input s, output a. Discrete action space: network outputs vector of probabilities Continuous action space: network outputs mean and diagonal covariance of Gaussian
Episodic Setting In each episode, the initial state is sampled from µ, and the agent acts until the terminal state is reached. For example:
Episodic Setting In each episode, the initial state is sampled from µ, and the agent acts until the terminal state is reached. For example: Taxi robot reaches its destination (termination = good)
Episodic Setting In each episode, the initial state is sampled from µ, and the agent acts until the terminal state is reached. For example: Taxi robot reaches its destination (termination = good) Waiter robot finishes a shift (fixed time)
Episodic Setting In each episode, the initial state is sampled from µ, and the agent acts until the terminal state is reached. For example: Taxi robot reaches its destination (termination = good) Waiter robot finishes a shift (fixed time) Walking robot falls over (termination = bad)
Episodic Setting In each episode, the initial state is sampled from µ, and the agent acts until the terminal state is reached. For example: Taxi robot reaches its destination (termination = good) Waiter robot finishes a shift (fixed time) Walking robot falls over (termination = bad) Goal: maximize expected return per episode maximize E [R π] π
Derivative Free Optimization / Evolution
Cross Entropy Method Initialize µ R d, σ R d for iteration = 1, 2,... do Collect n samples of θ i N(µ, diag(σ)) Perform one episode with each θ i, obtaining reward R i Select the top p% of θ samples (e.g. p = 20), the elite set Fit a Gaussian distribution, to the elite set, updating µ, σ. end for Return the final µ.
Cross Entropy Method Sometimes works embarrassingly well
Cross Entropy Method Sometimes works embarrassingly well
Cross Entropy Method Sometimes works embarrassingly well I. Szita and A. Lörincz. Learning Tetris using the noisy cross-entropy method. Neural computation (2006)
Cross Entropy Method Sometimes works embarrassingly well I. Szita and A. Lörincz. Learning Tetris using the noisy cross-entropy method. Neural computation (2006)
Stochastic Gradient Ascent on Distribution Let µ define distribution for policy π θ : θ P µ (θ)
Stochastic Gradient Ascent on Distribution Let µ define distribution for policy π θ : θ P µ (θ) Return R depends on policy parameter θ and noise ζ maximize E θ,ζ [R(θ, ζ)] µ R is unknown and possibly nondifferentiable
Stochastic Gradient Ascent on Distribution Let µ define distribution for policy π θ : θ P µ (θ) Return R depends on policy parameter θ and noise ζ maximize E θ,ζ [R(θ, ζ)] µ R is unknown and possibly nondifferentiable Score function gradient estimator: µ E θ,ζ [R(θ, ζ)] = E θ,ζ [ µ log P µ (θ)r(θ, ζ)] 1 N µ log P µ (θ i )R i N i=1
Stochastic Gradient Ascent on Distribution Compare with cross-entropy method
Stochastic Gradient Ascent on Distribution Compare with cross-entropy method Score function grad: µ E θ,ζ [R(θ, ζ)] 1 N N µ log P µ (θ i )R i i=1
Stochastic Gradient Ascent on Distribution Compare with cross-entropy method Score function grad: µ E θ,ζ [R(θ, ζ)] 1 N N µ log P µ (θ i )R i i=1 Cross entropy method: maximize µ 1 N N log P µ (θ i )f (R i ) i=1 where f (r) = 1[r above threshold] (cross entropy method)
Connection to Finite Differences Suppose P µ is Gaussian distribution with mean µ, covariance σ 2 I log P µ (θ) = µ θ 2 /2σ 2 + const µ log P µ (θ) = (θ µ)/σ 2 R i µ log P µ (θ i ) = R i (θ i µ)/σ 2
Connection to Finite Differences Suppose P µ is Gaussian distribution with mean µ, covariance σ 2 I log P µ (θ) = µ θ 2 /2σ 2 + const µ log P µ (θ) = (θ µ)/σ 2 R i µ log P µ (θ i ) = R i (θ i µ)/σ 2 Suppose we do antithetic sampling, where we use pairs of samples θ + = µ + σz, θ = µ σz ( ) 1 R(µ + σz, ζ) µ log P µ (θ + ) + R(µ σz, ζ ) µ log P µ (θ ) 2 = 1 σ ( R(µ + σz, ζ) R(µ σz, ζ ) ) z
Connection to Finite Differences Suppose P µ is Gaussian distribution with mean µ, covariance σ 2 I log P µ (θ) = µ θ 2 /2σ 2 + const µ log P µ (θ) = (θ µ)/σ 2 R i µ log P µ (θ i ) = R i (θ i µ)/σ 2 Suppose we do antithetic sampling, where we use pairs of samples θ + = µ + σz, θ = µ σz ( ) 1 R(µ + σz, ζ) µ log P µ (θ + ) + R(µ σz, ζ ) µ log P µ (θ ) 2 = 1 σ ( R(µ + σz, ζ) R(µ σz, ζ ) ) z Using same noise ζ for both evaluations reduces variance
Deriving the Score Function Estimator Score function gradient estimator: µ E θ,ζ [R(θ, ζ)] = E θ,ζ [ µ log P µ (θ)r(θ, ζ)] 1 N µ log P µ (θ i )R i N i=1
Deriving the Score Function Estimator Score function gradient estimator: µ E θ,ζ [R(θ, ζ)] = E θ,ζ [ µ log P µ (θ)r(θ, ζ)] 1 N µ log P µ (θ i )R i N Derive by writing expectation as an integral µ dµdζ P µ (θ)r(θ, ζ) = dµdζ µ P µ (θ)r(θ, ζ) = dµdζ P µ (θ) µ log P µ (θ)r(θ, ζ) i=1 = E θ,ζ [ µ log P µ (θ)r(θ, ζ)]
Literature on DFO Evolution strategies (Rechenberg and Eigen, 1973)
Literature on DFO Evolution strategies (Rechenberg and Eigen, 1973) Simultaneous perturbation stochastic approximation (Spall, 1992)
Literature on DFO Evolution strategies (Rechenberg and Eigen, 1973) Simultaneous perturbation stochastic approximation (Spall, 1992) Covariance matrix adaptation: popular relative of CEM (Hansen, 2006)
Literature on DFO Evolution strategies (Rechenberg and Eigen, 1973) Simultaneous perturbation stochastic approximation (Spall, 1992) Covariance matrix adaptation: popular relative of CEM (Hansen, 2006) Reward weighted regression (Peters and Schaal, 2007), PoWER (Kober and Peters, 2007)
Success Stories CMA is very effective for optimizing low-dimensional locomotion controllers
Success Stories CMA is very effective for optimizing low-dimensional locomotion controllers UT Austin Villa: RoboCup 2012 3D Simulation League Champion
Success Stories CMA is very effective for optimizing low-dimensional locomotion controllers UT Austin Villa: RoboCup 2012 3D Simulation League Champion Evolution Strategies was shown to perform well on Atari, competetive with policy gradient methods (Salimans et al., 2017)
Policy Gradient Methods
Overview Problem: maximize E[R π θ ] Here, we ll use a fixed policy parameter θ (instead of sampling θ P µ ) and estimate gradient with respect to θ
Overview Problem: maximize E[R π θ ] Here, we ll use a fixed policy parameter θ (instead of sampling θ P µ ) and estimate gradient with respect to θ Noise is in action space rather than parameter space
Overview Problem: maximize E[R π θ ] Intuitions: collect a bunch of trajectories, and... 1. Make the good trajectories more probable
Overview Problem: maximize E[R π θ ] Intuitions: collect a bunch of trajectories, and... 1. Make the good trajectories more probable 2. Make the good actions more probable
Overview Problem: maximize E[R π θ ] Intuitions: collect a bunch of trajectories, and... 1. Make the good trajectories more probable 2. Make the good actions more probable 3. Push the actions towards better actions
Score Function Gradient Estimator for Policies Now random variable is a whole trajectory τ = (s 0, a 0, r 0, s 1, a 1, r 1,..., s T 1, a T 1, r T 1, s T ) θ E τ [R(τ)] = E τ [ θ log P(τ θ)r(τ)]
Score Function Gradient Estimator for Policies Now random variable is a whole trajectory τ = (s 0, a 0, r 0, s 1, a 1, r 1,..., s T 1, a T 1, r T 1, s T ) Just need to write out P(τ θ): θ E τ [R(τ)] = E τ [ θ log P(τ θ)r(τ)] T 1 P(τ θ) = µ(s 0 ) [π(a t s t, θ)p(s t+1, r t s t, a t )] t=0 T 1 log P(τ θ) = log µ(s 0 ) + [log π(a t s t, θ) + log P(s t+1, r t s t, a t )] t=0 T 1 θ log P(τ θ) = θ log π(a t s t, θ) θ E τ [R] = E τ [ t=0 R θ T 1 ] log π(a t s t, θ) t=0
Policy Gradient: Use Temporal Structure Previous slide: [( T 1 )( T 1 )] θ E τ [R] = E τ r t θ log π(a t s t, θ) t=0 t=0
Policy Gradient: Use Temporal Structure Previous slide: [( T 1 )( T 1 )] θ E τ [R] = E τ r t θ log π(a t s t, θ) t=0 t=0 We can repeat the same argument to derive the gradient estimator for a single reward term r t. t θ E [r t ] = E r t θ log π(a t s t, θ) t=0
Policy Gradient: Use Temporal Structure Previous slide: [( T 1 )( T 1 )] θ E τ [R] = E τ r t θ log π(a t s t, θ) t=0 We can repeat the same argument to derive the gradient estimator for a single reward term r t. t θ E [r t ] = E r t θ log π(a t s t, θ) Sum this formula over t, we obtain T 1 θ E [R] = E = E t=0 [ T 1 t=0 r t t=0 t=0 t θ log π(a t s t, θ) t=0 T 1 θ log π(a t s t, θ) t =t r t ]
Policy Gradient: Introduce Baseline Further reduce variance by introducing a baseline b(s) [ T 1 ( T 1 )] θ E τ [R] = E τ θ log π(a t s t, θ) r t b(s t ) t=0 t =t
Policy Gradient: Introduce Baseline Further reduce variance by introducing a baseline b(s) [ T 1 ( T 1 )] θ E τ [R] = E τ θ log π(a t s t, θ) r t b(s t ) t=0 t =t For any choice of b, gradient estimator is unbiased.
Policy Gradient: Introduce Baseline Further reduce variance by introducing a baseline b(s) [ T 1 ( T 1 )] θ E τ [R] = E τ θ log π(a t s t, θ) r t b(s t ) t=0 For any choice of b, gradient estimator is unbiased. Near optimal choice is expected return, b(s t ) E [r t + r t+1 + r t+2 + + r T 1 ] t =t
Policy Gradient: Introduce Baseline Further reduce variance by introducing a baseline b(s) [ T 1 ( T 1 )] θ E τ [R] = E τ θ log π(a t s t, θ) r t b(s t ) t=0 For any choice of b, gradient estimator is unbiased. Near optimal choice is expected return, b(s t ) E [r t + r t+1 + r t+2 + + r T 1 ] Interpretation: increase logprob of action a t proportionally to how much returns T 1 t=t r t are better than expected t =t
Discounts for Variance Reduction Introduce discount factor γ, which ignores delayed effects between actions and rewards [ T 1 ( T 1 )] θ E τ [R] E τ θ log π(a t s t, θ) γ t t r t b(s t ) t=0 t =t
Discounts for Variance Reduction Introduce discount factor γ, which ignores delayed effects between actions and rewards [ T 1 ( T 1 )] θ E τ [R] E τ θ log π(a t s t, θ) γ t t r t b(s t ) t=0 t =t Now, we want b(s t ) E [ r t + γr t+1 + γ 2 r t+2 + + γ T 1 t r T 1 ]
Discounts for Variance Reduction Introduce discount factor γ, which ignores delayed effects between actions and rewards [ T 1 ( T 1 )] θ E τ [R] E τ θ log π(a t s t, θ) γ t t r t b(s t ) t=0 t =t Now, we want b(s t ) E [ ] r t + γr t+1 + γ 2 r t+2 + + γ T 1 t r T 1 Write gradient estimator more generally as [ T 1 ] θ E τ [R] E τ θ log π(a t s t, θ)ât  t is the advantage estimate t=0
Vanilla Policy Gradient Algorithm Initialize policy parameter θ, baseline b for iteration=1, 2,... do Collect a set of trajectories by executing the current policy At each timestep in each trajectory, compute the return ˆR t = T 1 t =t t γt r t, and the advantage estimate  t = ˆR t b(s t ). Re-fit the baseline, by minimizing b(s t ) R t 2, summed over all trajectories and timesteps. Update the policy, using a policy gradient estimate ĝ, which is a sum of terms θ log π(a t s t, θ)ât end for
Advantage Actor-Critic Use neural network that represents policy π θ and value function V θ (approximating V π θ) Pseudocode for iteration=1, 2,... do Agent acts for T timesteps (e.g., T = 20), For each timestep t, compute ˆR t = r t + γr t+1 + + γ T t+1 r T 1 + γ T t V θ (s t )  t = ˆR t V θ (s t ) ˆR t is target value function, in regression problem  t is estimated advantage function T Compute loss gradient g = θ t=1 [ log π θ (a t s t )Ât + c(v θ (s) ˆR ] t ) 2 g is plugged into a stochastic gradient ascent algorithm, e.g., Adam. end for V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. P. Lillicrap, et al. Asynchronous methods for deep reinforcement learning. (2016)
Trust Region Policy Optimization Motivation: make policy gradients more robust and sample efficient
Trust Region Policy Optimization Motivation: make policy gradients more robust and sample efficient Unlike in supervised learning, policy affects distribution of inputs, so a large bad update can be disastrous
Trust Region Policy Optimization Motivation: make policy gradients more robust and sample efficient Unlike in supervised learning, policy affects distribution of inputs, so a large bad update can be disastrous Makes use of a surrogate objective that estimates the performance of the policy around π old used for sampling L πold (π) = 1 N N i=1 π(a i s i ) π old (a i s i )Âi (1) Differentiating this objective gives the policy gradient
Trust Region Policy Optimization Motivation: make policy gradients more robust and sample efficient Unlike in supervised learning, policy affects distribution of inputs, so a large bad update can be disastrous Makes use of a surrogate objective that estimates the performance of the policy around π old used for sampling L πold (π) = 1 N N i=1 π(a i s i ) π old (a i s i )Âi (1) Differentiating this objective gives the policy gradient L πold (π) is only accurate when state distribution of π is close to π old, thus it makes sense to constrain or penalize the distance D KL [π old π]
Trust Region Policy Optimization Pseudocode: for iteration=1, 2,... do Run policy for T timesteps or N trajectories Estimate advantage function at all timesteps end for maximize θ subject to N n=1 π θ (a n s n ) π θold (a n s n )Ân KL πθold (π θ ) δ Can solve constrained optimization problem efficiently by using conjugate gradient Closely related to natural policy gradients (Kakade, 2002), natural actor critic (Peters and Schaal, 2005), REPS (Peters et al., 2010)
Proximal Policy Optimization Use penalty instead of constraint maximize θ N n=1 π θ (a n s n ) π θold (a n s n )Ân C KL πθold (π θ ) Pseudocode: for iteration=1, 2,... do Run policy for T timesteps or N trajectories Estimate advantage function at all timesteps Do SGD on above objective for some number of epochs If KL too high, increase β. If KL too low, decrease β. end for same performance as TRPO, but only first-order optimization
Variance Reduction for Policy Gradients
Reward Shaping
Reward Shaping Reward shaping: δ(s, a, s ) = r(s, a, s ) + γφ(s ) Φ(s) for arbitrary potential Φ 1 A. Y. Ng, D. Harada, and S. Russell. Policy invariance under reward transformations: Theory and application to reward shaping. ICML. 1999.
Reward Shaping Reward shaping: δ(s, a, s ) = r(s, a, s ) + γφ(s ) Φ(s) for arbitrary potential Φ Theorem: δ admits the same optimal policies as r. 1 1 A. Y. Ng, D. Harada, and S. Russell. Policy invariance under reward transformations: Theory and application to reward shaping. ICML. 1999.
Reward Shaping Reward shaping: δ(s, a, s ) = r(s, a, s ) + γφ(s ) Φ(s) for arbitrary potential Φ Theorem: δ admits the same optimal policies as r. 1 Proof sketch: suppose Q satisfies Bellman equation (T Q = Q). If we transform r δ, policy s value function satisfies Q(s, a) = Q (s, a) Φ(s) 1 A. Y. Ng, D. Harada, and S. Russell. Policy invariance under reward transformations: Theory and application to reward shaping. ICML. 1999.
Reward Shaping Reward shaping: δ(s, a, s ) = r(s, a, s ) + γφ(s ) Φ(s) for arbitrary potential Φ Theorem: δ admits the same optimal policies as r. 1 Proof sketch: suppose Q satisfies Bellman equation (T Q = Q). If we transform r δ, policy s value function satisfies Q(s, a) = Q (s, a) Φ(s) Q satisfies Bellman equation Q also satisfies Bellman equation 1 A. Y. Ng, D. Harada, and S. Russell. Policy invariance under reward transformations: Theory and application to reward shaping. ICML. 1999.
Reward Shaping Theorem: δ admits the same optimal policies as R. A. Y. Ng, D. Harada, and S. Russell. Policy invariance under reward transformations: Theory and application to reward shaping. ICML. 1999
Reward Shaping Theorem: δ admits the same optimal policies as R. A. Y. Ng, D. Harada, and S. Russell. Policy invariance under reward transformations: Theory and application to reward shaping. ICML. 1999 Alternative proof: advantage function is invariant. Let s look at effect on V π and Q π : E [ δ 0 + γδ 1 + γ 2 δ 2 +... ] = E [ (r 0 + γφ(s 1 ) Φ(s 0 )) + γ(r 1 + γφ(s 2 ) Φ(s 1 )) + γ 2 (r 2 + γφ(s 3 ) Φ(s 2 )) +... ] = E [ r 0 + γr 1 + γ 2 r 2 + Φ(s 0 ) ] Thus, A π (s, π(s)) = 0 at all states π is optimal Ṽ π (s) = V π (s) Φ(s) Q π (s) = Q π (s, a) Φ(s) Ã π (s) = A π (s, a)
Reward Shaping and Problem Difficulty Shape with Φ = V problem is solved in one step of value iteration
Reward Shaping and Problem Difficulty Shape with Φ = V problem is solved in one step of value iteration Shaping leaves policy gradient invariant (and just adds baseline to estimator) E [ θ log π θ (a 0 s 0 )(r 0 + γφ(s 1 ) Φ(s 0 )) + γ(r 1 + γφ(s 2 ) Φ(s 1 )) + γ 2 (r 2 + γφ(s 3 ) Φ(s 2 )) +... ] = E [ θ log π θ (a 0 s 0 )(r 0 + γr 1 + γ 2 r 2 + Φ(s 0 )) ] = E [ θ log π θ (a 0 s 0 )(r 0 + γr 1 + γ 2 r 2 +... ) ]
Reward Shaping and Policy Gradients First note connection between shaped reward and advantage function: E s1 [r 0 + γv π (s 1 ) V π (s 0 ) s 0 = s, a 0 = a] = A π (s, a) Now considering the policy gradient and ignoring all but first shaped reward (i.e., pretend γ = 0 after shaping) we get [ ] [ ] E θ log π θ (a t s t )δ t = E θ log π θ (a t s t )(r t + γv π (s t+1 ) V π (s t )) t [ ] = E θ log π θ (a t s t )A π (s t, a t ) t t
Reward Shaping and Policy Gradients Compromise: use more aggressive discount γλ, with λ (0, 1): called generalized advantage estimation θ log π θ (a t s t ) (γλ) k δ t+k t k=0
Reward Shaping and Policy Gradients Compromise: use more aggressive discount γλ, with λ (0, 1): called generalized advantage estimation θ log π θ (a t s t ) (γλ) k δ t+k t k=0 Or alternatively, use hard cutoff as in A3C n 1 θ log π θ (a t s t ) γ k δ t+k t = θ log π θ (a t s t ) t k=0 ( n 1 ) γ k r t+k + γ n Φ(s t+n ) Φ(s t ) k=0
Reward Shaping Summary Reward shaping transformation leaves policy gradient and optimal policy invariant
Reward Shaping Summary Reward shaping transformation leaves policy gradient and optimal policy invariant Shaping with Φ V π makes consequences of actions more immediate
Reward Shaping Summary Reward shaping transformation leaves policy gradient and optimal policy invariant Shaping with Φ V π makes consequences of actions more immediate Shaping, and then ignoring all but first term, gives policy gradient
Aside: Reward Shaping is Crucial in Practice I. Mordatch, E. Todorov, and Z. Popović. Discovery of complex behaviors through contact-invariant optimization. ACM Transactions on Graphics (TOG) 31.4 (2012), p. 43
Aside: Reward Shaping is Crucial in Practice I. Mordatch, E. Todorov, and Z. Popović. Discovery of complex behaviors through contact-invariant optimization. ACM Transactions on Graphics (TOG) 31.4 (2012), p. 43 Y. Tassa, T. Erez, and E. Todorov. Synthesis and stabilization of complex behaviors through online trajectory optimization. Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on. IEEE. 2012, pp. 4906 4913
Choosing parameters γ, λ Performance as γ, λ are varied 0.0 3D Biped 0.5 1.0 cost 1.5 2.0 2.5 γ =0.96,λ =0.96 γ =0.98,λ =0.96 γ =0.99,λ =0.96 γ =0.995,λ =0.92 γ =0.995,λ =0.96 γ =0.995,λ =0.98 γ =0.995,λ =0.99 γ =0.995,λ =1.0 γ =1,λ =0.96 γ =1, No value fn 0 100 200 300 400 500 number of policy iterations (Generalized Advantage Estimation for Policy Gradients, S. et al., ICLR 2016)
Pathwise Derivative Methods
Deriving the Policy Gradient, Reparameterized Episodic MDP: θ s 1 s 2... s T a 1 a 2... a T R T Want to compute θ E [R T ]. We ll use θ log π(a t s t ; θ)
Deriving the Policy Gradient, Reparameterized Episodic MDP: θ s 1 s 2... s T a 1 a 2... a T R T Want to compute θ E [R T ]. We ll use θ log π(a t s t ; θ) Reparameterize: a t = π(s t, z t ; θ). z t is noise from fixed distribution. θ s 1 s 2... s T a 1 a 2... a T R T z 1 z 2... z T
Deriving the Policy Gradient, Reparameterized Episodic MDP: θ s 1 s 2... s T a 1 a 2... a T R T Want to compute θ E [R T ]. We ll use θ log π(a t s t ; θ) Reparameterize: a t = π(s t, z t ; θ). z t is noise from fixed distribution. θ s 1 s 2... s T a 1 a 2... a T R T z 1 z 2... z T Only works if P(s 2 s 1, a 1 ) is known
Using a Q-function θ s 1 s 2... s T a 1 a 2... a T R T z 1 z 2... z T [ d dθ E [R T T ] = E = E t=1 [ T t=1 dr T da t da t dθ dq(s t, a t ) da t ] [ T ] d = E E [R T a t ] da t da t=1 t dθ ] [ da t T = E dθ t=1 d dθ Q(s t, π(s t, z t ; θ)) ]
SVG(0) Algorithm Learn Q φ to approximate Q π,γ, and use it to compute gradient estimates. N. Heess, G. Wayne, D. Silver, T. Lillicrap, Y. Tassa, et al. Learning Continuous Control Policies by Stochastic Value Gradients. arxiv preprint arxiv:1510.09142 (2015)
SVG(0) Algorithm Learn Q φ to approximate Q π,γ, and use it to compute gradient estimates. Pseudocode: for iteration=1, 2,... do Execute policy π θ to collect T timesteps of data Update π θ using g θ T t=1 Q(s t, π(s t, z t ; θ)) Update Q φ using g φ T t=1 (Q φ(s t, a t ) ˆQ t ) 2, e.g. with TD(λ) end for N. Heess, G. Wayne, D. Silver, T. Lillicrap, Y. Tassa, et al. Learning Continuous Control Policies by Stochastic Value Gradients. arxiv preprint arxiv:1510.09142 (2015)
SVG(1) Algorithm θ s 1 s 2... s T a 1 a 2... a T R T z 1 z 2... z T Instead of learning Q, we learn
SVG(1) Algorithm θ s 1 s 2... s T a 1 a 2... a T R T z 1 z 2... z T Instead of learning Q, we learn State-value function V V π,γ
SVG(1) Algorithm θ s 1 s 2... s T a 1 a 2... a T R T z 1 z 2... z T Instead of learning Q, we learn State-value function V V π,γ Dynamics model f, approximating st+1 = f (s t, a t ) + ζ t
SVG(1) Algorithm θ s 1 s 2... s T a 1 a 2... a T R T z 1 z 2... z T Instead of learning Q, we learn State-value function V V π,γ Dynamics model f, approximating st+1 = f (s t, a t ) + ζ t Given transition (s t, a t, s t+1 ), infer ζ t = s t+1 f (s t, a t )
SVG(1) Algorithm θ s 1 s 2... s T a 1 a 2... a T R T z 1 z 2... z T Instead of learning Q, we learn State-value function V V π,γ Dynamics model f, approximating st+1 = f (s t, a t ) + ζ t Given transition (s t, a t, s t+1 ), infer ζ t = s t+1 f (s t, a t ) Q(s t, a t ) = E [r t + γv (s t+1 )] = E [r t + γv (f (s t, a t ) + ζ t )], and a t = π(s t, θ, ζ t )
SVG( ) Algorithm θ s 1 s 2... s T a 1 a 2... a T R T z 1 z 2... z T Just learn dynamics model f
SVG( ) Algorithm θ s 1 s 2... s T a 1 a 2... a T R T z 1 z 2... z T Just learn dynamics model f Given whole trajectory, infer all noise variables
SVG( ) Algorithm θ s 1 s 2... s T a 1 a 2... a T R T z 1 z 2... z T Just learn dynamics model f Given whole trajectory, infer all noise variables Freeze all policy and dynamics noise, differentiate through entire deterministic computation graph
SVG Results Applied to 2D robotics tasks N. Heess, G. Wayne, D. Silver, T. Lillicrap, Y. Tassa, et al. Learning Continuous Control Policies by Stochastic Value Gradients. arxiv preprint arxiv:1510.09142 (2015)
SVG Results Applied to 2D robotics tasks Overall: different gradient estimators behave similarly N. Heess, G. Wayne, D. Silver, T. Lillicrap, Y. Tassa, et al. Learning Continuous Control Policies by Stochastic Value Gradients. arxiv preprint arxiv:1510.09142 (2015)
Deterministic Policy Gradient For Gaussian actions, variance of score function policy gradient estimator goes to infinity as variance goes to zero D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, et al. Deterministic Policy Gradient Algorithms. ICML. 2014
Deterministic Policy Gradient For Gaussian actions, variance of score function policy gradient estimator goes to infinity as variance goes to zero Intuition: finite difference gradient estimators D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, et al. Deterministic Policy Gradient Algorithms. ICML. 2014
Deterministic Policy Gradient For Gaussian actions, variance of score function policy gradient estimator goes to infinity as variance goes to zero Intuition: finite difference gradient estimators But SVG(0) gradient is fine when σ 0 θ Q(s t, π(s t, θ, ζ t )) t D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, et al. Deterministic Policy Gradient Algorithms. ICML. 2014
Deterministic Policy Gradient For Gaussian actions, variance of score function policy gradient estimator goes to infinity as variance goes to zero Intuition: finite difference gradient estimators But SVG(0) gradient is fine when σ 0 θ Q(s t, π(s t, θ, ζ t )) Problem: there s no exploration. t D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, et al. Deterministic Policy Gradient Algorithms. ICML. 2014
Deterministic Policy Gradient For Gaussian actions, variance of score function policy gradient estimator goes to infinity as variance goes to zero Intuition: finite difference gradient estimators But SVG(0) gradient is fine when σ 0 θ Q(s t, π(s t, θ, ζ t )) Problem: there s no exploration. t Solution: add noise to the policy, but estimate Q with TD(0), so it s valid off-policy D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, et al. Deterministic Policy Gradient Algorithms. ICML. 2014
Deterministic Policy Gradient For Gaussian actions, variance of score function policy gradient estimator goes to infinity as variance goes to zero Intuition: finite difference gradient estimators But SVG(0) gradient is fine when σ 0 θ Q(s t, π(s t, θ, ζ t )) Problem: there s no exploration. t Solution: add noise to the policy, but estimate Q with TD(0), so it s valid off-policy Policy gradient is a little biased (even with Q = Q π ), but only because state distribution is off it gets the right gradient at every state D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, et al. Deterministic Policy Gradient Algorithms. ICML. 2014
Deep Deterministic Policy Gradient Incorporate replay buffer and target network ideas from DQN for increased stability Use lagged (Polyak-averaging) version of Q φ and π θ for fitting Q φ (towards Q π,γ ) with TD(0) ˆQ t = r t + γq φ (s t+1, π(s t+1 ; θ )) Pseudocode: for iteration=1, 2,... do Act for several timesteps, add data to replay buffer Sample minibatch Update π θ using g θ T t=1 Q(s t, π(s t, z t ; θ)) Update Q φ using g φ T t=1 (Q φ(s t, a t ) ˆQ t ) 2, end for T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, et al. Continuous control with deep reinforcement learning. arxiv preprint arxiv:1509.02971 (2015)
DDPG Results Applied to 2D and 3D robotics tasks and driving with pixel input T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, et al. Continuous control with deep reinforcement learning. arxiv preprint arxiv:1509.02971 (2015)
Policy Gradient Methods: Comparison Two kinds of policy gradient estimator
Policy Gradient Methods: Comparison Two kinds of policy gradient estimator REINFORCE / score function estimator: log π(a s) Â.
Policy Gradient Methods: Comparison Two kinds of policy gradient estimator REINFORCE / score function estimator: log π(a s) Â. Learn Q or V for variance reduction, to estimate Â
Policy Gradient Methods: Comparison Two kinds of policy gradient estimator REINFORCE / score function estimator: log π(a s) Â. Learn Q or V for variance reduction, to estimate  Pathwise derivative estimators (differentiate wrt action)
Policy Gradient Methods: Comparison Two kinds of policy gradient estimator REINFORCE / score function estimator: log π(a s) Â. Learn Q or V for variance reduction, to estimate  Pathwise derivative estimators (differentiate wrt action) SVG(0) / DPG: d da Q(s, a) (learn Q)
Policy Gradient Methods: Comparison Two kinds of policy gradient estimator REINFORCE / score function estimator: log π(a s) Â. Learn Q or V for variance reduction, to estimate  Pathwise derivative estimators (differentiate wrt action) SVG(0) / DPG: d da Q(s, a) (learn Q) SVG(1): d da (r + γv (s )) (learn f, V )
Policy Gradient Methods: Comparison Two kinds of policy gradient estimator REINFORCE / score function estimator: log π(a s) Â. Learn Q or V for variance reduction, to estimate  Pathwise derivative estimators (differentiate wrt action) SVG(0) / DPG: d da Q(s, a) (learn Q) SVG(1): d da (r + γv (s )) (learn f, V ) d SVG( ): da t (r t + γr t+1 + γ 2 r t+2 +... ) (learn f )
Policy Gradient Methods: Comparison Two kinds of policy gradient estimator REINFORCE / score function estimator: log π(a s) Â. Learn Q or V for variance reduction, to estimate  Pathwise derivative estimators (differentiate wrt action) SVG(0) / DPG: d da Q(s, a) (learn Q) SVG(1): d da (r + γv (s )) (learn f, V ) d SVG( ): da t (r t + γr t+1 + γ 2 r t+2 +... ) (learn f ) Pathwise derivative methods more sample-efficient when they work (maybe), but work less generally due to high bias
Thanks Questions?