Variance Reduction for Policy Gradient Methods. March 13, 2017

Size: px

Start display at page:

Download "Variance Reduction for Policy Gradient Methods. March 13, 2017"

Gabriel Booth
6 years ago
Views:

1 Variance Reduction for Policy Gradient Methods March 13, 2017

2 Reward Shaping

3 Reward Shaping

4 Reward Shaping Reward shaping: r(s, a, s ) = r(s, a, s ) + γφ(s ) Φ(s) for arbitrary potential Φ Theorem: r admits the same optimal policies as r. 1 Proof sketch: suppose Q satisfies Bellman equation (T Q = Q). If we transform r r, policy s value function satisfies Q(s, a) = Q (s, a) Φ(s) Q satisfies Bellman equation Q also satisfies Bellman equation 1 A. Y. Ng, D. Harada, and S. Russell. Policy invariance under reward transformations: Theory and application to reward shaping. ICML

5 Reward Shaping Theorem: R admits the same optimal policies as R. A. Y. Ng, D. Harada, and S. Russell. Policy invariance under reward transformations: Theory and application to reward shaping. ICML Alternative proof: advantage function is invariant. Let s look at effect on V π and Q π : E [ r 0 + γr 1 + γ 2 r ] condition on either s 0 or (s 0, a 0 ) = E [ r 0 + γ r 1 + γ 2 r ] = E [ (r 0 + γφ(s 1 ) Φ(s 0 )) + γ(r 1 + γφ(s 2 ) Φ(s 1 )) + γ 2 (r 2 + γφ(s 3 ) Φ(s 2 )) +... ] = E [ r 0 + γr 1 + γ 2 r 2 + Φ(s 0 ) ] Thus, A π (s, π(s)) = 0 at all states π is optimal Ṽ π (s) = V π (s) Φ(s) Q π (s) = Q π (s, a) Φ(s) Ã π (s) = A π (s, a)

6 Reward Shaping and Problem Difficulty Shape with Φ = V problem is solved in one step of value iteration Shaping leaves policy gradient invariant (and just adds baseline to estimator) E [ θ log π θ (a 0 s 0 )(r 0 + γφ(s 1 ) Φ(s 0 )) + γ(r 1 + γφ(s 2 ) Φ(s 1 )) + γ 2 (r 2 + γφ(s 3 ) Φ(s 2 )) +... ] = E [ θ log π θ (a 0 s 0 )(r 0 + γr 1 + γ 2 r 2 + Φ(s 0 )) ] = E [ θ log π θ (a 0 s 0 )(r 0 + γr 1 + γ 2 r ) ]

7 Reward Shaping and Policy Gradients First note connection between shaped reward and advantage function: E s1 [r 0 + γv π (s 1 ) V π (s 0 ) s 0 = s, a 0 = a] = A π (s, a) Now considering the policy gradient and ignoring all but first shaped reward (i.e., pretend γ = 0 after shaping) we get [ ] [ ] E θ log π θ (a t s t ) r t = E θ log π θ (a t s t )(r t + γv π (s t+1 ) V π (s t )) t [ ] = E log π θ (a t s t )A π (s t, a t ) t t Compromise: use more aggressive discount γλ, with λ (0, 1): called generalized advantage estimation [ ] E θ log π θ (a t s t ) (γλ) k r t+k t k=0

8 Reward Shaping Summary Reward shaping transformation leaves policy gradient and optimal policy invariant Shaping with Φ V π makes consequences of actions more immediate Shaping, and then ignoring all but first term, gives policy gradient

[Aside] Reward Shaping: Very Important in Practice I. Mordatch, E.

Discovery of complex behaviors through contact-invariant optimization.

9 [Aside] Reward Shaping: Very Important in Practice I. Mordatch, E. Todorov, and Z. Popović. Discovery of complex behaviors through contact-invariant optimization. ACM Transactions on Graphics (TOG) 31.4 (2012), p. 43 Y. Tassa, T. Erez, and E. Todorov. Synthesis and stabilization of complex behaviors through online trajectory optimization. Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on. IEEE. 2012, pp

10 Variance Reduction for Policy Gradients

11 Variance Reduction We have the following policy gradient formula: [ T 1 ] θ E τ [R] = E τ θ log π(a t s t, θ)a π (s t, a t ) t=0 A π is not known, but we can plug in Ât, an advantage estimator Previously, we showed that taking Â t = r t + r t+1 + r t+2 + b(s t ) for any function b(s t ), gives an unbiased policy gradient estimator. b(s t ) V π (s t ) gives variance reduction.

12 The Delayed Reward Problem With policy gradient methods, we are confounding the effect of multiple actions: mixes effect of a t, a t+1, a t+2,... Â t = r t + r t+1 + r t+2 + b(s t ) SNR of Ât scales roughly as 1/T Only at contributes to signal A π (s t, a t ), but a t+1, a t+2,... contribute to noise.

13 Variance Reduction with Discounts Discount factor γ, 0 < γ < 1, downweights the effect of rewards that are far in the future ignore long term dependencies We can form an advantage estimator using the discounted return: Â γ t = r t + γr t+1 + γ 2 r t b(s }{{} t ) discounted return reduces to our previous estimator when γ = 1. So advantage has expectation zero, we should fit baseline to be discounted value function V π,γ (s) = E τ [ r0 + γr 1 + γ 2 r s 0 = s ] Discount γ is similar to using a horizon of 1/(1 γ) timesteps Â γ t is a biased estimator of the advantage function

14 Value Functions in the Future Baseline accounts for and removes the effect of past actions Can also use the value function to estimate future rewards r t + γv (s t+1 ) cut off at one timestep r t + γr t+1 + γ 2 V (s t+2 ) cut off at two timesteps... r t + γr t+1 + γ 2 r t timesteps (no V )

15 Value Functions in the Future Subtracting out baselines, we get advantage estimators Â (1) t = r t + γv (s t+1 ) V (s t ) Â (2) t = r t + r t+1 + γ 2 V (s t+2 ) V (s t )... Â ( ) t = r t + γr t+1 + γ 2 r t V (s t ) Â (1) t has low variance but high bias, Â ( ) t has high variance but low bias. Using intermediate k (say, 20) gives an intermediate amount of bias and variance

16 Finite-Horizon Methods: Advantage Actor-Critic A2C / A3C uses this fixed-horizon advantage estimator Pseudocode for iteration=1, 2,... do Agent acts for T timesteps (e.g., T = 20), For each timestep t, compute ˆR t = r t + γr t γ T t+1 r T 1 + γ T t V (s t ) Â t = ˆR t V (s t ) ˆR t is target value function, in regression problem Â t is estimated advantage function T Compute loss gradient g = θ t=1 [ log π θ (a t s t )Ât + c(v (s) ˆR ] t ) 2 g is plugged into a stochastic gradient descent variant, e.g., Adam. end for V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. P. Lillicrap, et al. Asynchronous methods for deep reinforcement learning. (2016)

17 A3C Video

18 A3C Results Score Beamrider DQN 1-step Q 1-step SARSA n-step Q A3C Training time (hours) Score Breakout DQN 1-step Q 1-step SARSA n-step Q A3C Training time (hours) Score Pong Score DQN 1-step Q 1-step SARSA n-step Q A3C Training time (hours) Q*bert DQN 1-step Q 1-step SARSA n-step Q A3C Score Training time (hours) Space Invaders DQN 1-step Q 1-step SARSA n-step Q A3C Training time (hours)

19 Reward Shaping

20 TD(λ) Methods: Generalized Advantage Estimation Recall, finite-horizon advantage estimators Â (k) t = r t + γr t γ k 1 r t+k 1 + γ k V (s t+k ) V (s t ) Define the TD error δ t = r t + γv (s t+1 ) V (s t ) By a telescoping sum, Â (k) t = δ t + γδ t γ k 1 δ t+k 1 Take exponentially weighted average of finite-horizon estimators: We obtain Â λ = Â(1) t + λâ(2) t + λ 2 Â (3) t +... Â λ t = δ t + (γλ)δ t+1 + (γλ) 2 δ t This scheme named generalized advantage estimation (GAE) in [1], though versions have appeared earlier, e.g., [2]. Related to TD(λ) J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel. High-dimensional continuous control using generalized advantage estimation. (2015) H. Kimura and S. Kobayashi. An Analysis of Actor/Critic Algorithms Using Eligibility Traces: Reinforcement Learning with Imperfect Value

Choosing parameters γ, λ Performance as γ, λ are varied 0.0 3D Biped 0.5 1.0 cost 1.5 2.0 2.5 γ =0.96,λ =0.96 γ =0.98,λ =0.96 γ =0.99,λ =0.96 γ =0.995,λ =0.

21 Choosing parameters γ, λ Performance as γ, λ are varied 0.0 3D Biped cost γ =0.96,λ =0.96 γ =0.98,λ =0.96 γ =0.99,λ =0.96 γ =0.995,λ =0.92 γ =0.995,λ =0.96 γ =0.995,λ =0.98 γ =0.995,λ =0.99 γ =0.995,λ =1.0 γ =1,λ =0.96 γ =1, No value fn number of policy iterations

22 TRPO+GAE Video

Policy Gradient Methods. February 13, 2017

Policy Gradient Methods. February 13, 2017 Policy Gradient Methods February 13, 2017 Policy Optimization Problems maximize E π [expression] π Fixed-horizon episodic: T 1 Average-cost: lim T 1 T r t T 1 r t Infinite-horizon discounted: γt r t Variable-length