Deep Reinforcement Learning via Policy Optimization

Size: px

Start display at page:

Download "Deep Reinforcement Learning via Policy Optimization"

Abel Caldwell
6 years ago
Views:

1 Deep Reinforcement Learning via Policy Optimization John Schulman July 3, 2017

2 Introduction

3 Deep Reinforcement Learning: What to Learn? Policies (select next action)

4 Deep Reinforcement Learning: What to Learn? Policies (select next action) Value functions (measure goodness of states or state-action pairs)

5 Deep Reinforcement Learning: What to Learn? Policies (select next action) Value functions (measure goodness of states or state-action pairs) Models (predict next states and rewards)

6 Model Free RL: (Rough) Taxonomy Policy Optimization Dynamic Programming modified policy iteration DFO / Evolution Policy Gradients Policy Iteration Value Iteration Actor-Critic Methods Q-Learning

7 Policy Optimization vs Dynamic Programming Conceptually...

8 Policy Optimization vs Dynamic Programming Conceptually... Policy optimization: optimize what you care about

9 Policy Optimization vs Dynamic Programming Conceptually... Policy optimization: optimize what you care about Dynamic programming: indirect, exploit the problem structure, self-consistency

10 Policy Optimization vs Dynamic Programming Conceptually... Policy optimization: optimize what you care about Dynamic programming: indirect, exploit the problem structure, self-consistency Empirically...

11 Policy Optimization vs Dynamic Programming Conceptually... Policy optimization: optimize what you care about Dynamic programming: indirect, exploit the problem structure, self-consistency Empirically... Policy optimization more versatile, dynamic programming methods more sample-efficient when they work

12 Policy Optimization vs Dynamic Programming Conceptually... Policy optimization: optimize what you care about Dynamic programming: indirect, exploit the problem structure, self-consistency Empirically... Policy optimization more versatile, dynamic programming methods more sample-efficient when they work Policy optimization methods more compatible with rich architectures (including recurrence) which add tasks other than control (auxiliary objectives), dynamic programming methods more compatible with exploration and off-policy learning

13 Parameterized Policies A family of policies indexed by parameter vector θ R d

14 Parameterized Policies A family of policies indexed by parameter vector θ R d Deterministic: a = π(s, θ)

15 Parameterized Policies A family of policies indexed by parameter vector θ R d Deterministic: a = π(s, θ) Stochastic: π(a s, θ)

16 Parameterized Policies A family of policies indexed by parameter vector θ R d Deterministic: a = π(s, θ) Stochastic: π(a s, θ) Analogous to classification or regression with input s, output a.

17 Parameterized Policies A family of policies indexed by parameter vector θ R d Deterministic: a = π(s, θ) Stochastic: π(a s, θ) Analogous to classification or regression with input s, output a. Discrete action space: network outputs vector of probabilities

18 Parameterized Policies A family of policies indexed by parameter vector θ R d Deterministic: a = π(s, θ) Stochastic: π(a s, θ) Analogous to classification or regression with input s, output a. Discrete action space: network outputs vector of probabilities Continuous action space: network outputs mean and diagonal covariance of Gaussian

19 Episodic Setting In each episode, the initial state is sampled from µ, and the agent acts until the terminal state is reached. For example:

20 Episodic Setting In each episode, the initial state is sampled from µ, and the agent acts until the terminal state is reached. For example: Taxi robot reaches its destination (termination = good)

21 Episodic Setting In each episode, the initial state is sampled from µ, and the agent acts until the terminal state is reached. For example: Taxi robot reaches its destination (termination = good) Waiter robot finishes a shift (fixed time)

22 Episodic Setting In each episode, the initial state is sampled from µ, and the agent acts until the terminal state is reached. For example: Taxi robot reaches its destination (termination = good) Waiter robot finishes a shift (fixed time) Walking robot falls over (termination = bad)

23 Episodic Setting In each episode, the initial state is sampled from µ, and the agent acts until the terminal state is reached. For example: Taxi robot reaches its destination (termination = good) Waiter robot finishes a shift (fixed time) Walking robot falls over (termination = bad) Goal: maximize expected return per episode maximize E [R π] π

24 Derivative Free Optimization / Evolution

25 Cross Entropy Method Initialize µ R d, σ R d for iteration = 1, 2,... do Collect n samples of θ i N(µ, diag(σ)) Perform one episode with each θ i, obtaining reward R i Select the top p% of θ samples (e.g. p = 20), the elite set Fit a Gaussian distribution, to the elite set, updating µ, σ. end for Return the final µ.

26 Cross Entropy Method Sometimes works embarrassingly well

27 Cross Entropy Method Sometimes works embarrassingly well

28 Cross Entropy Method Sometimes works embarrassingly well I. Szita and A. Lörincz. Learning Tetris using the noisy cross-entropy method. Neural computation (2006)

29 Cross Entropy Method Sometimes works embarrassingly well I. Szita and A. Lörincz. Learning Tetris using the noisy cross-entropy method. Neural computation (2006)

30 Stochastic Gradient Ascent on Distribution Let µ define distribution for policy π θ : θ P µ (θ)

31 Stochastic Gradient Ascent on Distribution Let µ define distribution for policy π θ : θ P µ (θ) Return R depends on policy parameter θ and noise ζ maximize E θ,ζ [R(θ, ζ)] µ R is unknown and possibly nondifferentiable

32 Stochastic Gradient Ascent on Distribution Let µ define distribution for policy π θ : θ P µ (θ) Return R depends on policy parameter θ and noise ζ maximize E θ,ζ [R(θ, ζ)] µ R is unknown and possibly nondifferentiable Score function gradient estimator: µ E θ,ζ [R(θ, ζ)] = E θ,ζ [ µ log P µ (θ)r(θ, ζ)] 1 N µ log P µ (θ i )R i N i=1

33 Stochastic Gradient Ascent on Distribution Compare with cross-entropy method

34 Stochastic Gradient Ascent on Distribution Compare with cross-entropy method Score function grad: µ E θ,ζ [R(θ, ζ)] 1 N N µ log P µ (θ i )R i i=1

35 Stochastic Gradient Ascent on Distribution Compare with cross-entropy method Score function grad: µ E θ,ζ [R(θ, ζ)] 1 N N µ log P µ (θ i )R i i=1 Cross entropy method: maximize µ 1 N N log P µ (θ i )f (R i ) i=1 where f (r) = 1[r above threshold] (cross entropy method)

36 Connection to Finite Differences Suppose P µ is Gaussian distribution with mean µ, covariance σ 2 I log P µ (θ) = µ θ 2 /2σ 2 + const µ log P µ (θ) = (θ µ)/σ 2 R i µ log P µ (θ i ) = R i (θ i µ)/σ 2

37 Connection to Finite Differences Suppose P µ is Gaussian distribution with mean µ, covariance σ 2 I log P µ (θ) = µ θ 2 /2σ 2 + const µ log P µ (θ) = (θ µ)/σ 2 R i µ log P µ (θ i ) = R i (θ i µ)/σ 2 Suppose we do antithetic sampling, where we use pairs of samples θ + = µ + σz, θ = µ σz ( ) 1 R(µ + σz, ζ) µ log P µ (θ + ) + R(µ σz, ζ ) µ log P µ (θ ) 2 = 1 σ ( R(µ + σz, ζ) R(µ σz, ζ ) ) z

38 Connection to Finite Differences Suppose P µ is Gaussian distribution with mean µ, covariance σ 2 I log P µ (θ) = µ θ 2 /2σ 2 + const µ log P µ (θ) = (θ µ)/σ 2 R i µ log P µ (θ i ) = R i (θ i µ)/σ 2 Suppose we do antithetic sampling, where we use pairs of samples θ + = µ + σz, θ = µ σz ( ) 1 R(µ + σz, ζ) µ log P µ (θ + ) + R(µ σz, ζ ) µ log P µ (θ ) 2 = 1 σ ( R(µ + σz, ζ) R(µ σz, ζ ) ) z Using same noise ζ for both evaluations reduces variance

39 Deriving the Score Function Estimator Score function gradient estimator: µ E θ,ζ [R(θ, ζ)] = E θ,ζ [ µ log P µ (θ)r(θ, ζ)] 1 N µ log P µ (θ i )R i N i=1

40 Deriving the Score Function Estimator Score function gradient estimator: µ E θ,ζ [R(θ, ζ)] = E θ,ζ [ µ log P µ (θ)r(θ, ζ)] 1 N µ log P µ (θ i )R i N Derive by writing expectation as an integral µ dµdζ P µ (θ)r(θ, ζ) = dµdζ µ P µ (θ)r(θ, ζ) = dµdζ P µ (θ) µ log P µ (θ)r(θ, ζ) i=1 = E θ,ζ [ µ log P µ (θ)r(θ, ζ)]

41 Literature on DFO Evolution strategies (Rechenberg and Eigen, 1973)

42 Literature on DFO Evolution strategies (Rechenberg and Eigen, 1973) Simultaneous perturbation stochastic approximation (Spall, 1992)

43 Literature on DFO Evolution strategies (Rechenberg and Eigen, 1973) Simultaneous perturbation stochastic approximation (Spall, 1992) Covariance matrix adaptation: popular relative of CEM (Hansen, 2006)

44 Literature on DFO Evolution strategies (Rechenberg and Eigen, 1973) Simultaneous perturbation stochastic approximation (Spall, 1992) Covariance matrix adaptation: popular relative of CEM (Hansen, 2006) Reward weighted regression (Peters and Schaal, 2007), PoWER (Kober and Peters, 2007)

45 Success Stories CMA is very effective for optimizing low-dimensional locomotion controllers

46 Success Stories CMA is very effective for optimizing low-dimensional locomotion controllers UT Austin Villa: RoboCup D Simulation League Champion

League Champion Evolution Strategies was shown to perform well on

47 Success Stories CMA is very effective for optimizing low-dimensional locomotion controllers UT Austin Villa: RoboCup D Simulation League Champion Evolution Strategies was shown to perform well on Atari, competetive with policy gradient methods (Salimans et al., 2017)

48 Policy Gradient Methods

49 Overview Problem: maximize E[R π θ ] Here, we ll use a fixed policy parameter θ (instead of sampling θ P µ ) and estimate gradient with respect to θ

50 Overview Problem: maximize E[R π θ ] Here, we ll use a fixed policy parameter θ (instead of sampling θ P µ ) and estimate gradient with respect to θ Noise is in action space rather than parameter space

51 Overview Problem: maximize E[R π θ ] Intuitions: collect a bunch of trajectories, and Make the good trajectories more probable

52 Overview Problem: maximize E[R π θ ] Intuitions: collect a bunch of trajectories, and Make the good trajectories more probable 2. Make the good actions more probable

53 Overview Problem: maximize E[R π θ ] Intuitions: collect a bunch of trajectories, and Make the good trajectories more probable 2. Make the good actions more probable 3. Push the actions towards better actions

54 Score Function Gradient Estimator for Policies Now random variable is a whole trajectory τ = (s 0, a 0, r 0, s 1, a 1, r 1,..., s T 1, a T 1, r T 1, s T ) θ E τ [R(τ)] = E τ [ θ log P(τ θ)r(τ)]

55 Score Function Gradient Estimator for Policies Now random variable is a whole trajectory τ = (s 0, a 0, r 0, s 1, a 1, r 1,..., s T 1, a T 1, r T 1, s T ) Just need to write out P(τ θ): θ E τ [R(τ)] = E τ [ θ log P(τ θ)r(τ)] T 1 P(τ θ) = µ(s 0 ) [π(a t s t, θ)p(s t+1, r t s t, a t )] t=0 T 1 log P(τ θ) = log µ(s 0 ) + [log π(a t s t, θ) + log P(s t+1, r t s t, a t )] t=0 T 1 θ log P(τ θ) = θ log π(a t s t, θ) θ E τ [R] = E τ [ t=0 R θ T 1 ] log π(a t s t, θ) t=0

56 Policy Gradient: Use Temporal Structure Previous slide: [( T 1 )( T 1 )] θ E τ [R] = E τ r t θ log π(a t s t, θ) t=0 t=0

57 Policy Gradient: Use Temporal Structure Previous slide: [( T 1 )( T 1 )] θ E τ [R] = E τ r t θ log π(a t s t, θ) t=0 t=0 We can repeat the same argument to derive the gradient estimator for a single reward term r t. t θ E [r t ] = E r t θ log π(a t s t, θ) t=0

58 Policy Gradient: Use Temporal Structure Previous slide: [( T 1 )( T 1 )] θ E τ [R] = E τ r t θ log π(a t s t, θ) t=0 We can repeat the same argument to derive the gradient estimator for a single reward term r t. t θ E [r t ] = E r t θ log π(a t s t, θ) Sum this formula over t, we obtain T 1 θ E [R] = E = E t=0 [ T 1 t=0 r t t=0 t=0 t θ log π(a t s t, θ) t=0 T 1 θ log π(a t s t, θ) t =t r t ]

59 Policy Gradient: Introduce Baseline Further reduce variance by introducing a baseline b(s) [ T 1 ( T 1 )] θ E τ [R] = E τ θ log π(a t s t, θ) r t b(s t ) t=0 t =t

60 Policy Gradient: Introduce Baseline Further reduce variance by introducing a baseline b(s) [ T 1 ( T 1 )] θ E τ [R] = E τ θ log π(a t s t, θ) r t b(s t ) t=0 t =t For any choice of b, gradient estimator is unbiased.

61 Policy Gradient: Introduce Baseline Further reduce variance by introducing a baseline b(s) [ T 1 ( T 1 )] θ E τ [R] = E τ θ log π(a t s t, θ) r t b(s t ) t=0 For any choice of b, gradient estimator is unbiased. Near optimal choice is expected return, b(s t ) E [r t + r t+1 + r t r T 1 ] t =t

62 Policy Gradient: Introduce Baseline Further reduce variance by introducing a baseline b(s) [ T 1 ( T 1 )] θ E τ [R] = E τ θ log π(a t s t, θ) r t b(s t ) t=0 For any choice of b, gradient estimator is unbiased. Near optimal choice is expected return, b(s t ) E [r t + r t+1 + r t r T 1 ] Interpretation: increase logprob of action a t proportionally to how much returns T 1 t=t r t are better than expected t =t

63 Discounts for Variance Reduction Introduce discount factor γ, which ignores delayed effects between actions and rewards [ T 1 ( T 1 )] θ E τ [R] E τ θ log π(a t s t, θ) γ t t r t b(s t ) t=0 t =t

64 Discounts for Variance Reduction Introduce discount factor γ, which ignores delayed effects between actions and rewards [ T 1 ( T 1 )] θ E τ [R] E τ θ log π(a t s t, θ) γ t t r t b(s t ) t=0 t =t Now, we want b(s t ) E [ r t + γr t+1 + γ 2 r t γ T 1 t r T 1 ]

65 Discounts for Variance Reduction Introduce discount factor γ, which ignores delayed effects between actions and rewards [ T 1 ( T 1 )] θ E τ [R] E τ θ log π(a t s t, θ) γ t t r t b(s t ) t=0 t =t Now, we want b(s t ) E [ ] r t + γr t+1 + γ 2 r t γ T 1 t r T 1 Write gradient estimator more generally as [ T 1 ] θ E τ [R] E τ θ log π(a t s t, θ)ât Â t is the advantage estimate t=0

66 Vanilla Policy Gradient Algorithm Initialize policy parameter θ, baseline b for iteration=1, 2,... do Collect a set of trajectories by executing the current policy At each timestep in each trajectory, compute the return ˆR t = T 1 t =t t γt r t, and the advantage estimate Â t = ˆR t b(s t ). Re-fit the baseline, by minimizing b(s t ) R t 2, summed over all trajectories and timesteps. Update the policy, using a policy gradient estimate ĝ, which is a sum of terms θ log π(a t s t, θ)ât end for

67 Advantage Actor-Critic Use neural network that represents policy π θ and value function V θ (approximating V π θ) Pseudocode for iteration=1, 2,... do Agent acts for T timesteps (e.g., T = 20), For each timestep t, compute ˆR t = r t + γr t γ T t+1 r T 1 + γ T t V θ (s t ) Â t = ˆR t V θ (s t ) ˆR t is target value function, in regression problem Â t is estimated advantage function T Compute loss gradient g = θ t=1 [ log π θ (a t s t )Ât + c(v θ (s) ˆR ] t ) 2 g is plugged into a stochastic gradient ascent algorithm, e.g., Adam. end for V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. P. Lillicrap, et al. Asynchronous methods for deep reinforcement learning. (2016)

68 Trust Region Policy Optimization Motivation: make policy gradients more robust and sample efficient

69 Trust Region Policy Optimization Motivation: make policy gradients more robust and sample efficient Unlike in supervised learning, policy affects distribution of inputs, so a large bad update can be disastrous

70 Trust Region Policy Optimization Motivation: make policy gradients more robust and sample efficient Unlike in supervised learning, policy affects distribution of inputs, so a large bad update can be disastrous Makes use of a surrogate objective that estimates the performance of the policy around π old used for sampling L πold (π) = 1 N N i=1 π(a i s i ) π old (a i s i )Âi (1) Differentiating this objective gives the policy gradient

71 Trust Region Policy Optimization Motivation: make policy gradients more robust and sample efficient Unlike in supervised learning, policy affects distribution of inputs, so a large bad update can be disastrous Makes use of a surrogate objective that estimates the performance of the policy around π old used for sampling L πold (π) = 1 N N i=1 π(a i s i ) π old (a i s i )Âi (1) Differentiating this objective gives the policy gradient L πold (π) is only accurate when state distribution of π is close to π old, thus it makes sense to constrain or penalize the distance D KL [π old π]

72 Trust Region Policy Optimization Pseudocode: for iteration=1, 2,... do Run policy for T timesteps or N trajectories Estimate advantage function at all timesteps end for maximize θ subject to N n=1 π θ (a n s n ) π θold (a n s n )Ân KL πθold (π θ ) δ Can solve constrained optimization problem efficiently by using conjugate gradient Closely related to natural policy gradients (Kakade, 2002), natural actor critic (Peters and Schaal, 2005), REPS (Peters et al., 2010)

73 Proximal Policy Optimization Use penalty instead of constraint maximize θ N n=1 π θ (a n s n ) π θold (a n s n )Ân C KL πθold (π θ ) Pseudocode: for iteration=1, 2,... do Run policy for T timesteps or N trajectories Estimate advantage function at all timesteps Do SGD on above objective for some number of epochs If KL too high, increase β. If KL too low, decrease β. end for same performance as TRPO, but only first-order optimization

74 Variance Reduction for Policy Gradients

75 Reward Shaping

76 Reward Shaping Reward shaping: δ(s, a, s ) = r(s, a, s ) + γφ(s ) Φ(s) for arbitrary potential Φ 1 A. Y. Ng, D. Harada, and S. Russell. Policy invariance under reward transformations: Theory and application to reward shaping. ICML

77 Reward Shaping Reward shaping: δ(s, a, s ) = r(s, a, s ) + γφ(s ) Φ(s) for arbitrary potential Φ Theorem: δ admits the same optimal policies as r. 1 1 A. Y. Ng, D. Harada, and S. Russell. Policy invariance under reward transformations: Theory and application to reward shaping. ICML

78 Reward Shaping Reward shaping: δ(s, a, s ) = r(s, a, s ) + γφ(s ) Φ(s) for arbitrary potential Φ Theorem: δ admits the same optimal policies as r. 1 Proof sketch: suppose Q satisfies Bellman equation (T Q = Q). If we transform r δ, policy s value function satisfies Q(s, a) = Q (s, a) Φ(s) 1 A. Y. Ng, D. Harada, and S. Russell. Policy invariance under reward transformations: Theory and application to reward shaping. ICML

79 Reward Shaping Reward shaping: δ(s, a, s ) = r(s, a, s ) + γφ(s ) Φ(s) for arbitrary potential Φ Theorem: δ admits the same optimal policies as r. 1 Proof sketch: suppose Q satisfies Bellman equation (T Q = Q). If we transform r δ, policy s value function satisfies Q(s, a) = Q (s, a) Φ(s) Q satisfies Bellman equation Q also satisfies Bellman equation 1 A. Y. Ng, D. Harada, and S. Russell. Policy invariance under reward transformations: Theory and application to reward shaping. ICML

80 Reward Shaping Theorem: δ admits the same optimal policies as R. A. Y. Ng, D. Harada, and S. Russell. Policy invariance under reward transformations: Theory and application to reward shaping. ICML. 1999

81 Reward Shaping Theorem: δ admits the same optimal policies as R. A. Y. Ng, D. Harada, and S. Russell. Policy invariance under reward transformations: Theory and application to reward shaping. ICML Alternative proof: advantage function is invariant. Let s look at effect on V π and Q π : E [ δ 0 + γδ 1 + γ 2 δ ] = E [ (r 0 + γφ(s 1 ) Φ(s 0 )) + γ(r 1 + γφ(s 2 ) Φ(s 1 )) + γ 2 (r 2 + γφ(s 3 ) Φ(s 2 )) +... ] = E [ r 0 + γr 1 + γ 2 r 2 + Φ(s 0 ) ] Thus, A π (s, π(s)) = 0 at all states π is optimal Ṽ π (s) = V π (s) Φ(s) Q π (s) = Q π (s, a) Φ(s) Ã π (s) = A π (s, a)

82 Reward Shaping and Problem Difficulty Shape with Φ = V problem is solved in one step of value iteration

83 Reward Shaping and Problem Difficulty Shape with Φ = V problem is solved in one step of value iteration Shaping leaves policy gradient invariant (and just adds baseline to estimator) E [ θ log π θ (a 0 s 0 )(r 0 + γφ(s 1 ) Φ(s 0 )) + γ(r 1 + γφ(s 2 ) Φ(s 1 )) + γ 2 (r 2 + γφ(s 3 ) Φ(s 2 )) +... ] = E [ θ log π θ (a 0 s 0 )(r 0 + γr 1 + γ 2 r 2 + Φ(s 0 )) ] = E [ θ log π θ (a 0 s 0 )(r 0 + γr 1 + γ 2 r ) ]

84 Reward Shaping and Policy Gradients First note connection between shaped reward and advantage function: E s1 [r 0 + γv π (s 1 ) V π (s 0 ) s 0 = s, a 0 = a] = A π (s, a) Now considering the policy gradient and ignoring all but first shaped reward (i.e., pretend γ = 0 after shaping) we get [ ] [ ] E θ log π θ (a t s t )δ t = E θ log π θ (a t s t )(r t + γv π (s t+1 ) V π (s t )) t [ ] = E θ log π θ (a t s t )A π (s t, a t ) t t

85 Reward Shaping and Policy Gradients Compromise: use more aggressive discount γλ, with λ (0, 1): called generalized advantage estimation θ log π θ (a t s t ) (γλ) k δ t+k t k=0

86 Reward Shaping and Policy Gradients Compromise: use more aggressive discount γλ, with λ (0, 1): called generalized advantage estimation θ log π θ (a t s t ) (γλ) k δ t+k t k=0 Or alternatively, use hard cutoff as in A3C n 1 θ log π θ (a t s t ) γ k δ t+k t = θ log π θ (a t s t ) t k=0 ( n 1 ) γ k r t+k + γ n Φ(s t+n ) Φ(s t ) k=0

87 Reward Shaping Summary Reward shaping transformation leaves policy gradient and optimal policy invariant

88 Reward Shaping Summary Reward shaping transformation leaves policy gradient and optimal policy invariant Shaping with Φ V π makes consequences of actions more immediate

89 Reward Shaping Summary Reward shaping transformation leaves policy gradient and optimal policy invariant Shaping with Φ V π makes consequences of actions more immediate Shaping, and then ignoring all but first term, gives policy gradient

90 Aside: Reward Shaping is Crucial in Practice I. Mordatch, E. Todorov, and Z. Popović. Discovery of complex behaviors through contact-invariant optimization. ACM Transactions on Graphics (TOG) 31.4 (2012), p. 43

Discovery of complex behaviors through contact-invariant optimization.

91 Aside: Reward Shaping is Crucial in Practice I. Mordatch, E. Todorov, and Z. Popović. Discovery of complex behaviors through contact-invariant optimization. ACM Transactions on Graphics (TOG) 31.4 (2012), p. 43 Y. Tassa, T. Erez, and E. Todorov. Synthesis and stabilization of complex behaviors through online trajectory optimization. Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on. IEEE. 2012, pp

Choosing parameters γ, λ Performance as γ, λ are varied 0.0 3D Biped 0.5 1.0 cost 1.5 2.0 2.5 γ =0.96,λ =0.96 γ =0.98,λ =0.96 γ =0.99,λ =0.96 γ =0.995,λ =0.92 γ =0.995,λ =0.96 γ =0.995,λ =0.98 γ =0.

92 Choosing parameters γ, λ Performance as γ, λ are varied 0.0 3D Biped cost γ =0.96,λ =0.96 γ =0.98,λ =0.96 γ =0.99,λ =0.96 γ =0.995,λ =0.92 γ =0.995,λ =0.96 γ =0.995,λ =0.98 γ =0.995,λ =0.99 γ =0.995,λ =1.0 γ =1,λ =0.96 γ =1, No value fn number of policy iterations (Generalized Advantage Estimation for Policy Gradients, S. et al., ICLR 2016)

93 Pathwise Derivative Methods

94 Deriving the Policy Gradient, Reparameterized Episodic MDP: θ s 1 s 2... s T a 1 a 2... a T R T Want to compute θ E [R T ]. We ll use θ log π(a t s t ; θ)

95 Deriving the Policy Gradient, Reparameterized Episodic MDP: θ s 1 s 2... s T a 1 a 2... a T R T Want to compute θ E [R T ]. We ll use θ log π(a t s t ; θ) Reparameterize: a t = π(s t, z t ; θ). z t is noise from fixed distribution. θ s 1 s 2... s T a 1 a 2... a T R T z 1 z 2... z T

96 Deriving the Policy Gradient, Reparameterized Episodic MDP: θ s 1 s 2... s T a 1 a 2... a T R T Want to compute θ E [R T ]. We ll use θ log π(a t s t ; θ) Reparameterize: a t = π(s t, z t ; θ). z t is noise from fixed distribution. θ s 1 s 2... s T a 1 a 2... a T R T z 1 z 2... z T Only works if P(s 2 s 1, a 1 ) is known

97 Using a Q-function θ s 1 s 2... s T a 1 a 2... a T R T z 1 z 2... z T [ d dθ E [R T T ] = E = E t=1 [ T t=1 dr T da t da t dθ dq(s t, a t ) da t ] [ T ] d = E E [R T a t ] da t da t=1 t dθ ] [ da t T = E dθ t=1 d dθ Q(s t, π(s t, z t ; θ)) ]

98 SVG(0) Algorithm Learn Q φ to approximate Q π,γ, and use it to compute gradient estimates. N. Heess, G. Wayne, D. Silver, T. Lillicrap, Y. Tassa, et al. Learning Continuous Control Policies by Stochastic Value Gradients. arxiv preprint arxiv: (2015)

99 SVG(0) Algorithm Learn Q φ to approximate Q π,γ, and use it to compute gradient estimates. Pseudocode: for iteration=1, 2,... do Execute policy π θ to collect T timesteps of data Update π θ using g θ T t=1 Q(s t, π(s t, z t ; θ)) Update Q φ using g φ T t=1 (Q φ(s t, a t ) ˆQ t ) 2, e.g. with TD(λ) end for N. Heess, G. Wayne, D. Silver, T. Lillicrap, Y. Tassa, et al. Learning Continuous Control Policies by Stochastic Value Gradients. arxiv preprint arxiv: (2015)

100 SVG(1) Algorithm θ s 1 s 2... s T a 1 a 2... a T R T z 1 z 2... z T Instead of learning Q, we learn

101 SVG(1) Algorithm θ s 1 s 2... s T a 1 a 2... a T R T z 1 z 2... z T Instead of learning Q, we learn State-value function V V π,γ

102 SVG(1) Algorithm θ s 1 s 2... s T a 1 a 2... a T R T z 1 z 2... z T Instead of learning Q, we learn State-value function V V π,γ Dynamics model f, approximating st+1 = f (s t, a t ) + ζ t

103 SVG(1) Algorithm θ s 1 s 2... s T a 1 a 2... a T R T z 1 z 2... z T Instead of learning Q, we learn State-value function V V π,γ Dynamics model f, approximating st+1 = f (s t, a t ) + ζ t Given transition (s t, a t, s t+1 ), infer ζ t = s t+1 f (s t, a t )

104 SVG(1) Algorithm θ s 1 s 2... s T a 1 a 2... a T R T z 1 z 2... z T Instead of learning Q, we learn State-value function V V π,γ Dynamics model f, approximating st+1 = f (s t, a t ) + ζ t Given transition (s t, a t, s t+1 ), infer ζ t = s t+1 f (s t, a t ) Q(s t, a t ) = E [r t + γv (s t+1 )] = E [r t + γv (f (s t, a t ) + ζ t )], and a t = π(s t, θ, ζ t )

105 SVG( ) Algorithm θ s 1 s 2... s T a 1 a 2... a T R T z 1 z 2... z T Just learn dynamics model f

106 SVG( ) Algorithm θ s 1 s 2... s T a 1 a 2... a T R T z 1 z 2... z T Just learn dynamics model f Given whole trajectory, infer all noise variables

107 SVG( ) Algorithm θ s 1 s 2... s T a 1 a 2... a T R T z 1 z 2... z T Just learn dynamics model f Given whole trajectory, infer all noise variables Freeze all policy and dynamics noise, differentiate through entire deterministic computation graph

108 SVG Results Applied to 2D robotics tasks N. Heess, G. Wayne, D. Silver, T. Lillicrap, Y. Tassa, et al. Learning Continuous Control Policies by Stochastic Value Gradients. arxiv preprint arxiv: (2015)

109 SVG Results Applied to 2D robotics tasks Overall: different gradient estimators behave similarly N. Heess, G. Wayne, D. Silver, T. Lillicrap, Y. Tassa, et al. Learning Continuous Control Policies by Stochastic Value Gradients. arxiv preprint arxiv: (2015)

110 Deterministic Policy Gradient For Gaussian actions, variance of score function policy gradient estimator goes to infinity as variance goes to zero D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, et al. Deterministic Policy Gradient Algorithms. ICML. 2014

111 Deterministic Policy Gradient For Gaussian actions, variance of score function policy gradient estimator goes to infinity as variance goes to zero Intuition: finite difference gradient estimators D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, et al. Deterministic Policy Gradient Algorithms. ICML. 2014

112 Deterministic Policy Gradient For Gaussian actions, variance of score function policy gradient estimator goes to infinity as variance goes to zero Intuition: finite difference gradient estimators But SVG(0) gradient is fine when σ 0 θ Q(s t, π(s t, θ, ζ t )) t D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, et al. Deterministic Policy Gradient Algorithms. ICML. 2014

113 Deterministic Policy Gradient For Gaussian actions, variance of score function policy gradient estimator goes to infinity as variance goes to zero Intuition: finite difference gradient estimators But SVG(0) gradient is fine when σ 0 θ Q(s t, π(s t, θ, ζ t )) Problem: there s no exploration. t D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, et al. Deterministic Policy Gradient Algorithms. ICML. 2014

114 Deterministic Policy Gradient For Gaussian actions, variance of score function policy gradient estimator goes to infinity as variance goes to zero Intuition: finite difference gradient estimators But SVG(0) gradient is fine when σ 0 θ Q(s t, π(s t, θ, ζ t )) Problem: there s no exploration. t Solution: add noise to the policy, but estimate Q with TD(0), so it s valid off-policy D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, et al. Deterministic Policy Gradient Algorithms. ICML. 2014

115 Deterministic Policy Gradient For Gaussian actions, variance of score function policy gradient estimator goes to infinity as variance goes to zero Intuition: finite difference gradient estimators But SVG(0) gradient is fine when σ 0 θ Q(s t, π(s t, θ, ζ t )) Problem: there s no exploration. t Solution: add noise to the policy, but estimate Q with TD(0), so it s valid off-policy Policy gradient is a little biased (even with Q = Q π ), but only because state distribution is off it gets the right gradient at every state D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, et al. Deterministic Policy Gradient Algorithms. ICML. 2014

116 Deep Deterministic Policy Gradient Incorporate replay buffer and target network ideas from DQN for increased stability Use lagged (Polyak-averaging) version of Q φ and π θ for fitting Q φ (towards Q π,γ ) with TD(0) ˆQ t = r t + γq φ (s t+1, π(s t+1 ; θ )) Pseudocode: for iteration=1, 2,... do Act for several timesteps, add data to replay buffer Sample minibatch Update π θ using g θ T t=1 Q(s t, π(s t, z t ; θ)) Update Q φ using g φ T t=1 (Q φ(s t, a t ) ˆQ t ) 2, end for T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, et al. Continuous control with deep reinforcement learning. arxiv preprint arxiv: (2015)

117 DDPG Results Applied to 2D and 3D robotics tasks and driving with pixel input T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, et al. Continuous control with deep reinforcement learning. arxiv preprint arxiv: (2015)

118 Policy Gradient Methods: Comparison Two kinds of policy gradient estimator

119 Policy Gradient Methods: Comparison Two kinds of policy gradient estimator REINFORCE / score function estimator: log π(a s) Â.

120 Policy Gradient Methods: Comparison Two kinds of policy gradient estimator REINFORCE / score function estimator: log π(a s) Â. Learn Q or V for variance reduction, to estimate Â

121 Policy Gradient Methods: Comparison Two kinds of policy gradient estimator REINFORCE / score function estimator: log π(a s) Â. Learn Q or V for variance reduction, to estimate Â Pathwise derivative estimators (differentiate wrt action)

122 Policy Gradient Methods: Comparison Two kinds of policy gradient estimator REINFORCE / score function estimator: log π(a s) Â. Learn Q or V for variance reduction, to estimate Â Pathwise derivative estimators (differentiate wrt action) SVG(0) / DPG: d da Q(s, a) (learn Q)

123 Policy Gradient Methods: Comparison Two kinds of policy gradient estimator REINFORCE / score function estimator: log π(a s) Â. Learn Q or V for variance reduction, to estimate Â Pathwise derivative estimators (differentiate wrt action) SVG(0) / DPG: d da Q(s, a) (learn Q) SVG(1): d da (r + γv (s )) (learn f, V )

124 Policy Gradient Methods: Comparison Two kinds of policy gradient estimator REINFORCE / score function estimator: log π(a s) Â. Learn Q or V for variance reduction, to estimate Â Pathwise derivative estimators (differentiate wrt action) SVG(0) / DPG: d da Q(s, a) (learn Q) SVG(1): d da (r + γv (s )) (learn f, V ) d SVG( ): da t (r t + γr t+1 + γ 2 r t ) (learn f )

125 Policy Gradient Methods: Comparison Two kinds of policy gradient estimator REINFORCE / score function estimator: log π(a s) Â. Learn Q or V for variance reduction, to estimate Â Pathwise derivative estimators (differentiate wrt action) SVG(0) / DPG: d da Q(s, a) (learn Q) SVG(1): d da (r + γv (s )) (learn f, V ) d SVG( ): da t (r t + γr t+1 + γ 2 r t ) (learn f ) Pathwise derivative methods more sample-efficient when they work (maybe), but work less generally due to high bias

126 Thanks Questions?

Variance Reduction for Policy Gradient Methods. March 13, 2017

Variance Reduction for Policy Gradient Methods March 13, 2017 Reward Shaping Reward Shaping Reward Shaping Reward shaping: r(s, a, s ) = r(s, a, s ) + γφ(s ) Φ(s) for arbitrary potential Φ Theorem: r admits