Policy Gradient Methods. February 13, 2017

Size: px

Start display at page:

Download "Policy Gradient Methods. February 13, 2017"

Shonda Jefferson
5 years ago
Views:

1 Policy Gradient Methods February 13, 2017

2 Policy Optimization Problems maximize E π [expression] π Fixed-horizon episodic: T 1 Average-cost: lim T 1 T r t T 1 r t Infinite-horizon discounted: γt r t Variable-length undiscounted: T terminal 1 r t Infinite-horizon undiscounted: r t

3 Episodic Setting s 0 µ(s 0 ) a 0 π(a 0 s 0 ) s 1, r 0 P(s 1, r 0 s 0, a 0 ) a 1 π(a 1 s 1 ) s 2, r 1 P(s 2, r 1 s 1, a 1 )... a T 1 π(a T 1 s T 1 ) s T, r T 1 P(s T s T 1, a T 1 ) Objective: maximize η(π), where η(π) = E[r 0 + r r T 1 π]

4 Episodic Setting π Agent a0 a1 at-1 s0 s1 s2 st μ0 r0 r1 rt-1 P Environment Objective: maximize η(π), where η(π) = E[r 0 + r r T 1 π]

5 Parameterized Policies A family of policies indexed by parameter vector θ R d Deterministic: a = π(s, θ) Stochastic: π(a s, θ) Analogous to classification or regression with input s, output a. Discrete action space: network outputs vector of probabilities Continuous action space: network outputs mean and diagonal covariance of Gaussian

6 Policy Gradient Methods: Overview Problem: maximize E[R π θ ] Intuitions: collect a bunch of trajectories, and Make the good trajectories more probable 1 2. Make the good actions more probable 3. Push the actions towards good actions (DPG 2, SVG 3 ) 1 R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning (1992); R. S. Sutton, D. McAllester, S. Singh, and Y. Mansour. Policy gradient methods for reinforcement learning with function approximation. NIPS. MIT Press, D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, et al. Deterministic Policy Gradient Algorithms. ICML N. Heess, G. Wayne, D. Silver, T. Lillicrap, Y. Tassa, et al. Learning Continuous Control Policies by Stochastic Value Gradients. arxiv preprint arxiv: (2015).

7 Score Function Gradient Estimator Consider an expectation E x p(x θ) [f (x)]. Want to compute gradient wrt θ θ E x [f (x)] = θ dx p(x θ)f (x) = dx θ p(x θ)f (x) = dx p(x θ) θp(x θ) p(x θ) f (x) = dx p(x θ) θ log p(x θ)f (x) = E x [f (x) θ log p(x θ)]. Last expression gives us an unbiased gradient estimator. Just sample x i p(x θ), and compute ĝ i = f (x i ) θ log p(x i θ). Need to be able to compute and differentiate density p(x θ) wrt θ

8 Derivation via Importance Sampling Alternative Derivation Using Importance Sampling 4 [ ] p(x θ) E x θ [f (x)] = E x θold p(x θ old ) f (x) [ ] θ p(x θ) θ E x θ [f (x)] = E x θold p(x θ old ) f (x) θ E x θ [f (x)] [ θ p(x θ) θ=θold θ=θold = E x θold f (x) p(x θ old ) = E x θold [ θ log p(x θ) ] θ=θold f (x) ] 4 T. Jie and P. Abbeel. On a connection between importance sampling and the likelihood ratio policy gradient. Advances in Neural Information Processing Systems. 2010, pp

9 Score Function Gradient Estimator: Intuition ĝ i = f (x i ) θ log p(x i θ) Let s say that f (x) measures how good the sample x is. Moving in the direction ĝ i pushes up the logprob of the sample, in proportion to how good it is Valid even if f (x) is discontinuous, and unknown, or sample space (containing x) is a discrete set

10 Score Function Gradient Estimator: Intuition ĝ i = f (x i ) θ log p(x i θ)

11 Score Function Gradient Estimator: Intuition ĝ i = f (x i ) θ log p(x i θ)

12 Score Function Gradient Estimator for Policies Now random variable x is a whole trajectory τ = (s 0, a 0, r 0, s 1, a 1, r 1,..., s T 1, a T 1, r T 1, s T ) Just need to write out p(τ θ): θ E τ [R(τ)] = E τ [ θ log p(τ θ)r(τ)] T 1 p(τ θ) = µ(s 0 ) [π(a t s t, θ)p(s t+1, r t s t, a t)] T 1 log p(τ θ) = log µ(s 0 ) + [log π(a t s t, θ) + log P(s t+1, r t s t, a t)] T 1 θ log p(τ θ) = θ log π(a t s t, θ) [ ] T 1 θ E τ [R] = E τ R θ log π(a t s t, θ) Interpretation: using good trajectories (high R) as supervised examples in classification / regression

13 Policy Gradient: Use Temporal Structure Previous slide: [( T 1 )( T 1 )] θ E τ [R] = E τ r t θ log π(a t s t, θ) We can repeat the same argument to derive the gradient estimator for a single reward term r t. t θ E [r t ] = E r t θ log π(a t s t, θ) Sum this formula over t, we obtain T 1 θ E [R] = E = E [ T 1 r t t θ log π(a t s t, θ) T 1 θ log π(a t s t, θ) t =t r t ]

14 Policy Gradient: Introduce Baseline Further reduce variance by introducing a baseline b(s) [ T 1 ( T 1 )] θ E τ [R] = E τ θ log π(a t s t, θ) r t b(s t ) For any choice of b, gradient estimator is unbiased. Near optimal choice is expected return, b(s t ) E [r t + r t+1 + r t r T 1 ] Interpretation: increase logprob of action a t proportionally to how much returns T 1 t =t r t are better than expected t =t

15 Baseline Derivation E τ [ θ log π(a t s t, θ)b(s t )] ] = E s0:t,a 0:(t 1) [E s(t+1):t,a t:(t 1) [ θ log π(a t s t, θ)b(s t )] (break up expectation) ] = E s0:t,a 0:(t 1) [b(s t )E s(t+1):t,a t:(t 1) [ θ log π(a t s t, θ)] (pull baseline term out) = E s0:t,a 0:(t 1) [b(s t )E at [ θ log π(a t s t, θ)]] (remove irrelevant vars.) = E s0:t,a 0:(t 1) [b(s t ) 0] Last equality because 0 = θ E at π( s t) [1] = E at π( s t) [ θ log π θ (a t s t )]

16 Discounts for Variance Reduction Introduce discount factor γ, which ignores delayed effects between actions and rewards [ T 1 ( T 1 )] θ E τ [R] E τ θ log π(a t s t, θ) γ t t r t b(s t ) t =t Now, we want b(s t ) E [ r t + γr t+1 + γ 2 r t γ T 1 t r T 1 ]

17 Vanilla Policy Gradient Algorithm Initialize policy parameter θ, baseline b for iteration=1, 2,... do Collect a set of trajectories by executing the current policy At each timestep in each trajectory, compute the return R t = T 1 t =t t γt r t, and the advantage estimate Ât = R t b(s t ). Re-fit the baseline, by minimizing b(s t ) R t 2, summed over all trajectories and timesteps. Update the policy, using a policy gradient estimate ĝ, which is a sum of terms θ log π(a t s t, θ)â t. (Plug ĝ into SGD or ADAM) end for

18 Practical Implementation with Autodiff Usual formula t θ log π(a t s t ; θ)ât is inefficient want to batch data Define surrogate function using data from currecnt batch L(θ) = t log π(a t s t ; θ)â t Then policy gradient estimator ĝ = θ L(θ) Can also include value function fit error L(θ) = (log π(a t s t ; θ)ât V (s t ) ˆR ) t 2 t

19 Value Functions [ Q π,γ (s, a) = E π r0 + γr 1 + γ 2 r s 0 = s, a 0 = a ] Called Q-function or state-action-value function [ V π,γ (s) = E π r0 + γr 1 + γ 2 r s 0 = s ] = E a π [Q π,γ (s, a)] Called state-value function A π,γ (s, a) = Q π,γ (s, a) V π,γ (s) Called advantage function

20 Policy Gradient Formulas with Value Functions Recall: [ T 1 ( T 1 )] θ E τ [R] = E τ θ log π(a t s t, θ) r t b(s t) [ T 1 E τ θ log π(a t s t, θ) t =t ( T 1 t =t )] γ t t r t b(s t) Using value functions [ T 1 ] θ E τ [R] = E τ θ log π(a t s t, θ)q π (s t, a t) [ T 1 ] = E τ θ log π(a t s t, θ)a π (s t, a t) [ T 1 ] E τ θ log π(a t s t, θ)a π,γ (s t, a t) Can plug in advantage estimator Â for A π,γ Advantage estimators have the form Return V (s)

21 Value Functions in the Future Baseline accounts for and removes the effect of past actions Can also use the value function to estimate future rewards ˆR (1) t = r t + γv (s t+1 ) cut off at one timestep ˆR (2) t = r t + γr t+1 + γ 2 V (s t+2 ) cut off at two timesteps... ˆR ( ) t = r t + γr t+1 + γ 2 r t timesteps (no V )

22 Value Functions in the Future Subtracting out baselines, we get advantage estimators Â (1) t = r t + γv (s t+1 ) V (s t ) Â (2) t = r t + r t+1 + γ 2 V (s t+2 ) V (s t )... Â ( ) t = r t + γr t+1 + γ 2 r t V (s t ) Â (1) t has low variance but high bias, Â ( ) t has high variance but low bias. Using intermediate k (say, 20) gives an intermediate amount of bias and variance

23 Discounts: Connection to MPC MPC: maximize a Discounted policy gradient Q,T (s, a) maximize Q,γ (s, a) a E a π [Q π,γ (s, a) θ log π(a s; θ)] = 0 when a arg max Q π,γ (s, a)

24 Application: Robot Locomotion

25 Finite-Horizon Methods: Advantage Actor-Critic A2C / A3C uses this fixed-horizon advantage estimator. (NOTE: async is only for speed, doesn t improve performance) Pseudocode for iteration=1, 2,... do Agent acts for T timesteps (e.g., T = 20), For each timestep t, compute ˆR t = r t + γr t γ T t+1 r T 1 + γ T t V (s t ) Â t = ˆR t V (s t ) ˆR t is target value function, in regression problem Â t is estimated advantage function T Compute loss gradient g = θ t=1 [ log π θ (a t s t )Ât + c(v (s) ˆR ] t ) 2 g is plugged into a stochastic gradient descent variant, e.g., Adam. end for V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. P. Lillicrap, et al. Asynchronous methods for deep reinforcement learning. (2016)

26 A3C Video

27 A3C Results Beamrider DQN 1-step Q 1-step SARSA n-step Q A3C Breakout DQN 1-step Q 1-step SARSA n-step Q A3C Pong Q*bert DQN 1-step Q 1-step SARSA n-step Q A3C Space Invaders DQN 1-step Q 1-step SARSA n-step Q A3C Score Training time (hours) Score Training time (hours) Score 0 Score 10 DQN 1-step Q 1-step SARSA 20 n-step Q A3C Training time (hours) Score Training time (hours) Training time (hours)

28 Further Reading A nice intuitive explanation of policy gradients: R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning (1992); R. S. Sutton, D. McAllester, S. Singh, and Y. Mansour. Policy gradient methods for reinforcement learning with function approximation. NIPS. MIT Press, 2000 My thesis has a decent self-contained introduction to policy gradient methods: A3C paper: V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. P. Lillicrap, et al. Asynchronous methods for deep reinforcement learning. (2016)

Variance Reduction for Policy Gradient Methods. March 13, 2017

Variance Reduction for Policy Gradient Methods March 13, 2017 Reward Shaping Reward Shaping Reward Shaping Reward shaping: r(s, a, s ) = r(s, a, s ) + γφ(s ) Φ(s) for arbitrary potential Φ Theorem: r admits