Policy Gradient Methods February 13, 2017
Policy Optimization Problems maximize E π [expression] π Fixed-horizon episodic: T 1 Average-cost: lim T 1 T r t T 1 r t Infinite-horizon discounted: γt r t Variable-length undiscounted: T terminal 1 r t Infinite-horizon undiscounted: r t
Episodic Setting s 0 µ(s 0 ) a 0 π(a 0 s 0 ) s 1, r 0 P(s 1, r 0 s 0, a 0 ) a 1 π(a 1 s 1 ) s 2, r 1 P(s 2, r 1 s 1, a 1 )... a T 1 π(a T 1 s T 1 ) s T, r T 1 P(s T s T 1, a T 1 ) Objective: maximize η(π), where η(π) = E[r 0 + r 1 + + r T 1 π]
Episodic Setting π Agent a0 a1 at-1 s0 s1 s2 st μ0 r0 r1 rt-1 P Environment Objective: maximize η(π), where η(π) = E[r 0 + r 1 + + r T 1 π]
Parameterized Policies A family of policies indexed by parameter vector θ R d Deterministic: a = π(s, θ) Stochastic: π(a s, θ) Analogous to classification or regression with input s, output a. Discrete action space: network outputs vector of probabilities Continuous action space: network outputs mean and diagonal covariance of Gaussian
Policy Gradient Methods: Overview Problem: maximize E[R π θ ] Intuitions: collect a bunch of trajectories, and... 1. Make the good trajectories more probable 1 2. Make the good actions more probable 3. Push the actions towards good actions (DPG 2, SVG 3 ) 1 R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning (1992); R. S. Sutton, D. McAllester, S. Singh, and Y. Mansour. Policy gradient methods for reinforcement learning with function approximation. NIPS. MIT Press, 2000. 2 D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, et al. Deterministic Policy Gradient Algorithms. ICML. 2014. 3 N. Heess, G. Wayne, D. Silver, T. Lillicrap, Y. Tassa, et al. Learning Continuous Control Policies by Stochastic Value Gradients. arxiv preprint arxiv:1510.09142 (2015).
Score Function Gradient Estimator Consider an expectation E x p(x θ) [f (x)]. Want to compute gradient wrt θ θ E x [f (x)] = θ dx p(x θ)f (x) = dx θ p(x θ)f (x) = dx p(x θ) θp(x θ) p(x θ) f (x) = dx p(x θ) θ log p(x θ)f (x) = E x [f (x) θ log p(x θ)]. Last expression gives us an unbiased gradient estimator. Just sample x i p(x θ), and compute ĝ i = f (x i ) θ log p(x i θ). Need to be able to compute and differentiate density p(x θ) wrt θ
Derivation via Importance Sampling Alternative Derivation Using Importance Sampling 4 [ ] p(x θ) E x θ [f (x)] = E x θold p(x θ old ) f (x) [ ] θ p(x θ) θ E x θ [f (x)] = E x θold p(x θ old ) f (x) θ E x θ [f (x)] [ θ p(x θ) θ=θold θ=θold = E x θold f (x) p(x θ old ) = E x θold [ θ log p(x θ) ] θ=θold f (x) ] 4 T. Jie and P. Abbeel. On a connection between importance sampling and the likelihood ratio policy gradient. Advances in Neural Information Processing Systems. 2010, pp. 1000 1008.
Score Function Gradient Estimator: Intuition ĝ i = f (x i ) θ log p(x i θ) Let s say that f (x) measures how good the sample x is. Moving in the direction ĝ i pushes up the logprob of the sample, in proportion to how good it is Valid even if f (x) is discontinuous, and unknown, or sample space (containing x) is a discrete set
Score Function Gradient Estimator: Intuition ĝ i = f (x i ) θ log p(x i θ)
Score Function Gradient Estimator: Intuition ĝ i = f (x i ) θ log p(x i θ)
Score Function Gradient Estimator for Policies Now random variable x is a whole trajectory τ = (s 0, a 0, r 0, s 1, a 1, r 1,..., s T 1, a T 1, r T 1, s T ) Just need to write out p(τ θ): θ E τ [R(τ)] = E τ [ θ log p(τ θ)r(τ)] T 1 p(τ θ) = µ(s 0 ) [π(a t s t, θ)p(s t+1, r t s t, a t)] T 1 log p(τ θ) = log µ(s 0 ) + [log π(a t s t, θ) + log P(s t+1, r t s t, a t)] T 1 θ log p(τ θ) = θ log π(a t s t, θ) [ ] T 1 θ E τ [R] = E τ R θ log π(a t s t, θ) Interpretation: using good trajectories (high R) as supervised examples in classification / regression
Policy Gradient: Use Temporal Structure Previous slide: [( T 1 )( T 1 )] θ E τ [R] = E τ r t θ log π(a t s t, θ) We can repeat the same argument to derive the gradient estimator for a single reward term r t. t θ E [r t ] = E r t θ log π(a t s t, θ) Sum this formula over t, we obtain T 1 θ E [R] = E = E [ T 1 r t t θ log π(a t s t, θ) T 1 θ log π(a t s t, θ) t =t r t ]
Policy Gradient: Introduce Baseline Further reduce variance by introducing a baseline b(s) [ T 1 ( T 1 )] θ E τ [R] = E τ θ log π(a t s t, θ) r t b(s t ) For any choice of b, gradient estimator is unbiased. Near optimal choice is expected return, b(s t ) E [r t + r t+1 + r t+2 + + r T 1 ] Interpretation: increase logprob of action a t proportionally to how much returns T 1 t =t r t are better than expected t =t
Baseline Derivation E τ [ θ log π(a t s t, θ)b(s t )] ] = E s0:t,a 0:(t 1) [E s(t+1):t,a t:(t 1) [ θ log π(a t s t, θ)b(s t )] (break up expectation) ] = E s0:t,a 0:(t 1) [b(s t )E s(t+1):t,a t:(t 1) [ θ log π(a t s t, θ)] (pull baseline term out) = E s0:t,a 0:(t 1) [b(s t )E at [ θ log π(a t s t, θ)]] (remove irrelevant vars.) = E s0:t,a 0:(t 1) [b(s t ) 0] Last equality because 0 = θ E at π( s t) [1] = E at π( s t) [ θ log π θ (a t s t )]
Discounts for Variance Reduction Introduce discount factor γ, which ignores delayed effects between actions and rewards [ T 1 ( T 1 )] θ E τ [R] E τ θ log π(a t s t, θ) γ t t r t b(s t ) t =t Now, we want b(s t ) E [ r t + γr t+1 + γ 2 r t+2 + + γ T 1 t r T 1 ]
Vanilla Policy Gradient Algorithm Initialize policy parameter θ, baseline b for iteration=1, 2,... do Collect a set of trajectories by executing the current policy At each timestep in each trajectory, compute the return R t = T 1 t =t t γt r t, and the advantage estimate Ât = R t b(s t ). Re-fit the baseline, by minimizing b(s t ) R t 2, summed over all trajectories and timesteps. Update the policy, using a policy gradient estimate ĝ, which is a sum of terms θ log π(a t s t, θ)â t. (Plug ĝ into SGD or ADAM) end for
Practical Implementation with Autodiff Usual formula t θ log π(a t s t ; θ)ât is inefficient want to batch data Define surrogate function using data from currecnt batch L(θ) = t log π(a t s t ; θ)â t Then policy gradient estimator ĝ = θ L(θ) Can also include value function fit error L(θ) = (log π(a t s t ; θ)ât V (s t ) ˆR ) t 2 t
Value Functions [ Q π,γ (s, a) = E π r0 + γr 1 + γ 2 r 2 +... s 0 = s, a 0 = a ] Called Q-function or state-action-value function [ V π,γ (s) = E π r0 + γr 1 + γ 2 r 2 +... s 0 = s ] = E a π [Q π,γ (s, a)] Called state-value function A π,γ (s, a) = Q π,γ (s, a) V π,γ (s) Called advantage function
Policy Gradient Formulas with Value Functions Recall: [ T 1 ( T 1 )] θ E τ [R] = E τ θ log π(a t s t, θ) r t b(s t) [ T 1 E τ θ log π(a t s t, θ) t =t ( T 1 t =t )] γ t t r t b(s t) Using value functions [ T 1 ] θ E τ [R] = E τ θ log π(a t s t, θ)q π (s t, a t) [ T 1 ] = E τ θ log π(a t s t, θ)a π (s t, a t) [ T 1 ] E τ θ log π(a t s t, θ)a π,γ (s t, a t) Can plug in advantage estimator  for A π,γ Advantage estimators have the form Return V (s)
Value Functions in the Future Baseline accounts for and removes the effect of past actions Can also use the value function to estimate future rewards ˆR (1) t = r t + γv (s t+1 ) cut off at one timestep ˆR (2) t = r t + γr t+1 + γ 2 V (s t+2 ) cut off at two timesteps... ˆR ( ) t = r t + γr t+1 + γ 2 r t+2 +... timesteps (no V )
Value Functions in the Future Subtracting out baselines, we get advantage estimators  (1) t = r t + γv (s t+1 ) V (s t )  (2) t = r t + r t+1 + γ 2 V (s t+2 ) V (s t )...  ( ) t = r t + γr t+1 + γ 2 r t+2 +... V (s t )  (1) t has low variance but high bias,  ( ) t has high variance but low bias. Using intermediate k (say, 20) gives an intermediate amount of bias and variance
Discounts: Connection to MPC MPC: maximize a Discounted policy gradient Q,T (s, a) maximize Q,γ (s, a) a E a π [Q π,γ (s, a) θ log π(a s; θ)] = 0 when a arg max Q π,γ (s, a)
Application: Robot Locomotion
Finite-Horizon Methods: Advantage Actor-Critic A2C / A3C uses this fixed-horizon advantage estimator. (NOTE: async is only for speed, doesn t improve performance) Pseudocode for iteration=1, 2,... do Agent acts for T timesteps (e.g., T = 20), For each timestep t, compute ˆR t = r t + γr t+1 + + γ T t+1 r T 1 + γ T t V (s t )  t = ˆR t V (s t ) ˆR t is target value function, in regression problem  t is estimated advantage function T Compute loss gradient g = θ t=1 [ log π θ (a t s t )Ât + c(v (s) ˆR ] t ) 2 g is plugged into a stochastic gradient descent variant, e.g., Adam. end for V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. P. Lillicrap, et al. Asynchronous methods for deep reinforcement learning. (2016)
A3C Video
A3C Results 16000 14000 12000 10000 Beamrider DQN 1-step Q 1-step SARSA n-step Q A3C 600 500 400 Breakout DQN 1-step Q 1-step SARSA n-step Q A3C 30 20 10 Pong 12000 10000 8000 Q*bert DQN 1-step Q 1-step SARSA n-step Q A3C 1600 1400 1200 1000 Space Invaders DQN 1-step Q 1-step SARSA n-step Q A3C Score 8000 6000 4000 2000 0 0 2 4 6 8 10 12 14 Training time (hours) Score 300 200 100 0 0 2 4 6 8 10 12 14 Training time (hours) Score 0 Score 10 DQN 1-step Q 1-step SARSA 20 n-step Q A3C 30 0 2 4 6 8 10 12 14 Training time (hours) 6000 4000 2000 Score 0 0 2 4 6 8 10 12 14 Training time (hours) 800 600 400 200 0 0 2 4 6 8 10 12 14 Training time (hours)
Further Reading A nice intuitive explanation of policy gradients: http://karpathy.github.io/2016/05/31/rl/ R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning (1992); R. S. Sutton, D. McAllester, S. Singh, and Y. Mansour. Policy gradient methods for reinforcement learning with function approximation. NIPS. MIT Press, 2000 My thesis has a decent self-contained introduction to policy gradient methods: http://joschu.net/docs/thesis.pdf A3C paper: V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. P. Lillicrap, et al. Asynchronous methods for deep reinforcement learning. (2016)