Tutorial on Policy Gradient Methods. Jan Peters

Size: px

Start display at page:

Download "Tutorial on Policy Gradient Methods. Jan Peters"

Lynne Ramsey
5 years ago
Views:

1 Tutorial on Policy Gradient Methods Jan Peters

2 Outline 1. Reinforcement Learning 2. Finite Difference vs Likelihood-Ratio Policy Gradients 3. Likelihood-Ratio Policy Gradients 4. Conclusion

3 General Setup Policy u = π(x) or u π(u x) Goal: Find a policy that maximizes the reward Reward r R Next state x R n System p(x x, u) Action u R m 1. Reinforcement Learning

4 Goal of RL What does maximizing your rewards mean? J(π) = 1 T T r(x t, u t, t) E{r(x, u, t)} t=1 Find a policy that maximizes the rewards! A policy tells you which actions to use for each state! 1. Reinforcement Learning

5 Policy Search vs Value Function Methods Value Function View! Critic: Policy Evaluation { T } Q(x t, u t, t) = E r(x τ, u τ, τ) x t, u t τ=t Actor: Compute Optimal Policy u t = π(x t, t) = argmax Q(x t, u, t) Policy Search View! J(π) = E Critic: Policy Sensitivity { T } r(x t, u t, t) t=0 Actor: Policy Improvement π = argmax π {J(π ) J(π)} 1. Reinforcement Learning

6 Greedy vs Gradients Greedy Updates: V π V π θ π Small change Small change = argmax θe π θ {Q π (x, u)} π Large change Small change V π Policy Gradient Updates: θ π π V π Large change = θ π + α dj(θ) dθ Small change π θ=θπ π Large change Small change potentially unstable learning process with large policy jumps stable learning process with smooth policy improvement 2. Value Function Methods

7 Outline 1. Reinforcement Learning for Motor Control? 2. Finite Difference vs Likelihood-Ratio Policy Gradients 3. Likelihood-Ratio Policy Gradients 4. Conclusion

8 Why Policy Gradient Methods? Why Policy Gradients? Smooth changes in the parameters result into stability. Prior information can be incorporated with ease. Works with incomplete information. Exploration-Exploitation Dilemma implicitly treated. Is unbiased! Only requires much fewer samples. 3. FD vs LR gradients

9 Finite Difference Gradients Blackbox-Approach Perturb the Parameters of your Policy Policy θ + δθ Reward r R u = π(x) or u π(u x) Next state x R n Action u R m J(θ + δθ) J(θ) System p(x x, u) dj dθ J(θ + δθ) J(θ) δθ 3. FD vs LR gradients

10 Finite Difference Gradients Why use Finite Difference Gradients? Only needs a black box! Works on any parameterization and deterministic policy. Fast to estimate for deterministic systems. State of the art in the simulation community Why not? Parameter perturbation can destroy your robot. Exploration is hard to include. For stochastic systems it is very slow. 3. FD vs LR gradients

11 Likelihood Ratio Gradients Whitebox-Approach Perturb the actions using a stochastic policy Policy u = π(x) or u π(u x) Reward r R Next state x R n System p(x x, u) Action u R m Actions with Noise 3. FD vs LR gradients

12 Likelihood Ratio Gradients Whitebox-Approach: Likelihood Ratio Trick d dθ J(θ) = d π (u) r (u) du, (1) dθ U dπ (u) = r (u) du, (2) U dθ 1 dπ (u) = π (u) r (u) du, (3) U π (u) dθ d log π (u) = π (u) r (u) du, (4) U dθ { } d log π (u) N d log π (u i ) = E r (u) r (u i ) (5) dθ dθ 3. FD vs LR gradients i=1

13 Likelihood Ratio Gradients Why use Likelihood Ratio Gradients? Fastest Gradient Method! We know the policy derivative - thus they are more efficient than for the simulation community. Only perturb the motor command -> policies will remain stable! Why not? Stochastic policy. Injection of noise into the system. Theory much more complex! 3. FD vs LR gradients

14 Outline 1. Reinforcement Learning for Motor Control? 2. Finite Difference vs Likelihood-Ratio Policy Gradients 3. Likelihood-Ratio Policy Gradients 4. Conclusion

15 Goal of RL Revisited Goal: Optimize the expected return J(θ) = X d π (x) U π(u x)r(x, u)dudx, State distribution Policy (we can choose it) { } = (1 γ)e γ t r t t=0 Reward 4. Likelihood Ratio Gradients

16 Policy Gradient Methods According to the Policy Gradient Theorem, the gradient can be computed as θ J(θ) = d π (x) θ π(u x)(q π (x, u) b π (x))dudx. X U Gradient of the expected return Derivative only of the policy State-action value function Arbitrary baseline function Problems: High variance, very slow convergence, dependence on baseline! Originally discovered: Aleksandrov, 1968; Glynn,1986. Examples: episodic REINFORCE, SRV, GPOMDP, Likelihood Ratio Gradients

17 Policy Gradient Methods According to the Policy Gradient Theorem, the gradient can be computed as θ J(θ) = d π (x) θ π(u x)(q π (x, u) b π (x))dudx. X U Gradient of the expected return Derivative only of the policy State-action value function Arbitrary baseline function Problems: High variance, very slow convergence, dependence on baseline! Originally discovered: Aleksandrov, 1968; Glynn,1986. Examples: episodic REINFORCE, SRV, GPOMDP, Likelihood Ratio Gradients

18 Compatible Function Approximation The state-action value function can be replaced by State-action value function Q π (x, u) f π w(x, u) = dlogπ(u x) dθ Compatible function approximation Log-policy derivative T w Parameters of the function approximator without biasing the gradient. Thus, the policy gradient becomes θ J(θ) = d π (x) θ π(u x)(fw(x, π u) b π (x))dudx. 4. Likelihood Ratio Gradients X U (Sutton et al., 2000; Konda& Tsitsiklis, 2000)

19 All-Action Gradient By integrating over all possible actions in a state, the baseline can be integrated out, and the gradient becomes: θ J(θ) = = All Action Matrix X X d π (x) d π (x) = F (θ)w. U U θ π(u x)(f π w(x, u) b(x))dudx, π(u x) θ logπ(u x) θ logπ(u x) T wdudx, Parameters (Peters et al., 2003) 4. Likelihood Ratio Gradients

20 Natural Gradients Natural Gradients: A more efficient gradient in learning problems is the natural gradient (Amari, 1998): Natural gradient θ J(θ) = G 1 (θ) θ J(θ) Inverse of the Fisher Information Matrix Vanilla` gradient where G(θ) = d π (x) π(u x) θ log (d π (x)π(u x)) θ log (d π (x)π(u x)) dudx. X U θ J(θ) = d π (x) θ π(u x)(q π (x, u) b π (x))dudx. 4. Likelihood Ratio Gradients X U

21 All Action = Fisher Information! So how does the All-Action Matrix F (θ) = G(θ) = X X d π (x) relate to the Fisher Information Matrix d π (x) U U π(u x) θ log π(u x) θ log π(u x)dudx. π(u x) θ log (d π (x)π(u x)) θ log (d π (x)π(u x)) dudx. While Kakade (2002) suggested that F is an average of point Fisher information matrices, we could prove that F = G. 4. Likelihood Ratio Gradients (Peters et al., 2003; 2005; Bagnel et al., 2003)

22 Natural Gradient Updates As G = F, the gradient simplifies to θ J(θ) = G 1 (θ) θ J(θ) = G 1 (θ)f (θ)w = w, and the policy parameter update becomes θ t+1 = θ t + α t w t. Important: The estimation of the gradient has simplified upon estimating the compatible function approximation / critic!!! (Kakade, 2002; Peters et al., 2003, 2005; Bagnell&Schneider, 2003) 4. Likelihood Ratio Gradients

23 Natural Policy Gradients Linear Quadratic Regulation Two-State Problem r = 0, u = r = 1 r = 2 u = 0 u = 1 r = 0, u = 1 4. Likelihood Ratio Gradients

24 Compatible Function Approximation To obtain the natural gradient θ J(θ) = w we need to estimate the compatible function approximation f π w(x, u) = dlogπ(u x) dθ This function approximation is mean zero! Therefore it can ONLY represent the Advantage Function f π w(x, u) = Q π (x, u) V π (x) = A π (x, u). T w 4. Likelihood Ratio Gradients

25 Compatible Function Approximation The advantage function f π w(x, u) = Q π (x, u) V π (x) = A π (x, u). is very different from the value functions! Value Function Q π (x, u) Advantage Function A π (x, u) 5 5 Action u Action u -5-5 State x State x 5...and we cannot directly do Temporal Difference Learning on this representation! 4. Likelihood Ratio Gradients

26 Natural Actor-Critic We cannot do TD learning with f π w(x, u) = Q π (x, u) V π (x) = A π (x, u). But when we add further basis function approximators into the Bellman equation V π (x) = φ(x) T v V π (x t ) + θ log π(u t x t ) T w = r(x t, u t ) + γv π (x t+1 ) + ɛ t we get a linear regression problem which can be solved with the LSTD(λ) algorithm (Boyan, 1996) in one step! 4. Likelihood Ratio Gradients

27 Natural Actor-Critic New { basis functions Boyan s LSTD(λ) Critic: LSTD-Q(λ) Evaluation Γ t = [φ(x t+1 ) T, 0 T ] T Φ t = [φ(x t ) T, θ log π(u t x t ) T ] T { z t+1 = λz t + Φ t A t+1 = A t + z t+1 (Φ t γγ t ) b t+1 = b t + z t+1 r t+1 [w T t+1, v T t+1] T = A 1 t+1 b t+1 Actor: Natural Policy Gradient Improvement θ t+1 = θ t + α t w t. 4. Likelihood Ratio Gradients

28 Algorithms Derivable from this Framework Gibbs Policy π(u t x t ) = e θ xu / b eθ xb Additional Basis Functions φ(x) = [0,..., 0, 1, 0,..., 0] T NAC Framework Sutton et al. s (1983) Actor Critic Linear Gauss-Policy π(u x) = N (u θ gain T x, θ explore ) Additional Basis Functions φ(x) = x T P x + p NAC Framework with Learning Rate α i = 1 J(θ i ) Bradtke&Bartos (1993) Q-Learning for LQR 4. Likelihood Ratio Gradients

29 Episodic Natural Actor Critic...but in many cases we don t have any good additional basis functions! V π (x) = φ(x) T v In this case, we can sum up the advantages along a trajectory and obtain one data point for a linear regression problem V π (x 0 ) }{{} J + ( T ) θ log π(u t x t ) t=0 } {{ } } {{ } ϕ i R i...and an additional basis function of 1 suffices! T w = T γ t r(x t, u t ) t=0 + γ T +1 V π (x T +1 ) }{{} 0 4. Likelihood Ratio Gradients

30 Episodic Natural Actor-Critic Sufficient Statistics{ Φ = Critic: Episodic Evaluation [ ϕ1, ϕ 2,..., ϕ N 1, 1,..., 1 R = [ R 1, R T 2,..., R T N ] T ] T Actor: Natural Policy Gradient Improvement θ t+1 = θ t + α t w t. { Linear Regression [ ] w J = ( Φ T Φ) 1 Φ T R 4. Likelihood Ratio Gradients

31 Improving Motor Primitives Ijspeert et al. (2002) suggested a nonlinear dynamics approach for motor primitives in imitation learning: Trajectory Plan Dynamics Canonical Dynamics Local Linear Model Approx. 4. Evaluations ( ( ) z) ( f ( x,v) + z) z = α z β z g y y = α y where ( ( ) v) v = α v β v g x x = α x v f ( x,v) = k i =1 k w i b i v i =1 w i w i = exp 1 2 d i ( ) 2 x c i The parameters b can also be improved by Reinforcement Learning and x = x x 0 g x 0

32 Improving Motor Primitives Minimum Motor Command Two Goals Policies!10 2.3!10 2.4!10 2.5!10 2!10 2.6!10 2.7! Evaluations

33 Outline 1. Reinforcement Learning for Motor Control? 2. Finite Difference vs Likelihood-Ratio Policy Gradients 3. Likelihood-Ratio Policy Gradients 4. Conclusion

34 Conclusions If you can explore your complete state-action space sufficiently... use value function methods. If you have access to policy and its derivatives... use likelihood ratio policy gradient methods. If you have good additional basis function... use Natural Actor-Critic. If not... use Episodic Natural Actor-Critic. If you want to explore a problem fastly... use vani&a likelihood ratio policy gradient methods. If you can only access your parameters... use finite difference policy gradient methods. If you can only access your parameters... use finite difference policy gradient methods. 6. Conclusion

35 Projects... Learning Motor Primitives with Reward-Weighted Regression This wi& be a nearly new method... very enthusiastic people needed! Applying Policy Gradients (methods of your choice!) to Osci#ator Optimization! Here we can create several projects if you like :) 6. Conclusion

Scaling Reinforcement Learning Paradigms for Motor Control

Scaling Reinforcement Learning Paradigms for Motor Control Jan Peters, Sethu Vijayakumar, Stefan Schaal Computer Science & Neuroscience, Univ. of Southern California, Los Angeles, CA 90089 ATR Computational