ELEC-E8119 Robotics: Manipulation, Decision Making and Learning Policy gradient approaches. Ville Kyrki

Size: px

Start display at page:

Download "ELEC-E8119 Robotics: Manipulation, Decision Making and Learning Policy gradient approaches. Ville Kyrki"

Clarissa McDonald
5 years ago
Views:

1 ELEC-E8119 Robotics: Manipulation, Decision Making and Learning Policy gradient approaches Ville Kyrki

2 Today Direct policy learning via policy gradient.

3 Learning goals Understand basis and limitations of policy gradient approaches.

4 Motivation Even with value function approximation, large state spaces can be problematic. Learning parametric policies p(u x,q) directly without learning value functions sometimes easier. Non-Markov (partially observable) or adversarial situations might benefit from stochastic policies.

5 Value-based vs policy-based RL VALUE FUNCTION POLICY Value-based Learnt value function. Implicit policy. Actor-critic Learnt value function. Learnt policy. Policy-based No value function. Learnt policy. - Can learn stochastic policies. - Usually locally optimal.

6 What is a good policy? How to measure policy quality? T R(q)=E [ t =0 γ t r t ] More generally, T R(q)=E [ t =0 a t r t ] Can also represent average reward per time step. Infinite horizon: Normalized expected return R(q)= X d p ( x) U p(u x )r( x,u) Stationary distribution of Markov chain How to optimize functions?

7 Policy gradient Use gradient ascent on R(q). Update policy parameters by q m+1 =q m +α m q R q=q m How to calculate gradient? T R(q)=E [ t =0 a t r ] t α m >0 m= 0 m=0 α m 2 < Guarantees convergence to local minimum. Depends on q. How to estimate gradient from data (if we have a chance to try different policies)?

8 Finite difference gradient estimation What is gradient? Vector of partial derivatives. How to estimate derivative? Finite difference: For policy gradient: Generate variation f ' (x) f ( x+dx) f (x) dx Δ q i H R(q+Δ q i ) ^R i = t =0 Estimate experimentally T Compute gradient [ g FD, R ref ] T =(Δ Θ T Δ Θ ) 1 Δ Θ T R Repeat until estimate converged Not easy to choose. at r t Δ Θ T = [ Δ q 1,, Δ q I 1,,1 ] R T =[ R 1,, R I ] Where does this come from? R i R ref + g T Δ q i

9 Likelihood-ratio approach Assume trajectories tau are generated by roll-outs, thus Expected return can then be written Gradient is thus Why do that? τ p q (τ )= p( τ q) R( τ )= t =0 R(q)=E τ [ R( τ )]= p q (τ ) R( τ )d τ q R (q)= q p q (τ ) R( τ )d τ = p q (τ ) q log p q (τ ) R( τ )d τ H p q (τ )= p( x 0 ) t =0 =E τ [ q log p q (τ) R( τ )] p( x t +1 x t, u t )p q (u t x t ) H at r t Likelihood ratio trick : Substitute q p q ( τ)= p q ( τ ) q log p q (τ ) Try substitution for log-gradient! H q log p q ( τ )= q p q (u t x t ) t =0 We know this!

10 Example differentiable policies Soft-max policy p q (u t x t ) e q T φ( x t, u t ) Log-policy (score function) Normalization constant missing. Probability portional to expontiated linear combination of features. q log p q (u t x t )=φ( x t,u t ) E pq [ φ (x t, )] Gaussian policy p q (u t x t ) N (q T φ( x t ), σ 2 ) Log-policy q log p q (u t x t )= ( u t q T φ (x t ) )φ (x t ) σ 2 Mean is linear combination of features. Can also be understood as linear policy plus exploration uncertainty p q (u t x t )=q T φ ( x t )+ϵ ϵ N (0,σ 2 )

11 MC policy gradient REINFORCE Episodic version shown here. Approach: Perform trial J (=1,2,3,...). Estimate gradient g R E = E τ [( t =0 = 1 J i =1 Repeat with new trial until convergence. No need to generate policy variations. Often single roll-out (trial) sufficient for policy update. J H q log p q (u t x Use empirical t )) R(i)] mean. [ i q log p q (u ] [ i t x ] t )) R(i)] H [( t=0 Reward for trial i.

12 Decreasing variance by adding baseline Constant baseline can be added to reduce variance of gradient estimate. q R (q)= E q [ q log p q (τ)( R(τ) b) ] = E q [ q log p q (τ) R( τ)] Does not cause bias because E τ [ q log p q ( τ )b ]= q p q (τ)b d τ=b q p q ( τ)d τ=b q 1=0

13 Episodic REINFORCE with optimal baseline Optimal baseline for episodic REINFORCE (minimize variance of estimator): b h = E τ [( t=0 E τ [( t =0 Approach: Perform trial J (=1,2,3,...). For each gradient element h Estimate optimal baseline Estimate gradient Repeat until convergence. H q h log p q (u t x t )) 2 R τ ] H qh log p q (u t x t ))] 2 b h g h = 1 J i =1 J Componentwise! Use empirical mean (average over trials). H [( t=0 )] [ i q h log p q (u ] [ i t x ] [ i t ))( R(i) b ] h Even with optimal baseline, variance can be an issue.

14 Policy gradient theorem/g(po)mdp Observation: Future actions do not depend on past rewards. E [ q log p q (u t x t )r k ]=0 t>k don't take into account past rewards when evaluating the effect of an action (causality, taking an action can only affect future rewards) PGT/G(PO)MDP: Two equivalent algorithms. Reduces variance of estimate Fewer samples needed on average. H g GMDP = E τ [ k =0 k ( t=0 q h log p q (u t x t ))(a k r k b k h )] Note: If only rewards at final time step, this is equivalent to REINFORCE.

15 Actor-critic approaches Critic estimates value function. Actor updates policy in direction of critic. For policy gradient, critic estimates approximate value function Q(x,u,w). See previous lectures.

16 Simple actor-critic example Critic linear TD(0), actor policy gradient, Q( x, u,w)=w T φ( x,u) Algorithm: Sample u p( u x) For each step Step forward, record r, x' u ' p(u x) δ=r +γq (x ', u', w) Q( x,u, w) Sample q=q+α q log p q (u x)q (x, u, w) w=w +βδ φ(x, u) x=x ', u=u ' Why Q corresponds to R? actor critic

17 Gradient vs natural gradient Gradient depends on parametrization. Natural gradient parametrization independent. q NG p q (u x)=g q 1 q p q (u x) Fisher information matrix Normalizes parameter influence. G q =E [ q log p q (u x) q log p q (u x) T ]

18 Episodic Natural Actor-Critic Bias in actor-critic PG can be avoided by compatible function approximation w Q( x, u, w)= q log p q (u x) Under natural gradient actor-critic PG simplifies. NG q R(q)=w Algorithm: Critic: H [ i q log p q (u ] [ i t x ] t ) ψ i = t=1 Ψ= [ ψ 1 ψ 2 ψ J ]T [w T R ref ] T = (Ψ T Ψ ) 1 Ψ T R Implies exact approximation under minimization of MSE. Actor: q m+1 =q m +α m w R=[ R (1) R(J )] T For details, see e.g. Peters & Schaal 2008.

19 Alternatives: non-episodic approaches, model-based policy learning There are also on-line and non-episodic approaches. More rare in robotics than episodic ones. Model-based learning: Learn first model dynamics, then use internal simulation to learn policy.

20 Challenges and future outlook Still need many samples. Learning rate needs to be tuned. Exploration rate difficult to learn/tune. Recent advances Weighted maximum-likelihood approaches. Hierarchical policies. Context-dependent policies.

21 Summary Episodic policy gradient effective for trajectory-based policies of complex skills. Finite-difference approaches approximate gradient by policy adjustments. Likelihood ratio-approaches calculate gradient through known policy.

22 Next: Partial observability and POMDPs What changes if we cannot observe x directly?

REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning

REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning Ronen Tamari The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (#67679) February 28, 2016 Ronen Tamari