Deterministic Policy Gradient, Advanced RL Algorithms

Size: px

Start display at page:

Download "Deterministic Policy Gradient, Advanced RL Algorithms"

Charity Woods
5 years ago
Views:

1 NPFL122, Lecture 9 Deterministic Policy Gradient, Advanced RL Algorithms Milan Straka December 1, 218 Charles University in Prague Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics unless otherwise stated

2 REINFORCE with Baseline The returns can be arbitrary better-than-average and worse-than-average returns cannot be recognized from the absolute value of the return. Hopefully, we can generalize the policy gradient theorem using a baseline A good choice for is, which can be shown to minimize variance of the estimator. Such baseline reminds centering of returns, given that. Then, betterthan-average returns are positive and worse-than-average returns are negative. to The resulting value is also called an advantage function. Of course, the (a s; θ) v (s) baseline can be only approximated. If neural networks are used to estimate, then some part of the network is usually shared between the policy and value function estimation, which is trained using mean square error of the predicted and observed return. b(s) θ J(θ) (q b(s)) θ(a s; θ). μ(s) (s, a) b(s) v (s) s S a A v (s) = E q a (s, a) a (s, a) = def q (s, a) v (s) 2/27

3 Parallel Advantage Actor Critic An alternative to independent workers is to train in a synchronous and centralized way by having the workes to only generate episodes. Such approach was described in May 217 by Celemente et al., who named their agent parallel advantage actor-critic (PAAC). Actions... States DNN Worker Worker nw learn Environments ne States, Rewards Targets Master Figure 1 of the paper "Efficient Parallel Methods for Deep Reinforcement Learning" by Alfredo V. Clemente et al. 3/27

4 Continuous Action Space Until now, the actions were discreet. However, many environments naturally accept actions from continuous space. We now consider actions which come from range for, or more generally from a Cartesian product of several such ranges: i i i [a, b ]. [a, b] a, b R A simple way how to parametrize the action distribution is to choose them from the normal distribution. Given mean and variance, probability density function of μ σ 2 2 N (μ, σ ) is 1..8 p.6 (x µ)2 2σ 2 µ=, µ=, µ=, µ= 2, 2 σ =.2, 2 σ = 1., 2 σ = 5., 2 σ =.5, (x μ) 2 1 p(x) = def e 2σ 2. 2σ x Figure from section 13.7 of "Reinforcement Learning: An Introduction, Second Edition". 4/27

5 Continuous Action Space in Gradient Methods Utilizing continuous action spaces in gradient-based methods is straightforward. Instead of the softmax distribution we suitably parametrize the action value, usually using the normal distribution. Considering only one real-valued action, we therefore have μ(s; θ) σ(s; θ) where and are function approximation of mean and standard deviation of the action distribution. (a s; θ) = def 2 P (a N (μ(s; θ), σ(s; θ) )), The mean and standard deviation are usually computed from the shared representation, with the mean being computed as a regular regression (i.e., one output neuron without activation); the standard variance (which must be positive) being computed again as a regression, x followed most commonly by either or, where. exp softplus softplus(x) = def log(1 + e ) 5/27

6 Continuous Action Space in Gradient Methods During training, we compute and and then sample the action value (clipping it to [a, b] μ(s; θ) σ(s; θ) if required). To compute the loss, we utilize the probability density function of the normal distribution (and usually also add the entropy penalty). mu = tf.layers.dense(hidden_layer, 1)[:, ] sd = tf.layers.dense(hidden_layer, 1)[:, ] sd = tf.exp(log_sd) # or sd = tf.nn.softplus(sd) normal_dist = tf.distributions.normal(mu, sd) # Loss computed as - log (a s) - entropy_regularization loss = - normal_dist.log_prob(self.actions) * self.returns \ - args.entropy_regularization * normal_dist.entropy() 6/27

7 Deterministic Policy Gradient Theorem Combining continuous actions and Deep Q Networks is not straightforward. In order to do so, we need a different variant of the policy gradient theorem. Recall that in policy gradient theorem, θ J(θ) q a) θ μ(s) (s, (a s; θ). s S Deterministic Policy Gradient Theorem (s; θ) a A Assume that the policy is deterministic and computes an action. Then under several assumptions about continuousness, the following holds: J(θ) θ E [ (s; θ) q (s, a) ]. s μ(s) θ a a=(s;θ) a R The theorem was first proven in the paper Deterministic Policy Gradient Algorithms by David Silver et al. 7/27

8 Deterministic Policy Gradient Theorem Proof The proof is very similar to the original (stochastic) policy gradient theorem. We assume that are continuous in all params. p(s s, a), p(s s, a), r(s, a), r(s, a), (s; θ), (s; θ) a v θ (s) = q θ (s, (s; θ)) a = θ(r(s, (s; θ)) + γ p(s s, (s; θ))v (s ) ds ) s θ = (s; θ) r(s, a) + θ a a=(s;θ) γ θ s p(s s, (s; θ))v (s ) ds = (s; θ) (r(s, a) + θ a a=(s;θ) γ p(s s, a))v (s ) ds ) s + γ p(s s, (s; θ)) v (s ) ds s θ Similarly to the gradient theorem, we finish the proof by continually expanding. = (s; θ) q θ a (s, a) + γ p(s s, (s; θ)) v (s ) ds a=(s;θ) θ s v (s ) θ 8/27

9 Deep Deterministic Policy Gradients Note that the formulation of deterministic policy gradient theorem allows an off-policy algorithm, because the loss functions no longer depends on actions (similarly to how expected Sarsa is also an off-policy algorithm). We therefore train function approximation for both and, training using a deterministic variant of the Bellman equation: and (s; θ) according to the deterministic policy gradient theorem. (s; θ) q(s, a; θ) q(s, a; θ) q(s, A t t; θ) = E R t+1,s t+1 [R t+1 + γq(s t+1, (S t+1; θ))] The algorithm was first described in the paper Continuous Control with Deep Reinforcement Learning by Timothy P. Lillicrap et al. (215). τ =.1 The authors utilize a replay buffer, a target network (updated by exponential moving average with ), batch normalization for CNNs, and perform exploration by adding a normaldistributed noise to predicted actions. Training is performed by Adam with learning rates of 1e-4 and 1e-3 for the policy and critic network, respectively. 9/27

10 Deep Deterministic Policy Gradients Algorithm 1 DDPG algorithm Randomly initialize critic network Q(s, a θ Q ) and actor µ(s θ µ ) with weights θ Q and θ µ. Initialize target network Q and µ with weights θ Q θ Q, θ µ θ µ Initialize replay buffer R for episode = 1, M do Initialize a random process N for action exploration Receive initial observation state s 1 for t = 1, T do Select action a t = µ(s t θ µ ) + N t according to the current policy and exploration noise Execute action a t and observe reward r t and observe new state s t+1 Store transition (s t, a t, r t, s t+1 ) in R Sample a random minibatch of N transitions (s i, a i, r i, s i+1 ) from R Set y i = r i + γq (s i+1, µ (s i+1 θ µ ) θ Q ) Update critic by minimizing the loss: L = 1 N i (y i Q(s i, a i θ Q )) 2 Update the actor policy using the sampled policy gradient: θ µj 1 N i a Q(s, a θ Q ) s=si,a=µ(si) θ µµ(s θ µ ) si Update the target networks: θ Q τ θ Q + (1 τ )θ Q end for end for θ µ τ θ µ + (1 τ )θ µ Algorithm 1 of the paper "Continuous Control with Deep Reinforcement Learning" by Timothy P. Lillicrap et al. 1/27

11 Deep Deterministic Policy Gradients Cart Pendulum Swing-up Cartpole Swing-up Fixed Reacher Monoped Balancing Normalized Reward 1 Gripper 1 Blockworld 1 Puck Shooting Cheetah Moving Gripper Million Steps 1 1 Figure 2: Performance curves for a selection of domains using variants of DPG: original DPG algorithm (minibatch NFQCA) with batch normalization (light grey), with target network (dark grey), with target networks and batch normalization (green), with target networks from pixel-only inputs (blue). Target networks are crucial. Figure 3 of the paper "Continuous Control with Deep Reinforcement Learning" by Timothy P. Lillicrap et al. 11/27

12 Deep Deterministic Policy Gradients Results using low-dimensional (lowd) version of the environment, pixel representation (pix) and DPG reference (cntrl). environment R av,lowd R best,lowd R av,pix R best,pix R av,cntrl R best,cntrl blockworld blockworld3da canada canada2d cart cartpole cartpolebalance cartpoleparalleldouble cartpoleserialdouble cartpoleserialtriple cheetah fixedreacher fixedreacherdouble fixedreachersingle gripper gripperrandom hardcheetah hopper hyq movinggripper pendulum reacher reacher3dafixedtarget reacher3darandomtarget reachersingle walker2d torcs Table 1 of the paper "Continuous Control with Deep Reinforcement Learning" by Timothy P. Lillicrap et al. 12/27

13 Natural Policy Gradient The following approach has been introduced by Kakade (22). Using policy gradient theorem, we are able to compute. Normally, we update the parameters by using directly this gradient. This choice is justified by the fact that a vector which maximizes under the constraint that is bounded by a small constant is exactly the gradient. Normally, the length is computed using Euclidean metric. But in general, any metric could be used. Representing a metric using a positive-definite matrix metric), we can compute the distance as direction is then given by. (identity matrix for Euclidean. The steepest ascent Note that when is the Hessian, the above process is exactly Newton's method. v v (s; θ + d) d 2 v G d 2 G 1 v Hv G d 2 = G d d T ij ij i j = d Gd d 13/27

14 Natural Policy Gradient J, / ( ) (a) Vanilla policy gradient. (b) Natural policy gradient. Figure 3 of the paper "Reinforcement learning of motor skills with policy gradients" by Jan Peters et al. 14/27

15 Natural Policy Gradient A suitable choice for the metric is Fisher information matrix defined as log (a s; θ) log (a s; θ) F s (θ) = def E (a s;θ) [ ] = E[ (a s; θ)]e[ (a s; θ)] T. θ θ It can be shown that the Fisher information metric is the only Riemannian metric (up to rescaling) invariant to change of parameters under sufficient statistic. Recall Kullback-Leibler distance (or relative entropy) defined as i p D KL (p q) = def p i i log = H(p, q) H(p). q i The Fisher information matrix is also a Hessian of the : 2 j i D ((a s; θ) (a s; θ ) KL F s (θ) = D ((a s; θ) (a s; θ ) KL. θ θ θ =θ i j 15/27

16 Natural Policy Gradient Using the metric we want to update the parameters using. An interesting property of using the to update the parameters is that updating using will choose an arbitrary better action in state ; updating using chooses the best action (maximizing expected return), similarly to tabular greedy policy improvement. However, computing θ v 1 θ F (θ) v d F F (θ) = E F (θ) d F s μ in a straightforward way is too costly. θ d F = def F (θ) s 1 v s 16/27

17 Truncated Natural Policy Gradient Duan et al. (216) in paper Benchmarking Deep Reinforcement Learning for Continuous Control propose a modification to the NPG to efficiently compute. Following Schulman et al. (215), they suggest to use conjugate gradient algorithm, which can solve a system of linear equations in an iterative manner, by using only to compute products for a suitable. Therefore, Av d F v Ax = b is found as a solution of d F A F (θ)d = F v and using only 1 iterations of the algorithm seem to suffice according to the experiments. Furthermore, Duan et al. suggest to use a specific learning rate suggested by Peters et al (28) of α ( v ) F (θ) v T 1. 17/27

18 Trust Region Policy Optimization Schulman et al. in 215 wrote an influential paper introducing TRPO as an improved variant of NPG. Considering two policies a (a s), we can write where is the advantage function and is the on-policy distribution of the policy. Analogously to policy improvement, we see that if, policy performance increases (or stays the same if the advantages are zero everywhere). However, sampling states, ~ ~ v ~ = v + E E a s μ( ~ ) a ~ (a s) (a s), s μ( ~ ) q (a s) v (s) μ( ~ ) a (a s) ~ is costly. Therefore, we instead consider L ( ~ ) = v + E E a (a s). s μ() a ~ (a s) 18/27

19 Trust Region Policy Optimization L ( ~ ) = v + E E a (a s) It can be shown that for parametrized the matches to the first order. Schulman et al. additionally proves that if we denote max D ( ( s) ( s)) s KL old new, then s μ() a ~ (a s) (a s; θ) L ( ~ ) max α = D ( ) = Therefore, TRPO minimizes subject to, where θ KL KL old new is used instead of D performance reasons; is a constant found empirically, as the one implied by the above equation is too small; importance sampling is used to account for sampling actions from. KL 4εγ v new L old ( new ) α where ε = (1 γ) 2 old v max s,a L ( θ) D ( θ ) < θ θ KL D ( θ θ) = E s μ( )[D ( ( s) ( s))] δ θ θ δ ~ new a (s, a). max KL for 19/27

20 Trust Region Policy Optimization The parameters are updated using, utilizing the conjugate gradient algorithm as described earlier for TNPG (note that the algorithm was designed originally for TRPO and only later employed for TNPG). To guarantee improvement and respect the We start by the learning rate of minimize L ( ) subject to D ( ) < constraint is satistifed and the objective improves. θ θ θ KL d 1 F = F (θ) L ( θ) D KL T δ/(d F (θ) 1d F F ) θ θ θ δ constraint, a line search is in fact performed. and shrink it exponentially until the 2/27

Trust Region Policy Optimization (a) (b) (c) (d) (e) (f) (g) Figure 1.

Humanoid. Figure 1 of the paper "Benchmarking Deep Reinforcement Learning for Continuous Control" by Duan et al.

6 ± 137.6 4869.8 ± 37.6 4815.4 ± 4.8 244.4 ± 568.3 4634.4 ± 87.8 Inverted Pendulum* 153.4 ±.2 13.4 ± 18. 29.7 ± 55.5 84.7 ± 13.8 113.

± 2.4 85. ± 7.7 288.4 ± 17.3 Acrobot 194.5 ± 1. 58.1 ± 91. 395.8 ± 121.2 352.7 ± 35.9 11.5 ± 1.8 326. ± 24.4 436.8 ± 14.7 785.6 ± 13.

9 1576.1 ± 51.3 2863.4 ± 154. Swimmer* 1.7 ±.1 92.3 ±.1 96. ±.2 6.7 ± 5.5 3.8 ± 3.3 96. ±.2 68.8 ± 2.4 64.9 ± 1.4 85.8 ± 1.8 Hopper 8.

6 ± 18.2 136. ± 15.9 37. ± 38.1 1353.8 ± 85. 84.5 ± 19.2 77.1 ± 24.3 318.4 ± 181.6 Half-Cheetah 9.8 ±.3 1183.1 ± 69.2 1729.5 ± 184.6 376.

2 ± 61.3 49.2 ± 5.9 17.8 ± 15.5 326.2 ± 2.8 Simple Humanoid 41.5 ±.2 128.1 ± 34. 255. ± 24.5 93.3 ± 17.4 28.3 ± 4.7 269.7 ± 4.3 6.6 ± 12.

21 Trust Region Policy Optimization (a) (b) (c) (d) (e) (f) (g) Figure 1. Illustration of locomotion tasks: (a) Swimmer; (b) Hopper; (c) Walker; (d) Half-Cheetah; (e) Ant; (f) Simple Humanoid; and (g) Full Humanoid. Figure 1 of the paper "Benchmarking Deep Reinforcement Learning for Continuous Control" by Duan et al. Task Random REINFORCE TNPG RWR REPS TRPO CEM CMA-ES DDPG Cart-Pole Balancing 77.1 ± ± ± ± ± ± ± ± ± 87.8 Inverted Pendulum* ± ± ± ± ± ± ± ± ± Mountain Car ± ± ± ± ± ± ± ± ± 17.3 Acrobot ± ± ± ± ± ± ± ± ± 5.8 Double Inverted Pendulum* ± ± ± ± ± ± ± ± ± 154. Swimmer* 1.7 ± ± ± ± ± ± ± ± ± 1.8 Hopper 8.4 ± ± ± ± ± ± ± ± ± D Walker 1.7 ± ± ± ± ± ± ± ± ± Half-Cheetah 9.8 ± ± ± ± ± ± ± ± ± 72.7 Ant* 13.4 ± ± ± ± ± ± ± ± ± 2.8 Simple Humanoid 41.5 ± ± ± ± ± ± ± ± ± 28.1 Full Humanoid 13.2 ± ± ± ± ± ± ± 2.9 N/A ± N/A 119. ± 31.2 Table 1 of the paper "Benchmarking Deep Reinforcement Learning for Continuous Control" by Duan et al. 21/27

22 Proximal Policy Optimization A simplification of TRPO which can be implemented using a few lines of code. Let t = def (A t S t;θ) (A S ;θ ) r (θ) t t old. PPO minimizes the objective L CLIP (θ) = def E [ min (r (θ), clip(r (θ), 1 t t A^t t ε, 1 + ε) ))]. A^t Such L CLIP (θ) is a lower (pessimistic) bound. L CLIP A > A < 1 ǫ 1 r r ǫ L CLIP Figure 1 of the paper "Proximal Policy Optimization Algorithms" by Schulman et al. 22/27

23 Proximal Policy Optimization The advantages the usual are additionally estimated using generalized advantage estimation. Instead of def A^t T t 1 ir t+1+i T t γ V (S T ) V (S t ) A^t = γ + where. Algorithm 1 PPO, Actor-Critic Style for iteration=1, 2, do for actor=1, 2,, N do Run policy θold in environment for T timesteps the authors employ Compute advantage estimates Â1,, ÂT end for Optimize surrogate L wrt θ, with K epochs and minibatch size M N T θ old end for i= δ t = R t+1 + γv (S t+1) V (S t ) θ T t 1 A^t = def (γλ) δ, i= i t+i Algorithm 1 of the paper "Proximal Policy Optimization Algorithms" by Schulman et al. 23/27

24 Proximal Policy Optimization Figure 3 of the paper "Proximal Policy Optimization Algorithms" by Schulman et al. NPFL122, Lecture 9 Refresh DPG DDPG NPG TRPO PPO SAC 24/27

25 Soft Actor Critic The paper Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor by Tuomas Haarnoja et al. introduces a different off-policy algorithm for continuous action space. The general idea is to introduce entropy directly in the value function we want to maximize. 25/27

26 Soft Actor Critic Algorithm 1 Soft Actor-Critic Initialize parameter vectors ψ, ψ, θ, φ. for each iteration do for each environment step do a t φ (a t s t ) s t+1 p(s t+1 s t, a t ) D D {(s t, a t, r(s t, a t ), s t+1 )} end for for each gradient step do ψ ψ λ V ˆ ψ J V (ψ) θ i θ i λ Q ˆ θi J Q (θ i ) for i {1, 2} φ φ λ ˆ φ J (φ) ψ τ ψ + (1 τ ) ψ end for end for Algorithm 1 of the paper "Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor" by Tuomas Haarnoja et al. 26/27

27 Soft Actor Critic pp nr ut er e g ar e v a nr ut er e g ar e v a nr ut er e g ar e v a million steps (a) Hopper-v million steps (b) Walker2d-v nr ut er e g ar e v a nr ut er e g ar e v a nr ut er e g ar e v a ) SAC DDPG PPO SQL TD3 (concurrent) (c) HalfCheetah-v1 ( million steps million steps (d) Ant-v million steps 8 1 (e) Humanoid-v million steps 8 1 (f) Humanoid (rllab) Figure 1 of the paper "Soft Actor-Critic: Oﬀ-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor" by Tuomas Haarnoja et al. NPFL122, Lecture 9 Refresh DPG DDPG NPG TRPO PPO SAC 27/27

Advanced Policy Gradient Methods: Natural Gradient, TRPO, and More. March 8, 2017

Advanced Policy Gradient Methods: Natural Gradient, TRPO, and More March 8, 2017 Defining a Loss Function for RL Let η(π) denote the expected return of π [ ] η(π) = E s0 ρ 0,a t π( s t) γ t r t We collect