Reinforcement Learning

Size: px

Start display at page:

Download "Reinforcement Learning"

Clyde Manning
5 years ago
Views:

1 Reinforcement Learning Policy Search: Actor-Critic and Gradient Policy search Mario Martin CS-UPC May 28, 2018 Mario Martin (CS-UPC) Reinforcement Learning May 28, 2018 / 63

2 Goal of this lecture So far we approximated the value or action-value function using parameters θ (e.g. neural networks) V θ V π Q θ (s, a) V π (s) A policy was generated directly from the value function e.g. using ɛ- greedy In this lecture we will directly parameterize the policy in a stochastic setting π θ (a s) = P θ (a s) and do a direct Policy search Again on model-free setting Mario Martin (CS-UPC) Reinforcement Learning May 28, / 63

Three approaches to RL Value based learning: Implicit policy Learn value function Q θ (s, a) and from there infer policy π(s) = arg max a Q(s, a) Policy based learning: No value function Explicitly

3 Three approaches to RL Value based learning: Implicit policy Learn value function Q θ (s, a) and from there infer policy π(s) = arg max a Q(s, a) Policy based learning: No value function Explicitly learn policy π θ (a s) that implicitly maximize reward over all policies Actor-Critic learning: Learn both Value Function and Policy Mario Martin (CS-UPC) Reinforcement Learning May 28, / 63

4 Advantges of Policy over Value approach Advantages: In some cases, computing Q-values is harder than picking optimal actions Better convergence properties Effective in high dimensional or continuous action spaces Can benefit from demonstrations Policy subspace can be chosen according to the task Exploration can be directly controlled Can learn stochastic policies Disadvantages: Typically converge to a local optimum rather than a global optimum Evaluating a policy is typically data inefficient and high variance Mario Martin (CS-UPC) Reinforcement Learning May 28, / 63

5 Stochastic Policies In general, two kinds of policies: Deterministic policy a = π θ (s) Stochastic policy P(a s) = π θ (a s) Not new, ɛ-greedy is stochastic, but different idea. Stochastic policy is good on its own, no because it is an approx. of a greedy policy Any example where stochastic could be better that deterministic? Mario Martin (CS-UPC) Reinforcement Learning May 28, / 63

Stochastic Policies: Rock-Paper-Scissors Two player game of rock paper scissors: Scissors beats paper Rock beats scissors Paper beats rock Consider policies for iterated rock

6 Stochastic Policies: Rock-Paper-Scissors Two player game of rock paper scissors: Scissors beats paper Rock beats scissors Paper beats rock Consider policies for iterated rock paper scissors A deterministic policy is easily exploited A uniform random policy is optimal (i.e., Nash equilibrium) Mario Martin (CS-UPC) Reinforcement Learning May 28, / 63

7 Stochastic Policies: Aliased World (POMDSPs) The agent cannot differentiate the grey states Consider features of the following form: φ d (s) = 1(wall to d) d {N, E, S, W } Compare value-based RL, using an approximate value function Q θ (s, a) = f θ (φ(s, a)) To policy-based RL, using a parametrized policy π θ (a s) = g θ (φ(s, a)) Mario Martin (CS-UPC) Reinforcement Learning May 28, / 63

8 Stochastic Policies: Aliased World (POMDSPs) Under aliasing, an optimal deterministic policy will either move W in both gray states move E in both gray states Either way, it can get stuck and never reach the money Value based RL learns a near deterministic policy So it will traverse the corridor for a long time Mario Martin (CS-UPC) Reinforcement Learning May 28, / 63

Stochastic Policies: Aliased World (POMDSPs) An optimal stochastic policy will randomly move E or W in gray states π θ (move E wall to N and S) = 0.5 π θ (move W wall to N and S) = 0.

9 Stochastic Policies: Aliased World (POMDSPs) An optimal stochastic policy will randomly move E or W in gray states π θ (move E wall to N and S) = 0.5 π θ (move W wall to N and S) = 0.5 It will reach the goal state in a few steps with high probability Policy based RL can learn the optimal stochastic policy Mario Martin (CS-UPC) Reinforcement Learning May 28, / 63

10 Policy Objective Functions Goal: given policy π θ (a s) with parameters θ, find best θ... but how do we measure the quality of a policy π θ? In episodic environments we can use the start value J 1 (θ) = V π θ (s 1 ) Mario Martin (CS-UPC) Reinforcement Learning May 28, / 63

11 Policy Objective Functions In continuing environments we can use the average value J avv (θ) = s d π θ (s)v π θ (s) where d π θ(s) is stationary distribution of Markov chain for π θ Or the average reward per time-step J avr (θ) = s d π θ (s) a π θ (a s)r(s, a) where d π θ(s) is the expected number of time steps on s in a randomly generated episode following π θ divided by time steps of trial For simplicity, we will mostly discuss the episodic case, but can easily extend to the continuing / infinite horizon case Mario Martin (CS-UPC) Reinforcement Learning May 28, / 63

12 Policy optimization Mario Martin (CS-UPC) Reinforcement Learning May 28, / 63

13 Policy optimization Goal: given policy π θ (a s) with parameters θ, find best θ Policy based reinforcement learning is an optimization problem Find policy parameters θ that maximize J(θ) Two approaches for solving the optimization problem Gradient-free Policy-gradient Mario Martin (CS-UPC) Reinforcement Learning May 28, / 63

14 Subsection 1 Gradient Free Policy Optimization Mario Martin (CS-UPC) Reinforcement Learning May 28, / 63

15 Gradient Free Policy Optimization Goal: given parametrized method (with parameters θ) to approximate policy π θ (a s), find best values for θ Policy based reinforcement learning is an optimization problem Find policy parameters θ that maximize J(θ) Some approaches do not use gradient Hill climbing Simplex / amoeba / Nelder Mead Genetic algorithms Cross-Entropy method (CEM) Covariance Matrix Adaptation (CMA) Mario Martin (CS-UPC) Reinforcement Learning May 28, / 63

16 Gradient Free Policy Optimization Goal: given parametrized method (with parameters θ) to approximate policy π θ (a s), find best values for θ Policy based reinforcement learning is an optimization problem Find policy parameters θ that maximize J(θ) Some approaches do not use gradient Hill climbing Simplex / amoeba / Nelder Mead Genetic algorithms Cross-Entropy method (CEM) Covariance Matrix Adaptation (CMA) Mario Martin (CS-UPC) Reinforcement Learning May 28, / 63

17 Cross-Entropy Method (CEM) A simplified version of Evolutionary algorithm Works embarrassingly well in some problems, f.i. Playing Tetris (Szita et al., 2006), (Gabillon et al., 2013) A variant of CEM called Covariance Matrix Adaptation has become standard in graphics (Wampler et al., 2009) Very simple idea: 1 From current policy, sample N trials (large) 2 Take the M trials with larger Long.return (we call the elite) 3 Fit new policy to behave as in M best sessions 4 Repeat until satisfied Policy improves gradually Mario Martin (CS-UPC) Reinforcement Learning May 28, / 63

18 Tabular Cross-Entropy Tabular Cross-Entropy Algorithm Given M (f.i, 20), N (f.i. 200) Initialize matrix policy π(a s) = A s,a randomly repeat Sample N roll-outs of the policy and collect for each R t elite = M best samples π(a s) = until convergence return π Notice! No value functions! [times in M samples took a in s] + λ [times in M samples was at s] + λ A Mario Martin (CS-UPC) Reinforcement Learning May 28, / 63

19 Tabular Cross-Entropy If you were in some state only once, you only take this action now Solution: Introduction of λ, a parameter to smooth probabilities Due to randomness, algorithm will prefer lucky sessions (training on lucky sessions is no good) Solution: run several simulations with these state-action pairs and average the results. Mario Martin (CS-UPC) Reinforcement Learning May 28, / 63

20 Approximated Cross-Entropy Method (CEM) Approximated Cross-Entropy Method Given M (f.i, 20), N (f.i. 200) and function approximation (f.i. NN) depending on θ Initialize θ randomly repeat Sample N roll-outs of the policy and collect for each R t elite = M best [ samples ] θ = θ + α s,a elite log π θ(a s) until convergence return π θ Mario Martin (CS-UPC) Reinforcement Learning May 28, / 63

21 Approximated Cross-Entropy Method (CEM) No Value function involved Notice that best policy is: arg max log π θ (a s) = arg max π θ π θ s,a elite s,a elite π θ (a s) so gradient goes in that direction (some theory about Entropy behind) Intuitively, is the policy that maximizes similarity with behavior of successful samples Tabular case is a particular case of this algorithm I promised no gradient, but notice that gradient is for the approximation, not for the policy Can easily be extended to continuous action spaces (f.i. robotics) Mario Martin (CS-UPC) Reinforcement Learning May 28, / 63

22 Grafient Free methods Often a great simple baseline to try Benefits Can work with any policy parameterizations, including non-differentiable Frequently very easy to parallelize (faster training time) Limitations Typically not very sample efficient because it ignores temporal structure Mario Martin (CS-UPC) Reinforcement Learning May 28, / 63

23 Subsection 2 Policy gradient Mario Martin (CS-UPC) Reinforcement Learning May 28, / 63

24 Policy gradient methods Policy based reinforcement learning is an optimization problem Find policy parameters θ that maximize V π θ We have seen gradient-free methods, but greater efficiency often possible using gradient in the optimization Pletora of methods: Gradient descent Conjugate gradient Quasi-newton We focus on gradient ascent, many extensions possible And on methods that exploit sequential structure Mario Martin (CS-UPC) Reinforcement Learning May 28, / 63

25 Policy gradient differences wrt Value methods With Value functions we use Greedy updates: V π 0 small change π 1 θ π = arg max E πθ [Q π (s, a)] θ large change V π 1 small change π 2 large change V π 2 Potentially unstable learning process with large policy jumps because arg max is not differentiable On the other hand, Policy Gradient updates are: θ π = θ π + α J(θ) θ Stable learning process with smooth policy improvement Mario Martin (CS-UPC) Reinforcement Learning May 28, / 63

26 Policy gradient method Define J(θ) = J π θ to make explicit the dependence of the evaluation policy on the policy parameters Assume episodic MDPs Policy gradient algorithms search for a local maximum in J(θ) by ascending the gradient of the policy, w.r.t parameters θ θ = α θ J(θ) Where θ J(θ) is the policy gradient and α is a step-size parameter Mario Martin (CS-UPC) Reinforcement Learning May 28, / 63

27 Computing the gradient analytically We now compute the policy gradient analytically Assume policy is differentiable whenever it is non-zero Mario Martin (CS-UPC) Reinforcement Learning May 28, / 63

28 Computing the gradient analytically We now compute the policy gradient analytically Assume policy is differentiable whenever it is non-zero and that we know the gradient θ π θ (a s) Denote a state-action trajectory (or trial) τ as τ = (s 0, a 0, r 1, s 1, a 1, r 2,... s T 1, a T 1, r T, s T ) Define long-term-reward to be the sum of rewards for the trajectory (R(τ)) T R(τ) = r(s t ) It can be discounted or not. Now not important because we will not use Bellman equations. t=1 Mario Martin (CS-UPC) Reinforcement Learning May 28, / 63

29 Computing the gradient analytically The value of the policy J(θ) is: J(θ) = E πθ [R(τ)] = τ P(τ θ)r(τ) where P(τ θ) denotes the probability of trajectory τ when following policy π θ Notice that sum is for all possible trajectories In this new notation, our goal is to find the policy parameters theta) that: arg max θ J(θ) = arg max θ P(τ θ)r(τ) τ Mario Martin (CS-UPC) Reinforcement Learning May 28, / 63

30 [Log-trick: a convenient equality] In general, assume we want to compute log f (x) : log f (x) = 1 f (x) f (x) f (x) log f (x) = f (x) It can be applied to any function and we can use the equality in any direction f (x) The term f (x) is called likelihood ratio and is used to analytically compute the gradients Btw. Notice the caveat... Assume policy is differentiable whenever it is non-zero. Mario Martin (CS-UPC) Reinforcement Learning May 28, / 63

31 Computing the gradient analytically In this new notation, our goal is to find the policy parameters θ that: arg max J(θ) = arg max P(τ θ)r(τ) θ θ So, taken the gradient wrt θ θ J(θ) = θ P(τ θ)r(τ) = τ τ τ θ P(τ θ)r(τ) = τ = τ = τ P(τ θ) P(τ θ) θp(τ θ)r(τ) P(τ θ)r(τ) θp(τ θ) P(τ θ) P(τ θ)r(τ) θ log P(τ θ) Mario Martin (CS-UPC) Reinforcement Learning May 28, / 63

32 Computing the gradient analytically Goal is to find the policy parameters θ that: arg max J(θ) = arg max P(τ θ)r(τ) θ θ So, taken the gradient wrt θ τ θ J(θ) = τ P(τ θ)r(τ) θ log P(τ θ) Of course we cannot compute all trajectories... Mario Martin (CS-UPC) Reinforcement Learning May 28, / 63

33 Computing the gradient analytically Goal is to find the policy parameters θ that: arg max J(θ) = arg max P(τ θ)r(τ) θ θ So, taken the gradient wrt θ τ θ J(θ) = τ P(τ θ)r(τ) θ log P(τ θ) Of course we cannot compute all trajectories...but we can sample m trajectories because of the form of the equation θ J(θ) (1/m) m R(τ i ) θ log P(τ i θ) i=1 Mario Martin (CS-UPC) Reinforcement Learning May 28, / 63

34 Computing the gradient analytically: at last! Sample m trajectories: θ J(θ) (1/m) m R(τ i ) θ log P(τ i θ) i=1 However, we still have a problem, we don t know the how to compute θ log P(τ θ) Fortunately, we can derive it from the stochastic policy θ log P(τ θ) = θ log [ µ(s 0 ) T 1 i=0 π θ (a i s i )P(s i+1 s i, a i ) ] T 1 = θ [log µ(s 0 ) + log π θ (a i s i ) + log P(s i+1 s i, a i ) = T 1 i=0 i=0 θ log π θ (a i s i ) }{{} No dynamics model required! ] Mario Martin (CS-UPC) Reinforcement Learning May 28, / 63

35 Computing the gradient analytically We assumed at the beginning that policy is differentiable and that we now the derivative wrt parameters θ So, we have the desired solution: θ J(θ) (1/m) m i=1 T 1 R(τ i ) θ log π θ (a i s i ) i=0 Mario Martin (CS-UPC) Reinforcement Learning May 28, / 63

36 Differentiable policies? Soft-max One popular way to do action selection instead of using ɛ-greedy is to assign probabilities to actions according to values: π(a s) = eq(s,a)/τ a eq(s,a )/τ eq(s,a)/τ where τ is parameter that controls exploration. Let s assume τ = 1 Let s consider the case where Q(s, a) = φ T (s, a)θ is approximated by a linear function e φt (s,a)θ θ log π θ (a i s i ) = θ log a eφt (s,a)θ = θ φ T (s, a)θ θ φ T (s, a)θ = φ(s, a) E πθ [φ(s, )] a Mario Martin (CS-UPC) Reinforcement Learning May 28, / 63

37 Differentiable policies? Gaussian Policy In continuous spaces of actions, action is generated by a random distribution with parameters. The most popular in Gaussian distribution Parameter of the Gaussian is a linear combination of feature (µ = φ T (s, a) θ). Variance σ 2 can be fixed or also approximated. Policy select actions following Gaussian distribution: In this case, a N (µ θ (s), σ 2 ) = 1 (a φt (s,a)θ) 2 2πσ 2 e 2σ 2 θ log π θ (a s) = θ (a φt (s, a) θ) 2 φ(s) 2σ 2 θ log σ 2 = (a φt (s, a) θ) φ(s) σ 2 Mario Martin (CS-UPC) Reinforcement Learning May 28, / 63

38 Differentiable policies? Deep Neural Network A very popular way to approximate the policy is to use a Deep NN with soft-max last layer with so many neurons as actions. In this case, use autodiff of the neural network package you use! In tensorflow: loss = - tf.reduce mean(tf.log(prob outputs) * reward) where prob outputs is the output layer of the DNN Backpropagation implemented will do the work for you. Mario Martin (CS-UPC) Reinforcement Learning May 28, / 63

39 Vanilla Policy Gradient Vanilla Policy Gradient Given architecture with parameters θ to implement π θ Initialize θ randomly repeat Generate episode {s 1, a 1, r 2,... s T 1, a T 1, r T, s T } π θ Get R long-term return for episode for all time steps t = 1 to T 1 do θ θ + α θ log π θ (a t s t )R end for until convergence Substitute θ log π θ (a t s t ) with appropriate equation Mario Martin (CS-UPC) Reinforcement Learning May 28, / 63

40 Vanilla Policy Gradient Remember: θ J(θ) (1/m) m i=1 Unbiased but very noisy Fixes that can make it practical Temporal structure Baseline T 1 R(τ i ) θ log π θ (a i s i ) i=0 Mario Martin (CS-UPC) Reinforcement Learning May 28, / 63

41 Subsection 3 Reduce variance using temporal structure: Reinforce and Actor-Critic architectures Mario Martin (CS-UPC) Reinforcement Learning May 28, / 63

42 Policy Gradient using Temporal structure Instead on focusing on reward of trajectories, J(θ) = E πθ [R(τ)] = τ P(τ θ)r(τ) We want to optimize the expected return J avv (θ) = s d π θ (s)v (s) = s d π θ π θ (a s)q(s, a) where d π θ(s) is the expected number of time steps on s in a randomly generated episode following π θ divided by time steps of trial Let s start with and MDP with one single step. J avr (θ) = d π θ (s)v (s) = d π θ π θ (a s)r(s, a) s s a a Mario Martin (CS-UPC) Reinforcement Learning May 28, / 63

43 Policy Gradient using Temporal structure J avr (θ) = d π θ π θ (a s)r(s, a) s a θ J avr (θ) = θ d π θ π θ (a s)r(s, a) s d π θ d π θ a = θ (π θ (a s)r(s, a)) s a = θ π θ (a s) log π θ (a s)r(s, a) s a = E πθ [ θ log π θ (a s)r(s, a)] And this expectation can be sampled Mario Martin (CS-UPC) Reinforcement Learning May 28, / 63

44 Policy Gradient theorem! The policy gradient theorem generalize the likelihood ratio approach to multi step MDPs Replaces instantaneous reward r with long term value Q(s, a) Policy gradient theorem applies to all objective functions we have seen Policy gradient theorem For any differentiable policy π θ (s, a), for any of the policy objective functions J = J 1, J avr or J avv, the policy gradient is: θ J(θ) = E πθ [ θ log π θ (a s)q π θ (s, a)] Simple proof in pag. 269 of (Sutton 2016) Mario Martin (CS-UPC) Reinforcement Learning May 28, / 63

45 REINFORCE algorithm REINFORCE algorithm (also called Monte Carlo Policy Gradient) use policy gradient theorem and long-term reward R as unbiased sample of Q π θ(s, a) REINFORCE algorithm Given architecture with parameters θ to implement π θ Initialize θ randomly repeat Generate episode {s 1, a 1, r 2,... s T 1, a T 1, r T, s T } π θ for all time steps t = 1 to T 1 do Get R t long-term return from step t to T θ θ + α θ log π θ (a t s t )R t end for until convergence Mario Martin (CS-UPC) Reinforcement Learning May 28, / 63

46 REINFORCE algorithm Let s analyze the update: Let s us rewrite is as follows θ θ + α θ log π θ (a t s t )R t θ θ + α θπ θ (a t s t ) π θ (a t s t ) R t Update is proportional to: the product of a return Rt and the gradient of the probability of taking the action actually taken, divided by the probability of taking that action. Mario Martin (CS-UPC) Reinforcement Learning May 28, / 63

47 REINFORCE algorithm Update: θ θ + α θπ θ (a t s t ) π θ (a t s t ) R t...move most in the directions that favor actions that yield the highest return...is inversely proportional to the action probability (actions that are selected frequently are at an advantage (the updates will be more often in their direction)) Is it necessary to change something in the algorithm for continuous actions? Mario Martin (CS-UPC) Reinforcement Learning May 28, / 63

48 REINFORCE algorithm Update: θ θ + α θπ θ (a t s t ) π θ (a t s t ) R t...move most in the directions that favor actions that yield the highest return...is inversely proportional to the action probability (actions that are selected frequently are at an advantage (the updates will be more often in their direction)) Is it necessary to change something in the algorithm for continuous actions? No! Just uses a continuous action policy mechanism and everything is the same! Mario Martin (CS-UPC) Reinforcement Learning May 28, / 63

49 REINFORCE algorithm with baseline Monte-Carlo policy gradient still has high variance because R t has a lot of variance We can reduce variance subtracting a baseline to the estimator θ θ + α θ log π θ (a t s t )(R t b(s t )) without introducing any bias when baseline does not depend on actions taken A good baseline is b(s t ) = V π θ(s t ) so we will use that Mario Martin (CS-UPC) Reinforcement Learning May 28, / 63

50 REINFORCE algorithm with baseline Monte-Carlo policy gradient still has high variance because R t has a lot of variance We can reduce variance subtracting a baseline to the estimator θ θ + α θ log π θ (a t s t )(R t b(s t )) without introducing any bias when baseline does not depend on actions taken A good baseline is b(s t ) = V π θ(s t ) so we will use that How to estimate V π θ? We ll use another set of parameters w to approximate Mario Martin (CS-UPC) Reinforcement Learning May 28, / 63

51 REINFORCE algorithm with baseline REINFORCE algorithm with baseline (aka MC Actor Critic) Given architecture with parameters θ to implement π θ and parameters w to approximate V Initialize θ randomly repeat Generate episode {s 1, a 1, r 2,... s T 1, a T 1, r T, s T } π θ for all time steps t = 1 to T 1 do Get R t long-term return from step t to T δ R t V w (s t ) w w + βδ w V w (s t ) θ θ + αδ θ log π θ (a t s t ) end for until convergence Mario Martin (CS-UPC) Reinforcement Learning May 28, / 63

52 Actor-Critic Architectures Monte-Carlo policy gradient has high variance So we used a baseline to reduce the variance R t V (s t ) Can we do something to speed up learning like we did with MC using TD? Mario Martin (CS-UPC) Reinforcement Learning May 28, / 63

53 Actor-Critic Architectures Monte-Carlo policy gradient has high variance So we used a baseline to reduce the variance R t V (s t ) Can we do something to speed up learning like we did with MC using TD? Yes, use different estimators of R t that do bootstrapping f.i. TD(0), n-steps, etc. These algorithms are called Actor Critic Mario Martin (CS-UPC) Reinforcement Learning May 28, / 63

54 Actor-Critic Architectures The Critic, evaluates the current policy and the result is used in the policy training The Actor implements the policy and is trained using Policy Gradient with estimations from the critic Mario Martin (CS-UPC) Reinforcement Learning May 28, / 63

55 Actor-Critic Architectures Actor-critic algorithms maintain two sets of parameters (like in REINFORCE with baseline): Critic parameters: approximation parameters w for action-value function under current policy Actor parameters: policy parameters θ Actor-critic algorithms follow an approximate policy gradient: Critic: Updates action-value function parameters w like in policy evaluation updates (you can apply everything we saw in FA for prediction) Actor: Updates policy gradient θ, in direction suggested by critic Mario Martin (CS-UPC) Reinforcement Learning May 28, / 63

56 Actor-Critic Architectures Actor updates are always in the same way: θ θ + α θ log π θ (a t s t )G t where G t is the evaluation of long-term returned by the critic for s t Critic updates are done to evaluate the current policy w w + αδ θ V w (a t s t ) where δ is the estimated error in evaluating the s state and that implements the kind of bootstrapping done. Mario Martin (CS-UPC) Reinforcement Learning May 28, / 63

57 One step Actor Critic (QAC) One step actor-critic: δ r + Q w (s, a ) Q w (s, a) One step Actor Critic Given architecture with parameters θ to implement π θ and parameters w to approximate Q Initialize θ randomly repeat Set s to initial state Get a from π θ repeat Take action a and observe reward r and new state s Get a from π θ δ r + Q w (s, a ) Q w (s, a) w w + βδ w Q w (s, a) θ θ + α θ log π θ (a s)q w (s, a) s s until s is terminal until convergence Mario Martin (CS-UPC) Reinforcement Learning May 28, / 63

58 Advantage Actor Critic (AAC or A2C) In this critic Advantage value function is used: A π θ (s, a) = Q π θ (s, a) V π θ (s) The advantage function can significantly reduce variance of policy gradient So the critic should really estimate the advantage function, for instance, estimating both V(s) and Q using two function approximators and two parameter vectors: V π θ (s) V v (s) (1) Q π θ (s, a) Q w (s, a) (2) A(s, a) = Q w (s, a) V v (s) (3) And updating both value functions by e.g. TD learning Mario Martin (CS-UPC) Reinforcement Learning May 28, / 63

59 From A2C to REINFORCE with baseline One way to implement A2C method without two different networks to estimate Q w (s, a) and V v (s) is the following. For the true value function V π θ(s), the TD error δ π θ(s) δ π θ(s) = r + γv π θ (s ) V π θ (s) is an unbiased estimate of the advantage function: E πθ [δ π θ s, a] = E π θ [ r + γv π θ (s ) s, a ] V π θ (s) = Q π θ (s, a) V π θ (s) = A π θ (s, a) So we can use the TD error to compute the policy gradient θ J(θ) = E πθ [ θ log π θ (a s)δ π θ ] In practice this approach only requires one set of critic parameters v to approximate TD error δ v = r + γv v (s ) V v (s) Notice this algorithm is exactly REINFORCE with baseline Mario Martin (CS-UPC) Reinforcement Learning May 28, / 63

60 Asyncrhonous Advantage Actor Critic (A3C) A3C (Mnih et al. 2016) idea: Sample for data can be parallelized using several copies of the same agent use N copies of the agents (workers) working in parallel collecting samples and computing gradients for policy and value function After some time, pass gradients to a main network that updates actor and critic using the gradients of all After some time the worker copy the weights of the global network This parallelism decorrelates the agents data, so no Experience Replay Buffer needed Even one can explicitly use different exploration policies in each actor-learner to maximize diversity Asynchronism can be extended to other update mechanisms (Sarsa, Q-learning...) but it works better in Advantage Actor critic setting Mario Martin (CS-UPC) Reinforcement Learning May 28, / 63

61 Generalized Advanced Estimator (GAE) Generalized Advanced Estimator (Schulman et al. 2016) Use a version of Advantage that consider weighted average of n-steps estimators of advantage like in TD(λ): A π GAE = (λγ) t t [r t +1 + γvθ π (s t +1) Vθ π (s t }{{ )] } t =t t -step advantage Used in continuous setting for locomotion tasks Mario Martin (CS-UPC) Reinforcement Learning May 28, / 63

62 Subsection 4 Conclusions and other approaches Mario Martin (CS-UPC) Reinforcement Learning May 28, / 63

63 Summary The policy gradient has many equivalent forms θ J(θ) = E πθ [ θ log π θ (a s)r t ] REINFORCE = E πθ [ θ log π θ (a s)q w (s, a)] Actor-Critic = E πθ [ θ log π θ (a s)a w (s, a)]] Advantage Actor-Critic = E πθ [ θ log π θ (a s)δ] TD Actor-Critic = E πθ [ θ log π θ (a s)δe] TD(λ) Actor-Critic Each leads a stochastic gradient ascent algorithm Critic uses policy evaluation (e.g. MC or TD learning) to estimate Q π (s, a), A π (s, a) or V π (s) Mario Martin (CS-UPC) Reinforcement Learning May 28, / 63

64 Compatible Function Approximation: Bias in AC Approximating the policy gradient introduces bias A biased policy gradient may not find the right solution Luckily, if we choose value function approximation carefully, then we can avoid bias If the following two conditions are satisfied: 1 Value function approximator is compatible to the policy w Q w (s, a) = θ logπ θ (a s) 2 Value function parameters w minimize the mean-squared error w E πθ [ (Q π θ (s, a) Q w (s, a)) 2] = 0 Then the policy gradient is without bias Mario Martin (CS-UPC) Reinforcement Learning May 28, / 63

65 Problems with Policy Gradient Directions Goal: Each step of policy gradient yields an updated policy π whose value is greater than or equal to the prior policy π: V π V π Several inefficiencies: Gradient ascent approaches update the weights a small step in direction of gradient Gradient ascent algorithms can follow any ascent direction (a good ascent direction can significantly speed convergence) Gradient is First order / linear approximation of the value function s dependence on the policy parameterization instead of actual policy 1 1 A policy can often be re parameterized without changing action probabilities (f.i., increasing score of all actions in a softmax policy). Vanilla gradient is sensitive to these re parameterizations. Mario Martin (CS-UPC) Reinforcement Learning May 28, / 63

66 About step size Step size is important in any problem involving finding the optima of a function Supervised learning: Step too far next updates will fix it But in Reinforcement learning Step too far bad policy Next batch: collected under bad policy Policy is determining data collect! Essentially controlling exploration and exploitation trade o due to particular policy parameters and the stochasticity of the policy May not be able to recover from a bad choice, collapse in performance! Mario Martin (CS-UPC) Reinforcement Learning May 28, / 63

67 Better Policy Gradient Directions: Natural Gradient A more efficient gradient in learning problems is the natural gradient It corresponds to steepest ascent in policy space and not in the parameter space with right step size Also, the natural policy gradient is parametrization independent Convergence to a local minimum is guaranteed It finds ascent direction that is closest to vanilla gradient, when changing policy by a small, fixed amount nat θ π θ (a s) = G 1 θ θ π θ (a s) Where G θ is the Fisher information matrix G θ = E πθ [ θ log π θ (a s) θ log π θ (a s) T ] Mario Martin (CS-UPC) Reinforcement Learning May 28, / 63

68 Natural Actor Critic (Peters et al 2005) Under linear model modelization of critic: A π θ (s, a) = φ(s, a) T w Using compatible function approximation, w A w (s, a) = θ log π θ (a s) The natural policy gradient nicely simplifies, θ J(θ) = E πθ [ θ log π θ (a s)a π θ (s, a)] ] = E πθ [ θ log π θ (a s) θ log π θ (a s) T w = G θ w nat θ J(θ) = w i.e. update actor parameters in direction of critic parameters Mario Martin (CS-UPC) Reinforcement Learning May 28, / 63

69 TRPO (Schulman et al 2017) Trust Region Policy Optimization (TRPO) maximize parameters that change the policy increasing advantage in action over wrt. old policy in proximal spaces to avoid too large step size. arg max θ L θold (θ) = arg max θ [ T 1 E s0: t=0 [ ] ] πθ (a t s t ) E a θ π θold (a t s t ) A θ(s t, a t ) Under penalizing constraint (using KL divergence of θ and θ old ) that ensures improvement of the policy in the proximity (small step size) Solves using Natural Gradient Mario Martin (CS-UPC) Reinforcement Learning May 28, / 63

70 TRPO (Schulman et al 2017) Trust Region Policy Optimization (TRPO) maximize parameters that change the policy increasing advantage in action over wrt. old policy in proximal spaces to avoid too large step size. arg max θ L θold (θ) = arg max θ [ T 1 E s0: t=0 [ ] ] πθ (a t s t ) E a θ π θold (a t s t ) A θ(s t, a t ) Under penalizing constraint (using KL divergence of θ and θ old ) that ensures improvement of the policy in the proximity (small step size) Solves using Natural Gradient Some TRPO videos here. Proximal Policy Optimization PPO inspired in TRPO simplifies computation Mario Martin (CS-UPC) Reinforcement Learning May 28, / 63

71 Deterministic Policy Gradient (DPGA) A different idea: Reparametrization Suppose problem has continuous action space, a R d Then Q(s,a) a tells use how to improve our action We can use reparameterization trick, so a is now a deterministic function a = f (s, z), where z is noise. Then, θ E τ [R] = θ Q π (s 0, a 0 ) + θ Q π (s 1, a 1 ) +... This method is called the deterministic policy gradient (Silver et al. 2014) A generalized version, which also uses a dynamics model, is described as the stochastic value gradient Mario Martin (CS-UPC) Reinforcement Learning May 28, / 63

Lecture 8: Policy Gradient

Lecture 8: Policy Gradient Hado van Hasselt Outline 1 Introduction 2 Finite Difference Policy Gradient 3 Monte-Carlo Policy Gradient 4 Actor-Critic Policy Gradient Introduction Vapnik s rule Never solve