Lecture 8: Policy Gradient I 2

Size: px

Start display at page:

Download "Lecture 8: Policy Gradient I 2"

Cory Hope McLaughlin
5 years ago
Views:

1 Lecture 8: Policy Gradient I 2 Emma Brunskill CS234 Reinforcement Learning. Winter 2018 Additional reading: Sutton and Barto 2018 Chp With many slides from or derived from David Silver and John Schulman and Pieter Abbeel Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 3 Winter / 62

2 Last Time: We want RL Algorithms that Perform Optimization Delayed consequences Exploration Generalization And do it statistically and computationally efficiently Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 4 Winter / 62

3 Last Time: Generalization and Efficiency Can use structure and additional knowledge to help constrain and speed reinforcement learning Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 5 Winter / 62

4 Class Structure Last time: Imitation Learning This time: Policy Search Next time: Policy Search Cont. Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 6 Winter / 62

5 Table of Contents 1 Introduction 2 Policy Gradient 3 Score Function and Policy Gradient Theorem 4 Policy Gradient Algorithms and Reducing Variance Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 7 Winter / 62

6 Policy-Based Reinforcement Learning In the last lecture we approximated the value or action-value function using parameters θ, V θ (s) V π (s) Q θ (s, a) Q π (s, a) A policy was generated directly from the value function e.g. using ɛ-greedy In this lecture we will directly parametrize the policy π θ (s, a) = P[a s, θ] Goal is to find a policy π with the highest value function V π We will focus again on model-free reinforcement learning Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 8 Winter / 62

7 Value-Based and Policy-Based RL Value Based Learnt Value Function Implicit policy (e.g. ɛ-greedy) Policy Based No Value Function Learnt Policy Actor-Critic Learnt Value Function Learnt Policy Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 9 Winter / 62

8 Advantages of Policy-Based RL Advantages: Better convergence properties Effective in high-dimensional or continuous action spaces Can learn stochastic policies Disadvantages: Typically converge to a local rather than global optimum Evaluating a policy is typically inefficient and high variance Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 10 Winter / 62

Example: Rock-Paper-Scissors Two-player game of rock-paper-scissors Scissors beats paper Rock beats scissors Paper beats rock Consider policies for iterated rock-paper-scissors A

9 Example: Rock-Paper-Scissors Two-player game of rock-paper-scissors Scissors beats paper Rock beats scissors Paper beats rock Consider policies for iterated rock-paper-scissors A deterministic policy is easily exploited A uniform random policy is optimal (i.e. Nash equilibrium) Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 11 Winter / 62

10 Example: Aliased Gridword (1) The agent cannot differentiate the grey states Consider features of the following form (for all N, E, S, W) φ(s, a) = 1(wall to N, a = move E) Compare value-based RL, using an approximate value function Q θ (s, a) = f (φ(s, a), θ) To policy-based RL, using a parametrised policy π θ (s, a) = g(φ(s, a), θ) Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 12 Winter / 62

Example: Aliased Gridworld (2) Under aliasing, an optimal deterministic policy will either move W in both grey states (shown by red arrows) move E in both grey states Either way, it can get stuck and

11 Example: Aliased Gridworld (2) Under aliasing, an optimal deterministic policy will either move W in both grey states (shown by red arrows) move E in both grey states Either way, it can get stuck and never reach the money Value-based RL learns a near-deterministic policy e.g. greedy or ɛ-greedy So it will traverse the corridor for a long time Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 13 Winter / 62

12 Example: Aliased Gridworld (3) An optimal stochastic policy will randomly move E or W in grey states π θ (wall to N and W, move E) = 0.5 π θ (wall to N and W, move W) = 0.5 It will reach the goal state in a few steps with high probability Policy-based RL can learn the optimal stochastic policy Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 14 Winter / 62

13 Policy Objective Functions Goal: given a policy π θ (s, a) with parameters θ, find best θ But how do we measure the quality for a policy π θ? In episodic environments we can use the start value of the policy J 1 (θ) = V π θ (s 1 ) = E πθ [v 1 ] In continuing environments we can use the average value J avv (θ) = s d π θ (s)v π θ (s) where d π θ(s) is the stationary distribution of Markov chain for π θ. Or the average reward per time-step J avr (θ) = s d π θ (s) a π θ (s, a)r a s For simplicity, today will mostly discuss the episodic case, but can easily extend to the continuing / infinite horizon case Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 15 Winter / 62

14 Policy optimization Policy based reinforcement learning is an optimization problem Find policy parameters θ that maximize V π θ Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 16 Winter / 62

15 Policy optimization Policy based reinforcement learning is an optimization problem Find policy parameters θ that maximize V π θ Can use gradient free optimization: Hill climbing Simplex / amoeba / Nelder Mead Genetic algorithms Cross-Entropy method (CEM) Covariance Matrix Adaptation (CMA) Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 17 Winter / 62

16 Recall Human-in-the-Loop Exoskeleton Optimization (Zhang et al. Science 2017) u Optimization was done using CMA-ES, variation of covariance matrix evaluation Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 18 Winter / 62

17 Gradient Free Policy Optimization Can often work embarrassingly well: discovered that evolution strategies (ES), an optimization technique that s been known for decades, rivals the performance of standard reinforcement learning (RL) techniques on modern RL benchmarks (e.g. Atari/MuJoCo) ( Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 19 Winter / 62

18 Gradient Free Policy Optimization Often a great simple baseline to try Benefits Can work with any policy parameterizations, including non-differentiable Frequently very easy to parallelize Limitations Typically not very sample efficient because it ignores temporal structure Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 20 Winter / 62

19 Policy optimization Policy based reinforcement learning is an optimization problem Find policy parameters θ that maximize V π θ Can use gradient free optimization: Greater efficiency often possible using gradient Gradient descent Conjugate gradient Quasi-newton We focus on gradient descent, many extensions possible And on methods that exploit sequential structure Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 21 Winter / 62

20 Table of Contents 1 Introduction 2 Policy Gradient 3 Score Function and Policy Gradient Theorem 4 Policy Gradient Algorithms and Reducing Variance Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 22 Winter / 62

21 Policy Gradient Define V (θ) = V π θ to make explicit the dependence of the value on the policy parameters Assume episodic MDPs (easy to extend to related objectives, like average reward) Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 23 Winter / 62

22 Policy Gradient Define V (θ) = V π θ to make explicit the dependence of the value on the policy parameters Assume episodic MDPs Policy gradient algorithms search for a local maximum in V (θ) by ascending the gradient of the policy, w.r.t parameters θ θ = α θ V (θ) Where θ V (θ) is the policy gradient and α is a step-size parameter θ V (θ) = δv (θ) δθ 1. δv (θ) δθ n Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 24 Winter / 62

23 Computing Gradients by Finite Differences To evaluate policy gradient of π θ (s, a) For each dimension k [1, n] Estimate kth partial derivative of objective function w.r.t. θ By perturbing θ by small amount ɛ in kth dimension δv (θ) δθ k V (θ + ɛu k V (θ) ɛ where u k is a unit vector with 1 in kth component, 0 elsewhere. Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 25 Winter / 62

24 Computing Gradients by Finite Differences To evaluate policy gradient of π θ (s, a) For each dimension k [1, n] Estimate kth partial derivative of objective function w.r.t. θ By perturbing θ by small amount ɛ in kth dimension δv (θ) δθ k V (θ + ɛu k V (θ) ɛ where u k is a unit vector with 1 in kth component, 0 elsewhere. Uses n evaluations to compute policy gradient in n dimensions Simple, noisy, inefficient - but sometimes effective Works for arbitrary policies, even if policy is not differentiable Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 26 Winter / 62

Training AIBO to Walk by Finite Difference Policy Gradient 27 Goal: learn a fast AIBO walk (useful for Robocup) Adapt these parameters by finite difference policy gradient Evaluate performance of

25 Training AIBO to Walk by Finite Difference Policy Gradient 27 Goal: learn a fast AIBO walk (useful for Robocup) Adapt these parameters by finite difference policy gradient Evaluate performance of policy by field traversal time 27 Kohl and Stone. Policy gradient reinforcement learning for fast quadrupedal locomotion. ICRA ai-lab/pubs/icra04.pdf Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 28 Winter / 62

26 AIBO Policy Parameterization AIBO walk policy is open-loop policy No state, choosing set of action parameters that define an ellipse Specified by 12 continuous parameters (elliptical loci) The front locus (3 parameters: height, x-pos., y-pos.) The rear locus (3 parameters) Locus length Locus skew multiplier in the x-y plane (for turning) The height of the front of the body The height of the rear of the body The time each foot takes to move through its locus The fraction of time each foot spends on the ground New policies: or each parameter, randomly add (ɛ, 0, or ɛ) Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 29 Winter / 62

27 AIBO Policy Experiments All of the policy evaluations took place on actual robots... only human intervention required during an experiment involved replacing discharged batteries... about once an hour. Ran on 3 Aibos at once Evaluated 15 policies per iteration. Each policy evaluated 3 times (to reduce noise) and averaged Each iteration took 7.5 minutes Used η = 2 (learning rate for their finite difference approach) Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 30 Winter / 62

28 Training AIBO to Walk by Finite Difference Policy Gradient Results Authors discuss that performance is likely impacted by: initial starting policy parameters, ɛ (how much policies are perturbed), η (how much to change policy), as well as policy parameterization Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 31 Winter / 62

29 AIBO Walk Policies walk Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 32 Winter / 62

30 Table of Contents 1 Introduction 2 Policy Gradient 3 Score Function and Policy Gradient Theorem 4 Policy Gradient Algorithms and Reducing Variance Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 33 Winter / 62

31 Computing the gradient analytically We now compute the policy gradient analytically Assume policy π θ is differentiable whenever it is non-zero and we know the gradient θ π θ (s, a) Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 34 Winter / 62

32 Likelihood Ratio Policies Denote a state-action trajectory as τ = (s 0, a 0, r 0,..., s T 1, a T 1, r T 1, s T ) Use R(τ) = T t=0 R(s t, a t ) to be the sum of rewards for a trajectory τ Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 35 Winter / 62

33 Likelihood Ratio Policies Denote a state-action trajectory as τ = (s 0, a 0, r 0,..., s T 1, a T 1, r T 1, s T ) Use R(τ) = T t=0 R(s t, a t ) to be the sum of rewards for a trajectory τ Policy value is [ T ] V (θ) = E πθ R(s t, a t ); π θ = P(τ; θ)r(τ), (1) τ t=0 where P(τ; θ) is used to denote the probability over trajectories when executing policy π(θ) In this new notation, our goal is to find the policy parameters θ: arg max V (θ) = arg max P(τ; θ)r(τ). (2) θ θ τ Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 36 Winter / 62

34 Likelihood Ratio Policy Gradient Goal is to find the policy parameters θ: arg max V (θ) = arg max P(τ; θ)r(τ). (3) θ θ Take the gradient with respect to θ: θ V (θ) = θ P(τ; θ)r(τ) τ τ Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 37 Winter / 62

35 Likelihood Ratio Policy Gradient Goal is to find the policy parameters θ: arg max V (θ) = arg max P(τ; θ)r(τ). (4) θ θ Take the gradient with respect to θ: θ V (θ) = θ P(τ; θ)r(τ) = τ τ τ θ P(τ; θ)r(τ) = τ = τ = τ P(τ; θ) P(τ; θ) θp(τ; θ)r(τ) P(τ; θ)r(τ) θp(τ; θ) P(τ; θ) }{{} likelihood ratio P(τ; θ)r(τ) θ log P(τ; θ) Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 38 Winter / 62

36 Likelihood Ratio Policy Gradient Goal is to find the policy parameters θ: arg max V (θ) = arg max θ θ P(τ; θ)r(τ). (5) Take the gradient with respect to θ: τ θ V (θ) = τ P(τ; θ)r(τ) θ log P(τ; θ) Approximate with empirical estimate for m sample paths under policy π θ : θ V (θ) ĝ = (1/m) m R(τ (i) ) θ log P(τ (i) ; θ) i=1 Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 39 Winter / 62

37 Score Function Gradient Estimator: Intuition Consider generic form of R(τ (i) ) θ log P(τ (i) ; θ): ĝ i = f (x i ) θ log p(x i θ) f (x) measures how good the sample x is. Moving in the direction ĝ i pushes up the logprob of the sample, in proportion to how good it is Valid even if f (x) is discontinuous, and unknown, or sample space (containing x) is a discrete set Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 40 Winter / 62

38 Score Function Gradient Estimator: Intuition ĝ i = f (x i ) θ log p(x i θ) Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 41 Winter / 62

39 Score Function Gradient Estimator: Intuition ĝ i = f (x i ) θ log p(x i θ) Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 42 Winter / 62

40 Decomposing the Trajectories Into States and Actions Approximate with empirical estimate for m sample paths under policy π θ : m θ V (θ) ĝ = (1/m) R(τ (i) ) θ log P(τ (i) ) θ log P(τ (i) ; θ) = i=1 Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 43 Winter / 62

41 Decomposing the Trajectories Into States and Actions Approximate with empirical estimate for m sample paths under policy π θ : θ V (θ) ĝ = (1/m) θ log P(τ (i);θ ) = θ log µ(s 0 ) m R(τ (i) ) θ log P(τ (i) ) i=1 }{{} Initial state distrib. T 1 t=0 π θ (a t s t ) P(s t+1 s t, a t ) }{{}}{{} policy dynamics model T 1 = θ [log µ(s 0 ) + log π θ (a t s t ) + log P(s t+1 s t, a t ) = T 1 t=0 t=0 θ log π θ (a t s t ) }{{} no dynamics model required! ] Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 44 Winter / 62

42 Score Function Define score function as θ log π θ (s, a) Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 45 Winter / 62

43 Likelihood Ratio / Score Function Policy Gradient Putting this together Goal is to find the policy parameters θ: arg max V (θ) = arg max θ θ P(τ; θ)r(τ). (6) Approximate with empirical estimate for m sample paths under policy π θ using score function: θ V (θ) ĝ = (1/m) = (1/m) m i=1 Do not need to know dynamics model τ m R(τ (i) ) θ log P(τ (i);θ ) i=1 T 1 R(τ (i) ) θ log π θ (a (i) t s (i) t ) t=0 Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 46 Winter / 62

44 Policy Gradient Theorem Theorem The policy gradient theorem generalizes the likelihood ratio approach For any differentiable policy π θ (s, a), for any of the policy objective function J = J 1, (episodic reward), J avr 1 (average reward per time step), or 1 γ J avv (average value), the policy gradient is θ J(θ) = E πθ [ θ log π θ (s, a)q π θ (s, a)] Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 47 Winter / 62

45 Table of Contents 1 Introduction 2 Policy Gradient 3 Score Function and Policy Gradient Theorem 4 Policy Gradient Algorithms and Reducing Variance Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 48 Winter / 62

46 Likelihood Ratio / Score Function Policy Gradient θ V (θ) (1/m) m i=1 Unbiased but very noisy Fixes that can make it practical Temporal structure Baseline T 1 R(τ (i) ) θ log π θ (a (i) t s (i) t ) t=0 Next time will discuss some additional tricks Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 49 Winter / 62

47 Policy Gradient: Use Temporal Structure Previously: [( T 1 θ E τ [R] = E τ t=0 ) ( T 1 r t t=0 θ log π θ (a t s t ) We can repeat the same argument to derive the gradient estimator for a single reward term r t. [ ] t θ E[r t ] = E r t θ log π θ (a t s t ) t=0 Summing this formula over t, we obtain [ T 1 ] t θ E[R] = E r t θ log π θ (a t s t ) = E t=0 [ T 1 t=0 t=0 T 1 θ log π θ (a t, s t ) Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 50 Winter / 62 t =t r t ] )]

48 Policy Gradient: Use Temporal Structure Recall for a particular trajectory τ (i), T 1 t =t r (i) t is the return G (i) t θ E[R] (1/m) m T 1 i=1 t=0 θ log π θ (a t, s t )G (i) t Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 51 Winter / 62

49 Monte-Carlo Policy Gradient (REINFORCE) Leverages likelihood ratio / score function and temporal structure θ t = α θ log π θ (s t, a t )G t (7) REINFORCE: Initialize policy parameters θ arbitrarily for each episode {s 1, a 1, r 2,, s T 1, a T 1, r T } π θ do for t = 1 to T 1 do θ θ + α θ log π θ (s t, a t )G t endfor endfor return θ Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 52 Winter / 62

50 Differentiable Policy Classes Many choices of differentiable policy classes including: Softmax Gaussian Neural networks Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 53 Winter / 62

51 Softmax Policy Weight actions using linear combination of features φ(s, a) T θ Probability of action is proportional to exponentiated weight π θ (s, a) = e φ(s,a)t θ /( a e φ(s,a)t θ ) (8) The score function is θ log π θ (s, a) = φ(s, a) Eπ θ [φ(s, )] Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 54 Winter / 62

52 Gaussian Policy In continuous action spaces, a Gaussian policy is natural Mean is a linear combination of state features µ(s) = φ(s) T θ Variance may be fixed σ 2, or can also parametrised Policy is Gaussian a N (µ(s), σ 2 ) The score function is θ log π θ (s, a) = (a µ(s))φ(s) σ 2 Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 55 Winter / 62

53 Likelihood Ratio / Score Function Policy Gradient θ V (θ) (1/m) m i=1 Unbiased but very noisy Fixes that can make it practical Temporal structure Baseline T 1 R(τ (i) ) θ log π θ (a (i) t s (i) t ) t=0 Next time will discuss some additional tricks Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 56 Winter / 62

54 Policy Gradient: Introduce Baseline Reduce variance by introducing a baseline b(s) [ T 1 θ E τ [R] = E τ t=0 θ log π(a t s t, θ) ( T 1 t =t For any choice of b(s), gradient estimator is unbiased. Near optimal choice is expected return, b(s t ) E[r t + r t r T 1 ] r t b(s t ) Interpretation: increase logprob of action a t proportionally to how much returns T 1 t =t r t are better than expected )] Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 57 Winter / 62

55 Baseline b(s) Does Not Introduce Bias Derivation E τ [ θ log π(a t s t, θ)b(s t )] = E s0:t,a 0:(t 1) [ Es(t+1):T,a t:(t 1) [ θ log π(a t s t, θ)b(s t )] ] Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 58 Winter / 62

56 Baseline b(s) Does Not Introduce Bias Derivation E τ [ θ log π(a t s t, θ)b(s t )] [ = E s0:t,a Es(t+1):T 0:(t 1),a t:(t 1) [ θ log π(a t s t, θ)b(s t )] ] (break up expectation) [ = E s0:t,a 0:(t 1) b(st )E s(t+1):t,a t:(t 1) [ θ log π(a t s t, θ)] ] (pull baseline term out) = E s0:t,a 0:(t 1) [b(s t )E at [ θ log π(a t s t, θ)]] (remove irrelevant variables) [ = E s0:t,a 0:(t 1) b(s t ) ] π θ (a t s t ) θπ(a t s t, θ) (likelihood ratio) π a θ (a t s t ) [ = E s0:t,a 0:(t 1) b(s t ) ] θ π(a t s t, θ) a [ ] = E s0:t,a 0:(t 1) b(s t ) θ π(a t s t, θ) = E s0:t,a 0:(t 1) [b(s t ) θ 1] = E s0:t,a 0:(t 1) [b(s t ) 0] = 0 a Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 59 Winter / 62

57 Vanilla Policy Gradient Algorithm Initialize policy parameter θ, baseline b for iteration=1, 2, do Collect a set of trajectories by executing the current policy At each timestep in each trajectory, compute the return R t = T 1 t =t r t, and the advantage estimate Â t = R t b(s t ). Re-fit the baseline, by minimizing b(s t ) R t 2, summed over all trajectories and timesteps. Update the policy, using a policy gradient estimate ĝ, which is a sum of terms θ log π(a t s t, θ)â t. (Plug ĝ into SGD or ADAM) endfor Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 60 Winter / 62

58 Practical Implementation with Autodiff Usual formula t θ log π(a t s t ; θ)ât is inifficient want to batch data Define surrogate function using data from current batch L(θ) = t log π(a t s t ; θ)ât Then policy gradient estimator ĝ = θ L(θ) Can also include value function fit error L(θ) = (log π(z t s t ; θ)â t V (s t ) ˆR t 2) t Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 61 Winter / 62

59 Value Functions Recall Q-function / state-action-value function: Q π,γ [ (s, a) = E π r0 + γr 1 + γ 2 r 2 s 0 = s, a 0 = a ] State-value function can serve as a great baseline V π,γ [ (s) = E π r0 + γr 1 + γ 2 r 2 s 0 = s ] = E a π [Q π,γ (s, a)] Advantage function: Combining Q with baseline V A π,γ (s, a) = Q π,γ (s, a) V π,γ (s) Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 62 Winter / 62

60 N-step estimators Can also consider blending between TD and MC estimators for the target to substitute for the true state-action value function. ˆR (1) t = r t + γv (s t+1 ) ˆR (2) t = r t + γr t+1 + γ 2 V (s t+2 ) ˆR (inf) t = r t + γr t+1 + γ 2 r t+1 + If subtract baselines from the above, get advantage estimators Â (1) t = r t + γv (s t+1 ) V (s t ) Â (2) t = r t + γr t+1 + γ 2 V (s t+2 ) V (s t ) Â (inf) t = r t + γr t+1 + γ 2 r t+1 + V (s t ) Â (a) t has low variance & high bias. Â ( ) t high variance but low bias. (Why? Like which model-free policy estimation techniques?) Using intermediate k (say, 20) can give an intermediate amount of bias and variance. Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 63 Winter / 62

61 Application: Robot Locomotion Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 64 Winter / 62

62 Class Structure Last time: Imitation Learning This time: Policy Search Next time: Policy Search Cont. Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 8: Policy Gradient I 65 Winter / 62

Lecture 9: Policy Gradient II (Post lecture) 2

Lecture 9: Policy Gradient II (Post lecture) 2 Emma Brunskill CS234 Reinforcement Learning. Winter 2018 Additional reading: Sutton and Barto 2018 Chp. 13 2 With many slides from or derived from David Silver