Reinforcement Learning

Size: px

Start display at page:

Download "Reinforcement Learning"

Barrie Cummings
5 years ago
Views:

Reinforcement Learning Policy gradients Daniel Hennes 26.06.

1 Reinforcement Learning Policy gradients Daniel Hennes University Stuttgart - IPVS - Machine Learning & Robotics 1

2 Policy based reinforcement learning So far we approximated the action value function and generated a policy from it Approximation of the action value function ˆq(s, a, w) q π (s, a) Generation of policy by, e.g., ɛ greedy ˆq(s, a, w) ɛ-greedy π Now we directly parameterize the policy π 2

3 Policy based reinforcement learning Advantages: better convergence properties effective in high-dimensional or continuous action spaces can learn stochastic policies Disadvantages: typically converge to a local rather than global optimum evaluating a policy is typically inefficient and high variance 3

4 Stochastic policies Consider the iterated version of rock paper scissors What happens if you play a deterministic policy? Which policy is best? 4

5 Policy optimization Policy optimization: π = arg max J(π) π where J is some performance measure Discounted return: G 0 = T 1 k=0 γk R k+1 π = arg max π Undiscounted return: G 0 = T 1 k=0 R k+1, i.e., γ = 1 E π [G 0 ] = arg max E π [v π (s 0 ) S 0 = s 0 ] π π = arg max E π [R 0 + R R T 1 S 0 = s 0 ] π In continuing environments, we could use the average state value: µ π (s) π(a s) p(s s, a)r(s, a, s ) s a s where µ π is the steady state distribution under π 5

6 Parameterized policies Policies parameterized by parameter θ R d : deterministic: a = π(s, θ) stochastic: a π(a s, θ) = Pr{A t = a S t = s, θ} Objective becomes max θ J(θ) = max θ E πθ [G 0 ] We can parameterize π in any way, as long as it is differentiable wrt to θ In general we require π(a s, θ) (0, 1) for all s, a If the action space is discrete (and not too large): softmax policy π(a s, θ) = where h( ) is the action preference function Can this approach the deterministic policy? eh(s,a,θ) b eh(s,b,θ) How does it compare to ɛ-greedy or softmax with action values? 6

7 Policy gradient methods Problem: π = arg max E π [R 0 + R R T 1 S 0 = s 0 ] π Intuition: collect a bunch of trajectories, and Make the good trajectories more probable 2. Make the good actions more probable 3. Push the actions towards good actions Policy gradient methods: θ θ + α J(θ) where J(θ) is the policy gradient: J(θ) = ( J(θ) note that this is approximate gradient ascent θ 0,..., J(θ) ) T θ d 7

8 Score function gradient estimator Consider E x p(x θ) [f (x)] for some score function f We want to compute the gradient wrt θ θ E x [f (x)] = θ f (x)p(x θ)dx = f (x) θ p(x θ)dx = f (x) θp(x θ) p(x θ)dx p(x θ) = [f (x) θ log p(x θ) ] p(x θ)dx = E x [f (x) θ log p(x θ)] Note that log x = x x Unbiased gradient estimator: sample x i p(x θ) compute sample gradient f (x i) θ log p(x i θ) 8

9 Score function gradient estimator Sample gradient: f (x i ) θ logp(x i θ) f (x i ) is the score of sample x i Moving in the direction of the sample gradient pushes the log probability of the sample in proportion to its score (gradient ascent) Valid even if f is: discontinuous unknown or sample space is a discrete set 9

10 Score function gradient estimator for policies Recall: θ E x [f (x)] = E x [f (x) θ log p(x θ)] Let us consider a random trajectory τ in place of x: τ = (S 0, A 0, R 1, S 1, A 1, R 2,..., S T 1, A T 1, R T, S T ) and G 0 instead of f, thus we have: θ J(θ) = θ E τ [G 0 ] = E τ [G 0 θ log p(τ θ) T 1 p(τ θ) = Pr{S 0 } π(a t S t, θ)p(s t+1 S t, A t ) T 1 log p(τ θ) = log Pr{S 0 } + log π(a t S t, θ) + log p(s t+1 S t, A t ) T 1 θ log p(τ θ) = θ log π(a t S t, θ) [ T 1 ] θ E τ [G 0 ] = E τ G 0 θ log π(a t S t, θ) 10

11 Policy gradient theorem θ E τ [G 0 ] = E τ [ For a single reward R k we have: G 0 T 1 [ ( T 1 = E τ θ E τ [R k ] = E τ [ ] θ log π(a t S t, θ) ) T 1 R t+1 θ log R k k 1 ] π(a t S t, θ) ] θ log π(a t S t, θ) Summing over all k, we get: θ E τ [G 0 ] = E τ [ T k=1 R k k 1 ] θ log π(a t S t, θ) [ T 1 = E τ θ log π(a t S t, θ) T k=t+1 R k ] 11

12 Policy gradient theorem [ T 1 T ] θ E τ [G 0 ] = E τ θ log π(a t S t, θ) R k k=t+1 = E πθ [ qπ (S t, A t ) log π(a t S t, θ) ] θ J(θ) = E πθ [ qπ (S t, A t ) log π(a t S t, θ) ] 12

13 Monte Carlo policy gradient (REINFORCE) Update parameters θ by stochastic gradient ascent using the policy gradient theorem using return G t as an unbiased sample of q π(s t, A t) Update rule: θ θ + αγ t G t θ log π(a t S t, θ) 13

14 REINFORCE with baseline Monte Carlo policy gradient still suffers from high variance We use a baseline to estimate the value function Policy parameters: θ Value estimator parameters: w Update rule: Interpretation: ˆv(s, w) v π (s) δ G ˆv(S t, w) w w + α w γ t δ w ˆv(S t, w) θ θ + α θ γ t δ θ log π(a t S t, θ) increase log prob of action A t proportionally to how much G t is better than expected baseline accounts for and removes the effect of past actions 14

15 Estimating the action value function We are solving the prediction problem: policy evaluation How good is policy π θ with current parameters θ? Familiar toolset for fitting the baseline: Monte Carlo policy evaluation TD-learning TD(λ) LSPI 15

16 Actor-critic concept 16

17 Actor-critic vs. baseline Actor-critic methods use the value function as a baseline for policy gradients Delivers trade off between variance reduction of policy gradients with bias introduction from value function methods Critic: updates value function parameters w Actor: updates policy parameters θ using critic REINFORCE with baseline uses value function as baseline not as a critic not used for bootstrapping One step actor-critic update: δ R t+1 + γˆv(s t+1, w) ˆv(S t, w) w w + α w γ t δ w ˆv(S t, w) θ θ + α θ γ t δ θ log π(a t S t, θ) 17

18 Value functions for the critic State-value function: ˆv(s, w) v π (s) as in one step Actor Critic Action value function: ˆq(s, a, w) q π (s, a) gradient of the policy is the direction that most improves q: [ ] J(θ) ˆq(s, a, w) π(s, θ) = E θ a θ Advantage function: â(s, a) = q π (s, a) v(s) e.g., used in Asynchronous Advantage Actor-Critic (A3C) async updates are for speed not for performance 18

19 Bias in Actor Critic algorithms Approximating (bootstrapping) the policy gradient introduces bias Biased policy gradient might not find the right solution But reduces variance and makes learning substantially more efficient Compatible function approximation avoids this problem: compatible function approximation theorem policy gradients are exact 19

20 What about continuous action spaces? Softmax works for (not too large) discrete action spaces In continuous action spaces, a Gaussian policy is a common choice Gaussian policy: A t N (µ(s t, θ), σ 2 (S t, θ)) Variance may also be constant across the state space E.g., neural network outputs the mean of each action dimension as a function of the state 20

21 Summary Action value methods: learn values, select action accordingly Policy gradient methods: learn directly a parameterized Policy Advantages: learn specific probabilities for taking the actions learn appropriate levels of exploration, or... approach deterministic policies can handle continuous action spaces some policies are simpler to represent than value function policy gradient theorem (exact expression how performance is affected) REINFORCE with baseline reduces variance without adding bias Value functions for bootstrapping (Actor Critic) introduces bias,... but substantially reduces variance 21

Policy Gradient Methods. February 13, 2017

Policy Gradient Methods. February 13, 2017 Policy Gradient Methods February 13, 2017 Policy Optimization Problems maximize E π [expression] π Fixed-horizon episodic: T 1 Average-cost: lim T 1 T r t T 1 r t Infinite-horizon discounted: γt r t Variable-length