Lecture 9: Policy Gradient II 1

Similar documents
Lecture 9: Policy Gradient II (Post lecture) 2

Lecture 8: Policy Gradient I 2

Policy Gradient Methods. February 13, 2017

Trust Region Policy Optimization

Advanced Policy Gradient Methods: Natural Gradient, TRPO, and More. March 8, 2017

Reinforcement Learning via Policy Optimization

Deep Reinforcement Learning: Policy Gradients and Q-Learning

Reinforcement Learning

Deep Reinforcement Learning via Policy Optimization

Reinforcement Learning for NLP

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation

Reinforcement Learning and NLP

15-889e Policy Search: Gradient Methods Emma Brunskill. All slides from David Silver (with EB adding minor modificafons), unless otherwise noted

Reinforcement Learning

Lecture 8: Policy Gradient

Variance Reduction for Policy Gradient Methods. March 13, 2017

Deep Reinforcement Learning

REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning

O P T I M I Z I N G E X P E C TAT I O N S : T O S T O C H A S T I C C O M P U TAT I O N G R A P H S. john schulman. Summer, 2016

(Deep) Reinforcement Learning

Grundlagen der Künstlichen Intelligenz

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Q-Learning in Continuous State Action Spaces

INF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018

Temporal Difference. Learning KENNETH TRAN. Principal Research Engineer, MSR AI

CSC321 Lecture 22: Q-Learning

Reinforcement Learning

Markov Decision Processes and Solving Finite Problems. February 8, 2017

Deep Reinforcement Learning SISL. Jeremy Morton (jmorton2) November 7, Stanford Intelligent Systems Laboratory

Approximation Methods in Reinforcement Learning

15-780: ReinforcementLearning

Reinforcement Learning

Misc. Topics and Reinforcement Learning

Reinforcement Learning

Reinforcement Learning Part 2

CS Deep Reinforcement Learning HW2: Policy Gradients due September 19th 2018, 11:59 pm

Deterministic Policy Gradient, Advanced RL Algorithms

Behavior Policy Gradient Supplemental Material

Introduction to Reinforcement Learning

Reinforcement Learning

Optimizing Expectations: From Deep Reinforcement Learning to Stochastic Computation Graphs

Reinforcement Learning. Spring 2018 Defining MDPs, Planning

REINFORCEMENT LEARNING

Olivier Sigaud. September 21, 2012

Real Time Value Iteration and the State-Action Value Function

Introduction of Reinforcement Learning

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro

Machine Learning I Continuous Reinforcement Learning

Actor-critic methods. Dialogue Systems Group, Cambridge University Engineering Department. February 21, 2017

CS 287: Advanced Robotics Fall Lecture 14: Reinforcement Learning with Function Approximation and TD Gammon case study

Reinforcement Learning

CS230: Lecture 9 Deep Reinforcement Learning

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Reinforcement Learning and Control

Lecture 25: Learning 4. Victor R. Lesser. CMPSCI 683 Fall 2010

CS599 Lecture 1 Introduction To RL

Reinforcement Learning

Deep Reinforcement Learning

Dialogue management: Parametric approaches to policy optimisation. Dialogue Systems Group, Cambridge University Engineering Department

Deep reinforcement learning. Dialogue Systems Group, Cambridge University Engineering Department

Reinforcement Learning

Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms *

MDP Preliminaries. Nan Jiang. February 10, 2019

Value Function Methods. CS : Deep Reinforcement Learning Sergey Levine

Learning Control for Air Hockey Striking using Deep Reinforcement Learning

Lecture 1: March 7, 2018

Notes on Reinforcement Learning

Reinforcement Learning. Yishay Mansour Tel-Aviv University

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon.

Temporal difference learning

Lecture 23: Reinforcement Learning

ELEC-E8119 Robotics: Manipulation, Decision Making and Learning Policy gradient approaches. Ville Kyrki

Reinforcement. Function Approximation. Learning with KATJA HOFMANN. Researcher, MSR Cambridge

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Dueling Network Architectures for Deep Reinforcement Learning (ICML 2016)

16.410/413 Principles of Autonomy and Decision Making

Reinforcement Learning

Reinforcement Learning II. George Konidaris

Reinforcement Learning: the basics

Lecture 6 Optimization for Deep Neural Networks

Basics of reinforcement learning

Reinforcement Learning II. George Konidaris

Parking lot navigation. Experimental setup. Problem setup. Nice driving style. Page 1. CS 287: Advanced Robotics Fall 2009

Reinforcement Learning. George Konidaris

Reinforcement Learning with Function Approximation. Joseph Christian G. Noel

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam:

An Introduction to Reinforcement Learning

6 Reinforcement Learning

Internet Monetization

Machine Learning. Reinforcement learning. Hamid Beigy. Sharif University of Technology. Fall 1396

Artificial Intelligence

Machine Learning I Reinforcement Learning

Approximate Q-Learning. Dan Weld / University of Washington

Approximate Dynamic Programming

Reinforcement Learning

arxiv: v1 [cs.lg] 30 Dec 2016

Reinforcement Learning Active Learning

Nonparametric Inverse Reinforcement Learning and Approximate Optimal Control with Temporal Logic Tasks. Siddharthan Rajasekaran

Decision Theory: Q-Learning

Transcription:

Lecture 9: Policy Gradient II 1 Emma Brunskill CS234 Reinforcement Learning. Winter 2019 Additional reading: Sutton and Barto 2018 Chp. 13 1 With many slides from or derived from David Silver and John Schulman and Pieter Abbeel Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 9: Policy Gradient II 1 Winter 2019 1 / 60

Class Feedback Thanks to those that participated! Of 79 responses, 26.6% thought too fast, 65.8% just right, 7.6% too slow 49% wanted us to keep sessions on Thursday and Friday, 51% wanted to switch to office hours Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 9: Policy Gradient II 1 Winter 2019 2 / 60

Class Feedback Thanks to those that participated! Of 70 responses, 54% thought too fast, 43% just right Good: Derivations, homeworks Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 9: Policy Gradient II 1 Winter 2019 3 / 60

Class Feedback Thanks to those that participated! Of 70 responses, 54% thought too fast, 43% just right Good: Derivations, homeworks Do more of: big picture explaining, speaking loudly throughout, more examples Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 9: Policy Gradient II 1 Winter 2019 4 / 60

Class Structure Last time: Policy Search This time: Policy Search Next time: Midterm review Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 9: Policy Gradient II 1 Winter 2019 5 / 60

Recall: Policy-Based RL Policy search: directly parametrize the policy π θ (s, a) = P[a s; θ] Goal is to find a policy π with the highest value function V π (Pure) Policy based methods No Value Function Learned Policy Actor-Critic methods Learned Value Function Learned Policy Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 9: Policy Gradient II 1 Winter 2019 6 / 60

Recall: Advantages of Policy-Based RL Advantages: Better convergence properties Effective in high-dimensional or continuous action spaces Can learn stochastic policies Disadvantages: Typically converge to a local rather than global optimum Evaluating a policy is typically inefficient and high variance Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 9: Policy Gradient II 1 Winter 2019 7 / 60

Recall: Policy Gradient Defined V (θ) = V π θ to make explicit the dependence of the value on the policy parameters Assumed episodic MDPs Policy gradient algorithms search for a local maximum in V (θ) by ascending the gradient of the policy, w.r.t parameters θ θ = α θ V (θ) Where θ V (θ) is the policy gradient and α is a step-size parameter θ V (θ) = V (θ) θ 1. V (θ) θ n Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 9: Policy Gradient II 1 Winter 2019 8 / 60

Desired Properties of a Policy Gradient RL Algorithm Goal: Converge as quickly as possible to a local optima Incurring reward / cost as execute policy, so want to minimize number of iterations / time steps until reach a good policy Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 9: Policy Gradient II 1 Winter 2019 9 / 60

Desired Properties of a Policy Gradient RL Algorithm Goal: Converge as quickly as possible to a local optima Incurring reward / cost as execute policy, so want to minimize number of iterations / time steps until reach a good policy During policy search alternating between evaluating policy and changing (improving) policy (just like in policy iteration) Would like each policy update to be a monotonic improvement Only guaranteed to reach a local optima with gradient descent Monotonic improvement will achieve this And in the real world, monotonic improvement is often beneficial Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 9: Policy Gradient II 1 Winter 2019 10 / 60

Desired Properties of a Policy Gradient RL Algorithm Goal: Obtain large monotonic improvements to policy at each update Techniques to try to achieve this: Last time and today: Get a better estimate of the gradient (intuition: should improve updating policy parameters) Today: Change, how to update the policy parameters given the gradient Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 9: Policy Gradient II 1 Winter 2019 11 / 60

Table of Contents 1 Better Gradient Estimates 2 Policy Gradient Algorithms and Reducing Variance 3 Need for Automatic Step Size Tuning 4 Updating the Parameters Given the Gradient: Local Approximation 5 Updating the Parameters Given the Gradient: Trust Regions 6 Updating the Parameters Given the Gradient: TRPO Algorithm Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 9: Policy Gradient II 1 Winter 2019 12 / 60

Likelihood Ratio / Score Function Policy Gradient Recall last time (m is a set of trajectories): θ V (θ) (1/m) m i=1 T 1 R(τ (i) ) θ log π θ (a (i) t s (i) t ) t=0 Unbiased estimate of gradient but very noisy Fixes that can make it practical Temporal structure (discussed last time) Baseline Alternatives to using Monte Carlo returns R(τ (i) ) as targets Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 9: Policy Gradient II 1 Winter 2019 13 / 60

Policy Gradient: Introduce Baseline Reduce variance by introducing a baseline b(s) [ T 1 θ E τ [R] = E τ t=0 θ log π(a t s t ; θ) ( T 1 t =t For any choice of b, gradient estimator is unbiased. Near optimal choice is the expected return, b(s t ) E[r t + r t+1 + + r T 1 ] r t b(s t ) Interpretation: increase logprob of action a t proportionally to how much returns T 1 t =t r t are better than expected )] Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 9: Policy Gradient II 1 Winter 2019 14 / 60

Baseline b(s) Does Not Introduce Bias Derivation E τ [ θ log π(a t s t ; θ)b(s t )] = E s0:t,a 0:(t 1) [ Es(t+1):T,a t:(t 1) [ θ log π(a t s t ; θ)b(s t )] ] Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 9: Policy Gradient II 1 Winter 2019 15 / 60

Baseline b(s) Does Not Introduce Bias Derivation E τ [ θ log π(a t s t ; θ)b(s t )] [ = E s0:t,a Es(t+1):T 0:(t 1),a t:(t 1) [ θ log π(a t s t ; θ)b(s t )] ] (break up expectation) [ = E s0:t,a 0:(t 1) b(st )E s(t+1):t,a t:(t 1) [ θ log π(a t s t ; θ)] ] (pull baseline term out) = E s0:t,a 0:(t 1) [b(s t )E at [ θ log π(a t s t ; θ)]] (remove irrelevant variables) [ = E s0:t,a 0:(t 1) b(s t ) ] π θ (a t s t ) θπ(a t s t ; θ) (likelihood ratio) π a θ (a t s t ) [ = E s0:t,a 0:(t 1) b(s t ) ] θ π(a t s t ; θ) a [ ] = E s0:t,a 0:(t 1) b(s t ) θ π(a t s t ; θ) = E s0:t,a 0:(t 1) [b(s t ) θ 1] = E s0:t,a 0:(t 1) [b(s t ) 0] = 0 a Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 9: Policy Gradient II 1 Winter 2019 16 / 60

Vanilla Policy Gradient Algorithm Initialize policy parameter θ, baseline b for iteration=1, 2, do Collect a set of trajectories by executing the current policy At each timestep t in each trajectory τ i, compute Return Gt i = T 1 t =t r t i, and Advantage estimate  i t = Gt i b(s t ). Re-fit the baseline, by minimizing i t b(s t) Gt i 2, Update the policy, using a policy gradient estimate ĝ, Which is a sum of terms θ log π(a t s t, θ)ât. (Plug ĝ into SGD or ADAM) endfor Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 9: Policy Gradient II 1 Winter 2019 17 / 60

Practical Implementation with Autodiff Usual formula t θ log π(a t s t ; θ)â t is inefficient want to batch data Define surrogate function using data from current batch L(θ) = t log π(a t s t ; θ)â t Then policy gradient estimator ĝ = θ L(θ) Can also include value function fit error L(θ) = (log π(a t s t ; θ)ât V (s t ) Ĝt 2) t Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 9: Policy Gradient II 1 Winter 2019 18 / 60

Other Choices for Baseline? Initialize policy parameter θ, baseline b for iteration=1, 2, do Collect a set of trajectories by executing the current policy At each timestep t in each trajectory τ i, compute Return Gt i = T 1 t =t r t i, and Advantage estimate  i t = Gt i b(s t ). Re-fit the baseline, by minimizing i t b(s t) Gt i 2, Update the policy, using a policy gradient estimate ĝ, Which is a sum of terms θ log π(a t s t, θ)ât. (Plug ĝ into SGD or ADAM) endfor Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 9: Policy Gradient II 1 Winter 2019 19 / 60

Choosing the Baseline: Value Functions Recall Q-function / state-action-value function: Q π,γ [ (s, a) = E π r0 + γr 1 + γ 2 r 2 s 0 = s, a 0 = a ] State-value function can serve as a great baseline V π,γ [ (s) = E π r0 + γr 1 + γ 2 r 2 s 0 = s ] = E a π [Q π,γ (s, a)] Advantage function: Combining Q with baseline V A π,γ (s, a) = Q π,γ (s, a) V π,γ (s) Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 9: Policy Gradient II 1 Winter 2019 20 / 60

Table of Contents 1 Better Gradient Estimates 2 Policy Gradient Algorithms and Reducing Variance 3 Need for Automatic Step Size Tuning 4 Updating the Parameters Given the Gradient: Local Approximation 5 Updating the Parameters Given the Gradient: Trust Regions 6 Updating the Parameters Given the Gradient: TRPO Algorithm Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 9: Policy Gradient II 1 Winter 2019 21 / 60

Likelihood Ratio / Score Function Policy Gradient Recall last time: θ V (θ) (1/m) m i=1 T 1 R(τ (i) ) θ log π θ (a (i) t s (i) t ) t=0 Unbiased estimate of gradient but very noisy Fixes that can make it practical Temporal structure (discussed last time) Baseline Alternatives to using Monte Carlo returns G i t as targets Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 9: Policy Gradient II 1 Winter 2019 22 / 60

Choosing the Target G i t is an estimation of the value function at s t from a single roll out Unbiased but high variance Reduce variance by introducing bias using bootstrapping and function approximation Just like in we saw for TD vs MC, and value function approximation Estimate of V /Q is done by a critic Actor-critic methods maintain an explicit representation of policy and the value function, and update both A3C (Mnih et al. ICML 2016) is a very popular actor-critic method Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 9: Policy Gradient II 1 Winter 2019 23 / 60

Policy Gradient Formulas with Value Functions Recall: [ T 1 θ E τ [R] = E τ t=0 [ T 1 θ E τ [R] E τ t=0 θ log π(a t s t ; θ) ( T 1 t =t r t b(s t ) )] θ log π(a t s t ; θ) (Q(s t ; w) b(s t )) Letting the baseline be an estimate of the value V, we can represent the gradient in terms of the state-action advantage function [ T 1 θ E τ [R] E τ t=0 θ log π(a t s t ; θ)â π (s t, a t ) ] ] Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 9: Policy Gradient II 1 Winter 2019 24 / 60

Choosing the Target: N-step estimators θ V (θ) (1/m) m T 1 i=1 t=0 Rt i θ log π θ (a (i) t s (i) t ) Note that critic can select any blend between TD and MC estimators for the target to substitute for the true state-action value function. Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 9: Policy Gradient II 1 Winter 2019 25 / 60

Choosing the Target: N-step estimators θ V (θ) (1/m) m T 1 i=1 t=0 Rt i θ log π θ (a (i) t s (i) t ) Note that critic can select any blend between TD and MC estimators for the target to substitute for the true state-action value function. ˆR (1) t = r t + γv (s t+1 ) ˆR (2) t = r t + γr t+1 + γ 2 V (s t+2 ) ˆR (inf) t = r t + γr t+1 + γ 2 r t+1 + If subtract baselines from the above, get advantage estimators  (1) t = r t + γv (s t+1 ) V (s t )  (inf) t = r t + γr t+1 + γ 2 r t+1 + V (s t )  (1) t has low variance & high bias.  ( ) t high variance but low bias. Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 9: Policy Gradient II 1 Winter 2019 26 / 60

Vanilla Policy Gradient Algorithm Initialize policy parameter θ, baseline b for iteration=1, 2, do Collect a set of trajectories by executing the current policy At each timestep t in each trajectory τ i, compute Target ˆR t i Advantage estimate  i t = Gt i b(s t ). t b(s t) ˆR i t 2, Re-fit the baseline, by minimizing i Update the policy, using a policy gradient estimate ĝ, Which is a sum of terms θ log π(a t s t, θ)â t. (Plug ĝ into SGD or ADAM) endfor Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 9: Policy Gradient II 1 Winter 2019 27 / 60

Table of Contents 1 Better Gradient Estimates 2 Policy Gradient Algorithms and Reducing Variance 3 Need for Automatic Step Size Tuning 4 Updating the Parameters Given the Gradient: Local Approximation 5 Updating the Parameters Given the Gradient: Trust Regions 6 Updating the Parameters Given the Gradient: TRPO Algorithm Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 9: Policy Gradient II 1 Winter 2019 28 / 60

Policy Gradient and Step Sizes Goal: Each step of policy gradient yields an updated policy π whose value is greater than or equal to the prior policy π: V π V π Gradient descent approaches update the weights a small step in direction of gradient First order / linear approximation of the value function s dependence on the policy parameterization Locally a good approximation, further away less good Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 9: Policy Gradient II 1 Winter 2019 29 / 60

Why are step sizes a big deal in RL? Step size is important in any problem involving finding the optima of a function Supervised learning: Step too far next updates will fix it Reinforcement learning Step too far bad policy Next batch: collected under bad policy Policy is determining data collection! Essentially controlling exploration and exploitation trade off due to particular policy parameters and the stochasticity of the policy May not be able to recover from a bad choice, collapse in performance! Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 9: Policy Gradient II 1 Winter 2019 30 / 60

Simple step-sizing: Line search in direction of gradient Simple but expensive (perform evaluations along the line) Naive: ignores where the first order approximation is good or bad Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 9: Policy Gradient II 1 Winter 2019 31 / 60

Policy Gradient Methods with Auto-Step-Size Selection Can we automatically ensure the updated policy π has value greater than or equal to the prior policy π: V π V π? Consider this for the policy gradient setting, and hope to address this by modifying step size Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 9: Policy Gradient II 1 Winter 2019 32 / 60

Objective Function Goal: find policy parameters that maximize value function 1 [ ] V (θ) = E πθ γ t R(s t, a t ); π θ t=0 where s 0 µ(s 0 ), a t π(a t s t ), s t+1 P(s t+1 s t, a t ) Have access to samples from the current policy π θ (param. by θ) Want to predict the value of a different policy (off policy learning!) 1 For today we will primarily consider discounted value functions Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 9: Policy Gradient II 1 Winter 2019 33 / 60

Objective Function Goal: find policy parameters that maximize value function 1 [ ] V (θ) = E πθ γ t R(s t, a t ); π θ t=0 where s 0 µ(s 0 ), a t π(a t s t ), s t+1 P(s t+1 s t, a t ) Express expected return of another policy in terms of the advantage over the original policy [ ] V ( θ) = V (θ) + E π θ γ t A π (s t, a t ) = V (θ) + µ π (s) π(a s)a π (s, a) s a t=0 where µ π (s) is defined as the discounted weighted frequency of state s under policy π (similar to in Imitation Learning lecture) 1 For today we will primarily consider discounted value functions Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 9: Policy Gradient II 1 Winter 2019 34 / 60

Objective Function Goal: find policy parameters that maximize value function 1 [ ] V (θ) = E πθ γ t R(s t, a t ); π θ t=0 where s 0 µ(s 0 ), a t π(a t s t ), s t+1 P(s t+1 s t, a t ) Express expected return of another policy in terms of the advantage over the original policy [ ] V ( θ) = V (θ) + E π θ γ t A π (s t, a t ) = V (θ) + µ π (s) π(a s)a π (s, a) s a t=0 where µ π (s) is defined as the discounted weighted frequency of state s under policy π (similar to in Imitation Learning lecture) We know the advantage A π and π But we can t compute the above because we don t know µ π, the state distribution under the new proposed policy 1 For today we will primarily consider discounted value functions Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 9: Policy Gradient II 1 Winter 2019 35 / 60

Table of Contents 1 Better Gradient Estimates 2 Policy Gradient Algorithms and Reducing Variance 3 Need for Automatic Step Size Tuning 4 Updating the Parameters Given the Gradient: Local Approximation 5 Updating the Parameters Given the Gradient: Trust Regions 6 Updating the Parameters Given the Gradient: TRPO Algorithm Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 9: Policy Gradient II 1 Winter 2019 36 / 60

Local approximation Can we remove the dependency on the discounted visitation frequencies under the new policy? Substitute in the discounted visitation frequencies under the current policy to define a new objective function: L π ( π) = V (θ) + s µ π (s) a π(a s)a π (s, a) Note that L πθ0 (π θ0 ) = V (θ 0 ) Gradient of L is identical to gradient of value function at policy parameterized evaluated at θ 0 : θ L πθ0 (π θ ) θ=θ0 = θ V (θ) θ=θ0 Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 9: Policy Gradient II 1 Winter 2019 37 / 60

Conservative Policy Iteration Is there a bound on the performance of a new policy obtained by optimizing the surrogate objective? Consider mixture policies that blend between an old policy and a different policy π new (a s) = (1 α)π old (a s) + απ (a s) In this case can guarantee a lower bound on value of the new π new : V πnew L πold (π new ) 2ɛγ (1 γ) 2 α2 where ɛ = max s E a π (a s) [A π (s, a)] Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 9: Policy Gradient II 1 Winter 2019 38 / 60

Check Your Understanding: Is this Bound Tight if π new = π old? Is there a bound on the performance of a new policy obtained by optimizing the surrogate objective? Consider mixture policies that blend between an old policy and a different policy π new (a s) = (1 α)π old (a s) + απ (a s) In this case can guarantee a lower bound on value of the new π new : V πnew L πold (π new ) 2ɛγ (1 γ) 2 α2 where ɛ = max s E a π (a s) [A π (s, a)] Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 9: Policy Gradient II 1 Winter 2019 39 / 60

Conservative Policy Iteration Is there a bound on the performance of a new policy obtained by optimizing the surrogate objective? Consider mixture policies that blend between an old policy and a different policy π new (a s) = (1 α)π old (a s) + απ (a s) In this case can guarantee a lower bound on value of the new π new : V πnew L πold (π new ) 2ɛγ (1 γ) 2 α2 where ɛ = max s E a π (a s) [A π (s, a)] Can we remove the dependency on the discounted visitation frequencies under the new policy? Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 9: Policy Gradient II 1 Winter 2019 40 / 60

Find the Lower-Bound in General Stochastic Policies Would like to similarly obtain a lower bound on the potential performance for general stochastic policies (not just mixture policies) Recall L π ( π) = V (θ) + s µ π(s) a π(a s)a π(s, a) Theorem Let D max TV (π 1, π 2 ) = max s D TV (π 1 ( s), π 2 ( s)). Then V πnew L πold (π new ) 4ɛγ (1 γ) 2 (Dmax TV (π old, π new )) 2 where ɛ = max s,a A π (s, a). Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 9: Policy Gradient II 1 Winter 2019 41 / 60

Find the Lower-Bound in General Stochastic Policies Would like to similarly obtain a lower bound on the potential performance for general stochastic policies (not just mixture policies) Recall L π ( π) = V (θ) + s µ π(s) a π(a s)a π(s, a) Theorem Let D max TV (π 1, π 2 ) = max s D TV (π 1 ( s), π 2 ( s)). Then V πnew L πold (π new ) 4ɛγ (1 γ) 2 (Dmax TV (π old, π new )) 2 where ɛ = max s,a A π (s, a). Note that D TV (p, q) 2 D KL (p, q) for prob. distrib p and q. Then the above theorem immediately implies that V πnew L πold (π new ) 4ɛγ (1 γ) 2 Dmax KL (π old, π new ) where D max KL (π 1, π 2 ) = max s D KL (π 1 ( s), π 2 ( s)) Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 9: Policy Gradient II 1 Winter 2019 42 / 60

Guaranteed Improvement 1 Goal is to compute a policy that maximizes the objective function defining the lower bound: 1 Lπ( π) = V (θ) + s µπ(s) a π(a s)aπ(s, a) Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 9: Policy Gradient II 1 Winter 2019 43 / 60

Guaranteed Improvement 1 Goal is to compute a policy that maximizes the objective function defining the lower bound: M i (π) = L πi (π) 4ɛγ (1 γ) 2 Dmax KL (π i, π) V π i+1 L πi (π) 4ɛγ (1 γ) 2 Dmax KL (π i, π) = M i (π i+1 ) V π i = M i (π i ) V π i+1 V π i M i (π i+1 ) M i (π i ) So as long as the new policy π i+1 is equal or an improvement compared to the old policy π i with respect to the lower bound, we are guaranteed to to monotonically improve! The above is a type of Minorization-Maximization (MM) algorithm 1 Lπ( π) = V (θ) + s µπ(s) a π(a s)aπ(s, a) Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 9: Policy Gradient II 1 Winter 2019 44 / 60

Guaranteed Improvement 1 V πnew L πold (π new ) 4ɛγ (1 γ) 2 Dmax KL (π old, π new ) Figure: Source: John Schulman, Deep Reinforcement Learning, 2014 1 Lπ( π) = V (θ) + s µπ(s) a π(a s)aπ(s, a) Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 9: Policy Gradient II 1 Winter 2019 45 / 60

Table of Contents 1 Better Gradient Estimates 2 Policy Gradient Algorithms and Reducing Variance 3 Need for Automatic Step Size Tuning 4 Updating the Parameters Given the Gradient: Local Approximation 5 Updating the Parameters Given the Gradient: Trust Regions 6 Updating the Parameters Given the Gradient: TRPO Algorithm Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 9: Policy Gradient II 1 Winter 2019 46 / 60

Optimization of Parameterized Policies 1 Goal is to optimize max L θold (θ new ) 4ɛγ θ (1 γ) 2 Dmax KL (θ old, θ new ) = L θold (θ new ) CDKL max (θ old, θ new ) where C is the penalty coefficient In practice, if we used the penalty coefficient recommended by the theory above C = 4ɛγ (1 γ), the step sizes would be very small 2 New idea: Use a trust region constraint on step sizes. Do this by imposing a constraint on the KL divergence between the new and old policy. max L θold (θ) θ subject to D s µ θ old KL (θ old, θ) δ This uses the average KL instead of the max (the max requires the KL is bounded at all states and yields an impractical number of constraints) 1 Lπ( π) = V (θ) + s µπ(s) a π(a s)aπ(s, a) Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 9: Policy Gradient II 1 Winter 2019 47 / 60

From Theory to Practice Prior objective: max L θold (θ) θ subject to D s µ θ old KL (θ old, θ) δ where L π ( π) = V (θ) + s µ π(s) a π(a s)a π(s, a) Don t know the visitation weights nor true advantage function Instead do the following substitutions: s µ π (s) 1 1 γ E s µ θold [...], Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 9: Policy Gradient II 1 Winter 2019 48 / 60

From Theory to Practice Next substitution: [ ] πθ (a s n ) π θ (a s n )A θold (s n, a) E a q q(a s n ) A θ old (s n, a) a where q is some sampling distribution over the actions and s n is a particular sampled state. This second substitution is to use importance sampling to estimate the desired sum, enabling the use of an alternate sampling distribution q (other than the new policy π θ ). Third substitution: A θold Q θold Note that the above substitutions do not change solution to the above optimization problem Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 9: Policy Gradient II 1 Winter 2019 49 / 60

Selecting the Sampling Policy Optimize [ ] πθ (a s) max E s µθold,a q θ q(a s) Q θ old (s, a) subject to E s µθold D KL( π θold ( s), π θ ( s)) δ Standard approach: sampling distribution is q(a s) is simply π old (a s) For the vine procedure see the paper Figure: Trust Region Policy Optimization, Schulman et al, 2015 Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 9: Policy Gradient II 1 Winter 2019 50 / 60

Searching for the Next Parameter Use a linear approximation to the objective function and a quadratic approximation to the constraint Constrained optimization problem Use conjugate gradient descent schulman2015trust Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 9: Policy Gradient II 1 Winter 2019 51 / 60

Table of Contents 1 Better Gradient Estimates 2 Policy Gradient Algorithms and Reducing Variance 3 Need for Automatic Step Size Tuning 4 Updating the Parameters Given the Gradient: Local Approximation 5 Updating the Parameters Given the Gradient: Trust Regions 6 Updating the Parameters Given the Gradient: TRPO Algorithm Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 9: Policy Gradient II 1 Winter 2019 52 / 60

Practical Algorithm: TRPO 1: for iteration=1, 2,... do 2: Run policy for T timesteps or N trajectories 3: Estimate advantage function at all timesteps 4: Compute policy gradient g 5: Use CG (with Hessian-vector products) to compute F 1 g where F is the Fisher information matrix 6: Do line search on surrogate loss and KL constraint 7: end for schulman2015trust Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 9: Policy Gradient II 1 Winter 2019 53 / 60

Practical Algorithm: TRPO Applied to Locomotion controllers in 2D Figure: Trust Region Policy Optimization, Schulman et al, 2015 Atari games with pixel input schulman2015trust Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 9: Policy Gradient II 1 Winter 2019 54 / 60

TRPO Results Figure: Trust Region Policy Optimization, Schulman et al, 2015 Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 9: Policy Gradient II 1 Winter 2019 55 / 60

TRPO Results Figure: Trust Region Policy Optimization, Schulman et al, 2015 Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 9: Policy Gradient II 1 Winter 2019 56 / 60

TRPO Summary Policy gradient approach Uses surrogate optimization function Automatically constrains the weight update to a trusted region, to approximate where the first order approximation is valid Empirically consistently does well Very influential: +350 citations since introduced a few years ago Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 9: Policy Gradient II 1 Winter 2019 57 / 60

Common Template of Policy Gradient Algorithms 1: for iteration=1, 2,... do 2: Run policy for T timesteps or N trajectories 3: At each timestep in each trajectory, compute target Q π (s t, a t ), and baseline b(s t ) 4: Compute estimated policy gradient ĝ 5: Update the policy using ĝ, potentially constrained to a local region 6: end for Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 9: Policy Gradient II 1 Winter 2019 58 / 60

Policy Gradient Summary Extremely popular and useful set of approaches Can input prior knowledge in the form of specifying policy parameterization You should be very familiar with REINFORCE and the policy gradient template on the prior slide Understand where different estimators can be slotted in (and implications for bias/variance) Don t have to be able to derive or remember the specific formulas in TRPO for approximating the objectives and constraints Will have the opportunity to practice with these ideas in homework 3 Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 9: Policy Gradient II 1 Winter 2019 59 / 60

Class Structure Last time: Policy Search This time: Policy Search Next time: Midterm review Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 9: Policy Gradient II 1 Winter 2019 60 / 60