Deep Reinforcement Learning via Policy Optimization

Similar documents
Variance Reduction for Policy Gradient Methods. March 13, 2017

Policy Gradient Methods. February 13, 2017

Deep Reinforcement Learning: Policy Gradients and Q-Learning

Deep Reinforcement Learning

Lecture 9: Policy Gradient II 1

Lecture 9: Policy Gradient II (Post lecture) 2

Lecture 8: Policy Gradient I 2

Reinforcement Learning and NLP

Reinforcement Learning

Trust Region Policy Optimization

Reinforcement Learning via Policy Optimization

Advanced Policy Gradient Methods: Natural Gradient, TRPO, and More. March 8, 2017

Reinforcement Learning

arxiv: v5 [cs.lg] 9 Sep 2016

Deep reinforcement learning. Dialogue Systems Group, Cambridge University Engineering Department

Deterministic Policy Gradient, Advanced RL Algorithms

Lecture 8: Policy Gradient

INF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018

Q-Learning in Continuous State Action Spaces

O P T I M I Z I N G E X P E C TAT I O N S : T O S T O C H A S T I C C O M P U TAT I O N G R A P H S. john schulman. Summer, 2016

Reinforcement Learning for NLP

REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning

Reinforcement Learning

Comparing Deep Reinforcement Learning and Evolutionary Methods in Continuous Control

REINFORCEMENT LEARNING

15-889e Policy Search: Gradient Methods Emma Brunskill. All slides from David Silver (with EB adding minor modificafons), unless otherwise noted

CSC321 Lecture 22: Q-Learning

Deep Reinforcement Learning SISL. Jeremy Morton (jmorton2) November 7, Stanford Intelligent Systems Laboratory

Dueling Network Architectures for Deep Reinforcement Learning (ICML 2016)

Optimizing Expectations: From Deep Reinforcement Learning to Stochastic Computation Graphs

Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms *

(Deep) Reinforcement Learning

Reinforcement Learning

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Generalized Advantage Estimation for Policy Gradients

Multiagent (Deep) Reinforcement Learning

Gradient Estimation Using Stochastic Computation Graphs

arxiv: v2 [cs.lg] 7 Mar 2018

Forward Actor-Critic for Nonlinear Function Approximation in Reinforcement Learning

Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor

Markov Decision Processes and Solving Finite Problems. February 8, 2017

Multi-Batch Experience Replay for Fast Convergence of Continuous Action Control

Learning Control for Air Hockey Striking using Deep Reinforcement Learning

arxiv: v2 [cs.lg] 8 Aug 2018

David Silver, Google DeepMind

Reinforcement. Function Approximation. Learning with KATJA HOFMANN. Researcher, MSR Cambridge

Decision Theory: Q-Learning

Deep Reinforcement Learning. Scratching the surface

Temporal Difference. Learning KENNETH TRAN. Principal Research Engineer, MSR AI

Reinforcement learning

Introduction to Reinforcement Learning

Proximal Policy Optimization Algorithms

CS230: Lecture 9 Deep Reinforcement Learning

Real Time Value Iteration and the State-Action Value Function

Learning Dexterity Matthias Plappert SEPTEMBER 6, 2018

Reinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina

Grundlagen der Künstlichen Intelligenz

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

Combining PPO and Evolutionary Strategies for Better Policy Search

Reinforcement Learning

Lecture 1: March 7, 2018

Reinforcement Learning. Spring 2018 Defining MDPs, Planning

Reinforcement Learning as Variational Inference: Two Recent Approaches

arxiv: v2 [cs.lg] 2 Oct 2018

COMP9444 Neural Networks and Deep Learning 10. Deep Reinforcement Learning. COMP9444 c Alan Blair, 2017

Lecture 23: Reinforcement Learning

Actor-critic methods. Dialogue Systems Group, Cambridge University Engineering Department. February 21, 2017

Decision Theory: Markov Decision Processes

Lecture 7: Value Function Approximation

Machine Learning I Continuous Reinforcement Learning

1 Introduction 2. 4 Q-Learning The Q-value The Temporal Difference The whole Q-Learning process... 5

Reinforcement Learning

Introduction of Reinforcement Learning

Reinforcement Learning

arxiv: v4 [cs.lg] 14 Oct 2018

Reinforcement Learning and Deep Reinforcement Learning

Learning Continuous Control Policies by Stochastic Value Gradients

Reinforcement Learning. Yishay Mansour Tel-Aviv University

Reinforcement Learning

Lecture 25: Learning 4. Victor R. Lesser. CMPSCI 683 Fall 2010

15-780: ReinforcementLearning

Lecture 10 - Planning under Uncertainty (III)

Reinforcement Learning

Towards Generalization and Simplicity in Continuous Control

Behavior Policy Gradient Supplemental Material

An Off-policy Policy Gradient Theorem Using Emphatic Weightings

Reinforcement Learning with Function Approximation. Joseph Christian G. Noel

arxiv: v1 [cs.lg] 30 Oct 2015

Decoupling deep learning and reinforcement learning for stable and efficient deep policy gradient algorithms

Nonparametric Inverse Reinforcement Learning and Approximate Optimal Control with Temporal Logic Tasks. Siddharthan Rajasekaran

Reinforcement Learning II. George Konidaris

Variational Deep Q Network

Reinforcement Learning: the basics

ELEC-E8119 Robotics: Manipulation, Decision Making and Learning Policy gradient approaches. Ville Kyrki

Human-level control through deep reinforcement. Liia Butler

Reinforcement Learning II. George Konidaris

Trust Region Evolution Strategies

Reinforcement Learning

Approximate Q-Learning. Dan Weld / University of Washington

Tutorial on Policy Gradient Methods. Jan Peters

Transcription:

Deep Reinforcement Learning via Policy Optimization John Schulman July 3, 2017

Introduction

Deep Reinforcement Learning: What to Learn? Policies (select next action)

Deep Reinforcement Learning: What to Learn? Policies (select next action) Value functions (measure goodness of states or state-action pairs)

Deep Reinforcement Learning: What to Learn? Policies (select next action) Value functions (measure goodness of states or state-action pairs) Models (predict next states and rewards)

Model Free RL: (Rough) Taxonomy Policy Optimization Dynamic Programming modified policy iteration DFO / Evolution Policy Gradients Policy Iteration Value Iteration Actor-Critic Methods Q-Learning

Policy Optimization vs Dynamic Programming Conceptually...

Policy Optimization vs Dynamic Programming Conceptually... Policy optimization: optimize what you care about

Policy Optimization vs Dynamic Programming Conceptually... Policy optimization: optimize what you care about Dynamic programming: indirect, exploit the problem structure, self-consistency

Policy Optimization vs Dynamic Programming Conceptually... Policy optimization: optimize what you care about Dynamic programming: indirect, exploit the problem structure, self-consistency Empirically...

Policy Optimization vs Dynamic Programming Conceptually... Policy optimization: optimize what you care about Dynamic programming: indirect, exploit the problem structure, self-consistency Empirically... Policy optimization more versatile, dynamic programming methods more sample-efficient when they work

Policy Optimization vs Dynamic Programming Conceptually... Policy optimization: optimize what you care about Dynamic programming: indirect, exploit the problem structure, self-consistency Empirically... Policy optimization more versatile, dynamic programming methods more sample-efficient when they work Policy optimization methods more compatible with rich architectures (including recurrence) which add tasks other than control (auxiliary objectives), dynamic programming methods more compatible with exploration and off-policy learning

Parameterized Policies A family of policies indexed by parameter vector θ R d

Parameterized Policies A family of policies indexed by parameter vector θ R d Deterministic: a = π(s, θ)

Parameterized Policies A family of policies indexed by parameter vector θ R d Deterministic: a = π(s, θ) Stochastic: π(a s, θ)

Parameterized Policies A family of policies indexed by parameter vector θ R d Deterministic: a = π(s, θ) Stochastic: π(a s, θ) Analogous to classification or regression with input s, output a.

Parameterized Policies A family of policies indexed by parameter vector θ R d Deterministic: a = π(s, θ) Stochastic: π(a s, θ) Analogous to classification or regression with input s, output a. Discrete action space: network outputs vector of probabilities

Parameterized Policies A family of policies indexed by parameter vector θ R d Deterministic: a = π(s, θ) Stochastic: π(a s, θ) Analogous to classification or regression with input s, output a. Discrete action space: network outputs vector of probabilities Continuous action space: network outputs mean and diagonal covariance of Gaussian

Episodic Setting In each episode, the initial state is sampled from µ, and the agent acts until the terminal state is reached. For example:

Episodic Setting In each episode, the initial state is sampled from µ, and the agent acts until the terminal state is reached. For example: Taxi robot reaches its destination (termination = good)

Episodic Setting In each episode, the initial state is sampled from µ, and the agent acts until the terminal state is reached. For example: Taxi robot reaches its destination (termination = good) Waiter robot finishes a shift (fixed time)

Episodic Setting In each episode, the initial state is sampled from µ, and the agent acts until the terminal state is reached. For example: Taxi robot reaches its destination (termination = good) Waiter robot finishes a shift (fixed time) Walking robot falls over (termination = bad)

Episodic Setting In each episode, the initial state is sampled from µ, and the agent acts until the terminal state is reached. For example: Taxi robot reaches its destination (termination = good) Waiter robot finishes a shift (fixed time) Walking robot falls over (termination = bad) Goal: maximize expected return per episode maximize E [R π] π

Derivative Free Optimization / Evolution

Cross Entropy Method Initialize µ R d, σ R d for iteration = 1, 2,... do Collect n samples of θ i N(µ, diag(σ)) Perform one episode with each θ i, obtaining reward R i Select the top p% of θ samples (e.g. p = 20), the elite set Fit a Gaussian distribution, to the elite set, updating µ, σ. end for Return the final µ.

Cross Entropy Method Sometimes works embarrassingly well

Cross Entropy Method Sometimes works embarrassingly well

Cross Entropy Method Sometimes works embarrassingly well I. Szita and A. Lörincz. Learning Tetris using the noisy cross-entropy method. Neural computation (2006)

Cross Entropy Method Sometimes works embarrassingly well I. Szita and A. Lörincz. Learning Tetris using the noisy cross-entropy method. Neural computation (2006)

Stochastic Gradient Ascent on Distribution Let µ define distribution for policy π θ : θ P µ (θ)

Stochastic Gradient Ascent on Distribution Let µ define distribution for policy π θ : θ P µ (θ) Return R depends on policy parameter θ and noise ζ maximize E θ,ζ [R(θ, ζ)] µ R is unknown and possibly nondifferentiable

Stochastic Gradient Ascent on Distribution Let µ define distribution for policy π θ : θ P µ (θ) Return R depends on policy parameter θ and noise ζ maximize E θ,ζ [R(θ, ζ)] µ R is unknown and possibly nondifferentiable Score function gradient estimator: µ E θ,ζ [R(θ, ζ)] = E θ,ζ [ µ log P µ (θ)r(θ, ζ)] 1 N µ log P µ (θ i )R i N i=1

Stochastic Gradient Ascent on Distribution Compare with cross-entropy method

Stochastic Gradient Ascent on Distribution Compare with cross-entropy method Score function grad: µ E θ,ζ [R(θ, ζ)] 1 N N µ log P µ (θ i )R i i=1

Stochastic Gradient Ascent on Distribution Compare with cross-entropy method Score function grad: µ E θ,ζ [R(θ, ζ)] 1 N N µ log P µ (θ i )R i i=1 Cross entropy method: maximize µ 1 N N log P µ (θ i )f (R i ) i=1 where f (r) = 1[r above threshold] (cross entropy method)

Connection to Finite Differences Suppose P µ is Gaussian distribution with mean µ, covariance σ 2 I log P µ (θ) = µ θ 2 /2σ 2 + const µ log P µ (θ) = (θ µ)/σ 2 R i µ log P µ (θ i ) = R i (θ i µ)/σ 2

Connection to Finite Differences Suppose P µ is Gaussian distribution with mean µ, covariance σ 2 I log P µ (θ) = µ θ 2 /2σ 2 + const µ log P µ (θ) = (θ µ)/σ 2 R i µ log P µ (θ i ) = R i (θ i µ)/σ 2 Suppose we do antithetic sampling, where we use pairs of samples θ + = µ + σz, θ = µ σz ( ) 1 R(µ + σz, ζ) µ log P µ (θ + ) + R(µ σz, ζ ) µ log P µ (θ ) 2 = 1 σ ( R(µ + σz, ζ) R(µ σz, ζ ) ) z

Connection to Finite Differences Suppose P µ is Gaussian distribution with mean µ, covariance σ 2 I log P µ (θ) = µ θ 2 /2σ 2 + const µ log P µ (θ) = (θ µ)/σ 2 R i µ log P µ (θ i ) = R i (θ i µ)/σ 2 Suppose we do antithetic sampling, where we use pairs of samples θ + = µ + σz, θ = µ σz ( ) 1 R(µ + σz, ζ) µ log P µ (θ + ) + R(µ σz, ζ ) µ log P µ (θ ) 2 = 1 σ ( R(µ + σz, ζ) R(µ σz, ζ ) ) z Using same noise ζ for both evaluations reduces variance

Deriving the Score Function Estimator Score function gradient estimator: µ E θ,ζ [R(θ, ζ)] = E θ,ζ [ µ log P µ (θ)r(θ, ζ)] 1 N µ log P µ (θ i )R i N i=1

Deriving the Score Function Estimator Score function gradient estimator: µ E θ,ζ [R(θ, ζ)] = E θ,ζ [ µ log P µ (θ)r(θ, ζ)] 1 N µ log P µ (θ i )R i N Derive by writing expectation as an integral µ dµdζ P µ (θ)r(θ, ζ) = dµdζ µ P µ (θ)r(θ, ζ) = dµdζ P µ (θ) µ log P µ (θ)r(θ, ζ) i=1 = E θ,ζ [ µ log P µ (θ)r(θ, ζ)]

Literature on DFO Evolution strategies (Rechenberg and Eigen, 1973)

Literature on DFO Evolution strategies (Rechenberg and Eigen, 1973) Simultaneous perturbation stochastic approximation (Spall, 1992)

Literature on DFO Evolution strategies (Rechenberg and Eigen, 1973) Simultaneous perturbation stochastic approximation (Spall, 1992) Covariance matrix adaptation: popular relative of CEM (Hansen, 2006)

Literature on DFO Evolution strategies (Rechenberg and Eigen, 1973) Simultaneous perturbation stochastic approximation (Spall, 1992) Covariance matrix adaptation: popular relative of CEM (Hansen, 2006) Reward weighted regression (Peters and Schaal, 2007), PoWER (Kober and Peters, 2007)

Success Stories CMA is very effective for optimizing low-dimensional locomotion controllers

Success Stories CMA is very effective for optimizing low-dimensional locomotion controllers UT Austin Villa: RoboCup 2012 3D Simulation League Champion

Success Stories CMA is very effective for optimizing low-dimensional locomotion controllers UT Austin Villa: RoboCup 2012 3D Simulation League Champion Evolution Strategies was shown to perform well on Atari, competetive with policy gradient methods (Salimans et al., 2017)

Policy Gradient Methods

Overview Problem: maximize E[R π θ ] Here, we ll use a fixed policy parameter θ (instead of sampling θ P µ ) and estimate gradient with respect to θ

Overview Problem: maximize E[R π θ ] Here, we ll use a fixed policy parameter θ (instead of sampling θ P µ ) and estimate gradient with respect to θ Noise is in action space rather than parameter space

Overview Problem: maximize E[R π θ ] Intuitions: collect a bunch of trajectories, and... 1. Make the good trajectories more probable

Overview Problem: maximize E[R π θ ] Intuitions: collect a bunch of trajectories, and... 1. Make the good trajectories more probable 2. Make the good actions more probable

Overview Problem: maximize E[R π θ ] Intuitions: collect a bunch of trajectories, and... 1. Make the good trajectories more probable 2. Make the good actions more probable 3. Push the actions towards better actions

Score Function Gradient Estimator for Policies Now random variable is a whole trajectory τ = (s 0, a 0, r 0, s 1, a 1, r 1,..., s T 1, a T 1, r T 1, s T ) θ E τ [R(τ)] = E τ [ θ log P(τ θ)r(τ)]

Score Function Gradient Estimator for Policies Now random variable is a whole trajectory τ = (s 0, a 0, r 0, s 1, a 1, r 1,..., s T 1, a T 1, r T 1, s T ) Just need to write out P(τ θ): θ E τ [R(τ)] = E τ [ θ log P(τ θ)r(τ)] T 1 P(τ θ) = µ(s 0 ) [π(a t s t, θ)p(s t+1, r t s t, a t )] t=0 T 1 log P(τ θ) = log µ(s 0 ) + [log π(a t s t, θ) + log P(s t+1, r t s t, a t )] t=0 T 1 θ log P(τ θ) = θ log π(a t s t, θ) θ E τ [R] = E τ [ t=0 R θ T 1 ] log π(a t s t, θ) t=0

Policy Gradient: Use Temporal Structure Previous slide: [( T 1 )( T 1 )] θ E τ [R] = E τ r t θ log π(a t s t, θ) t=0 t=0

Policy Gradient: Use Temporal Structure Previous slide: [( T 1 )( T 1 )] θ E τ [R] = E τ r t θ log π(a t s t, θ) t=0 t=0 We can repeat the same argument to derive the gradient estimator for a single reward term r t. t θ E [r t ] = E r t θ log π(a t s t, θ) t=0

Policy Gradient: Use Temporal Structure Previous slide: [( T 1 )( T 1 )] θ E τ [R] = E τ r t θ log π(a t s t, θ) t=0 We can repeat the same argument to derive the gradient estimator for a single reward term r t. t θ E [r t ] = E r t θ log π(a t s t, θ) Sum this formula over t, we obtain T 1 θ E [R] = E = E t=0 [ T 1 t=0 r t t=0 t=0 t θ log π(a t s t, θ) t=0 T 1 θ log π(a t s t, θ) t =t r t ]

Policy Gradient: Introduce Baseline Further reduce variance by introducing a baseline b(s) [ T 1 ( T 1 )] θ E τ [R] = E τ θ log π(a t s t, θ) r t b(s t ) t=0 t =t

Policy Gradient: Introduce Baseline Further reduce variance by introducing a baseline b(s) [ T 1 ( T 1 )] θ E τ [R] = E τ θ log π(a t s t, θ) r t b(s t ) t=0 t =t For any choice of b, gradient estimator is unbiased.

Policy Gradient: Introduce Baseline Further reduce variance by introducing a baseline b(s) [ T 1 ( T 1 )] θ E τ [R] = E τ θ log π(a t s t, θ) r t b(s t ) t=0 For any choice of b, gradient estimator is unbiased. Near optimal choice is expected return, b(s t ) E [r t + r t+1 + r t+2 + + r T 1 ] t =t

Policy Gradient: Introduce Baseline Further reduce variance by introducing a baseline b(s) [ T 1 ( T 1 )] θ E τ [R] = E τ θ log π(a t s t, θ) r t b(s t ) t=0 For any choice of b, gradient estimator is unbiased. Near optimal choice is expected return, b(s t ) E [r t + r t+1 + r t+2 + + r T 1 ] Interpretation: increase logprob of action a t proportionally to how much returns T 1 t=t r t are better than expected t =t

Discounts for Variance Reduction Introduce discount factor γ, which ignores delayed effects between actions and rewards [ T 1 ( T 1 )] θ E τ [R] E τ θ log π(a t s t, θ) γ t t r t b(s t ) t=0 t =t

Discounts for Variance Reduction Introduce discount factor γ, which ignores delayed effects between actions and rewards [ T 1 ( T 1 )] θ E τ [R] E τ θ log π(a t s t, θ) γ t t r t b(s t ) t=0 t =t Now, we want b(s t ) E [ r t + γr t+1 + γ 2 r t+2 + + γ T 1 t r T 1 ]

Discounts for Variance Reduction Introduce discount factor γ, which ignores delayed effects between actions and rewards [ T 1 ( T 1 )] θ E τ [R] E τ θ log π(a t s t, θ) γ t t r t b(s t ) t=0 t =t Now, we want b(s t ) E [ ] r t + γr t+1 + γ 2 r t+2 + + γ T 1 t r T 1 Write gradient estimator more generally as [ T 1 ] θ E τ [R] E τ θ log π(a t s t, θ)ât  t is the advantage estimate t=0

Vanilla Policy Gradient Algorithm Initialize policy parameter θ, baseline b for iteration=1, 2,... do Collect a set of trajectories by executing the current policy At each timestep in each trajectory, compute the return ˆR t = T 1 t =t t γt r t, and the advantage estimate  t = ˆR t b(s t ). Re-fit the baseline, by minimizing b(s t ) R t 2, summed over all trajectories and timesteps. Update the policy, using a policy gradient estimate ĝ, which is a sum of terms θ log π(a t s t, θ)ât end for

Advantage Actor-Critic Use neural network that represents policy π θ and value function V θ (approximating V π θ) Pseudocode for iteration=1, 2,... do Agent acts for T timesteps (e.g., T = 20), For each timestep t, compute ˆR t = r t + γr t+1 + + γ T t+1 r T 1 + γ T t V θ (s t )  t = ˆR t V θ (s t ) ˆR t is target value function, in regression problem  t is estimated advantage function T Compute loss gradient g = θ t=1 [ log π θ (a t s t )Ât + c(v θ (s) ˆR ] t ) 2 g is plugged into a stochastic gradient ascent algorithm, e.g., Adam. end for V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. P. Lillicrap, et al. Asynchronous methods for deep reinforcement learning. (2016)

Trust Region Policy Optimization Motivation: make policy gradients more robust and sample efficient

Trust Region Policy Optimization Motivation: make policy gradients more robust and sample efficient Unlike in supervised learning, policy affects distribution of inputs, so a large bad update can be disastrous

Trust Region Policy Optimization Motivation: make policy gradients more robust and sample efficient Unlike in supervised learning, policy affects distribution of inputs, so a large bad update can be disastrous Makes use of a surrogate objective that estimates the performance of the policy around π old used for sampling L πold (π) = 1 N N i=1 π(a i s i ) π old (a i s i )Âi (1) Differentiating this objective gives the policy gradient

Trust Region Policy Optimization Motivation: make policy gradients more robust and sample efficient Unlike in supervised learning, policy affects distribution of inputs, so a large bad update can be disastrous Makes use of a surrogate objective that estimates the performance of the policy around π old used for sampling L πold (π) = 1 N N i=1 π(a i s i ) π old (a i s i )Âi (1) Differentiating this objective gives the policy gradient L πold (π) is only accurate when state distribution of π is close to π old, thus it makes sense to constrain or penalize the distance D KL [π old π]

Trust Region Policy Optimization Pseudocode: for iteration=1, 2,... do Run policy for T timesteps or N trajectories Estimate advantage function at all timesteps end for maximize θ subject to N n=1 π θ (a n s n ) π θold (a n s n )Ân KL πθold (π θ ) δ Can solve constrained optimization problem efficiently by using conjugate gradient Closely related to natural policy gradients (Kakade, 2002), natural actor critic (Peters and Schaal, 2005), REPS (Peters et al., 2010)

Proximal Policy Optimization Use penalty instead of constraint maximize θ N n=1 π θ (a n s n ) π θold (a n s n )Ân C KL πθold (π θ ) Pseudocode: for iteration=1, 2,... do Run policy for T timesteps or N trajectories Estimate advantage function at all timesteps Do SGD on above objective for some number of epochs If KL too high, increase β. If KL too low, decrease β. end for same performance as TRPO, but only first-order optimization

Variance Reduction for Policy Gradients

Reward Shaping

Reward Shaping Reward shaping: δ(s, a, s ) = r(s, a, s ) + γφ(s ) Φ(s) for arbitrary potential Φ 1 A. Y. Ng, D. Harada, and S. Russell. Policy invariance under reward transformations: Theory and application to reward shaping. ICML. 1999.

Reward Shaping Reward shaping: δ(s, a, s ) = r(s, a, s ) + γφ(s ) Φ(s) for arbitrary potential Φ Theorem: δ admits the same optimal policies as r. 1 1 A. Y. Ng, D. Harada, and S. Russell. Policy invariance under reward transformations: Theory and application to reward shaping. ICML. 1999.

Reward Shaping Reward shaping: δ(s, a, s ) = r(s, a, s ) + γφ(s ) Φ(s) for arbitrary potential Φ Theorem: δ admits the same optimal policies as r. 1 Proof sketch: suppose Q satisfies Bellman equation (T Q = Q). If we transform r δ, policy s value function satisfies Q(s, a) = Q (s, a) Φ(s) 1 A. Y. Ng, D. Harada, and S. Russell. Policy invariance under reward transformations: Theory and application to reward shaping. ICML. 1999.

Reward Shaping Reward shaping: δ(s, a, s ) = r(s, a, s ) + γφ(s ) Φ(s) for arbitrary potential Φ Theorem: δ admits the same optimal policies as r. 1 Proof sketch: suppose Q satisfies Bellman equation (T Q = Q). If we transform r δ, policy s value function satisfies Q(s, a) = Q (s, a) Φ(s) Q satisfies Bellman equation Q also satisfies Bellman equation 1 A. Y. Ng, D. Harada, and S. Russell. Policy invariance under reward transformations: Theory and application to reward shaping. ICML. 1999.

Reward Shaping Theorem: δ admits the same optimal policies as R. A. Y. Ng, D. Harada, and S. Russell. Policy invariance under reward transformations: Theory and application to reward shaping. ICML. 1999

Reward Shaping Theorem: δ admits the same optimal policies as R. A. Y. Ng, D. Harada, and S. Russell. Policy invariance under reward transformations: Theory and application to reward shaping. ICML. 1999 Alternative proof: advantage function is invariant. Let s look at effect on V π and Q π : E [ δ 0 + γδ 1 + γ 2 δ 2 +... ] = E [ (r 0 + γφ(s 1 ) Φ(s 0 )) + γ(r 1 + γφ(s 2 ) Φ(s 1 )) + γ 2 (r 2 + γφ(s 3 ) Φ(s 2 )) +... ] = E [ r 0 + γr 1 + γ 2 r 2 + Φ(s 0 ) ] Thus, A π (s, π(s)) = 0 at all states π is optimal Ṽ π (s) = V π (s) Φ(s) Q π (s) = Q π (s, a) Φ(s) Ã π (s) = A π (s, a)

Reward Shaping and Problem Difficulty Shape with Φ = V problem is solved in one step of value iteration

Reward Shaping and Problem Difficulty Shape with Φ = V problem is solved in one step of value iteration Shaping leaves policy gradient invariant (and just adds baseline to estimator) E [ θ log π θ (a 0 s 0 )(r 0 + γφ(s 1 ) Φ(s 0 )) + γ(r 1 + γφ(s 2 ) Φ(s 1 )) + γ 2 (r 2 + γφ(s 3 ) Φ(s 2 )) +... ] = E [ θ log π θ (a 0 s 0 )(r 0 + γr 1 + γ 2 r 2 + Φ(s 0 )) ] = E [ θ log π θ (a 0 s 0 )(r 0 + γr 1 + γ 2 r 2 +... ) ]

Reward Shaping and Policy Gradients First note connection between shaped reward and advantage function: E s1 [r 0 + γv π (s 1 ) V π (s 0 ) s 0 = s, a 0 = a] = A π (s, a) Now considering the policy gradient and ignoring all but first shaped reward (i.e., pretend γ = 0 after shaping) we get [ ] [ ] E θ log π θ (a t s t )δ t = E θ log π θ (a t s t )(r t + γv π (s t+1 ) V π (s t )) t [ ] = E θ log π θ (a t s t )A π (s t, a t ) t t

Reward Shaping and Policy Gradients Compromise: use more aggressive discount γλ, with λ (0, 1): called generalized advantage estimation θ log π θ (a t s t ) (γλ) k δ t+k t k=0

Reward Shaping and Policy Gradients Compromise: use more aggressive discount γλ, with λ (0, 1): called generalized advantage estimation θ log π θ (a t s t ) (γλ) k δ t+k t k=0 Or alternatively, use hard cutoff as in A3C n 1 θ log π θ (a t s t ) γ k δ t+k t = θ log π θ (a t s t ) t k=0 ( n 1 ) γ k r t+k + γ n Φ(s t+n ) Φ(s t ) k=0

Reward Shaping Summary Reward shaping transformation leaves policy gradient and optimal policy invariant

Reward Shaping Summary Reward shaping transformation leaves policy gradient and optimal policy invariant Shaping with Φ V π makes consequences of actions more immediate

Reward Shaping Summary Reward shaping transformation leaves policy gradient and optimal policy invariant Shaping with Φ V π makes consequences of actions more immediate Shaping, and then ignoring all but first term, gives policy gradient

Aside: Reward Shaping is Crucial in Practice I. Mordatch, E. Todorov, and Z. Popović. Discovery of complex behaviors through contact-invariant optimization. ACM Transactions on Graphics (TOG) 31.4 (2012), p. 43

Aside: Reward Shaping is Crucial in Practice I. Mordatch, E. Todorov, and Z. Popović. Discovery of complex behaviors through contact-invariant optimization. ACM Transactions on Graphics (TOG) 31.4 (2012), p. 43 Y. Tassa, T. Erez, and E. Todorov. Synthesis and stabilization of complex behaviors through online trajectory optimization. Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on. IEEE. 2012, pp. 4906 4913

Choosing parameters γ, λ Performance as γ, λ are varied 0.0 3D Biped 0.5 1.0 cost 1.5 2.0 2.5 γ =0.96,λ =0.96 γ =0.98,λ =0.96 γ =0.99,λ =0.96 γ =0.995,λ =0.92 γ =0.995,λ =0.96 γ =0.995,λ =0.98 γ =0.995,λ =0.99 γ =0.995,λ =1.0 γ =1,λ =0.96 γ =1, No value fn 0 100 200 300 400 500 number of policy iterations (Generalized Advantage Estimation for Policy Gradients, S. et al., ICLR 2016)

Pathwise Derivative Methods

Deriving the Policy Gradient, Reparameterized Episodic MDP: θ s 1 s 2... s T a 1 a 2... a T R T Want to compute θ E [R T ]. We ll use θ log π(a t s t ; θ)

Deriving the Policy Gradient, Reparameterized Episodic MDP: θ s 1 s 2... s T a 1 a 2... a T R T Want to compute θ E [R T ]. We ll use θ log π(a t s t ; θ) Reparameterize: a t = π(s t, z t ; θ). z t is noise from fixed distribution. θ s 1 s 2... s T a 1 a 2... a T R T z 1 z 2... z T

Deriving the Policy Gradient, Reparameterized Episodic MDP: θ s 1 s 2... s T a 1 a 2... a T R T Want to compute θ E [R T ]. We ll use θ log π(a t s t ; θ) Reparameterize: a t = π(s t, z t ; θ). z t is noise from fixed distribution. θ s 1 s 2... s T a 1 a 2... a T R T z 1 z 2... z T Only works if P(s 2 s 1, a 1 ) is known

Using a Q-function θ s 1 s 2... s T a 1 a 2... a T R T z 1 z 2... z T [ d dθ E [R T T ] = E = E t=1 [ T t=1 dr T da t da t dθ dq(s t, a t ) da t ] [ T ] d = E E [R T a t ] da t da t=1 t dθ ] [ da t T = E dθ t=1 d dθ Q(s t, π(s t, z t ; θ)) ]

SVG(0) Algorithm Learn Q φ to approximate Q π,γ, and use it to compute gradient estimates. N. Heess, G. Wayne, D. Silver, T. Lillicrap, Y. Tassa, et al. Learning Continuous Control Policies by Stochastic Value Gradients. arxiv preprint arxiv:1510.09142 (2015)

SVG(0) Algorithm Learn Q φ to approximate Q π,γ, and use it to compute gradient estimates. Pseudocode: for iteration=1, 2,... do Execute policy π θ to collect T timesteps of data Update π θ using g θ T t=1 Q(s t, π(s t, z t ; θ)) Update Q φ using g φ T t=1 (Q φ(s t, a t ) ˆQ t ) 2, e.g. with TD(λ) end for N. Heess, G. Wayne, D. Silver, T. Lillicrap, Y. Tassa, et al. Learning Continuous Control Policies by Stochastic Value Gradients. arxiv preprint arxiv:1510.09142 (2015)

SVG(1) Algorithm θ s 1 s 2... s T a 1 a 2... a T R T z 1 z 2... z T Instead of learning Q, we learn

SVG(1) Algorithm θ s 1 s 2... s T a 1 a 2... a T R T z 1 z 2... z T Instead of learning Q, we learn State-value function V V π,γ

SVG(1) Algorithm θ s 1 s 2... s T a 1 a 2... a T R T z 1 z 2... z T Instead of learning Q, we learn State-value function V V π,γ Dynamics model f, approximating st+1 = f (s t, a t ) + ζ t

SVG(1) Algorithm θ s 1 s 2... s T a 1 a 2... a T R T z 1 z 2... z T Instead of learning Q, we learn State-value function V V π,γ Dynamics model f, approximating st+1 = f (s t, a t ) + ζ t Given transition (s t, a t, s t+1 ), infer ζ t = s t+1 f (s t, a t )

SVG(1) Algorithm θ s 1 s 2... s T a 1 a 2... a T R T z 1 z 2... z T Instead of learning Q, we learn State-value function V V π,γ Dynamics model f, approximating st+1 = f (s t, a t ) + ζ t Given transition (s t, a t, s t+1 ), infer ζ t = s t+1 f (s t, a t ) Q(s t, a t ) = E [r t + γv (s t+1 )] = E [r t + γv (f (s t, a t ) + ζ t )], and a t = π(s t, θ, ζ t )

SVG( ) Algorithm θ s 1 s 2... s T a 1 a 2... a T R T z 1 z 2... z T Just learn dynamics model f

SVG( ) Algorithm θ s 1 s 2... s T a 1 a 2... a T R T z 1 z 2... z T Just learn dynamics model f Given whole trajectory, infer all noise variables

SVG( ) Algorithm θ s 1 s 2... s T a 1 a 2... a T R T z 1 z 2... z T Just learn dynamics model f Given whole trajectory, infer all noise variables Freeze all policy and dynamics noise, differentiate through entire deterministic computation graph

SVG Results Applied to 2D robotics tasks N. Heess, G. Wayne, D. Silver, T. Lillicrap, Y. Tassa, et al. Learning Continuous Control Policies by Stochastic Value Gradients. arxiv preprint arxiv:1510.09142 (2015)

SVG Results Applied to 2D robotics tasks Overall: different gradient estimators behave similarly N. Heess, G. Wayne, D. Silver, T. Lillicrap, Y. Tassa, et al. Learning Continuous Control Policies by Stochastic Value Gradients. arxiv preprint arxiv:1510.09142 (2015)

Deterministic Policy Gradient For Gaussian actions, variance of score function policy gradient estimator goes to infinity as variance goes to zero D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, et al. Deterministic Policy Gradient Algorithms. ICML. 2014

Deterministic Policy Gradient For Gaussian actions, variance of score function policy gradient estimator goes to infinity as variance goes to zero Intuition: finite difference gradient estimators D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, et al. Deterministic Policy Gradient Algorithms. ICML. 2014

Deterministic Policy Gradient For Gaussian actions, variance of score function policy gradient estimator goes to infinity as variance goes to zero Intuition: finite difference gradient estimators But SVG(0) gradient is fine when σ 0 θ Q(s t, π(s t, θ, ζ t )) t D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, et al. Deterministic Policy Gradient Algorithms. ICML. 2014

Deterministic Policy Gradient For Gaussian actions, variance of score function policy gradient estimator goes to infinity as variance goes to zero Intuition: finite difference gradient estimators But SVG(0) gradient is fine when σ 0 θ Q(s t, π(s t, θ, ζ t )) Problem: there s no exploration. t D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, et al. Deterministic Policy Gradient Algorithms. ICML. 2014

Deterministic Policy Gradient For Gaussian actions, variance of score function policy gradient estimator goes to infinity as variance goes to zero Intuition: finite difference gradient estimators But SVG(0) gradient is fine when σ 0 θ Q(s t, π(s t, θ, ζ t )) Problem: there s no exploration. t Solution: add noise to the policy, but estimate Q with TD(0), so it s valid off-policy D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, et al. Deterministic Policy Gradient Algorithms. ICML. 2014

Deterministic Policy Gradient For Gaussian actions, variance of score function policy gradient estimator goes to infinity as variance goes to zero Intuition: finite difference gradient estimators But SVG(0) gradient is fine when σ 0 θ Q(s t, π(s t, θ, ζ t )) Problem: there s no exploration. t Solution: add noise to the policy, but estimate Q with TD(0), so it s valid off-policy Policy gradient is a little biased (even with Q = Q π ), but only because state distribution is off it gets the right gradient at every state D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, et al. Deterministic Policy Gradient Algorithms. ICML. 2014

Deep Deterministic Policy Gradient Incorporate replay buffer and target network ideas from DQN for increased stability Use lagged (Polyak-averaging) version of Q φ and π θ for fitting Q φ (towards Q π,γ ) with TD(0) ˆQ t = r t + γq φ (s t+1, π(s t+1 ; θ )) Pseudocode: for iteration=1, 2,... do Act for several timesteps, add data to replay buffer Sample minibatch Update π θ using g θ T t=1 Q(s t, π(s t, z t ; θ)) Update Q φ using g φ T t=1 (Q φ(s t, a t ) ˆQ t ) 2, end for T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, et al. Continuous control with deep reinforcement learning. arxiv preprint arxiv:1509.02971 (2015)

DDPG Results Applied to 2D and 3D robotics tasks and driving with pixel input T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, et al. Continuous control with deep reinforcement learning. arxiv preprint arxiv:1509.02971 (2015)

Policy Gradient Methods: Comparison Two kinds of policy gradient estimator

Policy Gradient Methods: Comparison Two kinds of policy gradient estimator REINFORCE / score function estimator: log π(a s) Â.

Policy Gradient Methods: Comparison Two kinds of policy gradient estimator REINFORCE / score function estimator: log π(a s) Â. Learn Q or V for variance reduction, to estimate Â

Policy Gradient Methods: Comparison Two kinds of policy gradient estimator REINFORCE / score function estimator: log π(a s) Â. Learn Q or V for variance reduction, to estimate  Pathwise derivative estimators (differentiate wrt action)

Policy Gradient Methods: Comparison Two kinds of policy gradient estimator REINFORCE / score function estimator: log π(a s) Â. Learn Q or V for variance reduction, to estimate  Pathwise derivative estimators (differentiate wrt action) SVG(0) / DPG: d da Q(s, a) (learn Q)

Policy Gradient Methods: Comparison Two kinds of policy gradient estimator REINFORCE / score function estimator: log π(a s) Â. Learn Q or V for variance reduction, to estimate  Pathwise derivative estimators (differentiate wrt action) SVG(0) / DPG: d da Q(s, a) (learn Q) SVG(1): d da (r + γv (s )) (learn f, V )

Policy Gradient Methods: Comparison Two kinds of policy gradient estimator REINFORCE / score function estimator: log π(a s) Â. Learn Q or V for variance reduction, to estimate  Pathwise derivative estimators (differentiate wrt action) SVG(0) / DPG: d da Q(s, a) (learn Q) SVG(1): d da (r + γv (s )) (learn f, V ) d SVG( ): da t (r t + γr t+1 + γ 2 r t+2 +... ) (learn f )

Policy Gradient Methods: Comparison Two kinds of policy gradient estimator REINFORCE / score function estimator: log π(a s) Â. Learn Q or V for variance reduction, to estimate  Pathwise derivative estimators (differentiate wrt action) SVG(0) / DPG: d da Q(s, a) (learn Q) SVG(1): d da (r + γv (s )) (learn f, V ) d SVG( ): da t (r t + γr t+1 + γ 2 r t+2 +... ) (learn f ) Pathwise derivative methods more sample-efficient when they work (maybe), but work less generally due to high bias

Thanks Questions?