Policy Gradient Methods. February 13, 2017

Similar documents
Variance Reduction for Policy Gradient Methods. March 13, 2017

Deep Reinforcement Learning: Policy Gradients and Q-Learning

Deep Reinforcement Learning via Policy Optimization

Lecture 9: Policy Gradient II 1

Lecture 8: Policy Gradient I 2

Lecture 9: Policy Gradient II (Post lecture) 2

Deep Reinforcement Learning

Reinforcement Learning

Trust Region Policy Optimization

Reinforcement Learning and NLP

REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning

Reinforcement Learning via Policy Optimization

Markov Decision Processes and Solving Finite Problems. February 8, 2017

Reinforcement Learning for NLP

Reinforcement Learning

Deep reinforcement learning. Dialogue Systems Group, Cambridge University Engineering Department

Advanced Policy Gradient Methods: Natural Gradient, TRPO, and More. March 8, 2017

Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms *

INF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018

Combining PPO and Evolutionary Strategies for Better Policy Search

REINFORCEMENT LEARNING

Lecture 8: Policy Gradient

O P T I M I Z I N G E X P E C TAT I O N S : T O S T O C H A S T I C C O M P U TAT I O N G R A P H S. john schulman. Summer, 2016

Deep Reinforcement Learning. Scratching the surface

Comparing Deep Reinforcement Learning and Evolutionary Methods in Continuous Control

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Reinforcement Learning

CSC321 Lecture 22: Q-Learning

Off-Policy Actor-Critic

MDP Preliminaries. Nan Jiang. February 10, 2019

Grundlagen der Künstlichen Intelligenz

Introduction of Reinforcement Learning

arxiv: v5 [cs.lg] 9 Sep 2016

Forward Actor-Critic for Nonlinear Function Approximation in Reinforcement Learning

Temporal Difference. Learning KENNETH TRAN. Principal Research Engineer, MSR AI

Internet Monetization

Real Time Value Iteration and the State-Action Value Function

Q-Learning in Continuous State Action Spaces

arxiv: v2 [cs.lg] 7 Mar 2018

Gradient Estimation Using Stochastic Computation Graphs

Decision Theory: Q-Learning

(Deep) Reinforcement Learning

Reinforcement Learning and Deep Reinforcement Learning

arxiv: v1 [cs.ai] 5 Nov 2017

Lecture 1: March 7, 2018

Decision Theory: Markov Decision Processes

Actor-critic methods. Dialogue Systems Group, Cambridge University Engineering Department. February 21, 2017

Approximation Methods in Reinforcement Learning

Multiagent (Deep) Reinforcement Learning

Deterministic Policy Gradient, Advanced RL Algorithms

15-889e Policy Search: Gradient Methods Emma Brunskill. All slides from David Silver (with EB adding minor modificafons), unless otherwise noted

Multi-Batch Experience Replay for Fast Convergence of Continuous Action Control

Optimizing Expectations: From Deep Reinforcement Learning to Stochastic Computation Graphs

Learning Continuous Control Policies by Stochastic Value Gradients

1 Introduction 2. 4 Q-Learning The Q-value The Temporal Difference The whole Q-Learning process... 5

Deep Reinforcement Learning SISL. Jeremy Morton (jmorton2) November 7, Stanford Intelligent Systems Laboratory

Decoupling deep learning and reinforcement learning for stable and efficient deep policy gradient algorithms

Reinforcement. Function Approximation. Learning with KATJA HOFMANN. Researcher, MSR Cambridge

CS Deep Reinforcement Learning HW2: Policy Gradients due September 19th 2018, 11:59 pm

Machine Learning I Continuous Reinforcement Learning

arxiv: v4 [cs.lg] 14 Oct 2018

Reinforcement Learning

Supplementary Material for Asynchronous Methods for Deep Reinforcement Learning

Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor

Notes on Reinforcement Learning

Reinforcement Learning

ECE276B: Planning & Learning in Robotics Lecture 16: Model-free Control

Dialogue management: Parametric approaches to policy optimisation. Dialogue Systems Group, Cambridge University Engineering Department

Reinforcement Learning

arxiv: v2 [cs.lg] 2 Oct 2018

Introduction to Reinforcement Learning

Proximal Policy Optimization Algorithms

CS230: Lecture 9 Deep Reinforcement Learning

David Silver, Google DeepMind

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation

Reinforcement Learning with Function Approximation. Joseph Christian G. Noel

An Off-policy Policy Gradient Theorem Using Emphatic Weightings

Learning Control for Air Hockey Striking using Deep Reinforcement Learning

Behavior Policy Gradient Supplemental Material

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

Reinforcement Learning: the basics

Implicit Incremental Natural Actor Critic

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Combine Monte Carlo with Exhaustive Search: Effective Variational Inference and Policy Gradient Reinforcement Learning

Reinforcement Learning

Reinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina

arxiv: v2 [cs.lg] 8 Aug 2018

Temporal difference learning

Lecture 4: Approximate dynamic programming

An Introduction to Reinforcement Learning

Reinforcement Learning as Classification Leveraging Modern Classifiers

arxiv: v1 [cs.ne] 4 Dec 2015

Proximal Policy Optimization (PPO)

Temporal Difference Learning & Policy Iteration

arxiv: v1 [cs.lg] 30 Oct 2015

Lecture 3: Markov Decision Processes

PolicyBoost: Functional Policy Gradient with Ranking-Based Reward Objective

Reinforcement Learning

arxiv: v1 [cs.lg] 30 Dec 2016

Reinforcement Learning. Yishay Mansour Tel-Aviv University

Transcription:

Policy Gradient Methods February 13, 2017

Policy Optimization Problems maximize E π [expression] π Fixed-horizon episodic: T 1 Average-cost: lim T 1 T r t T 1 r t Infinite-horizon discounted: γt r t Variable-length undiscounted: T terminal 1 r t Infinite-horizon undiscounted: r t

Episodic Setting s 0 µ(s 0 ) a 0 π(a 0 s 0 ) s 1, r 0 P(s 1, r 0 s 0, a 0 ) a 1 π(a 1 s 1 ) s 2, r 1 P(s 2, r 1 s 1, a 1 )... a T 1 π(a T 1 s T 1 ) s T, r T 1 P(s T s T 1, a T 1 ) Objective: maximize η(π), where η(π) = E[r 0 + r 1 + + r T 1 π]

Episodic Setting π Agent a0 a1 at-1 s0 s1 s2 st μ0 r0 r1 rt-1 P Environment Objective: maximize η(π), where η(π) = E[r 0 + r 1 + + r T 1 π]

Parameterized Policies A family of policies indexed by parameter vector θ R d Deterministic: a = π(s, θ) Stochastic: π(a s, θ) Analogous to classification or regression with input s, output a. Discrete action space: network outputs vector of probabilities Continuous action space: network outputs mean and diagonal covariance of Gaussian

Policy Gradient Methods: Overview Problem: maximize E[R π θ ] Intuitions: collect a bunch of trajectories, and... 1. Make the good trajectories more probable 1 2. Make the good actions more probable 3. Push the actions towards good actions (DPG 2, SVG 3 ) 1 R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning (1992); R. S. Sutton, D. McAllester, S. Singh, and Y. Mansour. Policy gradient methods for reinforcement learning with function approximation. NIPS. MIT Press, 2000. 2 D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, et al. Deterministic Policy Gradient Algorithms. ICML. 2014. 3 N. Heess, G. Wayne, D. Silver, T. Lillicrap, Y. Tassa, et al. Learning Continuous Control Policies by Stochastic Value Gradients. arxiv preprint arxiv:1510.09142 (2015).

Score Function Gradient Estimator Consider an expectation E x p(x θ) [f (x)]. Want to compute gradient wrt θ θ E x [f (x)] = θ dx p(x θ)f (x) = dx θ p(x θ)f (x) = dx p(x θ) θp(x θ) p(x θ) f (x) = dx p(x θ) θ log p(x θ)f (x) = E x [f (x) θ log p(x θ)]. Last expression gives us an unbiased gradient estimator. Just sample x i p(x θ), and compute ĝ i = f (x i ) θ log p(x i θ). Need to be able to compute and differentiate density p(x θ) wrt θ

Derivation via Importance Sampling Alternative Derivation Using Importance Sampling 4 [ ] p(x θ) E x θ [f (x)] = E x θold p(x θ old ) f (x) [ ] θ p(x θ) θ E x θ [f (x)] = E x θold p(x θ old ) f (x) θ E x θ [f (x)] [ θ p(x θ) θ=θold θ=θold = E x θold f (x) p(x θ old ) = E x θold [ θ log p(x θ) ] θ=θold f (x) ] 4 T. Jie and P. Abbeel. On a connection between importance sampling and the likelihood ratio policy gradient. Advances in Neural Information Processing Systems. 2010, pp. 1000 1008.

Score Function Gradient Estimator: Intuition ĝ i = f (x i ) θ log p(x i θ) Let s say that f (x) measures how good the sample x is. Moving in the direction ĝ i pushes up the logprob of the sample, in proportion to how good it is Valid even if f (x) is discontinuous, and unknown, or sample space (containing x) is a discrete set

Score Function Gradient Estimator: Intuition ĝ i = f (x i ) θ log p(x i θ)

Score Function Gradient Estimator: Intuition ĝ i = f (x i ) θ log p(x i θ)

Score Function Gradient Estimator for Policies Now random variable x is a whole trajectory τ = (s 0, a 0, r 0, s 1, a 1, r 1,..., s T 1, a T 1, r T 1, s T ) Just need to write out p(τ θ): θ E τ [R(τ)] = E τ [ θ log p(τ θ)r(τ)] T 1 p(τ θ) = µ(s 0 ) [π(a t s t, θ)p(s t+1, r t s t, a t)] T 1 log p(τ θ) = log µ(s 0 ) + [log π(a t s t, θ) + log P(s t+1, r t s t, a t)] T 1 θ log p(τ θ) = θ log π(a t s t, θ) [ ] T 1 θ E τ [R] = E τ R θ log π(a t s t, θ) Interpretation: using good trajectories (high R) as supervised examples in classification / regression

Policy Gradient: Use Temporal Structure Previous slide: [( T 1 )( T 1 )] θ E τ [R] = E τ r t θ log π(a t s t, θ) We can repeat the same argument to derive the gradient estimator for a single reward term r t. t θ E [r t ] = E r t θ log π(a t s t, θ) Sum this formula over t, we obtain T 1 θ E [R] = E = E [ T 1 r t t θ log π(a t s t, θ) T 1 θ log π(a t s t, θ) t =t r t ]

Policy Gradient: Introduce Baseline Further reduce variance by introducing a baseline b(s) [ T 1 ( T 1 )] θ E τ [R] = E τ θ log π(a t s t, θ) r t b(s t ) For any choice of b, gradient estimator is unbiased. Near optimal choice is expected return, b(s t ) E [r t + r t+1 + r t+2 + + r T 1 ] Interpretation: increase logprob of action a t proportionally to how much returns T 1 t =t r t are better than expected t =t

Baseline Derivation E τ [ θ log π(a t s t, θ)b(s t )] ] = E s0:t,a 0:(t 1) [E s(t+1):t,a t:(t 1) [ θ log π(a t s t, θ)b(s t )] (break up expectation) ] = E s0:t,a 0:(t 1) [b(s t )E s(t+1):t,a t:(t 1) [ θ log π(a t s t, θ)] (pull baseline term out) = E s0:t,a 0:(t 1) [b(s t )E at [ θ log π(a t s t, θ)]] (remove irrelevant vars.) = E s0:t,a 0:(t 1) [b(s t ) 0] Last equality because 0 = θ E at π( s t) [1] = E at π( s t) [ θ log π θ (a t s t )]

Discounts for Variance Reduction Introduce discount factor γ, which ignores delayed effects between actions and rewards [ T 1 ( T 1 )] θ E τ [R] E τ θ log π(a t s t, θ) γ t t r t b(s t ) t =t Now, we want b(s t ) E [ r t + γr t+1 + γ 2 r t+2 + + γ T 1 t r T 1 ]

Vanilla Policy Gradient Algorithm Initialize policy parameter θ, baseline b for iteration=1, 2,... do Collect a set of trajectories by executing the current policy At each timestep in each trajectory, compute the return R t = T 1 t =t t γt r t, and the advantage estimate Ât = R t b(s t ). Re-fit the baseline, by minimizing b(s t ) R t 2, summed over all trajectories and timesteps. Update the policy, using a policy gradient estimate ĝ, which is a sum of terms θ log π(a t s t, θ)â t. (Plug ĝ into SGD or ADAM) end for

Practical Implementation with Autodiff Usual formula t θ log π(a t s t ; θ)ât is inefficient want to batch data Define surrogate function using data from currecnt batch L(θ) = t log π(a t s t ; θ)â t Then policy gradient estimator ĝ = θ L(θ) Can also include value function fit error L(θ) = (log π(a t s t ; θ)ât V (s t ) ˆR ) t 2 t

Value Functions [ Q π,γ (s, a) = E π r0 + γr 1 + γ 2 r 2 +... s 0 = s, a 0 = a ] Called Q-function or state-action-value function [ V π,γ (s) = E π r0 + γr 1 + γ 2 r 2 +... s 0 = s ] = E a π [Q π,γ (s, a)] Called state-value function A π,γ (s, a) = Q π,γ (s, a) V π,γ (s) Called advantage function

Policy Gradient Formulas with Value Functions Recall: [ T 1 ( T 1 )] θ E τ [R] = E τ θ log π(a t s t, θ) r t b(s t) [ T 1 E τ θ log π(a t s t, θ) t =t ( T 1 t =t )] γ t t r t b(s t) Using value functions [ T 1 ] θ E τ [R] = E τ θ log π(a t s t, θ)q π (s t, a t) [ T 1 ] = E τ θ log π(a t s t, θ)a π (s t, a t) [ T 1 ] E τ θ log π(a t s t, θ)a π,γ (s t, a t) Can plug in advantage estimator  for A π,γ Advantage estimators have the form Return V (s)

Value Functions in the Future Baseline accounts for and removes the effect of past actions Can also use the value function to estimate future rewards ˆR (1) t = r t + γv (s t+1 ) cut off at one timestep ˆR (2) t = r t + γr t+1 + γ 2 V (s t+2 ) cut off at two timesteps... ˆR ( ) t = r t + γr t+1 + γ 2 r t+2 +... timesteps (no V )

Value Functions in the Future Subtracting out baselines, we get advantage estimators  (1) t = r t + γv (s t+1 ) V (s t )  (2) t = r t + r t+1 + γ 2 V (s t+2 ) V (s t )...  ( ) t = r t + γr t+1 + γ 2 r t+2 +... V (s t )  (1) t has low variance but high bias,  ( ) t has high variance but low bias. Using intermediate k (say, 20) gives an intermediate amount of bias and variance

Discounts: Connection to MPC MPC: maximize a Discounted policy gradient Q,T (s, a) maximize Q,γ (s, a) a E a π [Q π,γ (s, a) θ log π(a s; θ)] = 0 when a arg max Q π,γ (s, a)

Application: Robot Locomotion

Finite-Horizon Methods: Advantage Actor-Critic A2C / A3C uses this fixed-horizon advantage estimator. (NOTE: async is only for speed, doesn t improve performance) Pseudocode for iteration=1, 2,... do Agent acts for T timesteps (e.g., T = 20), For each timestep t, compute ˆR t = r t + γr t+1 + + γ T t+1 r T 1 + γ T t V (s t )  t = ˆR t V (s t ) ˆR t is target value function, in regression problem  t is estimated advantage function T Compute loss gradient g = θ t=1 [ log π θ (a t s t )Ât + c(v (s) ˆR ] t ) 2 g is plugged into a stochastic gradient descent variant, e.g., Adam. end for V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. P. Lillicrap, et al. Asynchronous methods for deep reinforcement learning. (2016)

A3C Video

A3C Results 16000 14000 12000 10000 Beamrider DQN 1-step Q 1-step SARSA n-step Q A3C 600 500 400 Breakout DQN 1-step Q 1-step SARSA n-step Q A3C 30 20 10 Pong 12000 10000 8000 Q*bert DQN 1-step Q 1-step SARSA n-step Q A3C 1600 1400 1200 1000 Space Invaders DQN 1-step Q 1-step SARSA n-step Q A3C Score 8000 6000 4000 2000 0 0 2 4 6 8 10 12 14 Training time (hours) Score 300 200 100 0 0 2 4 6 8 10 12 14 Training time (hours) Score 0 Score 10 DQN 1-step Q 1-step SARSA 20 n-step Q A3C 30 0 2 4 6 8 10 12 14 Training time (hours) 6000 4000 2000 Score 0 0 2 4 6 8 10 12 14 Training time (hours) 800 600 400 200 0 0 2 4 6 8 10 12 14 Training time (hours)

Further Reading A nice intuitive explanation of policy gradients: http://karpathy.github.io/2016/05/31/rl/ R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning (1992); R. S. Sutton, D. McAllester, S. Singh, and Y. Mansour. Policy gradient methods for reinforcement learning with function approximation. NIPS. MIT Press, 2000 My thesis has a decent self-contained introduction to policy gradient methods: http://joschu.net/docs/thesis.pdf A3C paper: V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. P. Lillicrap, et al. Asynchronous methods for deep reinforcement learning. (2016)