Advanced Policy Gradient Methods: Natural Gradient, TRPO, and More. March 8, 2017

Similar documents
Trust Region Policy Optimization

Lecture 9: Policy Gradient II 1

Lecture 9: Policy Gradient II (Post lecture) 2

Reinforcement Learning via Policy Optimization

Deep Reinforcement Learning: Policy Gradients and Q-Learning

Policy Gradient Methods. February 13, 2017

Deep Reinforcement Learning via Policy Optimization

Deterministic Policy Gradient, Advanced RL Algorithms

VIME: Variational Information Maximizing Exploration

Deep Reinforcement Learning

Variance Reduction for Policy Gradient Methods. March 13, 2017

Reinforcement Learning

Reinforcement Learning for NLP

O P T I M I Z I N G E X P E C TAT I O N S : T O S T O C H A S T I C C O M P U TAT I O N G R A P H S. john schulman. Summer, 2016

Reinforcement Learning and NLP

REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning

Behavior Policy Gradient Supplemental Material

Lecture 8: Policy Gradient I 2

Deep Reinforcement Learning SISL. Jeremy Morton (jmorton2) November 7, Stanford Intelligent Systems Laboratory

Decision Theory: Q-Learning

Convex Optimization Algorithms for Machine Learning in 10 Slides

CSC2541 Lecture 5 Natural Gradient

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

MDP Preliminaries. Nan Jiang. February 10, 2019

Actor-critic methods. Dialogue Systems Group, Cambridge University Engineering Department. February 21, 2017

Markov Decision Processes and Solving Finite Problems. February 8, 2017

Implicit Incremental Natural Actor Critic

arxiv: v4 [cs.lg] 14 Oct 2018

Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms *

Dueling Network Architectures for Deep Reinforcement Learning (ICML 2016)

Q-Learning in Continuous State Action Spaces

Decision Theory: Markov Decision Processes

Optimizing Expectations: From Deep Reinforcement Learning to Stochastic Computation Graphs

Lecture 6: CS395T Numerical Optimization for Graphics and AI Line Search Applications

REINFORCEMENT LEARNING

New Insights and Perspectives on the Natural Gradient Method

CS Deep Reinforcement Learning HW5: Soft Actor-Critic Due November 14th, 11:59 pm

Multi-Batch Experience Replay for Fast Convergence of Continuous Action Control

Numerical Optimization Professor Horst Cerjak, Horst Bischof, Thomas Pock Mat Vis-Gra SS09

Lecture 7: CS395T Numerical Optimization for Graphics and AI Trust Region Methods

Gradient Methods for Markov Decision Processes

Reinforcement Learning

Proximal Policy Optimization Algorithms

Value Function Methods. CS : Deep Reinforcement Learning Sergey Levine

Learning Dexterity Matthias Plappert SEPTEMBER 6, 2018

Reinforcement Learning

Proximal Policy Optimization (PPO)

Towards Generalization and Simplicity in Continuous Control

6 Basic Convergence Results for RL Algorithms

Customised Pearlmutter Propagation: A Hardware Architecture for Trust Region Policy Optimisation

arxiv: v2 [cs.lg] 8 Aug 2018

Misc. Topics and Reinforcement Learning

Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor

Optimization for neural networks

OPER 627: Nonlinear Optimization Lecture 14: Mid-term Review

Combining PPO and Evolutionary Strategies for Better Policy Search

Real Time Value Iteration and the State-Action Value Function

CSC321 Lecture 22: Q-Learning

IPAM Summer School Optimization methods for machine learning. Jorge Nocedal

INF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018

Module 8 Linear Programming. CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo

arxiv: v5 [cs.lg] 9 Sep 2016

Trust-Region Optimization Methods Using Limited-Memory Symmetric Rank-One Updates for Off-The-Shelf Machine Learning

arxiv: v3 [cs.ai] 22 Feb 2018

Deep Learning & Neural Networks Lecture 4

Comparing Deep Reinforcement Learning and Evolutionary Methods in Continuous Control

Lecture 8: Policy Gradient

arxiv: v2 [cs.lg] 7 Mar 2018

Reinforcement Learning

Reinforcement Learning

Improving L-BFGS Initialization for Trust-Region Methods in Deep Learning

Reinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina

Stochastic Optimization Algorithms Beyond SG

Generative Adversarial Imitation Learning

Nonlinear Optimization for Optimal Control

Suppose that the approximate solutions of Eq. (1) satisfy the condition (3). Then (1) if η = 0 in the algorithm Trust Region, then lim inf.

Trust Region Evolution Strategies

Reinforcement Learning

Tutorial on: Optimization I. (from a deep learning perspective) Jimmy Ba

Normalization Techniques in Training of Deep Neural Networks

Discrete Variables and Gradient Estimators

Constraint-Space Projection Direct Policy Search

Subsampled Hessian Newton Methods for Supervised

Towards Generalization and Simplicity in Continuous Control

Reinforcement Learning

arxiv: v1 [cs.lg] 31 May 2016

CSC 411 Lecture 17: Support Vector Machine

arxiv: v1 [stat.ml] 6 Nov 2018

Nonlinear Optimization Methods for Machine Learning

(Deep) Reinforcement Learning

Development of a Deep Recurrent Neural Network Controller for Flight Applications

Policy Gradient Reinforcement Learning for Robotics

Stochastic Analogues to Deterministic Optimizers

Gradient Estimation Using Stochastic Computation Graphs

arxiv: v1 [cs.lg] 13 Dec 2018

Machine Learning. Machine Learning: Jordan Boyd-Graber University of Maryland REINFORCEMENT LEARNING. Slides adapted from Tom Mitchell and Peter Abeel

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Trust Regions. Charles J. Geyer. March 27, 2013

Optimization Methods for Machine Learning

Reinforcement Learning

Transcription:

Advanced Policy Gradient Methods: Natural Gradient, TRPO, and More March 8, 2017

Defining a Loss Function for RL Let η(π) denote the expected return of π [ ] η(π) = E s0 ρ 0,a t π( s t) γ t r t We collect data with π old. Want to optimize some objective to get a new policy π t=0 A useful identity 1 : [ ] η(π) = η(π old ) + E τ π γ t A π old (s t, a t ) t=0 1 S. Kakade and J. Langford. Approximately optimal approximate reinforcement learning. ICML. 2002.

Proof of Useful Identity First note that A π old (s, a) = E s P(s s,a) [r(s) + γv π old (s ) V π old (s)]. [ ] E τ π γ t A πold (s t, a t ) t=0 [ ] = E τ π γ t (r(s t ) + γv π old (s t+1 ) V π old (s t )) t=0 = E τ π [ V π old (s 0 ) + ] γ t r(s t ) t=0 [ ] = E s0 [V π old (s 0 )] + E τ π γ t r(s t ) = η(π old ) + η(π) t=0

Surrogate Loss Function Want to manipulate η(π) into an objective that we can estimate from sampled data η(π) = const + E s π,a π [A π old (s, a)] [ ] π(a s) = const + E s π,a πold π old (a s) Aπ old (s, a) Define L πold (π) to be the surrogate objective that ignores change in state distrib: L(π) = E s πold,a π [A π old (s t, a t )] [ ] π(a s) = E s πold,a π old π old (a s) Aπ old (s, a) Matches to first order for parameterized policy θ L(π θ ) θold = E s,a πold [ θ π θ (a s) π old (a s) Aπ old (s, a)] θold = E s,a πold [ θ log π θ (a s)a π old (s, a)] θold = θ η(π θ ) θ=θold Local approximation to the performance of the policy

Improvement Theory Theory: bound the difference between L πold (π) and η(π), the performance of the policy (error due because we re ignoring state distrib. change) Result: η(π) L πold (π) C max s KL[π old ( s), π( s)], where c = 2ɛγ/(1 γ) 2 Monotonic improvement guaranteed (MM algorithm)

Practical Approximations Theory: should maximize L πold (π) C max s KL[π old ( s), π( s)]. Approximations: Estimate L πold (π) using trajectories sampled from π old ˆL πold (π) = π(a n s n) n π old (a n s n)ân If just gradient needed, can use ˆLπold (π) = n log π(s n s n )Ân Use mean KL divergence E s πold [KL[π old ( s), π( s)]] instead of max s KL[... ] Define KLπold (π) = E s πold [KL[π old ( s), π( s)]] Use estimate from samples, n KL[π old( s n ), π( s n )] C is too pessimistic and provides guarantee for discounted return Natural policy gradient and PPO: use fixed or adaptive coefficient C TRPO: use hard constraint with fixed KL penalty

Solving KL-Penalized Problem maximize θ L πθold (π θ ) C KL πθold (π θ ) Make linear approximation to L πθold maximize θ and quadratic approximation to KL term: g (θ θ old ) C 2 (θ θ old) T F (θ θ old ) where g = θ L π θold (π θ ) θ=θold, F = 2 2 θ KL π θold (π θ ) θ=θold Quadratic part of L is negligible compared to KL term F is positive semidefinite, but not if we include Hessian of L Solution: θ θ old = 1 C F 1 g

Solving Linear Systems using Conjugate Gradient Previous slide: θ θ old = 1 F 1 g. Don t want to form full Hessian matrix C F = 2 KL 2 θ π θold (π θ ) θ=θold (memory and computation) Can compute F 1 g approximately using conjugate gradient algorithm without forming F explicitly

Truncated Newton Method Conjugate gradient algorithm approximately solves for x = A 1 b, without explicitly forming matrix A, just reads A through matrix-vector products v Av. After k iterations, CG has minimized 1 2 x T Ax bx in subspace spanned by b, Ab, A 2 b,..., A k 1 b Given vector v with same dimension as θ, want to compute H 1 v, where H = 2 2 θ f (θ). To perform CG, Hessian-vector products v Hv. Can form this function using autodiff software like Tensorflow. Example: theta = tf.placeholder(...) # parameter vector f_of_theta =... # scalar vector = tf.placeholder([dim_theta]) gradient = tf.grad(f, theta) gradient_vector_product = tf.sum( gradient * vector ) hessian_vector_product = tf.grad(gradient_vector_product, theta) Hessian vector product computation takes 1-2 times as long as gradient computation S. J. Wright and J. Nocedal. Numerical optimization. Springer New York, 1999

Truncated Newton Method Hessian-vector can be formed as follows if KL Hessian (F ) is computed using KL theta = tf.placeholder(...) # parameter vector actprob =... # batchsize x n_actions. # E.g., outputs of neural network with softmax oldactprob = tf.stop_gradient(actprob) vector = tf.placeholder([dim_theta]) kl = tf.sum( oldactprob * tf.log(oldactprob / actprob), axis=1) gradient = tf.grad(kl, theta) gradient_vector_product = tf.sum( gradient * vector ) hessian_vector_product = tf.grad(gradient_vector_product, theta) S. J. Wright and J. Nocedal. Numerical optimization. Springer New York, 1999

Solving KL-Penalized Problem: Summary maximize θ L πθold (π θ ) C KL πθold (π θ ) Make linear approximation to L πθold and quadratic approximation to KL term: maximize θ g (θ θ old ) C 2 (θ θ old) T F (θ θ old ) where g = θ L π θold (π θ ) θ=θold, F = 2 2 θ KL π θold (π θ ) θ=θold Solution: θ θ old = 1 C F 1 g Solve for F 1 g approximately using conjugate gradient algorithm by forming Hessian-vector product function

Truncated Natural Policy Gradient Algorithm for iteration=1, 2,... do Run policy for T timesteps or N trajectories Estimate advantage function at all timesteps Compute policy gradient g Use CG (with Hessian-vector products) to compute F 1 g Update policy parameter θ = θ old + αf 1 g end for

TRPO: KL-Constrained Problem Unconstrained problem: maximize L πθold (π θ ) C KL πθold (π θ ) Constrained problem: maximize L πθold (π θ ) subject to KL πθold (π θ ) δ Often easier to set hyperparameter δ rather than C, can remain fixed over whole learning process We ll solve constrained quadratic problem: compute F 1 g, and then rescale step to get correct KL Take linear and quadratic constraint: 1 maximize θ g (θ θ old ) subject to 2 (θ θ old) T F (θ θ old ) δ Form Lagrangian L(θ, λ) = g (θ θold ) λ 2 [(θ θ old) T F (θ θ old ) δ] Differentiate wrt θ, get θ θold = 1 λ F 1 g To satisfy constraint, want 1 2 st Fs = δ. Given candidate step sunscaled, rescale to s = 2δ s unscaled (Fs unscaled ) s unscaled

TRPO: KL-Constrained Problem Compute s unscaled = F 1 g Rescale: s = 2δ s unscaled (Hs unscaled ) s unscaled Now do backtracking line search on original problem (before quadratic constraint) maximize L πθold (π θ ) 1[KL πθold (π θ ) δ] Use steps s, s/2, s/4,... until line search objective improves

TRPO Algorithm for iteration=1, 2,... do Run policy for T timesteps or N trajectories Estimate advantage function at all timesteps Compute policy gradient g Use CG (with Hessian-vector products) to compute F 1 g Compute rescaled step s = αf 1 g with rescaling and line search Apply update: θ = θ old + αf 1 g end for

Alternative Method for Calculating Natural Gradients Given parameterized probability density p θ (x) Fisher information matrix 2 θ KL[p θ old, p θ ] = E x pθold [ ( ) T ( θ(x))] θ log p θ(x) θ log p θ=θold FIM forms a metric on policy s parameter space, induced by KL divergence. Makes step invariant to reparameterization of coordinates (θ = f (θ)), whereas gradient is not invariant. In policy optimization setting, instead of forming Fisher by differentiating KL, can explictly form ( n log π θ θ(a n s n ) ) T ( log π θ θ(a n s n ) )

Proximal Policy Optimization Back to penalty instead of constraint maximize θ N n=1 π θ (a n s n ) π θold (a n s n )Ân C KL πθold (π θ ) Pseudocode: for iteration=1, 2,... do Run policy for T timesteps or N trajectories Estimate advantage function at all timesteps Do SGD on above objective for some number of epochs If KL too high, increase β. If KL too low, decrease β. end for same performance as TRPO, but only first-order optimization

Approximations in Supervised vs Reinforcement Learning Supervised learning Linear approximation given by gradient f (θ) f (θ0 ) + (θ θ 0 ) g Training loss approximates test loss Reinforcement learning (policy gradients) Linear approximation given by gradient of surrogate f (θ) f (θ 0 ) + (θ θ 0 ) g Training surrogate approximates test surrogate (sampled data is representative of visitation distribution) State distribution doesn t change much

Further Reading S. Kakade. A Natural Policy Gradient. NIPS. 2001 S. Kakade and J. Langford. Approximately optimal approximate reinforcement learning. ICML. 2002 J. Peters and S. Schaal. Natural actor-critic. Neurocomputing (2008) J. Schulman, S. Levine, P. Moritz, M. I. Jordan, and P. Abbeel. Trust Region Policy Optimization. ICML (2015) Y. Duan, X. Chen, R. Houthooft, J. Schulman, and P. Abbeel. Benchmarking Deep Reinforcement Learning for Continuous Control. ICML (2016) J. Martens and I. Sutskever. Training deep and recurrent networks with Hessian-free optimization. Neural Networks: Tricks of the Trade. Springer, 2012

That s all. Questions?