Advanced Policy Gradient Methods: Natural Gradient, TRPO, and More March 8, 2017
Defining a Loss Function for RL Let η(π) denote the expected return of π [ ] η(π) = E s0 ρ 0,a t π( s t) γ t r t We collect data with π old. Want to optimize some objective to get a new policy π t=0 A useful identity 1 : [ ] η(π) = η(π old ) + E τ π γ t A π old (s t, a t ) t=0 1 S. Kakade and J. Langford. Approximately optimal approximate reinforcement learning. ICML. 2002.
Proof of Useful Identity First note that A π old (s, a) = E s P(s s,a) [r(s) + γv π old (s ) V π old (s)]. [ ] E τ π γ t A πold (s t, a t ) t=0 [ ] = E τ π γ t (r(s t ) + γv π old (s t+1 ) V π old (s t )) t=0 = E τ π [ V π old (s 0 ) + ] γ t r(s t ) t=0 [ ] = E s0 [V π old (s 0 )] + E τ π γ t r(s t ) = η(π old ) + η(π) t=0
Surrogate Loss Function Want to manipulate η(π) into an objective that we can estimate from sampled data η(π) = const + E s π,a π [A π old (s, a)] [ ] π(a s) = const + E s π,a πold π old (a s) Aπ old (s, a) Define L πold (π) to be the surrogate objective that ignores change in state distrib: L(π) = E s πold,a π [A π old (s t, a t )] [ ] π(a s) = E s πold,a π old π old (a s) Aπ old (s, a) Matches to first order for parameterized policy θ L(π θ ) θold = E s,a πold [ θ π θ (a s) π old (a s) Aπ old (s, a)] θold = E s,a πold [ θ log π θ (a s)a π old (s, a)] θold = θ η(π θ ) θ=θold Local approximation to the performance of the policy
Improvement Theory Theory: bound the difference between L πold (π) and η(π), the performance of the policy (error due because we re ignoring state distrib. change) Result: η(π) L πold (π) C max s KL[π old ( s), π( s)], where c = 2ɛγ/(1 γ) 2 Monotonic improvement guaranteed (MM algorithm)
Practical Approximations Theory: should maximize L πold (π) C max s KL[π old ( s), π( s)]. Approximations: Estimate L πold (π) using trajectories sampled from π old ˆL πold (π) = π(a n s n) n π old (a n s n)ân If just gradient needed, can use ˆLπold (π) = n log π(s n s n )Ân Use mean KL divergence E s πold [KL[π old ( s), π( s)]] instead of max s KL[... ] Define KLπold (π) = E s πold [KL[π old ( s), π( s)]] Use estimate from samples, n KL[π old( s n ), π( s n )] C is too pessimistic and provides guarantee for discounted return Natural policy gradient and PPO: use fixed or adaptive coefficient C TRPO: use hard constraint with fixed KL penalty
Solving KL-Penalized Problem maximize θ L πθold (π θ ) C KL πθold (π θ ) Make linear approximation to L πθold maximize θ and quadratic approximation to KL term: g (θ θ old ) C 2 (θ θ old) T F (θ θ old ) where g = θ L π θold (π θ ) θ=θold, F = 2 2 θ KL π θold (π θ ) θ=θold Quadratic part of L is negligible compared to KL term F is positive semidefinite, but not if we include Hessian of L Solution: θ θ old = 1 C F 1 g
Solving Linear Systems using Conjugate Gradient Previous slide: θ θ old = 1 F 1 g. Don t want to form full Hessian matrix C F = 2 KL 2 θ π θold (π θ ) θ=θold (memory and computation) Can compute F 1 g approximately using conjugate gradient algorithm without forming F explicitly
Truncated Newton Method Conjugate gradient algorithm approximately solves for x = A 1 b, without explicitly forming matrix A, just reads A through matrix-vector products v Av. After k iterations, CG has minimized 1 2 x T Ax bx in subspace spanned by b, Ab, A 2 b,..., A k 1 b Given vector v with same dimension as θ, want to compute H 1 v, where H = 2 2 θ f (θ). To perform CG, Hessian-vector products v Hv. Can form this function using autodiff software like Tensorflow. Example: theta = tf.placeholder(...) # parameter vector f_of_theta =... # scalar vector = tf.placeholder([dim_theta]) gradient = tf.grad(f, theta) gradient_vector_product = tf.sum( gradient * vector ) hessian_vector_product = tf.grad(gradient_vector_product, theta) Hessian vector product computation takes 1-2 times as long as gradient computation S. J. Wright and J. Nocedal. Numerical optimization. Springer New York, 1999
Truncated Newton Method Hessian-vector can be formed as follows if KL Hessian (F ) is computed using KL theta = tf.placeholder(...) # parameter vector actprob =... # batchsize x n_actions. # E.g., outputs of neural network with softmax oldactprob = tf.stop_gradient(actprob) vector = tf.placeholder([dim_theta]) kl = tf.sum( oldactprob * tf.log(oldactprob / actprob), axis=1) gradient = tf.grad(kl, theta) gradient_vector_product = tf.sum( gradient * vector ) hessian_vector_product = tf.grad(gradient_vector_product, theta) S. J. Wright and J. Nocedal. Numerical optimization. Springer New York, 1999
Solving KL-Penalized Problem: Summary maximize θ L πθold (π θ ) C KL πθold (π θ ) Make linear approximation to L πθold and quadratic approximation to KL term: maximize θ g (θ θ old ) C 2 (θ θ old) T F (θ θ old ) where g = θ L π θold (π θ ) θ=θold, F = 2 2 θ KL π θold (π θ ) θ=θold Solution: θ θ old = 1 C F 1 g Solve for F 1 g approximately using conjugate gradient algorithm by forming Hessian-vector product function
Truncated Natural Policy Gradient Algorithm for iteration=1, 2,... do Run policy for T timesteps or N trajectories Estimate advantage function at all timesteps Compute policy gradient g Use CG (with Hessian-vector products) to compute F 1 g Update policy parameter θ = θ old + αf 1 g end for
TRPO: KL-Constrained Problem Unconstrained problem: maximize L πθold (π θ ) C KL πθold (π θ ) Constrained problem: maximize L πθold (π θ ) subject to KL πθold (π θ ) δ Often easier to set hyperparameter δ rather than C, can remain fixed over whole learning process We ll solve constrained quadratic problem: compute F 1 g, and then rescale step to get correct KL Take linear and quadratic constraint: 1 maximize θ g (θ θ old ) subject to 2 (θ θ old) T F (θ θ old ) δ Form Lagrangian L(θ, λ) = g (θ θold ) λ 2 [(θ θ old) T F (θ θ old ) δ] Differentiate wrt θ, get θ θold = 1 λ F 1 g To satisfy constraint, want 1 2 st Fs = δ. Given candidate step sunscaled, rescale to s = 2δ s unscaled (Fs unscaled ) s unscaled
TRPO: KL-Constrained Problem Compute s unscaled = F 1 g Rescale: s = 2δ s unscaled (Hs unscaled ) s unscaled Now do backtracking line search on original problem (before quadratic constraint) maximize L πθold (π θ ) 1[KL πθold (π θ ) δ] Use steps s, s/2, s/4,... until line search objective improves
TRPO Algorithm for iteration=1, 2,... do Run policy for T timesteps or N trajectories Estimate advantage function at all timesteps Compute policy gradient g Use CG (with Hessian-vector products) to compute F 1 g Compute rescaled step s = αf 1 g with rescaling and line search Apply update: θ = θ old + αf 1 g end for
Alternative Method for Calculating Natural Gradients Given parameterized probability density p θ (x) Fisher information matrix 2 θ KL[p θ old, p θ ] = E x pθold [ ( ) T ( θ(x))] θ log p θ(x) θ log p θ=θold FIM forms a metric on policy s parameter space, induced by KL divergence. Makes step invariant to reparameterization of coordinates (θ = f (θ)), whereas gradient is not invariant. In policy optimization setting, instead of forming Fisher by differentiating KL, can explictly form ( n log π θ θ(a n s n ) ) T ( log π θ θ(a n s n ) )
Proximal Policy Optimization Back to penalty instead of constraint maximize θ N n=1 π θ (a n s n ) π θold (a n s n )Ân C KL πθold (π θ ) Pseudocode: for iteration=1, 2,... do Run policy for T timesteps or N trajectories Estimate advantage function at all timesteps Do SGD on above objective for some number of epochs If KL too high, increase β. If KL too low, decrease β. end for same performance as TRPO, but only first-order optimization
Approximations in Supervised vs Reinforcement Learning Supervised learning Linear approximation given by gradient f (θ) f (θ0 ) + (θ θ 0 ) g Training loss approximates test loss Reinforcement learning (policy gradients) Linear approximation given by gradient of surrogate f (θ) f (θ 0 ) + (θ θ 0 ) g Training surrogate approximates test surrogate (sampled data is representative of visitation distribution) State distribution doesn t change much
Further Reading S. Kakade. A Natural Policy Gradient. NIPS. 2001 S. Kakade and J. Langford. Approximately optimal approximate reinforcement learning. ICML. 2002 J. Peters and S. Schaal. Natural actor-critic. Neurocomputing (2008) J. Schulman, S. Levine, P. Moritz, M. I. Jordan, and P. Abbeel. Trust Region Policy Optimization. ICML (2015) Y. Duan, X. Chen, R. Houthooft, J. Schulman, and P. Abbeel. Benchmarking Deep Reinforcement Learning for Continuous Control. ICML (2016) J. Martens and I. Sutskever. Training deep and recurrent networks with Hessian-free optimization. Neural Networks: Tricks of the Trade. Springer, 2012
That s all. Questions?