Reinforcement Learning via Policy Optimization Hanxiao Liu November 22, 2017 1 / 27
Reinforcement Learning Policy a π(s) 2 / 27
Example - Mario 3 / 27
Example - ChatBot 4 / 27
Applications - Video Games 5 / 27
Applications - Robotics 6 / 27
Applications - Combinatorial Problems 7 / 27
Applications - Bioinformatics 8 / 27
Supervised Learning vs RL Supervised setup s t P ( ) a t = π(s t ) immediate reward Reinforcement setup s t P ( s t 1, a t 1 ) a t π(s t ) delayed reward Both aims to maximize the total reward. 9 / 27
Markov Decision Process What is the underlying assumption? 10 / 27
Landscape 11 / 27
Monte Carlo Return R(τ) = T t=1 R(s t, a t ). Trajectory τ = s 0, a 0,..., s T, a T Goal: finding a good π such that R is maximized. Policy evaluation via MC 1. Sample a trajectory τ given π. 2. Record R(τ) Repeat the above for many times and take the average. 12 / 27
Policy Gradient Let s parametrize π using a neural network. a π θ (s) Expected return max θ U(θ) = max θ P (τ; π θ )R(τ) (1) τ Heuristic: Raise the probability of good trajectories. 1. Sample a trajectory τ under π θ 2. Update θ using gradient θ log P (τ; π θ )R(τ). 13 / 27
Policy Gradient The log-derivative trick θ U(θ) = τ = τ = τ θ P (τ; π θ )R(τ) (2) P (τ; π θ ) P (τ; π θ ) θp (τ; π θ )R(τ) (3) P (τ; π θ ) θ log P (τ; π θ )R(τ) (4) = E τ P ( ;πθ ) θ log P (τ; π θ )R(τ) (5) θ log P (τ; π θ )R(τ) τ P ( ; π θ ) (6) 14 / 27
Policy Gradient θ U(θ) θ log P (τ; π θ )R(τ) τ P ( ; π θ ) (7) Analogous to SGD (so variance reduction is important), but data distribution here is a moving target (so we may want a trust region). 15 / 27
Policy Gradient We can subtract any constant from the reward θ P (τ; π θ )(R(τ) b) (8) τ = θ P (τ; π θ )R(τ) θ P (τ; π θ )b (9) τ = θ P (τ; π θ )R(τ) θ b (10) τ = θ P (τ; π θ )R(τ) (11) τ τ 16 / 27
Policy Gradient The variance is magnified by R(τ) More fine-grained baseline? θ U(θ) θ log P (τ; π θ )R(τ) (12) t=0 t=0 θ log π θ (a t s t ) t =0 θ log π θ (a t s t ) t =t R(s t, a t ) R(s t, a t ) (13) } {{ } R t (14) t=0 θ log π θ (a t s t ) (R t b t ) (15) 17 / 27
Policy Gradient θ log π θ (a t s t ) (R t b t ) (16) t=0 Good choice of b t? t=0 b t = argmin b E (R t b) 2 (17) V φ (s t ) (18) θ log π θ (a t s t ) (R t V φ (s t )) }{{} A t (19) Promote actions that lead to positive advantage. 18 / 27
Policy Gradient θ log π θ (a t s t )A t (20) t=0 Sample a trajectory τ For each step in τ Rt = t =t r t At = R t V φ (s t ) take a gradient step φ t=0 V φ(s t ) R t 2 take a gradient step θ t=0 log π θ(a t s t )A t 19 / 27
TRPO In PG, our opt objective a moving target defined based on trajectory samples given the current policy each sample is only used once An opt objective that reuses all historical data? 20 / 27
TRPO Policy gradient θ log π θ (a t s t )A t (21) t=0 Objective (on-policy) E π [A(s, a)] (22) Objective (off-policy), via importance sampling [ ] π(a s) E πold π old (a s) A old(s, a) (23) 21 / 27
TRPO max π [ ] π(a s) E πold π old (a s) A old(s, a) N n=1 π(a n s n ) π old (a n s n ) A n (24) s.t. KL(π old, π) δ (25) Quadratic approximation is used for the KL. Use conjugate gradient for the natural gradient direction F 1 g. 22 / 27
PPO max π N n=1 π(a n s n ) π old (a n s n ) A n βkl(π old, π) (26) Fixed β does not work well shrink β when KL(π old, π) is small increase β otherwise 23 / 27
PPO Alternative ways to penalize large change in π? Modify As π(a n s n ) π old (a n s n ) A n = r n (θ)a n (27) min (r n (θ)a n, clip(1 ɛ, 1 + ɛ, r n (θ))a n ) (28) being pessimistic whenever a large change in π leads to a better obj. same as the original obj otherwise. 24 / 27
Advantage Estimation Currently  t = r t + r t+1 + r t+2 + V (s t ) (29) More bias, less variance (ignoring long-term effect)  t = r t + γr t+1 + γ 2 r t+2 + V (s t ) (30) Bootstrapping  t = r t + γr t+1 + γ 2 r t+2 + V (s t ) (31) = r t + γ(r t+1 + γr t+2 +... ) V (s t ) (32) = r t + γv (s t+1 ) V (s t ) (33) We may unroll for more than one steps. 25 / 27
A2C Advantage Actor-Critic Sample a trajectory τ For each step in τ ˆRt = r t + γv (s t+1 ) Ât = ˆR t V (s t ) take a gradient step using [ log π θ (a t s t )Ât + V φ (s t ) ˆR ] t 2 θ,φ t=0 May use TROP, PPO for policy optimization. 26 / 27
A3C Variance reduction via multiple actors Figure : Asynchronous Advantage Actor-Critic 27 / 27