REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning

REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning Ronen Tamari The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (#67679) February 28, 2016 Ronen Tamari REINFORCE (Williams) February 28, 2016 1 / 38

Outline Reinforcement Learning: A Quick Refresher 1 Reinforcement Learning: A Quick Refresher 2 Policy Gradient Methods 3 REINFORCE Framework 4 REINFORCE Based Algorithm 5 REINFORCE in Neural Networks 6 Comparison with Related Algorithms 7 Conclusions Ronen Tamari REINFORCE (Williams) February 28, 2016 2 / 38

Reinforcement Learning: A Quick Refresher Reinforcement Learning Framework System modeled as Markov Decision Process: s S - state of the system a A agent s action p(s s, a) the dynamics of the system r : S A R the reward function (possibly stochastic) π(a s) - the agent s policy (also possibly stochastic) Ronen Tamari REINFORCE (Williams) February 28, 2016 3 / 38

Reinforcement Learning: A Quick Refresher Reinforcement Learning Objective Assuming episodic problem with horizon T, the reward of the episode is: T T r (s t, a t ) = r (t) t=1 Or in the case of discounted rewards, t=1 T d (t) r (t) t=1 where d is timestep-dependent weighting, often set to d (t) = γ t, γ (0, 1) Generally, goal is to learn policy optimizing the expected return: [ T ] E d (t) r (t) t=1 Ronen Tamari REINFORCE (Williams) February 28, 2016 4 / 38

Reinforcement Learning: A Quick Refresher Reinforcement Learning Methods Value Based- try to learn a value function satisfying Bellman Equation Q-Learning Actor-Critic Algorithms Policy Search- attempt to directly learn a policy maximizing rewards Gradient Based REINFORCE (also known as likelihood-ratio methods) Gradient-Free Simulated Annealing Cross-Entropy Search Ronen Tamari REINFORCE (Williams) February 28, 2016 5 / 38

Outline Policy Gradient Methods 1 Reinforcement Learning: A Quick Refresher 2 Policy Gradient Methods 3 REINFORCE Framework 4 REINFORCE Based Algorithm 5 REINFORCE in Neural Networks 6 Comparison with Related Algorithms 7 Conclusions Ronen Tamari REINFORCE (Williams) February 28, 2016 6 / 38

Policy Gradient Methods Performance Measure Denote a trajectory by τ = {(a t, s t )} T t=1 Assume policy parameterized by K parameters: θ R K Performance measure defined as [ T ] J (θ) = E pθ (τ) r (t) θ = E pθ (τ) [r (τ) θ] t=1 Ronen Tamari REINFORCE (Williams) February 28, 2016 7 / 38

Policy Gradient Methods Policy Gradient Methods Update rule: where α t is the learning rate. θ t+1 = θ t + α t θ J θ=θt The main problem is obtaining a good estimator of the policy gradient θ J θ=θt Ronen Tamari REINFORCE (Williams) February 28, 2016 8 / 38

Outline REINFORCE Framework 1 Reinforcement Learning: A Quick Refresher 2 Policy Gradient Methods 3 REINFORCE Framework 4 REINFORCE Based Algorithm 5 REINFORCE in Neural Networks 6 Comparison with Related Algorithms 7 Conclusions Ronen Tamari REINFORCE (Williams) February 28, 2016 9 / 38

REINFORCE Framework Episodic REINFORCE (Williams, R.J. (1992)) REINFORCE: REward Increment = Nonnegative Factor times Offset Reinforcement times Characteristic Eligibility. Or formally, according to Williams, a learning algorithm with a weight update of the form T w ij = α ij (r b ij ) e ij (t) w ij - Policy parameterization (think of case of θ = W being a N M weights matrix) α ij - Learning rate factor. r - Reward received at end of trial or after each timestep. Can be time dependent. b ij - Reinforcement baseline- conditionally independent of action y t given policy. e ij = ln g w ij - "Characteristic Eligibility"- where g = Pr (y = ξ w, x) t=1 Ronen Tamari REINFORCE (Williams) February 28, 2016 10 / 38

REINFORCE Framework Episodic REINFORCE (Williams, R.J. (1992)) From now on we will adhere to contemporary notation common today in the deep learning community, which is an adaptation to context of MDP type problems. T w ij = α ij (r b ij ) e ij (t) t=1 T T θ = α r t b θ log π θ (y t x t ) t=1 t=1 }{{} R θ - Policy parameterization (for example, weights of NN) α - Equivalent to Gradient Descent learning rate. θ log π θ (y t x t ) - gradient of log-likelihood of policy. Ronen Tamari REINFORCE (Williams) February 28, 2016 11 / 38

REINFORCE Framework Theorem: Episodic REINFORCE Theorem For any episodic REINFORCE algorithm, the inner product of E [ θ θ] and θ E [R θ] is non-negative. Furthermore, if α > 0 then this inner product is zero only when θ E [R θ] = 0. Also, if α ij is independent of i, j then α θ E [R θ] = E [ θ θ] We will show a formulation specifically for the MDP case: [ T ] α θ J (θ) = α θ E [(r (τ) θ] = E pθ (τ) α θ log π θ (a t s t ) (r (τ) b) θ t=0 Ronen Tamari REINFORCE (Williams) February 28, 2016 12 / 38

REINFORCE Framework Deriving the Policy Gradient Proof: J (θ) = E pθ (τ) [r (τ)] = T p θ (τ) r (τ) dτ Where p θ (τ) = p (s 0 ) T t=0 p (s t+1 s t, a t ) π θ (a t s t ) due to Markovity assumption. The gradient is: θ p θ (τ) =? θ J (θ) = θ p θ (τ) r (τ) dτ T Ronen Tamari REINFORCE (Williams) February 28, 2016 13 / 38

REINFORCE Framework Deriving the Policy Gradient (cont.) Use the "log-likelihood trick": θ log (p θ (τ)) = θp θ (τ) p θ (τ) To obtain: θ J (θ) = T θ p θ (τ) r (τ) dτ = T θ log (p θ (τ)) p θ (τ) r (τ) dτ = [ T ] = E pθ (τ) [ θ log (p θ (τ)) r (τ)] = E pθ (τ) θ log π θ (a t s t ) r (τ) t=0 Since only the policy π θ (a t s t ) is dependent on θ! Ronen Tamari REINFORCE (Williams) February 28, 2016 14 / 38

REINFORCE Framework Deriving the Policy Gradient (cont.) [ T ] E pθ (τ) θ log π θ (a t s t ) r (τ) = t=0 The expectation can be replaced by sample averages to give the form: θ J (θ) 1 M M T ) θ log π θ (a t s i t i R i i=1 t=0 Where i = 1,..., M are episodes where the agent is run with the current policy. Discount factor could also be incorporated into R i. (where reward is dependent on length of episode). Ronen Tamari REINFORCE (Williams) February 28, 2016 15 / 38

REINFORCE Framework Variance Reduction Using Baseline We are only computing the expected gradient, variance may be high, as demonstrated in following example: Assume baseline b = 0 in a scenario with a single reward r for all state-action pairs. The variance of the gradient grows cubically with length of the horizon T : [ ] T T Var T r ( θ log π θ (a t s t )) = T 2 r 2 Var [ θ log π θ (a t s t )] t=0 Subtracting a baseline b from the estimation can reduce variance: [ ] T Var (Tr b) ( θ log π θ (..)) = (Tr b) 2 T Var [ θ log π θ (..)] t=0 t=0 t=0 Ronen Tamari REINFORCE (Williams) February 28, 2016 16 / 38

REINFORCE Framework Variance Reduction Using Baseline This gives the form: θ J (θ) 1 M M T ) θ log π θ (a ( T ) t s i t i rt i b i=1 t=0 t =0 Ronen Tamari REINFORCE (Williams) February 28, 2016 17 / 38

REINFORCE Framework Variance Reduction Using Baseline Doesn t introduce bias to gradient estimate θ J (θ), since: T T = θ p θ (τ) [r (τ) b] dτ = We used T E pθ (τ) [ θ log (p θ (τ)) [r (τ) b]] = θ log (p θ (τ)) p θ (τ) [r (τ) b] dτ = T θ p θ (τ) r (τ) dτ b = θ J (θ) p θ (τ) dτ = 1 = T θ p θ (τ) dτ = 0 θ p θ (τ) dτ T } {{ } =0 Ronen Tamari REINFORCE (Williams) February 28, 2016 18 / 38

REINFORCE Framework Choosing Optimal Baseline Williams work doesn t detail how to choose the baseline, though this is significant in practice for lowering variance and thus achieving good convergence. Future works explore it in detail (see for example (Sutton ( 99) and Peters (08 ) in the Further Reading section ). Ronen Tamari REINFORCE (Williams) February 28, 2016 19 / 38

Outline REINFORCE Based Algorithm 1 Reinforcement Learning: A Quick Refresher 2 Policy Gradient Methods 3 REINFORCE Framework 4 REINFORCE Based Algorithm 5 REINFORCE in Neural Networks 6 Comparison with Related Algorithms 7 Conclusions Ronen Tamari REINFORCE (Williams) February 28, 2016 20 / 38

Algorithm REINFORCE Based Algorithm Episodic REINFORCE Algorithm (from Peters, J., Schaal, S. (2008).) Input: Policy parameterization θ 1.Perform trials 1,..., M (until converged): } M { {(a } i - Given trial trajectories t, st) i, r i T t t=0 i=1 - Obtain uncorrected gradient estimate g = 1 M T ( ( )) ( M i=1 t=0 θ log π θ a i t st i ) T t =0 d trt i - Estimate optimal baseline per element b = (b 1,..., b h,...b K ): b h = 1 M M i=1 ( T ( ( )) ) t=0 θ log π θh a i t st i 2 ( T ) t =0 d t r t i ( T ) 2 t=0 ( θ h log π θ (at s i t)) i - Subtract baseline: g = 1 M T ( ( )) ( M i=1 t=0 θ log π θ a i t st i ) T t =0 d trt i b - Return g = θ J (θ) (One step of SGD) Ronen Tamari REINFORCE (Williams) February 28, 2016 21 / 38

REINFORCE Based Algorithm Algorithm Convergence Original work doesn t contain analysis of asymptotic properties of REINFORCE algorithms. Empirical simulations show episodic REINFORCE algorithms to be slower due to delayed feedback (for case where reward only assigned at end of episode). Susceptibility to convergence to local optima. Baseline choice significantly affects convergence properties. Related work has shown that for this type of( method, ) theoretical convergence to true gradient is at rate of O 1 m where m denotes the number of episodes (Monte Carlo based analysis). Ronen Tamari REINFORCE (Williams) February 28, 2016 22 / 38

Outline REINFORCE in Neural Networks 1 Reinforcement Learning: A Quick Refresher 2 Policy Gradient Methods 3 REINFORCE Framework 4 REINFORCE Based Algorithm 5 REINFORCE in Neural Networks 6 Comparison with Related Algorithms 7 Conclusions Ronen Tamari REINFORCE (Williams) February 28, 2016 23 / 38

REINFORCE in Neural Networks REINFORCE in Neural Networks REINFORCE originally designed (also) for use in Neural Networks and RNNs. Agent represented by stochastic output unit\s behind which are deterministic hidden units. For example, consider an RNN with a softmax output of probabilities according to which the action a t is chosen. Hidden units of RNN θ t are policy parameterization. Derivation of REINFORCE algorithm using "unfolding through time" of RNN: Weights frozen for duration of episode and updated at the end. Naturally compatible with backpropagation- θ log π θ (a t s t ) is the gradient of the corresponding RNN evaluated at timestep t. Ronen Tamari REINFORCE (Williams) February 28, 2016 24 / 38

REINFORCE in Neural Networks Example: Recurrent Attention Model (Mnih et al., 2015) Policy parameterized by RNN At each step 2 types of actions (l t glimpse location and a t classification) controlled by 2 sub-networks. Goal is to learn stochastic policy π ((l t, a t ) s 1:t ; θ) maximizing rewards. Ronen Tamari REINFORCE (Williams) February 28, 2016 25 / 38

REINFORCE in Neural Networks Example: Recurrent Attention Model (Mnih et al., 2015) Trajectory given by s 1:t = x 1, l 1, a 1,..., x t 1, l t 1, a t 1, x t At each step 2 types of actions (l t glimpse location and a t classification) controlled by 2 sub-networks. Reward R = T t=1 r t where r T = 1 for correct classification and 0 otherwise. Ronen Tamari REINFORCE (Williams) February 28, 2016 26 / 38

REINFORCE in Neural Networks Example: Recurrent Attention Model (Mnih et al., 2015) Gradient estimate of the form: θ J (θ) 1 M M T i=1 t=1 ( )) ( ( T θ log π θ at s i 1:t i rt i b t t =1 ) Ronen Tamari REINFORCE (Williams) February 28, 2016 27 / 38

REINFORCE in Neural Networks Example: Recurrent Attention Model (Mnih et al., 2015) - Dynamic Environment Same approach used to train agent to play simple game in dynamic environment. Ronen Tamari REINFORCE (Williams) February 28, 2016 28 / 38

Outline Comparison with Related Algorithms 1 Reinforcement Learning: A Quick Refresher 2 Policy Gradient Methods 3 REINFORCE Framework 4 REINFORCE Based Algorithm 5 REINFORCE in Neural Networks 6 Comparison with Related Algorithms 7 Conclusions Ronen Tamari REINFORCE (Williams) February 28, 2016 29 / 38

Comparison with Related Algorithms Q-Learning: A reminder* The value function: V π [ (s) = E a π( s) r (s, a) + γv π ( s )] s p( s,a) The Q-Function is the value of taking action a at state s: [ ( Q (s, a) = r (s, a) + γe s p( s,a) V s )] The optimal value satisfies the Bellman Equation: V (s) = max E [ π a π( s) r (s, a) + γv ( s )] s p( s,a) *Slides based on Noga Zaslavsky s presentation Ronen Tamari REINFORCE (Williams) February 28, 2016 30 / 38

Comparison with Related Algorithms Q-Learning: A reminder* If we know V then an optimal policy is to decide deterministically a (s) = arg max Q (s, a) a Learning Q means learning an optimal policy *Slides based on Noga Zaslavsky s presentation Ronen Tamari REINFORCE (Williams) February 28, 2016 31 / 38

Comparison with Related Algorithms Q-Learning Elegant mathematical characterization and converges to optimal Q-function for MDP problems. Model free - only uses Q-function and not system dynamics p(s t+1 s t, a t ) Two main issues complicate the picture: What if problem isn t Markovian (often arises in multi-agent settings)? Assume it is but partially observable... (POMDP) Partial observability makes the learning problem much harder. Continuous or very large action-state space Still possible to learn with function approximation (Deepmind Atari-style games). max operator in Bellman equation makes function approximation difficult. Sensitive to noise and in practice hard to train. Ronen Tamari REINFORCE (Williams) February 28, 2016 32 / 38

Comparison with Related Algorithms Q-Learning vs. REINFORCE Q-Learning better adapted to classic observable Markovian setting, REINFORCE-based learning more relevant the further problem is from that setting (hidden and continuous states). Ronen Tamari REINFORCE (Williams) February 28, 2016 33 / 38

Outline Conclusions 1 Reinforcement Learning: A Quick Refresher 2 Policy Gradient Methods 3 REINFORCE Framework 4 REINFORCE Based Algorithm 5 REINFORCE in Neural Networks 6 Comparison with Related Algorithms 7 Conclusions Ronen Tamari REINFORCE (Williams) February 28, 2016 34 / 38

Conclusions Conclusions Policy parameterization suited to complex dynamic systems. Slow convergence. Limited theoretical analysis. Integrates naturally with Deep Learning, RNNs. General approach, serves as basis for more effective algorithms. Not coupled as tightly as Q-Learning with Markovity assumptions. Ronen Tamari REINFORCE (Williams) February 28, 2016 35 / 38

Further Reading Conclusions Sutton, R. S., Mcallester, D., Singh, S., Mansour, Y. (1999). Policy Gradient Methods for Reinforcement Learning with Function Approximation. Advances in Neural Information Processing Systems 12, 1057-1063. Peters, J., Schaal, S. (2008). Reinforcement learning of motor skills with policy gradients. Neural Networks, 21(4), 682-697 Kober, J., Bagnell, J. A., Peters, J. (2013). Reinforcement learning in robotics: A survey. The International Journal of Robotics Research, 32(11), 1238-1274. http://doi.org/10.1177/0278364913495721 Shalev-shwartz, S., Ben-zrihem, N., Cohen, A., Shashua, A. (n.d.). Long-term Planning by Short-term Prediction, 1-7. http://arxiv.org/abs/1602.01580 Ronen Tamari REINFORCE (Williams) February 28, 2016 36 / 38

Outline 1 Reinforcement Learning: A Quick Refresher 2 Policy Gradient Methods 3 REINFORCE Framework 4 REINFORCE Based Algorithm 5 REINFORCE in Neural Networks 6 Comparison with Related Algorithms 7 Conclusions Ronen Tamari REINFORCE (Williams) February 28, 2016 37 / 38

Thank You Ronen Tamari REINFORCE (Williams) February 28, 2016 38 / 38