REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning

Similar documents
Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Policy Gradient Methods. February 13, 2017

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Reinforcement Learning and NLP

Deep Reinforcement Learning: Policy Gradients and Q-Learning

CSC321 Lecture 22: Q-Learning

REINFORCEMENT LEARNING

Markov Decision Processes and Solving Finite Problems. February 8, 2017

Lecture 9: Policy Gradient II 1

Lecture 8: Policy Gradient

Lecture 18: Reinforcement Learning Sanjeev Arora Elad Hazan

Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms *

Reinforcement Learning via Policy Optimization

CS599 Lecture 1 Introduction To RL

Dialogue management: Parametric approaches to policy optimisation. Dialogue Systems Group, Cambridge University Engineering Department

INF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018

1 Introduction 2. 4 Q-Learning The Q-value The Temporal Difference The whole Q-Learning process... 5

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be

ELEC-E8119 Robotics: Manipulation, Decision Making and Learning Policy gradient approaches. Ville Kyrki

Reinforcement learning an introduction

Reinforcement Learning

(Deep) Reinforcement Learning

Reinforcement Learning

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation

Trust Region Policy Optimization

Introduction of Reinforcement Learning

6 Reinforcement Learning

Lecture 9: Policy Gradient II (Post lecture) 2

Q-Learning in Continuous State Action Spaces

Machine Learning I Continuous Reinforcement Learning

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

Introduction to Reinforcement Learning

Lecture 8: Policy Gradient I 2

Deep Reinforcement Learning. Scratching the surface

Lecture 1: March 7, 2018

Lecture 23: Reinforcement Learning

Grundlagen der Künstlichen Intelligenz

Reinforcement Learning for NLP

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro

Reinforcement Learning. Yishay Mansour Tel-Aviv University

Variance Reduction for Policy Gradient Methods. March 13, 2017

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes

Reinforcement Learning and Deep Reinforcement Learning

15-889e Policy Search: Gradient Methods Emma Brunskill. All slides from David Silver (with EB adding minor modificafons), unless otherwise noted

Lecture 3: Markov Decision Processes

Chapter 3: The Reinforcement Learning Problem

MS&E338 Reinforcement Learning Lecture 1 - April 2, Introduction

Grundlagen der Künstlichen Intelligenz

Spatial Transformation

Reinforcement Learning. Introduction

Tutorial on Policy Gradient Methods. Jan Peters

Reinforcement Learning

Reinforcement Learning as Variational Inference: Two Recent Approaches

Reinforcement Learning. Spring 2018 Defining MDPs, Planning

CS 598 Statistical Reinforcement Learning. Nan Jiang

Reinforcement Learning II. George Konidaris

Deep Reinforcement Learning via Policy Optimization

Reinforcement Learning II. George Konidaris

Discrete Latent Variable Models

Reinforcement Learning Part 2

Approximate Q-Learning. Dan Weld / University of Washington

Machine Learning. Reinforcement learning. Hamid Beigy. Sharif University of Technology. Fall 1396

Notes on Reinforcement Learning

Approximation Methods in Reinforcement Learning

Reinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina

Reinforcement Learning as Classification Leveraging Modern Classifiers

Open Theoretical Questions in Reinforcement Learning

Actor-critic methods. Dialogue Systems Group, Cambridge University Engineering Department. February 21, 2017

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Reinforcement Learning

Deep reinforcement learning. Dialogue Systems Group, Cambridge University Engineering Department

Basics of reinforcement learning

Chapter 3: The Reinforcement Learning Problem

Reinforcement Learning

Deep Reinforcement Learning SISL. Jeremy Morton (jmorton2) November 7, Stanford Intelligent Systems Laboratory

Decision Theory: Q-Learning

Learning Tetris. 1 Tetris. February 3, 2009

Reinforcement Learning: the basics

Reinforcement Learning and Control

Reinforcement Learning for Continuous. Action using Stochastic Gradient Ascent. Hajime KIMURA, Shigenobu KOBAYASHI JAPAN

Active Policy Iteration: Efficient Exploration through Active Learning for Value Function Approximation in Reinforcement Learning

Hidden Markov Models (HMM) and Support Vector Machine (SVM)

Policy Gradient Reinforcement Learning for Robotics

Machine Learning I Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon.

Deep Reinforcement Learning

Reading Response: Due Wednesday. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1

Temporal Difference. Learning KENNETH TRAN. Principal Research Engineer, MSR AI

Advanced Policy Gradient Methods: Natural Gradient, TRPO, and More. March 8, 2017

Reinforcement Learning: An Introduction

This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer.

Lecture 3: The Reinforcement Learning Problem

Deep Reinforcement Learning. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 19, 2017

INTRODUCTION TO EVOLUTION STRATEGY ALGORITHMS. James Gleeson Eric Langlois William Saunders

MDP Preliminaries. Nan Jiang. February 10, 2019

Markov Decision Processes

Transcription:

REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning Ronen Tamari The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (#67679) February 28, 2016 Ronen Tamari REINFORCE (Williams) February 28, 2016 1 / 38

Outline Reinforcement Learning: A Quick Refresher 1 Reinforcement Learning: A Quick Refresher 2 Policy Gradient Methods 3 REINFORCE Framework 4 REINFORCE Based Algorithm 5 REINFORCE in Neural Networks 6 Comparison with Related Algorithms 7 Conclusions Ronen Tamari REINFORCE (Williams) February 28, 2016 2 / 38

Reinforcement Learning: A Quick Refresher Reinforcement Learning Framework System modeled as Markov Decision Process: s S - state of the system a A agent s action p(s s, a) the dynamics of the system r : S A R the reward function (possibly stochastic) π(a s) - the agent s policy (also possibly stochastic) Ronen Tamari REINFORCE (Williams) February 28, 2016 3 / 38

Reinforcement Learning: A Quick Refresher Reinforcement Learning Objective Assuming episodic problem with horizon T, the reward of the episode is: T T r (s t, a t ) = r (t) t=1 Or in the case of discounted rewards, t=1 T d (t) r (t) t=1 where d is timestep-dependent weighting, often set to d (t) = γ t, γ (0, 1) Generally, goal is to learn policy optimizing the expected return: [ T ] E d (t) r (t) t=1 Ronen Tamari REINFORCE (Williams) February 28, 2016 4 / 38

Reinforcement Learning: A Quick Refresher Reinforcement Learning Methods Value Based- try to learn a value function satisfying Bellman Equation Q-Learning Actor-Critic Algorithms Policy Search- attempt to directly learn a policy maximizing rewards Gradient Based REINFORCE (also known as likelihood-ratio methods) Gradient-Free Simulated Annealing Cross-Entropy Search Ronen Tamari REINFORCE (Williams) February 28, 2016 5 / 38

Outline Policy Gradient Methods 1 Reinforcement Learning: A Quick Refresher 2 Policy Gradient Methods 3 REINFORCE Framework 4 REINFORCE Based Algorithm 5 REINFORCE in Neural Networks 6 Comparison with Related Algorithms 7 Conclusions Ronen Tamari REINFORCE (Williams) February 28, 2016 6 / 38

Policy Gradient Methods Performance Measure Denote a trajectory by τ = {(a t, s t )} T t=1 Assume policy parameterized by K parameters: θ R K Performance measure defined as [ T ] J (θ) = E pθ (τ) r (t) θ = E pθ (τ) [r (τ) θ] t=1 Ronen Tamari REINFORCE (Williams) February 28, 2016 7 / 38

Policy Gradient Methods Policy Gradient Methods Update rule: where α t is the learning rate. θ t+1 = θ t + α t θ J θ=θt The main problem is obtaining a good estimator of the policy gradient θ J θ=θt Ronen Tamari REINFORCE (Williams) February 28, 2016 8 / 38

Outline REINFORCE Framework 1 Reinforcement Learning: A Quick Refresher 2 Policy Gradient Methods 3 REINFORCE Framework 4 REINFORCE Based Algorithm 5 REINFORCE in Neural Networks 6 Comparison with Related Algorithms 7 Conclusions Ronen Tamari REINFORCE (Williams) February 28, 2016 9 / 38

REINFORCE Framework Episodic REINFORCE (Williams, R.J. (1992)) REINFORCE: REward Increment = Nonnegative Factor times Offset Reinforcement times Characteristic Eligibility. Or formally, according to Williams, a learning algorithm with a weight update of the form T w ij = α ij (r b ij ) e ij (t) w ij - Policy parameterization (think of case of θ = W being a N M weights matrix) α ij - Learning rate factor. r - Reward received at end of trial or after each timestep. Can be time dependent. b ij - Reinforcement baseline- conditionally independent of action y t given policy. e ij = ln g w ij - "Characteristic Eligibility"- where g = Pr (y = ξ w, x) t=1 Ronen Tamari REINFORCE (Williams) February 28, 2016 10 / 38

REINFORCE Framework Episodic REINFORCE (Williams, R.J. (1992)) From now on we will adhere to contemporary notation common today in the deep learning community, which is an adaptation to context of MDP type problems. T w ij = α ij (r b ij ) e ij (t) t=1 T T θ = α r t b θ log π θ (y t x t ) t=1 t=1 }{{} R θ - Policy parameterization (for example, weights of NN) α - Equivalent to Gradient Descent learning rate. θ log π θ (y t x t ) - gradient of log-likelihood of policy. Ronen Tamari REINFORCE (Williams) February 28, 2016 11 / 38

REINFORCE Framework Theorem: Episodic REINFORCE Theorem For any episodic REINFORCE algorithm, the inner product of E [ θ θ] and θ E [R θ] is non-negative. Furthermore, if α > 0 then this inner product is zero only when θ E [R θ] = 0. Also, if α ij is independent of i, j then α θ E [R θ] = E [ θ θ] We will show a formulation specifically for the MDP case: [ T ] α θ J (θ) = α θ E [(r (τ) θ] = E pθ (τ) α θ log π θ (a t s t ) (r (τ) b) θ t=0 Ronen Tamari REINFORCE (Williams) February 28, 2016 12 / 38

REINFORCE Framework Deriving the Policy Gradient Proof: J (θ) = E pθ (τ) [r (τ)] = T p θ (τ) r (τ) dτ Where p θ (τ) = p (s 0 ) T t=0 p (s t+1 s t, a t ) π θ (a t s t ) due to Markovity assumption. The gradient is: θ p θ (τ) =? θ J (θ) = θ p θ (τ) r (τ) dτ T Ronen Tamari REINFORCE (Williams) February 28, 2016 13 / 38

REINFORCE Framework Deriving the Policy Gradient (cont.) Use the "log-likelihood trick": θ log (p θ (τ)) = θp θ (τ) p θ (τ) To obtain: θ J (θ) = T θ p θ (τ) r (τ) dτ = T θ log (p θ (τ)) p θ (τ) r (τ) dτ = [ T ] = E pθ (τ) [ θ log (p θ (τ)) r (τ)] = E pθ (τ) θ log π θ (a t s t ) r (τ) t=0 Since only the policy π θ (a t s t ) is dependent on θ! Ronen Tamari REINFORCE (Williams) February 28, 2016 14 / 38

REINFORCE Framework Deriving the Policy Gradient (cont.) [ T ] E pθ (τ) θ log π θ (a t s t ) r (τ) = t=0 The expectation can be replaced by sample averages to give the form: θ J (θ) 1 M M T ) θ log π θ (a t s i t i R i i=1 t=0 Where i = 1,..., M are episodes where the agent is run with the current policy. Discount factor could also be incorporated into R i. (where reward is dependent on length of episode). Ronen Tamari REINFORCE (Williams) February 28, 2016 15 / 38

REINFORCE Framework Variance Reduction Using Baseline We are only computing the expected gradient, variance may be high, as demonstrated in following example: Assume baseline b = 0 in a scenario with a single reward r for all state-action pairs. The variance of the gradient grows cubically with length of the horizon T : [ ] T T Var T r ( θ log π θ (a t s t )) = T 2 r 2 Var [ θ log π θ (a t s t )] t=0 Subtracting a baseline b from the estimation can reduce variance: [ ] T Var (Tr b) ( θ log π θ (..)) = (Tr b) 2 T Var [ θ log π θ (..)] t=0 t=0 t=0 Ronen Tamari REINFORCE (Williams) February 28, 2016 16 / 38

REINFORCE Framework Variance Reduction Using Baseline This gives the form: θ J (θ) 1 M M T ) θ log π θ (a ( T ) t s i t i rt i b i=1 t=0 t =0 Ronen Tamari REINFORCE (Williams) February 28, 2016 17 / 38

REINFORCE Framework Variance Reduction Using Baseline Doesn t introduce bias to gradient estimate θ J (θ), since: T T = θ p θ (τ) [r (τ) b] dτ = We used T E pθ (τ) [ θ log (p θ (τ)) [r (τ) b]] = θ log (p θ (τ)) p θ (τ) [r (τ) b] dτ = T θ p θ (τ) r (τ) dτ b = θ J (θ) p θ (τ) dτ = 1 = T θ p θ (τ) dτ = 0 θ p θ (τ) dτ T } {{ } =0 Ronen Tamari REINFORCE (Williams) February 28, 2016 18 / 38

REINFORCE Framework Choosing Optimal Baseline Williams work doesn t detail how to choose the baseline, though this is significant in practice for lowering variance and thus achieving good convergence. Future works explore it in detail (see for example (Sutton ( 99) and Peters (08 ) in the Further Reading section ). Ronen Tamari REINFORCE (Williams) February 28, 2016 19 / 38

Outline REINFORCE Based Algorithm 1 Reinforcement Learning: A Quick Refresher 2 Policy Gradient Methods 3 REINFORCE Framework 4 REINFORCE Based Algorithm 5 REINFORCE in Neural Networks 6 Comparison with Related Algorithms 7 Conclusions Ronen Tamari REINFORCE (Williams) February 28, 2016 20 / 38

Algorithm REINFORCE Based Algorithm Episodic REINFORCE Algorithm (from Peters, J., Schaal, S. (2008).) Input: Policy parameterization θ 1.Perform trials 1,..., M (until converged): } M { {(a } i - Given trial trajectories t, st) i, r i T t t=0 i=1 - Obtain uncorrected gradient estimate g = 1 M T ( ( )) ( M i=1 t=0 θ log π θ a i t st i ) T t =0 d trt i - Estimate optimal baseline per element b = (b 1,..., b h,...b K ): b h = 1 M M i=1 ( T ( ( )) ) t=0 θ log π θh a i t st i 2 ( T ) t =0 d t r t i ( T ) 2 t=0 ( θ h log π θ (at s i t)) i - Subtract baseline: g = 1 M T ( ( )) ( M i=1 t=0 θ log π θ a i t st i ) T t =0 d trt i b - Return g = θ J (θ) (One step of SGD) Ronen Tamari REINFORCE (Williams) February 28, 2016 21 / 38

REINFORCE Based Algorithm Algorithm Convergence Original work doesn t contain analysis of asymptotic properties of REINFORCE algorithms. Empirical simulations show episodic REINFORCE algorithms to be slower due to delayed feedback (for case where reward only assigned at end of episode). Susceptibility to convergence to local optima. Baseline choice significantly affects convergence properties. Related work has shown that for this type of( method, ) theoretical convergence to true gradient is at rate of O 1 m where m denotes the number of episodes (Monte Carlo based analysis). Ronen Tamari REINFORCE (Williams) February 28, 2016 22 / 38

Outline REINFORCE in Neural Networks 1 Reinforcement Learning: A Quick Refresher 2 Policy Gradient Methods 3 REINFORCE Framework 4 REINFORCE Based Algorithm 5 REINFORCE in Neural Networks 6 Comparison with Related Algorithms 7 Conclusions Ronen Tamari REINFORCE (Williams) February 28, 2016 23 / 38

REINFORCE in Neural Networks REINFORCE in Neural Networks REINFORCE originally designed (also) for use in Neural Networks and RNNs. Agent represented by stochastic output unit\s behind which are deterministic hidden units. For example, consider an RNN with a softmax output of probabilities according to which the action a t is chosen. Hidden units of RNN θ t are policy parameterization. Derivation of REINFORCE algorithm using "unfolding through time" of RNN: Weights frozen for duration of episode and updated at the end. Naturally compatible with backpropagation- θ log π θ (a t s t ) is the gradient of the corresponding RNN evaluated at timestep t. Ronen Tamari REINFORCE (Williams) February 28, 2016 24 / 38

REINFORCE in Neural Networks Example: Recurrent Attention Model (Mnih et al., 2015) Policy parameterized by RNN At each step 2 types of actions (l t glimpse location and a t classification) controlled by 2 sub-networks. Goal is to learn stochastic policy π ((l t, a t ) s 1:t ; θ) maximizing rewards. Ronen Tamari REINFORCE (Williams) February 28, 2016 25 / 38

REINFORCE in Neural Networks Example: Recurrent Attention Model (Mnih et al., 2015) Trajectory given by s 1:t = x 1, l 1, a 1,..., x t 1, l t 1, a t 1, x t At each step 2 types of actions (l t glimpse location and a t classification) controlled by 2 sub-networks. Reward R = T t=1 r t where r T = 1 for correct classification and 0 otherwise. Ronen Tamari REINFORCE (Williams) February 28, 2016 26 / 38

REINFORCE in Neural Networks Example: Recurrent Attention Model (Mnih et al., 2015) Gradient estimate of the form: θ J (θ) 1 M M T i=1 t=1 ( )) ( ( T θ log π θ at s i 1:t i rt i b t t =1 ) Ronen Tamari REINFORCE (Williams) February 28, 2016 27 / 38

REINFORCE in Neural Networks Example: Recurrent Attention Model (Mnih et al., 2015) - Dynamic Environment Same approach used to train agent to play simple game in dynamic environment. Ronen Tamari REINFORCE (Williams) February 28, 2016 28 / 38

Outline Comparison with Related Algorithms 1 Reinforcement Learning: A Quick Refresher 2 Policy Gradient Methods 3 REINFORCE Framework 4 REINFORCE Based Algorithm 5 REINFORCE in Neural Networks 6 Comparison with Related Algorithms 7 Conclusions Ronen Tamari REINFORCE (Williams) February 28, 2016 29 / 38

Comparison with Related Algorithms Q-Learning: A reminder* The value function: V π [ (s) = E a π( s) r (s, a) + γv π ( s )] s p( s,a) The Q-Function is the value of taking action a at state s: [ ( Q (s, a) = r (s, a) + γe s p( s,a) V s )] The optimal value satisfies the Bellman Equation: V (s) = max E [ π a π( s) r (s, a) + γv ( s )] s p( s,a) *Slides based on Noga Zaslavsky s presentation Ronen Tamari REINFORCE (Williams) February 28, 2016 30 / 38

Comparison with Related Algorithms Q-Learning: A reminder* If we know V then an optimal policy is to decide deterministically a (s) = arg max Q (s, a) a Learning Q means learning an optimal policy *Slides based on Noga Zaslavsky s presentation Ronen Tamari REINFORCE (Williams) February 28, 2016 31 / 38

Comparison with Related Algorithms Q-Learning Elegant mathematical characterization and converges to optimal Q-function for MDP problems. Model free - only uses Q-function and not system dynamics p(s t+1 s t, a t ) Two main issues complicate the picture: What if problem isn t Markovian (often arises in multi-agent settings)? Assume it is but partially observable... (POMDP) Partial observability makes the learning problem much harder. Continuous or very large action-state space Still possible to learn with function approximation (Deepmind Atari-style games). max operator in Bellman equation makes function approximation difficult. Sensitive to noise and in practice hard to train. Ronen Tamari REINFORCE (Williams) February 28, 2016 32 / 38

Comparison with Related Algorithms Q-Learning vs. REINFORCE Q-Learning better adapted to classic observable Markovian setting, REINFORCE-based learning more relevant the further problem is from that setting (hidden and continuous states). Ronen Tamari REINFORCE (Williams) February 28, 2016 33 / 38

Outline Conclusions 1 Reinforcement Learning: A Quick Refresher 2 Policy Gradient Methods 3 REINFORCE Framework 4 REINFORCE Based Algorithm 5 REINFORCE in Neural Networks 6 Comparison with Related Algorithms 7 Conclusions Ronen Tamari REINFORCE (Williams) February 28, 2016 34 / 38

Conclusions Conclusions Policy parameterization suited to complex dynamic systems. Slow convergence. Limited theoretical analysis. Integrates naturally with Deep Learning, RNNs. General approach, serves as basis for more effective algorithms. Not coupled as tightly as Q-Learning with Markovity assumptions. Ronen Tamari REINFORCE (Williams) February 28, 2016 35 / 38

Further Reading Conclusions Sutton, R. S., Mcallester, D., Singh, S., Mansour, Y. (1999). Policy Gradient Methods for Reinforcement Learning with Function Approximation. Advances in Neural Information Processing Systems 12, 1057-1063. Peters, J., Schaal, S. (2008). Reinforcement learning of motor skills with policy gradients. Neural Networks, 21(4), 682-697 Kober, J., Bagnell, J. A., Peters, J. (2013). Reinforcement learning in robotics: A survey. The International Journal of Robotics Research, 32(11), 1238-1274. http://doi.org/10.1177/0278364913495721 Shalev-shwartz, S., Ben-zrihem, N., Cohen, A., Shashua, A. (n.d.). Long-term Planning by Short-term Prediction, 1-7. http://arxiv.org/abs/1602.01580 Ronen Tamari REINFORCE (Williams) February 28, 2016 36 / 38

Outline 1 Reinforcement Learning: A Quick Refresher 2 Policy Gradient Methods 3 REINFORCE Framework 4 REINFORCE Based Algorithm 5 REINFORCE in Neural Networks 6 Comparison with Related Algorithms 7 Conclusions Ronen Tamari REINFORCE (Williams) February 28, 2016 37 / 38

Thank You Ronen Tamari REINFORCE (Williams) February 28, 2016 38 / 38