Approximation Methods in Reinforcement Learning

Similar documents
Lecture 8: Policy Gradient

Reinforcement Learning

Q-Learning in Continuous State Action Spaces

(Deep) Reinforcement Learning

Lecture 7: Value Function Approximation

Reinforcement Learning Part 2

15-889e Policy Search: Gradient Methods Emma Brunskill. All slides from David Silver (with EB adding minor modificafons), unless otherwise noted

Reinforcement Learning

INF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018

Reinforcement Learning for NLP

CSE 190: Reinforcement Learning: An Introduction. Chapter 8: Generalization and Function Approximation. Pop Quiz: What Function Are We Approximating?

Deep Reinforcement Learning

Lecture 9: Policy Gradient II 1

CS230: Lecture 9 Deep Reinforcement Learning

CSE 190: Reinforcement Learning: An Introduction. Chapter 8: Generalization and Function Approximation. Pop Quiz: What Function Are We Approximating?

Lecture 9: Policy Gradient II (Post lecture) 2

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation

Approximate Q-Learning. Dan Weld / University of Washington

Reinforcement Learning. Machine Learning, Fall 2010

Deep Reinforcement Learning. Scratching the surface

REINFORCEMENT LEARNING

Off-Policy Actor-Critic

Introduction of Reinforcement Learning

Grundlagen der Künstlichen Intelligenz

Reinforcement Learning

Introduction to Reinforcement Learning

Reinforcement Learning. George Konidaris

Reinforcement Learning

Reinforcement Learning and NLP

Reinforcement Learning. Yishay Mansour Tel-Aviv University

Reinforcement Learning

Reinforcement Learning

Reinforcement. Function Approximation. Learning with KATJA HOFMANN. Researcher, MSR Cambridge

Q-learning. Tambet Matiisen

Lecture 8: Policy Gradient I 2

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro

Temporal Difference. Learning KENNETH TRAN. Principal Research Engineer, MSR AI

6 Reinforcement Learning

CSC321 Lecture 22: Q-Learning

Policy Gradient Methods. February 13, 2017

Deep Reinforcement Learning: Policy Gradients and Q-Learning

15-780: ReinforcementLearning

Deep Reinforcement Learning SISL. Jeremy Morton (jmorton2) November 7, Stanford Intelligent Systems Laboratory

REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning

Lecture 3: Markov Decision Processes

Deep reinforcement learning. Dialogue Systems Group, Cambridge University Engineering Department

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Reinforcement Learning: An Introduction

Chapter 8: Generalization and Function Approximation

Learning Control for Air Hockey Striking using Deep Reinforcement Learning

Lecture 1: March 7, 2018

Lecture 18: Reinforcement Learning Sanjeev Arora Elad Hazan

Summary of A Few Recent Papers about Discrete Generative models

Reinforcement Learning

Grundlagen der Künstlichen Intelligenz

David Silver, Google DeepMind

Deep Reinforcement Learning. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 19, 2017

Lecture 23: Reinforcement Learning

This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer.

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

Deep Q-Learning with Recurrent Neural Networks

CS599 Lecture 1 Introduction To RL

Temporal difference learning

Lecture 10 - Planning under Uncertainty (III)

Generalization and Function Approximation

CS599 Lecture 2 Function Approximation in RL

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Machine Learning. Reinforcement learning. Hamid Beigy. Sharif University of Technology. Fall 1396

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Notes on Reinforcement Learning

Machine Learning I Continuous Reinforcement Learning

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

Reinforcement Learning

Basics of reinforcement learning

Artificial Intelligence

Multiagent (Deep) Reinforcement Learning

Machine Learning I Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

ECE276B: Planning & Learning in Robotics Lecture 16: Model-free Control

CS885 Reinforcement Learning Lecture 7a: May 23, 2018

Markov Decision Processes Chapter 17. Mausam

Reinforcement Learning. Spring 2018 Defining MDPs, Planning

arxiv: v1 [cs.lg] 30 Dec 2016

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes

CS 570: Machine Learning Seminar. Fall 2016

Dueling Network Architectures for Deep Reinforcement Learning (ICML 2016)

Lecture 4: Approximate dynamic programming

Reinforcement Learning as Classification Leveraging Modern Classifiers

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon.

I D I A P. Online Policy Adaptation for Ensemble Classifiers R E S E A R C H R E P O R T. Samy Bengio b. Christos Dimitrakakis a IDIAP RR 03-69

Chapter 7: Eligibility Traces. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1

arxiv: v1 [cs.ne] 4 Dec 2015

Reinforcement Learning. Summer 2017 Defining MDPs, Planning

Monte Carlo is important in practice. CSE 190: Reinforcement Learning: An Introduction. Chapter 6: Temporal Difference Learning.

Today s Outline. Recap: MDPs. Bellman Equations. Q-Value Iteration. Bellman Backup 5/7/2012. CSE 473: Artificial Intelligence Reinforcement Learning

Actor-critic methods. Dialogue Systems Group, Cambridge University Engineering Department. February 21, 2017

The Book: Where we are and where we re going. CSE 190: Reinforcement Learning: An Introduction. Chapter 7: Eligibility Traces. Simple Monte Carlo

ilstd: Eligibility Traces and Convergence Analysis

Optimization of a Sequential Decision Making Problem for a Rare Disease Diagnostic Application.

Transcription:

2018 CS420, Machine Learning, Lecture 12 Approximation Methods in Reinforcement Learning Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net http://wnzhang.net/teaching/cs420/index.html

Reinforcement Learning Materials Our course on RL is mainly based on the materials from these masters. Prof. Richard Sutton University of Alberta, Canada http://incompleteideas.net/sutton/index.html Reinforcement Learning: An Introduction (2 nd edition) http://incompleteideas.net/sutton/book/the-book-2nd.html Dr. David Silver Google DeepMind and UCL, UK http://www0.cs.ucl.ac.uk/staff/d.silver/web/home.html UCL Reinforcement Learning Course http://www0.cs.ucl.ac.uk/staff/d.silver/web/teaching.html Prof. Andrew Ng Stanford University, US http://www.andrewng.org/ Machine Learning (CS229) Lecture Notes 12: RL http://cs229.stanford.edu/materials.html

Last Lecture Model-based dynamic programming Value iteration Policy iteration V (s) =R(s)+max X P sa (s 0 )V (s 0 ) a2a X s 0 2S ¼(s) =argmax P sa (s 0 )V (s 0 ) a2a s 0 2S Model-free reinforcement learning On-policy MC V (s t ) Ã V (s t )+ (G t V (s t )) On-policy TD V (s t ) Ã V (s t )+ (r t+1 + V (s t+1 ) V (s t )) On-policy TD SARSA Q(s t ;a t ) Ã Q(s t ;a t )+ (r t+1 + Q(s t+1 ;a t+1 ) Q(s t ;a t )) Off-policy TD Q-learning Q(s t ;a t ) Ã Q(s t ;a t )+ (r t+1 + max Q(s t+1 ;a t+1 ) Q(s t ;a t )) a 0

Key Problem to Solve in This Lecture In all previous models, we have created a lookup table to maintain a variable V(s) for each state or Q(s,a) for each state-action What if we have a large MDP, i.e. the state or state-action space is too large or the state or action space is continuous to maintain V(s) for each state or Q(s,a) for each state-action? For example Game of Go (10 170 states) Helicopter, autonomous car (continuous state space)

Content Solutions for large MDPs Discretize or bucketize states/actions Build parametric value function approximation Policy gradient Deep reinforcement learning and multi-agent RL

Content Solutions for large MDPs Discretize or bucketize states/actions Build parametric value function approximation Policy gradient Deep reinforcement learning and multi-agent RL

Discretization Continuous MDP For a continuous-state MDP, we can discretize the state space For example, if we have 2D states (s 1, s 2 ), we can use a grid to discretize the state space The discrete state ¹s¹s The discretized MDP: ( S;A;fP ¹ ¹sa g; ;R) S 2 ¹s¹s Then solve this MDP with any previous solutions S 1

Bucketize Large Discrete MDP For a large discrete-state MDP, we can bucketize the states to down sample the states To use domain knowledge to merge similar discrete states For example, clustering using state features extracted from domain knowledge

Discretization/Bucketization Pros Straightforward and off-theshelf Efficient Can work well for many problems Cons A fairly naïve representation for V Assumes a constant value over each discretized cell Curse of dimensionality S = R n ) S ¹ = f1;:::;kg n S 2 ¹s¹s S 1

Parametric Value Function Approximation Create parametric (thus learnable) functions to approximate the value function V μ (s) ' V ¼ (s) Q μ (s; a) ' Q ¼ (s; a) θ is the parameters of the approximation function, which can be updated by reinforcement learning Generalize from seen states to unseen states

Main Types of Value Function Approx. V μ (s) Q μ (s; a) Many function approximations (Generalized) linear model Neural network Decision tree Nearest neighbor Fourier / wavelet bases μ μ Differentiable functions (Generalized) linear model Neural network s s a We assume the model is suitable to be trained for nonstationary, non-iid data

Value Function Approx. by SGD Goal: find parameter vector θ minimizing mean-squared error between approximate value function V θ (s) and true value V π (s) J(μ) =E ¼ h 1 2 (V ¼ (s) V μ (s)) 2i J(μ) =E ¼ h 1 2 (V ¼ (s) V μ (s)) 2i Gradient to minimize the error @J(μ) h = E ¼ (V ¼ (s) V μ (s)) @V i μ(s) Stochastic gradient descent on one sample μ Ã μ @J(μ) = μ + (V ¼ (s) V μ (s)) @V μ(s)

Featurize the State Represent state by a feature vector x(s) = 2 3 x 1 (s) 6 7 4. 5 x k (s) For example of a helicopter 3D location 3D speed (differentiation of location) 3D acceleration (differentiation of speed)

Linear Value Function Approximation Represent value function by a linear combination of features V μ (s) =μ > x(s) Objective function is quadratic in parameters θ h 1 J(μ) =E ¼ 2 (V ¼ (s) μ > x(s)) 2i Thus stochastic gradient descent converges on global optimum μ Ã μ @J(μ) = μ + (V ¼ (s) V μ (s))x(s) Step size Prediction error Feature value

Monte-Carlo with Value Function Approx. μ Ã μ + (V ¼ (s) V μ (s))x(s) Now we specify the target value function V π (s) We can apply supervised learning to training data hs 1 ;G 1 i; hs 2 ;G 2 i;:::;hs T ;G T i For each data instance <s t, G t > μ Ã μ + (G t V μ (s))x(s t ) MC evaluation at least converges to a local optimum In linear case it converges to a global optimum

TD Learning with Value Function Approx. μ Ã μ + (V ¼ (s) V μ (s))x(s) TD target r t+1 + V μ (s t+1 ) is a biased sample of true target value V ¼ (s t ) Supervised learning from training data hs 1 ;r 2 + V μ (s 2 )i; hs 2 ;r 3 + V μ (s 3 )i;:::;hs T ;r T i For each data instance hs t ;r t+1 + V μ (s t+1 )i μ Ã μ + (r t+1 + V μ (s t+1 ) V μ (s))x(s t ) Linear TD converges (close) to global optimum

Action-Value Function Approximation Approximate the action-value function Minimize mean squared error Q μ (s; a) ' Q ¼ (s; a) h 1 J(μ) =E ¼ 2 (Q¼ (s; a) Q μ (s; a)) 2i Stochastic gradient descent on one sample μ Ã μ @J(μ) = μ + (Q ¼ (s; a) Q μ (s; a)) @Q μ(s; a)

Linear Action-Value Function Approx. Represent state-action pair by a feature vector x(s; a) = 2 3 x 1 (s; a) 6 7 4. 5 x k (s; a) Parametric Q function, e.g., the linear case Q μ (s; a) =μ > x(s; a) Stochastic gradient descent update μ Ã μ @J(μ) = μ + (Q ¼ (s; a) μ > x(s; a))x(s; a)

TD Learning with Value Function Approx. μ Ã μ + (Q ¼ (s; a) Q μ (s; a)) @Q μ(s; a) For MC, the target is the return G t μ Ã μ + (G t Q μ (s; a)) @Q μ(s; a) For TD, the target is r t+1 + Q μ (s t+1 ;a t+1 ) μ Ã μ + (r t+1 + Q μ (s t+1 ;a t+1 ) Q μ (s; a)) @Q μ(s; a)

Control with Value Function Approx. Q μ = Q ¼ μ Q μ ' Q ¼ Q ;¼ ¼ = ²-greedy(Q μ ) Policy evaluation: approximately policy evaluation Policy improvement: ɛ-greedy policy improvement Q μ ' Q ¼

NOTE of TD Update For TD(0), the TD target is State value Action value μ Ã μ + (V ¼ (s t ) V μ (s t )) @V μ(s t ) = μ + (r t+1 + V μ (s t+1 ) V μ (s)) @V μ(s t ) μ Ã μ + (Q ¼ (s; a) Q μ (s; a)) @Q μ(s; a) = μ + (r t+1 + Q μ (s t+1 ;a t+1 ) Q μ (s; a)) @Q μ(s; a) Although θ is in the TD target, we don t calculate gradient from the target. Think about why.

Case Study: Mountain Car The gravity is stronger than the car s engine Cost-to-go function

Case Study: Mountain Car

Deep Q-Network (DQN) Volodymyr Mnih, Koray Kavukcuoglu, David Silver et al. Playing Atari with Deep Reinforcement Learning. NIPS 2013 workshop. Volodymyr Mnih, Koray Kavukcuoglu, David Silver et al. Human-level control through deep reinforcement learning. Nature 2015.

Deep Q-Network (DQN) Implement Q function with deep neural network Volodymyr Mnih, Koray Kavukcuoglu, David Silver et al. Playing Atari with Deep Reinforcement Learning. NIPS 2013 workshop. Volodymyr Mnih, Koray Kavukcuoglu, David Silver et al. Human-level control through deep reinforcement learning. Nature 2015.

Deep Q-Network (DQN) The loss function of Q-learning update at iteration i r L i (μ i )=E (s;a;r;s )»U(D)h 0 + max Q(s 0 ;a 0 ; μ i ) Q(s; a; μ i) i 2 a 0 target Q value estimated Q value θ i are the network parameters to be updated at iteration i Updated with standard back-propagation algorithms θ i- are the target network parameters Only updated with θ i for every C steps (s,a,r,s )~U(D): the samples are uniformly drawn from the experience pool D Thus to avoid the overfitting to the recent experiences Volodymyr Mnih, Koray Kavukcuoglu, David Silver et al. Playing Atari with Deep Reinforcement Learning. NIPS 2013 workshop. Volodymyr Mnih, Koray Kavukcuoglu, David Silver et al. Human-level control through deep reinforcement learning. Nature 2015.

Content Solutions for large MDPs Discretize or bucketize states/actions Build parametric value function approximation Policy gradient Deep reinforcement learning and multi-agent RL

Parametric Policy We can parametrize the policy ¼ μ (ajs) which could be deterministic a = ¼ μ (s) or stochastic ¼ μ (ajs) =P (ajs; μ) θ is the parameters of the policy Generalize from seen states to unseen states We focus on model-free reinforcement learning

Policy-based RL Advantages Better convergence properties Effective in high-dimensional or continuous action spaces No.1 reason: for value function, you have to take a max operation Can learn stochastic polices Disadvantages Typically converge to a local rather than global optimum Evaluating a policy is typically inefficient and of high variance

Policy Gradient For stochastic policy ¼ μ (ajs) =P (ajs; μ) Intuition lower the probability of the action that leads to low value/reward higher the probability of the action that leads to high value/reward A 5-action example 1. Initialize θ 3. Update θ by policy gradient 5. Update θ by policy gradient Action Probability Action Probability Action Probability 0.25 0.2 0.15 0.1 0.05 0 A1 A2 A3 A4 A5 0.4 0.3 0.2 0.1 0 A1 A2 A3 A4 A5 0.4 0.3 0.2 0.1 0 A1 A2 A3 A4 A5 2. Take action A2 Observe positive reward 4. Take action A3 Observe negative reward

Policy Gradient in One-Step MDPs Consider a simple class of one-step MDPs Starting in state s» d(s) Terminating after one time-step with reward r sa Policy expected value J(μ) =E ¼μ [r] = X s2s d(s) X ¼ μ (ajs)r sa a2a @J(μ) = X s2s d(s) X a2a @¼ μ (ajs) r sa

Likelihood Ratio Likelihood ratios exploit the following identity @¼ μ (ajs) 1 @¼ μ (ajs) = ¼ μ (ajs) ¼ μ (ajs) = ¼ μ (ajs) @ log ¼ μ(ajs) Thus the policy s expected value J(μ) =E ¼μ [r] = X s2s d(s) X ¼ μ (ajs)r sa a2a @J(μ) = X s2s d(s) X a2a @¼ μ (ajs) r sa = X d(s) X ¼ μ (ajs) @ log ¼ μ(ajs) s2s a2a h @ log ¼μ (ajs) i = E ¼μ r sa r sa This can be approximated by sampling state s from d(s) and action a from π θ

Policy Gradient Theorem The policy gradient theorem generalizes the likelihood ratio approach to multi-step MDPs Replaces instantaneous reward r sa with long-term value Policy gradient theorem applies to start state objective J 1, average reward objective J avr, and average value objective J avv Theorem Q ¼ μ (s; a) For any differentiable policy ¼ μ (ajs), for any of policy objective function J = J 1, J avr, J avv, the policy gradient is @J(μ) h @ log ¼μ (ajs) i = E ¼μ Q ¼ μ (s; a) Please refer to appendix of the slides for detailed proofs

Monte-Carlo Policy Gradient (REINFORCE) Update parameters by stochastic gradient ascent Using policy gradient theorem Using return G t as an unbiased sample of Q ¼ μ (s; a) μ t = @ log ¼ μ(a t js t ) G t REINFORCE Algorithm Initialize θ arbitrarily for each episode fs 1 ;a 1 ;r 2 ;:::;s T 1 ;a T 1 ;r T g»¼ μ do for t=1 to T-1 do end for return θ μ Ã μ + @ log ¼ μ(a t js t )G t end for

Puck World Example Continuous actions exert small force on puck Puck is rewarded for getting close to target Target location is reset every 30 seconds Policy is trained using variant of MC policy gradient

Softmax Stochastic Policy Softmax policy is a very commonly used stochastic policy ¼ μ (ajs) = P ef μ(s;a) a 0 e f μ(s;a 0 ) where f θ (s,a) is the score function of a state-action pair parametrized by θ, which can be defined with domain knowledge The gradient of its log-likelihood @ log ¼ μ (ajs) = @f μ(s; a) = @f μ(s; a) 1 X P a 0 e f e fμ(s;a00) @f μ(s; a00 ) μ(s;a 0 ) a 00 h @fμ (s; a 0 ) i E a 0»¼ μ (a 0 js)

Softmax Stochastic Policy Softmax policy is a very commonly used stochastic policy ¼ μ (ajs) = P ef μ(s;a) a 0 e f μ(s;a 0 ) where f θ (s,a) is the score function of a state-action pair parametrized by θ, which can be defined with domain knowledge For example, we define the linear score function f μ (s; a) =μ > x(s; a) @ log ¼ μ (ajs) = @f μ(s; a) h @fμ (s; a 0 ) i E a 0»¼ μ (a 0 js) = x(s; a) E a 0»¼ μ (a 0 js)[x(s; a 0 )]

Sequence Generation Example Generator is a reinforcement learning policy G μ (y t jy 1:t 1 ) of generating a sequence Decide the next word (discrete action) to generate given the previous ones, implemented by softmax policy Discriminator provides the reward (i.e. the probability of being true data) for the whole sequence G is trained by MC policy gradient (REINFORCE) [Lantao Yu, Weinan Zhang, Jun Wang, Yong Yu. SeqGAN: Sequence Generative Adversarial Nets with Policy Gradient. AAAI 2017.]

Experiments on Synthetic Data Evaluation measure with Oracle h X T i NLL oracle = E Y1:T»G μ log G oracle (y t jy 1:t 1 ) t=1 [Lantao Yu, Weinan Zhang, Jun Wang, Yong Yu. SeqGAN: Sequence Generative Adversarial Nets with Policy Gradient. AAAI 2017.]

Content Solutions for large MDPs Discretize or bucketize states/actions Build parametric value function approximation Policy gradient Deep reinforcement learning and multi-agent RL By our invited speakers Yuhuai Wu, Shixiang Gu and Ying Wen

APPENDIX Policy Gradient Theorem: Average Reward Setting Average reward objective 1 h i J(¼) = lim n!1 n E r 1 + r 2 + + r n j¼ = X s 1X h i Q ¼ (s; a) = E r t J(¼)js 0 = s; a 0 = a; ¼ @V ¼ (s) ) @J(¼) t=1 def = @ X = X a = X a = X a = X a a ¼(s; a)q ¼ (s; a); 8s h @¼(s; a) h @¼(s; a) Q ¼ (s; a)+¼(s; a) @ h @¼(s; a) h @¼(s; a) Q ¼ (s; a)+¼(s; a) @ i Q¼ (s; a) ³ Q ¼ (s; a)+¼(s; a) @J(¼) d ¼ (s) X a Q ¼ (s; a)+¼(s; a) X s 0 P a ss 0 @V ¼ (s 0 ) ¼(s; a)r(s; a) ³ r(s; a) J(¼)+ X i Pss a 0V ¼ (s 0 ) s 0 + @ X i Pss a 0V ¼ (s 0 ) s i 0 @V ¼ (s) Please refer to Chapter 13 of Rich Sutton s Reinforcement Learning: An Introduction (2 nd Edition)

APPENDIX Policy Gradient Theorem: Average Reward Setting Average reward objective @J(¼) = X h @¼(s; a) Q ¼ (s; a)+¼(s; a) X P a @V ¼ (s 0 ) i ss @V ¼ (s) 0 a s X 0 d ¼ (s) @J(¼) = X d ¼ (s) X @¼(s; a) Q ¼ (s; a)+ X d ¼ (s) X ¼(s; a) X P a @V ¼ (s 0 ) ss X 0 s s a s a s 0 s X d ¼ (s) X ¼(s; a) X P a @V ¼ (s 0 ) ss = X X X d ¼ (s)¼(s; a)p a @V ¼ (s0 ) 0 ss 0 s a s 0 s a s 0 = X X ³ X d ¼ (s) ¼(s; a)p a 0 @V ¼ (s 0 ) ss = X X d ¼ @V ¼ (s 0 ) (s)p ss 0 s s 0 a s s 0 = X ³ X @V d ¼ ¼ (s 0 ) (s)p ss 0 = X d ¼ (s 0 ) @V ¼ (s 0 ) s 0 s s 0 ) X d ¼ (s) @J(¼) = X d ¼ (s) X @¼(s; a) Q ¼ (s; a)+ X d ¼ (s 0 ) @V ¼ (s 0 ) X s s a s 0 s ) @J(¼) = X d ¼ (s) X @¼(s; a) Q ¼ (s; a) s a d ¼ (s) @V ¼ (s) d ¼ (s) @V ¼ (s) Please refer to Chapter 13 of Rich Sutton s Reinforcement Learning: An Introduction (2 nd Edition)

APPENDIX Policy gradient theorem: Start Value Setting Start state value objective h X 1 i J(¼) =E t 1 r t s0 ;¼ t=1 h X 1 Q ¼ (s; a) =E @V ¼ (s) def = @ = X a = X a = X a k=1 i k 1 r t+k st = s; a t = a; ¼ X ¼(s; a)q ¼ (s; a); a 8s Q ¼ (s; a)+¼(s; a) @ i Q¼ (s; a) h @¼(s; a) h @¼(s; a) Q ¼ (s; a)+¼(s; a) @ @¼(s; a) Q ¼ (s; a)+ X a ³ r(s; a)+ X i Pss a 0V ¼ (s 0 ) s 0 ¼(s; a) X s 0 P a ss 0 @V ¼ (s 0 ) Please refer to Chapter 13 of Rich Sutton s Reinforcement Learning: An Introduction (2 nd Edition)

APPENDIX Policy gradient theorem: Start Value Setting Start state value objective X a @V ¼ (s) = X a X a @¼(s; a) Q ¼ (s; a)+ X a @¼(s; a) Q ¼ (s; a) = 0 Pr(s! s; 0;¼) X a ¼(s; a) X s 1 P a ss 1 @V ¼ (s 1 ) @V ¼ (s 1 ) = X a = X s 1 X a = X s 1 P ss1 @V ¼ (s 1 ) ¼(s; a) X P a @V ¼ (s 1 ) ss 1 s 1 @¼(s; a) Q ¼ (s; a) ¼(s; a) P a ss 1 @V ¼ (s 1 ) = 1 X s 1 Pr(s! s 1 ; 1;¼) @V ¼ (s 1 ) @¼(s; a) Q ¼ (s; a)+ X 1 Pr(s 1! s 2 ; 1;¼) @V ¼ (s 2 ) s 2 Please refer to Chapter 13 of Rich Sutton s Reinforcement Learning: An Introduction (2 nd Edition)

APPENDIX Policy gradient theorem: Start Value Setting Start state value objective @V ¼ (s) ) @J(¼) = 0 Pr(s! s; 0;¼) X a @¼(s; a) Q ¼ (s; a)+ X 1 s 1 Pr(s! s 1 ; 1;¼) X a + 2 X s 1 Pr(s! s 1 ; 1;¼) X s 2 Pr(s 1! s 2 ; 1;¼) @V ¼ (s 2 ) = 0 Pr(s! s; 0;¼) X @¼(s; a) Q ¼ (s; a)+ X 1 a s 1 + X 2 Pr(s! s 2 ; 2;¼) @V ¼ (s 2 ) s 2 1X X = k Pr(s! x; k; ¼) X k=0 x a = @V ¼ (s 0 ) = X 1X k Pr(s 0! s; k; ¼) X s k=0 a @¼(x; a) Q ¼ (x; a) = X x Pr(s! s 1 ; 1;¼) X a @¼(s 1 ;a) Q ¼ (s 1 ;a) @¼(s 1 ;a) Q ¼ (s 1 ;a) 1X k Pr(s! x; k; ¼) X a k=0 @¼(s; a) Q ¼ (s; a) = X s d ¼ (s) X a @¼(x; a) Q ¼ (x; a) @¼(s; a) Q ¼ (s; a) Please refer to Chapter 13 of Rich Sutton s Reinforcement Learning: An Introduction (2 nd Edition)