Off-Policy Actor-Critic Ludovic Trottier Laval University July 25 2012 Ludovic Trottier (DAMAS Laboratory) Off-Policy Actor-Critic July 25 2012 1 / 34
Table of Contents 1 Reinforcement Learning Theory 2 Off-PAC Algorithm 3 Experimentations 4 Discussion 5 References Ludovic Trottier (DAMAS Laboratory) Off-Policy Actor-Critic July 25 2012 2 / 34
1 Reinforcement Learning Theory 2 Off-PAC Algorithm 3 Experimentations 4 Discussion 5 References Ludovic Trottier (DAMAS Laboratory) Off-Policy Actor-Critic July 25 2012 3 / 34
On vs Off Policy On-policy learning: learns the value function of the policy that is used to make decisions. (SARSA) Q(s, a) Q(s, a) + α [ r + γq(s, a ) Q(s, a) ] Off-policy learning: learns the value function of a policy different of the one used to make decisions. (Q-learning) [ ] Q(s, a) Q(s, a) + α r + γ max Q(s, a ) Q(s, a) a Ludovic Trottier (DAMAS Laboratory) Off-Policy Actor-Critic July 25 2012 4 / 34
Ref. [3] Ludovic Trottier (DAMAS Laboratory) Off-Policy Actor-Critic July 25 2012 5 / 34 Actor-Critic vs Off-Policy Actor-Critic (Off-PAC) Actor-Critic evaluates improves critic behavior - actor - behavior data generated with behavior Off-PAC evaluates improves critic target - actor - target data generated with behavior Figure: Actor-Critic Method
Value Function Approximation Because of a large state-space, the value function need to be approximated. Q(s, a) = s S Pr(s s, a)[r(s, a, s ) + γv (s )] Off-policy + value function approximation fluctuation. Based on the analysis of [2], it is better to use linear approximation : ˆV (s) = v T x s, v i and x s i R e.g. tile coding binary features x s Ludovic Trottier (DAMAS Laboratory) Off-Policy Actor-Critic July 25 2012 6 / 34
Mountain Car Tile Coding Simulation S = [ 1.5, 0.5] [ 0.07, 0.07] A = { 1, 0, 1} Ludovic Trottier (DAMAS Laboratory) Off-Policy Actor-Critic July 25 2012 7 / 34
Stochastic Gradient Descent Given an exemple (x t, y t ) X and an objective function Q(w, x, y) Normal gradient descent w t+1 = w t + α Q(w, x t, y t ) w w t+1 = w t + α 1 n n i=1 Q(w, x i, y i ) w Ludovic Trottier (DAMAS Laboratory) Off-Policy Actor-Critic July 25 2012 8 / 34
1 Reinforcement Learning Theory 2 Off-PAC Algorithm 3 Experimentations 4 Discussion 5 References Ludovic Trottier (DAMAS Laboratory) Off-Policy Actor-Critic July 25 2012 9 / 34
Motivations 3 problems of off-policy action-value methods : 1 Target policy is deterministic 2 Policy improvement is problematic for large action space 3 Small changes in policy evaluation large changes in policy improvement Ludovic Trottier (DAMAS Laboratory) Off-Policy Actor-Critic July 25 2012 10 / 34
Problem The policy is a function parametrize by a weight vector u: π u : A S [0, 1] Objective function to maximise [4] : J(u) = lim Pr(s t = s s 0, b)]v (s) with V (s) = t s S[ π(a s)q(s, a) a A Figure: Chain Ludovic Trottier (DAMAS Laboratory) Off-Policy Actor-Critic July 25 2012 11 / 34
Desciption of the Critic Evaluating an approximation of the true value function V (s) ˆV (s) = v T x s, v i and x s i R in an incremental way using gradient descent. Objective function to minimize : MSPBE(v) = ˆV ΠT ˆV 2 D Ludovic Trottier (DAMAS Laboratory) Off-Policy Actor-Critic July 25 2012 12 / 34
MSPBE Objective function Given a vector function f : f 2 D = s S[ lim t Pr(s t = s s 0, b)]f 2 (s) Define V π (s) = E [ t=0 γt R t+1 S 0 = s, π] and we get : V π (s) = a π(s, a) s Pr(s, a, s )[R(s, a) + γv π (s )] = s Pr(s, a, s )[R(s, a) + γv π (s )] = s Pr(s, a, s )R(s, a) + s Pr(s, a, s )γv π (s ) V π = E[R] + γpv π = TV π Ludovic Trottier (DAMAS Laboratory) Off-Policy Actor-Critic July 25 2012 13 / 34
MSPBE Objective function For a vector function f, we define the linear projection as Πf := f θ = θ T φ such that θ = arg min fθ f 2 D θ { } { } = arg min θ [ lim Pr(s t = s s 0, b)](f t θ f ) 2 (s) s S Ludovic Trottier (DAMAS Laboratory) Off-Policy Actor-Critic July 25 2012 14 / 34
Geometrical Interpretation Ref. [2] Ludovic Trottier (DAMAS Laboratory) Off-Policy Actor-Critic July 25 2012 15 / 34
Exemple Iteration 0 Ludovic Trottier (DAMAS Laboratory) Off-Policy Actor-Critic July 25 2012 16 / 34
Exemple Iteration 1 Ludovic Trottier (DAMAS Laboratory) Off-Policy Actor-Critic July 25 2012 17 / 34
Exemple Iteration 1 Ludovic Trottier (DAMAS Laboratory) Off-Policy Actor-Critic July 25 2012 18 / 34
Exemple Iteration 1 Ludovic Trottier (DAMAS Laboratory) Off-Policy Actor-Critic July 25 2012 19 / 34
Exemple Iteration 2 Ludovic Trottier (DAMAS Laboratory) Off-Policy Actor-Critic July 25 2012 20 / 34
Eligibility Traces R (1) t = r t+1 + γv t (s t+1 ) state R (2) t = r t+1 + γr t+2 + γ 2 V t (s t+2 ) action R (n) t = r t+1 + γr t+2 +... + γ n V t (s t+n ) Ref. [3] Ludovic Trottier (DAMAS Laboratory) Off-Policy Actor-Critic July 25 2012 21 / 34
Eligibility Traces The return R t can be rewritten as any average of steps. e.g. R avg t = 1 2 R(2) t + 1 2 R(4) t Ref. [3] Ludovic Trottier (DAMAS Laboratory) Off-Policy Actor-Critic July 25 2012 22 / 34
Eligibility Traces The λ-return used with TD methods is another way to define R avg R avg t := R λ t = (1 λ) n=1 λ n 1 R (n) t, 0 λ 1 t. Ref. [3] Ludovic Trottier (DAMAS Laboratory) Off-Policy Actor-Critic July 25 2012 23 / 34
Off-Policy Policy-gradient Theorem Objective function to maximise : u J is too difficult to estimate. J(u) = s S[ lim t Pr(s t = s s 0, b)]v (s) We approximate the gradient with g(u). u J(u) g(u) = lim Pr(s t = s s 0, b)] t s S[ u π(a s)q(s, a) a A (product rule + approximation) Ludovic Trottier (DAMAS Laboratory) Off-Policy Actor-Critic July 25 2012 24 / 34
Desciption of the Actor Uses stochastic gradient descent on g(u) : [ ] g(u) = E u π(a s)q(s, a) s lim Pr( ) t a We can add a random function of the state to the equations because of [4] u π(s, a) = 0; Intuition : u π(a s) and Q(s, a) are orthogonal. a Ludovic Trottier (DAMAS Laboratory) Off-Policy Actor-Critic July 25 2012 25 / 34
Desciption of the Actor Rewriting the terms [ ] g(u) = E ρ(s, a)ψ(s, a)q(s, a) s lim Pr( ), a b( s) t where ρ(s, a) = π(a s) b(a s) and ψ(s, a) = uπ(a s) π(a s) We further approximate Q(s, a) with the off-policy λ-return R λ t R λ t = r t+1 + (1 λ)γ ˆV (s t ) + λγρ(s, a)r λ t+1 Forward view Backward view Ludovic Trottier (DAMAS Laboratory) Off-Policy Actor-Critic July 25 2012 26 / 34
1 Reinforcement Learning Theory 2 Off-PAC Algorithm 3 Experimentations 4 Discussion 5 References Ludovic Trottier (DAMAS Laboratory) Off-Policy Actor-Critic July 25 2012 27 / 34
Parameters 4 methods : Q(λ) Greedy-GQ Softmax-GQ Off-PAC b( s) = uniform distribution γ = 0.99 # steps = 5000, with performance testing at each 20 episodes. Gibbs distribution for target policy : π(a s) = 10 10 tile coding feature extraction. Parameters sweep exp{ut φ s,a} a exp{ut φ s,a} Ludovic Trottier (DAMAS Laboratory) Off-Policy Actor-Critic July 25 2012 28 / 34
Results Off-PAC is successful Ref. [1] Ludovic Trottier (DAMAS Laboratory) Off-Policy Actor-Critic July 25 2012 29 / 34
1 Reinforcement Learning Theory 2 Off-PAC Algorithm 3 Experimentations 4 Discussion 5 References Ludovic Trottier (DAMAS Laboratory) Off-Policy Actor-Critic July 25 2012 30 / 34
Discussion Why using approximation (value function + policy) in small state/action space problems? Linear approximation of value function is very naive? Uniform behavior policy may cause problem for Q(λ). Is a stochastic policy really necessary? What is the complexity? Is Bellman operator T defined with deterministic policy, but applied to stochastic one?. Ludovic Trottier (DAMAS Laboratory) Off-Policy Actor-Critic July 25 2012 31 / 34
Conclusion 1 Off-Policy vs On-Policy methods 2 Actor-Critic for on and off-policy 3 Description of the Critic (value-function approximation, eligibility trace, stochastic gradient descent,gtd(λ)) 4 Description of the Actor (approximation of gradient, forward backward view, objective function) 5 Results 6 (Possible) problems Ludovic Trottier (DAMAS Laboratory) Off-Policy Actor-Critic July 25 2012 32 / 34
1 Reinforcement Learning Theory 2 Off-PAC Algorithm 3 Experimentations 4 Discussion 5 References Ludovic Trottier (DAMAS Laboratory) Off-Policy Actor-Critic July 25 2012 33 / 34
References Thomas Degris, Martha White, and Richard S. Sutton. Off-policy actor-critic. CoRR, abs/1205.4839, 2012. Hamid Reza Maei. Gradient Temporal-Difference Learning Algorithms. PhD thesis, University of Alberta, Fall 2011. Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA, USA, 1st edition, 1998. Richard S. Sutton, David A. McAllester, Satinder P. Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. In NIPS, pages 1057 1063, 1999. Ludovic Trottier (DAMAS Laboratory) Off-Policy Actor-Critic July 25 2012 34 / 34