Off-Policy Actor-Critic

Size: px

Start display at page:

Download "Off-Policy Actor-Critic"

Amanda Willis
5 years ago
Views:

1 Off-Policy Actor-Critic Ludovic Trottier Laval University July Ludovic Trottier (DAMAS Laboratory) Off-Policy Actor-Critic July / 34

2 Table of Contents 1 Reinforcement Learning Theory 2 Off-PAC Algorithm 3 Experimentations 4 Discussion 5 References Ludovic Trottier (DAMAS Laboratory) Off-Policy Actor-Critic July / 34

3 1 Reinforcement Learning Theory 2 Off-PAC Algorithm 3 Experimentations 4 Discussion 5 References Ludovic Trottier (DAMAS Laboratory) Off-Policy Actor-Critic July / 34

4 On vs Off Policy On-policy learning: learns the value function of the policy that is used to make decisions. (SARSA) Q(s, a) Q(s, a) + α [ r + γq(s, a ) Q(s, a) ] Off-policy learning: learns the value function of a policy different of the one used to make decisions. (Q-learning) [ ] Q(s, a) Q(s, a) + α r + γ max Q(s, a ) Q(s, a) a Ludovic Trottier (DAMAS Laboratory) Off-Policy Actor-Critic July / 34

Ref. [3] Ludovic Trottier (DAMAS Laboratory) Off-Policy Actor-Critic July 25 2012 5 / 34 Actor-Critic vs Off-Policy Actor-Critic (Off-PAC) Actor-Critic evaluates improves

5 Ref. [3] Ludovic Trottier (DAMAS Laboratory) Off-Policy Actor-Critic July / 34 Actor-Critic vs Off-Policy Actor-Critic (Off-PAC) Actor-Critic evaluates improves critic behavior - actor - behavior data generated with behavior Off-PAC evaluates improves critic target - actor - target data generated with behavior Figure: Actor-Critic Method

6 Value Function Approximation Because of a large state-space, the value function need to be approximated. Q(s, a) = s S Pr(s s, a)[r(s, a, s ) + γv (s )] Off-policy + value function approximation fluctuation. Based on the analysis of [2], it is better to use linear approximation : ˆV (s) = v T x s, v i and x s i R e.g. tile coding binary features x s Ludovic Trottier (DAMAS Laboratory) Off-Policy Actor-Critic July / 34

7 Mountain Car Tile Coding Simulation S = [ 1.5, 0.5] [ 0.07, 0.07] A = { 1, 0, 1} Ludovic Trottier (DAMAS Laboratory) Off-Policy Actor-Critic July / 34

8 Stochastic Gradient Descent Given an exemple (x t, y t ) X and an objective function Q(w, x, y) Normal gradient descent w t+1 = w t + α Q(w, x t, y t ) w w t+1 = w t + α 1 n n i=1 Q(w, x i, y i ) w Ludovic Trottier (DAMAS Laboratory) Off-Policy Actor-Critic July / 34

9 1 Reinforcement Learning Theory 2 Off-PAC Algorithm 3 Experimentations 4 Discussion 5 References Ludovic Trottier (DAMAS Laboratory) Off-Policy Actor-Critic July / 34

10 Motivations 3 problems of off-policy action-value methods : 1 Target policy is deterministic 2 Policy improvement is problematic for large action space 3 Small changes in policy evaluation large changes in policy improvement Ludovic Trottier (DAMAS Laboratory) Off-Policy Actor-Critic July / 34

11 Problem The policy is a function parametrize by a weight vector u: π u : A S [0, 1] Objective function to maximise [4] : J(u) = lim Pr(s t = s s 0, b)]v (s) with V (s) = t s S[ π(a s)q(s, a) a A Figure: Chain Ludovic Trottier (DAMAS Laboratory) Off-Policy Actor-Critic July / 34

12 Desciption of the Critic Evaluating an approximation of the true value function V (s) ˆV (s) = v T x s, v i and x s i R in an incremental way using gradient descent. Objective function to minimize : MSPBE(v) = ˆV ΠT ˆV 2 D Ludovic Trottier (DAMAS Laboratory) Off-Policy Actor-Critic July / 34

13 MSPBE Objective function Given a vector function f : f 2 D = s S[ lim t Pr(s t = s s 0, b)]f 2 (s) Define V π (s) = E [ t=0 γt R t+1 S 0 = s, π] and we get : V π (s) = a π(s, a) s Pr(s, a, s )[R(s, a) + γv π (s )] = s Pr(s, a, s )[R(s, a) + γv π (s )] = s Pr(s, a, s )R(s, a) + s Pr(s, a, s )γv π (s ) V π = E[R] + γpv π = TV π Ludovic Trottier (DAMAS Laboratory) Off-Policy Actor-Critic July / 34

14 MSPBE Objective function For a vector function f, we define the linear projection as Πf := f θ = θ T φ such that θ = arg min fθ f 2 D θ { } { } = arg min θ [ lim Pr(s t = s s 0, b)](f t θ f ) 2 (s) s S Ludovic Trottier (DAMAS Laboratory) Off-Policy Actor-Critic July / 34

15 Geometrical Interpretation Ref. [2] Ludovic Trottier (DAMAS Laboratory) Off-Policy Actor-Critic July / 34

16 Exemple Iteration 0 Ludovic Trottier (DAMAS Laboratory) Off-Policy Actor-Critic July / 34

17 Exemple Iteration 1 Ludovic Trottier (DAMAS Laboratory) Off-Policy Actor-Critic July / 34

18 Exemple Iteration 1 Ludovic Trottier (DAMAS Laboratory) Off-Policy Actor-Critic July / 34

19 Exemple Iteration 1 Ludovic Trottier (DAMAS Laboratory) Off-Policy Actor-Critic July / 34

20 Exemple Iteration 2 Ludovic Trottier (DAMAS Laboratory) Off-Policy Actor-Critic July / 34

21 Eligibility Traces R (1) t = r t+1 + γv t (s t+1 ) state R (2) t = r t+1 + γr t+2 + γ 2 V t (s t+2 ) action R (n) t = r t+1 + γr t γ n V t (s t+n ) Ref. [3] Ludovic Trottier (DAMAS Laboratory) Off-Policy Actor-Critic July / 34

22 Eligibility Traces The return R t can be rewritten as any average of steps. e.g. R avg t = 1 2 R(2) t R(4) t Ref. [3] Ludovic Trottier (DAMAS Laboratory) Off-Policy Actor-Critic July / 34

23 Eligibility Traces The λ-return used with TD methods is another way to define R avg R avg t := R λ t = (1 λ) n=1 λ n 1 R (n) t, 0 λ 1 t. Ref. [3] Ludovic Trottier (DAMAS Laboratory) Off-Policy Actor-Critic July / 34

24 Off-Policy Policy-gradient Theorem Objective function to maximise : u J is too difficult to estimate. J(u) = s S[ lim t Pr(s t = s s 0, b)]v (s) We approximate the gradient with g(u). u J(u) g(u) = lim Pr(s t = s s 0, b)] t s S[ u π(a s)q(s, a) a A (product rule + approximation) Ludovic Trottier (DAMAS Laboratory) Off-Policy Actor-Critic July / 34

25 Desciption of the Actor Uses stochastic gradient descent on g(u) : [ ] g(u) = E u π(a s)q(s, a) s lim Pr( ) t a We can add a random function of the state to the equations because of [4] u π(s, a) = 0; Intuition : u π(a s) and Q(s, a) are orthogonal. a Ludovic Trottier (DAMAS Laboratory) Off-Policy Actor-Critic July / 34

26 Desciption of the Actor Rewriting the terms [ ] g(u) = E ρ(s, a)ψ(s, a)q(s, a) s lim Pr( ), a b( s) t where ρ(s, a) = π(a s) b(a s) and ψ(s, a) = uπ(a s) π(a s) We further approximate Q(s, a) with the off-policy λ-return R λ t R λ t = r t+1 + (1 λ)γ ˆV (s t ) + λγρ(s, a)r λ t+1 Forward view Backward view Ludovic Trottier (DAMAS Laboratory) Off-Policy Actor-Critic July / 34

27 1 Reinforcement Learning Theory 2 Off-PAC Algorithm 3 Experimentations 4 Discussion 5 References Ludovic Trottier (DAMAS Laboratory) Off-Policy Actor-Critic July / 34

28 Parameters 4 methods : Q(λ) Greedy-GQ Softmax-GQ Off-PAC b( s) = uniform distribution γ = 0.99 # steps = 5000, with performance testing at each 20 episodes. Gibbs distribution for target policy : π(a s) = tile coding feature extraction. Parameters sweep exp{ut φ s,a} a exp{ut φ s,a} Ludovic Trottier (DAMAS Laboratory) Off-Policy Actor-Critic July / 34

29 Results Off-PAC is successful Ref. [1] Ludovic Trottier (DAMAS Laboratory) Off-Policy Actor-Critic July / 34

30 1 Reinforcement Learning Theory 2 Off-PAC Algorithm 3 Experimentations 4 Discussion 5 References Ludovic Trottier (DAMAS Laboratory) Off-Policy Actor-Critic July / 34

31 Discussion Why using approximation (value function + policy) in small state/action space problems? Linear approximation of value function is very naive? Uniform behavior policy may cause problem for Q(λ). Is a stochastic policy really necessary? What is the complexity? Is Bellman operator T defined with deterministic policy, but applied to stochastic one?. Ludovic Trottier (DAMAS Laboratory) Off-Policy Actor-Critic July / 34

32 Conclusion 1 Off-Policy vs On-Policy methods 2 Actor-Critic for on and off-policy 3 Description of the Critic (value-function approximation, eligibility trace, stochastic gradient descent,gtd(λ)) 4 Description of the Actor (approximation of gradient, forward backward view, objective function) 5 Results 6 (Possible) problems Ludovic Trottier (DAMAS Laboratory) Off-Policy Actor-Critic July / 34

33 1 Reinforcement Learning Theory 2 Off-PAC Algorithm 3 Experimentations 4 Discussion 5 References Ludovic Trottier (DAMAS Laboratory) Off-Policy Actor-Critic July / 34

34 References Thomas Degris, Martha White, and Richard S. Sutton. Off-policy actor-critic. CoRR, abs/ , Hamid Reza Maei. Gradient Temporal-Difference Learning Algorithms. PhD thesis, University of Alberta, Fall Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA, USA, 1st edition, Richard S. Sutton, David A. McAllester, Satinder P. Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. In NIPS, pages , Ludovic Trottier (DAMAS Laboratory) Off-Policy Actor-Critic July / 34

PART A and ONE question from PART B; or ONE question from PART A and TWO questions from PART B.

PART A and ONE question from PART B; or ONE question from PART A and TWO questions from PART B. Advanced Topics in Machine Learning, GI13, 2010/11 Advanced Topics in Machine Learning, GI13, 2010/11 Answer any THREE questions. Each question is worth 20 marks. Use separate answer books Answer any THREE