Reinforcement Learning

Size: px

Start display at page:

Download "Reinforcement Learning"

Baldwin Brown
5 years ago
Views:

1 Reinforcement Learning Temporal Difference Learning Temporal difference learning, TD prediction, Q-learning, elibigility traces. (many slides from Marc Toussaint) Vien Ngo MLR, University of Stuttgart

2 Outline Temporal Difference Learning Q-learning Eligibility Traces 2/??

3 Learning in MDPs Assume unknown MDP {S, A,,, γ} (the agent does not know P, R). While interacting with the world, the agent collects data of the form D = {(s t, a t, r t, s t+1 )} H t=1 (state, action, immediate reward, next state) 3/??

4 Learning in MDPs Assume unknown MDP {S, A,,, γ} (the agent does not know P, R). While interacting with the world, the agent collects data of the form D = {(s t, a t, r t, s t+1 )} H t=1 (state, action, immediate reward, next state) What could we learn from that? learn to predict next state: P (s s, a) learn to predict immediate reward: P (r s, a) (or R(s, a)) learn to predict value: s, a Q(s, a) learn to predict action: π(s, a) 3/??

Model-based versus Model-free Edward Tolman (1886-1959) Purposive

driven) Clark Hull (1884-1952) Principles of Behavior (1943) learn

(1887 1967) learn facts about the world that they could subsequently

5 Model-based versus Model-free Edward Tolman ( ) Purposive Behavior in Animals and Men (stimulus-stimulus, non-reinforcement driven) Clark Hull ( ) Principles of Behavior (1943) learn stimulus-response mappings based on reinforcement Wolfgang Köhler ( ) learn facts about the world that they could subsequently use in a flexible manner, rather than simply learning automatic responses 4/??

6 5/??

7 6/??

8 Introduction to Model-Free Methods Monte-Carlo Methods, On-Policy MC Control Temporal Difference Learning, On-Policy SARSA Algorithm Off-Policy Q-Learning Algorithm 7/??

9 Monte-Carlo Policy Evaluation MC policy evaluation: First-visit and every-visit methods Algorithm 1 MC-PE: policy evaluation of policy π 1: Given π; Returns(s) =, s S 2: while (!converged) do 3: Generate an episode τ = (s 0, a 0, r 0,..., s T 1, a T 1, r T 1, s T ) using π 4: for each state s t in τ do 5: Compute the return of s t: R(s t) = T l=t γl t r l 6: Either: For each s t in τ, add R(s t) into Returns(s t) once, first-visit Add R(s t) into Returns(s t), every-visit 7: Return V π (s) = average ( Returns(s) ) Converge!!! as the number of visits to all s goes to infinity. (Introduction to RL, Sutton & Barto 1998) 8/??

10 Monte-Carlo Policy Evaluation Alternatively, increment updates for MC-PE when obtaining a return R(s t ): V π (s t ) = V π (s t ) + α t ( R(st ) V (s t ) ) 9/??

11 Example: MC Method for Blackjack States is 3-dimensional: current sum (12-21), dealer s one showing card (ace-10), having a usable card (1/0) Actions: stick, hit Rewards: +1.0 if current sum is better than the dealer s R(, hit) = 0.0 if current sum is equal the dealer s 1.0 if current sum is worse than the dealer s 1.0 if current sum > 21 R(, hit) = 0.0 otherwise If current sum < 12, automatically do hit. 10/??

12 Example: MC Method for Blackjack Approximate value functions for the policy that sticks only on 20 or /??

On-Policy MC Control Algorithm Generalized policy iteration framework Algorithm 2 On-policy MC Control Algorithm 1: Init an initial policy π 0.

13 On-Policy MC Control Algorithm Generalized policy iteration framework Algorithm 2 On-policy MC Control Algorithm 1: Init an initial policy π 0. 2: while (!converged) do 3: Policy Evaluation: π k Q π k (s, a), s, a 4: Policy Improvement: greedy action selection π k+1 (s) = arg max Q(s, a), s a 12/??

14 ɛ-greedy Policy The ɛ-greedy policy chooses actions with probability ɛ/ A + 1 ɛ if a = arg max a π(a s) = A Q(s, a ) ɛ/ A otherwise 13/??

15 ɛ-greedy Policy The ɛ-greedy policy chooses actions with probability ɛ/ A + 1 ɛ if a = arg max a π(a s) = A Q(s, a ) ɛ/ A otherwise Policy improvement: the improvement of a new ɛ-greedy policy π k+1 that is constructed upon the value functions Q π k (s, a) of the previous policy is Q π k (s, π k+1 (s)) = π k+1 (a s)q π k (s, a) a = ɛ Q π k (s, a) + (1 ɛ) max Q π k (s, a) A a a ɛ Q π k (s, a) + (1 ɛ) π(a s) ɛ/ A Q π k (s, a) A 1 ɛ a a = π(a s)q π k (s, a) = V π k (s) a Therefore V π k+1 (s) V π k (s) 13/??

16 MC Control Algorithm with ɛ-greedy Choose actions using ɛ-greedy. Converges!!! if ɛ k decreases to zero through time, e.g. ɛ k = 1/k (GLIE policy). A policy is GLIE (Greedy in the Limit with Infinite Exploration) if All state-action pairs are visited infinitely often. In the limit (k ), the policy becomes greedy where δ is a Dirac function. lim π k(a s) = δ a(arg max Q π k (s, a )) k a 14/??

17 MC Control Algorithm: Blackjack 15/??

18 Temporal difference (TD) learning TD Prediction (TD(0)) On-policy TD Control (SARSA Algorithm) Off-policy TD Control (Q-Learning) Eligibility Traces (TD(λ)) 16/??

19 Temporal difference (TD) Prediction recall [ ] V π (s) = E R(π(s), s) + γv π (s ) 17/??

20 Temporal difference (TD) Prediction recall [ ] V π (s) = E R(π(s), s) + γv π (s ) TD learning: Given a new experience (s, a, r, s ) V new (s) = (1 α) V old (s) + α [r + γv old (s )] }{{} TD target = V old (s) + α [r V old (s) + γv old (s )]. }{{} TD error 17/??

21 Temporal difference (TD) Prediction recall [ ] V π (s) = E R(π(s), s) + γv π (s ) TD learning: Given a new experience (s, a, r, s ) V new (s) = (1 α) V old (s) + α [r + γv old (s )] }{{} TD target = V old (s) + α [r V old (s) + γv old (s )]. }{{} TD error Reinforcement: more reward than expected (r > V old (s) γv old (s )) increase V (s) less reward than expected (r < V old (s) γv old (s )) decrease V (s) 17/??

22 TD Prediction vs. MC Prediction Figure: TD backup diagram Figure: MC backup diagram 18/??

23 TD Prediction vs. MC Prediction Driving Home Example Elapsed Time Predicted Predicted State (minutes) Time to Go Total Time leaving office, friday at reach car, raining exiting highway ndary road, behind truck entering home street arrive home /??

24 TD Prediction vs. MC Prediction Monte-Carlo method TD(0) (Introduction to RL, Sutton & Barto 1998) 20/??

25 TD Prediction vs. MC Prediction TD can learns before termination of epsisodes. TD can be used for either non-episodic or episodic tasks. The update depends on single stochastic transition lower variance. Updates use bootstrapping estimate has some bias. TD updates exploit the Markov property. MC learning must wait until the end of episodes. MC only works for episodic tasks. The update depends on a sequence of many stochastic transitions much larger variance. Unbiased estimate. MC updates does not exploit the Markov property, hence it can be effective in non-markov environment. 21/??

26 TD vs. MC: A Random Walk Example 22/??

27 On-Policy TD Control: SARSA 23/??

28 On-Policy TD Control: SARSA s,a r s' Figure: Learning on tuple (s, a, r, s, a ): SARSA a' Q-value updates: Q t+1 (s, a) = Q t (s, a) + α ( r t + γq t (s, a ) Q t (s, a) ) 24/??

29 On-Policy TD Control: SARSA Algorithm 3 SARSA Algorithm 1: Init Q(s, a), s S, a A 2: while (!converged) do 3: Init a starting state s 4: Select an action a from s using a policy derived from Q (e.g. ɛ-greedy) 5: for each episode do 6: Execute a, observe r, s 7: Select an action a from s using a policy derived from Q (e.g. ɛ-greedy) ( 8: Update: Q t+1(s, a) = Q t(s, a) + α t r + γqt(s, a ) Q t(s, a) ) 9: s s ; a a 25/??

30 The SARSA s Q values converges w.p.1 to the optimal values as long as the learning rates satisfy, α t (s, a) = t αt 2 (s, a) < t the policies π t (s, a) derived from Q t (s, a) are GLIE policies. 26/??

31 SARSA on Windy Gridworld Example reward is -1.0 at non-terminal state (cell G). each move is shifted upward along the wind direction, the strength in number of cells shifted upwards is given below each column. 27/??

32 y-axis shows the accumulated number of goal reaching. 28/??

33 Off-Policy TD Control: Q-Learning 29/??

34 Off-Policy TD Control: Q-Learning Off-Policy MC? Importance sampling to estimate the following expectation of returns: [ ] E τ P (τ) ρ(τ) = ρ(τ)p (τ)dτ = ρ(τ) P (τ) P (τ)dτ P [(τ) ] = E τ P (τ) ρ(τ) P (τ) P (τ) 1 N [ N i=1 ρ(τ i ) P (τ i) P (τ i ) Denote a trajectory distribution P π (τ), where τ = {s 0, a 0, s 1, a 1, } ] P π (τ) = P 0 (s 0 ) P (s t+1 s t, a t )µ(a t s t ) 30/??

35 Assuming in MC Control, the control policy used to generate data is µ(a s) (i.e. P (τ)). The target policy is π(a s) (i.e. P (τ)) Set importance weights as: w t = P (τ t) T P (τ t ) = π(a i s i ) µ(a i s i ) The MC value update becomes (when observing a return ρ t ): i=t V (s t ) V (s t ) + α(w t ρ t V (s t )) 31/??

36 Off-Policy TD Control: Q-Learning Off-Policy TD? The term ρ = r t + γv (s t+1 ) is estimated by importance sampling. The TD value update becomes (given a transition (s t, a t, r t, s t+1 )): ( π(at s t ) ) V (s t ) V (s t ) + α µ(a t s t ) (r t + γv (s t+1 )) V (s t ) 32/??

37 Off-Policy TD Control: Q-Learning The target policy is greedy: π(s t ) = arg max a Q(s t, a) The control policy is µ, e.g. ɛ-greedy w.r.t Q(s, a) 33/??

38 Off-Policy TD Control: Q-Learning Q-learning (Watkins, 1988) Given a new experience (s, a, r, s ) Q new (s, a) = (1 α) Q old (s, a) + α [r + γmax Q old (s, a )] a = Q old (s, a) + α [r t Q old (s, a) + γ max Q old (s, a)] a Reinforcement: more reward than expected (r > Q old (s, a) γ max a Q old (s, a)) increase Q(s, a) less reward than expected (r < Q old (s, a) γ max a Q old (s, a)) decrease Q(s, a) 34/??

39 Q-Learning (Introduction to RL, Sutton & Barto 1998) 35/??

40 Q-learning convergence with prob 1 Q-learning is a stochastic approximation of Q-Iteration: Q-learning: Q new(s, a) = (1 α)q old (s, a) + α[r + γ max a Q old (s, a )] Q-Iteration: s,a : Q k+1 (s, a) = R(s, a) + γ s P (s a, s) max a Q k (s, a ) We ve shown convergence of Q-VI to Q Convergence of Q-learning: Q-Iteration is a deterministic update: Q k+1 = T (Q k ) Q-learning is a stochastic version: Q k+1 = (1 α)q k + α[t (Q k ) + η k ] η k is zero mean! 36/??

41 Q-learning convergence with prob 1 The Q-learning algorithm converges w.p.1 as long as the learning rates satisfy, α t (s, a) = t αt 2 (s, a) < t (Watkins and Dayan, Q-learning. Machine Learning 1992) 37/??

42 Q-learning vs. SARSA: The Cliff Example 38/??

43 Q-Learning impact Q-Learning was the first provably convergent direct adaptive optimal control algorithm Great impact on the field of Reinforcement Learning smaller representation than models automatically focuses attention to where it is needed, i.e., no sweeps through state space though does not solve the exploration versus exploitation issue ɛ-greedy, optimistic initialization, etc,... 39/??

44 Unified View 40/??

45 Eligibility traces Temporal Difference: based on single experience (s 0, r 0, s 1 ) V new (s 0 ) = V old (s 0 ) + α[r 0 + γv old (s 1 ) V old (s 0 )] longer experience sequence, e.g.: (s 0, r 0, r 1, r 2, s 3 ) 41/??

46 Eligibility traces Temporal Difference: based on single experience (s 0, r 0, s 1 ) V new (s 0 ) = V old (s 0 ) + α[r 0 + γv old (s 1 ) V old (s 0 )] longer experience sequence, e.g.: (s 0, r 0, r 1, r 2, s 3 ) temporal credit assignment, think further backwards: receiving r 3 also tells us something about V (s 0 ) V new (s 0 ) = V old (s 0 ) + α[r 0 + γr 1 + γ 2 r 2 + γ 3 V old (s 3 ) V old (s 0 )] 41/??

47 Eligibility trace The error of n-step TD update V n t (s) = V n t (s) + α[r n t V n t (s)] where R n t is n-step returns R n t = r t + γr t γ n 1 r t+n 1 + γ n V t (s t+n ) The offline value update up to time T Error reduction T 1 V t (s) = V t (s) + α V t (s) t=0 V n t V π γ n V t V π 42/??

48 TD(λ): Forward View TD(λ) is the averaging of n-backups with different n Look into the future, and do MC-Evaluation then averaging weightedly: R λ t = (1 λ) λ n 1 Rt n n=1 43/??

49 TD(λ): Forward View TD(λ) is the averaging of n-backups with different n Rt λ = (1 λ) λ n 1 Rt n n=1 44/??

50 TD(λ): Forward View 19-State Random Walk Task. 45/??

51 TD(λ): Backward View Each step, eligibility traces for all states are updated γλe t 1 (s) e t (s) = γλe t 1 (s) s s t s = s t 46/??

52 TD(λ): Backward View TD(λ): remember where you ve been recently ( eligibility trace ) and update those values as well: e(s t) e(s t) + 1 s : V new(s) = V old (s) + α e(s) [r t + γv old (s t+1 ) V old (s t)] s : e(s) γλe(s) 47/??

53 TD(λ): Backward View TD(λ): remember where you ve been recently ( eligibility trace ) and update those values as well: e(s t) e(s t) + 1 s : V new(s) = V old (s) + α e(s) [r t + γv old (s t+1 ) V old (s t)] s : e(s) γλe(s) 47/??

54 TD(λ): Backward View 48/??

55 TD(λ): Backward vs. Forward Two view provides equivalent offline update (see the proof in Section 7.4, Introduction to RL book, Sutton & Barto). 49/??

56 SARSA(λ) 50/??

57 SARSA(λ): Example 51/??

58 Q(λ) The n-step return r t + γr t γ n 1 r t+n 1 + γ n max Q t (s t+n, a) a Q(λ) algorithm by Watkin 52/??

Reinforcement Learning

Reinforcement Learning Temporal Difference Learning Temporal difference learning, TD prediction, Q-learning, elibigility traces. (many slides from Marc Toussaint) Vien Ngo Marc Toussaint University of