Chapter 6: Temporal Difference Learning

Size: px

Start display at page:

Download "Chapter 6: Temporal Difference Learning"

Curtis Lynch
5 years ago
Views:

1 Chapter 6: emporal Difference Learning Objectives of this chapter: Introduce emporal Difference (D) learning Focus first on policy evaluation, or prediction, methods Compare efficiency of D learning with MC learning hen extend to control methods R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1

2 cf. Dynamic Programming V(S t ) E π [ R t+1 + γ V(S t+1 )] = X a (a S t ) X s 0,r p(s 0,r S t,a)[r + V (s 0 )] S t a r s 0 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 2

3 Simple Monte Carlo V(S t ) V(S t ) + α [ G t V(S t )] St R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 3

4 Simplest D Method V(S t ) V(S t ) + α [ R t+1 + γ V(S t+1 ) V(S t )] S t S t+1 R t+1 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 4

5 D methods bootstrap and sample Bootstrapping: update involves an estimate MC does not bootstrap DP bootstraps D bootstraps Sampling: update does not involve an expected value MC samples DP does not sample D samples R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 5

6 D Prediction Policy Evaluation (the prediction problem): for a given policy π, compute the state-value function vπ Recall: Simple every-visit Monte Carlo method: V (S t ) h i V (S t )+ G t V (S t ) target: the actual return after time t he simplest temporal-difference method D(0): V (S t ) h i V (S t )+ R t+1 + V (S t+1 ) V (S t ). target: an estimate of the return R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 6

7 Example: Driving Home Elapsed ime Predicted Predicted State (minutes) ime to Go otal ime leaving o ce, friday at reach car, raining exiting highway ndary road, behind truck entering home street arrive home R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 7

8 Driving Home Changes recommended by Monte Carlo methods (α=1) Changes recommended by D methods (α=1) R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 8

9 Advantages of D Learning D methods do not require a model of the environment, only experience D, but not MC, methods can be fully incremental You can learn before knowing the final outcome Less memory Less peak computation You can learn without the final outcome From incomplete sequences Both MC and D converge (under certain assumptions to be detailed later), but which is faster? R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 9

10 Random Walk Example Values learned by D after various numbers of episodes V (S t ) h i V (S t )+ R t+1 + V (S t+1 ) V (S t ). R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 10

11 D and MC on the Random Walk Data averaged over 100 sequences of episodes R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 11

12 Batch Updating in D and MC methods Batch Updating: train completely on a finite amount of data, e.g., train repeatedly on 10 episodes until convergence. Compute updates according to D or MC, but only update estimates after each complete pass through the data. For any finite Markov prediction task, under batch updating, D converges for sufficiently small α. Constant-α MC also converges under these conditions, but to a difference answer! R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 12

13 Random Walk under Batch Updating After each new episode, all previous episodes were treated as a batch, and algorithm was trained until convergence. All repeated 100 times. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 13

14 You are the Predictor Suppose you observe the following 8 episodes: A, 0, B, 0 B, 1 B, 1 B, 1 B, 1 B, 1 B, 1 B, 0 V(B)? 0.75 V(A)? 0? Assume Markov states, no discounting (γ = 1) R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 14

15 You are the Predictor V(A)? 0.75 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 15

16 You are the Predictor he prediction that best matches the training data is V(A)=0 his minimizes the mean-square-error on the training set his is what a batch Monte Carlo method gets If we consider the sequentiality of the problem, then we would set V(A)=.75 his is correct for the maximum likelihood estimate of a Markov model generating the data i.e, if we do a best fit Markov model, and assume it is exactly correct, and then compute what it predicts (how?) his is called the certainty-equivalence estimate his is what D gets R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 16

17 Summary so far Introduced one-step tabular model-free D methods hese methods bootstrap and sample, combining aspects of DP and MC methods D methods are computationally congenial If the world is truly Markov, then D methods will learn faster than MC methods MC methods have lower error on past data, but higher error on future data R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 17

18 Learning An Action-Value Function Estimate qπ for the current policy π R t +2 R t R t +1 S t S t +1 S t, A t S, t+1 A t+1 S t +2 S t +3 S, t+2 A t+2 S, t+3 A t+3 After every transition from a nonterminal state, S t, do this: Q(S t, A t ) Q(S t, A t ) + α [ R t+1 + γ Q(S t+1, A t+1 ) Q(S t, A t )] If S t+1 is terminal, then define Q(S t+1, A t+1 ) = 0 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 18

19 Sarsa: On-Policy D Control urn this into a control method by always updating the policy to be greedy with respect to the current estimate: Initialize Q(s, a), 8s 2 S, a 2 A(s), arbitrarily, and Q(terminal-state, ) = 0 Repeat (for each episode): Initialize S Choose A from S using policy derived from Q (e.g., "-greedy) Repeat (for each step of episode): ake action A, observe R, S 0 Choose A 0 from S 0 using policy derived from Q (e.g., "-greedy) Q(S, A) Q(S, A)+ [R + Q(S 0,A 0 ) Q(S, A)] S S 0 ; A A 0 ; until S is terminal R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 19

20 Windy Gridworld Wind: undiscounted, episodic, reward = 1 until goal R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 20

21 Results of Sarsa on the Windy Gridworld R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 21

22 6.5 Q-learning: O -Policy D Control Exercise 6.9 Why is Q-lear Q-Learning: Off-Policy D Control One of the most important breakthroughs in reinforcement learning was the de opment of an o -policy D control algorithm known as Q-learning (Watkins, 19 Its simplest form, one-step Q-learning, is defined by One-step Q-learning: h i Q(St, At ) Q(St, At ) + Rt+1 + max Q(St+1, a) Q(St, At ). ( a 6.5. Q-LEARNING: OFF-POLICY CONROL In this case, the learned action-value D function, Q, directly approximates q,145 the timal action-value function, independent of the policy being followed.q-learning his dram ically simplifies the algorithm and enabled early convergence Initialize Q(s,the a), analysis 8s 2 S, a of 2 A(s), arbitrarily, and Q(terminal-state, ) = 0 pro Figure 6.6: herepeat policy (for still each has an e ect in that it determines which state action pairshe arebac vis episode): and updated. all that is required for correct convergence is that all p Initialize However, S continue to be(for updated. As of weepisode): observed in Chapter 5, this is a minimal requirem Repeat each step in the sense that A any method guaranteed to findfrom optimal behavior in the general c Choose from S using policy derived Q (e.g., "-greedy) must require Under assumption akeit.action A,this observe R, S 0 and a variant of the usual stochastic app 0 Q(S, A) on Q(S, A)sequence + [R + ofmax Q(S, A)] Q has been shown imation conditions the step-size parameters, a Q(S, a) 0; converge Swith Sprobability 1 to q. he Q-learning algorithm is shown in proced S is terminal form inuntil Figure What is the backup diagram for Q-learning? he rule (6.6) updates a state ac Figure 6.12: Q-learning: D control algorithm. pair, the node, the root ofan theo -policy backup, must be a small, filled 22action no R. S. Sutton so and A. G. Barto:top Reinforcement Learning: An Introduction

23 Cliffwalking R R ε greedy, ε = 0.1 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 23

24 Expected Sarsa Instead of the sample value-of-next-state, use the expectation! Q(S t,a t ) h i Q(S t,a t )+ R t+1 + E[Q(S t+1,a t+1 ) S t+1 ] Q(S t,a t ) h i Q(S t,a t )+ R t+1 + (a S t+1 )Q(S t+1,a) Q(S t,a t ) X a, Q-learning Expected Sarsa Expected Sarsa s performs better than Sarsa (but costs more) R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 24

25 van Seijen, van Hasselt, Whiteson, & Wiering 2009 Performance on the Cliff-walking ask Asymptotic Performance Expected Sarsa Reward per episode Q-learning Interim Performance (after 100 episodes) Q-learning Sarsa n = 100, Sarsa n = 100, Q learning n = 100, Expected Sarsa n = 1E5, Sarsa n = 1E5, Q learning n = 1E5, Expected Sarsa R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 25

Off-policy Expected Sarsa Expected Sarsa generalizes to arbitrary behavior policies μ Q(S t,a t ) Nothing changes here in which case it includes Q-learning as the special case in which π is the

26 Off-policy Expected Sarsa Expected Sarsa generalizes to arbitrary behavior policies μ Q(S t,a t ) Nothing changes here in which case it includes Q-learning as the special case in which π is the greedy policy h i Q(S t,a t )+ R t+1 + E[Q(S t+1,a t+1 ) S t+1 ] Q(S t,a t ) h i Q(S t,a t )+ R t+1 + (a S t+1 )Q(S t+1,a) Q(S t,a t ) X a, Q-learning Expected Sarsa his idea seems to be new R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 26

27 Maximization Bias Example 100% N( 0.1, 1) % Wrong actions 75% 50% 25% Double Q-learning... B Q-learning 0 wrong A SAR 0 right 5% 0 optimal Episodes abular Q-learning: Q(S t,a t ) h Q(S t,a t )+ R t+1 + max a i Q(S t+1,a) Q(S t,a t )

28 Hado van Hasselt 2010 Double Q-Learning rain 2 action-value functions, Q 1 and Q 2 Do Q-learning on both, but never on the same time steps (Q 1 and Q 2 are indep.) pick Q 1 or Q 2 at random to be updated on each step If updating Q 1, use Q 2 for the value of the next state: Q 1 (S t,a t ) Action selections are (say) ε-greedy with respect to the sum of Q 1 and Q 2 Q 1 (S t,a t )+ R t+1 + Q 2 S t+1, argmax a Q 1 (S t+1,a) Q 1 (S t,a t )

29 Hado van Hasselt 2010 Double Q-Learning Initialize Q 1 (s, a) and Q 2 (s, a), 8s 2 S,a2 A(s), arbitrarily Initialize Q 1 (terminal-state, ) =Q 2 (terminal-state, ) =0 Repeat (for each episode): Initialize S Repeat (for each step of episode): Choose A from S using policy derived from Q 1 and Q 2 (e.g., "-greedy in Q 1 + Q 2 ) ake action A, observe R, S 0 With 0.5 probabilility: Q 1 (S, A) Q 1 (S, A)+ R + Q 2 S 0, argmax a Q 1 (S 0,a) Q 1 (S, A) else: Q 2 (S, A) S S 0 ; until S is terminal Q 2 (S, A)+ R + Q 1 S 0, argmax a Q 2 (S 0,a) Q 2 (S, A)

30 Example of Maximization Bias 100% N( 0.1, 1) % Wrong actions 75% 50% 25% Double Q-learning... B Q-learning 0 wrong A SAR 0 right Double Q-learning: Q 1 (S t,a t ) 5% 0 optimal Episodes Q 1 (S t,a t )+ R t+1 + Q 2 S t+1, argmax Q 1 (S t+1,a) Q 1 (S t,a t ) a

31 Afterstates Usually, a state-value function evaluates states in which the agent can take an action. But sometimes it is useful to evaluate states after agent has acted, as in tic-tac-toe. Why is this useful? What is this in general? R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 31

32 Summary Introduced one-step tabular model-free D methods hese methods bootstrap and sample, combining aspects of DP and MC methods D methods are computationally congenial If the world is truly Markov, then D methods will learn faster than MC methods MC methods have lower error on past data, but higher error on future data Extend prediction to control by employing some form of GPI On-policy control: Sarsa, Expected Sarsa Off-policy control: Q-learning, Expected Sarsa Avoiding maximization bias with Double Q-learning R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 32

Monte Carlo Exhaustive search... R. S. Sutton and A.

33 Unified View emporaldifference learning width of backup Dynamic programming height (depth) of backup Monte Carlo Exhaustive search... R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 33

Introduction to Reinforcement Learning. Part 5: Temporal-Difference Learning

Introduction to Reinforcement Learning Part 5: emporal-difference Learning What everybody should know about emporal-difference (D) learning Used to learn value functions without human input Learns a guess