Introduction to Reinforcement Learning. Part 5: Temporal-Difference Learning

Size: px

Start display at page:

Download "Introduction to Reinforcement Learning. Part 5: Temporal-Difference Learning"

Emma Dorsey
5 years ago
Views:

1 Introduction to Reinforcement Learning Part 5: emporal-difference Learning

2 What everybody should know about emporal-difference (D) learning Used to learn value functions without human input Learns a guess from a guess Applied by Samuel to play Checkers (1959) and by esauro to beat humans at Backgammon (1992-5) and Jeopardy! (2011) Explains (accurately models) the brain reward systems of primates, rats, bees, and many other animals (Schultz, Dayan & Montague 1997) Arguably solves Bellman s curse of dimensionality

3 Simple Monte Carlo V(S t ) V(S t ) + α [ G t V(S t )] St

4 Simplest D Method V(S t ) V(S t ) + α [ R t+1 + γ V(S t+1 ) V(S t )] S t S t+1 R t+1

5 D Prediction Policy Evaluation (the prediction problem): for a given policy π, compute the state-value function vπ Recall: Simple every-visit Monte Carlo method: V(S t ) V(S t ) + α [ G t V(S t )] target: the actual return after time t he simplest temporal-difference method, D(0): V(S t ) V(S t ) + α [ R t+1 + γ V(S t+1 ) V(S t )] target: an estimate of the return

6 Example: Driving Home State Elapsed ime Predicted Predicted (minutes) ime to Go otal ime leaving office reach car, raining exit highway behind truck home street arrive home (5) (15) (10) (10) (3) bad good bad bad ok

7 Driving Home Changes recommended by Monte Carlo methods (α=1) Changes recommended by D methods (α=1)

8 Advantages of D Learning D methods do not require a model of the environment, only experience D, but not MC, methods can be fully incremental You can learn before knowing the final outcome Less memory Less peak computation You can learn without the final outcome From incomplete sequences Both MC and D converge (under certain assumptions to be detailed later), but which is faster?

9 Random Walk Example Values learned by D(0) after various numbers of episodes

10 D and MC on the Random Walk Data averaged over 100 sequences of episodes

11 Batch Updating in D and MC methods Batch Updating: train completely on a finite amount of data, e.g., train repeatedly on 10 episodes until convergence. Compute updates according to D(0), but only update estimates after each complete pass through the data. For any finite Markov prediction task, under batch updating, D(0) converges for sufficiently small α. Constant-α MC also converges under these conditions, but to a difference answer!

12 Random Walk under Batch Updating After each new episode, all previous episodes were treated as a batch, and algorithm was trained until convergence. All repeated 100 times.

13 You are the Predictor Suppose you observe the following 8 episodes: A, 0, B, 0 B, 1 B, 1 B, 1 B, 1 B, 1 B, 1 B, 0 V(B)? 0.75 V(A)? 0?

14 You are the Predictor he prediction that best matches the training data is V(A)=0 his minimizes the mean-square-error on the training set his is what a batch Monte Carlo method gets If we consider the sequentiality of the problem, then we would set V(A)=.75 his is correct for the maximum likelihood estimate of a Markov model generating the data i.e, if we do a best fit Markov model, and assume it is exactly correct, and then compute what it predicts (how?) his is called the certainty-equivalence estimate his is what D(0) gets

15 Summary (so far) Introduced one-step tabular model-free D methods hese methods bootstrap and sample, combining aspects of DP and MC methods D prediction If the world is truly Markov, then D methods will learn faster than MC methods MC methods have lower error on past data, but higher error on future data Extend prediction to control by employing some form of GPI On-policy control: Sarsa Off-policy control: Q-learning

16 Learning An Action-Value Function Estimate qπ for the current policy π R t +2 R t R t +1 S t S t +1 S t, A t S, t+1 A t+1 S t +2 S t +3 S, t+2 A t+2 S, t+3 A t+3 After every transition from a nonterminal state, S t, do this: Q(S t, A t ) Q(S t, A t ) + α [ R t+1 + γ Q(S t+1, A t+1 ) Q(S t, A t )] If S t+1 is terminal, then define Q(S t+1, A t+1 ) = 0

17 Sarsa: On-Policy D Control urn this into a control method by always updating the policy to be greedy with respect to the current estimate: Initialize Q(s, a), 8s 2 S, a 2 A(s), arbitrarily, and Q(terminal-state, ) = 0 Repeat (for each episode): Initialize S Choose A from S using policy derived from Q (e.g., "-greedy) Repeat (for each step of episode): ake action A, observe R, S 0 Choose A 0 from S 0 using policy derived from Q (e.g., "-greedy) Q(S, A) Q(S, A)+ [R + Q(S 0,A 0 ) Q(S, A)] S S 0 ; A A 0 ; until S is terminal

18 Windy Gridworld Wind: undiscounted, episodic, reward = 1 until goal

19 Results of Sarsa on the Windy Gridworld

20 Q-Learning: Off-Policy D Control One-step Q-learning: Q(S t, A t ) Q(S t, A t ) + α % R t+1 + γ maxq(s t+1,a) Q(S t, A t )' & a ( Initialize Q(s, a), 8s 2 S, a 2 A(s), arbitrarily, and Q(terminal-state, ) = 0 Repeat (for each episode): Initialize S Repeat (for each step of episode): Choose A from S using policy derived from Q (e.g., "-greedy) ake action A, observe R, S 0 Q(S, A) Q(S, A)+ [R + max a Q(S 0,a) Q(S, A)] S S 0 ; until S is terminal

21 Cliffwalking R R ε greedy, ε = 0.1

22 Summary Introduced one-step tabular model-free D methods hese methods bootstrap and sample, combining aspects of DP and MC methods D prediction If the world is truly Markov, then D methods will learn faster than MC methods MC methods have lower error on past data, but higher error on future data Extend prediction to control by employing some form of GPI On-policy control: Sarsa Off-policy control: Q-learning

Chapter 6: Temporal Difference Learning

Chapter 6: Temporal Difference Learning Chapter 6: emporal Difference Learning Objectives of this chapter: Introduce emporal Difference (D) learning Focus first on policy evaluation, or prediction, methods hen extend to control methods R. S.