Monte Carlo is important in practice. CSE 190: Reinforcement Learning: An Introduction. Chapter 6: Temporal Difference Learning.

Monte Carlo is important in practice CSE 190: Reinforcement Learning: An Introduction Chapter 6: emporal Difference Learning When there are just a few possibilitieo value, out of a large state space, Monte Carlo is a big win Backgammon, Go, But it is still somewhat inefficient, because it figures out state values in isolation from other states (no bootstrapping Acknowledgment: A good number of these slides are cribbed from Rich Sutton 2 Chapter 6: emporal Difference Learning Objectives of this chapter: Introduce emporal Difference (D learning As usual, we focus first on policy evaluation, or prediction, methods hen extend to control methods D Prediction Policy Evaluation (the prediction problem: for a given policy!, compute the state-value function V! Simple every-visit Monte Carlo method: [ ]! + " R t # target: the actual return after time t he simplest D method, D(0: [ ]! + " r t +1 + # +1 $ target: an estimate of the return 3 4

Note that: D Prediction { } V! (s = E! R t = s # % ( = E! & $ " k r t + k +1 = s ' k =0 * # % ( = E! & r t +1 + " $ " k r t + k +2 = s ' k =0 * { } = E! r t +1 + " V! ( +1 = s Hence the two targets are the same - in expectation. In practice, the results may differ due to finite training sets D Prediction Policy Evaluation (the prediction problem: for a given policy!, compute the state-value function V! Simple every-visit Monte Carlo method: [ ]! + " R t # target: the actual return after time t he simplest D method, D(0: [ ]! + " r t +1 + # +1 $ target: an estimate of the return 5 6 Simple Monte Carlo Simplest D Method! + " [ R t # ] where R t ihe actual return following state. [ ]! + " r t +1 + # +1 $ +1 r t +1 R t 7 8

cf. Dynamic Programming! E " { r t +1 + # } rt +1 +1 Comparing MC and Dynamic Programming Bootstrapping: update involves an estimate from a nearby state MC does not bootstrap: Each state is independently estimated DP bootstraps Estimates for each state depend on nearby states Sampling: update involves an actual return MC samples Learns from experience in the environment, gets an actual return R DP does not sample Learns by spreading constraints until convergence through iterating the Bellman equations - without interaction with the environment! 9 10 D methods bootstrap and sample Policy Evaluation with D(0 Bootstrapping: update involves an estimate MC does not bootstrap DP bootstraps D bootstraps Sampling: update involves an actual return MC samples DP does not sample D samples Note the on-line nature: updates as you go. 11 12

Example: Driving Home Value of a state: he remaining time to get home, or time to go State Elapsed ime Predicted Predicted (minutes ime to Go otal ime leaving office 0 30 30 (! reach car, raining 5 (5 35 40 Driving Home Changes recommended by Changes recommended Monte Carlo methods ("=1 by D methods ("=1( exit highway 20 (15 15 35 behind truck 30 (10 10 40 home street 40 (10 3 43 arrive home 43 (3 0 43 13 14 Advantages of D Learning Random Walk Example D methods do not require a model of the environment, only experience D, but not MC, methods can be fully incremental You can learn before knowing the final outcome Less memory Less peak computation You can learn without the final outcome From incomplete sequences Both MC and D converge (under certain assumptiono be detailed later, but which is faster? rewards Values learned by D(0 after various numbers of episodes What happened on the first episode? 15 16

D and MC on the Random Walk Optimality of D(0 Batch Updating: train completely on a finite amount of data, e.g., train repeatedly on 10 episodes until convergence. Compute updates according to D(0, but only update estimates after each complete pashrough the data. For any finite Markov prediction task, under batch updating, D(0 converges for sufficiently small ". Data averaged over 100 sequences of episodes Constant-" MC also converges under these conditions, but to a different answer! 17 18 Random Walk under Batch Updating You are the Predictor Suppose you observe the following 8 episodes: (NOE: these are not from the previous example! 1. A, 0, B, 0, terminate 2. B, 1, terminate 3. B, 1, terminate 4. B, 1, terminate 5. B, 1, terminate 6. B, 1, terminate 7. B, 1, terminate 8. B, 0, terminate V(A? V(B? After each new episode, all previous episodes were treated as a batch, and algorithm warained until convergence. All repeated 100 times. 19 20

You are the Predictor V(A? You are the Predictor he prediction that best matchehe training data is V(A=0 his minimizehe mean-square-error on the training set his is what a batch Monte Carlo method gets If we consider the sequentiality of the problem, then we would set V(A=.75 his is correct for the maximum likelihood estimate of a Markov model generating the data i.e, if we do a best fit Markov model, and assume it is exactly correct, and then compute what it predicts (how? his is called the certainty-equivalence estimate his is what D(0 gets 21 22 Learning An Action-Value Function Estimate Q! for the current behavior policy!. Sarsa: On-Policy D Control urn this into a control method by always updating the policy to be greedy with respect to the current estimate: After every transition from a nonterminal state, do this: (! Q(,a t + " r t +1 + # Q( +1,a t +1 $ Q(,a t Q,a t If +1 ierminal, then Q( +1,a t +1 = 0. (no action out of a terminal state! %& ' ( 23 24

Windy Gridworld Results of Sarsa on the Windy Gridworld undiscounted, episodic, reward = 1 until goal 25 26 Q-Learning: Off-Policy D Control One-step Q-learning: Q(,a t! Q,a t Compare to SARSA: Q,a t ( + " % r t +1 + # max & a Q( +1,a $ Q(,a t (! Q(,a t + " r t +1 + # Q( +1,a t +1 $ Q(,a t %& ' ( ' ( Q-Learning: Off-Policy D Control One-step Q-learning: Q(,a t! Q(,a t + " % r t +1 + # max & ( $ Q(,a t a Q +1,a ' ( hiiny difference in the update makes a BIG difference! Q-learning learns Q*, the optimal Q-values,with probability 1, no matter what policy is being followed (as long as it is exploratory and visits each state-action pair infinitely often - and the learning rates get smaller at the appropriate rate. Hence this is off-policy because it finds #* while following another policy 27 28

Cliffwalking Actor-Critic Methods #$greedy greedy,,# = 0.1 Explicit representation of policy as well as value function Minimal computation to select actions Can learn an explicit stochastic policy Can put constraints on policies Appealing as psychological and neural models 29 30 Actor-Critic Details Dopamine Neurons and D Error Suppose we take action a from state, reaching state +1 he critic evaluatehe action using the D-error:! t = r t +1 + " +1 # W. Schultz et al. Universite de Fribourg If actions are determined by preferences, p(s,a, as follows: { } = e p(s,a "! t (s,a = Pr a t = a = s b e p(s,b (a softmax policy then you can update the actor's preferences like this: p(,a t # p(,a t + $% t, 31 32

Can we apply these methodo continuous tasks without Discounting? Yes: Average Reward Per ime Step Average expected reward per time step under policy!: "! = lim % E! r t the same for each state if ergodic n t =1 Value of a state relative to! " :!V " s n#$ 1 $ n { } { } ( = % E " r t + k #! " = s k =1 Value of a state-action pair relative to! " :!Q " s,a $ { } ( = % E " r t + k #! " = s,a t = a k =1 hese code values ahe transients away from the average... 33 R-Learning 34 Access-Control Queuing ask n servers Customers have four different priorities, which pay reward of 1, 2, 4, or 8, if served At each time step, customer at head of queue is accepted (assigned to a server or removed from the queue - :-( Proportion of randomly distributed high priority customers in queue is h Busy server becomes free with probability p on each time step Statistics of arrivals and departures are unknown Apply R-learning n=10, h=.5, p=.06 Afterstates Usually, a state-value function evaluates states in which the agent can take an action. But sometimes it is useful to evaluate states after agent has acted, as in tic-tac-toe. Why ihis useful? What ihis in general? 35 36

Summary D prediction Introduced one-step tabular model-free D methods Extend prediction to control by employing some form of GPI On-policy control: Sarsa Off-policy control: Q-learning and R-learning hese methods bootstrap and sample, combining aspects of DP and MC methods Questions What can I tell you about RL? What is common to all three classes of methods? DP, MC, D What are the principal strengths and weaknesses of each? In what sense is our RL view complete? In what senses is it incomplete? What are the principal things missing? he broad applicability of these ideas What doehe term bootstrapping refer to? What ihe relationship between DP and learning? 37 38 he Book END 40 Part I: he Problem Introduction Evaluative Feedback he Reinforcement Learning Problem Part II: Elementary Solution Methods Dynamic Programming Monte Carlo Methods emporal Difference Learning Part III: A Unified View Eligibility races Generalization and Function Approximation Planning and Learning Dimensions of Reinforcement Learning Case Studies 40

Unified View 41