The Book: Where we are and where we re going. CSE 190: Reinforcement Learning: An Introduction. Chapter 7: Eligibility Traces. Simple Monte Carlo

CSE 190: Reinforcement Learning: An Introduction Chapter 7: Eligibility races Acknowledgment: A good number of these slides are cribbed from Rich Sutton he Book: Where we are and where we re going Part I: he Problem Introduction Evaluative Feedback he Reinforcement Learning Problem Part II: Elementary Solution Methods Dynamic Programming Monte Carlo Methods emporal Difference Learning Part III: A Unified View Eligibility races Generalization and Function Approximation Planning and Learning Dimensions of Reinforcement Learning Case Studies 2 Chapter 7: Eligibility races Simple Monte Carlo V! V + " [ R t # V ] where R t is the actual return following state s t. s t R t 3 4

Simplest D Method Is there something in between? [ ] V! V + " r t +1 + # V +1 $ V s t s t s t +1 r t +1 s t +1 r t +1 5 6 N-step D Prediction Idea: Look farther into the future when you do D backup (1, 2, 3,, n steps Mathematics of N-step D Prediction Monte Carlo: R t = r t +1 +! 2 r t + 3 +!+! "t "1 r R t (1 = r t +1 +! V t +1 D: I.e., use V to estimate remaining return Nothe n-step D: 2 step return: R t (2 = r t +1 +! 2 V t +2 n-step return: R t (n = r t +1 +! 2 r t + 3 +!+! n"1 r t +n +! n V t +n 7 8

Mathematics of N-step D Prediction Nothe Notation Monte Carlo: R t = r t +1 +! 2 r t + 3 +!+! "t "1 r Monte Carlo: R t = r t +1 +! 2 r t + 3 +!+! "t "1 r R t (1 = r t +1 +! V t +1 D: I.e., use V to estimate remaining return Nothe R t (1 = r t +1 +! V t +1 D: I.e., use V to estimate remaining return Nothe n-step D: 2 step return: n-step return: R t (2 = r t +1 +! 2 V t +2 R t (n = r t +1 +! 2 r t + 3 +!+! n"1 r t +n +! n V t +n Data estimate 9 n-step D: n-step return: R t (n = r t +1 +! 2 r t + 3 +!+! n"1 r t +n +! n V t +n An (n at thop of the R means n-step return (easy to miss! - R t is basically R! t 10 Learning with N-step Backups Backup (on-line or off-line:!v t = " $% R (n t # V t & ' Error reduction property of n-step returns max s E! {R t (n s t = s} " V! (s # $ n max s V(s " V! (s Learning with N-step Backups Backup (on-line or off-line:!v t = " $% R (n t # V t & ' Error reduction property of n-step returns max s E! {R t (n s t = s} " V! (s # $ n max s V(s " V! (s Maximum error using n-step return Maximum error using V Maximum error using n-step return Maximum error using V Using this, you can show that n-step methods converge Using this, you can show that n-step methods converge 11 12

Random Walk Examples A Larger Example How does 2-step D work here? How about 3-step D? ask: 19 state random walk Each curve corresponds to the error after 10 episodes. Each curve uses a different n. he x-axis is the learning rate. Do you think there is an optimal n? for everything? 13 14 Averaging N-step Returns Forward View of D(! n-step methods were introduced to help with D(! understanding (next! Idea: backup an average of several returns e.g. backup half of 2-step and half of 4-step R avg t = 1 2 R (2 t + 1 2 R t(4 Called a complex backup o draw the backup diagram: Draw each component Label with the weights for that component One backup D(! is a method for averaging all n-step backups weighted by! n-1 (time since visitation!-return: R! t = (1"! $! n"1 (n R t n=1 Backup using!-return:!v t = " %& R # t $ V t ' ( # 15 16

Forward View of D(! D(! is a method for averaging all n-step backups weighted by! n-1 (time since visitation Note: Sinche weights add up to 1, r t+1 is not overemphasized.!-return Weighting Function "t "1 R! t = (1 "! #! n"1 R (n t +! "t "1 R t n=1 17 Until termination After termination 18 Relation to D(0 and MC Forward View of D(! II!-return can be rewritten as: "t "1 R! t = (1"! #! n"1 R (n t +! "t "1 R t n=1 Look forward from each stato determine update from future states and rewards: If! = 1, you get MC: Until termination After termination "t "1 R! t = (1 " 1 # 1 n"1 R (n t + 1 "t "1 R t = R t n=1 If! = 0, you get D(0 "t "1 R! t = (1 " 0 # 0 n"1 R (n t + 0 "t "1 (1 R t = R t n=1 19 20

!-return on the Random Walk Backward View Same 19 state random walk as before Why do you think intermediate values of! are best? 21! t = r t +1 + " V t +1 # V t Shout! t backwards over time he strength of your voice decreases with temporal distance by "#$ 22 Backward View of D(! he forward view was for theory he backward view is for mechanism New variable called eligibility trace (s!" + On each step, decay all traces by "#$ and increment thrace for the current state by 1 his gives the equivalent weighting as the forward algorithm when used incrementally % '!" #1 (s (s = &!" #1 (s + 1 (' if s $ s t if s = s t 23 Policy Evaluation Using D(! Initialize V(s arbitrarily Repeat (for each episode: e(s = 0, for all s!s Initialize s Repeat (for each step of episode: a " action given by # for s ake action a, observe reward, r, and next state s$ % " r + & V( s $ ' V(s /* Compute error */ e(s " e(s + 1 /* Increment trace for current state */ For all s: /* Update ALL s's - really only ones we have visited */ V(s " V(s + (%e(s /* incrementally update V(s according to D( e(s " &e(s /* decay trace */ s " s $ /* Movo next state */ Until s is terminal 24

Relation of Backwards View to MC & D(0 Using update rule:!v t (s = "# t (s As before, if you set $ to 0, you get to D(0 If you set $ to 1, you get MC but in a better way Can apply D(1 to continuing tasks Works incrementally and on-line (instead of waiting to the end of the episode Again, by thime you get to therminal state, these increments add up to the one you would have made via the forward (theoretical view. "1 Forward View = Backward View he forward (theoretical view of D(! is equivalent to the backward (mechanistic view for off-line updating "1 "1!V D # he book shows: t (s = #!V $ t I sst #!V D t (s = # $ I sst #(%& k "t ' k $!V " t I sst = $ % I sst $ (&" k #t ' k t =0 "1 t =0 "1 k =t On-line updating with small " is similar t =0 t =0 Backward updates Forward updates algebra shown in book #1 t =0 #1 t =0 #1 k =t 25 26 On-line versus Off-line on Random Walk Control: Sarsa(! Standard control idea: Learn Q values - so, save eligibility for state-action pairs instead of just states $ & (s,a = % '&!" #1 (s,a + 1!" #1 (s,a if s = s t and a = a t otherwise Q t +1 (s,a = Q t (s,a + ( t (s,a t = r t +1 +! Q t +1,a t +1 # Q t,a t Same 19 state random walk On-line performs better over a broader range of parameters 27 28

Sarsa(! Algorithm Initialize Q(s,a arbitrarily Repeat (for each episode: e(s,a = 0, for all s,a Initialize s,a Repeat (for each step of episode: ake action a, observe r, s! Choose a! from s! using policy derived from Q (e.g. "-greedy # $ r + % Q( s!, a! & Q(s,a e(s,a $ e(s,a + 1 For all s,a: Q(s,a $ Q(s,a + '#e(s,a e(s,a $ %(e(s,a s $ s!;a $ a! Until s is terminal Sarsa(! Gridworld Example With onrial, the agent has much more information about how to get to the goal not necessarily the best way Can considerably accelerate learning 29 30 hree Approaches to Q(! hree Approaches to Q(! How can we extend this to Q-learning? Recall Q-learning is an off-policy method to learn Q* - and it uses the max of the Q values for a state in its backup What happens if we make an exploratory move? his is NO the right thing to back up over then What to do? If you mark every state action pair as eligible, you backup over non-greedy policy Watkins: Zero out eligibility trace after a non-greedy action. Do max when backing up at first non-greedy choice. hree answers: Watkins, Peng, and the naïve answer. % 1 +!" #1 (s,a ' (s,a = & 0 '!" #1 (s,a ( if s = s t,a = a t,q t #1,a t = max a Q t #1,a if Q t #1,a t $ max a Q t #1,a otherwise 31 Q t +1 (s,a = Q t (s,a + * t (s,a * t = r t +1 +! max a+ Q t +1, a + # Q t,a t 32

Watkins s Q(! Initialize Q(s,a arbitrarily Peng s Q(! Repeat (for each episode: e(s,a = 0, for all s,a Initialize s,a Repeat (for each step of episode: ake action a, observe r, s! Choose a! from s! using policy derived from Q (e.g. "-greedy a * # argmax b Q( s!,b (if a ties for the max, then a * # a! $ # r + % Q( s!, a! & Q(s,a * e(s,a # e(s,a + 1 For all s,a: Q(s,a # Q(s,a + '$e(s,a If a! = a *, then e(s,a # %(e(s,a else e(s,a # 0 Disadvantago Watkins s method: Early in learning, the eligibility trace will be cut (zeroed out frequently resulting in little advantago traces Peng: Backup max action except at end Never cut traces Disadvantage: Complicated to implement s # s!;a # a! Until s is terminal 33 34 Naïve Q(! Comparison ask Idea: is it really a problem to backup exploratory actions? Never zero traces Always backup max at current action (unlike Peng or Watkins s Is this truly naïve? Well, it works well in some empirical studies Compared Watkins s, Peng s, and Naïve (called McGovern s here Q(! on several tasks. See McGovern and Sutton (1997. owards a Better Q($ for other tasks and results (stochastic tasks, continuing tasks, etc Deterministic gridworld with obstacles 10x10 gridworld 25 randomly generated obstacles 30 runs " = 0.05, # = 0.9,! = 0.9, $ = 0.05, accumulating traces What is the backup diagram? 35 From McGovern CSE 190: Reinforcement and Sutton (1997. Learning, owards Lecture a better on Chapter Q(! 7 36

Comparison Results Convergence of the Q(! s None of the methods are proven to converge. Much extra credit if you can prove any of them. Watkins s is thought to convergo Q * Peng s is thought to convergo a mixture of Q % and Q * Naïve - Q *? From CSE McGovern 190: Reinforcement and Sutton Learning, (1997. owards Lecture on a better Chapter Q(! 7 37 38 Eligibility races for Actor-Critic Methods Eligibility races for Actor-Critic Methods Critic: On-policy learning of V %. Use D(! as described before. Actor: Needs eligibility traces for each state-action pair. We changhe update equation: Critic: On-policy learning of V %. Use D(! as described before. Actor: Needs eligibility traces for each state-action pair. Can changhe other actor-critic update: # % p t +1 (s,a = $ &% to p t (s,a +!" t p t (s,a if a = a t and s = s t otherwise p t +1 (s,a = p t (s,a +!" t (s,a 39 % ' p t (s,a +!" t [ 1# $(s,a ] if a = a t and s = s t p t +1 (s,a = & p t (s,a otherwise (' to p t +1 (s,a = p t (s,a +!" t (s,a where % '!" #1 (s,a + 1# $ t,a t (s,a = &!" #1 (s,a (' if s = s t and a = a t otherwise 40

Replacing races Using accumulating traces, frequently visited states can have eligibilities greater than 1 his can be a problem for convergence Replacing races Example Same 19 state random walk task as before Replacing traces perform better than accumulating traces over more values of! Replacing traces: Instead of adding 1 when you visit a state, set that traco 1 % ' (s = & ('!" #1 (s if s $ s t 1 if s = s t 41 42 Why Replacing races? More Replacing races Replacing traces can significantly speed learning hey can makhe system perform well for a broader set of parameters Accumulating traces can do poorly on certain types of tasks Off-line replacing trace D(1 is identical to first-visit MC Extension to action-values (Q values: When you revisit a state, what should you do with thraces for the actions you didn t take? Singh and Sutton say to set them to zero: Why is this task particularly onerous for accumulating traces? $ & (s,a = % & ' 1 0!" #1 (s,a if s = s t and a = a t if s = s t and a ( a t if s ( s t 43 44

Implementation Issues with races Could require much more computation But most eligibility traces are VERY closo zero If you implement it in Matlab, backup is only one line of code and is very fast (Matlab is optimized for matrices Can generalizo variable! Here! is a function of time Could define Variable! % '!" t #1 (s if s $ s t (s = &!" t #1 (s + 1 if s = s (' t! t =! or! t =! t " 45 46 Conclusions hwo views Provides efficient, incremental way to combine MC and D Includes advantages of MC (can deal with lack of Markov property Includes advantages of D (using D error, bootstrapping Can significantly speed learning Does have a cost in computation 47 48

Unified View END 50 50