- PDF Free Download

Size: px

Start display at page:

Download ""

Kristopher Cummings
5 years ago
Views:

2 Eligibility traces Chapter 12, plus some extra stuff! Like n-step methods, but better!

3 Eligibility traces A mechanism that allow TD, Sarsa and Q-learning to learn more efficiently A way to move smoothly between TD(0) updates and Monte Carlo updates (using full returns) - useful when we don t have the Markov property - allows us to do Monte Carlo-like updates even when the task is continuing or episodes are very long - methods in-between are often better in practice Allows TD methods to due multi-step updates More simple to implement than n-step methods

4 1-step prediction Consider the update target of contant-α MC - G t R t+1 + γr t+2 + γ 2 R t γ T-t-1 R T - uses the full return, all future rewards until termination Consider the update target of TD(0) - G (1) t R t+1 + γv(s t+1 ) - it s only based the next reward and γv(s t+1 ) is a stand-in for the rest of the return γr t+2 + γ 2 R t this allows us to avoid waiting, we can update on every step

5 n-returns prediction We do something in between - use more than one reward, but less than all rewards until termination Use two rewards and then the value function: - G (2) t R t+1 + γr t+2 + γ 2 V(S t+2 ) - now γ 2 V(S t+2 ) takes the place of γ 2 R t+3 + γ 3 R t we call this a 2-step return We could do 3-step, 4-step,.. n-steps - G (n) t R t+1 + γr t+2 + γ 2 R t γ n-1 R t+n + γ n V(S t+n ) We call these n-step TD methods; TD(0) is a one-step TD method

6 Spectrum of returns TD (1-step) 2-step 3-step n-step Monte Carlo

7 n-step returns All n-step returns are approximations of the full return Gt - truncated after n-steps and corrected for the remaining missing terms using V(St+n) Monte Carlo methods are a special case being an -step return

8 TD updates with n-step returns We can do these updates online, during the episode as soon as the information becomes available: - we wait n-1 steps We have already explored implementations of n- step methods - they work better than TD(0) - BUT the algorithms are more complex

9 Complex backups Another valid option is averaging n-step returns For example we could average a 2-step and 4-step return G (2) t G (4) t In fact, we can average any set of returns this way, even infinite set of returns as long as the weighting sums to 1 - still guaranteed to converge to correct predictions The average of simple backups is called a complex backup, TD(λ) uses a complex backup 1 2

10 λ-returns The TD(λ) algorithm averages n-step returns in a particular way Each n-step return is weighted proportional to λ n-1, where λ [0,1] For the episodic case the λ-return is: TXt 1 G t = (1 ) n=1 n 1 G (n) t + T t 1 G t, G (n) t def = R t+1 + R t R t n 1 R t+n + n V (S t+n ) = nx 1 k=0 k R t+k+1 + n V (S t+n ) When λ=1 G λ t =G t, the full return, Monte Carlo When λ=0 G λ t =G (1) t, the 1-step return, TD(0), hence the name!

11 λ-return example T t 1 G t = (1 ) Let T = 4 and t = 0. X n=1 The one step return (n = 1 of the sum): n 1 G (n) t + T t 1 G t, (1 ) (n 1) G (n) t =(1 ) (0) G (1) t =(1 )[R t+1 + V (S t+1 )] =(1 )[R 1 + V (S 1 )] The two step return (n = 2 of the sum): (1 ) (1) G (2) t =(1 ) [R 1 + R V (S 2 )] The two step return (n = 3 of the sum): (1 ) (2) G (3) t =(1 ) 2 [R 1 + R R V (S 3 )] now we stop because T t 1=4 0 1 = 3. No we take care of the last component of the equation: T t 1 G t = 3 [R 1 + R R R T =4 ]

12 given the next largest weight, (1 ) ; the three-step return is given the weight ) 2 ; and so on. The weight fades by with each additional step. After a rminal state has been reached, all subsequent n-step returns are equal to Gt. If we ant, we can separate these post-termination terms from the main sum, yielding λ-returns 7.2. THE FORWARD VIEW OF TD( ) Gt = (1 ) TX t 1 n 1 (n) Gt T t 1 + Gt, (7.6) TD("), "-return n=1 indicated in the figures. This equation makes it clearer what happens when = 1. In this case the main sum goes to zero, and the remaining term reduces to e conventional return, Gt. Thus, for = 1, backing up according to the -return def (n) Monte the same as the G = R Carlo + algorithm V (S that ) we called constant- MC (6.1) in the Weight t+1 t (n) Gt def = t+1 weight given to the 3-step return Rt+1 R 2 (1 ) t+2 is !" total area = 1 (1!") " V (St+2 ) decay by " (n) G1!" t (n) Gt def = Rt+1 + Rt+2 + def = Rt+1 + Rt+2 + t 2 Rt V (St+3 )weight given to actual, final return is T t 1 Rt T # n 1=1 Rt+n + 2 (1!") " n V (St+n ) Time Figure 7.3: The backup digram for TD( ). If T-t-1 " = 0, then the overa

13 t want, we can separate these post-termination terms from the main sum, yielding is given the next largest weight, (1 ) ; the three-step return is given the TX t 1 2 (1 ) ; and so step. A n on. 1 (n) T t fades 1 Gt = (1 ) Gt The+weight Gt, by with each additional(7.6) terminal state has been reached, all subsequent n-step returns are equal to Gt n=1 want, we can separate these post-termination terms from the main sum, yield as indicated in the figures. This equation makes it clearer what happens when t 1 = 1. In this case the main sumt X goes to zero, and thet remaining term reduces to n 1 (n) t 1 Gt = G(1. Thus, ) for = 1, Gt backing + up according Gt, the conventional return, to the -return t n=1 is the same as the Monte Carlo algorithm that we called constant- MC (6.1) in the as indicated in the figures. This equation makes it clearer what happens = 1. In this case the main sum goes to zero, and the remaining term redu weight given to the conventional return, =total 1, backing area = 1up according to the the 3-step return Gt. Thus, for (1 Monte ) 2 is the same asisthe Carlo algorithm that we called constant- MC (6.1) λ-returns decay by " Weight weight given to the 3-step return ) 2 is (1 1!" weight given total toarea = 1 actual, final return is T t 1 decay by " Weight t 1!" T Time weight given to actual, final return is T t 1 Figure 7.4: Weighting given in the -return to each of the n-step returns. T t Time

14 Can we show that the weighting sums to 1? TXt 1 G t = (1 ) n=1 n 1 G (n) t + T t 1 G t, Consider the first term: (1 ) k=a TXt 1 n=1 n 1 We can use the generalized formula for the sum of geometric series: bx r k = ra r b+1 1 r

15 TXt 1 Consider the first term: (1 ) bx k=a r k = ra r b+1 1 r n=1 n 1 generalized formula TXt 1 n=1 n = 1 T t plug in the variables. But we want λ^{n-1} not n TXt 1 n=1 n 1 = TXt 1 n=1 n ( 1 )= 1 T X t 1 n=1 n now formula in correct form TXt 1 n=1 n 1 = (1 T t 1 ) (1 ) = 1 T t 1 (1 ) pull λ out of numerator and cancel λ (1 ) TXt 1 n 1 =1 T t 1 n=1 add in the last term: 1 T t 1 + T t 1 =1

16 λ-return updates The λ-return algorithm, performs backups towards the λ-return as a target: G (λ) t t+1 h i. = t + G t ˆv(S t, t ) rˆv(s t, t ) The off-line algorithm makes no changes to the weight vector during the episode All updates are done at the end of episodes

17 19 State chain (no approximation = no FA) THE -RETURN 255 n-step TD methods Off-line λ-return algorithm λ= λ=.99 (from Chapter 7) 128 n=64 n=32 λ=.975 RMS error at the end of the episode over the first 10 episodes λ=.95 λ=0 λ=.95 n=32 n=1 n=16 λ=.9 λ=.4 λ=.8 n=8 n=2 n=4 Figure 12.3: 19-state Random walk results (Example 7.1): Performance of the o ine return algorithm alongside that of the n-step TD methods. In both case, intermediate values

18 1!" The forward view (1!") " 2 The λ-return algorithm uses what is called the forward view (1!") " #=1 " From each state we visit, we look forward in time, 7.3. THEfuture BACKWARD VIEWFigure OF TD(7.3: ) The backup digram for TD( ). If155= 0, then th to all rewards first component, the one-step TD backup, whereas if to its last component, the Monte Carlo backup. Rr = 1, th T Rrt+3 Sst+3 +3 Rrt+1 Sstt is given next largest weight, (1 ) ; the three-step Rrt+2 the Sstt (1 ) ; and so on. The weight fades by with eac Sst+1 terminal state has been reached, all subsequent n-step re want, T imewe can separate these post-termination terms from TX t 1 (n) by looking Figure 7.6: The forward or theoretical view. We decide how to update each n 1 state T t 1 G = (1 ) G + Gt, t t forward to future rewards and states. n=1

19 λ-return and Monte Carlo The λ-return algorithm uses the λ parameter to shift between TD(0) and Monte Carlo backups - same way that n in n-step TD method allowed us to shift between the two extremes The λ-return is distinctive because there is a special algorithm for achieving it, that does not require looking into the future - instead we keep a memory and make updates to previously visited states - this is what the TD(λ) algorithm does

20 update treated in Chapter 9 (and, in the tabular case, to the simple TD rule (6.2)). Thisbackward is why that algorithm was called The view looks backtd(0). In terms of Figure 12.5, TD(0) is the case in which only the one state preceding the current one is changed by the totdtheerror. recently visited states (marked<by eligibility traces) For larger values of, but still 1, more of the preceding states eett Sst-3-3 eett Sstt-2-2!t t eett Sstt-1-1 eett Sstt T ime Sstt+1 +1 Figure Shout 12.5: The or mechanistic view. Each update depends on the current TD thebackward TD error backwards error combined with eligibility traces of past events. The traces fade with temporal distance by γλ

21 The backward view The forward view, is theoretical, it s not directly implementable - at each step it requires knowledge of what will happen on future steps The backward view provides a mechanism for approximating the forward view We need a memory variable called the eligibility trace

22 The TD(λ) algorithm One of the oldest, best known, and most successful algorithms in RL Three distinctive advantages over off-line λ-return algorithm: - updates weight vector (value func) on every step, not just at the end of episodes - computation equally distributed (not just at end) - can be applied to continuing problems

23 Eligibility traces The eligibility traces keep a record of which states have been visited recently The traces indicates the degree to which a state is eligible for a learning update accumulating eligibility trace times of visits to a state

24 Demo Here we are marking state-action pairs with a replacing eligibility trace 24

25 Accumulating eligibility traces same shape as θ New memory vector called eligibility trace e t 2 R n 0 - On each step, decay each component by γλ and increment the trace for the current state by 1 e 0. = 0, e t. = rˆv(st, t )+ e t 1 accumulating eligibility trace times of visits to a state

26 Combining eligibility traces and TD We want to send information back in time - but what do we send back? The temporal difference error is defined as: t def = R t+1 + ˆv(S t+1, t ) ˆv(S t, t ) Using the eligibility trace we update the values of previously visited states with the TD error on this time step def = t e t

27 the semi-gradient version of TD( ) with function approx TD errors.function The error forrate state-value prediction is is where is thetddiscount and is the paramete With approximation, the eligibility trace section. eligibility trace trackvector of which com same number components as keeps the weight t. Wh. The of v (St, t ). t = Rt+1 + v (St+1, t ) have contributed, or negatively, to recent st long-term memory, positively accumulating over the lifetime of the istd( defined in terms. algorithm The is said th is a short-term memory, typically lasting lessindicate time TheInSemi-gradient TD(λ) ), the weight vector istrace updated on to each step than pro Eligibility traces assist in the learning process; their onl of the weight vector for undergoing learning changes sh error and the vector eligibility trace: a ect the weight vector, andare thenconcerned the weightwith vector The reinforcing events we aredete the. + eligibility, TD TD for state-value predictiontoisz t+1 =), The tthe t eterror In errors. TD( trace vector is initialized episode, is.incremented on each time step by the value gra On the next page, complete pseudocode for TD( ) is give = R + v (S, ) v (S, ). t t+1 t+1 t t t by : of its operation is suggested by Figure In TD( the vectorinis time. updated on each step 0, weight TD( )e0is),= oriented backward At each moment. et assign =the rv (S etto1,each errorand and vector eligibility trace: t) + error itt, backward prior state accordin contributed to. the current eligibility trace at that time. W where t+1 is = the t discount + t et, rate and is the parameter i section. The eligibility trace keeps track of which comp have contributed, negatively, to On the next page,positively completeorpseudocode forrecent TD( state ) is g is in terms. The trace said to12.5. indicate the e of defined its operation is suggested by is Figure

28 TD(1) The TD(1) algorithm is more general than Monte Carlo method - it can be applied to continuing tasks - can be performed incrementally and online - learns during episodes - control methods based on TD(1) can learn from unusually good or bad rewards during the episode and alter the behavior during the episode

29 TD(λ) performs similarly to offline λ-return alg. but slightly worse, particularly at high α Tabular 19-state random walk task RMS error at the end of the episode over the first 10 episodes λ=.95 TD(λ) λ=.9 λ=0 λ=.8 λ=.95 λ=1 Off-line λ-return algorithm (from the previous section) λ=.99 λ=.975 λ=.95 λ=0 λ=.9 λ=.8 λ=.4 λ=.9 λ=.8 λ=.4 Can we do better? Can we update online?

30 Eligibility in the linear case Assuming linear FA - TD(λ) with accumulating traces updates et: - et = γλet-1 + ɸt Assuming binary features (like tile coding) - et = γλet-1 - e(f)t = e(f)t where F is the indexes of the active features

31 Replacing traces The previous experiment suggests a weakness of TD(λ) with accumulating traces With accumulating traces multiple visits to a state can cause further increments to e t Instead of: e t = γλe t-1 + ɸ t, we could replace the trace value: - e t = max(γλe t-1, ɸ t ) This is called a replacing trace If λ=1 TD(λ) with replacing traces is closely related to first-visit Monte Carlo

32 Replacing traces for binary features et = γλet-1 e(f)t = 1 F contains the indexes of the active features

33 Linear True-online methods The λ-return algorithm is ideally what we would like to do It performs better than TD(λ) in practice TD(λ) is only equivalent to the λ-return algorithm at the ends of episodes: - after each episode, when each algorithm is done all their updates, they are the same Recently a new version of TD(λ) was proposed that achieves equivalence during the episode!

34 Dutch traces, linear case e t. = et e > t 1 t t This is called a dutch trace - somewhere in between accumulating and replacing traces When used with TD(λ) we call it: True-online TD(λ)

35 T role in bootstrapping in the n-step returns of the updates. In the fin. diagonal weight vectors are renamed without a superscript, t = then is to find a compact, efficient way of computing each tt from th this is done, for the linear case in which v (s, ) = > (s), then we a online TD( ) algorithm:. > > t+1 = t + t et + t t t 1 t (et t ), True online TD(λ) 268 where we have used the shorthand et is defined by. et = et > et 1 t. = t CHAPTER 12. ELIGIB (St ), t is defined as in TD t. dutch trace This algorithm has been proven to produce exactly the same seque tors, t, 80 t T, as the on-line -return algorithm (van Siejen the results on the random walk task on the left of Figure 12.7 are that task. Now, however, the algorithm is much less expensive. The ments of true online TD( ) are identical to those of conventional per-step computation is increased by about 50% (there is one more

36 36 Accumulating, Dutch, and Replacing Traces All traces fade the same: But increment differently! times of state visits accumulating traces dutch traces (α = 0.5) replacing traces

37 Problems with accumulating traces wrong wrong wrong wrong wrong right right right right +1 right Consider the above MDP: learning a policy (and q) Long sequence of states, so we would like to use long traces Imagine in the first episode, in some state s, we took the wrong action several times and then the right action - right action is more recent, but e corresponding to (s,wrong) > e corresponding to (s,right) - at termination Q(s,wrong) > Q(s,right) - which further compounds the problem Not a problem for replacing traces and Sarsa

38 Chain On-line TD(λ), replacing traces λ=1 λ=.99 On-line TD(λ), dutch traces λ=1 λ=.975 λ=.99 RMS error over first 10 episodes λ=.975 λ=.95 λ=0 λ=.95 λ=.975 λ=.95 λ=0 λ=.9 λ=.8 λ=.4 λ=.9 λ=.8 λ=.4

39 Comparison with λ-return algorithm On-line TD(λ), replacing traces λ=1 λ=.99 On-line TD(λ), dutch traces λ=1 λ=.975 λ=.99 RMS error over first 10 episodes λ=.975 λ=.95 λ=0 λ=.95 λ=.975 λ=.95 λ=0 λ=.9 λ=.8 λ=.4 λ=.9 λ=.8 λ=.4 On-line λ-return algorithm λ=1 λ=.99 λ=.975 λ=.95 Off-line λ-return algorithm off-line TD(λ), accumulating traces λ=1 λ=.99 λ=0 RMS error over first 10 episodes λ=.95 λ=0 λ=.99 λ=.975 λ=.95 λ=.9 λ=.8 λ=.4 λ=.9 λ=.8 λ=.4

40 Comparison of all three traces (online) RMS error over first 10 episodes On-line TD(λ), accumulating traces λ=.9 λ=.8 λ=.95 λ=.9 λ=.8 λ=0 λ=.4 On-line TD(λ), replacing traces λ=1 λ=.99 Off-line TD(λ), accumulating traces off-line λ-return algorithm λ=1 λ=.99 λ=0 λ=.4 λ=.975 λ=.95 λ=.8 λ=.9 λ=.99 On-line TD(λ), dutch traces λ=1 λ=.975 λ=.99 RMS error over first 10 episodes λ=.975 λ=.95 λ=0 λ=.95 λ=.975 λ=.95 λ=0 λ=.9 λ=.8 λ=.4 λ=.9 λ=.8 λ=.4

41 The winner is? Dutch traces is the closest match to the λ-return algorithm, under online updating Replacing traces gets lower error and does much better with larger λ values Replacing traces don t extend in a general way to function approximation You can create small MDPs where λ makes no difference for TD(λ) with replacing traces You can create small MDPs where large λ and large α causes TD(λ) with accumulating traces to blow up!

42 TD(λ) + Generalized policy iteration + FA As usual we combing policy evaluation for q functions with a policy improvement step TD(λ) handles the policy evaluation step And we form the policy to be ϵ-greedy with respect to the current q estimate The eligibility traces are now defined over s,a pairs

43 Linear semi-gradient Sarsa(λ) with binary features and ϵ-greedy policy Let and e be vectors with one component for each possible feature Let F a, for every possible action a, be a set of feature indices, initially empty Initialize as appropriate for the problem, e.g., = 0 Repeat (for each episode): e = 0 S, A initial state and action of episode (e.g., "-greedy) F A set of features present in S, A Repeat (for each step of episode): For all i 2 F A : e i e i + 1 (accumulating traces) or e i 1 (replacing traces) Take action A, observe reward, R, and next state, S 0 R P i2f A i If S 0 is terminal, then + e; go to next episode For all a 2 A(S 0 ): F a set P of features present in S 0,a Q a i2f a i A 0 new action in S 0 (e.g., "-greedy) + Q A 0 + e e e S S 0 A A 0

44 MOUNTAIN CAR Goal Sarsa + Tile coding on Mountain Car REPLACE TRACES 800 ACCUMULATE TRACES!= 1!= !=. 5 Steps per episode averaged over first first episodes trials and 30 runs!=. 9 9!= 0!= !=. 6!= 0!=. 4!= !=. 3!= " " 5

45 Mountain Car with Radial Basis Functions S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 32

46 Should we bootstrap? In all cases lower is better Red points are the cases of no bootstrapping Pure bootstrapping No bootstrapping

47 Multi-step updating vs onestep updates with Sarsa Path taken Action values increased by one-step Sarsa Action values increased by Sarsa(!) with!=0.9

48 Efficient trace implementations Methods with traces are more complex than onestep methods! - on every step we update e and q for every (s,a) - not a problem on a GPU or vector processor Most of the time, many et are nearly zero - decay is exponential in γλ Could keep track of et > η and only decay those & update the corresponding q(s,a) values

49 Other GPI TD algorithms that use traces Watkins Q(λ) = Q-learning + eligibility traces - et=0 when-ever μ takes an action different from π In practice this means Watkins Q(λ) is hardly more efficient than regular Q-learning - if the behavior policy explores a lot

50 Variable λ λ need not be fixed to a constant We could make it a function of state - λ t = λ(s t ) - different than a function of time If we had some way to compute our confidence in V(S t ) we could: - make λ t large in states where our confidence is low: bootstrap less, trust V less, use more of returns (MC) - make λ t near zero in states where confidence is high: full bootstrapping, like a TD(0) update

51 Summary Provide an efficient, incremental way to combine Monte Carlo (MC) and temporal-difference (TD) learning methods - Includes advantages of MC (can deal with lack of Markov property) - Includes advantages of TD (using TD error, bootstrapping) Can significantly speed learning, with small small cost in computation ( x2) Extends to control in on-policy: Sarsa(λ)) Three varieties: accumulating, replacing, and dutch True online TD(λ) is new and best - Is exactly equivalent to online λ-return algorithm

52 Unified View Temporaldifference learning width of backup Dynamic programming height (depth) of backup Monte Carlo Exhaustive search...

Temporal difference learning

Temporal difference learning AI & Agents for IET Lecturer: S Luz http://www.scss.tcd.ie/~luzs/t/cs7032/ February 4, 2014 Recall background & assumptions Environment is a finite MDP (i.e. A and S are finite).