RL Lecture 7: Eligibility Traces. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1

Size: px

Start display at page:

Download "RL Lecture 7: Eligibility Traces. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1"

Grant Holt
5 years ago
Views:

1 RL Lecure 7: Eligibiliy Traces R. S. Suon and A. G. Baro: Reinforcemen Learning: An Inroducion 1

2 N-sep TD Predicion Idea: Look farher ino he fuure when you do TD backup (1, 2, 3,, n seps) R. S. Suon and A. G. Baro: Reinforcemen Learning: An Inroducion 2

3 R. S. Suon and A. G. Baro: Reinforcemen Learning: An Inroducion 3 Mone Carlo: TD: Use V o esimae remaining reurn n-sep TD: 2 sep reurn: n-sep reurn: Mahemaics of N-sep TD Predicion T T r r r r R = γ γ γ! ( 1 ) 1 (1) = s V r R γ ( 2 ) (2) = s V r r R γ γ ) ( ) ( n n n n n s V r r r r R = γ γ γ γ!

4 Learning wih N-sep Backups Backup (on-line or off-line): ΔV (s ) = α[ R (n) V (s )] Error reducion propery of n-sep reurns max s n π n Eπ { R s = s} V ( s) γ maxv ( s) V s π ( s) n sep reurn Maximum error using n-sep reurn Maximum error using V Using his, you can show ha n-sep mehods converge R. S. Suon and A. G. Baro: Reinforcemen Learning: An Inroducion 4

5 Random Walk Examples How does 2-sep TD work here? How abou 3-sep TD? R. S. Suon and A. G. Baro: Reinforcemen Learning: An Inroducion 5

6 A Larger Example Task: 19 sae random walk Do you hink here is an opimal n (for everyhing)? R. S. Suon and A. G. Baro: Reinforcemen Learning: An Inroducion 6

7 Averaging N-sep Reurns n-sep mehods were inroduced o help wih TD(λ) undersanding Idea: backup an average of several reurns e.g. backup half of 2-sep and half of 4- sep avg 1 (2) R = R R (4) One backup Called a complex backup Draw each componen Label wih he weighs for ha componen R. S. Suon and A. G. Baro: Reinforcemen Learning: An Inroducion 7

8 Forward View of TD(λ) TD(λ) is a mehod for averaging all n-sep backups weigh by λ n-1 (ime since visiaion) λ-reurn: R λ = (1 λ) Backup using λ-reurn: λ n 1 (n) R n=1 ΔV (s ) = α[ R λ V (s )] R. S. Suon and A. G. Baro: Reinforcemen Learning: An Inroducion 8

9 λ-reurn Weighing Funcion R. S. Suon and A. G. Baro: Reinforcemen Learning: An Inroducion 9

10 Relaion o TD(0) and MC λ-reurn can be rewrien as: R λ = (1 λ) T 1 λ n 1 n=1 R (n) + λ T 1 R Unil erminaion If λ = 1, you ge MC: Afer erminaion R λ = (1 1) T 1 1 n 1 n=1 R (n ) + 1 T 1 R = R If λ = 0, you ge TD(0) R λ = (1 0) T 1 0 n 1 n=1 R (n ) + 0 T 1 R = R (1) R. S. Suon and A. G. Baro: Reinforcemen Learning: An Inroducion 10

11 Forward View of TD(λ) II Look forward from each sae o deermine updae from fuure saes and rewards: R. S. Suon and A. G. Baro: Reinforcemen Learning: An Inroducion 11

12 λ-reurn on he Random Walk Same 19 sae random walk as before Why do you hink inermediae values of λ are bes? R. S. Suon and A. G. Baro: Reinforcemen Learning: An Inroducion 12

13 Backward View of TD(λ) The forward view was for heory The backward view is for mechanism New variable called eligibiliy race (s)" On each sep, decay all races by γλ and incremen he race for he curren sae by 1 Accumulaing race e + % γλe 1 (s) e (s) = & ' γλe 1 (s) +1 if s s if s = s R. S. Suon and A. G. Baro: Reinforcemen Learning: An Inroducion 13

14 On-line Tabular TD(λ) Iniialize V(s) arbirarily and e(s) = 0, for all s S Repea (for each episode) : Iniialize s Repea (for each sep of episode) : a acion given by π for s Take acion a, observe reward, r, and nex sae s $ δ r +γv( s $ ) V (s) e(s) e(s) +1 For all s : s s $ V(s) V(s) +αδe(s) e(s) γλe(s) Unil s is erminal R. S. Suon and A. G. Baro: Reinforcemen Learning: An Inroducion 14

15 Backward View δ = r ( ) γv s+ 1 V ( s ) Shou δ backwards over ime The srengh of your voice decreases wih emporal disance by γλ R. S. Suon and A. G. Baro: Reinforcemen Learning: An Inroducion 15

16 Relaion of Backwards View o MC & TD(0) Using updae rule: As before, if you se λ o 0, you ge o TD(0) If you se λ o 1, you ge MC bu in a beer way ΔV ( s) = e ( s) αδ Can apply TD(1) o coninuing asks Works incremenally and on-line (insead of waiing o he end of he episode) R. S. Suon and A. G. Baro: Reinforcemen Learning: An Inroducion 16

17 Forward View = Backward View The forward (heoreical) view of TD(λ) is equivalen o he backward (mechanisic) view for off-line updaing The book shows: T 1 ΔV TD (s) = ΔV λ (s ) = 0 T 1 = 0 I ss Backward updaes Forward updaes algebra shown in book T 1 ΔV TD (s) = αi ss = 0 T 1 = 0 T 1 k = (γλ) k δ k T 1 On-line updaing wih small α is similar ΔV λ (s )I ss = αi ss = 0 T 1 = 0 T 1 k = (γλ) k δ k R. S. Suon and A. G. Baro: Reinforcemen Learning: An Inroducion 17

18 On-line versus Off-line on Random Walk Same 19 sae random walk On-line performs beer over a broader range of parameers R. S. Suon and A. G. Baro: Reinforcemen Learning: An Inroducion 18

19 Conrol: Sarsa(λ) Save eligibiliy for sae-acion pairs insead of jus saes $ e (s, a) = γλe 1(s, a) +1 % & γλe 1 (s,a) if s = s and a = a oherwise Q +1 (s, a) = Q (s, a) +αδ e (s, a) δ = r +1 + γq (s +1,a +1 ) Q (s, a ) R. S. Suon and A. G. Baro: Reinforcemen Learning: An Inroducion 19

20 Sarsa(λ) Algorihm Iniialize Q(s,a) arbirarily and e(s, a) = 0, for all s, a Repea (for each episode) : Iniialize s, a Repea (for each sep of episode) : Take acion a, observe r, s! Choose a! from s! using policy derived from Q (e.g.? - greedy) δ r +γq( s!, a!) Q(s, a) e(s,a) e(s,a) +1 For all s,a : Q(s, a) Q(s, a) +αδe(s, a) e(s, a) γλe(s, a) s s!;a a! Unil s is erminal R. S. Suon and A. G. Baro: Reinforcemen Learning: An Inroducion 20

21 Sarsa(λ) Gridworld Example Wih one rial, he agen has much more informaion abou how o ge o he goal no necessarily he bes way Can considerably accelerae learning R. S. Suon and A. G. Baro: Reinforcemen Learning: An Inroducion 21

22 Three Approaches o Q(λ) How can we exend his o Q- learning? If you mark every sae acion pair as eligible, you backup over non-greedy policy Wakins: Zero ou eligibiliy race afer a non-greedy acion. Do max when backing up a firs nongreedy choice. e (s, a) = % 1 + γλe 1 (s, a) ' & 0 ( ' γλe 1 (s,a) if s = s, a = a,q 1 (s,a ) = max a Q 1 (s, a) if Q 1 (s,a ) max a Q 1 (s,a) oherwise Q +1 (s, a) = Q (s, a) +αδ e (s, a) δ = r +1 + γ max a + Q (s +1, a + ) Q (s,a ) R. S. Suon and A. G. Baro: Reinforcemen Learning: An Inroducion 22

23 Wakins s Q(λ) Iniialize Q(s,a) arbirarily and e(s, a) = 0, for all s, a Repea (for each episode) : Iniialize s, a Repea (for each sep of episode) : Take acion a, observe r, s! Choose a! from s! using policy derived from Q (e.g.? - greedy) a * arg max b Q( s!, b) (if a ies for he max, hen a * a!) δ r +γq( s!, a!) Q(s, a * ) e(s,a) e(s,a) +1 For all s,a : Q(s, a) Q(s, a) +αδe(s, a) If a! = a *, hen e(s, a) γλe(s,a) s s!;a a! Unil s is erminal else e(s, a) 0 R. S. Suon and A. G. Baro: Reinforcemen Learning: An Inroducion 23

24 Peng s Q(λ) Disadvanage o Wakins s mehod: Peng: Early in learning, he eligibiliy race will be cu (zeroed ou) frequenly resuling in lile advanage o races Backup max acion excep a end Never cu races Disadvanage: Complicaed o implemen R. S. Suon and A. G. Baro: Reinforcemen Learning: An Inroducion 24

25 Naïve Q(λ) Idea: is i really a problem o backup exploraory acions? Never zero races Always backup max a curren acion (unlike Peng or Wakins s) Is his ruly naïve? Works well is preliminary empirical sudies Wha is he backup diagram? R. S. Suon and A. G. Baro: Reinforcemen Learning: An Inroducion 25

26 Comparison Task Compared Wakins s, Peng s, and Naïve (called McGovern s here) Q(λ) on several asks. See McGovern and Suon (1997). Towards a Beer Q(λ) for oher asks and resuls (sochasic asks, coninuing asks, ec) Deerminisic gridworld wih obsacles 10x10 gridworld 25 randomly generaed obsacles 30 runs α = 0.05, γ = 0.9, λ = 0.9, ε = 0.05, accumulaing races From McGovern and Suon (1997). Towards a beer Q(λ) R. S. Suon and A. G. Baro: Reinforcemen Learning: An Inroducion 26

27 Comparison Resuls From McGovern and Suon (1997). Towards a beer Q(λ) R. S. Suon and A. G. Baro: Reinforcemen Learning: An Inroducion 27

28 Convergence of he Q(λ) s None of he mehods are proven o converge. Much exra credi if you can prove any of hem. Wakins s is hough o converge o Q * Peng s is hough o converge o a mixure of Q π and Q * Naïve - Q *? R. S. Suon and A. G. Baro: Reinforcemen Learning: An Inroducion 28

29 Eligibiliy Traces for Acor-Criic Mehods Criic: On-policy learning of V π. Use TD(λ) as described before. Acor: Needs eligibiliy races for each sae-acion pair. We change he updae equaion: # p +1 (s, a) = p (s,a) +αδ if a = a and s = s $ % p (s, a) oherwise o p + 1( s, a) = p ( s, a) + αδ e ( s, a) Can change he oher acor-criic updae: [ ] if a = a and s = s % p +1 (s, a) = p (s,a) +αδ 1 π(s, a) & ' p (s,a) oherwise o p+ 1( s, a) = p ( s, a) + αδ e ( s, a) where % e (s, a) = γλe (s, a) +1 π (s,a ) 1 & ' γλe 1 (s, a) if s = s and a = a oherwise R. S. Suon and A. G. Baro: Reinforcemen Learning: An Inroducion 29

30 Replacing Traces Using accumulaing races, frequenly visied saes can have eligibiliies greaer han 1 This can be a problem for convergence Replacing races: Insead of adding 1 when you visi a sae, se ha race o 1 % e (s) = γλe (s) if s s 1 & ' 1 if s = s R. S. Suon and A. G. Baro: Reinforcemen Learning: An Inroducion 30

31 Replacing Traces Example Same 19 sae random walk ask as before Replacing races perform beer han accumulaing races over more values of λ R. S. Suon and A. G. Baro: Reinforcemen Learning: An Inroducion 31

32 Why Replacing Traces? Replacing races can significanly speed learning They can make he sysem perform well for a broader se of parameers Accumulaing races can do poorly on cerain ypes of asks Why is his ask paricularly onerous for accumulaing races? R. S. Suon and A. G. Baro: Reinforcemen Learning: An Inroducion 32

33 More Replacing Traces Off-line replacing race TD(1) is idenical o firs-visi MC Exension o acion-values: When you revisi a sae, wha should you do wih he races for he oher acions? Singh and Suon say o se hem o zero: e (s, a) = $ & 1 if s = s and a = a % 0 if s = s and a a & ' γλe 1 (s, a) if s s R. S. Suon and A. G. Baro: Reinforcemen Learning: An Inroducion 33

34 Implemenaion Issues Could require much more compuaion Bu mos eligibiliy races are VERY close o zero If you implemen i in Malab, backup is only one line of code and is very fas (Malab is opimized for marices) R. S. Suon and A. G. Baro: Reinforcemen Learning: An Inroducion 34

35 Variable λ Can generalize o variable λ % γλ e 1 (s) e (s) = & ' γλ e 1 (s) +1 if s s if s = s Here λ is a funcion of ime Could define λ = λ( s ) or λ = λ τ R. S. Suon and A. G. Baro: Reinforcemen Learning: An Inroducion 35

36 Conclusions Provides efficien, incremenal way o combine MC and TD Includes advanages of MC (can deal wih lack of Markov propery) Includes advanages of TD (using TD error, boosrapping) Can significanly speed learning Does have a cos in compuaion R. S. Suon and A. G. Baro: Reinforcemen Learning: An Inroducion 36

37 Somehing Here is No Like he Oher R. S. Suon and A. G. Baro: Reinforcemen Learning: An Inroducion 37

Chapter 7: Eligibility Traces. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1

Chapter 7: Eligibility Traces. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Chapter 7: Eligibility Traces R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Midterm Mean = 77.33 Median = 82 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction