What we learned last time

Size: px

Start display at page:

Download "What we learned last time"

Maximillian Parrish
5 years ago
Views:

1 Wat we learned last time Value-function approximation by stocastic gradient descent enables RL to be applied to arbitrarily large state spaces Most algoritms just carry over Targets from tabular case Wit bootstrapping (TD), we don t get true gradient descent metods but linear, on-policy case is still guaranteed convergent and learning is faster wit n-step metods (n>1), as before For continuous state spaces, coarse/tile coding is a good strategy

2 Capter 1: On-policy Control wit Approximation

3 Value function approximation (VFA) replaces table wit a general parameterized form S t ˆq(S t,a t, ) A t U t

4 On-policy Control wit Approximation (Semi-)gradient metods carry over to control in usual way Mountain Car example n-step metods carry over too, wit usual tradeoffs A new average-reward setting, wit differential value functions and differential algoritms Queuing example (tabular) Te discounting setting is deprecated

5 (Semi-)gradient metods carry over to control in usual on-policy GPI way Always learn action-value function of current policy Always act near-greedily wrt current action-value estimates Te learning rule is same as in Capter 9: t+1 i = t + U t ˆq(S t,a t, t ) rˆq(s t,a t, t ) (Expected Sarsa) U t = R t+1 + a update target, eg, U t = G t (MC) U t = R t+1 + ˆq(S t+1,a t+1, t ) (a S t+1 )ˆq(S t+1,a, t ) U t = s,r p(s i,r S t,a t ) r + (a s )ˆq(s,a, t ) a (Sarsa) (DP)

6 (Semi-)gradient metods carry over to control t+1 i = t + U t ˆq(S t,a t, t ) rˆq(s t,a t, t ) Episodic Semi-gradient Sarsa for Estimating ˆq q Input: a di erentiable function ˆq : S A R n! R Initialize value-function weigts 2 R n arbitrarily (eg, = ) Repeat (for eac episode): S, A initial state and action of episode (eg, "-greedy) Repeat (for eac step of episode): Take action A, observe R, S If S is terminal: + R ˆq(S, A, ) rˆq(s, A, ) Go to next episode Coose A as a function of ˆq(S,, ) (eg, "-greedy) + R + ˆq(S,A, ) ˆq(S, A, ) rˆq(s, A, ) S S A A

7 Example: Te Mountain-Car problem Goal SITUATIONS: car's position and velocity ACTIONS: tree trusts: forward, reverse, none Gravity wins REWARDS: always 1 until car reaces goal Episodic, No Discounting, γ=1 Minimum-Time-to-Goal Problem

8 Values learned wile solving Mountain-Car wit tile coding function approximation MOUNTAIN CAR Goal Step 428 Episode 12 4!12 Position 6 Velocity! Position Velocity ( max a ˆq(s, a, ) Episode 14 Episode 1 Episode Position Velocity Position Velocity Position Velocity Demo

9 Learning curves for semi-gradient Sarsa wit tile coding 1 8 8x8 tilings tiles3py Mountain Car 4 Steps per episode log scale averaged over 1 runs 2 =1/8 =2/8 =5/8 1 5 Episode

10 As we ave seen before, performance is best if an intermediate level of ping is used, corresponding to an n larger tan 1 Figure 13 sow algoritm tends to learn faster and obtain a better asymptotic performan T, as usual Te n-step update equation is tan at n = 1 on Mountain Car task Figure 14 sows results i e ect of parameters and n on rate of learn detailed study of q (St, task At, t+n 1 ) rq (St, At, t+n 1 ), t < T (14) n-step semi-gradient Sarsa is better for n>1 wit = Gt if t + n (n) t+n = t+n 1 + Gt (n) Gt Exercise 12 Give pseudocode for semi-gradient one-step Expected Sar Complete pseudocode is given on next page 3 n=1 As we ave seen before, performance is best if an intermediate 12 N -STEP SEMI-GRADIENT SARSA 237n=16level of bootstrap- n=8 ping is used, corresponding to an n larger tan28 1 Figure 13 sows ow tis algoritm a Car better asymptotic performancen=4at n = 8 Mountain 1 tends to learn faster and obtain 26 Steps per episode tan at n = 1 on Mountain Car task Figure 14 sows results of a more n=2 averaged over first 5 episodes detailed study of e ect of parameters and n on rate of learning on tis n=16 24 and 1 runs task 4 n=1 n=8 Mountain Car Steps per Exercise episode log scale averaged over 1 runs n=4 22 n= con12 Give pseudocode for semi-gradient one-step Expected1 Sarsa for 2 n=1 n=8 3 1 Mountain Car 28 number of tilings (8) Figure n=1 14: E ect of and n on early performance of n-step semi-gradien tile-coding function n=16approximation on Mountain Car task As usual, an i level of bootstrapping (n = 4) performed best Tese results are for selected n=8 5 log scale, and n connected by straigt lines Te standard errors ranged fro Episode tan line widt) for n = 1 to about 4 for n = 16 (wy se results are mo so main e ects are all statistically significant

11 On-policy Control wit Approximation (Semi-)gradient metods carry over to control in usual way Mountain Car example n-step metods carry over too, wit usual tradeoffs A new average-reward setting, wit differential value functions and differential algoritms Queuing example (tabular) Te discounting setting is deprecated

12 On-policy Control wit Approximation (Semi-)gradient metods carry over to control in usual way Mountain Car example n-step metods carry over too, wit usual tradeoffs A new average-reward setting, wit differential value functions and differential algoritms Queuing example (tabular) Te discounting setting is deprecated

13 13 AVERAGE REWARD: A NEW PROBLEM SETTING FOR CONTINUING A new goal for continuing tasks: : A NEW PROBLEM SETTING FOR CONTINUING TASKS239 In average-reward setting, quality of a policy is defined as aver Maximizingrateaverage reward per time step of reward wile following tat policy, wic we denote an ( ): ting, quality of a policy is defined as average T 1 g tat policy, wic we denote ( ) an ( ): = lim E[Rt A:t Maximize T!1 T t=1 t A:t 1 ] 1 ] assuming tat se limits exist is known as ergodicity property = lim E[Rt A:t 1 ], t!1 = d (s) (a s) p(s, r s, a)r, (15) s a s,r (1 ], 13 AVERAGE REWARD: A NEW PROBLEM SETTING FOR CONTINUING TASKS23 a s) p(s, r s, a)r, were expectations are conditioned on prior actions, A, A1,, At 1, be d : S! [, 1] is steady-state distribution under π, s,r taken according to, and d is steady-state distribution, d (s) = limt!1 Pr{S also known as on-policy distribution: wic is assumed to exist and to of beaindependent of S Tis property is known In average-reward setting, quality policy is defined as average onditioned on prior actions, A, A1,, At 1, being ergodicity It means tat werewic MDP startsan or ( ): any early decision made by rate of reward wile following tat policy, we denote is steady-state distribution, d (s) = limt!1 Pr{St = s A:t 1 }, agent can Tave only a temporary e ect; in long run your expectation of being nd to be independent of S Tis1 property is known as a state depends only on policy and MDP transition probabilities Ergodic = lim E[Rtamount A:t by ]reward received per time step ere MDP starts ( ) or any early decision made 1 is T average of!1 T is sufficient to guarantee existence of limits in equations above t=1 rary e ect; in long run your expectation of being in 1

14 (a s, )p(s s, a) = d (s ) d (s) (16 Tis is known as di erential return, and corresponding value functions are s a known as di erential value functions Tey are defined in same way and we willaverage-reward use same notation forreturns m asare we defined ave all in along: v (s) = E [Gt St = s] and In setting, terms of di erences between q (s, a) = Eaverage s, At = a] (similarly for v and q ) Di erential value functions rewards and [Gt St =reward: also ave Bellman equations, just sligtly di erent from tose we ave seen earlier Returns: G = Rt+1 ( )all+ s R ( )all+rewards Rt+3 by ( ) between reward (17 t simply t+2replace We remove and + di erence and true average reward: Tis is known as di erential return, and corresponding value functions ar i in same way and w known as di erential value functions Tey are defined v (s) = (a s) p(s, r s, a) r ( ) + v (s ), Bellman Eqs: In average reward setting, everyting is new will use same notation for [Gt St = s] and a r,sm as we ave all along: v (s) = E prediction i q (s, a) = E [Gt St = s, At = a] (similarly for v and q ) Di erential value function q (s, a) = p(s, r s, a) r ( ) + (a s )q (s, a ), also ave equations, 1 juston-policy sligtly di erent from tose weapproimation ave seen earlier 24 BellmanCHAPTER CONTROL WITH r,s a We simply remove all s and replace all rewards by di erence between reward i and truev average reward: p(s, r s, a) r ( ) + v (s ), and (s) = max i a r,s v (s) = (a s) p(s, r s, a) r ( ) + v (s ), control i q (s,aa) q (s, a) = r,s p(s, r s, a) = r,s p(s, r s, a) r (cf Eqs 314, 41, and 42) Update targets: r,s r ( ) + max q (s, a ) ( ) + a a (a s )q (s, a ), i Tere is also a di erential form of two TD errors: Ut = Rt+1 R t + q (St+1, At+1, ) or Ut = Rt+1 R t + v (St+1, ) t = Rt+1 R estimate, ) v (S, ), and t + v (St+1 t of ( ) (18)

15 Di erential semi-gradient Sarsa for estimating ˆq q Input: a di erentiable function ˆq : S A R n! R Parameters: step sizes, > Initialize value-function weigts 2 R n arbitrarily (eg, = ) Initialize average reward estimate R arbitrarily (eg, R = ) Initialize state S, and action A Repeat (for eac step): Take action A, observe R, S Coose A as a function of ˆq(S,, ) (eg, "-greedy) R R +ˆq(S,A, ) ˆq(S, A, ) R R + + rˆq(s, A, ) S S A A

16 Example: Te access-control queuing problem solved by tabular differential Sarsa Customers wait in line to be served by one of k=1 servers Priority REJECT ACCEPT POLICY Customers pay rewards of 1, 2, 4, or 8 (depending on ir priority) for being served Number of free servers On eac step, customer at front of queue is accepted (served), or rejected 1 5 priority 8 priority 4 Te queue never empties; new customers ave random priorities Differential value of best action -5 priority 2 priority 1 VALUE FUNCTION Busy servers become free -1 wit probability p=6 on eac step Number of free servers t =2,,, = = 1, = 1, R t 231

17 Discounting is futile in continuing control settings wit function approximation Te problem statement is broken! Te goal is broken! We can not longer give a useful ordering on policies we can only order a few policies, tose tat dominate ors in all states It would be OK if we could say wat states we care about, but in control case we can t Suppose we cared about states according to ow often y occur? Surprisingly, discounting n becomes irrelevant!

18 Te Futility of Discounting in Continuing Problems Peraps discounting can be saved by coosing an objective tat sums discounted values over distribution wit wic states occur under policy: J( ) = s = s d (s)v (s) d (s) a = ( )+ s (a s) s d (s) a = ( )+ s v (s ) s (were v is discounted value function) r (a s) s d (s) a p(s,r s, a) r + r v (s ) (Bellman Eq) p(s,r s, a) v (s ) (from (15)) (a s)p(s s, a) (from (38)) = ( )+ s v (s )d (s ) (from (16)) = ( )+ J( ) = ( )+ ( )+ 2 J( ) = ( )+ ( )+ 2 ( )+ 3 ( )+ = 1 ( ) 1 Te proposed discounted objective orders policies identically to undiscounted (average reward) objective We ave failed to save discounting!

19 Conclusions Control is straigtforward in on-policy, episodic, linear case For continuing case, we need average-reward setting wic is a lot like just replacing Rt wit Rt - η(π) everywere were η(π) is average reward per step, or its estimate We sould probably never use discounting as a control objective Formal results (bounds) exist for linear, on-policy case we get cattering near a good solution, not convergence

Chapter 6: Temporal Difference Learning

Chapter 6: Temporal Difference Learning Chapter 6: emporal Difference Learning Objectives of this chapter: Introduce emporal Difference (D) learning Focus first on policy evaluation, or prediction, methods hen extend to control methods R. S.