Chapte 8: Genealization and Function Appoximation Objectives of this chapte: Look at how expeience with a limited pat of the state set be used to poduce good behavio ove a much lage pat. Oveview of function appoximation (FA) methods and how they can be adapted to RL
Value Pediction with FA As usual: Policy Evaluation (the pediction poblem): fo a given policy π, compute the state-value function V! In ealie chaptes, value functions wee stoed in lookup tables. Hee, the value function estimate at time t, V t, depends on a paamete vecto, and only the paamete vecto is updated. e.g.,! t could be the vecto of connection weights of a neual netwok. R. S. Sutton and A. G. Bato: Reinfocement Leaning: An Intoduction 2
Adapt Supevised Leaning Algoithms Taining Info = desied (taget) outputs Inputs Supevised Leaning System Outputs Taining example = {input, taget output} Eo = (taget output actual output) R. S. Sutton and A. G. Bato: Reinfocement Leaning: An Intoduction 3
Backups as Taining Examples e.g., the TD(0) backup : [ ] V(s t )! V(s t ) + " t +1 +# V(s t +1 ) $ V(s t ) As a taining example: desciption of s t, t +1 +! V (s t+1 ) { } input taget output R. S. Sutton and A. G. Bato: Reinfocement Leaning: An Intoduction 4
Any FA Method? In pinciple, yes: atificial neual netwoks decision tees multivaiate egession methods etc. But RL has some special equiements: usually want to lean while inteacting ability to handle nonstationaity othe? R. S. Sutton and A. G. Bato: Reinfocement Leaning: An Intoduction 5
Gadient Descent Methods tanspose = ( (1), (2),K, (n)) T Assume V t is a (sufficiently smooth) diffeentiable function of! t, fo all s "S. Assume, fo now, taining examples of this fom : { desciption of s t, V! (s t )} R. S. Sutton and A. G. Bato: Reinfocement Leaning: An Intoduction 6
Pefomance Measues Many ae applicable but a common and simple one is the mean-squaed eo (MSE) ove a distibution P : MSE( ) = & s%s [ ] P(s) V # (s) $V t (s) Why P? Why minimize MSE? Let us assume that P is always the distibution of states with which backups ae done. The on-policy distibution: the distibution ceated while following the policy being evaluated. Stonge esults ae available fo this distibution. 2 R. S. Sutton and A. G. Bato: Reinfocement Leaning: An Intoduction 7
Gadient Descent Let f be any function of the paamete space. Its gadient at any point in this space is : #" f ( ) = $f ( ) $"(1),$f ( ) $"(2),K,$f ( T % )( ' * & $"(n) )! (2)! t = (! t (1),! t (2)) T Iteatively! t +1 = move down! t "#$ f ( the gadient:!! t )! (1) R. S. Sutton and A. G. Bato: Reinfocement Leaning: An Intoduction 8
Gadient Descent Cont. Fo the MSE given above and using the chain ule: +1 = # 1 2 $% " MSE( ) ' P(s) [ V ( (s) #V (s)] 2 s&s = + $ P(s) [ V ( (s) #V t (s)]%" V t(s) = # 1 2 $% ' s&s R. S. Sutton and A. G. Bato: Reinfocement Leaning: An Intoduction 9
Gadient Descent Cont. Use just the sample gadient instead: +1 = # 1 2 $% [ V & (s ) #V t (s t )] 2 = + $ [ V & (s t ) #V t (s t )]%" V (s ), t t Since each sample gadient is an unbiased estimate of the tue gadient, this conveges to a local minimum of the MSE if α deceases appopiately with t. E[ V " (s t ) #V t (s t )]$% V t (s t ) = ' P(s) V " (s) #V t (s) s&s [ ] $ % V t (s) R. S. Sutton and A. G. Bato: Reinfocement Leaning: An Intoduction 10
But We Don t have these Tagets Suppose we just have tagets v t instead :! t +1 =! t +" [ v t # V t (s t )]$ V (s )! t t If each v t is an unbiased estimate of V " (s t ), i.e., E{ v t } = V " (s t ), then gadient descent conveges to a local minimum (povided # deceases appopiately). e.g., the Monte Calo taget v t = R t : +1 = + #[ R t $V t (s t )]%" V t (s t ) R. S. Sutton and A. G. Bato: Reinfocement Leaning: An Intoduction 11
What about TD(λ) Tagets? +1 = + #[ R $ t %V t (s t )]&" V t (s t ) Not unbiased fo $ <1 But we do it anyway, using the backwads view : +1 = + #$ t e t, whee : $ t = t +1 + % V t (s t +1 ) &V t (s t ), as usual, and e t = % ' e t&1 + ( V (s t ) R. S. Sutton and A. G. Bato: Reinfocement Leaning: An Intoduction 12
On-Line Gadient-Descent TD(λ) R. S. Sutton and A. G. Bato: Reinfocement Leaning: An Intoduction 13
Linea Methods Repesent states as featue vectos: fo each s " S : # s = (# s (1),# s (2),K,# s (n)) T V t (s) = T " # V t (s) =? n $ i=1 # s = (i)# s (i) R. S. Sutton and A. G. Bato: Reinfocement Leaning: An Intoduction 14
Nice Popeties of Linea FA Methods The gadient is vey simple:! V (s) = # s Fo MSE, the eo suface is simple: quadatic suface with a single minumum. Linea gadient descent TD(λ) conveges: Step size deceases appopiately On-line sampling (states sampled fom the on-policy distibution) Conveges to paamete vecto with popety:! " MSE(! " ) # 1 $% & 1 $ % MSE(! ' ) (Tsitsiklis & Van Roy, 1997) best paamete vecto R. S. Sutton and A. G. Bato: Reinfocement Leaning: An Intoduction 15
Coase Coding R. S. Sutton and A. G. Bato: Reinfocement Leaning: An Intoduction 16
Leaning and Coase Coding R. S. Sutton and A. G. Bato: Reinfocement Leaning: An Intoduction 17
Tile Coding Binay featue fo each tile Numbe of featues pesent at any one time is constant Binay featues means weighted sum easy to compute Easy to compute indices of the featues pesent R. S. Sutton and A. G. Bato: Reinfocement Leaning: An Intoduction 18
Tile Coding Cont. Iegula tilings Hashing CMAC Ceebella Model Aithmetic Compute Albus 1971 R. S. Sutton and A. G. Bato: Reinfocement Leaning: An Intoduction 19
Radial Basis Functions (RBFs) e.g., Gaussians % " s (i) = exp # s # c i ' 2 & 2$ i 2 ( * ) R. S. Sutton and A. G. Bato: Reinfocement Leaning: An Intoduction 20
Can you beat the cuse of dimensionality? Can you keep the numbe of featues fom going up exponentially with the dimension? Function complexity, not dimensionality, is the poblem. Kaneva coding: Select a bunch of binay pototypes Use hamming distance as distance measue Dimensionality is no longe a poblem, only complexity Lazy leaning schemes: Remembe all the data To get new value, find neaest neighbos and intepolate e.g., locally-weighted egession R. S. Sutton and A. G. Bato: Reinfocement Leaning: An Intoduction 21
Contol with FA Leaning state-action values Taining examples of the fom: { desciption of ( s t, a t ), v } t The geneal gadient-descent ule:! t +1 =! t +" [ v t # Q t (s t,a t )]$! Q(s t,a t ) Gadient-descent Sasa(λ) (backwad view):! t +1 =! t +"# t whee e t # t = t +1 + $ Q t (s t +1, a t +1 ) % Q t (s t,a t ) e t = $ & e t %1 + ' Q t (s t,a t )! R. S. Sutton and A. G. Bato: Reinfocement Leaning: An Intoduction 22
Linea Gadient Descent Sasa(λ) R. S. Sutton and A. G. Bato: Reinfocement Leaning: An Intoduction 23
GPI Linea Gadient Descent Watkins Q(λ) R. S. Sutton and A. G. Bato: Reinfocement Leaning: An Intoduction 24
Mountain-Ca Task R. S. Sutton and A. G. Bato: Reinfocement Leaning: An Intoduction 25
Mountain-Ca Results R. S. Sutton and A. G. Bato: Reinfocement Leaning: An Intoduction 26
Baid s Counteexample R. S. Sutton and A. G. Bato: Reinfocement Leaning: An Intoduction 27
Baid s Counteexample Cont. R. S. Sutton and A. G. Bato: Reinfocement Leaning: An Intoduction 28
Should We Bootstap? R. S. Sutton and A. G. Bato: Reinfocement Leaning: An Intoduction 29
Summay Genealization Adapting supevised-leaning function appoximation methods Gadient-descent methods Linea gadient-descent methods Radial basis functions Tile coding Kaneva coding Nonlinea gadient-descent methods? Backpopation? Subleties involving function appoximation, bootstapping and the on-policy/off-policy distinction R. S. Sutton and A. G. Bato: Reinfocement Leaning: An Intoduction 30