Barycentric Interpolators for Continuous. Space & Time Reinforcement Learning. Robotics Institute, Carnegie Mellon University

Barycetrc Iterpolators for Cotuous Space & Tme Reforcemet Learg Rem Muos & Adrew Moore Robotcs Isttute, Carege Mello Uversty Pttsburgh, PA 15213, USA. E-mal:fmuos, awmg@cs.cmu.edu Category : Reforcemet Learg ad Cotrol Preferece : oral presetato Abstract I order to d the optmal cotrol of cotuous state-space ad tme reforcemet learg (RL) problems, we appromate the value fucto (F) wth a partcular class of fuctos called the barycetrc terpolators. We establsh sucet codtos uder whch a RL algorthm coverges to the optmal F, eve whe we use appromate models of the state dyamcs ad the reforcemet fuctos. 1 INTRODUCTION I order to appromate the value fucto (F) of a cotuous state-space ad tme reforcemet learg (RL) problem, we dee a partcular class of fuctos called the barycetrc terpolator, that use some terpolato process based o te sets of pots. Ths class of fuctos, cludg cotuous or dscotuous pecewse lear ad mult-lear fuctos, provdes us wth a geeral method for desgg RL algorthms that coverge to the optmal value fucto. Ideed these fuctos permt us to dscretze the HJB equato of the cotuous cotrol problem by a cosstet (ad thus coverget) appromato scheme, whch s solved by usg some model of the state dyamcs ad the reforcemet fuctos. Secto 2 dees the barycetrc terpolators. Secto 3 descrbes the optmal cotrol problem the determstc cotuous case. Secto 4 states the covergece result for RL algorthms by gvg sucet codtos o the appled model. Secto 5 gves some computatoal ssues for ths method, ad Secto 6 descrbes the appromato scheme used here ad proves the covergece result.

2 DEFINITION OF BARYCENTRIC INTERPOLATORS Let = f g be a set of pots dstrbuted at some resoluto (see (4) below) o the state space of dmeso d. For ay state sde some smple ( 1 ; :::; ), we say that s the baryceter of the f g =1:: sde ths smple wth postve coecets p(j ) of sum 1, called the barycetrc coordates, f = P =1:: p(j ):. Let ( ) be the value of the fucto at the pots. s a barycetrc terpolator f for ay state whch s the baryceter of the pots f g =1:: for some smple ( 1 ; :::; ), wth the barycetrc coordates p(j ), we have : () = X =1:: p(j ): ( ) (1) Moreover we assume that the smple ( 1 ; :::; ) s of dameter O(). Let us descrbe some smple barycetrc terpolators : Pecewse lear fuctos deed by some tragulato o the state space (thus deg cotuous fuctos), see gure 1.a, or deed at ay by a lear combato of (d +1)values at ay pots ( 1 ; :::; d+1 ) 3 (such fuctos may be dscotuous at some boudares), see gure 1.b. Pecewse mult-lear fuctos deed by a mult-lear combato of the 2 d values at the vertces of d-dmesoal rectagles, see gure 1.c. I ths case as well, we ca buld cotuous terpolatos or allow dscotutes at the boudares of the rectagles. A mportat pot s that the covergece result stated Secto 4 does ot requre the cotuty of the fucto. Ths permts us to buld varable resoluto tragulatos (see gure 1.b) or grd (gure 1.c) easly. (a) (b) (c) Fgure 1: Some eamples of barycetrc appromators. These are pecewse cotuous (a) or dscotuous (b) lear or mult-lear (c) terpolators. Remark 1 I the geeral case, for a gve, the choce of a smple ( 1 ; :::; ) 3 s ot uque (see the two sets of grey ad black pots gure 1.b ad 1.c), ad oce the smple ( 1 ; :::; ) 3 s deed, f >d+1 (for eample gure 1.c), the the choce of the barycetrc coordates p(j ) s also ot uque. Remark 2 Depedg o the terpolato method we use, the tme eeded for computg the values wll vary. Followg [Dav96], the cotuous mult-lear terpolato must process 2 d values, whereas the lear cotuous terpolato sde a smple processes (d + 1) values O(d log d) tme.

I comparso to [Gor95], the fuctos used here are averagers that satsfy the barycetrc terpolato property (1). Ths addtoal geometrc costrat permts us to prove the cosstecy (see (15) below) of the appromato scheme ad thus the covergece to the optmal value the cotuous tme case. 3 THE OPTIMAL CONTROL PROBLEM Let us descrbe the optmal cotrol problem the determstc ad dscouted case for cotuous state-space ad tme varables ad dee the value fucto that we ted to appromate. We cosder a dyamcal system whose state dyamcs depeds o the curret state (t) 2 O (the state-space, wth O a ope subset of IR d ) ad cotrol u(t) 2 U (compact subset) by a deretal equato : d = f((t);u(t)) (2) dt From equato (2), the choce of a tal state ad a cotrol fucto u(t) leads to a uque trajectores (t) (see gure 2). Let be the et tme from O (wth the coveto that f (t) always stays O, the = 1). The, we dee the fuctoal J as the dscouted cumulatve reforcemet: J(; u(:)) = Z 0 t r((t);u(t))dt + R(()) where r(; u) s the rug reforcemet ad R() the boudary reforcemet. s the dscout factor (0 <1). We assume that f, r ad R are bouded ad Lpschtza, ad that the boudary @O s C 2. RL uses the method of Dyamc Programmg (DP) that troduces the value fucto (F) : the mamal value of J as a fucto of tal state : () = sup J(; u(:)): u(:) From the DP prcple, we deduce that satses a rst-order deretal equato, called the Hamlto-Jacob-Bellma (HJB) equato (see [FS93] for a survey) : Theorem 1 If s deretable at 2 O, let D () be the gradet of at, the the followg HJB equato holds at. H(; D; ) def = ()l + sup[d ():f(; u)+ r(; u)] = 0 (3) u2u The challege of RL s to get a good appromato of the F, because from we ca deduce the optmal cotrol : for state, the cotrol u () that realzes the supremum the HJB equato provdes a optmal (feed-back) cotrol law. The followg hypothess s a sucet codto for to be cotuous wth O (see [Bar94]) ad s requred for provg the covergece result of the et secto. Hyp 1: For 2 @O; let,! () be the outward ormal of O at, we assume that : -If 9u 2 U; s.t. f(; u):,! () 0 the 9v 2 U; s.t. f(; v),! () < 0: -If 9u 2 U; s.t. f(; u):,! () 0 the 9v 2 U; s.t. f(; v),! () > 0: whch meas that at the states (f there est ay) where some trajectory s taget to the boudary, there ests, for some cotrol, a trajectory strctly comg sde ad oe strctly leavg the state space.

O f(,u) 2 1 η η 3 (t) ( τ) Fgure 2: The state space ad the set of pots (the black dots belog to the teror ad the whte oes to the boudary). The value at some pot s updated, at step, by the dscouted value at pot 2 ( 1; 2; 3). The ma requremet for covergece s that the pots appromate the sese : p( j ) = p(j )+O() (.e. the belog to the grey area). 4 THE CONERGENCE RESULT Let us troduce the set of pots = f g, composed of the teror ( \ O) ad the boudary (@ = O), such that ts cove hull covers the state space O, ad performg a dscretzato at some resoluto : 8 2 O; f jj, jj ad 8 2 @O f jj, j jj (4) 2 \O j2@ Moreover, we appromate the cotrol space U by some te cotrol spaces U U such that for 0, U 0 U ad lm!0 U = U. We would lke to update the value of ay: - teror pot 2 \ O wth the dscouted values at state (; u) (gure 2) : h +1() sup (;u) ( (; u)) + (; u):r (; u) (5) u2u for some state (; u), some tme delay (; u) ad some reforcemet r (; u). - boudary pot 2 @ wth some termal reforcemet R () : +1 () R () (6) The followg theorem states that the values computed by a RL algorthm usg the model (because of some a pror partal ucertaty of the state dyamcs ad the reforcemet fuctos) (; u), (; u), r (; u) ad R () coverge to the optmal value fucto as the umber of teratos! 1ad the resoluto! 0. Let us dee the state (; u) (see gure 2) : (; u) = + (; u):f(; u) (7) for some tme delay (; u) (wth k 1 (; u) k 2 for some costats k 1 > 0 ad k 2 > 0), ad let p(j ) (resp. p( j )) be the barycetrc coordate of sde a smple cotag t (resp. sde the same smple). We wll wrte,,, r,..., stead of (; u), (; u), (; u), r(; u),... whe o cofuso s possble. Theorem 2 Assume that the hypotheses of the prevous sectos hold, ad that for ay resoluto, we use barycetrc terpolators deed o state spaces (satsfyg (4)) such that all pots of \ O are regularly updated wth rule (5) ad all pots of @ are updated wth rule (6) at least oce. Suppose that,, r ad R appromate,, r ad R the sese : 8 ;p( j ) = p(j )+O() (8) = + O( 2 ) (9) r = r + O() (10) R = R + O() (11)

the we have lm!1 = uformly o ay compact O (.e. 8" >0; 8!0 compact O; 9; 9N, such that 8 ; 8 N;sup \ j, j"). Remark 3 For a gve value of, the rule (5) s ot a DP updatg rule for some Markov Decso Problem (MDP) sce the values ; ;r deped o. Ths pot s mportat the RL framework sce ths allows o-le mprovemet of the model of the state dyamcs ad the reforcemet fuctos. Remark 4 Ths result eteds the prevous results of covergece obtaed by Fte-Elemet or Fte-Derece methods (see [Mu97]). Ths theoretcal result ca be appled by startg from a rough (hgh ) ad by combg to the terato process (!1) some learg process of the model (! ) ad a creasg process of the umberofpots (! 0). 5 COMPUTATIONAL ISSUES From (8) we deduce that the method wll also coverge f we use a appromate barycetrc terpolator, deed at ay state 2 ( 1 ; :::; )by the value of the barycetrc terpolator at some state 0 2 ( 1 ; :::; ) such that p( 0 j )=p(j )+ O() (see gure 3). The fact that we eed ot be completely accurate ca be Appro-lear Lear O( δ) 1 2 3 4 Fgure 3: The lear fucto ad the appromato error aroud t (the grey area). The value of the appromate lear fucto plotted here at some state s equal to the value of the lear oe at 0. Ay such appromate baryceter terpolator ca be used (5). used to our advatage. Frst, the computato of barycetrc coordates ca use very fast appromate matr methods. Secod, the model we use to tegrate the dyamcs eed ot be perfect. We ca make ao( 2 ) error, whch s useful f we are learg a model from data: we eed smply arrage to ot gather more data tha s ecessary for the curret. For eample, f we use earest eghbor for our dyamcs learg, we eed to esure eough data so that every observato s O( 2 ) from ts earest eghbor. If we use local regresso, the a mere O() desty s all that s requred [Omo87, AMS97]. 6 PROOF OF THE CONERGENCE RESULT 6.1 Descrpto of the appromato scheme We use a coverget scheme derved from Kusher (see [Kus90]) order to appromate the cotuous cotrol problem by a te MDP. The HJB equato s dscretzed, at some resoluto, to the followg DP equato : for 2 \ O, () =F (:) () def = sup u2u P p(j ): ( )+:r ad for 2 @, () =R(). Ths s a ed-pot equato ad we ca prove that, thaks to the dscout factor, t satses the \strog" cotracto property: sup o (12) +1, : sup, for some <1 (13)

from whch we deduce that there ests eactly oe soluto to the DP equato, whch ca be computed by some value terato process : for ay tal 0,we terate +1 F.Thus for ay resoluto, the values! as!1. Moreover, as s a barycetrc terpolator ad from the deto (7) of, F (:) () = sup u2u ( + :f(; u)) + :r (14) from whch we deduce that the scheme F s cosstet : a formal sese, lm sup!0 1 jf [W ](), W ()j H(W;DW;) (15) ad obta, from the geeral covergece theorem of [BS91] (ad a result of strog ucty obtaed from hyp.1), the covergece of the scheme :! as! 0. 6.2 Use of the \weak cotracto" result of covergece Sce the RL approach used here, we oly have a appromato,,... of the true values,,..., the strog cotracto property (13) does ot hold ay more. However, prevous work ([Mu98]), we have prove the covergece for some weakeed codtos, recalled here : If the values updated by some algorthm satsfy the \weak" cotracto property wth respect to a soluto of a coverget appromato scheme (such as the prevous oe (12)) : sup \O sup @ +1, (1, k:): sup, + o() (16), +1 = O() (17) for some postve costat k, (wth the otato f() o() 9g() = o() wth f() g()) the we have lm!1 = uformly o ay compact O!0 (.e. 8" > 0, 8 compact O, 9 ad N such that 8 ; 8 N, sup \, "). 6.3 Proof of theorem 2 We are gog to use the appromatos (8), (9), (10) ad (11) to deduce that the weak cotracto property holds, ad the use the result of the prevous secto to prove theorem 2. The proof of (17) s mmedate sce, from (6) ad (11) we have :8 2 @, +1(), () = jr (), R()j = O() Now we eed to prove (16). Let us estmate the error E () = (), () betwee the value of the DP equato (12) ad the values computed by rule (5) after oe terato : E +1 () = sup u2u E +1 () = sup u2u P p(j ): ( ), p( j ): ( ) + :r, :r o P P [p(j ), p( j )] ( )+[, ] p( j ): ( ) o + P p( j ): ( ), ( ) + [r, r ]+[, ] r By usg (9) (from whch we deduce : = + O( 2 )) ad (10), we deduce : je +1 ()j sup u2u : P [p(j ), p( j )] ( ) + P p( j ): ( ), ( ) o + O( 2 ): (18)

From the basc propertes of the coecets p(j ) ad p( j )wehave: P [p(j ), p( j )] ( )= P [p(j ), p( j )] ( ), () (19) Moreover, j ( ), ()j j ( ), ( )j + j ( ), ()j + j (), ()j:, #0 From the covergece of the scheme,wehave sup \! 0 for ay compact O ad from the cotuty of ad the fact that the support of the smple fg 3 s O(), we have sup \ j ( ), ()j #0! 0 ad deduce that : sup \ ( ), () #0 X [p(j), p( j)] ( )! 0. Thus, from (19) ad (8), we obta : = o() (20) The \weak" cotracto property (16) holds : from the property of the epoetal fucto 1, l 1 for small values of 2, from (9) ad that k 1,we deduce that deduce that : +1(), () (1, k:) sup 1, k1 2 l 1 + O(2 ), ad from (18) ad (20) we +1(), () + o() wth k = k1 l 1, ad the property (16) holds. Thus the \weak cotracto" result 2 of covergece (descrbed secto 6.2) apples ad covergece occurs. FUTURE WORK Ths work proves the covergece to the optmal value as the resoluto teds to the lmt, but does ot provde us wth the rate of covergece. Our future work wll focus o deg upper bouds of the appromato error, especally for varable resoluto dscretzatos, ad we wll also cosder the stochastc case. Refereces [AMS97] C. G. Atkeso, A. W. Moore, ad S. A. Schaal. Locally Weghted Learg. AI Revew, 11:11{73, Aprl 1997. [Bar94] Guy Barles. Solutos de vscoste des equatos de Hamlto-Jacob, volume 17 of Mathematques et Applcatos. Sprger-erlag, 1994. [BS91] Guy Barles ad P.E. Sougads. Covergece of appromato schemes for fully olear secod order equatos. Asymptotc Aalyss, 4:271{283, 1991. [Dav96] [FS93] [Gor95] [Kus90] Scott Daves. Multdmesoal tragulato ad terpolato for reforcemet learg. Advaces Neural Iformato Processg Systems, 8, 1996. Wedell H. Flemg ad H. Mete Soer. Cotrolled Markov Processes ad scosty Solutos. Applcatos of Mathematcs. Sprger-erlag, 1993. G. Gordo. Stable fucto appromato dyamc programmg. Iteratoal Coferece o Mache Learg, 1995. Harold J. Kusher. Numercal methods for stochastc cotrol problems cotuous tme. SIAM J. Cotrol ad Optmzato, 28:999{1048, 1990. [Mu97] Rem Muos. A coverget reforcemet learg algorthm the cotuous case based o a te derece method. Iteratoal Jot Coferece o Artcal Itellgece, 1997. [Mu98] Rem Muos. A geeral covergece theorem for reforcemet learg the cotuous case. Europea Coferece o Mache Learg, 1998. [Omo87] S. M. Omohudro. Ecet Algorthms wth Neural Network Behavour. Joural of Comple Systems, 1(2):273{347, 1987.