On-line Reinforcement Learning Using Incremental Kernel-Based Stochastic Factorization

Size: px

Start display at page:

Download "On-line Reinforcement Learning Using Incremental Kernel-Based Stochastic Factorization"

Mark Powers
5 years ago
Views:

1 On-lne Renforcement Lernng Usng Incrementl Kernel-Bsed Stochstc Fctorzton André M. S. Brreto School of Computer Scence McGll Unversty Montrel, Cnd Don Precup School of Computer Scence McGll Unversty Montrel, Cnd Abstrct Joelle Pneu School of Computer Scence McGll Unversty Montrel, Cnd Kernel-bsed stochstc fctorzton KBSF s n lgorthm for solvng renforcement lernng tsks wth contnuous stte spces whch bulds Mrkov decson process MDP bsed on set of smple trnstons. Wht sets KBSF prt from other kernel-bsed pproches s the fct tht the sze of ts MDP s ndependent of the number of trnstons, whch mkes t possble to control the trde-off between the qulty of the resultng pproxmton nd the ssocted computtonl cost. However, KBSF s memory usge grows lnerly wth the number of trnstons, precludng ts pplcton n scenros where lrge mount of dt must be processed. In ths pper we show tht t s possble to construct KBSF s MDP n fully ncrementl wy, thus freeng the spce complexty of ths lgorthm from ts dependence on the number of smple trnstons. The ncrementl verson of KBSF s ble to process n rbtrry mount of dt, whch results n model-bsed renforcement lernng lgorthm tht cn be used to solve contnuous MDPs n both off-lne nd on-lne regmes. We present theoretcl results showng tht KBSF cn pproxmte the vlue functon tht would be computed by conventonl kernel-bsed lernng wth rbtrry precson. We emprclly demonstrte the effectveness of the proposed lgorthm n the chllengng threepole blncng tsk, n whch the blty to process lrge number of trnstons s crucl for success. Introducton The tsk of lernng polcy for sequentl decson problem wth contnuous stte spce s long-stndng chllenge tht hs ttrcted the ttenton of the renforcement lernng communty for yers. Among the mny pproches tht hve been proposed to solve ths problem, kernel-bsed renforcement lernng KBRL stnds out for ts good theoretcl gurntees [, 2]. KBRL solves contnuous stte-spce Mrkov decson process MDP usng fnte model constructed bsed on smple trnstons only. By cstng the problem s non-prmetrc pproxmton, t provdes sttstclly consstent wy of pproxmtng n MDP s vlue functon. Moreover, snce t comes down to the soluton of fnte model, KBRL lwys converges to unque soluton. Unfortuntely, the good theoretcl propertes of kernel-bsed lernng come t prce: snce the model constructed by KBRL grows wth the mount of smple trnstons, the number of opertons performed by ths lgorthm quckly becomes prohbtvely lrge s more dt become vlble. Such computtonl burden severely lmts the pplcblty of KBRL to rel renforcement lernng RL problems. Relzng tht, mny reserchers hve proposed wys of turnng KBRL nto more prctcl tool [3, 4, 5]. In ths pper we focus on our own pproch to leverge KBRL, n lgorthm clled kernel-bsed stochstc fctorzton KBSF [4]. KBSF uses KBRL s kernel-bsed strtegy to perform soft ggregton of the sttes of ts MDP. By dong so, our lgorthm s ble to summrze the nformton contned n KBRL s model n n MDP whose sze s ndependent of the number of smple trnstons. KBSF enjoys good theoretcl

2 gurntees nd hs shown excellent performnce on severl tsks [4]. The mn lmtton of the lgorthm s the fct tht, n order to construct ts model, t uses n mount of memory tht grows lnerly wth the number of smple trnstons. Although ths s sgnfcnt mprovement over KBRL, t stll hnders the pplcton of KBSF n scenros n whch lrge mount of dt must be processed, such s n complex domns or n on-lne renforcement lernng. In ths pper we show tht t s possble to construct KBSF s MDP n fully ncrementl wy, thus freeng the spce complexty of ths lgorthm from ts dependence on the number of smple trnstons. In order to dstngush t from ts orgnl, btch counterprt, we cll ths new verson of our lgorthm ncrementl KBSF, or KBSF for short. As wll be seen, KBSF s ble to process n rbtrry number of smple trnstons. Ths results n model-bsed RL lgorthm tht cn be used to solve contnuous MDPs n both off-lne nd on-lne regmes. A second mportnt contrbuton of ths pper s theoretcl nlyss showng tht t s possble to control the error n the vlue-functon pproxmton performed by KBSF. In our prevous experments wth KBSF, we defned the model used by ths lgorthm by clusterng the smple trnstons nd then usng the clusters s centers s the representtve sttes n the reduced MDP [4]. However, we dd not provde theoretcl justfcton for such strtegy. In ths pper we fll ths gp by showng tht we cn pproxmte KBRL s vlue functon t ny desred level of ccurcy by mnmzng the dstnce from smpled stte to the nerest representtve stte. Besdes ts theoretcl nterest, the bound s lso relevnt from prctcl pont of vew, snce t cn be used n KBSF to gude the on-lne selecton of representtve sttes. Fnlly, thrd contrbuton of ths pper s n emprcl demonstrton of the performnce of KBSF n new, chllengng control problem: the trple pole-blncng tsk, n extenson of the well-known double pole-blncng domn. Here, KBSF s blty to process lrge number of trnstons s crucl for chevng hgh success rte, whch cnnot be esly replcted wth btch methods. 2 Bckground In renforcement lernng, n gent ntercts wth n envronment n order to fnd polcy tht mxmzes the dscounted sum of rewrds [6]. As usul, we ssume tht such n ntercton cn be modeled s Mrkov decson process MDP, [7]. An MDP s tuple M S,A,P,r,γ, where S s the stte spce nd A s the fnte cton set. In ths pper we re mostly concerned wth MDPs wth contnuous stte spces, but our strtegy wll be to pproxmte such models s fnte MDPs. In fnte MDP the mtrx P R S S gves the trnston probbltes ssocted wth cton A nd the vector r R S stores the correspondng expected rewrds. The dscount fctor γ [0, s used to gve smller weghts to rewrds receved further n the future. Consder n MDP M wth contnuous stte spce S [0,] d. Kernel-bsed renforcement lernng KBRL uses smple trnstons to derve fnte MDP tht pproxmtes the contnuous model [, 2]. Let S {s k,r k,ŝ k k,2,...,n } be smple trnstons ssocted wth cton A, where s k,ŝ k S nd r k R. Let φ : R+ R + be Lpschtz contnuous functon nd let k τ s,s be kernel functon defned s k τ s,s φ s s /τ, where s norm n R d nd τ > 0. Fnlly, defne the normlzed kernel functon ssocted wth cton s κτ s,s k τs,s / n j k τ s,s j. The model constructed by KBRL hs the followng trnston nd rewrd functons: ˆP s s{ κ τ s,s, f s ŝ, { nd ˆR 0, otherwse s,s r, f s ŝ, 0, otherwse. Snce only trnstons endng n the sttes ŝ hve non-zero probblty of occurrence, one cn defne fnte MDP ˆM composed solely of these n n sttes [2, 3]. After ˆV, the optml vlue functon of ˆM, hs been computed, the vlue of ny stte-cton pr cn be determned s: Qs, n κ τ s,s [ r + γ ˆV ŝ ], where s S nd A. Ormonet nd Sen [] proved tht, f n for ll A nd the wdths of the kernels τ shrnk t n dmssble rte, the probblty of choosng suboptml cton bsed on Qs, converges to zero. Usng dynmc progrmmng, one cn compute the optml vlue functon of ˆM, but the tme nd spce requred to do so grow fst wth the number of sttes n [7, 8]. Therefore, the use of KBRL leds to dlemm: on the one hnd, one wnts to use s mny trnstons s possble to cpture the dynmcs of M, but on the other hnd one wnts to hve n MDP ˆM of mngeble sze. 2

3 Kernel-bsed stochstc fctorzton KBSF provdes prctcl wy of weghng these two conflctng objectves [4]. Our lgorthm compresses the nformton contned n KBRL s model ˆM n n MDP M whose sze s ndependent of the number of trnstons n. The fundmentl de behnd KBSF s the stochstc-fctorzton trck, whch we now summrze. Let P R n n be trnston-probblty mtrx nd let P DK be fctorzton n whch D R n m nd K R m n re stochstc mtrces. Then, swppng the fctors D nd K yelds nother trnston mtrx P KD tht retns the bsc topology of P tht s, the number of recurrent clsses nd ther respectve reducbltes nd perodctes [9]. The nsght s tht, n some cses, one cn work wth P nsted of P; when m n, ths replcement ffects sgnfcntly the memory usge nd computng tme. KBSF results from the pplcton of the stochstc-fctorzton trck to ˆM. Let S { s, s 2,..., s m } be set of representtve sttes n S. KBSF computes mtrces Ḋ R n m nd K R m n wth elements d j κ τŝ, s j nd k j κ τ s,s j, where κ τ s defned s κ τ s, s k τ s, s / m j k τs, s j. The bsc de of the lgorthm s to replce the MDP ˆM wth M S,A, P, r,γ, where P K Ḋ nd r K r r R n s the vector composed of smple rewrds r. Thus, nsted of solvng n MDP wth n sttes, one solves model wth m sttes only. Let D [ Ḋ Ḋ2 Ḋ A... ] R m n nd let K [ K K 2... K A ] R m n. Bsed on Q R m A, the optml cton-vlue functon of M, one cn obtn n pproxmte vlue functon for ˆM s ṽ ΓD Q, where Γ s the mx opertor ppled row wse, tht s, ṽ mx D Q. We hve showed tht the error n ṽ s bounded by: ˆv ṽ γ mx ˆr D r + γ 2 Cmx mx d j +Ĉγ j 2 mx ˆP DK, 2 where s the nfnty norm, ˆv R n s the optml vlue functon of KBRL s MDP, Ĉ mx, ˆr mn, ˆr, C mx, r mn, r, nd K s mtrx K wth ll elements equl to zero except for those correspondng to mtrx K see [4] for detls. 3 Incrementl kernel-bsed stochstc fctorzton In the btch verson of KBSF, descrbed n Secton 2, the mtrces P nd vectors r re determned usng ll the trnstons n the correspondng sets S smultneously. Ths hs two undesrble consequences. Frst, the constructon of the MDP M requres n mount of memory of On mx m, where n mx mx n. Although ths s sgnfcnt mprovement over KBRL s memory usge, whch s On 2 mx, n more chllengng domns even lner dependence on n mx my be mprctcl. Second, wth btch KBSF the only wy to ncorporte new dt nto the model M s to recompute the multplcton P K Ḋ for ll ctons for whch there re new smple trnstons vlble. Even f we gnore the ssue of memory usge, ths s clerly neffcent n terms of computton. In ths secton we present n ncrementl verson of KBSF tht crcumvents these mportnt lmttons. Suppose we splt the set of smple trnstons S n two subsets S nd S 2 such tht S S 2 /0 nd S S 2 S. Wthout loss of generlty, suppose tht the smple trnstons re ndexed so tht S {s k,r k,ŝ k k,2,...,n } nd S 2 {s k,r k,ŝ k k n +,n + 2,...,n + n 2 n }. Let P S nd r S be mtrx P nd vector r computed by KBSF usng only the n trnstons n S f n 0, we defne P S 0 R m m nd r S 0 R m for ll A. We wnt to compute P S S 2 nd r S S 2 from P S, r S, nd S 2, wthout usng the set of smple trnston S. We strt wth the trnston mtrces P. We know tht n p S j t k t d tj n t k τ s,st k τŝ t, s j l k τ s,s l m l k τŝ t, s l n l k τ s,s l n n t k τ s,st k τ ŝt, s j m l k τŝ t., s l To smplfy the notton, defne n l k τ s,s l,ws 2 n +n 2 ln + k τ s,s l, nd ct j k τ s,st k τ ŝt, s j m l k τ ŝt, s, wth t {,2,...,n + n 2 }. Then, l p S S 2 j n t ct j + n +n 2 tn + ct j 3 p S j ws + n +n 2 tn + ct j.

4 Now, defnng b S 2 j n +n 2 tn + ct j, we hve the smple updte rule: p S S 2 j b S 2 j + ps j. 3 We cn pply smlr resonng to derve n updte rule for the rewrds r. We know tht n r S n l k τ s,s l k τ s,st rt n t k τ s,st rt. t Let h t k τ s,st rt, wth t {,2,...,n + n 2 }. Then, r S S 2 n t ht + n +n 2 tn + ht r S + n +n 2 tn + ht. Defnng e S 2 n +n 2 tn + ht, we hve the followng updte rule: r S S 2 e S 2 + r S. 4 Snce b S 2 j, es 2, nd w S 2 cn be computed bsed on S 2 only, we cn dscrd the smple trnstons n S fter computng P S nd r S. To do tht, we only hve to keep the vrbles. These vrbles cn be stored n A vectors w R m, resultng n modest memory overhed. Note tht we cn pply the des bove recursvely, further splttng the sets S nd S 2 n subsets of smller sze. Thus, we hve fully ncrementl wy of computng KBSF s MDP whch requres lmost no extr memory. Algorthm shows step-by-step descrpton of how to updte M bsed on set of smple trnstons. Usng ths method to updte ts model, KBSF s spce complexty drops from Onm to Om 2. Snce the mount of memory used by KBSF s now ndependent of n, t cn process n rbtrry number of smple trnstons. Algorthm Updte KBSF s MDP Input: P, r, w for ll A S for ll A Output: Updted M nd w for A do for t,...,n do z t m l k τŝ t, s l n S for,2,...,m do w n t k τ s,st for j,2,...,m do b n t k τ s,st k τ ŝt, s j /z t p j w b + p +w j w e n t k τ s,s t r t r w +w e + r w w w + w Algorthm 2 Incrementl KBSF KBSF s Representtve sttes,,2,...,m t Input: m Intervl to updte model t v Intervl to updte vlue functon n Totl number of smple trnstons Output: Approxmte vlue functon Qs, Q rbtrry mtrx n R m A P 0 R m m, r 0 R m, w 0 R m, A for t,2,...,n do Select bsed on Qs t, m κ τs t, s q Execute n s t nd observe r t nd ŝ t S S {s t,r t,ŝ t } f t mod t m 0 then Add new representtve sttes to M usng S Updte M nd w usng Algorthm nd S S /0 for ll A f t mod t v 0 updte Q Insted of ssumng tht S nd S 2 re prtton of fxed dtset S, we cn consder tht S 2 ws generted bsed on the polcy lerned by KBSF usng the trnstons n S. Thus, Algorthm provdes flexble frmework for ntegrtng lernng nd plnnng wthn KBSF. A generl descrpton of the ncrementl verson of KBSF s gven n Algorthm 2. KBSF updtes the model M nd the vlue functon Q t fxed ntervls t m nd t v, respectvely. When t m t v n, we recover the btch verson of KBSF; when t m t v, we hve n on-lne method whch stores no smple trnstons. Note tht Algorthm 2 lso llows for the ncluson of new representtve sttes to the model M. Usng Algorthm ths s esy to do: gven new representtve stte s m+, t suffces to set w m+ 0, r m+ 0, nd p m+, j p j,m+ 0 for j,2,...,m + nd ll A. Then, n the followng pplctons of Eqns 3 nd 4, the dynmcs of M wll nturlly reflect the exstence of stte s m+. 4

5 4 Theoretcl Results Our prevous experments wth KBSF suggest tht, t lest emprclly, the lgorthm s performnce mproves s m n [4]. In ths secton we present theoretcl results tht confrm ths property. The results below re prtculrly useful for KBSF becuse they provde prctcl gudnce towrds where nd when to dd new representtve sttes. Suppose we hve fxed set of smple trnstons S. We wll show tht, f we re free to defne the representtve sttes, then we cn use KBSF to pproxmte KBRL s soluton to ny desred level of ccurcy. To be more precse, let d mx, mn j ŝ s j, tht s, d s the mxmum dstnce from smpled stte ŝ to the closest representtve stte. We wll show tht, by mnmzng d, we cn mke ˆv ṽ s smll s desred cf. Eqn 2. Let ŝ ŝ k wth k rgmx mn j ŝ s j nd s s h where h rgmn j ŝ s j, tht s, ŝ s the smpled stte n S whose dstnce to the closest representtve stte s mxml, nd s s the representtve stte tht s closest to ŝ. Usng these defntons, we cn select the pr ŝ, s tht mxmzes ŝ s : ŝ ŝ b nd s s b where b rgmx ŝ s. Obvously, ŝ s d. We mke the followng smple ssumptons: ŝ nd s re unque for ll A, 0 φxdx L φ <, φx φy f x < y, v A φ,λ φ > 0, B φ 0 such tht A φ exp x φx λ φ A φ exp x f x B φ. Assumpton v mples tht the kernel functon φ wll eventully decy exponentlly. We strt by ntroducng the followng defnton: Defnton. Gven α 0,] nd s,s S, the α-rdus of k τ wth respect to s nd s s defned s ρk τ,s,s,αmx{x R + φ x/ταk τ s,s }. The exstence of ρk τ,s,s,α s gurnteed by ssumptons nd nd the fct tht φ s contnuous []. To provde some ntuton on the menng of the α-rdus of k τ, suppose tht φ s strctly decresng nd let c φ s s /τ. Then, there s s S such tht φ s s /ταc. The rdus of k τ n ths cse s s s. It should be thus obvous tht ρk τ,s,s,α s s. We cn show tht ρ hs the followng propertes proved n the supplementry mterl: Property. If s s < s s, then ρk τ,s,s,α ρk τ,s,s,α. Property 2. If α < α, then ρk τ,s,s,α > ρk τ,s,s,α. Property 3. For α 0, nd ε > 0, there s δ > 0 such tht ρk τ,s,s,α s s < ε f τ < δ. We now ntroduce noton of dssmlrty between two sttes s,s S whch s nduced by specfc set of smple trnstons S nd the choce of kernel functon: Defnton 2. Gven β > 0, theβ-dssmlrty between s nd s wth respect to κτ s defned s { ψκτ,s,s n,β k κ τ s,s k κ τ s,s k, f s s β, 0, otherwse. The prmeter β defnes the volume of the bll wthn whch we wnt to compre sttes. As we wll see, ths prmeter lnks Defntons nd 2. Note tht ψκ τ,s,s,β [0,2]. It s possble to show tht ψ stsfes the followng property see supplementry mterl: Property 4. For β > 0 nd ε > 0, there s δ > 0 such tht ψκ τ,s,s,β < ε f s s < δ. Defntons nd 2 llow us to enuncte the followng result: Lemm. For ny α 0,] nd ny t m, let ρ ρk τ,ŝ, s,α/t, let ψρ ψκτ,ŝ, s j,ρ, nd let ψmx mx ψκτ,ŝ, s j,. Then,, j mx, j Proof. See supplementry mterl. P DK + α ψ ρ + α + α ψ mx. 5 Snce ψ mx ψ ρ, one mght thnk t frst tht the rght-hnd sde of Eqn 5 decreses monotonclly s α 0. Ths s not necessrly true, though, becuse ψ ρ ψ mx s α 0 see Property 2. We re fnlly redy to prove the mn result of ths secton. 5

6 Proposton. For ny ε > 0, there re δ,δ 2 > 0 such tht ˆv ṽ < ε f d < δ nd τ < δ 2. Proof. Let ř [r,r 2,...,r A ] R n. From Eqn nd the defnton of r, we cn wrte ˆr D r ˆP ř D K r ˆP ř DK ř ˆP DK ř ˆP DK ř. 6 Thus, pluggng Eqn 6 bck nto Eqn 2, t s cler tht there s η > 0 such tht ˆv ṽ < ε f mx ˆP DK < η nd mx mx j d j < η. We strt by showng tht f d nd τ re smll enough, then mx ˆP DK < η. From Lemm we know tht, for ny set of m n representtve sttes, nd for ny α 0,], the followng must hold: mx P DK + α ψ ρ + α + α ψ MAX, where ψ MAX mx,,s ψk τ,ŝ,s, nd ψ ρ mx ψρ mx,, j ψκτ,ŝ, s j,ρ, wth ρ ρk τ,ŝ, s,α/n. Note tht ψ MAX s ndependent of the representtve sttes. Defne α such tht α/+αψ MAX < η. We hve to show tht, f we defne the representtve sttes n such wy tht d s smll enough, nd set τ ccordngly, then we cn mke ψ ρ < αη αψ MAX η. From Property 4 we know tht there s δ > 0 such tht ψ ρ < η f ρ < δ for ll A. From Property we know tht ρ ρk τ,ŝ, s,α/n for ll A. From Property 3 we know tht, for ny ε > 0, there s δ > 0 such tht ρk τ,ŝ, s,α/n < d + ε f τ < δ. Therefore, f d < δ, we cn tke ny ε < δ d to hve n upper bound δ for τ. It remns to show tht there s δ > 0 such tht mn mx j d j > η f τ < δ. Recllng tht d j k τŝ, s j/ m k k τŝ, s k, let h rgmx j k τ ŝ, s j, nd let y k τ ŝ, s h nd ˇy mx j h k τ ŝ, s j. Then, for ny, mx j d j y / y + j h k τ ŝ, s j y /y +m ˇy. From Assump. nd Prop. 3 we know tht there s δ > 0 such tht y > m η ˇy /η f τ < δ. Thus, by mkng δ mn, δ, we cn gurntee tht mn mx j d j > η. If we tke δ 2 mnδ,δ, the result follows. Proposton tells us tht, regrdless of the specfc renforcement-lernng problem t hnd, f the dstnces between smpled sttes nd the respectve nerest representtve sttes re smll enough, then we cn mke KBSF s pproxmton of KBRL s vlue functon s ccurte s desred by settng τ to smll vlue. How smll d nd τ should be depends on the prtculr choce of kernel k τ nd on the chrcterstcs of the sets of trnstons S. Of course, fxed number m of representtve sttes mposes mnmum possble vlue for d, nd f ths vlue s not smll enough decresng τ my ctully hurt the pproxmton. Agn, the optml vlue for τ n ths cse s problem-dependent. Our result supports the use of locl pproxmton bsed on representtve sttes spred over the stte spce S. Ths s n lne wth the quntzton strteges used n btch-mode kernel-bsed renforcement lernng to defne the sttes s j [4, 5]. In the cse of on-lne lernng, we hve to dptvely defne the representtve sttes s j s the smple trnstons come n. One cn thnk of severl wys of dong so [0]. In the next secton we show smple strtegy for ddng representtve sttes whch s bsed on the theoretcl results presented n ths secton. 5 Emprcl Results We now nvestgte the emprcl performnce of the ncrementl verson of KBSF. We strt wth smple tsk n whch KBSF s contrsted wth btch KBSF. Next we explot the sclblty of KBSF to solve dffcult control tsk tht, to the best of our knowledge, hs never been solved before. We use the puddle world problem s proof of concept []. In ths frst experment we show tht KBSF s ble to recover the model tht would be computed by ts btch counterprt. In order to do so, we ppled Algorthm 2 to the puddle-world tsk usng rndom polcy to select ctons. Fgure shows the result of such n experment when we vry the prmeters t m nd t v. Note tht the cse n whch t m t v 8000 corresponds to the btch verson of KBSF. As expected, the performnce of KBSF decson polces mproves grdully s the lgorthm goes through more smple trnstons, nd n generl the ntensty of the mprovement s proportonl to the mount of dt processed. More mportnt, the performnce of the decson polces fter ll smple trnstons hve been processed s essentlly the sme for ll vlues of t m nd t v, whch shows tht KBSF cn be used s tool to crcumvent KBSF s memory demnd whch s lner n n. Thus, f one hs btch of smple trnstons tht does not ft n the vlble memory, t s possble to splt the dt n chunks of smller szes nd stll get the sme vlue-functon pproxmton tht would 6

7 be computed f the entre dt set were processed t once. As shown n Fgure b, there s only smll computtonl overhed ssocted wth such strtegy ths results from unnormlzng nd normlzng the elements of P nd r severl tmes through updte rules 3 nd 4. Return ι 000 ι 2000 ι 4000 ι 8000 Seconds ι 000 ι 2000 ι 4000 ι Number of smple trnstons Performnce Number of smple trnstons b Run tmes Fgure : Results on the puddle-world tsk verged over 50 runs. KBSF used 00 representtve sttes evenly dstrbuted over the stte spce nd t m t v ι see legends. Smple trnstons were collected by rndom polcy. The gents were tested on two sets of sttes surroundng the puddles : 3 3 grd over [0., 0.3] [0.3, 0.5] nd the four sttes {0., 0.3} {0.9,.0}. But KBSF s more thn just tool for vodng the memory lmttons ssocted wth btch lernng. We llustrte ths fct wth more chllengng RL tsk. Pole blncng hs long hstory s benchmrk problem becuse t represents rch clss of unstble systems [2, 3, 4]. The objectve n ths tsk s to pply forces to wheeled crt movng long lmted trck n order to keep one or more poles hnged to the crt from fllng over [5]. There re severl vrtons of the problem wth dfferent levels of dffculty; mong them, blncng two poles t the sme tme s prtculrly hrd [6]. In ths pper we rse the br, nd dd thrd pole to the pole-blncng tsk. We performed our smultons usng the prmeters usully dopted wth the double pole tsk, except tht we dded thrd pole wth the sme length nd mss s the longer pole [5]. Ths results n problem wth n 8-dmensonl stte spce S. In our experments wth the double-pole tsk, we used 200 representtve sttes nd 0 6 smple trnstons collected by rndom polcy [4]. Here we strt our experment wth trple pole-blncng usng exctly the sme confgurton, nd then we let KBSF refne ts model M by ncorportng more smple trnstons through updte rules 3 nd 4. Specfclly, we used Algorthm 2 wth 0.3-greedy polcy, t m t v 0 6, nd n 0 7. Polcy terton ws used to compute Q t ech vluefuncton updte. As for the kernels, we dopted Gussn functons wth wdths τ 00 nd τ to mprove effcency, we used KD-tree to only compute the 50 lrgest vlues of k τ s, nd the 0 lrgest vlues of k τ ŝ,. Representtve sttes were dded to the model on-lne every tme the gent encountered smple stte ŝ for whch k τ ŝ, s j < 0.0 for ll j,2,...,m ths corresponds to settng the mxmum llowed dstnce d from smpled stte to the closest representtve stte. We compre KBSF wth ftted Q-terton usng n ensemble of 30 trees generted by Ernst et l. s extr-trees lgorthm [7]. We chose ths lgorthm becuse t hs shown excellent performnce n both benchmrk nd rel-world renforcement-lernng tsks [7, 8]. Snce ths s btch-mode lernng method, we used ts result on the ntl set of 0 6 smple trnstons s bselne for our emprcl evluton. To buld the trees, the number of cut-drectons evluted t ech node ws fxed t dms 8, nd the mnmum number of elements requred to splt node, denoted here by η mn, ws frst set to 000 nd then to 00. The lgorthm ws run for 50 tertons, wth the structure of the trees fxed fter the 0 th terton. As shown n Fgure 2, both ftted Q-terton nd btch KBSF perform poorly n the trple poleblncng tsk, wth verge success rtes below 55%. Ths suggests tht the mount of dt used Another reson for choosng ftted Q-terton ws tht some of the most nturl compettors of KBSF hve lredy been tested on the smpler double pole-blncng tsk, wth dsppontng results [9, 4]. 7

8 by these lgorthms s nsuffcent to descrbe the dynmcs of the control tsk. Of course, we could gve more smple trnstons to ftted Q-terton nd btch KBSF. Note however tht, snce they re btch-lernng methods, there s n nherent lmt on the mount of dt tht these lgorthms cn use to construct ther pproxmton. In contrst, the mount of memory requred by KBSF s ndependent of the number of smple trnstons n. Ths fct together wth the fct tht KBSF s computtonl complexty s only lner n n llow our lgorthm to process lrge mount of dt wthn resonble tme. Ths cn be observed n Fgure 2b, whch shows tht KBSF cn buld n pproxmton usng 0 7 trnstons n under 20 mnutes. As reference for comprson, ftted Q-terton usng η mn 000 took n verge of hour nd 8 mnutes to process 0 tmes less dt. Successful epsodes KBSF TREE 000 TREE 00 Btch KBSF 2e+06 4e+06 6e+06 8e+06 e+07 Number of smple trnstons Seconds log Btch KBSF KBSF TREE 000 TREE 00 2e+06 4e+06 6e+06 8e+06 e+07 Number of smple trnstons Number of representtve sttes e+06 6e+06 e+07 Number of smple trnstons Performnce b Run tmes c Sze of KBSF s MDP Fgure 2: Results on the trple pole-blncng tsk verged over 50 runs. The vlues correspond to the frcton of epsodes ntted from the test sttes n whch the 3 poles could be blnced for 3000 steps one mnute of smulted tme. The test set ws composed of 256 sttes eqully dstrbuted over the hypercube defned by ±[.2m,0.24m/s,8 o,75 o /s,8 o,50 o /s,8 o,75 o /s]. Shdowed regons represent 99% confdence ntervls. As shown n Fgure 2, the blty of KBSF to process lrge number of smple trnstons llows our lgorthm to cheve success rte of pproxmtely 80%. Ths s smlr to the performnce of btch KBSF on the double-pole verson of the problem [4]. The good performnce of KBSF on the trple pole-blncng tsk s especlly mpressve when we recll tht the decson polces were evluted on set of test sttes representng ll possble drectons of nclnton of the three poles. In order to cheve the sme level of performnce wth KBSF, pproxmtely 2 Gb of memory would be necessry, even usng sprse kernels, wheres KBSF used less thn 0.03 Gb of memory. To conclude, observe n Fgure 2c how the number of representtve sttes m grows s functon of the number of smple trnstons processed by KBSF. As expected, n the begnnng of the lernng process m grows fst, reflectng the fct tht some relevnt regons of the stte spce hve not been vsted yet. As more nd more dt come n, the number of representtve sttes strts to stblze. 6 Concluson Ths pper presented two contrbutons, one prctcl nd one theoretcl. The prctcl contrbuton s KBSF, the ncrementl verson of KBSF. KBSF retns ll the nce propertes of ts precursor: t s smple, fst, nd enjoys good theoretcl gurntees. However, snce ts memory complexty s ndependent of the number of smple trnstons, KBSF cn be ppled to dtsets of ny sze, nd t cn lso be used on-lne. To show how KBSF s blty to process lrge mounts of dt cn be useful n prctce, we used the proposed lgorthm to lern how to smultneously blnce three poles, dffcult control tsk tht hd never been solved before. As for the theoretcl contrbuton, we showed tht KBSF cn pproxmte KBRL s vlue functon t ny level of ccurcy by mnmzng the dstnce between smpled sttes nd the closest representtve stte. Ths supports the quntzton strteges usully dopted n kernel-bsed RL, nd lso offers gudnce towrds where nd when to dd new representtve sttes n on-lne lernng. Acknowledgments The uthors would lke to thnk Amr mssoud Frhmnd for helpful dscussons regrdng ths work. Fundng for ths reserch ws provded by the Ntonl Insttutes of Helth grnt R2 DA09800 nd the NSERC Dscovery Grnt progrm. 8

9 References [] D. Ormonet nd S. Sen. Kernel-bsed renforcement lernng. Mchne Lernng, : 6 78, [2] D. Ormonet nd P. Glynn. Kernel-bsed renforcement lernng n verge-cost problems. IEEE Trnsctons on Automtc Control, 470: , [3] N. Jong nd P. Stone. Kernel-bsed models for renforcement lernng n contnuous stte spces. In Proceedngs of the Interntonl Conference on Mchne Lernng ICML Workshop on Kernel Mchnes nd Renforcement Lernng, [4] A. M. S. Brreto, D. Precup, nd J. Pneu. Renforcement lernng usng kernel-bsed stochstc fctorzton. In Advnces n Neurl Informton Processng Systems NIPS, pges , 20. [5] B. Kveton nd G. Theochrous. Kernel-bsed renforcement lernng on representtve sttes. In Proceedngs of the AAAI Conference on Artfcl Intellgence AAAI, pges 24 3, 202. [6] R. S. Sutton nd A. G. Brto. Renforcement Lernng: An Introducton. MIT Press, 998. [7] M. L. Putermn. Mrkov Decson Processes Dscrete Stochstc Dynmc Progrmmng. John Wley & Sons, Inc., 994. [8] M. L. Lttmn, T. L. Den, nd L. P. Kelblng. On the complexty of solvng Mrkov decson problems. In Proceedngs of the Conference on Uncertnty n Artfcl Intellgence UAI, pges , 995. [9] A. M. S. Brreto nd M. D. Frgoso. Computng the sttonry dstrbuton of fnte Mrkov chn through stochstc fctorzton. SIAM Journl on Mtrx Anlyss nd Applctons, 32: , 20. [0] Y. Engel, S. Mnnor, nd R. Mer. The kernel recursve lest squres lgorthm. IEEE Trnsctons on Sgnl Processng, 52: , [] R. S. Sutton. Generlzton n renforcement lernng: Successful exmples usng sprse corse codng. In Advnces n Neurl Informton Processng Systems NIPS, pges , 996. [2] D. Mche nd R. Chmbers. BOXES: An experment n dptve control. Mchne Intellgence 2, pges 25 33, 968. [3] C. W. Anderson. Lernng nd Problem Solvng wth Multlyer Connectonst Systems. PhD thess, Computer nd Informton Scence, Unversty of Msschusetts, 986. [4] A. G. Brto, R. S. Sutton, nd C. W. Anderson. Neuronlke dptve elements tht cn solve dffcult lernng control problems. IEEE Trnsctons on Systems, Mn, nd Cybernetcs, 3: , 983. [5] F. J. Gomez. Robust non-lner control through neuroevoluton. PhD thess, The Unversty of Texs t Austn, [6] A. P. Welnd. Evolvng neurl network controllers for unstble systems. In Proceedngs of the Interntonl Jont Conference on Neurl Networks IJCNN, pges , 99. [7] D. Ernst, P. Geurts, nd L. Wehenkel. Tree-bsed btch mode renforcement lernng. Journl of Mchne Lernng Reserch, 6: , [8] D. Ernst, G. B. Stn, J. Gonçlves, nd L. Wehenkel. Clncl dt bsed optml STI strteges for HIV: renforcement lernng pproch. In Proceedngs of the IEEE Conference on Decson nd Control CDC, pges 24 3, [9] F. Gomez, J. Schmdhuber, nd R. Mkkulnen. Effcent non-lner control through neuroevoluton. In Proceedngs of the Europen Conference on Mchne Lernng ECML, pges ,

Partially Observable Systems. 1 Partially Observable Markov Decision Process (POMDP) Formalism

Partially Observable Systems. 1 Partially Observable Markov Decision Process (POMDP) Formalism CS294-40 Lernng for Rootcs nd Control Lecture 10-9/30/2008 Lecturer: Peter Aeel Prtlly Oservle Systems Scre: Dvd Nchum Lecture outlne POMDP formlsm Pont-sed vlue terton Glol methods: polytree, enumerton,