Reinforcement Learning with a Gaussian Mixture Model

Size: px

Start display at page:

Download "Reinforcement Learning with a Gaussian Mixture Model"

Gregory Mills
5 years ago
Views:

1 Renforcement Lernng wth Gussn Mxture Model Alejndro Agostn, Member, IEEE nd Enrc Cely Abstrct Recent pproches to Renforcement Lernng (RL) wth functon pproxmton nclude Neurl Ftted Q Iterton nd the use of Gussn Processes. They belong to the clss of ftted vlue terton lgorthms, whch use set of support ponts to ft the vlue-functon n btch tertve process. These technques mke effcent use of reduced number of smples by reusng them s needed, nd re pproprte for pplctons where the cost of experencng new smple s hgher thn storng nd reusng t, but ths s t the expense of ncresng the computtonl effort, snce these lgorthms re not ncrementl. On the other hnd, non-prmetrc models for functon pproxmton, lke Gussn Processes, re preferred gnst prmetrc ones, due to ther greter flexblty. A further dvntge of usng Gussn Processes for functon pproxmton s tht they llow to quntfy the uncertnty of the estmton t ech pont. In ths pper, we propose new pproch for RL n contnuous domns bsed on Probblty Densty Estmtons. Our method combnes the best fetures of the prevous methods: t s non-prmetrc nd provdes n estmton of the vrnce of the pproxmted functon t ny pont of the domn. In ddton, our method s smple, ncrementl, nd computtonlly effcent. All these fetures mke ths pproch more ppelng thn Gussn Processes nd ftted vlue terton lgorthms n generl. I. INTRODUCTION A crucl ssue n Renforcement Lernng (RL) s how to del wth problems whose stte nd cton spces re contnuous, or dscrete but very lrge. In these cses, the pplcton of clsscl tbulr methods to store the Q-vlue for ech possble stte-cton pr (or the vlue of ech stte, n model-bsed pproches) becomes nfesble. On the other hnd, f the number of sttes s too lrge, lernng bout them by vstng them ll turns out to be mpossble, so tht t s necessry to nfer the vlue of stte from the vlues of smlr ones for whch experences hve been collected. To cheve ths, RL must be used wth some form of functon pproxmton provdng the necessry compctness n ts representton nd pproprte generlzton on sttes nd ctons. In generl, functon pproxmton methods cn be clssfed s prmetrc nd non-prmetrc [1]. Prmetrc methods nclude neurl nets, polynomls, nd combntons of rdl bss functons, mong others. They defne prmeterzed fmles of functons wth fnte number of prmeters (whch, n the dscrete cse, s much lesser thn the number of sttes), nd try to fnd the vlues of the prmeters for whch the functon best represents the A. Agostn nd E. Cely re wth the Insttut de Robòtc Informàtc Industrl (UPC - CSIC), c./ Llorens Artgs 4-6, Brcelon, Spn. emls: gostn@r.upc.edu nd cely@r.upc.edu Ths work ws prtlly supported by the Spnsh Mnstry of Scence nd Innovton under project MIPRCV, Consolder Ingeno 2010 (CSD ). vlble dt. Prmetrc methods hve been extensvely used snce they llow the pplcton of grdent technques for prmeter optmzton. One dffculty wth prmetrc models resdes n the selecton of the prmeterzed fmly: f t s too restrctve, t could not be ble to model the dt wth the necessry ccurcy; f t s too generl, there s rsk of overfttng the dt nd provde poor generlzton. Non-prmetrc functon pproxmtors, nsted, do not fx n dvnce the number or the nture of the prmeters (despte ther nme, non-prmetrc pproxmtors usully hve prmeters, but ther number s not upper bounded) so tht they cn be endowed wth unrestrcted functon pproxmton cpbltes. Some exmples re Gussn Processes, tree-bsed methods, nd Mxtures of Gussns wth vrble number of unts. Snce, n generl, n complex RL problem t s not possble to guess wht knd of functon representton wll work, the more flexble non-prmetrc methods re preferred. In the lst yers, dfferent non-prmetrc functon pproxmtors for RL hve been proposed. In [2], Rsmussen nd Kuss proposed the use of Gussn Processes (GP) for RL: Usng model-bsed pproch, number of GPs (one for ech dmenson of the stte spce) s used to model the system dynmcs, nd further GP represents the vlue functon. In [3], the pproch s extended to onlne lernng usng Byesn ctve lernng. An lterntve pplcton of GPs to RL s tht of Engel et l. [4], who use GP to drectly represent the Q-functon n model-free pproch. One beneft of usng GPs for functon pproxmton s tht, besdes provdng the expected vlue of the functon, they lso provde ts vrnce, wht llows to quntfy the uncertnty of the predcted vlue. As ponted out n [5], ths nformton my be very useful to drect the explorton n RL. All of these GP-bsed RL lgorthms fll n the clss of the so-clled ftted vlue terton lgorthms [6], whch, n order to pproxmte the desred functon, tke fnte number of smples, or support ponts, ndtrytoft the functon to them n btch tertve process. Ftted vlue terton hs been used wth both, prmetrc nd non-prmetrc functon pproxmtors. For exmple, Ernst et l. [7], bsed on the prevous work of Ormonet nd Sen [8] on kernel-bsed RL, proposed the ftted Q Iterton lgorthm usng (nonprmetrc) rndomzed trees for functon pproxmton, whle Redmller [9], [10] proposed Neurl Ftted Q Iterton usng (prmetrc) mult-lyer neurl net. The mn de of ftted vlue terton s to reuse the set of smples s much s needed to get ll possble nformton from them. Ths llows to lern wth mnml number of nterctons wth the rel system, but ths does not mply lesser number of /10/$ IEEE

2 functon updtes. Thus, t s pproprte when cqurng new dt s more costly thn just storng them for future use. However, ssumng tht dt cn be obtned t low cost, s for exmple by smulton, ths dvntge dsppers, nd cn even become dsdvntge when the dynmcs evolves wth tme, snce old dt would be no more vld. A key ssue n ftted vlue terton lgorthms s the generton of the set of representtve smples of the functon to be pproxmted. When the model of the problem s known, they cn be obtned by unform smplng through the stte-spce s n [7]. When the model s not vlble, smples cn be generted by nterctng wth the system wth rndom ctons, but ths strtegy my not work when the complexty of the problem s such tht rechng the nterestng regons of the stte spce requres long chn of lucky ctons. For ths reson, Redmller [9] uses greedy heurstc, whch conssts n explotng the polcy lerned n the prevous lernng stges to generte new smples for the next tertons of the lgorthm. Even so, when the problem rses n complexty, he fnds necessry to use wht he clls the hnt-to-gol heurstc to provde specfc exemplrs wthn the gol regon. Note tht, n prncple, ftted vlue terton lgorthms re not ncrementl, n the sense tht ech tme new smples re ntroduced, the functon pproxmton process must be repeted from scrtch for the new dtset, wht s computtonlly neffcent. In ths pper, we propose n pproch to RL for contnuous stte-cton spces wth functon pproxmton bsed on probblty densty estmtons. The de s to represent the densty dstrbuton of the observed smples n the jont spce of sttes, ctons, nd q-vlues. To represent ths densty dstrbuton we use Gussn Mxture Model wth vrble number of unts, so tht the functon pproxmton s nonprmetrc, wht mkes t generl. Wth ths pproch, t s possble to obtn, for ech gven stte nd cton, the probblty dstrbuton of q(s, ) s the condtonl probblty p(q s, ). From ths dstrbuton we cn obtn the vlue of Q(s, ) s the expected vlue of q(s, ). Furthermore, we cn obtn the vrnce of q(s, ) nd estmte ts confdence, so tht our pproch lso presents wht hs been rgued to be n mportnt feture of GPs [4], [3]. The Gussn Mxture Model cn be updted wth n ncrementl, low complexty verson of the Expectton Mxmzton lgorthm, wht mkes ths pproch more ppelng thn GPs nd ftted vlue terton lgorthms n generl. As further beneft of usng densty estmtons, t s possble, by mrgnlzton on the stte-cton vrbles, to obtn the locl smplng densty n pont (s, ), whch, n stochstc problems, my be used to evlute how relble s the estmton t ths pont. The rest of the pper s orgnzed s follows: Secton II brefly resumes the bscs of RL. Secton III ntroduces the GMM for multvrte densty estmton nd the EM lgorthm n ts btch verson. In Secton IV we defne the on-lne EM lgorthm for the GMM. In Secton V we develop our RL lgorthm usng densty estmton of the Q-vlue functon. Secton VI shows the fesblty of the pproch wth n exmple, nd Secton VII concludes the pper. II. REINFORCEMENT LEARNING Renforcement Lernng s prdgm n whch n gent hs to lern n optml cton polcy by nterctng wth ts envronment [11]. The tsk s formlly modelled s the soluton of Mrkov decson process n whch, t ech tme step, the gent observes the current stte of the envronment, s t, nd chooses n llowed cton t usng some cton polcy, t = π(s t ). In response to ths cton, the envronment chnges to stte s t+1 nd produces n nstntneous rewrd r t = r(s t, t ). Usng the nformton collected n ths wy, the gent must fnd the polcy tht mxmzes the expected sum of dscounted rewrds, lso clled return, defned s: R = γ t r t, (1) t=0 where γ s the dscount rte, wth vlues n [0,1], tht regultes the mportnce of future rewrds wth respect to mmedte ones. One of the most populr lgorthms used n RL s Q- Lernng [12], whch uses n cton-vlue functon Q(s, ) to estmte the mxmum expected return tht cn be obtned by executng cton n stuton s nd ctng optmlly therefter. Q-lernng uses the Bellmn equton [13] to estmte smple vlues for Q(s, ) tht we denote by q(s, ): q(s t, t )=r(s t, t )+γmx Q(s t+1,) (2) where mx Q(s t+1,) s the estmted mxmum expected return correspondng to the next observed stuton s t+1.at gven stge of the lernng, the temporry polcy cn be derved from the estmted Q-functon s π(s) =rgmx Q(s, ) (3) In ctor/crtc rchtectures, polcy functon (clled the ctor) s lerned nd explctly stored, so tht ctons re drectly decded by the ctor nd do not need to be computed through the mxmzton n (3). Despte ths computtonl dvntge, the lernng of n ctor my slow down convergence, snce then the lernng of the Q-functon must be done on-polcy nsted of off-polcy, nd both functons, ctor nd crtc, must dpt to ech other to rech convergence. In our mplementton we vod the use of n ctor, nd thus we must fce the problem of mxmzng the Q(s, ) functon n (3). The bsc formulton of Q-lernng ssumes dscrete stte-cton spces nd the Q-functon s stored n tbulr representton. For contnuous domns functon pproxmton s requred to represent the Q-functon nd generlze between smlr stutons. In next sectons we present our proposl for functon pproxmton usng densty estmtons.

3 III. DENSITY ESTIMATION WITH A GAUSSIAN MIXTURE MODEL A Gussn Mxture Model [14] s weghted sum of multvrte Gussn probblty densty functons, nd s used to represent generl probblty densty functons n multdmensonl spces. It s ssumed tht the smples of the dstrbuton to be represented hve been generted through the followng process: frst, one Gussn s rndomly selected wth prorgven probbltes, nd then, smple s rndomly generted wth the probblty dstrbuton of the selected Gussn. Accordng to ths, the probblty densty functon of genertng smple x s: p(x; Θ) = α N (x; μ, Σ ) (4) =1 where K s the number of Gussns of the mxture; α, usully denoted s the mxng prmeter, s the pror probblty, P (), of Gussn to generte smple; N (x; μ, Σ ) s the multdmensonl Gussn functon wth men vector μ nd covrnce mtrx Σ ; nd Θ = {{α 1,μ 1, Σ 1 },..., {α K,μ K, Σ K }} s the whole set of prmeters of the mxture. By llowng the dpton of the number K of Gussns n the mxture, ny smooth densty dstrbuton cn be pproxmted rbtrrly close [15]. The prmeters of the model cn be estmted usng mxmumlkelhood estmtor (MLE). Gven set of smples X = {x t ; t =1,...,N}, the lkelhood functon s gven by L[X; Θ] = N p(x t ; Θ). (5) The mxmum-lkelhood estmton of the model prmeters s the Θ tht mxmzes the lkelhood (5) for the dt set X. Drect computton of the MLE requres complete nformton bout whch mxture component generted whch nstnce. Snce ths nformton s mssng, the EM lgorthm, descrbed n the next secton, s often used. A. The Expectton-Mxmzton lgorthm The Expectton-Mxmzton (EM) lgorthm [16] s generl tool tht permts to estmte the prmeters tht mxmze the lkelhood functon (5) for bord clss of problems when there re some mssng dt. The EM method frst produces n estmton of the expected vlues of the mssng dt usng ntl vlues of the prmeters to be estmted (E step), nd then computes the MLE of the prmeters gven the expected vlues of the mssng dt (M step). Ths process s repeted tertvely untl convergence crteron s fulflled. In ths secton we brefly descrbe how EM s ppled to the specfc cse of GMM. The process strts wth n ntlzton of the men vectors nd covrnce mtrces of the Gussns. The E step conssts n obtnng the probblty P ( x t ) for ech component of genertng nstnce x t,tht we denote by w t,, w t, = P ( x t )= P ()p(x t ) = α N (x t ; μ, Σ ) P (j)p(x t j) α j N (x t ; μ j, Σ j ) j=1 j=1 (6) where t =1,.., N nd =1,.., K. The mxmzton step conssts n computng the MLE usng the estmted w t,.it cn be shown [17] tht, for the cse of GMM, the mxng prmeters, mens, nd covrnces re gven by α = 1 N w t, (7) N Σ = N w t, x t μ = (8) N w t, N w t, (x t μ )(x t μ ) T (9) N w t, IV. ON-LINE EM Estmtng probblty densty functon by mens of the EM lgorthm nvolves the terton of E nd M steps on the complete set of vlble dt, tht s, the mode of operton of EM s n btch. However, n RL, smple dt re not ll vlble t once: they rrve sequentlly nd must be used onlne to mprove the polcy tht wll llow n effcent explorton-explotton strtegy. Ths prevents the use of the off-lne EM lgorthm, nd requres n on-lne, ncrementl verson of t. Severl ncrementl EM lgorthms hve been proposed for the Gussn Mxture Model ppled to clusterng or clssfcton of sttonry dt [18], [19]. The pproch proposed n [18] n not strctly n on-lne EM lgorthm. It pples the conventonl btch EM lgorthm onto seprte dt strems correspondng to successve epsodes. For ech new strem, new GMM model s trned n btch mode nd then merged wth the prevous model. The number of components for ech new GMM s defned usng the Byesn Informton Crteron, nd the mergng process nvolves smlrty comprsons between Gussns. Ths method nvolves mny computtonlly expensve processes t ech epsode nd tends to generte more components thn ctully needed. The pplcblty of ths method to RL seems lmted, not only for ts computtonl cost, but lso becuse, due to the non-sttonrty of the Q-estmton, old dt should not be tken s eqully vld durng ll the process. The work of [19] performs ncrementl updtng of the densty model usng no hstorcl dt nd ssumng tht consecutve dt vry smoothly. The method mntns two GMMs: the current GMM estmton, nd prevous GMM of the sme complexty fter whch no model updtng (.e. no chnge n the number of Gussns) hs been done.

4 By comprng the current GMM wth the hstorcl one, t s determned f new Gussns re generted or f some Gussns re merged together. Two observed shortcomngs of the lgorthm re tht the system fls when new dt s well explned by the hstorcl GMM, nd when consecutve dt volte the condton of smooth vrton. In [20], n on-lne EM lgorthm s presented for the Normlzed Gussn Network (NGnet), model closely relted to the GMM. Ths lgorthm s bsed on the works of [21], [22]. In [21] method for the ncrementl dptton of the model prmeters usng forgettng fctor nd cumultve sttstcs s proposed, whle n [22] the method n [21] s evluted nd contrsted wth n ncrementl verson whch performs steps of EM over fxed set of smples n n ncrementl wy. The method proposed n [20] uses foundtons of both works to elborte n on-lne lernng lgorthm to trn NGnet for regresson, where weghted verges of the model prmeters re clculted usng lernng rte tht mplctly ncorportes forgettng fctor to del wth non-sttonrtes. Inspred by ths work, we developed n on-lne EM lgorthm for the GMM. Our pproch uses cumultve sttstcs whose updtng nvolves forgettng fctor explctly. A. On-lne EM for the GMM In the on-lne EM pproch, n E step nd n M step re performed fter the observton of ech ndvdul smple. The E step does not dffer from the btch verson (equton (6)), except tht t s only computed for the new smple. For the M step, the prmeters of ll mxture components re updted wth the new smple. For ths, we defne the followng tme-dscounted weghted sums W t, = [[1]] t, (10) X t, = [[x]] t, (11) (XX) t, = [[ xx T]] (12) t, where we use the notton: ( t t [[f(x)]] t, = τ =1 s=τ +1 λ s )f(x τ )w τ, (13) where λ t [0, 1] s tme dependent dscount fctor ntroduced for forgettng the effect of old, possbly outdted vlues. Observe tht for low vlues of λ t,thenfluence of old dt decreses progressvely, so tht they re forgotten long tme. Ths forgettng effect of old dt s ttenuted when λ t pproches 1: n ths cse, old nd new dt hve the sme nfluence n the sum. As lernng proceeds nd dt vlues become more stble, forgettng them s no more requred nd λ t cn be mde to progressvely pproch 1 to llow convergence. The sum W t, cn be nterpreted s the ccumulted number of smples (composed of weghts w t, ) ttrbuted to unt long tme, wth forgettng. Smlrly, X t, corresponds to the weghted sum wth forgettng of smple vectors x τ ttrbuted to unt, whch s used to derve the men vector μ. In the sme wy, (XX) t, s the weghted sum wth forgettng of the mtrces obtned s the products x τ x T τ of smple vectors ttrbuted to unt, whch wll be used to fnd the covrnce mtrx Σ. From (13), we obtn the recursve formul: [[f(x)]] t, = λ t [[f(x)]] t 1, + f(x t )w t,. (14) When new smple x t rrves, the ccumultors (10), (11), nd (12) re updted wth the ncrementl formul (14), nd new estmtors for the GMM prmeters re obtned s: α (t) = W t, (15) W t,j j=1 μ (t) = X t, W t, (16) Σ (t) = (XX) t, W t, μ (t)μ (t) T (17) If the number K of Gussns n the mxture s fxed, the GMM s prmetrc functon pproxmton method whose pproxmton cpbltes re determned by K. Sncewe cn not determne the most pproprte K beforehnd, we llow the number of Gussns to be ncremented on-lne by process of unt generton, so tht the functon pproxmton method s whole becomes non-prmetrc. The process of unt generton s explned n Secton V-B. B. Weght-Dependent Forgettng The fctors λ t n (13) were ntroduced by [20] wth the purpose of progressvely replce (forget) old dt by new, more relble vlues. The effect of ths s clerly seen n the ncrementl formul (14), whch shows how, t ech tme step, ll pst dt re multpled by λ t, nd ths s done for ll unts, no mtter how much weght w t, s ttrbuted to ech of them. We observe tht, the rel effect of pplyng (14) to unts wth low ctvton w t, s not to replce ther pst vlues by the new one but, essentlly, to decrese ther vlues by fctor λ t. It cn be seen tht ths s exctly the cse when settng w t, =0n equton (14), wht yelds: [[f(x)]] t, = λ t [[f(x)]] t 1,, (18) showng tht the ccumultors of unts tht re seldom ctvted wll systemtclly decy to 0. Ths stuton s prtculrly nnoyng n the cse of onlne RL, for whch t s very lkely tht hghly vlued regons of the sttecton spce wll be smpled much more frequently thn less promsng ones, so tht n the long term, unts coverng low vlued regons wll get ther sttstcs lost. Ths cn be voded by modfyng the updtng formul (14) n ths wy: [[f(x)]] t, = λ wt, t [[f(x)]] t 1, + f(x t )w t,. (19) Wth ths updtng formul, the mount by whch old dt re forgotten s regulted by the mount w t, n whch new vlue s dded to the sum, so tht dt re lwys replced,

5 nsted of smply forgotten. Effectvely, f now we mke w t, =0n (19), wht we get s: [[f(x)]] t, =[[f(x)]] t 1,, (20) so tht the vlues of the sttstcs of the nctve unts remn unchnged. On the other hnd, n the cse of full ctvton of unt,.e., f w t, =1, the effect of the new updtng formul s exctly the sme s tht of (14). Therefore, we wll prefer the updtng formul (19) to keep better trck of less smpled regons, notng tht by dong ths, the defnton gven n (13) does no longer hold. V. THE GMM FOR Q-LEARNING In ths Secton, we descrbe how the GMM cn be used for functon pproxmton to estmte the expected Q-vlue s well s ts vrnce t ech pont of the stte-cton spce by mens of sngle representton of the probblty densty functon n the jont spce of sttes, ctons nd Q-vlues: p(s,,q)= α N (s,,q; μ, Σ ). (21) =1 In onlne Q-lernng, ech smple s of the form x t = (s t, t,q(s t, t )), correspondng to the vsted stte s t,the executed cton t, nd the estmted vlue of q(s t, t ) s gven by eq. (2). To obtn ths estmton we need to evlute mx Q(s t+1, t ),whereq(s,) s defned s the expected vlue of q gven s nd for the jont probblty dstrbuton (21) provded by the GMM: Q(s,)=E [q s,]=μ(q s,). (22) To compute ths, we must frst obtn the dstrbuton p(q s,). Decomposng the covrnces Σ nd mens μ n the followng wy: ( ) (s,) μ μ = μ q (23) ( ) Σ = Σ (s,)(s,) Σ q,(s,) Σ (s,),q Σ qq, (24) the probblty dstrbuton of q, for gven stte s nd cton, cn then be expressed s: p(q s,)= β (s,)n (q; μ (q s,),σ (q)) (25) where, =1 μ (q s,)=μ q + Σq,(s,) σ 2 (q) =Σqq β (s,)= ( ( Σ q,(s,) Σ (s,)(s,) Σ (s,)(s,) ) 1 ( ) (s,) μ (s,) (26) ) 1 (s,),q Σ (27) α N (s,; μ (s,), Σ (s,)(s,) ). (28) α j N (s,; μ (s,) j, Σ (s,)(s,) j ) j=1 From (25) we cn obtn the condtonl men nd covrnce, μ(q s,) nd σ 2 (q s,), of the mxture t pont (s,) s: μ(q s,)= β (s,)μ (q s,) (29) σ 2 (q s,)= =1 β (s,)(σ 2 (q)+(μ (q s,) μ(q s,)) 2 ). =1 (30) Equton (29) s the estmted Q vlue for gven stte nd cton, whle (30) s ts estmted vrnce. Our purpose ws to fnd the mxmum for ll ctons nd for gven s of Q(s,), n order to compute (2). Unfortuntely, ths s hrd to do nlytclly, but n pproxmted vlue cn be obtned by numercl technques. In our mplementton, we tke the smple pproch of computng Q(s,) for fnte number of ctons, nd then tkng the lrgest Q vlue s the pproxmted mxmum: mx Q(s, ) mx Q(s, ), (31) A where A s the set of ctons tht we tke nto consderton to fnd the pproxmted mxmum. A. Acton Selecton wth Explorton Acton selecton n RL must ddress the explorton/explotton trdeoff. If we wnt just to explot wht hs been lernt so fr, wth no explorton, we must execute the cton g correspondng to the greedy polcy s gven by eq. (3), tht n our cse s computed s: g = π g (s) rgmx Q(s, ). (32) A However, durng lernng, n explorton strtegy s necessry tht gurntees tht no cton s excluded from executon n ny stte. Two well-known wys to cheve ths re the ε-greedy nd the Boltzmnn explorton. Accordng to [23], these strteges re n the fmly of the undrected explorton methods, menng tht explorton s bsed n rndomness, nd no explorton-specfc knowledge s used for gudng explorton. It s clmed tht drected explorton technques re often more effcent thn undrected ones, so we propose more drected method of explorton tht tkes nto ccount the predcton error for ech cton, whch s cptured n the vrnce of the Q-vlues. For ths, we defne: Q rnd (s, ) =Q(s, )+Δ Q (σ 2 (q s, )), (33) where Δ Q (σ 2 (q s, )) s vlue tken t rndom from norml probblty dstrbuton wth 0 men nd vrnce σ 2 (q s, ). Then, the cton selecton wth explorton s mde ccordng to: explr = rgmx Q rnd (s, )+ rnd, (34) A where rnd s n pproprtely szed rndom perturbton of the cton, ntroduced to llow the executon of rbtrry ctons nd not just those contned n A.

6 Wth ths form of explorton ll the ctons hve lwys chnce of gettng Q rnd bove ts compettors, nd hence, probblty to be selected. Usully, hgher-vlued ctons wll hve more chnces of gettng the hghest Q rnd.however, low-vlued cton my eventully receve hgh Q rnd tht surpsses the vlues of other ctons. Ths provdes blnce between explorton nd explotton, tht tends to tke the greedy cton when we re rther certn tht t wll result n lrger vlue, nd ncreses the probblty of explorng non-greedy cton when ts predcted outcome s uncertn. B. Unt Generton The GMM s ntlzed wth smll number of unts tht s selected ccordng to the expected complexty of the problem. However, f durng trnng the model s found to be nsuffcent to represent the smple dstrbuton wth the requred ccurcy, t my be upgrded by genertng new unts. Snce our mn nterest s to ccurtely represent the Q functon, the generton of new Gussn s determned by the flure of the current GMM to ccount for n ctully observed q vlue. Thus, new Gussn s generted when the two followng condtons re stsfed: 1) The estmton error of the observed q vlue s lrger thn predefned vlue δ: (q(s, ) μ(q s,)) 2 δ. (35) 2) Unts close to the experenced pont hve been suffcently updted. We consder unt s close to pont x = (s,, q), f the Mhlnobs dstnce D () M, wth covrnce mtrx Σ, between the unt men nd the pont s less thn 1, I = {1 K D () M (x,μ ) < 1}, (36) nd thus, the crteron cn be expressed s: t w τ, >N conf, I. (37) τ =1 The purpose of ths condton s to vod the premture generton of new unts n regon before the system hs hd the opportunty to dpt to dt n the gven regon. Whenever both crter re fulflled, Gussn s generted wth prmeters gven by: W K+1 =1 (38) μ K+1 (s,,q)=(s t, t,q(s t, t )) (39) Σ K+1 = Cdg{d 1,..., d D,d,d q }, (40) where d s the totl rnge sze of vrble, D s the dmenson of the stte spce, nd C s postve vlue to sze the vrnces of the new Gussn. VI. EXPERIMENTS A clsscl benchmrk problem for RL, the control of n nverted pendulum wth lmted torque [24], hs been selected to test our lgorthm. We ddressed the problem of the swng up nd stblzton of the pendulum. The tsk conssts n swngng the pendulum untl rechng the uprght poston nd then sty there ndefntely. The optml polcy for ths problem s not trvl to fnd snce, due to the lmted torques vlble, the controller hs to swng the pendulum severl tmes bck nd forth untl ts knetc energy s lrge enough to overcome the lod torque nd rech the uprght poston, nd then stblze the pendulum there. The stte spce of ths problem s two-dmensonl nd s formed by the ngulr poston θ nd ngulr velocty θ: s =(θ, θ), whereθ tkes vlues n the ntervl [ π, π], nd θ s lmted to the ntervl [ 8, 8]s 1. The Gussns of the mxture model re four-dmensonl nd the GMM provdes estmtons of the probblty denstes n the jont spce x=(θ, θ,, q). As the rewrd sgnl we smply tke the heght of the tp of the pendulum h = cos(θ) whch rnges n the ntervl [ 1, 1], nd the dscount coeffcent γ n equton (2) s set to We ntlze the model wth 20 Gussns wth rndom ntl mens μ for ll except the q dmenson, tht s ntlzed to the mxmum possble Q vlue to fvor explorton of unvsted regons. The ntl covrnce mtrces Σ re dgonl nd the vrnce of ech vrble s set to the rnge of tht vrble. The ntl number of smples W of ech Gussn s set to 0.1. Ths smll vlue mkes the component to hve smll nfluence n the estmton whle there s no, or lttle, updtng. The dscount fctor λ t for the weghted sums wth forgettng (secton IV) tkes vlues from the equton, λ t =1 1/(t + b) (41) where b regultes the ntl vlue λ 0,nd determnes ts growth rte towrd 1. In our experments we set =0.001 nd b = 1000 n the cse of usng the updtng formul (14), nd b =10n the cse of usng the updtng formul (19) to compenste for the effect of the exponent w t, < 1. For the experments, we dopt the set-up of [25]: we run epsodes of 7 seconds wth ctuton ntervls of 0.01 seconds. At the begnnng of ech epsode the pendulum s rndomly plced nsde n rch centered n the uprght poston. Ths cn be seen s form of the hnt-to-gol heurstc used n [9] nd lso n [3]. The length of the rch s stedly ncremented wth ech epsode untl coverng ts whole rnge, thus llowng ny rbtrry ntl poston. To evlute the performnce of the lernng system, we run 10 ndependent experments of 120 epsodes ech. At the end of ech epsode, 7 sec. test explotng the polcy lerned so fr s done, nd the totl ccumulted rewrd s computed. Fgure 1 shows the result of vergng the results of the 10 experments usng the updtng formul (14). It cn be observed tht, despte n cceptble control s reched, there s some nstblty tht perssts even n the fnl epsodes. Ths s cused by spordc perods n whch, fter hvng

7 Rewrd ccumulted Fg. 3. A stroboscopc sequence obtned from plcng the pendulum n the downrght poston Epsodes Fg. 1. Averge of the ccumulted rewrd per epsode performed over 10 experments, wth unform forgettng (λ t) lerned to correctly swng-up nd stblze the pendulum, the system unlerns t nd must re-lern gn to recover the rght polcy. Ths effect s consequence of the forgettng of the functon pproxmton n low vlued regons cused by the bsed smplng tht occurs when the system keeps lernng fter the rght polcy hs been lredy found. To correct ths effect s tht we ntroduced the updtng formul (19) for weght-dependent forgettng. The results obtned θ θ Rewrd ccumulted Epsodes Fg. 2. Averge of the ccumulted rewrd per epsode performed over 10 experments, wth weght-dependent forgettng (λ w t, t ). Fg. 4. q Projecton of the Gussns of the GMM nto the stte spce. usng ths formul re shown n Fgure 2. It cn be seen tht n ths cse convergence s much fster nd wth much more stble behvor, wht demonstrtes the effectveness of the pproch. To llustrte the performnce reched, Fgure 3 shows stroboscopc sequence of the pendulum strtng from the ntl poston of the pendulum hngng down. Fgures 4 nd Fg. 5. θ Projecton of the Gussns of the GMM nto the (θ, q) spce.

8 5 show two projectons of the Gussns of typcl GMM obtned for ths problem fter trnng. It cn be seen tht they re not eqully dstrbuted on the whole confgurton spce, but concentrted n the most common trjectores of the system, mkng n effcent use of resources. VII. CONCLUSIONS We hve shown tht estmtng probblty densty functon n the jont spce of sttes, ctons nd q-vlues, provdes useful tool for RL n contnuous domns. The probblty densty functon cptures ll the nformton vlble to the RL gent: In the frst plce, t provdes functon pproxmton for the cton-vlue functon Q(s, ) s the men of the smple vlues q(s, ); n second plce, s n the cse of usng GPs, the probblty densty functon provdes not only the men vlue of q(s, ), but full probblty dstrbuton of ts possble vlues, nd n prtculr, ts vrnce. We use ths nformton to drect the explorton, possblty suggested n [5], [4], but tht hd not been mplemented yet. Fnlly, the probblty densty n the jont spce cn be mrgnlzed to obtn the densty of smples n the stte-cton spce, whch my be used to mesure the confdence we my hve n the estmton t ech pont. To represent the probblty densty functon we use GMM wth vrble number of unts. Ths provdes generl, non-prmetrc, functon pproxmton tool. By usng n onlne verson of the EM lgorthm, the trnng of the GMM cn be done ncrementlly nd, thnks to the smplcty of the GMM, the updte process s computtonlly effcent. The fesblty of the method s demonstrted on stndrd benchmrk for RL, the swng-up nd blnce of n nverted pendulum wth lmted torque, wth good results. We beleve tht the smplcty nd expressveness of ths pproch mkes t promsng lterntve for RL n contnuous domns. REFERENCES [1] C. Stone, Optml globl rtes of convergence for nonprmetrc regresson, The Annls of Sttstcs, vol. 10, no. 4, pp , [2] C. Rsmussen nd M. Kuss, Gussn processes n renforcement lernng, Advnces n Neurl Informton Processng Systems, vol. 16, pp , [3] M. Desenroth, C. Rsmussen, nd J. Peters, Gussn process dynmc progrmmng, Neurocomputng, vol. 72, no. 7-9, pp , [4] Y. Engel, S. Mnnor, nd R. Mer, Renforcement lernng wth Gussn processes, n ICML 05: Proceedngs of the 22nd nterntonl conference on Mchne lernng. New York, NY, USA: ACM, 2005, pp [5], Byes meets Bellmn: The Gussn process pproch to temporl dfference lernng, n Proc. of the 20th Interntonl Conference on Mchne Lernng, 2003, pp [6] G. J. Gordon, Stble functon pproxmton n dynmc progrmmng, n ICML, 1995, pp [7] D. Ernst, P. Geurts, nd L. Wehenkel, Tree-bsed btch mode renforcement lernng, J. Mch. Lern. Res., vol. 6, pp , [8] D. Ormonet nd S. Sen, Kernel-bsed renforcement lernng, Mchne Lernng, vol. 49, no. 2-3, pp , [9] M. Redmller, Neurl Renforcement Lernng to Swng-up nd Blnce Rel Pole, n Proceedngs of the 2005 IEEE Interntonl Conference on Systems, Mn nd Cybernetcs, vol. 4, 2005, pp [10], Neurl ftted Q terton-frst experences wth dt effcent neurl renforcement lernng method, Lecture notes n computer scence, vol. 3720, pp , [11] R. Sutton nd A. Brto, Renforcement Lernng: An Introducton, B. Book, Ed. Cmbrdge, MA: MIT Press, [12] C. Wtkns nd P. Dyn, Q-lernng, Mchne Lernng, vol. 8, no. 3-4, pp , [Onlne]. Avlble: [13] R. Bellmn nd S. Dreyfus, Appled Dynmc Progrmmng. Prnceton, New Jersy: Prnceton Unversty Press, [14] C. M. Bshop, Pttern Recognton nd Mchne Lernng (Informton Scence nd Sttstcs). Secucus, NJ, USA: Sprnger-Verlg New York, Inc., [15] M. Fgueredo, On Gussn rdl bss functon pproxmtons: Interpretton, extensons, nd lernng strteges, Pttern Recognton, Interntonl Conference on, vol. 2, pp , [16] A. Dempster, N. Lrd, D. Rubn, et l., Mxmum lkelhood from ncomplete dt v the EM lgorthm, Journl of the Royl Sttstcl Socety. Seres B (Methodologcl), vol. 39, no. 1, pp. 1 38, [17] R. O. Dud, P. E. Hrt, nd D. G. Stork, Pttern clssfcton. New- York, USA: John Wley nd Sons, Inc, [18] M. Song nd H. Wng, Hghly effcent ncrementl estmton of Gussn mxture models for onlne dt strem clusterng, n Proceedngs of SPIE: Intellgent Computng: Theory nd Applctons III, Orlndo, FL, USA, 2005, pp [19] O. Arndjelovc nd R. Cpoll, Incrementl lernng of temporllycoherent Gussn mxture models, n Techncl Ppers - Socety of Mnufcturng Engneers (SME), [20] M.-A. Sto nd S. Ish, On-lne em lgorthm for the normlzed Gussn network, Neurl Comput., vol. 12, no. 2, pp , [21] S. J. Nowln, Soft compettve dptton: neurl network lernng lgorthms bsed on fttng sttstcl mxtures, Ph.D. dssertton, Pttsburgh, PA, USA, [22] R. Nel nd G. Hnton, A vew of the em lgorthm tht justfes ncrementl, sprse, nd other vrnts, n Proceedngs of the NATO Advnced Study Insttute on Lernng n grphcl models. Norwell, MA, USA: Kluwer Acdemc Publshers, 1998, pp [23] S. Thrun, The role of explorton n lernng control, n Hndbook for Intellgent Control: Neurl, Fuzzy nd Adptve Approches, D. Whte nd D. Sofge, Eds. Florence, Kentucky 41022: Vn Nostrnd Renhold, [24] K. Doy, Renforcement lernng n contnuous tme nd spce, Neurl Comput., vol. 12, no. 1, pp , [25] M.-. Sto nd S. Ish, Renforcement lernng bsed on on-lne em lgorthm, n Proceedngs of the 1998 conference on Advnces n neurl nformton processng systems (NIPS 99). Cmbrdge, MA, USA: MIT Press, 1999, pp

Dennis Bricker, 2001 Dept of Industrial Engineering The University of Iowa. MDP: Taxi page 1

Dennis Bricker, 2001 Dept of Industrial Engineering The University of Iowa. MDP: Taxi page 1 Denns Brcker, 2001 Dept of Industrl Engneerng The Unversty of Iow MDP: Tx pge 1 A tx serves three djcent towns: A, B, nd C. Ech tme the tx dschrges pssenger, the drver must choose from three possble ctons: