Robot Planning in Partially Observable Continuous Domains

Size: px

Start display at page:

Download "Robot Planning in Partially Observable Continuous Domains"

Sabrina Candice Sherman
6 years ago
Views:

1 Robot Plnning in Prtilly Obervble Continuou Domin Joep M. Port Intitut de Robòtic i Informàtic Indutril (UPC-CSIC) Lloren i Artig 4-6, 828, Brcelon Spin Emil: port@iri.upc.edu Mtthij T. J. Spn Informtic Intitute Univerity of Amterdm Kruiln 43, 98SJ, Amterdm The Netherlnd Emil: mtjpn@cience.uv.nl Niko Vli Informtic Intitute Univerity of Amterdm Kruiln 43, 98SJ, Amterdm The Netherlnd Emil: vli@cience.uv.nl Abtrct We preent vlue itertion lgorithm for lerning to ct in Prtilly Obervble Mrkov Deciion Procee (POMDP) with continuou tte pce. Mintrem POMDP reerch focue on the dicrete ce nd thi complicte it ppliction to, e.g., robotic problem tht re nturlly modeled uing continuou tte pce. The min difficulty in defining (belief-bed) POMDP in continuou tte pce i tht expected vlue over tte mut be defined uing integrl tht, in generl, cnnot be computed in cloed from. In thi pper, we firt how tht the optiml finite-horizon vlue function over the continuou infinite-dimenionl POMDP belief pce i piecewie liner nd convex, nd i defined by finite et of upporting α-function tht re nlogou to the α-vector (hyperplne) defining the vlue function of dicrete-tte POMDP. Second, we how tht, for firly generl cl of POMDP model in which ll function of interet re modeled by Guin mixture, ll belief updte nd vlue itertion bckup cn be crried out nlyticlly nd exct. A crucil difference with repect to the α-vector of the dicrete ce i tht, in the continuou ce, the α-function will typiclly grow in complexity (e.g., in the number of component) in ech vlue itertion. Finlly, we demontrte PERSEUS, our previouly propoed rndomized point-bed vlue itertion lgorithm, in imple robot plnning problem with continuou domin, where encourging reult re oberved. I. INTRODUCTION A populr formlim for deciion mking under uncertinty i the Mrkov Deciion Proce (MDP) frmework []. In thi prdigm, n gent interct with given ytem by executing ction tht chnge the tte of the ytem tochticlly nd tht provide rewrd or penltie to the gent. The objective of the lerning gent i to identify for ech tte the ction tht produce the mot rewrd in the long term. When the deciion mking h to performed bed on uncertin informtion bout the tte of the ytem, the tk i nturlly formlized Prtilly Obervble Mrkov Deciion Proce (POMDP) [2] [6]. POMDP hve often been ued frmework for plnning in robotic [7] []. In generl, computing the exct olution of POMDP i n intrctble problem [], [2], even for the dicrete ce (i.e., dicrete et of tte, ction, nd obervtion). Two min fctor cue thi high computtionl cot [3]. The firt one i the cure of hitory: the number of ction-obervtion equence to be conidered incree exponentilly we extend the plnning horizon. Fortuntely the cure of hitory cn be minimized by limiting ourelve to pproximte olution [3], [4]. The econd fctor tht mke POMDP lgorithm inefficient i the cure of dimenionlity: the computtionl cot of dicrete tte POMDP lgorithm cle with the number of tte. Therefore, the finer the grnulrity of the tte pce dicretiztion, the higher the cot of olving the POMDP. One inight we cn extrct from thi fct i tht it would be deirble to void the dicretiztion of the tte pce. Moreover, rel world problem re nturlly formlized uing continuou pce. For intnce, in robot nvigtion problem, the tte to be etimted i the poe of the robot tht, for robot moving on plnr urfce, i nturlly defined in the continuou pce of the Crtein coordinte of the robot nd it orienttion. Liner POMDP with continuou tte nd qudrtic rewrd function hve cloed olution [5]. However, thi i too retrictive ce for mny prcticl purpoe. Exiting lgorithm for continuoutte POMDP with generl rewrd function re bed on policy erch [6], [7] or pproximte (grid-bed) vlue itertion [8], [9]. For dicrete-tte POMDP, recent promiing lgorithm re bed on point-bed vlue itertion [3], [4]. In thi pper, we preent novel pproch to olve POMDP in continuou tte pce vi vlue itertion. The min difficulty of working in continuou tte pce i tht expected vlue over tte mut be defined uing integrl. Thee integrl cnnot be computed in cloed form for generl function nd, therefore, only pproximtion technique cn be ued [9]. In our pproch, we retrict ll function defined on the tte pce to prticulr, lthough highly expreive, fmily of function: liner combintion of Guin. Thi llow u to evlute ll integrl involved in the vlue itertion POMDP formultion in cloed form. Uing thi fct, we cn dpt to the continuou ce the rich mchinery developed for dicrete-tte POMDP vlue itertion, in prticulr the pointbed lgorithm. Thi pper i orgnized follow. Firt, in Section II, we review the POMDP frmework nd the vlue itertion proce for dicrete-tte POMDP. In Section III, we generlize the vlue function repreenttion commonly ued in dicrete-tte POMDP to continuou-tte one. Thi llow u to do vlue itertion for the continuou ce. In Section IV, we derive cloed formul for the element involved in the vlue itertion

2 frmework introduced in Section III, uming Guinbed repreenttion for the belief nd the model defining the POMDP. In Section V, we ue thee cloed formul to define point-bed lgorithm for Guin-bed POMDP. In Section VI, we preent ome reult with the propoed lgorithm nd, in Section VII, we ummrize our work nd point to direction for further reerch. II. PRELIMINARIES: POMDPS A POMDP model n gent intercting with ytem uing the following element A et of ytem tte, S. A et of gent ction, A. A et of obervtion, O. An ction (or trnition) model defined by p(, ), the probbility tht the ytem chnge from tte to when the gent execute ction. An obervtion model defined by p(o ), the probbility tht the gent oberve o when the ytem reche tte. A rewrd function defined r () R, the rewrd obtined by the gent if it execute ction when the the ytem i in tte. At given moment, the ytem i in tte,, nd the gent execute n ction,. A reult, the gent receive rewrd, r, the ytem tte chnge to nd, then, the gent oberve o. The knowledge of the gent bout the ytem tte i repreented belief, i.e., probbility ditribution over the tte pce. The initil belief i umed to be known nd, for dicrete et of tte, if b i the belief of the gent bout the tte, the belief fter executing ction nd oberving o i b,o ( ) = p(o ) p(, ) b(). () p(o, b) S A function mpping belief to ction i clled policy. An optiml policy i one tht, on the verge, generte much rewrd poible in the long term. The vlue function condene the immedite nd delyed rewrd tht cn be obtined from given belief. Thi function cn be expreed in recurive wy with Q n (b, ) = S V n (b) = mx Q n (b, ), (2) r () b() + γ o p(o b, ) V n (b,o ), (3) where n i the plnning horizon, S nd O re umed dicrete nd γ [, ) i dicount fctor tht trde off the importnce of the immedite nd the delyed rewrd. The bove recurion i uully written in functionl form V n = H V n (4) nd it i known the Bellmn recurion [2]. Thi recurion converge to fixed point V tht i the optiml vlue function [] An optiml policy π cn be defined π (b) = rg mx Q (b, ) for Q the Q-function ocited with the optiml vlue function, V. Vlue itertion for POMDP [2], [6], [2] generte equence of function V i uing the recurrence in Eq. 4 tht progreively pproch V nd compute n pproximtely optiml policy from the finl V i. At firt ight the vlue function eem intrctble, but it cn be expreed in imple form [2] V n (b) = mx {α i n }i α i n() b(), with {αn} i i et of vector. Uing thi formultion, vlue itertion lgorithm typiclly focu on the computtion of the α n -vector. III. POMDPS IN CONTINUOUS STATE SPACES In thi ection, we generlize POMDP to continuou tte pce, while till uming dicrete ction nd obervtion pce. With thi formultion, we void the neceity of dicretizing the tte pce nd, thu, we reduce the chnce of being ffected by the cure of dimenionlity. In the dicrete ce, expecttion for given belief re computed by umming over the tte pce (ee Eq. nd 3). The generliztion to the continuou ce mount to computing thee expected vlue by integrting inted of umming. Thu we hve b,o ( ) = p(o ) p(o, b) nd Q n (b, ) = r () b() + γ o p(, ) b(), (5) p(o b, ) V n (b,o ), (6) where r : S R i continuou rewrd function for ction. With continuou tte pce, the belief pce i lo continuou, in the dicrete ce, but now with n infinite number of dimenion. However, there re everl propertie typicl of vlue function for dicrete tte pce tht till hold in the continuou ce. Nmely, we cn prove [22] tht () the optiml finite-horizon vlue function i piecewie liner nd convex (PWLC) in the belief pce, (2) the vlue function recurion i iotonic, nd (3) thi recurion i lo contrction (nd thu, the itertive computtion of the vlue function for increing horizon will converge to the optiml vlue function V ). The PWLC i bic property ince it llow to repreent the vlue function uing mll et of upporting element. Thi kind of repreenttion i the key element to define the vlue itertion proce. To prove thi property, we firt need to prove the following lemm. Lemm : The vlue function in continuou-tte POMDP cn be expreed V n (b) = mx αn() i b(), {α i n }i for pproprite α-function α i n : S R.

3 Proof: The proof, in the dicrete ce, i done vi induction. For plnning horizon, we only hve to tke into ccount the immedite rewrd nd, thu, we hve tht V (b) = mx r () b(), nd, therefore, if we define we hve tht, deired {α i ()} i = {r ()} A, V (b) = mx {α i }i α i () b(). For the generl ce, we hve tht, uing Eq. 2 nd 6, { V n (b) = mx r () b() + γ } p(o, b) V n (b,o ), o nd, by the induction hypothei, V n (b,o ) = mx α j {α j n ( ) b,o ( ). n }j From Eq. 5, V n (b,o ) = = p(o, b) mx α j n ( ) p(o ) p(, ) b() p(o, b) α j n ( ) p(o ) p(, ) b(), {α j n }j mx {α j n }j nd, therefore, { V n (b) = mx r () b() + γ } mx α j o {α j n ( ) p(o ) p(, ) b() n }j { = mx r () b() + γ [ ] } mx α j o {α j n ( ) p(o ) p(, ) b(). n }j At thi point, we define α,o() j = α j n ( ) p(o ) p(, ). (7) With thi, we hve tht { V n (b) = mx r () b() + γ } mx α,o() j b(), o {α j,o} j nd we define α,o,b = rg mx α,o() j b(). (8) {α j,o} j Oberve tht, for given nd o, α,o,b i jut one of the M element in the et {α j,o} j. Uing reoning prllel to tht of the enumertion phe of the Monhn lgorithm [3], we cn hve, t mot, A M O different α,o,b -function. The finite crdinlity of thi et i crucil point ince it prove tht we cn repreent V n (b) with finite et of upporting α-function, depite the infinite dimenionlity of the belief pce. Uing the bove, we cn write V n (b) = mx If we define = mx { r () b() + γ } α,o,b () b() o { [ r () + γ ] } α,o,b () b(). o {α i n()} i = {r () + γ o α,o,b () } A, (9) we hve V n in the deired form V n (b) = mx {α i n }i αn() i b(), () nd, thu, the lemm hold. Lemm 2: The vlue function i PWLC in the belief pce. Proof: It hold tht with V n (b) = mx {α i n }i V i n(b), Vn(b) i = αn() i b(). For prticulr V i n clerly hold V i n(κ b + λb 2 ) = κ V i n(b ) + λ V i n(b 2 ), for rbitrry κ nd λ. Therefore, ech Vn i i liner function in b. The piecewie linerity prt of the property i given by the fct tht the {αn} i i et i of finite crdinlity nd, hown bove, V n i liner, for ech individul αn. i Finlly, the convexity i given by the fct tht we tke the mximum of convex (liner) function when computing the vlue function nd, thu, we obtin convex function reult. Eq. 7 to 9 contitute the vlue itertion proce for continuou tte POMDP ince they provide contructive wy to determine the element (i.e., the α-function) defining V n from thoe defining V n. IV. GAUSSIAN-BASED POMDPS In previou ection, we left n open point how to ctully compute the belief updte (Eq. 5), the tep in the vlue itertion proce (Eq. 7 to 9), nd the vlue for given belief point (Eq. ). In thi ection, we how how thee computtion re poible uming tht the belief well the obervtion, ction, nd rewrd model re repreented liner combintion of Guin. We firt formlly introduce our umption on the model (Section IV-A) nd then we define the belief updte (Section IV-B) nd the bic vlue itertion tep (Section IV-C) for Guin-bed POMDP. Note tht other fmilie of integrble function could be ued to determine the α-function in cloed form, but Guin-bed model provide high degree of flexibility nd re of common ue in mny ppliction, including robotic [23], [24].

4 A. Model for Guin-bed POMDP We will ume tht belief point re repreented Guin mixture b() = j w j φ( j, Σ j ), () with φ Guin with men j nd covrince mtrix Σ j nd where the mixing weight tify w j >, j w j =. In the extreme ce, Guin mixture with n infinite number of component would be necery to repreent given point in the continuou, infinite-dimenionl belief pce. However, only Guin mixture with few component re needed in prcticl itution. We ume tht our obervtion model i defined nonprmetriclly from et of mple T = {( i, o i ) i [, N]} with o i n obervtion obtined t tte i. Uing thee mple, the obervtion model cn be defined p(o ) = p( o) p(o), p() nd, uming uniform p() in the pce covered by T, nd pproximting p(o) from the mple in the trining et we hve [ N o ] No p(o ) λ o i φ( o i, Σ o No i ) N o N = wi o φ( o i, Σ o i ) i= with o i one of the N o point in T with o n ocited obervtion nd where wi o = λo i /N nd Σo i re, repectively, weighting fctor nd covrince mtrix ocited with tht trining point. The et {λ o i } i nd {Σ o i } i re defined o tht p() = p( o) p(o) = N o wi o φ( o i, Σ o i ), o o i= i (pproximtely) uniform in the re covered by T. A fr the ction model i concerned, we ume it i liner-guin i= p(, ) = φ( + (), Σ ). (2) Non-liner ction model cn be pproximted it i done, for intnce, in the extended Klmn filter or in the uncented Klmn filter [25]. The function : A S implement the trnition model of the ytem. Finlly, the rewrd cn be een n obervtion with n ocited clr vlue. Therefore, uming finite et of poible rewrd R = {r i i [, M]}, the rewrd model p(r, ) for ech prticulr cn be repreented in the me wy the obervtion model With tht, we hve tht M r p(r, ) wi r φ( r i, Σ r i ). i= r () = r p(r, ) M r r wi r φ( r i, Σ r i ), r R r R i= tht i n unnormlized Guin mixture. B. Belief updte for Guin-Bed POMDP The belief updte on Eq. 5 cn be implemented in our model tking into ccount tht it conit of two tep. The firt one i the ppliction of the ction model on the current belief tte. Thi cn be computed the convolution of the Guin repreenting b() (Eq. ) with the Guin repreenting the ction model (Eq. 2). Thi convolution reult in p(, ) b() = w j φ( j + (), Σ j + Σ ). j In the econd tep of the belief updte, the prediction obtined with the ction model i corrected uing the informtion provided by the obervtion model [ ] b,o ( ) wi o φ( o i, Σ o i ) = i,j i [ j ] w j φ( j + (), Σ j + Σ ) w o i w j φ( o i, Σ o i ) φ( j + (), Σ j + Σ ) The product of two Guin function i cled Guin. Therefore, we hve tht with b,o ( ) i,j w o i w j δ,o i,j φ(,o i,j, Σ,o i,j ), δ,o i,j = φ( j + () o i, Σ o i + Σ j + Σ ), Σ,o i,j = ((Σo i ) + (Σ j + Σ ) ),,o i,j = Σ,o i,j ((Σo i ) o i + (Σ j + Σ ) ( j + ())). The proportionlity in the definition of b,o ( ) implie tht the weight (wi o w j δ,o i,j, i, j) hould be cled to um to one. C. Bckup Opertor for Guin-Bed POMDP The computtion of the mpping H (Eq. 4) for given belief point b i clled bckup. Thi mpping determine the α function (or α-vector in the dicrete ce) to be included in V n for belief point under conidertion (ee Eq. 7 to 9). A full bckup, i.e., bckup for the whole belief pce, involve the computtion of ll relevnt α-function for V n. Full bckup re computtionlly expenive (in the dicrete ce they involve the ue of liner progrmming in order to determine ufficient et of point on which to bckup), but the bckup for ingle belief point i reltively chep. Thi i exploited by the point-bed POMDP lgorithm to efficiently pproximte V n on fixed et of belief point [3], [4]. Next, we decribe the bckup opertor on continuou tte pce tht we will ue lter in the PERSEUS lgorithm. The bckup for given belief point b i bckup(b) = rg mx αn() i b(), {α i n }i where α i n() i defined in Eq. 8 nd 9 from the α,o -function (Eq. 7).

5 Lemm 3: The function αn() i cn be expreed liner combintion of Guin, uming the enor, ction nd rewrd model re lo Guin-bed. Proof: Thi lemm cn be proved vi induction. For n =, α() i = r () for fixed nd thu it i indeed n unnormlized Guin mixture. For n >, we ume tht Thi opertor cn be computed [ ][ ] α, b = w k φ( k, Σ k ) w j φ( j, Σ j ) = k,j k w k w j φ( k, Σ k ) φ( j, Σ j ) j α j n ( ) = k w j k φ( j k, Σj k ). = k,j w k w j φ( j k, Σ k + Σ j ). Then, with our prticulr model, α,o() j in Eq. 7 i the integrl of three liner combintion of Guin [ ][ α,o() j = w j k φ( j k, Σj k ) ] wl o φ( o l, Σ o l ) = k,l w j k wo l k l φ( + (), Σ ) φ( j k, Σj k )φ( o l, Σ o l )φ( + (), Σ ). In thi ce, we hve to perform the product of two Guin twice, once for φ( j k, Σj k ) nd φ( o l, Σo l ) to get (δ j,o k,l φ(, Σ )) nd once more for (δ j,o k,l φ(, Σ )) nd φ( + (), Σ ) to get (δ j,o k,l βj,o, k,l ()φ(, Σ)). The term () cn be expreed δ j,o k,l with nd βj,o, k,l With thi, we hve δ j,o k,l = φ(o l j k, Σj k + Σo l ), β j,o, k,l () = φ( j,o k,l (), Σj,o k,l + Σ ), Σ j,o k,l = [(Σj k ) + (Σ o l ) ], j,o k,l = Σj,o k,l [(Σj k ) j k + (Σo l ) o l ]. α j,o() = k,l = k,l = k,l w j k wo l δ j,o k,l βj,o, k,l () φ(, Σ) w j k wo l δ j,o k,l βj,o, k,l () φ(, Σ) w j k wo l δ j,o k,l βj,o, k,l (). Once we hve the α,o-function, j we cn compute the αn- i function. To do tht, we need to determine the α,o j for which α,o() j b() i mximized. Since the integrl of the product of two Guin mixture (in prticulr n α-function nd belief point) i rther common opertion in the continuou tte POMDP frmework we will denote it by α, b = α() b(). Uing thi opertor nd Eq. 8 nd 9, we define {α i n()} i = {r () + γ o rg mx {α j,o} j α j,o, b } A. Since ll element involved in the definition re liner combintion of Guin o i the finl reult. Uing the bove lemm, the bckup function i bckup(b) = rg mx {α i n }i α i n, b, nd the vlue of V n t b (Eq. ) i imply V n (b) = bckup(b), b. V. CONTINUOUS-STATE PERSEUS In thi ection, we ue the bckup opertor to extend to the continuou ce the point-bed vlue itertion lgorithm PERSEUS [4], [26], which h been hown to be very efficient for dicrete tte POMDP. The continuou-tte PERSEUS lgorithm i hown in Tble I. Point-bed POMDP lgorithm focu on identifying the α-function (α-vector in the dicrete ce) for et of likely belief point. The α-function for thi retricted et of belief point generlize over the whole belief pce nd, thu, they cn be ued to pproximte the vlue function for ny belief point. The reult i n pproximtion of the vlue function with le error in region of the belief pce where deciion re more likely to be tken. The vlue updte cheme of PERSEUS implement rndomized pproximte vlue function recurion V n = HV n for et of rndomly mpled belief point B. Firt (Tble I, line 2), we let the gent rndomly explore the environment nd collect et B of rechble belief point. Next (Tble I, line 3-5), we initilize the vlue function V ingle weighted Guin with lrge covrince nd with weight min{r}/( γ), with R the et of poible rewrd. Strting with V, PERSEUS perform number of pproximte vlue function updte tge. The definition of the vlue updte proce cn be een on line 2 in Tble I, where B i et of non-improved point: point for which V n+ (b) i till lower thn V n (b). At the trt of ech updte tge, V n+ i et to nd B i initilized to B. A long B i not empty, we mple point b from B nd compute the new α- function ocited with thi point uing the bckup opertor (ee Section IV-C nd line 4 in Tble I). If thi α-function improve the vlue of b (i.e., if α, b V n (b), line 5), we dd α to V n+ (line 8). The hope i tht α improve the vlue of mny other point, nd ll thee point re removed from B (line 9). Often, mll number of vector will be ufficient

6 Pereu Input: A continuou tte POMDP. Output: V n, n pproximtion to the optiml vlue function, V. : Initilize 2: B A et of rndomly mpled belief point. 3: α min{r} φ(, Σ γ ) 4: n 5: V n {α} 6: do 7: b B, 8: Function n(b) rg mx α Vn α, b 9: Vlue n(b) Function n(b), b : V n+ : B B 2: do 3: b Point mpled rndomly from B. 4: α bckup(b) 5: if α, b < Vlue n(b) 6: α Function n(b) 7: endif 8: V n+ V n+ {α} 9: B B \ {b B α, b Vlue n(b )} 2: until B = 2: n n + 22: until convergence () left end right end corridor door (b) move left move right enter door TABLE I THE PERSEUS ALGORITHM. THE bckup FUNCTION IS DESCRIBED IN SECTION IV-C. to improve V n (b) b B, epecilly in the firt tep of vlue itertion. A long B i not empty we continue mpling belief point from it nd trying to dd their α-function to V n+. If the α computed by the bckup opertor doe not improve t let the vlue of b (i.e., α, b < V n (b), ee line 5 in Tble I), we ignore α nd inert copy of the mximizing function of b from V n in V n+ (line 6 nd 8). Point b i now conidered improved nd i removed from B, together with ny other belief point tht hve the me function mximizing one in V n (line 9). Thi procedure enure tht B hrink t ech itertion nd tht the vlue updte tge terminte. PERSEUS top when given convergence criterion hold. Thi criterion cn be bed on the tbility of the vlue function, on the tbility of the ocited policy, or imply on mximum number of itertion. One point tht deerve pecil conidertion when implementing the PERSEUS lgorithm i the poible exploion of the number of component in the Guin mixture defining the α-function for increing n nd on the number of component in the belief repreenttion when the belief updte (ee Section IV-B) i repeted for mny time tep. The lrger the number of component the lower the bic opertion of the lgorithm. To keep the number of component bounded, we dpted the procedure decribed in [27] tht trnform given (c) Fig.. A pictoril repreenttion of the tet problem (), the correponding obervtion model (b) nd the rewrd model (c). Guin mixture with k component to nother Guin mixture with t mot m component, m < k, while retining the initil component tructure. VI. EXPERIMENTS AND RESULTS To demontrte the vibility of our method we crried out n experiment in imulted robotic domin. In thi problem (ee Fig. -), robot i moving in corridor with four door. The robot cn detect when it i in front of door nd when it i t the left or right end of the corridor. In ny other itution, the robot jut detect tht it i in corridor (ee Fig. -b). The robot cn move 2 unit to the left or to the right (with Σ =.5) nd cn try to enter door t ny point (even when not in front of door). The trget for the robot i to locte the econd door from the right nd to enter it. The robot only get poitive rewrd when it enter the trget door (ee Fig. - c). When the robot trie to move further thn the end of the corridor (either t the right or t the left) or when it trie to enter the door t wrong poition it get negtive rewrd. The et of belief B ued in the PERSEUS lgorithm contin unique belief point. Thoe belief point re collected uing rndom wlk deprting from belief including 4 component tht pproximte uniform ditribution on the whole corridor. The wlk of the robot long the corridor re orgnized in epiode where the robot execute ction until it trie to enter door or until it execute 25 (movement) ction.

7 g replcement π g replcement π V # of function PSfrg replcement.2 time () π PSfrg replcement 2 3 time () rewrd π.2. time () time () 2 3 Fig. 2. Top: Evolution of the vlue for ll the belief in B nd the verge ccumulted dicounted rewrd for epiode. Bottom: Number of vector in V n nd the number of policy chnge. Reult re verged forh ( ) H ( ) repetition nd the br repreent the tndrd devition. I ( ) I ( ) H ( ) I ( ) Fig. 3. Evolution of the belief when following the dicovered policy. The rrow under the nphot repreent the ction: for moving right, for moving left nd for entering the door. On the x-xi the four door loction The experimentl etup i completed by etting γ to.95, re indicted. compreing belief o tht they never contin more thn 4 component (i.e., the number of component of the initil belief) nd compreing α-function o tht they never hve more component thn thoe ued to repreent the rewrd function ( component)..8 Fig. 2 how the verge reult obtined fter run of the PERSEUS lgorithm on thi problem. The firt plot (top-.6 left) how tht the vlue computed b B V (b) converge. The econd plot (top-right) how the expected dicounted rewrd verged for epiode with the policy vilble t the correponding time lice. The plot indicte tht the robot uccefully lern to find out it poition nd to ditinguih between the four door. The next plot (bottom-left) how the number of α-function ued to repreent the vlue function. We cn ee tht the number of α-function incree, but i fr below, the mximum poible number of α-function (if we would bck up ech point in B). In the finl plot (bottomright) we how the number of chnge in the policy from one time tep to the next one. The chnge in the policy re computed the number of element in B with different ction from one time lice to the next. The number of policy chnge drop to cloe to zero, indicting convergence with repect to the prticulr B. Following the lerned policy the robot move to one of the end of the corridor to determine it poition nd then towrd the correct door to enter it. The nphot A to I in Fig. 3 how the evolution of the belief of the robot nd the executed ction in ech ce from the initil tge of the epiode to the point t which the trget door i reched. In Fig. 4 we plot the vlue of belief tht hve only one component, prmetrized by the men nd the covrince of PSfrg replcement PSfrg replcement PSfrg replcement A ( ) A ( ) PSfrg replcement PSfrg replcement PSfrg replcement H ( ) H ( ) H ( ) A I ( ) ( ) A A ( ) I ( ) ( ) A I ( ) ( ) PSfrg replcement PSfrg replcement PSfrg replcement H A ( ) ( ) H A ( ) ( ) H A ( ) ( ) B I ( ) ( ) B I ( ) ( ) B I ( ) ( ) PSfrg replcement µ Fig. 4. Vlue function for ingle component belief function of the men nd the covrince. thi component. We cn ee tht, the uncertinty bout the poition of the robot grow (i.e., the covrince i lrger) the vlue of the correponding belief decree. The color/hding in the figure correpond to the different ction: light-gry for moving to the right, white for entering the door, nd drk-gry for moving to the left. Oberve tht the dvntge of uing continuou tte pce i tht we obtin cle-invrint olution. If we hve to olve the me problem in longer corridor, we cn jut cle the Guin ued in the problem definition nd we will obtin the olution with the me cot we hve now. The only difference i tht more ction would be needed 6 σ

8 in ech epiode to rech the correct door. When dicretizing the environment, the grnulrity h to be in ccordnce with the ize of the ction tken by the robot (±2 left/right) nd, thu, the number of tte, nd conequently the cot of the plnning, grow the environment grow. VII. CONCLUSIONS AND FUTURE WORK In thi pper, we hve hown how to generlize vlue itertion to continuou-tte POMDP nd, in prticulr, for the ce of Guin-bed belief nd model. Thi llowed u to define n efficient point-bed vlue itertion lgorithm tht eem to be pproprite for plnning problem tht re often encountered in robotic. An pproch to continuou-tte POMDP tht i cloely relted to our i preented in [9]. In tht work, belief i repreented by et of weighted mple, which cn be regrded degenerte verion of our Guin mixture repreenttion. Additionlly, the vlue function i pproximted by neret-neighbor interpoltion, where in our ce the vlue function chieve generliztion through et of α-function. Alo, in the bove work rel-time dynmic progrmming pproch i ued for updting the vlue function, with the Bellmn bckup opertor being pproximted by mpling from the belief trnition model. In our ce, vlue itertion pplie on pre-collected et of belief, while the Bellmn bckup opertor i nlyticlly computed given the prticulr vlue function repreenttion. Although we hve not directly compred our method to the method preented in [9], we expect our method to be fter (ince it pln on fixed et of belief point) nd the vlue function to generlize better over the belief pce (through the ue of α-function). Ongoing work involve extending our frmework to continuou ction [26] nd obervtion pce [28], well defining pproximte belief repreenttion uing Monte Crlo technique [9]. ACKNOWLEDGMENTS We would like to thnk J.J. Verbeek nd W. Zjdel for their contribution to the work reported here, nd the four reviewer for their detiled comment. J.M. Port h been prtilly upported by Rmón y Cjl contrct from the Spnih Minitry for Science nd Technology. M.T.J. Spn nd N. Vli re upported by PROGRESS, the embedded ytem reerch progrm of the Dutch orgniztion for Scientific Reerch NWO, the Dutch Minitry of Economic Affir nd the Technology Foundtion STW, project AES 544. Author re lited in lphbeticl order. REFERENCES [] M. L. Putermn, Mrkov Deciion Procee: Dicrete Stochtic Dynmic Progrmming. Wiley Serie in Probbility nd Mthemticl Sttitic. John Wiley nd Son, Inc., 994. [2] E. J. Sondik, The Optiml Control of Prtilly Obervble Mrkov Procee, Ph.D. dierttion, Stnford Univerity, 97. [3] G. E. Monhn, A Survey of Prtilly Obervble Mrkov Deciion Procee: Theory, Model, nd Algorithm, Mngement Science, vol. 28, no., pp. 6, 982. [4] H. T. Cheng, Algorithm for Prtilly Obervble Mrkov Deciion Procee, Ph.D. dierttion, Univerity of Britih Columbi, 988. [5] A. R. Cndr, M. L. Littmn, nd N. L. Zhng, Incrementl Pruning: A Simple, Ft, Exct Algorithm for Prtilly Obervble Mrkov Deciion Procee, in Proceeding of the Thirteenth Annul Conference on Uncertinty in Artificil Intelligence (UAI 97), 997. [6] L. P. Kelbling, M. L. Littmn, nd A. R. Cndr, Plnning nd Acting in Prtilly Obervble Stochtic Domin, Artificil Intelligence, vol., no. -2, pp , 998. [7] R. Simmon nd S. Koenig, Probbilitic Robot Nvigtion in Prtilly Obervble Environment, in Proceeding of the Interntionl Joint Conference on Artificil Intelligence (IJCAI), 995. [8] A. R. Cndr, L. P. Kelbling, nd J. A. Kurien, Acting under Uncertinty: Dicrete Byein Model for Mobile-Robot Nvigtion, in Proceeding of IEEE/RSJ Interntionl Conference on Intelligent Robot nd Sytem (IROS), 996, pp [9] G. Theochrou nd S. Mhdevn, Approximte Plnning with Hierrchicl Prtilly Obervble Mrkov Deciion Procee for Robot Nvigtion, in IEEE Interntionl Conference on Robotic nd Automtion, 22, pp [] J. Pineu, M. Montemerlo, M. Pollck, N. Roy, nd S. Thrun, Towrd Robotic Aitnt in Nuring Home: Chllenge nd Reult, in Robotic nd Autonomou Sytem, vol. 42, no. 3-4, 23, pp [] C. Ppdimitriou nd J. N. Tiikli, The Complexity of Mrkov Deciion Procee, Mthemticl nd Opertion Reerch, vol. 2, no. 3, pp , 987. [2] O. Mdni, S. Hnk, nd A. Condon, On the Undecidbility of Probbilitic Plnning nd Infinite-Horizon Prtilly Obervble Mrkov Deciion Problem, in Proceeding of the Sixteenth Ntionl Conference on Artificil Intelligence (AAAI), 999, pp [3] J. Pineu, G. Gordon, nd S. Thrun, Point-bed Vlue Itertion: An Anytime Algorithm for POMDP, in Interntionl Joint Conference on Artificil Intelligence (IJCAI), 23. [4] N. Vli nd M. T. J. Spn, A Ft Point-Bed Algorithm for POMDP, in In Proceeding of Annul Mchine Lerning Conference of Belgium nd the Netherlnd, Bruel, Belgium, 24, pp [5] D. P. Bertek, Dynmic Progrmming nd Optiml Control, 2nd ed. Belmont, MA: Athen Scientific cop, 2. [6] A. Y. Ng nd M. Jordn, PEGASUS: A Policy Serch Method for Lrge MDP nd POMDP, in Proceeding of the 6th Conference on Uncertinty in Artificil Inteligence (UAI), 2, pp [7] D. Aberdeen nd J. Bxter, Sclble Internl-Stte Policy-Grdient Method for POMDP, in Proceeding of the Interntionl Conference on Mchine Lerning (ICML), 22, pp. 3. [8] N. Roy, G. Gordon, nd S. Thrun, Finding Approximte POMDP Solution Through Belief Compreion, Journl of Artificil Intelligence Reerch, vol. 23, pp. 4, 25. [9] S. Thrun, Monte Crlo POMDP, in Advnce in Neurl Informtion Proceing Sytem (NIPS), S. Soll, T. Leen, nd K.-R. Müller, Ed. MIT Pre, 2, pp [2] R. E. Bellmn, Dynmic Progrmming. Princenton Univerity Pre, 957. [2] M. Hukrecht, Vlue Function Approximtion for Prtilly Obervble Mrkov Deciion Procee, Journl of Artificil Intelligence Reerch, vol. 3, pp , 2. [22] J. M. Port, M. T. J. Spn, nd N. Vli, Vlue Itertion for Continuou-Stte POMDP, IAS Technicl Report, Univerity of Amterdm, Tech. Rep. IAS-UVA-4-4, 24. [23] J. J. Leonrd nd H. F. Durrnt-Whyte, Mobile Robot Locliztion by Trcking Geometric Becon, IEEE Trnction on Robotic nd Automtion, vol. 7, no. 3, pp , 99. [24] P. Jenfelt nd S. Kritenen, Active Globl Locliztion for Mobile Robot Uing Multiple Hypothei Trcking, IEEE Trnction on Robotic nd Automtion, vol. 7, no. 5, pp , 2. [25] J. Julier nd J. K. Uhlmnn, A New Extenion of the Klmn Filter to Nonliner Sytem, in In Proceeding of AeroSene: The th Interntionl Sympoium on Aeropce/Defence Sening, Simultion nd Control, 997, pp [26] M. T. J. Spn nd N. Vli, Pereu: Rndomized Point-bed Vlue Itertion for POMDP, Journl of Artificil Intelligence Reerch, 25. [27] J. Goldberger nd S. Rowei, Hierrchicl Clutering of Mixture Model, in Advnce in Neurl Informtion Proceing Sytem (NIPS), 25. [28] J. Hoey nd P. Pouprt, Solving POMDP with Continuou or Lrge Dicrete Obervtion Spce, in Proceeding of the Interntionl Joint Conference on Artificil Intelligence (IJCAI), 25.

Robot Planning in Partially Observable Continuous Domains

Robot Planning in Partially Observable Continuous Domains Robot Plnning in Prtilly Obervble Continuou Domin Joep M. Port Intitut de Robòtic i Informàtic Indutril (UPC-CSIC) Lloren i Artig 4-6, 828, Brcelon Spin Emil: port@iri.upc.edu Mtthij T. J. Spn Informtic