Value Iteration for Continuous-State POMDPs

Size: px

Start display at page:

Download "Value Iteration for Continuous-State POMDPs"

Barnaby Smith
6 years ago
Views:

1 Univeriteit van Amterdam IAS technical report IAS-UVA Value Iteration for Continuou-State POMDP Joep M. Porta, Matthij T.J. Spaan, and Niko Vlai Intitut de Robòtica i Informàtica Indutrial (UPC-CSIC) Lloren i Artiga 4-6, 08028, Barcelona, Spain porta@iri.upc.edu Informatic Intitute, Univerity of Amterdam Kruilaan 403, 1098 SJ Amterdam, The Netherland {mtjpaan,vlai}@cience.uva.nl We preent a value iteration algorithm for learning to act in Partially Obervable Markov Deciion Procee (POMDP) with continuou tate pace. Maintream POMDP reearch focue on the dicrete cae and thi complicate it application to, e.g., robotic problem that are naturally modeled uing continuou tate pace. The main difficulty in defining a (belief-baed) POMDP in a continuou tate pace i that expected value over tate mut be defined uing integral that, in general, cannot be computed in cloed from. In thi report, we provide three main contribution to the literature on continuou-tate POMDP. Firt, we how that the optimal finite-horizon value function over the continuou infinitedimenional POMDP belief pace i piecewie linear and convex, and i defined by a finite et of upporting α-function that are analogou to the α-vector (hyperplane) defining the value function of a dicrete-tate POMDP. Second, we how that, for a fairly general cla of POMDP model in which all function of interet are modeled by Gauian mixture, all belief update and value iteration backup can be carried out analytically and exact. Contrary to the dicrete cae, in a continuou-tate POMDP the α-function may grow in ize (e.g., in the number of Gauian component) in each value iteration. Third, we how how the recent point-baed value iteration algorithm for dicrete POMDP can be extended to the continuou cae, allowing for efficient planning in practical problem. In particular, we demontrate Pereu, our previouly propoed randomized point-baed value iteration algorithm, in a imple robot planning problem in a continuou domain, where encouraging reult are oberved. IAS intelligent autonomou ytem

2 Value Iteration for Continuou-State POMDP Content Content 1 Introduction 1 2 Preliminarie: POMDP 1 3 POMDP in Continuou State Space 3 4 Gauian-baed POMDP Model for Gauian-baed POMDP Belief update for Gauian-Baed POMDP Backup Operator for Gauian-Baed POMDP C-Pereu: A Point-Baed Continuou-State POMDP Solver 10 6 Experiment and Reult 13 7 Concluion and Future Work 15 Intelligent Autonomou Sytem Informatic Intitute, Faculty of Science Univerity of Amterdam Kruilaan 403, 1098 SJ Amterdam The Netherland Tel (fax): (7490) Correponding author: JM Porta tel: porta@cience.uva.nl Copyright IAS, 2004

3 Section 1 Introduction 1 1 Introduction A popular formalim for deciion making under uncertainty i a Markov Deciion Proce (MDP) [20]. In thi paradigm, an agent interact with a given ytem by executing action, and thee action have the effect of changing the tate of the ytem tochatically, a well a providing reward/penaltie to the agent. The objective of the learning agent i to identify the action that produce the mot reward in the long term for each tate. When the election of action ha to be made with uncertain information about the tate of the ytem, the tak i naturally formalized a a Partially Obervable Markov Deciion Proce (POMDP) [23, 15, 6, 5, 12]. POMDP have often been ued a a framework for planning in robotic [22, 4, 25, 19]. In general, computing the exact olution of a POMDP i an intractable problem [17, 14], even for the dicrete cae (i.e., dicrete et of tate, action, and obervation). Two main factor caue thi high computational cot [18]. The firt one i the cure of hitory: the number of action-obervation equence to be conidered increae exponentially a we extend the planning horizon. Fortunately the cure of hitory can be minimized by limiting ourelve to approximate olution. Recently-developed point-baed algorithm [18, 27] are a promiing alternative in thi line. The econd factor that make POMDP algorithm inefficient i the cure of dimenionality: the computational cot of dicrete tate POMDP algorithm cale with the number of tate. Therefore, the finer the granularity of the tate pace dicretization, the higher the cot of olving the POMDP. One inight we can extract from thi fact i that it would be deirable to avoid the dicretization of the tate pace. Moreover, real world problem are naturally formalized uing continuou pace. For intance, in a robot navigation problem, the tate to be etimated i the poe of the robot that, for a robot moving on a planar urface, i naturally defined in the continuou pace of the Carteian coordinate of the robot and it orientation. Linear POMDP with continuou tate and quadratic reward function have a cloed olution [3]. Exiting algorithm for continuou-tate POMDP with general reward function are baed on policy earch [16, 1] or approximate (grid-baed) value iteration [21, 26]. For dicrete-tate POMDP, recent promiing algorithm are baed on point-baed value iteration [18, 27]. In thi report, we preent a novel approach to olve POMDP in continuou tate pace via value iteration. The main difficulty of working in continuou tate pace i that expected value over tate mut be defined uing integral. Thee integral cannot be computed in cloed form for general function and, therefore, only approximation technique can be ued [26]. In our approach, we retrict all function defined on the tate pace to a particular, although highly expreive, family of function: linear combination of Gauian. Thi allow u to evaluate all integral involved in the value iteration POMDP formulation in cloed form. Uing thi fact, we can adapt to the continuou cae the rich machinery developed for dicrete-tate POMDP value iteration, in particular the point-baed algorithm. Thi report i organized a follow. Firt, in Section 2, we review the POMDP framework and the value iteration proce for dicrete-tate POMDP. In Section 3, we generalize the value function repreentation commonly ued in dicrete-tate POMDP to continuou-tate one. Thi allow u to do value iteration for the continuou cae. In Section 4, we derive cloed formula for the element involved in the value iteration framework introduced in Section 3, auming a Gauian-baed repreentation for the belief and the model defining the POMDP. In Section 5, we ue thee cloed formula to define a point-baed algorithm for Gauian-baed POMDP. In Section 6, we preent ome reult with the propoed algorithm and, in Section 7, we ummarize our work and point to direction for further reearch. 2 Preliminarie: POMDP A POMDP model an agent interacting with a ytem uing the following element

4 2 Value Iteration for Continuou-State POMDP A et of ytem tate, S. A et of agent action, A. A et of obervation, O. An action (or tranition) model defined by p( a, ), the probability that the ytem change from tate to when the agent execute action a. An obervation model defined by p(o ), the probability that the agent oberve o when the ytem reache tate. A reward function defined a r a () R, the reward obtained by the agent if it execute action a when the the ytem i in tate. At a given moment, the ytem i in a tate and the agent execute an action, a. A a reult, the agent receive a reward r, the ytem tate change to and, then, the agent oberve o. The knowledge of the agent about the ytem tate i repreented a a belief, i.e., a probability ditribution over the tate pace. The initial belief i aumed to be known and, for a dicrete et of tate, if b i the belief of the agent about the tate, the belief after executing action a and oberving o i b a,o ( ) = p(o ) p(, a) b(). (1) p(o a, b) A function mapping belief to action i called a policy. An optimal policy i one that, on the average, generate a much reward a poible. The value function condene the immediate and delayed reward that can be obtained from a given belief. Thi function i expreed in a recurive way V n (b) = max Q n (b, a), (2) a with Q n (b, a) = S S r a () b() + γ o p(o b, a) V n 1 (b a,o ), (3) where n i the planning horizon, S and O are aumed dicrete and γ [0, 1) i a dicount factor that trade off the importance of the immediate and the delayed reward. The above recurion i uually written in functional form V n = H V n 1 (4) and it i known a the Bellman recurion [2]. Thi recurion converge to a fixed point V that i the optimal value function. An optimal policy π can be defined a π (b) = arg max Q (b, a) a for Q the Q-function aociated with the optimal value function, V. Value iteration for POMDP [23, 12, 8] generate a equence of function V i uing the recurrence in Eq. 4 that progreively approach V and compute an approximately optimal policy from the final V i. At firt ight the value function eem intractable, but it can be expreed in a imple form [23] V n (b) = max αn() i b(), {α i n} i with {α i n} i a et of vector. Uing thi formulation, value iteration algorithm typically focu on the computation of the α n -vector.

5 Section 3 POMDP in Continuou State Space 3 The initial value function approximation can be the value function with planning horizon 0 [23, 15, 6, 12, 5, 18] or can be defined a a ingle α-vector that lower bound the value function for any poible planning horizon. Thi econd trategy ha hown to be more efficient in many benchmark problem 3 POMDP in Continuou State Space In the previou ection, we aumed dicrete et of tate, action, and obervation. In thi ection, we generalize POMDP to continuou tate pace, while till auming dicrete action and obervation pace. With thi formulation, we avoid the neceity of dicretizing the tate pace and, thu, we reduce the chance of being affected by the cure of dimenionality. In the dicrete cae, expectation for a given belief are computed by umming over the tate pace (ee Eq. 1 and 3). The generalization to the continuou cae amount to computing thee expected value by integrating intead of umming. Thu we have b a,o ( ) = p(o ) p(, a) b(), (5) p(o a, b) and Q n (b, a) = r a () b() + γ o p(o b, a) V n 1 (b a,o ), (6) where r a : S R i a continuou reward function for action a. With a continuou tate pace, the belief pace i alo continuou, a in the dicrete cae, but now with an infinite number of dimenion. However, there are everal propertie typical of value function for dicrete tate pace that till hold in the continuou cae. Namely, we can prove that (1) the optimal finite-horizon value function i piecewie linear and convex (PWLC) in the belief pace, (2) the value function recurion i iotonic, and (3) thi recurion i alo a contraction (and thu, the iterative computation of the value function for increaing horizon will converge to the optimal value function V ). Next, we prove thee three propertie The PWLC i a baic property ince it allow to repreent the value function uing a mall et of upporting element. Thi kind of repreentation i the key element to define the value iteration proce. To prove thi property, we firt need to prove the following lemma. Lemma 1 The value function in a continuou-tate POMDP can be expreed a V n (b) = max αn() i b(), (7) {α i n} i for appropriate α-function α i n : S R. Proof: The proof, a in the dicrete cae, i done via induction. For planning horizon 0, we only have to take into account the immediate reward and, thu, we have that V 0 (b) = max r a () b(), a and, therefore, if we define we have that, a deired {α i 0()} i = {r a ()} a A, V 0 (b) = max α0() i b(). {α i 0 } i

6 4 Value Iteration for Continuou-State POMDP For the general cae, we have that, uing Eq. 2 and 6, { V n (b) = max r a () b() + γ } p(o a, b) V n 1 (b a,o ), a o and, by the induction hypothei, V n 1 (b a,o ) = max α j {α j n 1 } n 1 ( ) b a,o ( ). j From Eq. 5, V n 1 (b a,o ) = = 1 p(o a, b) max {α j n 1 } j max {α j n 1 } j α j n 1 ( ) p(o ) p(, a) b() p(o a, b) α j n 1 ( ) p(o ) p(, a) b(), and, therefore, V n (b) = max a = max a { r a () b() + γ o { r a () b() + γ o } max α j {α j n 1 } n 1 ( ) p(o ) p(, a) b() j [ ] max α j {α j n 1 } n 1 ( ) p(o ) p(, a) b() j }. At thi point, we define αa,o() j = α j n 1 ( ) p(o ) p(, a). (8) and we aume that there i a cloed form to compute αa,o j from αn 1 i for a given action a and obervation o. If M = {α j n 1 } then, for a given action a and obervation o, we can generate at mot M αa,o-function. i Oberve that the αa,o i function are independent of the belief point b for which we are computing V n. With thi, we have that { V n (b) = max r a () b() + γ } max α a a,o() j b(), o {α j a,o} j and we define α a,o,b = arg max αa,o() j b(). (9) {α j a,o} j Oberve that, for a given a and o, α a,o,b i jut one of the M element in the et {αa,o} j j. Uing a reaoning parallel to that of the enumeration phae of the Monahan algorithm [15], we can have, at mot, A M O different α a,o,b -function. The finite cardinality of thi et i a crucial point ince it prove that we can repreent V n (b) with a finite et of upporting α-function, depite the infinite dimenionality of the belief pace. Uing the above, we can write V n (b) = max a = max a = max a { r a () b() + γ o { r a () b() + γ { [ r a () + γ o o } α a,o,b () b() } α a,o,b () b() ] } α a,o,b () b().

7 Section 3 POMDP in Continuou State Space 5 If we define {α i n()} i = {r a () + γ o α a,o,b () } a A, (10) we have V n in the deired form and, thu, the lemma hold. V n (b) = max αn() i b(), (11) {α i n } i Lemma 2 The value function i PWLC in the belief pace. Proof: It hold that with For a particular V i n clearly hold V n (b) = max Vn(b), i {α i n } i Vn(b) i = αn() i b(). V i n(κ b 1 + λb 2 ) = κ V i n(b 1 ) + λ V i n(b 2 ), for arbitrary κ and λ. Therefore, each Vn i i a linear function in b. The piecewie linearity part of the property i given by the fact that the {αn} i i et i of finite cardinality and, a hown above, V n i linear, for each individual αn. i Finally, the convexity i given by the fact that we take the maximum of convex (linear) function when computing the value function and, thu, we obtain a convex function a a reult. Eq. 8 to 10 contitute the value iteration proce for continuou tate POMDP ince they provide a contructive way to determine the element (i.e., the α-function) defining V n from thoe defining V n 1. Lemma 3 The mapping H in the value function iteration for continuou tate pace i iotonic. Proof: H i aid to be iotonic if The H mapping can be een a with V U HV HU. HV (b) = max H a V (b), a H a V (b) = r a () b() + γ o p(o a, b) V (b a,o ). Let denote a a 1 the action that maximize HV at point b and a 2 the action that doe o for HU HV (b) = H a 1 V (b), HU(b) = H a 2 U(b). By definition, the value for action a 1 for HU at b i lower (or equal) than that for a 2, that i H a 1 U(b) H a 2 U(b).

8 6 Value Iteration for Continuou-State POMDP From a given b we can compute b a 1,o, for an arbitrary o and, then, the following hold γ o V U b, o, V (b a1,o ) U(b a1,o ) p(o a 1, b) V (b a1,o ) γ p(o a 1, b) U(b a1,o ) o H a 1 V (b) H a 1 U(b) H a 1 V (b) H a 2 U(b) HV (b) HU(b) HV HU. Since b and, from it b a 1,o, can be choen arbitrarily, the value function i iotonic. Lemma 4 The mapping H in the value function iteration for continuou tate pace i a contraction. Proof: The mapping H on the value function i aid to be a contraction if HV HU β V U, with 0 β < 1 and the upreme norm. Aume that HV HU i maximum at point b. a 1 i the optimal action for HV at b and o i a 2 for HU HV (b) = H a 1 V (b), HU(b) = H a 2 U(b). Then it hold HV (b) HU(b) = H a 1 V (b) H a 2 U(b). auming, without lo of generality that HV (b) HU(b). Since a 1 i the action that maximize HV at b we have that H a 2 V (b) H a 1 V (b). Therefore, we have that HV HU = HV (b) HU(b) = H a 1 V (b) H a 2 U(b) H a 2 V (b) H a 2 U(b) = γ o p(o a 2, b) [V (b a 2,o ) U(b a 2,o )] γ o p(o a 2, b) V U = γ V U. Since γ i in [0, 1), the lemma hold. Since the value iteration mapping for the continuou tate pace i a contraction it can be proved that exact value iteration converge in norm to the unique fixed point of V = HV that i the optimal value function V and that, from the approximation to V, we can derive (near) optimal policie (ee [20] Theorem and 6.2.3).

9 Section 4 Gauian-baed POMDP 7 4 Gauian-baed POMDP In previou ection, we left a an open point how to actually compute the belief update (Eq. 5), the tep in the value iteration proce (Eq. 8 to 10), and the value for a given belief point (Eq. 11). In thi ection, we how how thee computation are poible auming that the belief a well a the obervation, action, and reward model are repreented a linear combination of Gauian. We firt formally introduce our aumption on the model (Section 4.1) and then we define the belief update (Section 4.2) and the baic value iteration tep (Section 4.3) for Gauian-baed POMDP. Note that other familie of integrable function could be ued to determine the α-function in cloed form, but Gauian-baed model provide a high degree of flexibility and are of common ue in many application, including robotic [13, 10]. 4.1 Model for Gauian-baed POMDP We will aume that belief point are repreented a Gauian mixture b() = j w j φ( j, Σ j ), (12) with φ a Gauian with mean j and covariance matrix Σ j and where the mixing weight atify w j > 0, j w j = 1. In the extreme cae, Gauian mixture with an infinite number of component would be neceary to repreent a given point in the continuou, infinite-dimenional belief pace. However, only Gauian mixture with few component are needed in practical ituation. We aume that our obervation model i defined non-parametrically from a et of ample T = {( i, o i ) i [1, N]} with o i an obervation obtained at tate i. Uing thee ample, the obervation model can be defined a p(o ) = p( o) p(o), p() and, auming a uniform p() in the pace covered by T, and approximating p(o) from the ample in the training et we have [ 1 N o ] No p(o ) λ o i φ( o i, Σ o No i ) N o N = wi o φ( o i, Σ o i ) i=1 with o i one of the N o point in T with o a an aociated obervation and where wi o = λo i /N and Σ o i are, repectively, a weighting factor and a covariance matrix aociated with that training point. The et {λ o i } i and {Σ o i } i hould be defined o that i=1 p() = o p( o) p(o) = o N o i=1 w o i φ( o i, Σ o i ), i (approximately) uniform in the area covered by T or, in other word, o that p(o ) 1. o A far a the action model i concerned, we aume it i linear-gauian p(, a) = φ( + (a), Σ a ). (13)

10 8 Value Iteration for Continuou-State POMDP with φ a Gauian centered at + (a) with covariance Σ a. Non-linear action model can be approximated a it i done, for intance, in the extended Kalman filter or in the uncented Kalman filter [11]. The function : A S implement the tranition model of the ytem. Finally, the reward can be een a an obervation with an aociated calar value. Therefore, auming a finite et of poible reward R = {r i i [1, M]}, the reward model p(r, a) for each particular a can be repreented in the ame way a the obervation model With that, we have that M r p(r, a) wi r φ( r i, Σ r i ). i=1 r a () = r R that i an unnormalized Gauian mixture. r p(r, a) M r r wi r φ( r i, Σ r i ), r R 4.2 Belief update for Gauian-Baed POMDP The belief update on Eq. 5 can be implemented in our model taking into account that it conit of two tep. The firt one i the application of the action model on the current belief tate. Thi can be computed a the convolution of the Gauian repreenting b() (Eq. 12) with the Gauian repreenting the action model (Eq. 13). Thi convolution reult in p(, a) b() = w j φ( j + (a), Σ j + Σ a ). j In the econd tep of the belief update, the prediction obtained with the action model i corrected uing the information provided by the obervation model [ ] [ ] b a,o ( ) wi o φ( o i, Σ o i ) w j φ( j + (a), Σ j + Σ a ) = i,j i j i=1 w o i w j φ( o i, Σ o i ) φ( j + (a), Σ j + Σ a ) The product of two Gauian function i a caled Gauian. Therefore, we have that b a,o ( ) i,j w o i w j δ a,o i,j φ( a,o i,j, Σa,o i,j ), with δ a,o i,j = φ( j + (a) o i, Σ o i + Σ j + Σ a ), Σ a,o i,j = ((Σo i ) 1 + (Σ j + Σ a ) 1 ) 1, Finally, we can re-arrange the term to get a,o i,j = Σa,o i,j ((Σo i ) 1 o i + (Σ j + Σ a ) 1 ( j + (a))). b a,o ( ) k w k φ( k, Σ k ). The proportionality in the definition of b a,o ( ) implie that the weight (w k, k) hould be caled to um to one.

11 Section 4 Gauian-baed POMDP Backup Operator for Gauian-Baed POMDP The computation of the mapping H (Eq. 4) for a given belief point b i called a backup. Thi mapping determine the α function (or α-vector in the dicrete cae) to be included in V n for a belief point under conideration (ee Eq. 8 to 10). A full backup, i.e., a backup for the whole belief pace, involve the computation of all relevant α-function for V n. Full backup are computationally expenive (in the dicrete cae they involve the ue of linear programming in order to determine a ufficient et of point on which to backup), but the backup for a ingle belief point i relatively cheap. Thi i exploited by the point-baed POMDP algorithm to efficiently approximate V n on a fixed et of belief point [18, 27]. Next, we decribe the backup operator on a continuou tate pace that we will ue later in the Pereu algorithm. The backup for a given belief point b i backup(b) = arg max αn() i b(), {α i n } i where α i n() i defined in Eq. 9 and 10 from the α a,o -function (Eq. 8). Lemma 5 The function α i n() can be expreed a linear combination of Gauian, auming the enor, action and reward model are alo Gauian-baed. Proof: Thi lemma can be proved via induction. For n = 0, α i 0 () = r a() for a fixed a and thu it i indeed an unnormalized Gauian mixture. For n > 0, we aume that α j n 1 ( ) = k w j k φ( j k, Σj k ). Then, with our particular model, αa,o() j in Eq. 8 i the integral of three linear combination of Gauian [ ] [ αa,o() j = w j k φ( j k, Σj k ) ] wl o φ( o l, Σo l ) φ( + (a), Σ a ) = = k,l k k,l w j k wo l l w j k wo l φ( j k, Σj k ) φ( o l, Σo l ) φ( + (a), Σ a ) φ( j k, Σj k ) φ( o l, Σo l ) φ( + (a), Σ a ). In thi cae, we have to perform the product of two Gauian twice, once for φ( j k, Σj k ) and φ( o l, Σo l ) to get (δj,o k,l φ( 1, Σ 1 )) and once more for (δ j,o k,l φ( 1, Σ 1 )) and φ( + (a), Σ a ) to get (δ j,o k,l βj,o,a k,l () φ(, Σ)). The term δ j,o k,l and βj,o,a k,l () can be expreed a with δ j,o k,l = φ(o l j k, Σj k + Σo l ), β j,o,a k,l () = φ( j,o k,l (a), Σj,o k,l + Σa ), Σ j,o k,l = [(Σj k ) 1 + (Σ o l ) 1 ] 1, j,o k,l = Σj,o k,l [(Σj k ) 1 j k + (Σo l ) 1 o l ].

12 10 Value Iteration for Continuou-State POMDP With thi, we have α j a,o() = k,l = k,l = k,l w j k wo l w j k wo l δj,o δ j,o k,l βj,o,a k,l () φ(, Σ) k,l βj,o,a k,l () φ(, Σ) w j k wo l δj,o k,l βj,o,a k,l (). Once we have the αa,o-function, j we can compute the αn-function. i To do that, we need to determine the αa,o j for which αa,o() j b() i maximized. Since the integral of the product of two Gauian mixture (in particular an α-function and a belief point) i a rather common operation in the continuou tate POMDP framework we will denote it by α, b = α() b(). Thi operator can be computed a [ ][ ] α, b = w k φ( k, Σ k ) w l φ( l, Σ l ) = k,l = k,l k w k w l φ( k, Σ k ) φ( l, Σ l ) w k w l φ( l k, Σ k + Σ l ) l φ( k,l, Σ k,l ) = k,l w k w l φ( l k, Σ k + Σ l ). Uing thi operator and Eq. 9 and 10, we define {α i n()} i = {r a () + γ o arg max {α j a,o} j α j a,o, b } a A. Since all element involved in the definition are linear combination of Gauian o i the final reult. Uing the above lemma, the backup function i and the value of V n at b (Eq. 11) i imply backup(b) = arg max {α i n} i α i n, b, V n (b) = backup(b), b. 5 C-Pereu: A Point-Baed Continuou-State POMDP Solver In thi ection, we ue the backup operator to define a point-baed approximate continuou-tate POMDP olver. In particular, we how how to extend to the continuou cae the point-baed value iteration algorithm Pereu [27, 24], which ha been hown to be very efficient for dicrete

13 Section 5 C-Pereu: A Point-Baed Continuou-State POMDP Solver 11 Pereu Input: A continuou tate POMDP. Output: V n, an approximation to the optimal value function, V. 1: Initialize 2: B A et of randomly ampled belief point. 3: α min{r} 1 γ φ( 0, Σ ) 4: n 0 5: V n {α} 6: do 7: b B, 8: Function n (b) arg max α Vn α, b 9: Value n (b) Function n (b), b 10: V n+1 11: B B 12: do 13: b Point ampled randomly from B. 14: α backup(b) 15: if α, b < Value n (b) 16: α Function n (b) 17: B B \ {b B Function n (b ) = α} 18: ele 19: B B \ {b B α, b Value n (b )} 20: endif 21: V n+1 V n+1 {α} 22: until B = 23: n n : until convergence Table 1: The Pereu algorithm. The backup function i decribed in Section 4.3. tate POMDP. The continuou-tate Pereu algorithm i hown in Table 1. Point-baed POMDP algorithm focu on identifying the α-function (α-vector in the dicrete cae) for the belief point where the agent i more likely to be. The α-function for thi retricted et of belief point generalize over the whole belief pace and, thu, they can be ued to approximate the value function for any belief point. The reult i an approximation of the value function with le error in region of the belief pace where deciion are more likely to be taken. The value update cheme of Pereu implement a randomized approximate value function recurion V n = HV n 1 for a et of randomly ampled belief point B. Firt (Table 1, line 2), we let the agent randomly explore the environment and collect a et B of reachable belief point. Next (Table 1, line 3-5), we initialize the value function V 0 a a ingle weighted Gauian with large covariance and with weight min{r}/(1 γ), with R the et of poible reward. Starting with V 0, Pereu perform a number of approximate value function update tage. The definition of the value update proce can be een on line in Table 1, where B i a et of non-improved point: point for which V n+1 (b) i till lower than V n (b). At the tart of each update tage, V n+1 i et to and B i initialized to B. A long a B i not empty, we ample a point b from B and compute the new α-function aociated with thi point uing the backup operator (ee Section 4.3). If thi α-function improve the value of b (i.e., if α, b V n (b)), we add α to V n+1. The hope i that α improve the value of many other point, and all thee point

14 12 Value Iteration for Continuou-State POMDP Gauian Mixture Condenation(f, m) Input: A Gauian mixture f = k i=1 w i f i (x µ i, Σ i ). The maximum number of component in the output mixture, m, m < k. Output: A Gauian mixture g = m i=1 w i g i(x µ i, Σ i ) that locally minimize k i=1 w i min j [1,m] KL(f i g j ) 1: Initialize 2: for j = 1 to m 3: w j w j 4: µ j µ j 5: Σ j Σ j 6 : d k i=1 w i min j [1,m] KL(f i g j ) 7: do 8: Compute the mapping from f to g 9: for i = 1 to k 10: π(i) arg min j [1,m],w j >0 KL(f i g j ) 11: Define a new g 12: for j = 1 to m 13: I j {i π(i) = j, i [1, k]} 14: w j i I j w i 15: µ j 1 w j i I j w i µ i 16: Σ j 1 w j i I j w i (Σ i + (µ i µ j )(µ i µ j ) ) 17: d d 18: d k i=1 w i KL(f i g π(i) ) 19: until d d d < ɛ Table 2: Gauian mixture condenation algorithm. ɛ i a ufficiently mall threhold. are removed from B. Often, a mall number of vector will be ufficient to improve V n (b) b B, epecially in the firt tep of value iteration. A long a B i not empty we continue ampling belief point from it and trying to add their α-function to V n+1. If the α computed by the backup operator doe not improve at leat the value of b (i.e., α, b < V n (b), ee line in Table 1), we ignore α and inert a copy of the maximizing function of b from V n in V n+1. Point b i now conidered improved and i removed from B, together with any other belief point that had the ame function a maximizing one in V n. Thi procedure enure that B hrink at each iteration and that the value update tage terminate. Pereu top when a given convergence criterion hold. Thi criterion can be baed on the tability of the value function, on the tability of the aociated policy, or imply on a maximum number of iteration. One point that deerve pecial conideration when implementing the Pereu algorithm i the poible exploion of the number of component in the Gauian mixture defining the α-function for increaing n and on the number of component in the belief repreentation when the belief update (ee Section 4.2) i repeated for many time tep. If C o i the average number of component in the obervation model and C b i the average number of component in the belief, the number of component in the α-function or in the belief after n iteration cale with O(Co n Cb n). Since the larger the number of component the lower the baic operation of the algorithm and efficient implementation of the algorithm require to keep the number of

15 Section 6 Experiment and Reult 13 component reaonably bounded. To achieve thi objective, we ue the procedure decribed in [7] that tranform a given Gauian mixture with k component to another Gauian mixture with at mot m component, m < k, while retaining the initial component tructure. The algorithm i detailed in Table 2. The algorithm ue the Kullback-Leibler, KL, ditance between to Gauian ditribution f i = N(µ, Σ), g j = N(µ, Σ ) that i KL(f i g j ) = 1 2 ( ) log Σ Σ + T r((σ ) 1 Σ) + (µ µ ) (Σ ) 1 (µ µ ) c with c the dimenionality of the pace where the Gauian are defined. Oberve that the above procedure i defined for normalized Gauian mixture and our α function are unnormalized Gauian mixture. Therefore, for the α-function compreion, we ue a modified verion of the procedure jut decribed where the weight are normalized after taking it abolute value (o that relevant reward peak either negative or poitive are preerved). After the compreion, the revere procedure i ued to recover weight in the original cale. In our implementation, we limit the number of component in the α function to thoe in the α function in V 0. A imilar number of component exploion occur when computing the belief update detailed in Section 4.2. In thi cae, we ue the Gauian mixture clutering algorithm o that number of component never exceed that of the initial belief. 6 Experiment and Reult To demontrate the viability of our method we carried out an experiment in a robotic domain. In thi problem (ee Fig. 1-a), a robot i moving in a corridor with four door. The robot can detect when it i in front of a door and when it i at the left or right end of the corridor. In any other ituation, the robot jut detect that it i in a corridor (ee Fig. 1-b). The robot can move 2 unit to the left or to the right (with Σ a = 0.05) and can try to enter a door at any point (even when not in front of a door). The target for the robot i to locate the econd door from the right and to enter it. The robot only get poitive reward when it enter the target door (ee Fig. 1-c). When the robot trie to mover further than the end of the corridor (either at the right or at the left) or when it trie to enter the door at a wrong poition it get negative reward. The et of belief B ued in the C-Pereu algorithm contain 1000 unique belief point. Thoe belief point are collected uing random walk departing from a belief including 4 component that approximate a uniform ditribution on the whole corridor. The walk of the robot along the corridor are organized in epiode where the robot execute action until it trie to enter a door or until it execute 25 (movement) action. The experimental etup i completed by etting γ to 0.95, compreing belief o that they never contain more than 4 component (i.e., the number of component of the initial belief) and compreing α-function o that they never have more component than thoe ued to repreent the reward function (11 component). Fig. 2 how the average reult obtained after 10 run of the C-Pereu algorithm on thi problem. The firt plot (top-left) how that the value computed a b V (b) converge. The econd plot (top-right) how the expected dicounted reward averaged for 100 epiode with the policy available at the correponding time lice. The plot indicate that the robot uccefully learn to find out it poition and to ditinguih between the four door. Next plot (bottom-left) how the number of α-function ued to repreent the value function. We can ee that the number of α-function ued increae, but i far below 1000, the maximum poible

16 14 Value Iteration for Continuou-State POMDP (a) left end right end corridor door (b) move left move right enter door (c) Figure 1: A pictorial repreentation of the tet problem (a), the correponding obervation model (b) and the reward model (c). number of α-function (in the extreme cae we would ue a different α-function for each point in B). Finally (plot at Fig. 2, bottom-right) we how the number of change in the policy from one time tep to the next one. The change in the policy are computed a the number of element in B with a different action from one time lice to the next. The number of policy change drop to cloe to zero, indicating convergence with repect to the particular B. Following the learned policy the robot move to one of the end of the corridor to determine it poition and then toward the correct door to enter it. Fig. 3 how the evolution of the belief of the robot and the executed action in each cae from the initial tage of the epiode to the point at which the target door i reached. Finally, Fig. 4 plot the value for belief with only one component parametrized by the average and the covariance of thi component. We can ee that, a the uncertainty about the poition of the robot grow (i.e., a the covariance i larger) the value of the correponding belief decreae. The color in the figure correpond the the different action: light-gray for moving to the right, white for entering the door, and dark-gray for moving to the left. Oberve that the advantage of uing a continuou tate pace i that we obtain a caleinvariant olution. If we have to olve the ame problem in a longer corridor, we can jut cale

17 Section 7 Concluion and Future Work V reward 0.1 PSfrag replacement π PSfrag replacement time () π time () # of function π PSfrag replacement π PSfrag replacement time () time () Figure 2: Top: Evolution of the value for all the belief in B and the average accumulated dicounted reward for 100 epiode. Bottom: Number of vector in V n and the number of policy change. Reult are averaged for 10 repetition and the bar repreent the tandard deviation. the Gauian ued in the problem definition and we will obtain the olution with the ame cot a we have now. The only difference i that more action would be needed in each epiode to reach the correct door. When dicretizing the environment, the granularity ha to be in accordance with the ize of the action taken by the robot (±2 left/right) and, thu, the number of tate and, conequently, the cot of the planning grow a the environment grow. 7 Concluion and Future Work In thi paper we have hown how to generalize value iteration to continuou-tate POMDP, and in particular for the cae of Gauian-baed belief and model. Thi allowed u to define an efficient point-baed value iteration algorithm that eem to be appropriate for planning problem that are often encountered in robotic. An approach to continuou-tate POMDP that i cloely related to our i preented in [26]. In that work, a belief i repreented by a et of weighted ample, which can be regarded a a degenerate verion of our Gauian mixture repreentation. Additionally, the value function i approximated by nearet-neighbor interpolation, wherea in our cae the value function achieve generalization through a et of α-function. Alo, in the above work a real-time dynamic programming approach i ued for updating the value function, with the Bellman backup operator being approximated by ampling from the belief tranition model. In our cae, value iteration applie on a pre-collected et of belief, while the Bellman backup operator i analytically computed given the particular value function repreentation. Although we have not directly compared our method to the method preented in [26], we expect our method to be fater (ince it plan on a fixed et of belief point) and the value function to generalize better over the belief pace (through the ue of α-function). Ongoing work involve extending our framework to continuou action [24] and obervation pace [9], a well a defining approximate belief repreentation uing Monte Carlo tech-

18 16 Value Iteration for Continuou-State POMDP PSfrag replacement B ( ) 1 C ( ) D ( ) E ( ) F ( ) G ( ) H ( ) 0 I ( ) PSfrag replacement A ( ) B ( ) 1 C ( ) E ( ) F ( ) G ( ) H ( ) 0 I ( ) PSfrag replacement A ( ) B ( ) 1 C ( ) D ( ) E ( ) F ( ) H ( ) 0 I ( ) PSfrag replacement A ( ) 1 C ( ) D ( ) E ( ) F ( ) G ( ) H ( ) A ( ) I ( ) PSfrag replacement A ( ) B ( ) 1 C ( ) D ( ) F ( ) G ( ) H ( ) D ( ) I ( ) PSfrag replacement A ( ) B ( ) 1 C ( ) D ( ) E ( ) F ( ) G ( ) G ( ) I ( ) PSfrag replacement A ( ) B ( ) 1 D ( ) E ( ) F ( ) G ( ) H ( ) B ( ) I ( ) PSfrag replacement A ( ) B ( ) 1 C ( ) D ( ) E ( ) G ( ) H ( ) E ( ) I ( ) PSfrag replacement A ( ) B ( ) 1 C ( ) D ( ) E ( ) F ( ) G ( ) H ( ) H ( ) C ( ) F ( ) I ( ) Figure 3: Evolution of the belief when following the dicovered policy. The arrow under the naphot repreent the action: for moving right, for moving left and for entering the door. On the x-axi the four door location are indicated.

19 REFERENCES µ σ Figure 4: Value function for ingle component belief a a function of the average and the covariance. nique [26]. Reference [1] D. Aberdeen and J. Baxter. Scalable Internal-State Policy-Gradient Method for POMDP. In Proceeding of the International Conference on Machine Learning (ICML), page 3 10, [2] R.E. Bellman. Dynamic Programming. Princenton Univerity Pre, [3] D.P. Berteka. Dynamic Programming and Optimal Control. Athena Scientific cop, Belmont, MA, 2 edition, [4] A.R. Caandra, L.P. Kaelbling, and J.A. Kurien. Acting under Uncertainty: Dicrete Bayeian Model for Mobile-Robot Navigation. In Proceeding of IEEE/RSJ International Conference on Intelligent Robot and Sytem (IROS), [5] A.R. Caandra, M.L. Littman, and N.L. Zhang. Incremental Pruning: A Simple, Fat, Exact Algorithm for Partially Obervable Markov Deciion Procee. In Proceeding of the Thirteenth Annual Conference on Uncertainty in Artificial Intelligence (UAI 97), [6] H.T. Cheng. Algorithm for Partially Obervable Markov Deciion Procee. PhD thei, Univerity of Britih Columbia, [7] J. Goldberger and S. Rowei. Hierarchical Clutering of a Mixture Model. In Advance in Neural Information Proceing Sytem (NIPS), [8] M. Haukrecht. Value Function Approximation for Partially Obervable Markov Deciion Procee. Journal of Artificial Intelligence Reearch, 13:33 95, [9] J. Hoey and P. Poupart. Solving POMDP with Continuou or Large Dicrete Obervation Space. In Proceeding of the International Joint Conference on Artificial Intelligence, 2005.

20 18 REFERENCES [10] P. Jenfelt and S. Kritenen. Active Global Localization for a Mobile Robot Uing Multiple Hypothei Tracking. IEEE Tranaction on Robotic and Automation, 17(5): , [11] J. Julier and J. K. Uhlmann. A New Extenion of the Kalman Filter to Nonlinear Sytem. In In Proceeding of AeroSene: The 11th International Sympoium on Aeropace/Defence Sening, Simulation and Control, [12] L.P. Kaelbling, M.L. Littman, and A.R. Caandra. Planning and Acting in Partially Obervable Stochatic Domain. Artificial Intelligence, 101(1-2):99 134, [13] J.J. Leonard and H.F. Durrant-Whyte. Mobile Robot Localization by Tracking Geometric Beacon. IEEE Tranaction on Robotic and Automation, 7(3): , [14] O. Madani, S. Hank, and A. Condon. On the Undecidability of Probabilitic Planning and Infinite-Horizon Partially Obervable Markov Deciion Problem. In Proceeding of the Sixteenth National Conference on Artificial Intelligence (AAAI), page , [15] G.E. Monahan. A Survey of Partially Obervable Markov Deciion Procee: Theory, Model, and Algorithm. Management Science, 28(1):1 16, [16] A.Y. Ng and M. Jordan. PEGASUS: A Policy Search Method for Large MDP and POMDP. In Proceeding of the 16th Conference on Uncertainty in Artificial Inteligence (UAI), page , [17] C. Papadimitriou and J.N. Tiikli. The Complexity of Markov Deciion Procee. Mathematical and Operation Reearch, 12(3): , [18] J. Pineau, G. Gordon, and S. Thrun. Point-baed Value Iteration: An Anytime Algorithm for POMDP. In International Joint Conference on Artificial Intelligence (IJCAI), [19] J. Pineau, M. Montemerlo, M. Pollack, N. Roy, and S. Thrun. Toward Robotic Aitant in Nuring Home: Challenge and Reult. In Robotic and Autonomou Sytem, volume 42, page , [20] M.L. Puterman. Markov Deciion Procee: Dicrete Stochatic Dynamic Programming. Wiley Serie in Probability and Mathematical Statitic. John Wiley and Son, Inc., [21] N. Roy, G. Gordon, and S. Thrun. Finding Approximate POMDP Solution Through Belief Compreion. Journal of Artificial Intelligence Reearch, 23:1 40, [22] R. Simmon and S. Koenig. Probabilitic Robot Navigation in Partially Obervable Environment. In Proceeding of the International Joint Conference on Artificial Intelligence (IJCAI), [23] E.J. Sondik. The Optimal Control of Partially Obervable Markov Procee. PhD thei, Stanford Univerity, [24] M.T.J. Spaan and N. Vlai. Pereu: Randomized Point-baed Value Iteration for POMDP. Journal of Artificial Intelligence Reearch, [25] G. Theocharou and S. Mahadevan. Approximate Planning with Hierarchical Partially Obervable Markov Deciion Procee for Robot Navigation. In IEEE International Conference on Robotic and Automation, page , [26] S. Thrun. Monte Carlo POMDP. In S.A. Solla, T.K. Leen, and K.-R. Müller, editor, Advance in Neural Information Proceing Sytem (NIPS), page MIT Pre, 2000.

21 REFERENCES 19 [27] N. Vlai and M.T.J. Spaan. A Fat Point-Baed Algorithm for POMDP. In In Proceeding of Annual Machine Learning Conference of Belgium and the Netherland, Bruel, Belgium, page , 2004.

22 20 REFERENCES

23 Acknowledgement We would like to thank J.J. Verbeek and W. Zajdel for their contribution to the work reported here. Thi work wa developed while J.M. Porta wa viiting the IAS Group of the Univerity of Amterdam. He ha been partially upported by Ramón y Cajal contract from the Spanih Minitry for Science and Technology. M.T.J. Spaan and N. Vlai are upported by PROGRESS, the embedded ytem reearch program of the Dutch organization for Scientific Reearch NWO, the Dutch Minitry of Economic Affair and the Technology Foundation STW, project AES Author are lited in alphabetical order.

24 IAS report Thi report i in the erie of IAS technical report. The erie editor i Stephan ten Hagen (tephanh@cience.uva.nl). Within thi erie the following title appeared: W. Zajdel, A.T. Cemgil and B.J.A. Kröe. A Hybrid Graphical Model for Online Multi-camera Tracking. Technical Report IAS-UVA-04-03, Informatic Intitute, Univerity of Amterdam, The Netherland, November Matthij T.J. Spaan and Niko Vlai. Pereu: Randomized point-baed value iteration for POMDP. Technical Report IAS-UVA-04-02, Informatic Intitute, Univerity of Amterdam, The Netherland, November Niko Vlai and Jakob J Verbeek. Gauian mixture learning from noiy data. Technical Report IAS-UVA-04-01, Informatic Intitute, Univerity of Amterdam, The Netherland, September Jelle R. Kok and Niko Vlai. The Puruit Domain Package. Technical Report IAS-UVA-03-03, Informatic Intitute, Univerity of Amterdam, The Netherland, Augut Jori Portegie Zwart, Ben Kröe, and Sjoerd Gelema. Aircraft Claification from Etimated Model of Radar Scattering. Technical Report IAS-UVA-03-02, Informatic Intitute, Univerity of Amterdam, The Netherland, January Jori Portegie Zwart, René van der Heiden, Sjoerd Gelema, and Fran Groen. Fat Tranlation Invariant Claification of HRR Range Profile in a Zero Phae Repreentation. Technical Report IAS-UVA-03-01, Informatic Intitute, Univerity of Amterdam, The Netherland, January All IAS technical report are available for download at the IAS webite, cience.uva.nl/reearch/ia/publication/report/.

Clustering Methods without Given Number of Clusters

Clustering Methods without Given Number of Clusters Clutering Method without Given Number of Cluter Peng Xu, Fei Liu Introduction A we now, mean method i a very effective algorithm of clutering. It mot powerful feature i the calability and implicity. However,