PArtially observable Markov decision processes

Size: px

Start display at page:

Download "PArtially observable Markov decision processes"

Hope Stafford
5 years ago
Views:

1 Solving Continuous-State POMDPs via Density Projection Enlu Zhou, Member, IEEE, Michael C. Fu, Fellow, IEEE, and Steven I. Marcus, Fellow, IEEE Abstract Research on numerical solution methods for artially observable Markov decision rocesses (POMDPs) has rimarily focused on finite-state models, and these algorithms do not generally extend to continuous-state POMDPs, due to the infinite dimensionality of the belief sace. In this aer, we develo a comutationally viable and theoretically sound method for solving continuous-state POMDPs by effectively reducing the dimensionality of the belief sace via density rojection. The density rojection technique is also incororated into article filtering to rovide a filtering scheme for online decision making. We rovide an error bound between the value function induced by the olicy obtained by our method and the true value function of the POMDP, and also an error bound between rojection article filtering and exact filtering. Finally, we illustrate the effectiveness of our method through an inventory control roblem. Index Terms Partially observable Markov decision rocesses, article filtering, decision making, density rojection, belief state, value function. I. ITRODUCTIO PArtially observable Markov decision rocesses (POMDPs) model sequential decision making under uncertainty with artially observed state information. At each stage or eriod, an action is taken based on a artial observation of the current state along with the history of observations and actions, and the state transitions robabilistically. The objective is to minimize (or maximize) a cost (or reward) function, where costs (or rewards) are accrued in each stage. Clearly, POMDPs suffer from the same curse of dimensionality as fully observable MDPs, so efficient numerical solution of roblems with large state saces is a challenging research area. A POMDP can be converted to a continuous-state Markov decision rocess (MDP) by introducing the notion of the belief state [6], which is the conditional distribution of the current state given the history of observations and actions. For a finite-state POMDP, the belief sace is finite dimensional (i.e., a robability simlex), whereas for a continuous-state POMDP, the belief sace is an infinite-dimensional sace E. Zhou is with the Deartment of Industrial & Enterrise Systems Engineering, University of Illinois at Urbana-Chamaign, IL, 680 USA enluzhou@illinois.edu. S.I. Marcus is with the Deartment of Electrical and Comuter Engineering, and Institute for Systems Research, University of Maryland, College Park, MD, USA marcus@umd.edu. M.C. Fu is with Robert H. Smith School of Business, and Institute for Systems Research, University of Maryland, College Park, MD, USA mfu@umd.edu. This work has been suorted in art by the ational Science Foundation under Grants DMI and DMI , and by the Air Force Office of Scientific Research under Grant FA of continuous robability distributions. This difference suggests that simle generalizations of many of the finite-state algorithms to continuous-state models are not aroriate or alicable. For examle, discretization of the continuousstate sace may result in a finite-state POMDP of dimension either too large to solve comutationally or not sufficiently recise. Taking another examle, many algorithms for solving finite-state POMDPs (see [7] for a survey) are based on discretization of the finite-dimensional robability simlex; however, it is usually not feasible to discretize an infinitedimensional robability distribution sace. Throughout the aer, when we use the word dimension or dimensional, we refer to the dimension of the belief sace/state. Desite the abundance of algorithms for finite-state POMDPs, the aforementioned difficulty has motivated some researchers to look for efficient algorithms for continuous-state POMDPs [24] [25] [3] [28] [8] [9] [0]. Assuming discrete observation and action saces, Porta et al. [24] showed that the otimal finite-horizon value function is defined by a finite set of α-functions, and model all functions of interest by Gaussian mixtures. In a later work [25], they extended their result and method to continuous observation and action saces using samling strategies. However, the number of Gaussian mixtures in reresenting belief states and α-functions grows exonentially in value iteration as the number of iterations increases. Thrun [3] addressed continuous-state POMDPs using article filtering to simulate the roagation of belief states and reresent the belief states by a finite number of samles. The number of samles determines the dimension of the belief sace, and the dimension could be very high in order to aroximate the belief states closely. Brunskill et al. [0] used weighted sums of Gaussians to aroximate the belief states and value functions in a class of switching state models. Roy [28] and Brooks et al. [8] used sufficient statistics to reduce the dimension of the belief sace, which is often referred to as belief comression in the Artificial Intelligence literature. Roy [28] roosed an augmented MDP (AMDP), characterizing belief states using maximum likelihood state and entroy, which are usually not sufficient statistics excet for a linear Gaussian model. As shown by Roy himself, the algorithm fails in a simle robot navigation roblem, since the two statistics are not sufficient for distinguishing between a unimodal distribution and a bimodal distribution. Brooks et al. [8] roosed a arametric POMDP, reresenting the belief state as a Gaussian distribution so as to convert the POMDP to a roblem of comuting the value function over a twodimensional continuous sace, and using the extended Kalman filter to estimate the transition of the aroximated belief state.

2 2 The restriction to the Gaussian reresentation has the same roblem as the AMDP. The algorithm recently roosed in Brooks and Williams [9] is similar to ours, in that they also aroximate the belief state by a arameterized density and solve the aroximate belief MDP on the arameter sace using Monte Carlo simulation-based methods. However, they did not secify how to comute the arameters excet for Gaussian densities, whereas we exlicitly rovide an analytical way to calculate the arameters for exonential families of densities. Moreover, we develo rigorous theoretical error bounds for our algorithm. There are some other belief comression algorithms designed for finite-state POMDPs, such as value-directed comression [26] and the exonential family rincile comonents analysis (E-PCA) belief comression [29], but they cannot be directly generalized to continuousstate models, since they are based on a fixed set of suort oints. Motivated by the work of [3] [28] and [8], we develo a comutationally tractable algorithm that effectively reduces the dimension of the belief state and has the flexibility to reresent arbitrary belief states, such as multimodal or heavy tail distributions. The idea is to roject the original high/infinitedimensional belief sace to a low-dimensional family of arameterized distributions by minimizing the Kullback-Leibler (KL) divergence between the belief state and that family of distributions. For an exonential family, the minimization of KL divergence can be carried out in analytical form, making the method easy to imlement. The rojected belief MDP can then be solved on the arameter sace by using simulationbased algorithms, or can be further aroximated by a finitestate MDP via a suitable discretization of the arameter sace and thus solved by using standard solution techniques such as value iteration and olicy iteration. Our method can be viewed as a generalization of the AMDP in [28] and the arametric POMDP in [8], which considers only the family of Gaussian distributions. In addition, we rovide theoretical results on the error bounds of the value function and the erformance of the olicy generated by our method with resect to the otimal ones. We also develo a rojection article filter for online filtering and decision making, by incororating the density rojection technique into article filtering. The rojection article filter we roose here is a modification of the rojection article filter in [2]. Unlike in [2] where the redicted conditional density is rojected, we roject the udated conditional density, so as to ensure the rojected belief state remains in the given family of densities. Although seemingly a small modification in the algorithm, we rove under much less restrictive assumtions a similar bound on the error between our rojection article filter and the exact filter. The rest of the aer is organized as follows. Section II describes the formulation of a continuous-state POMDP and its transformation to a belief MDP. Section III describes the density rojection technique, and uses it to develo the rojected belief MDP. Section IV develos the rojection article filter. Section V comutes error bounds for the value function aroximation and the rojection article filter. Section VI discusses scalability and comutational issues of the method, and alies the method to a simulation examle of an inventory control roblem. Section VII concludes the aer. Proofs of all results are contained in the Aendix. II. COTIUOUS-STATE POMDP Consider a discrete-time continuous-state POMDP: x k = f(x k, a k, u k ), k =, 2..., () y k = h(x k, a k, v k ), k =, 2,..., (2) y 0 = h 0 (x 0, v 0 ), where for all k, the state x k is in a continuous state sace S R n x, the action a k is in a finite action sace A R n a, the observation y k is in a continuous observation sace O R ny, the random disturbances u k R n u and v k R n v are sequences of i.i.d. continuous random vectors with known distributions. Assume that {u k } and {v k } are indeendent of each other, and are indeendent of x 0, which follows a distribution 0. Also assume that f(x, a, u) is continuous in x for every a A and u R nu, h(x, a, v) is continuous in x for every a A and v R n v, and h 0 (x, v) is continuous in x for every v R n v. Eqn. () is often referred to as the state equation, and (2) as the observation equation. All the information available to the decision maker at time k can be summarized by means of an information vector I k, which is defined as I k = (y 0, y,..., y k, a 0, a,..., a k ), k =, 2,..., I 0 = y 0. The objective is to find a olicy π consisting of a sequence of functions π = {μ 0, μ,...}, where each function μ k mas the information vector I k onto the action sace A, to minimize the value function { H } J π = lim E H x 0,{u k } H k=0,{v γ k g(x k} H k, μ k (I k )), k=0 k=0 where g : S A R is the one-ste cost function, γ (0, ) is the discount factor, and E x0,{u k } H k=0,{v k} H k=0 denotes the exectation with resect to the joint distribution of x 0, u 0,..., u H, v 0,..., v H. For simlicity, we assume that the above limit exists. The otimal value function is defined by J = min π Π J π, where Π is the set of all admissible olicies. An otimal olicy, denoted by π, is an admissible olicy that achieves J. A stationary olicy is an admissible olicy of the form π = {μ, μ,...}, referred to as the stationary olicy μ for brevity, and its corresonding value function is denoted by J μ. The information vector I k grows as the history exands. The standard aroach to encode historical information is the use of the belief state, which is the conditional robability density of the current state x k given the ast history, i.e., b k (x) (x k = xi k ).

3 3 Given our assumtions on () and (2), b k exists, and can be comuted recursively via Bayes rule: b k (x) = (x k = xi k, a k, y k ) = (y kx k = x, I k, a k )(x k = xi k, a k ) (y k I k, a k ) (y k x k = x, a k ) (x k = xi k, a k, x k )... S (x k I k, a k )dx k (y k x k = x, a k ) (x k = xa k, x k )... S b k (x k )dx k. (3) The third line follows from the Markovian roerty of y k induced by (2), and the fact that the denominator (y k I k, a k ) does not exlicitly deend on x k and k; the fourth line follows from the Markovian roerty of x k induced by (), and the fact that a k is a function of I k. The righthand side of (3) can be exressed in terms of b k, a k and y k. Hence, b k = ψ(b k, a k, y k ), (4) where y k is characterized by the time-homogeneous conditional distribution P Y (y k b k ) that is induced by () and (2), and does not deend on {y 0,..., y k }. A POMDP can be converted to an MDP by conditioning on the information vectors ([6], Chater 5), and the converted MDP is called the belief MDP. The states of the belief MDP are the belief states, which follow the system dynamics (4), where y k can be viewed as the system noise with the distribution P Y. The state sace of the belief MDP is the belief sace, denoted by B, which is the set of all belief states, i.e., a set of robability densities. A olicy π is a sequence of functions π = {μ 0, μ,...}, where each function μ k mas the belief state b k onto the action sace A. oticing that E x0,{u i} {g(x k k, a k )} = E {E i=0,{vi}k xk {g(x k, a k )I k }}, i=0 thus the one-ste cost function can be written in terms of the belief state as the belief one-ste cost function g(b k, a k ) E xk {g(x k, a k )I k } = g(x, a k )b k (x)dx x S g(, a), b. Assuming there exists a stationary otimal olicy, the otimal value function is given by J (b) = lim k T k J(b), b B, where T is the dynamic rogramming (DP) maing that oerates on any bounded function J : S R according to T J(b) = min a A [ g(, a), b + γe Y {J(ψ(b, a, Y ))}], (5) where E Y denotes the exectation with resect to the distribution P Y. For finite-state POMDPs, the belief state b is a vector with each entry being the robability of being at one of the states. Hence, the belief sace B is a finite-dimensional robability simlex, and the value function is a iecewise linear convex function after a finite number of iterations, rovided that the one-ste cost function is iecewise linear and convex [30]. This feature has been exloited in various exact and aroximate value iteration algorithms such as those found in [7], [22], and [30]. For continuous-state POMDPs, the belief state b is a continuous density, and thus, the belief sace B is an infinitedimensional sace that contains all sorts of continuous densities. For continuous-state POMDPs, the value function reserves convexity [32], but value iteration algorithms are not comutationally feasible because the belief sace is infinite dimensional. The infinite-dimensionality of the belief sace also creates difficulties in alying the aroximate algorithms that were develoed for finite-state POMDPs. For examle, one straightforward and commonly used aroach is to aroximate a continuous-state POMDP by a finite-state one via discretization of the state sace. In ractice, this could lead to comutational difficulties, either resulting in a belief sace that is of huge dimension or in a solution that is not accurate enough. In addition, note that even for a relatively nice rior distribution b k (e.g., a Gaussian distribution), the exact evaluation of the osterior distribution b k+ is comutationally intractable; moreover, the udate b k+ may not have any structure, and therefore can be very difficult to handle. Therefore, for ractical reasons, we often wish to have a lowdimensional belief sace and to have a osterior distribution b k+ that stays in the same distribution family as the rior b k. To address the aforementioned difficulties, we aly the density rojection technique to roject the infinite-dimensional belief sace onto a finite/low-dimensional arameterized family of densities, so as to derive a so-called rojected belief MDP, which is an MDP with a finite/low-dimensional state sace and therefore can be solved by many existing methods. In the next section, we describe density rojection in detail and develo the formulation of a rojected belief MDP. III. PROJECTED BELIEF MDP A rojection maing from the belief sace B to a family of arameterized densities Ω, denoted as P roj Ω : B Ω, is defined by P roj Ω (b) arg min f Ω D KL(b f), b B, (6) where D KL (b f) denotes the Kullback-Leibler (KL) divergence (or relative entroy) between b and f, which is D KL (b f) b(x) log b(x) dx. (7) f(x) Hence, the rojection of b on Ω has the minimum KL divergence from b among all the densities in Ω. When Ω is an exonential family of densities, the minimization (6) has an analytical solution and can be carried out easily. The exonential families include many common families of densities, such as Gaussian, binomial, Poisson, Gamma, etc. An exonential family of densities is defined as follows [3]: Definition : Let {c ( ),..., c m ( )} be affinely indeendent scalar functions defined on R n, i.e., for distinct oints

4 4 x,..., x m+, m+ i= λ ic(x i ) = 0 and m+ i= λ i = 0 imlies λ,..., λ m+ = 0, where c(x) = [c (x),..., c m (x)] T. Assuming that Θ 0 = {θ R m : φ(θ) = log ex (θ T c(x))dx < } is a convex set with a nonemty interior, then Ω defined by Ω = {f(, θ), θ Θ}, f(x, θ) = ex [θ T c(x) φ(θ)], where Θ Θ 0 is oen, is called an exonential family of robability densities, with θ its arameter and c(x) its sufficient statistic. Substituting f(x) = f(x, θ) into (7) and exressing it further as D KL (b f(, θ)) = b(x) log b(x)dx b(x) log f(x, θ)dx, we can see that the first term does not deend on f(, θ), hence min D KL (b f(, θ)) is equivalent to max b(x) log f(x, θ)dx, which by Definition is the same as max (θ T c(x) φ(θ))b(x)dx. (8) Recall the fact that the log-likelihood l(θ) = θ T c(x) φ(θ) is strictly concave in θ [2], and therefore, (θ T c(x) φ(θ))b(x)dx is also strictly concave in θ. Hence, (8) has a unique maximum and the maximum is achieved when the first-order otimality condition is satisfied, i.e., ( c j (x) cj (x) ex (θ T ) c(x))dx b(x)dx = 0. ex (θt c(x))dx With a little rearranging of the terms and the exression of f(x, θ), the above equation can be rewritten as E b [c j (X)] = E θ [c j (X)], j =,..., m, (9) where E b and E θ denote the exectations with resect to b and f(, θ), resectively. Density rojection is a useful idea to aroximate an arbitrary (most likely, infinite-dimensional) density as accurately as ossible by a density in a chosen family that is characterized by only a few arameters. Using this idea, we can transform the belief MDP to another MDP confined on a lowdimensional belief sace, and then solve this MDP roblem. We call such an MDP the rojected belief MDP. Its state is the rojected belief state b k Ω that satisfies the system dynamics b 0 = P roj Ω (b 0 ), b k = ψ(b k, a k, y k ), k = 0,,..., where ψ(b k, a k, y k ) = P roj Ω (ψ(b k, a k, y k )), and the dynamic rogramming maing on the rojected belief MDP is T J(b ) = min [ g(, a), a A b + γe Y {J(ψ(b, a, Y ) )}]. (0) For the rojected belief MDP, a olicy is denoted as π = {μ 0, μ,...}, where each function μ k mas the rojected belief state b k onto the action sace A. Similarly, a stationary olicy is denoted as μ ; an otimal stationary olicy is denoted as μ ; and the otimal value function is denoted as J (b ). The rojected belief MDP is in fact a low-dimensional continuous-state MDP, and can be solved in numerous ways. One common aroach is to use value iteration or olicy iteration by converting the rojected belief MDP to a discrete-state MDP roblem via a suitable discretization of the rojected belief sace (i.e., the arameter sace) and then estimating the one-ste cost function and transition robabilities on the discretized mesh. The effect of the discretization rocedure on dynamic rogramming has been studied in [5]. We describe this aroach in detail below. Discretization of the rojected belief sace Ω is equivalent to discretization of the arameter sace Θ, which yields a set of grid oints, denoted by G = {θ i, i =,..., }. Let g(θ i, a) denote the one-ste cost function associated with taking action a at the rojected belief state b = f(, θ i ). Let P (θi, a)(θ j ) denote the transition robability from the current rojected belief state b k = f(, θ i) to the next rojected belief state b k+ = f(, θ j) by taking action a. Estimation of P (θ i, a)(θ j ) is done using a variation of the rojection article filtering algorithm, to be described in the next section. g(θ i, a) can be estimated by its samle mean: g(θ i, a) = g(x j, a), () j= where x,..., x are samled i.i.d. from f(, θ i ). Remark : The aroach for solving the rojected belief MDP described here is robably the most intuitive, but not necessarily the most comutationally efficient. Other more efficient techniques for solving continuous-state MDPs can be used to solve the rojected belief MDP, such as the linear rogramming aroach [5], neuro-dynamic rogramming methods [7], and simulation-based methods [2]. IV. PROJECTIO PARTICLE FILTERIG Solving the rojected belief MDP gives us a olicy, which tells us what action to take at each rojected belief state. In an online imlementation, at each time k, the decision maker receives a new observation y k, estimates the belief state b k, and then chooses his action a k according to b k and that olicy. Hence, to imlement our aroach requires addressing the roblem of estimating the belief state. Estimation of b k, or simly called filtering, does not have an analytical solution in most cases excet linear Gaussian systems, but it can be solved using many aroximation methods, such as the extended Kalman filter and article filtering. Here we focus on article filtering, because ) it outerforms the extended Kalman filter in many nonlinear/non-gaussian systems [], and 2) we will develo a rojection article filter to be used in conjunction with the rojected belief MDP.

5 5 A. Particle Filtering Particle filtering is a Monte Carlo simulation-based method that aroximates the belief state by a finite number of articles/samles and mimics the roagation of the belief state [] [4]. As we have already shown in (3), the belief state evolves recursively as b k (x k ) (y k x k, a k ) (x k a k, x k )... S b k (x k )dx k. (2) The integration in (2) can be aroximated using Monte Carlo simulation, which is the essence of article filtering. Secifically, suose {x i k } i= are drawn i.i.d. from b k, and x i kk is drawn from (x ka k, x i k ) for each i; then b k (x k ) can be aroximated by the robability mass function where ˆbk (x k ) = i= wkδ(x i k x i kk ), (3) w i k (y k x i kk, a k ), (4) δ denotes the Kronecker delta function, {x i kk } i= are the random suort oints, and {wk i } i= are the associated robabilities/weights which sum u to. To avoid samle degeneracy, new samles {x i k } i= are samled i.i.d. from the aroximate belief state ˆb k. At the next time k +, the above stes are reeated to yield {x i k+k } i= and corresonding weights {wk+ i } i=, which are used to aroximate b k+. This is the basic form of article filtering, which is also called the bootstra filter [8]. (Please see [] for a rigorous and thorough derivation for a more general form of article filtering.) The algorithm is as follows: Algorithm : (Particle Filtering (Bootstra Filter)) Inut: a (stationary) olicy μ on the belief MDP; a sequence of observations y, y 2,... arriving sequentially at time k =, 2,.... Outut: a sequence of aroximate belief states ˆb, ˆb 2,.... Ste. Initialization: Samle x 0,..., x 0 i.i.d. from the aroximate initial belief state ˆb 0. Set k =. Ste 2. Prediction: Comute x kk,..., x kk by roagating x k,..., x k according to the system dynamics () using the action a k = μ(ˆb k ) and randomly generated noise {u i k } i=, i.e., samle xi kk from ( x i k, a k ), i =,...,. The emirical redicted belief state is ˆbkk (x) = δ(x x i kk ). i= Ste 3. Bayes udating: Receive a new observation y k. The emirical udated belief state is ˆbk (x) = wkδ(x i x i kk ), where w i k = i= (y k x i kk, a k ) i= (y kx i kk, a k ), i =,...,. Ste 4. Resamling: Samle x k,..., x k i.i.d. from ˆb k. Ste 5. k k + and go to ste 2. It has been roved that the aroximate belief state ˆb k converges to the true belief state b k as the samle number increases to infinity [3] [20]. However, uniform convergence in time has only been roved for the secial case, where the system dynamics has a mixing kernel which ensures that any error is forgotten (exonentially) in time. Usually, as time k increases, an increasing number of samles is required to ensure a given recision of the aroximation ˆb k for all k. B. Projection Particle Filtering To obtain a reasonable aroximation of the belief state, article filtering needs a large number of samles/articles. Since the number of samles/articles is the dimension of the aroximate belief state ˆb, article filtering is not very helful in reducing the dimensionality of the belief sace. Moreover, article filtering does not give us an aroximate belief state in the rojected belief sace Ω, hence the olicy we obtained by solving the rojected belief MDP is not immediately alicable. We incororate the idea of density rojection into article filtering, so as to aroximate the belief state by a density in Ω. The rojection article filter we roose here is a modification of the one in [2]. Their rojection article filter rojects the emirical redicted belief state, not the emirical udated belief state, onto a arametric family of densities, so after Bayes udating, the aroximate belief state might not be in that family. We will roject the emirical udated belief state onto a arametric family by minimizing the KL divergence between the emirical density and the rojected one. In addition, we will need much less restrictive assumtions than [2] to obtain similar error bounds. Since resamling is from a continuous distribution instead of an emirical (discrete) one, the roosed rojection article filter also overcomes the difficulty of samle imoverishment [] that occurs in the bootstra filter. Alying the density rojection technique we described in the last section, rojecting the emirical belief state ˆb k onto an exonential family Ω involves finding a f(, θ) with the arameter θ satisfying (9). Hence, lugging (3) into (9), yields w i c j (x i kk ) = E θ [c j ], j =,..., m, i= which constitutes the rojection ste in the rojection article filtering. Algorithm 2: (Projection article filtering for an exonential family of densities (PPF)) Inut: a (stationary) olicy μ on the rojected belief MDP; a family of exonential densities Ω = {f(, θ), θ Θ}; a sequence of observations y, y 2,... arriving sequentially at time k =, 2,.... Outut: a sequence of aroximate belief states f(, ˆθ ), f(, ˆθ 2 ),.... Ste. Initialization: Samle x 0,..., x 0 i.i.d. from the aroximate initial belief state f(, ˆθ 0 ). Set k =.

6 6 Ste 2. Prediction: Comute x kk,..., x kk by roagating x k,..., x k according to the system dynamics () using the action a k = μ (f(, ˆθ k )) and randomly generated noise {u i k } i=, i.e., samle xi kk from ( x i k, a k ), i =,...,. Ste 3. Bayes udating: Receive a new observation y k. Comute weights according to w i k = (y k x i kk, a k ) i= (y kx i kk, a k ), i =,...,. Ste 4. Projection: The aroximate belief state is f(, ˆθ k ), where ˆθ k satisfies the equations wkc i j (x i kk ) = [c Eˆθk j ], i= j =,..., m. Ste 5. Resamling: Samle x k,..., x k from f(, ˆθ k ). Ste 6. k k + and go to Ste 2. In an online imlementation, at each time k, the PPF aroximates b k by f(, ˆθ k ), and then decides an action a k according to a k = μ (f(, ˆθ k )), where μ is the olicy solved for the rojected belief MDP. As mentioned in the last section, PPF can be varied slightly for estimating the transition robabilities of the discretized rojected belief MDP, as follows: Algorithm 3: (Estimation of the transition robabilities) Inut: θ i, a, ; Outut: P (θi, a)(θ j ), j =,...,. Ste. Samling: Samle x,..., x from f(, θ i ). Ste 2. Prediction: Comute x,..., x by roagating x,..., x according to the system dynamics () using the action a and randomly generated noise {u i } i=. Ste 3. Samling observation: Comute y,..., y from x,..., x according to the observation equation (2) using randomly generated noise {v i } i=. Ste 4. Bayes udating: For each y k, k =,...,, the udated belief state is where w k i = bk (x) = wi k δ(x x i ), i= (y k x i, a) i= (y k x i, a), i =,...,. Ste 5. Projection: For k =,...,, roject each b k onto the exonential family, i.e., finding θ k that satisfies (9). Ste 6. Estimation: For k =,...,, find the nearestneighbor grid oint of θ k in G. For each θ j G, count the frequency P (θ i, a)(θ j ) = (number of θ j )/. V. AALYSIS OF ERROR BOUDS A. Value Function Aroximation Our method solves the rojected belief MDP instead of the original belief MDP, and that raises two questions: How well does the otimal value function of the rojected belief MDP aroximate the otimal value function of the original belief MDP? How well does the otimal olicy obtained by solving the rojected belief MDP erform on the original belief MDP? To answer these questions, we first need to rehrase them mathematically. Here we assume erfect comutation of the belief states and the rojected belief states, and the following: Assumtion : There exist a stationary otimal olicy for the belief MDP, denoted by μ, and a stationary otimal olicy for the rojected belief MDP, denoted by μ. Assumtion holds under some mild conditions [6], [9]. Using the stationarity, and the dynamic rogramming maing on the belief MDP and the rojected belief MDP given by (5) and (0), the otimal value function J (b) for the belief MDP can be obtained by J (b) J μ (b) = lim k T k J 0 (b), and the otimal value function for the rojected belief MDP obtained by J (b ) J μ (b ) = lim k (T ) k J 0 (b ). Therefore, the questions osed at the beginning of this section can be formulated mathematically as:. How well the otimal value function of the rojected belief MDP aroximates the true otimal value function can be measured by J (b) J (b ). 2. How well the otimal olicy μ for the rojected belief MDP erforms on the original belief sace can be measured by J (b) J μ (b), where μ (b) μ P roj Ω (b) = μ (b ). The next assumtion bounds the difference between the belief state b and its rojection b, and also the difference between their one-ste evolutions ψ(b, a, y) and ψ(b, a, y). It is an assumtion on the rojection error. Assumtion 2: There exist ε > 0 and δ > 0 such that for all a A, y O and b B, g(, a), b b ε, g(, a), ψ(b, a, y) ψ(b, a, y) δ. The following assumtion can be seen as a continuity roerty of the value function. Assumtion 3: For any δ > 0 that satisfies g(, a), b b δ, b, b B, there exists ε > 0 such that J k (b) J k (b ) ε, b, b B, k, and there exists ε > 0 such that J μ (b) J μ (b ) ε, b, b B, μ Π. ow we resent our main result. Theorem : Under Assumtions, 2 and 3, J (b) J (b ) ε + γε 2, b B, (5) γ J (b) J μ (b) 2ε + γ(ε 2 + ε 3 ), b B, (6) γ

7 7 where ε is the constant in Assumtion 2, and ε 2, ε 3 are the constants ε and ε, resectively, in Assumtion 3 corresonding to δ = δ, where δ is the constant in Assumtion 2. Remark 2: In (5) and (6), ε is a rojection error, and ε 2 and ε 3 decrease as the rojection error δ decreases. Therefore, as the rojection error decreases, the otimal value function of the rojected belief MDP J (b ) converges to the true otimal value function J (b), and the corresonding olicy μ converges to the true otimal olicy μ. Roughly seaking, the rojection error decreases as the number of sufficient statistics in the chosen exonential family increases (for details, lease see section V-C: Validation of the Assumtions). B. Projection Particle Filtering In the above analysis, we assumed erfect comutation of the belief states and the rojected belief states. In this section, we consider the filtering error, and comute an error bound on the aroximate belief state generated by the rojection article filter (PPF). ) otations: Let C b (R n ) be the set of all continuous bounded functions on R n. Let B(R n ) be the set of all bounded measurable functions on R n. Let denote the suremum norm on B(R n ), i.e., φ su x R n φ(x), φ B(R n ). Let M + (R n ) and P(R n ) be the sets of nonnegative measures and robability measures on R n, resectively. If η M + (R n ) and φ : R n R is an integrable function with resect to η, then η, φ φdη. Moreover, if η P(R n ), E η [φ] = η, φ, Var η (φ) = η, φ 2 η, φ 2. We will use the reresentations on the two sides of the above equalities interchangeably in the sequel. The belief state and the rojected belief state are robability densities; however, we will rove our results in terms of their corresonding robability measures, which we refer to as conditional distributions (belief states are conditional densities). The two reresentations are essentially the same once we assume the robability measures admit robability densities. Therefore, the notations used for robability densities before are used to denote their corresonding robability measures from now on. amely, we use b to denote a robability measure on R n x and assume it admits a robability density with resect to Lebesgue measure, which is the belief state. Similarly, we use f(, θ) to denote a robability measure on R nx and assume it admits a robability density with resect to Lebesgue measure in the chosen exonential family with arameter θ. A robability transition kernel K : P(R nx ) R nx R is defined by Kη(F ) η(dx)k(f, x), R n x where F is a set in the Borel σ-algebra on R n x. For φ : R n x R, an integrable function with resect to K(, x), Kφ(x) φ(x )K(dx, x). R n x TABLE I OTATIOS OF DIFFERET CODITIOAL DISTRIBUTIOS b k ˆbk f(, ˆθ k ) b k f(, θ k ) exact conditional distribution PPF conditional distribution before rojection PPF rojected conditional distribution CF conditional distribution before rojection CF rojected conditional distribution Let K k (dx k, x k ) denote the robability transition kernel of the system () at time k, which satisfies b kk (dx k ) = K k b k (dx kk ) = b k (dx k )K k (dx kk, x k ). R n x We let Ψ k denote the likelihood function associated with the observation equation (2) at time k, and assume that Ψ k C b (R nx ). Hence, b k = Ψ kb kk b kk, Ψ k. 2) Main Idea: The exact filter (EF) at time k can be described as b k b kk = K k b k b k = Ψ kb kk b kk, Ψ k. rediction udating The PPF at time k can be described as ˆf(, ˆθ k ) ˆb kk = K k ˆf(, ˆθk )... rediction udating ˆbk = Ψ kˆb kk f(, ˆθ k ) ˆf(, ˆθ k ). ˆb kk, Ψ k rojection resamling To facilitate our analysis, we introduce a concetual filter (CF), which at each time k is reinitialized by f(, ˆθ k ), erforms exact rediction and udating to yield b kk and b k, resectively, and does rojection to obtain f(, θ k ). It can be described as f(, ˆθ k ) b kk = K kf(, ˆθ k )... rediction b k = Ψ kb kk b kk, Ψ k udating rojection f(, θ k). The CF serves as an bridge to connect the EF and PPF, as we describe below. We are interested in the difference between the true conditional distribution b k and the PPF generated rojected conditional distribution f(, ˆθ k ) for each time k. The total error between the two can be decomosed as follows: b k f(, ˆθ k ) = (b k b k)+(b k f(, θ k))+(f(, θ k) f(, ˆθ k )). (7) The first term (b k b k ) is the error due to the inexact initial condition of the CF comared to EF, i.e., (b k f(, ˆθ k )), which is also the total error at time k. The second

8 8 term (b k f(, θ k )) evaluates the minimum deviation from the exonential family generated by one ste of exact filtering, since f(, θ k ) is the rojection of b k. The third term (f(, θ k ) f(, ˆθ k )) is urely due to Monte Carlo simulation, since f(, θ k ) and f(, ˆθ k ) are obtained using the same stes from f(, ˆθ k ) and its emirical version ˆf(, ˆθ k ), resectively. We will find error bounds on each of the three terms, and finally find the total error at time k by induction. 3) Error Bound: We shall look at the the case in which the observation rocess has an arbitrary but fixed value y 0:k = {y 0,..., y k }. Hence, all the exectations in this section are with resect to the samling in the algorithm only. We consider a test function φ B(R nx ). It is easy to see that Kφ B(R n x ) and Kφ φ, since Kφ(x) = φ(x )K(dx, x) R n x φ(x )K(dx, x) R nx φ K(dx, x) = φ. R n x Since Ψ C b (R nx ), we know that Ψ B(R nx ) and Ψφ B(R n x ). We also need the following assumtions. Assumtion 4: All the rojected distributions are in a comact subset of the given exonential family. In other words, there exists a comact set Θ Θ such that ˆθ k Θ, and θ k Θ, k. Assumtion 5: For all k, b kk, Ψ k > 0, b kk, Ψ k > 0, w.., ˆb kk, Ψ k > 0, w... Assumtion 5 guarantees that the normalizing constant in the Bayes udating is nonzero, so that the conditional distribution is well defined. Under Assumtion 4, the second inequality in Assumtion 5 can be strengthened using the comactness of Θ. Since f(x, a k, u k ) in () is continuous in x, K k is weakly continuous ( , [9]). Hence, b kk, Ψ k = K k f(, ˆθ k ), Ψ k = f(, ˆθ k ), K k Ψ k is continuous in ˆθ k, where ˆθ k Θ. Since Θ is comact, there exists a constant δ > 0 such that for each k b kk, Ψ k, w... (8) δ The assumtion below guarantees that the conditional distribution stays close to the given exonential family after one ste of exact filtering if the initial distribution is in the exonential family. Recall that starting with initial distribution f(, ˆθ k ), one ste of exact filtering yields b k, which is then rojected to yield f(, θ k ), where ˆθ k Θ, θ k Θ. Assumtion 6: There exists a constant ε > 0 such that E [ b k, φ f(, θ k), φ ] ε φ, φ B(R n x ), k. Remark 3: Assumtion 6 is our main assumtion, which essentially assumes an error bound on the rojection error. Our assumtions are much less restrictive than the assumtions in [2], while our conclusion is similar to that in [2], which will be seen later. Although Assumtion 6 aears similar to Assumtion 3 in [2], it is essentially different. Assumtion 3 in [2] says that the otimal conditional density stays close to the given exonential family for all time, whereas Assumtion 6 only assumes that if the exact filter starts in the given exonential family, after one ste the conditional distribution stays close to the family. Moreover, we do not need any assumtion like the restrictive Assumtion 4 in [2]. Lemma considers the bound on the first term in (7). Lemma : For each k, if e k is a ositive constant such that E[ b k f(, ˆθ k ), φ ] e k φ, φ B(R nx ), then under Assumtions 4 and 5, for each k there exists a constant τ k > 0 such that E [ b k b k, φ ] τ k e k φ, φ B(R n x ). (9) Lemma 2 considers the bound on the third term in (7) before rojection. Lemma 2: Under Assumtions 4 and 5, for each k, [ ] E ˆbk b k, φ τ k φ, φ B(R n x ), where τ k is the same constant as in Lemma. Lemma 3 considers the bound on the third term in (7) based on the result of Lemma 2. Lemma 3: Let c j, j =,..., m be the sufficient statistics of the exonential family as defined in Definition, and assume c j B(R nx ), j =,..., m. Under Assumtions 4 and 5, there exists a constant d > 0 such that for each k, [ ] f(, E ˆθk ) f(, θ k), φ dτ k φ, φ B(R n x ), (20) where τ k is the same constant as in Lemmas and 2. ow we resent our main result on the error bound of the rojection article filter. Theorem 2: Let e 0 be a nonnegative constant such that E[ b 0 f(, ˆθ 0 ), φ ] e 0 φ, φ B(R n x ). Under Assumtions 4, 5 and 6, and assuming that c j B(R nx ), j =,..., m, then for each k [ bk E f(, ˆθ ] k ), φ e k φ, φ B(R n x ), where ( k ) e k = τ k e 0 + τi k + ε + d i=2 k τi k, (2) τi k = k j=i τ j for k i, τi k = 0 for k < i, τ j is the constant in Lemmas, 2, and 3, d is the constant in Lemma 3, and ε is the constant in Assumtion 6. Remark 4: As we mentioned in Remark 2, the rojection error e 0 and ε decrease as the number of sufficient statistics in the chosen exonential family, m, increases. The error e k decreases at the rate of, as we increase the number of samles in the rojection article filter. However, notice that the coefficient in front of grows with time, so we have to use an increasing number of samles as time goes on, in order to ensure a uniform error bound with resect to time. i=

9 9 C. Validation of the Assumtions Assumtions 2 and 6 are the main assumtions of our analysis. They are assumtions on the rojection error, assuming that density rojection introduces a small error. We will show that in certain cases these assumtions hold, and the rojection error converges to 0 as the number of sufficient statistics, m, goes to infinity. We will first state a convergence result from [4]. However, as this convergence result is in the sense of KL divergence, we will further show the convergence in the sense emloyed in our assumtions by using an intermediate result in [4]. Consider a robability density function b defined on a bounded interval, and aroximate it by b, a density function in an exonential family, whose sufficient statistics consist of olynomials, slines or trigonometric series. The following theorem is roved in [4]. Theorem 3: If log b has r square-integrable derivatives, i.e., D r log b 2 <, then D KL (bb ) converges to 0 at rate m 2r as m. Theorem 3 says the rojected density b converges to b in the sense of KL divergence, as m goes to infinity. An intermediate result (see (6.6) in [4]) is: log b/b ν m, where ν m is a constant that deends on m, and ν m 0 as m. Since b is bounded and log( ) is a continuously differentiable function, there exists a constant ξ such that b b ξ log b log b. Hence, with the intermediate result above, φ, b b φ b b dx φ ξ log b b dx φ ξlν m, where l is the length of the bounded interval that b is defined on. Since ν m can be made arbitrarily small by taking large enough m, it is easy to see that Assumtions 2 and 6 hold in the cases that we consider. VI. UMERICAL EXPERIMETS A. Scalability and Comutational Issues Estimation of the one-ste cost function () and transition robabilities (Algorithm 3) are executed for every belief-action air that is in the discretized mesh G and the action sace A. Hence, the algorithms scale according to O(GA) and O(GA 2 ), resectively, where G is the number of grid oints, A is the number of actions, and is the number of samles secified in the algorithms. In imlementation, we found that most of the comutation time is sent on executing Algorithm 3 over all belief-action airs. However, estimation of cost functions and transition robabilities can be re-comuted and stored, and hence only needs to be done once. The advantage of the algorithms is that the scalability is indeendent of the size of the actual state sace, since G is a grid mesh on the arameter sace of the rojected belief sace. That is exactly what is desired by emloying density rojection. However, to get a better aroximation, more arameters in the exonential family should be used, and that will lead to a higher-dimensional arameter sace to discretize. Increasing the number of arameters in the exonential family also makes samling more difficult. Samling from a general exonential family is usually not easy, and may require some advanced techniques, such as the adative rejection samling (ARS) [6], and hence more comutation time. This difficulty can be avoided by resamling from the discrete articles instead of the rojected density, which is equivalent to using the lain article filter and then doing rojection outside the filter. However, this may lead to samle degeneracy. The trade-off between a better aroximation and less comutation time is comlicated and requires more research. We lan to study how to aroriately choose the exonential family and imrove the simulation efficiency in the future. B. Simulation Results Since most of the benchmark POMDP roblems in the literature assume a discrete state sace, it is difficult to comare against the state of the art. Here we consider an inventory control roblem by adding a artial observation equation to a fully observable inventory control roblem. The fully observable roblem has an otimal threshold olicy [27], which allows us to verify our method in the limiting case by setting the observation noise very close to 0. In our inventory control roblem, the inventory level is reviewed at discrete times, and the observations are noisy because of, e.g., inventory soilage, mislacement, distributed storage. At each eriod, inventory is either relenished by an order of a fixed amount or not relenished. The customer demands arrive randomly with known distribution. The demand is filled if there is enough inventory remaining. Otherwise, in the case of a shortage, excess demand is not satisfied and a enalty is issued on the lost sales amount. We assume that the demand and the observation noise are both continuous random variables; hence the state, i.e., the inventory level, and the observation, are continuous random variables. Let x k denote the inventory level at eriod k, u k the i.i.d. random demand at eriod k, a k the relenish decision at eriod k (i.e., a k = 0 or ), Q the fixed order amount, y k the observation of inventory level x k, v k the i.i.d. observation noise, h the er eriod er unit inventory holding cost, s the er eriod er unit inventory shortage enalty cost. The system equations are as follows x k+ = max(x k + a k Q u k, 0), k = 0,,..., y k = x k + v k, k = 0,,.... The cost incurred in eriod k is g k (x k, a k, u k ) = h max (x k + a k Q u k, 0)... + s max (u k x k a k Q, 0). We consider two objective functions: average cost er eriod and discounted total cost, given by [ H ] E k=0 g [ H ] k lim su, lim H H E γ k g k, H where γ (0, ) is the discount factor. k=0

10 0 In the simulation, we first choose an exonential family and secify a grid mesh on its arameter sace, then imlement () and Algorithm 3 on the grid mesh, and use value iteration to solve for a olicy. These are done offline. In an online run, Algorithm 2 (PPF) is used for filtering and making decisions with the olicy obtained offline. We also consider a small variation of this method: instead of using PPF, we use Algorithm (PF) and do density rojection outside the filter each time. We comare our two methods (called Ours and Ours 2, resectively) described above to four other algorithms: () Certainty equivalence using the mean estimate (CE); (2) Certainty equivalence using the maximum likelihood estimate (CE- MLE); (3) EKF-based Parametric POMDP (EKF-PPOMDP) in [8]; (4) Greedy olicy. CE treats the state estimate as the true state in the solution to the full observation roblem. We use the bootstra filter to obtain the mean estimate and the MLE of the states for CE. EKF-PPOMDP aroximates the belief state by a Gaussian distribution, and uses the extended Kalman filter to estimate the transition of the belief state. Similar to our method, it also solves a discretized MDP defined on the arameter sace of the Gaussian density. The greedy olicy chooses an action a k that attains the minimum in the exression min ak A E xk,u k [g k (x k, a k Q, u k )I k ]. umerical exeriments are carried out in the following settings: Problem arameters: initial inventory level x 0 = 5, holding cost h =, shortage enalty cost s = 0, fixed order amount b = 0, random demand u k ex(5), discount factor γ = 0.9, inventory observation noise v k (0, σ 2 ) with σ ranging from 0. to 3.3 in stes of 0.2. Algorithm arameters: The number of articles in both the usual article filter and the rojection article filter is = 200; the exonential family in the rojection article filter is chosen as the Gaussian family; the set of grid oints on the rojected belief sace is G = {mean = [0 : 0.5 : 5], standard deviation = [0 : 0.2 : 5]} for both our methods and EKF-PPOMDP; one run of horizon length H = 0 5 for each average cost criterion case, 000 indeendent runs of horizon length H = 40 for each discounted total cost criterion case; nearest neighbor as the value function aroximator in both our methods and EKF-PPOMDP. Simulation issues: We use common random variables among different olicies and different σ s. In order to imlement CE, we use Monte Carlo simulation to find the otimal threshold olicy for the fully observed roblem (i.e., y k = x k ): if the inventory level is below the threshold L, the store/warehouse should order to relenish its inventory; otherwise, if the inventory level is above L, the store/warehouse should not order. That is, { 0, if xk > L; a k = (22), if x k < L. The simulation result indicates both the average and discounted cost functions are convex in the threshold and the minimum is achieved at L = 7.7 for both. Table II and Table III list the simulated average costs and discounted total cost using different olicies under different observation noises, resectively. Our methods generally outerforms all the other algorithms under all observation noise levels. CE also erforms very well, and slightly outerforms CE-MLE. EKF-PPOMDP erforms better in the average cost case than the discounted cost case. The greedy olicy is much worse than all other algorithms. While our methods and the EKF-PPOMDP involve offline comutation, the more critical online comutation time of all the simulated methods is aroximately the same. For all the algorithms, the average cost/discounted cost increases as the observation noise increases. That is consistent with the intuition that we cannot erform better with less information. Fig. shows 000 actions taken by our method versus the true inventory levels in the average cost case (the discounted total cost case is similar and is omitted here). The dotted vertical line is the otimal threshold under full observation L. Our algorithm yields a olicy that icks actions very close to those of the otimal threshold olicy when the observation noise is small (cf. Fig.(a)), indicating that our algorithm is indeed finding the otimal olicy. As the observation noise increases, more actions icked by our olicy violate the otimal threshold, and that again shows the value of information in determining the actions. The erformances of our two methods are very close, with one slightly better than the other. Solely for the urose of filtering, doing rojection outside the filter is easier to imlement if we want to use a general exonential family, and also gives a better estimate of the belief state, since the rojection error will not accumulate. However, for solving POMDPs, we conjecture that PPF would work better in conjunction with the olicy solved from the rojected belief MDP, since the rojected belief MDP assumes that the transition of the belief state is also rojected. However, we do not know which one is better. Our method outerforms the EKF-PPOMDP, mainly because the rojection article filter used in our method is better than the extend Kalman filter used in the EKF-PPOMDP for estimating the belief transition robabilities. This agrees with the results in [9], which also observed that Monte Carlo simulation of the belief transitions is better than the EKF estimate. Although the erformance of CE is comarable to that of our methods for this articular examle, CE is generally a subotimal olicy excet in some secial cases (cf. section 6. in [6]), and it does not have a theoretical error bound. Moreover, to use CE requires solving the full observation roblem, which is also very difficult in many cases. In contrast, our method has a roven error bound on the erformance, and works with the POMDP directly without having to solve the MDP roblem under full observation. VII. COCLUSIO In this aer, we develoed a method that effectively reduces the dimension of the belief sace via the orthogonal rojection of the belief states onto a arameterized family of

11 Action a (relenish or not relenish) Action a (relenish or not relenish) Action a (relenish or not relenish) 0 (state, action) yielded by our method otimal threshold under full observation State x (Inventory level) 0 (a) observation noise σ = 0. (state, action) yielded by our method otimal threshold under full observation State x (Inventory level) 0 (b) observation noise σ =. (state,action) yielded by our method otimal threshold under full observation State x (Inventory level) (c) observation noise σ = 3. Fig.. Our method (Ours ): actions taken for different inventory levels under different observation noise variances. robability densities. For an exonential family, the density rojection has an analytical form and can be carried out efficiently. The exonential family is fully reresented by a finite (small) number of arameters, hence the belief sace is maed to a low-dimensional arameter sace and the resultant belief MDP is called the rojected belief MDP. The rojected belief MDP can then be solved in numerous ways, such as standard value iteration or olicy iteration, to generate a olicy. This olicy is used in conjunction with the rojection TABLE II OPTIMAL AVERAGE COST ESTIMATES FOR THE IVETORY COTROL PROBLEM USIG DIFFERET METHODS. EACH ETRY REPRESETS THE AVERAGE COST OF A RU OF HORIZO 0 5. σ Ours Ours 2 CE CE-MLE EKF-P- Greedy POMDP Policy article filter for online decision making. We analyzed the erformance of the olicy generated by solving the rojected belief MDP in terms of the difference between the value function associated with this olicy and the otimal value function of the POMDP. We also rovided a bound on the error between our rojection article filter and exact filtering. We alied our method to an inventory control roblem, and it generally outerformed other methods. When the observation noise is small, our algorithm yields a olicy that icks the actions very closely to the otimal threshold olicy for the fully observed roblem. Although we only roved theoretical results for discounted cost roblems, the simulation results indicate that our method also works well on average cost roblems. We should oint out that our method is also alicable to finite horizon roblems, and is suitable for largestate POMDPs in addition to continuous-state POMDPs. APPEDIX Proof of Theorem : Denote J k (b) T k J 0 (b), J k (b ) (T ) k J 0 (b ), k = 0,,..., and define b k (b, a) = g(, a), b + γe Y {J k (ψ(b, a, Y ))}, μ k (b) = arg min a A Q k(b, a), b k (b, a) = g(, a), b + γe Y {J k (ψ(b, a, Y ) )}, μ k (b) = arg min a A Q k (b, a). Hence, J k (b) = min k(b, a) = Q k (b, μ k (b)), a A J k (b ) = min a A Q k (b, a) = Q k (b, μ k (b)). Denote err k max b B J k (b) J k (b ), k =, 2,.... We consider the first iteration. Initialize with J 0 (b) = J 0 (b ) = 0. By Assumtion 2, a A, Q (b, a) Q (b, a) = g(, a), b b ε, b B. (23)

12 2 TABLE III OPTIMAL DISCOUTED COST ESTIMATE FOR THE IVETORY COTROL PROBLEM USIG DIFFERET METHODS. EACH ETRY REPRESETS THE DISCOUTED COST (MEA, STADARD ERROR I PARETHESES) BASED O 000 IDEPEDET RUS OF HORIZO 40. σ Ours Ours 2 CE CE-MLE EKF-P- Greedy POMDP Policy (.64) (.63) (.8) (.8) (.65) (2.99) (.63) (.63) (.78) (.78) (.62) (2.98) (.63) (.62) (.77) (.78) (.60) (2.98) (.62) (.6) (.77) (.79) (.55) (2.98) (.63) (.63) (.76) (.78) (.60) (2.97) (.64) (.63) (.77) (.75) (.57) (2.97) (.65) (.64) (.76) (.72) (.52) (2.96) (.68) (.66) (.75) (.77) (.52) (2.95) (.67) (.67) (.76) (.77) (.52) (2.96) (.69) (.67) (.75) (.73) (.56) (2.93) (.69) (.67) (.74) (.79) (.54) (2.95) (.68) (.67) (.75) (.78) (.55) (2.97) (.68) (.68) (.76) (.78) (.58) (2.96) (.68) (.68) (.75) (.78) (.54) (2.96) (.68) (.68) (.75) (.83) (.57) (2.96) (.70) (.70) (.76) (.79) (.54) (2.95) (.69) (.69) (.76) (.84) (.60) (2.94) Hence, with a = μ (b), the above inequality yields Q (b, μ (b)) J (b ) + ε. Using J (b) Q (b, μ (b)), we get J (b) J (b ) + ε, b B. (24) With a = μ (b), (23) yields Q (b, μ (b)) ε J (b). Using J (b ) Q (b, μ (b)), we get J (b ) ε J (b), b B. (25) From (24) and (25), we conclude J (b) J (b ) ε, b B. Taking the maximum over b on both sides of the above inequality yields err ε. (26) ow we consider the (k +) th iteration. For a fixed Y = y, by Assumtion 2, g(, a), ψ(b, a, y) ψ(b, a, y) δ. Let δ be the δ in Assumtion 3 and denote the corresonding ε by ε 2. Then J k (ψ(b, a, y)) J k (ψ(b, a, y) ) ε 2, b B. (27) Therefore, a A, Q k+ (b, a) Q k+ (b, a) g(, a), b b... + γe Y {J k (ψ(b, a, Y )) J k (ψ(b, a, Y ) )} ε + γe Y {J k (ψ(b, a, Y )) J k (ψ(b, a, Y ) )... + J k (ψ(b, a, Y ) ) J k (ψ(b, a, Y ) )} ε + γ(ε 2 + err k ), b B. The third inequality follows from (27) and the definition of err k. Using an argument similar to that used to rove (26) from (23), we conclude that err k+ ε + γ(ε 2 + err k ). (28) Using induction on (28) with initial condition (26) and taking k, we obtain J (b) J (b ) γ k ε + γ k ε 2 k=0 k= = ε + γε 2 γ. (29) Therefore, (5) is roved. Fixing a olicy μ on the original belief MDP, define the maings under olicy μ on the belief MDP and the rojected belief MDP as T μ J(b)= g(, μ(b)), b + γe Y {J(ψ(b, μ(b), Y ))}, (30) T μj(b)= g(, μ(b)), b + γe Y {J(ψ(b, μ(b), Y ) )},(3) resectively. Since μ is a stationary olicy for the rojected belief MDP, μ = μ P roj Ω is stationary for the original belief MDP. Hence, J (b ) = T μ J (b ), J μ (b) = T μ J μ (b). Subtracting both sides of the above two equations, and substituting in the definitions of T and T (i.e., (3) and (30)) for the right-hand sides resectively, we get J (b ) J μ (b) = g (, μ (b )), b b... + γe Y { J (ψ(b, μ (b ), Y ) ) J μ (ψ(b, μ (b ), Y )) }. (32) For a fixed Y = y, J (ψ(b, μ (b ), y) ) J μ (ψ(b, μ (b ), y)) J ( b) J μ ( b)... + J μ (ψ(b, μ (b ), y) ) J μ (ψ(b, μ (b ), y)), where b = ψ(b, μ (b ), y) B. By Assumtion 2, g(, a), ψ(b, μ (b ), y) ψ(b, μ (b ), y) δ, letting δ = δ in Assumtion 3 and denoting the corresonding ε by ε 3, we get the second term J μ (ψ(b, μ (b ), y) ) J μ (ψ(b, μ (b ), y)) ε3. Denoting err max b B J (b ) J μ (b), we obtain J (ψ(b, μ (b ), y) ) J μ (ψ(b, μ (b ), y)) err + ε 3.

13 3 Therefore, (32) becomes J (b ) J μ (b) ε + γ(err + ε 3 ). Taking the maximum over b on both sides of the above inequality yields Hence, err ε + γ(err + ε 3 ). err ε + γε 3 γ. (33) With (29) and (33), we obtain J (b) J μ (b) J (b) J (b ) + J (b ) J μ (b) 2ε + γ(ε 2 + ε 3 ), b B. γ Therefore, (6) is roved. [ bk Proof of Lemma : E f(, ˆθ ] k ), φ is the error from time k, which is also the initial error for time k. Hence, the rediction ste yields [ ] bkk E b kk [, φ Kk = E (b k f(, ˆθ ] k )), φ [ bk = E f(, ˆθ ] k ), K k φ e k K k φ e k φ. (34) Under Assumtions 4 and 5, we have showed (8). Using (8) and (34), the Bayes udating ste yields E [ b k b k, φ ] [ b kk, Ψ k φ = E b kk, Ψ k b kk, Ψ ] kφ b kk, Ψ k [ b kk, Ψ k φ E b kk, Ψ k b ] kk, Ψ k φ b kk, Ψ... k [ b kk, Ψ k φ + E b kk, Ψ k b kk, Ψ ] kφ b kk, Ψ k [ b kk, Ψ k φ b kk b kk = E, Ψ ] k b kk, Ψ k b kk, Ψ... k [ b kk b kk + E, Ψ ] kφ b kk, Ψ k b kk, Ψ k φ [ ] δ b kk, Ψ k E bkk b kk, Ψ k... [ ] bkk + δe b kk, Ψ kφ δe k φ Ψ k + δe k Ψ k φ 2δe k Ψ k φ = τ k e k φ, where τ k = 2δ Ψ k. Proof of Lemma 2: This lemma uses essentially the same roof technique as Lemmas 3 and 4 in [3]. However, it is not quite obvious how these lemmas imly our lemma here. Therefore, we state the roof to make this aer more accessible. After the resamling ste, ˆf(, ˆθ k ) = i= δ(x xi k ), where x i k, i =,..., are i.i.d. samles from f(, ˆθ k ). Using the Cauchy-Schwartz inequality, we have = = = ( E [ ˆf(, ˆθ k ) f(, ˆθ k ), φ 2]) /2 ( E ( φ(x i k ) f(, ˆθ k ), φ ) ) 2 (E [ i= /2 ]) /2 (φ(x i k ) f(, ˆθ k ), φ ) 2 i= ( f(, ˆθ k ), φ 2 f(, ˆθ k ), φ 2) /2 f(, ˆθ k ), φ 2 /2 φ. (35) The Bayes udating ste yields [ ] E ˆbk b k, φ [ ˆb kk, Ψ k φ = E ˆb kk, Ψ k b kk, Ψ ] kφ b kk, Ψ k [ ˆb kk, Ψ k φ E ˆb kk, Ψ k ˆb ] kk, Ψ k φ b kk, Ψ... k [ ˆb kk, Ψ k φ + E b kk, Ψ k b kk, Ψ ] kφ b kk, Ψ k [ ˆb kk, Ψ k φ ˆb kk b kk = E, Ψ ] k ˆb kk, Ψ k b kk, Ψ... k [ ˆb kk b kk + E, Ψ ] kφ b kk, Ψ. k Under Assumtions 4 and 5, we have shown (8). Using the Cauchy-Schwartz inequality, (8), and (35), the first term can be simlified as [ ˆb kk, Ψ k φ ˆb kk b kk E, Ψ ] k ˆb kk, Ψ k b kk, Ψ k ( [ ]) /2 ˆbkk, Ψ k φ 2 δ E... ˆb kk, Ψ k 2 ( E [ ˆb kk b kk, Ψ k 2]) /2 ( [ ]) /2 ˆbkk, Ψ k φ 2 = δ E... ˆb kk, Ψ k 2 ( E [ f(, ˆθ k ) f(, θ k ), K k Ψ k 2]) /2 δ φ Ψ k,

14 4 and the second term can be simlified as [ ˆb kk b kk E, Ψ ] kφ b kk, Ψ k ( δ E [ ˆb kk b kk, Ψ kφ 2]) /2 ( = δ E [ ˆf(, ˆθ k ) f(, ˆθ k ), K k Ψ k φ 2]) /2 δ Ψ k φ δ Ψ k φ. Therefore, adding these two terms yields [ ] E ˆbk b k, φ 2δ Ψ k φ = τ k φ, where τ k = 2δ Ψ k, the same constant as in Lemma. Proof of Lemma 3: The key idea of the roof for Lemma 4 in [2] is used here. From (9), we know that [c Eˆθk j (X)] = [c Eˆbk j (X)] and E θ k [c j (X)] = E b k [c j (X)]. Hence, we obtain [ ] [ ] E Eˆθ (c k j(x)) E θ (c j(x)) = E ˆbk b k k, c j, for j =,..., m. Taking summation over j, we obtain m E m [ ] (c Eˆθk j (X)) E θ k (c j (X)) = E ˆbk b k, c j. j= j= Since c j B(R nx ), we aly Lemma 2 with φ = c j and thus obtain [ ] E ˆbk b c j k, c j τ k, j =,..., m. Therefore, [ ] E Eˆθ (c(x)) E k θ k (c(x)) τ k, where denotes the L norm on R n x, c = [c,..., c m ] T, m and τ k = τ k j= c j. Since Θ is comact and the Fisher information matrix [E θ [c i (X)c j (X)] E θ [c i (X)] E θ [c j (X)]] ij is ositive definite, we get (cf. Fact 2 in [2] for a detailed roof) ˆθ k θ k α (c(x)) E Eˆθk θ k (c(x)). Taking exectation on both sides yields [ ] [ ] E ˆθk θ k αe Eˆθ (c(x)) E k θ k (c(x)) α τ k. On the other hand, taking derivative of E θ [φ(x)] with resect to θ i yields d E θ [φ(x)] dθ i = E θ[c i (X)φ(X)] E θ [c i (X)]E θ [φ(x)] Var θ (c i )Var θ (φ) E θ (c 2 i )E θ(φ 2 ) c i φ. Hence, ( d dθ E m ) θ[φ(x)] c i φ. Since Θ is comact, there exists a constant β > 0 such that E θ [φ(x)] is Lischitz over θ Θ with Lischitz constant β φ (cf. the roof of Fact 2 in [2]), i.e., [φ] E Eˆθk θ k [φ] β φ ˆθ k θ k. i= Taking exectation on both sides yields [ ] [ ] f(, E ˆθk ) f(, θ k), φ β φ E ˆθk θ k β φ α τ k = dτ k φ, where d = αβ m j= c j. Proof of Theorem 2: Alying Lemma, Assumtion 6, and Lemma 3, we have that for each k [ bk E f(, ˆθ ] k ), φ E [ b k b k, φ ] + E [ b k f(, θ k), φ ]... [ f(, + E θ k ) f(, ˆθ ] k ), φ ( τ k e k + ε + dτ ) k φ = e k φ, φ B(R nx ), where τ k is the constant in Lemmas and 3, d is the constant in Lemma 3, and ε is the constant in Assumtion 6. It is easy to deduce by induction that where τ k i ( k ) e k = τ k e 0 + τi k + ε + i=2 = k j=i τ j for k i, τ k i ACKOWLEDGEMET d k τi k, i= = 0 for k < i. We thank the associate editor and anonymous reviewers for their careful reading of the aer and very constructive comments that led to a substantially imroved aer. We thank one of the anonymous reviewers for bringing the work of Brooks and Williams [9] to our attention. REFERECES [] S. Arulamalam, S. Maskell,. J. Gordon, and T. Cla, A Tutorial on Particle Filters for On-line on-linear/on-gaussian Bayesian Tracking, IEEE Transactions of Signal Processing, vol. 50, no. 2, , [2] B. Azimi-Sadjadi, and P. S. Krishnarasad, Aroximate onlinear Filtering and its Alication in avigation, Automatica, vol. 4, no. 6, , [3] O. E. Barndorff-ielsen, Information and Exonential Families in Statistical Theory, Wiley, ew York, 978. [4] A. R. Barron, and C. Sheu, Aroximation of Density Functions by Sequences of Exonential Family, The Annals of Statistics, vol. 9, no. 3, , 99. [5] D. P. Bertsekas, Convergence of Discretization Procedures in Dynamic Programming, IEEE Transactions on Automatic Control, vol. 20, no. 3, , 975. [6] D. P. Bertsekas, Dynamic Programming and Otimal Control, Athena Scientific, Belmont, MA, 995. [7] D. P. Bertsekas, and J.. Tsitsiklis, euro-dynamic Programming, Otimization and eural Comutation Series, Athena Scientific, Belmont, MA, st edition, 996.

5 [8] A. Brooks, A. Makarenkoa, S. Williams, and H. Durrant-Whytea, Parametric POMDPs for Planning in Continuous State Saces, Robotics and Autonomous Systems, vol. 54, no.,. 887-897, 2006. [9] A.

Roy, Continuous- State POMDPs with Hybrid Dynamics, in Proceedings of the Tenth International Symosium on Artificial Intelligence and Mathematics, 2008. [] A. R.

Marcus, Simulation-based Algorithms for Markov Decision Processes, Communications and Control Engineering Series, Sringer, ew York, 2007. [3] D. Crisan, and A.

15 5 [8] A. Brooks, A. Makarenkoa, S. Williams, and H. Durrant-Whytea, Parametric POMDPs for Planning in Continuous State Saces, Robotics and Autonomous Systems, vol. 54, no., , [9] A. Brooks, and S. Williams, A Monte Carlo Udate for Parametric POMDPs, International Symosium of Robotics Research, [0] E. Brunskill, L. Kaelbling, T. Lozano-Perez, and. Roy, Continuous- State POMDPs with Hybrid Dynamics, in Proceedings of the Tenth International Symosium on Artificial Intelligence and Mathematics, [] A. R. Cassandra, Exact and Aroximate Algorithms for Partially Observable Markov Decision Processes, Ph.D. thesis, Brown University, [2] H. S. Chang, M. C. Fu, J. Hu, and S. I. Marcus, Simulation-based Algorithms for Markov Decision Processes, Communications and Control Engineering Series, Sringer, ew York, [3] D. Crisan, and A. Doucet, A Survey of Convergence Results on Particle Filtering Methods for Practitioners, IEEE Transaction on Signal Processing, vol. 50, no. 3, , [4] A. Doucet, J. F. G. de Freitas, and. J. Gordon, editors, Sequential Monte Carlo Methods In Practice, Sringer, ew York, 200. [5] D. de Farias, and B. Van Roy, The Linear Programming Aroach to Aroximate Dynamic Programming, Oerations Research, vol. 5, no. 6, [6] W. R. Gilks, Derivative-Free Adative Rejection Samling for Gibbs Samling, in Bayesian Statistics 4, J. Bernardo, J. Berger, A. Dawid, and A. Smith, editors, , Oxford University Press, Oxford, 992. [7] M. Hauskrecht, Value-Function Aroximations for Partially Observable Markov Decision Processes, Journal of Artificial Intelligence Research, vol. 3, , [8]. J. Gordon, D. J. Salmond, and A. F. M. Smith, ovel Aroach to onlinear/on-gaussian Bayesian State Estimation, in IEE Proceedings F (Radar and Signal Processing), vol. 40, no. 2,. 07 3, 993. [9] O. Hernandez-Lerma, and J. B. Lasserre, Discrete-Time Markov Control Processes Basic Otimality Criteria, Sringer, ew York, 996. [20] F. Le Gland, and. Oudjane, Stability and Uniform Aroximation of onlinear Filters Using the Hilbert Metric and Alication to Particle Filter, The Annals of Alied Probability, vol. 4, no., , [2] E. L. Lemann, and G. Casella, Theory of Point Estimation, Sringer, ew York, 2nd edition, 998. [22] M. L. Littman, The Witness Algorithm: Solving Partially Observable Markov Decision Processes, TR CS-94-40, Deartment of Comuter Science, Brown University, Providence, RI, 994. [23] M. K. Murray, and J. W. Rice, Differential Geometry and Statistics, Chaman & Hall, ew York, 993. [24] J. M. Porta, M. T. J. Saan, and. Vlassis, Robot Planning in Partially Observable Continuous Domains, in Proc. Robotics: Science and Systems, [25] J. M. Porta,. Vlassis, M. T. J. Saan, and P. Pouart, Point-Based Value Iteration for Continuous POMDPs, Journal of Machine Learning Research, vol. 7, , [26] P. Pouart, and C. Boutilier, Value-Directed Comression of POMDPs, Advances in eural Information Processing Systems, vol. 5, , [27] M. Resh, and P. aor, An Inventory Problem with Discrete Time Review and Relenishment by Batches of Fixed Size, Management Science, vol. 0, no.,. 09-8, 963. [28]. Roy, Finding Aroximate POMDP Solutions through Belief Comression, Ph.D. thesis, Robotics Institute, Carnegie Mellon University, Pittsburgh, PA, [29]. Roy, and G. Gordon, Exonential Family PCA for Belief Comression in POMDPs, Advances in eural Information Processing Systems, vol. 5, , [30] R. D. Smallwood, and E. J. Sondik, The Otimal Control of Partially Observable Markov Processes over a Finite Horizon, Oerations Research, vol. 2, no. 5, , 973. [3] S. Thrun, Monte Carlo POMDPs, Advances in eural Information Processing Systems, vol. 2, , [32] H. J. Yu, Aroximate Solution Methods for Partially Observable Markov and Semi-Markov Decision Processes, Ph.D. thesis, M.I.T., Cambridge, MA, Enlu Zhou (S 07-M 0) received the B.S. degree with highest honors in electrical engineering from Zhejiang University, China, in 2004, and received the Ph.D. degree in electrical engineering from the University of Maryland, College Park, in She is currently an Assistant Professor in the Deartment of Industrial and Enterrise Systems Engineering, at the University of Illinois at Urbana-Chamaign. Her research interests include stochastic control, nonlinear filtering, and simulation otimization. Michael C. Fu (S 89-M 89-SM 06-F 08) received a bachelor s degree in mathematics and bachelor s and master s degrees in electrical engineering and comuter science from the Massachusetts Institute of Technology, Cambridge, MA, in 985, and a master s and Ph.D. degree in alied mathematics from Harvard University, Cambridge, MA in 986 and 989, resectively. Since 989, he has been with the University of Maryland, College Park, currently as Ralh J. Tyser Professor of Management Science in the Robert H. Smith School of Business, with a joint aointment in the Institute for Systems Research and an affiliate faculty aointment in the Deartment of Electrical and Comuter Engineering. His research interests include simulation otimization and alied robability, with alications in manufacturing, suly chain management, and financial engineering. He served as the Simulation Area Editor of Oerations Research from , and as the Stochastic Models and Simulation Deartment Editor of Management Science from He is co-author (with J. Q. Hu) of the book, Conditional Monte Carlo: Gradient Estimation and Otimization Alications (Kluwer, 997), which received the IFORMS Simulation Society s Outstanding Publication Award in 998, and co-author (with H. S. Chang, J. Hu, and S. I. Marcus) of the book, Simulation-based Algorithms for Markov Decision Processes (Sringer, 2007). Steven I. Marcus (S 70-M 75-SM 83-F 86) received the B.A. degree in electrical engineering and mathematics from Rice University in 97 and the S.M. and Ph.D. degrees in electrical engineering from the Massachusetts Institute of Technology in 972 and 975, resectively. From 975 to 99, he was with the Deartment of Electrical and Comuter Engineering at the University of Texas at Austin, where he was the L.B. (Preach) Meaders Professor in Engineering. He was Associate Chairman of the Deartment during the eriod In 99, he joined the University of Maryland, College Park, as Professor in the Electrical and Comuter Engineering Deartment and the Institute for Systems Research. He was Director of the Institute for Systems Research from 99 to 996 and Chair of the Electrical and Comuter Engineering Deartment from 2000 to Currently, his research is focused on stochastic control, estimation, and otimization.

Solving Continuous-State POMDPs. via Density Projection

Solving Continuous-State POMDPs 1 via Density Projection Enlu Zhou, Student Member, IEEE, Michael C. Fu, Fellow, IEEE, and Steven I. Marcus, Fellow, IEEE, Abstract Research on numerical solution methods