Bayes-Adaptive POMDPs: Toward an Optimal Policy for Learning POMDPs with Parameter Uncertainty

Size: px

Start display at page:

Download "Bayes-Adaptive POMDPs: Toward an Optimal Policy for Learning POMDPs with Parameter Uncertainty"

Gavin Goodman
5 years ago
Views:

1 Bayes-Adaptive POMDPs: Toward an Optimal Policy for Learning POMDPs with Parameter Uncertainty Stéphane Ross School of Computer Science McGill University, Montreal (Qc), Canada, H3A 2A7 Abstract. Most of the POMDP litterature as focused on developping new approximate algorithms to solve ever larger POMDPs, under the general assumption that the POMDP model is known a priori. In practice, however this is rarely the case. For instance, robot navigation problems generally require that the parameters of the POMDP be well tuned to the robot s sensors and actuators in order for the POMDP to reflect the reality, but the sensor and actuator parameters are rarely known precisely. Hence it is of crucial importance to develop new approaches which can take the uncertainty of these parameters into account during the planning process and further refine the model of the POMDP as experience is acquired in the environment. To this end, we formulate a new Bayes-Adaptive POMDP model such that its optimal policy provide an optimal exploration-exploitation tradeoff that will maximize long-term reward while taking into account the parameter uncertainty. However, since the Bayes-Adaptive POMDP has an infinite number of states, we propose an approximate algorithm that can solve the problem in a reasonable ammount of time. 1 Introduction In real world systems, uncertainty generally arises in both the prediction of the system s behaviour under different controls and the observability of the current system state. Partially Observable Markov Decision Processes (POMDPs) take both kind of uncertainty into account and provide a powerful model for sequential decision making under these conditions. However, most real world problem have huge state space and observation space, such that exact solving approaches are completely intractable (finite-horizon POMDPs are PSPACE-complete [1] and infinite-horizon POMDPs are undecidable [2]). This has motivated most researchers to focus on elaborating approximate solving approaches to this problem in order to solve ever larger POMDPs. However, it is generally assumed in the community that the POMDP model is known a priori, which is rarely the case in practice. A typical example is the robot navigation problem. POMDPs have been used extensively to solve robot navigation problems, but in practice, if we want to find the optimal policy

2 2 that the robot should follow in the real world, the POMDP must reflect exactly the uncertainty on the robot sensors and actuators. These parameters are rarely known exactly and they are generally approximated by human beeings, such that even if the resulting POMDP is solved exactly, the resulting policy may not be optimal due to model (parameter) uncertainty. A more desirable approach would be to take into account the uncertainty on the model in the planning process and be able to learn from experience the values of these unknown parameters. Several approaches have been explored to learn POMPD models. A first commonly used approach is the Baum-Welch algorithm [3], which is an Expectation- Maximization (EM) algorithm that uses a maximum likelyhood approach to find the most likely model given the sequence of action and observations observed. This approach converges to a local optima and does not address the issue of planning with an uncertain model. Another recent approach, called Medusa [4], tried to address this problem in an active learning fashion. The POMDP is extended such that an extra Query action is added, and when executed, this action provide full information on the current state of the environment. Using this information, the algorithm updates Dirichlet distributions over its unknown parameters. During the planning process, several models are sampled from the joint Dirichlet distribution and solved independently. Then the executed action is chosen randomly among the best actions to do in each sampled model, with a probability proportionnal to the likelyhood of its corresponding model. The drawback of this approach is that it requires the use of an oracle, which might not always be available. Furthermore, because the sampled models are solved independently, as if they were the correct model of the POMDP, the resulting policy do not take into account the uncertainty on the model. Query actions in this approach were only planned according to specific heuristics. The approaches most related to our approach come from the field of bayesian reinforcement learning, where Bayes-Adaptive MDP [5] were formulated to provide a theoretically optimal exploration-exploitation tradeoff to learn MDPs. These approaches, as in Medusa, use Dirichlet distribution to maintain the uncertainty on the parameters of the model. To take the parameter uncertainty into account in the planning process, the state space is extended with the Dirichlet distribution parameters, which are known at all times, and the transition probabilities are computed according to the expected value of the Dirichlet distributions in the current state. Because the state is observable in MDPs, they do not need to use an oracle to update the Dirichlet distributions after an action is taken in the environment. In this report, we propose an extension of Bayes-Adaptive MDPs to POMDPs that do not require any oracle to learn the POMDP model. However, since the Bayes-Adaptive POMDP has an infinite number of states, belief state maintenance and value function representation becomes a problematic issue. We propose different approximations that can be used to palliate to these problems. We first introduce the POMDP model and some approximate solving approaches. Then we introduce our new Bayes-Adaptive POMDP model and pro-

3 3 vide approximations that can be used to solve them with standard POMDP solving algorithms. We conclude with future possible extensions and improvements. 2 POMDP In this section we introduce the POMDP model and introduce some approximate algorithms to solve POMDPs. 2.1 Model A Partially Observable Markov Decision Process (POMDP) is a model for sequential decision making under uncertainty. Using such a model, an agent can plan an optimal sequence of action according to its belief by taking into account the uncertainty associated with its actions and observations. A POMDP is generally defined by a tuple (S,A,Ω,T,R,O,γ,b 0 ) where S is the state space, A is the action set, Ω is the observation set, T(s,a,s ) : S A S [0,1] is the transition function which specifies the probability of ending up in a certain state s, given that we were in state s and did action a, R : S A R is the reward function where R(s,a) specifies the immediate reward obtained by doing action a in state s, O : S A Ω [0,1] is the observation function where O(s,a,z) specifies the probability of observing a certain observation z, given that we did action a and ended in state s and γ is the discount factor. Finally, b 0 is the initial belief state of the environment and specifies the probability distribution over the initial state of the environment. In a POMDP, the agent does not know exactly in which state it currently is, since its observations on its current state are uncertain. Instead the agent maintains a belief state b which is a probability distribution over all states that specifies the probability that the agent is in each state. After the agent performs an action a and perceives an observation z, the agent can update its current belief state b using the belief update function τ(b,a,z) specified in equation 1. b (s ) = ηo(s,a,z) s S T(s,a,s )b(s) (1) Here, b is the new belief state and b is the last belief state of the agent. The summation part specifies the expected probability of transiting in state s, given that we performed action a and belief state b. Afterward, this expected probability is weighted by the probability that the agent observed o in state s after doing action a. η is a normalization constant such that the new probability distribution over all states sums to 1. Solving a POMDP consists in finding an optimal policy π which specifies the best action to do in every belief state b. This optimal policy depends on the planning horizon and on the discount factor used. In order to find this optimal policy, we need to compute the optimal value of a belief state over the planning

4 4 horizon. For the infinite horizon, the optimal value function is the fixed point of the Bellman equation (equation 2). V (b) = max R(b,a) + γ P(o b,a)v (τ(b,a,o)) (2) a A o Ω In this equation, R(b,a) = s S R(s,a)b(s) is the expected immediate reward of doing action a in belief state b and P(z b,a) is the probability of observing z after doing action a in belief state b. This probability can be computed using equation 3. P(z b,a) = s S O(z,a,s ) s S T(s,a,s )b(s) (3) This equation is very similar to the belief update function, except that it needs to sum over all the possible resulting states s in order to consider the global probability of observing z over all the state space. In fact, when computing the belief update function τ(b,a,z), the normalization constant η = P(z b,a). Similarly to the definition of the optimal value function, we can define the optimal policy π as in equation 4. π (b) = arg max R(b,a) + γ P(z b,a)v (τ(b,a,z)) (4) a A z Ω However, one problem with this formulation is that there is an infinite number of belief states and as a consequence, it would be impossible to compute such a policy for all belief states in a finite amount of time. But, since it has been shown that the optimal value function over a finite horizon is piecewise linear and convex, we can define the optimal value function and policy of a finite-horizon POMDP using a finite set of S-dimensional hyper plan, called α-vector, over the belief state space. This is how exact offline value iteration algorithms are able to compute a very close approximation to V in a finite amount of time. However, exact value iteration algorithms can only be applied to small problems of 10 to 20 states due to their high complexity. For more detail, refer to Littman and Cassandra [6, 7]. 2.2 Approximate algorithms Contrary to exact value iteration algorithms, approximate value iteration algorithms try to keep only a subset of α-vectors after each iteration of the algorithm in order to limit the complexity of the algorithm. Pineau [8, 9] has developed a point based value iteration algorithm (PBVI) which bounds the complexity of exact value iteration to the number of belief points in its set. Instead of keeping all the α-vectors as in exact value iteration, PBVI only keeps a maximum of one α-vector per belief point, that maximizes its value. Therefore, the precision of the algorithm depends on the number of belief points and the location of the chosen belief points. Spaan [10] has adopted a similar approach (Perseus), but instead of updating all belief points at each iteration, Perseus updates only the

5 5 belief points which have not been improved by a previous α-vector update in the current iteration. Since Perseus generally updates only a small subset of belief points at each turn, it can converge more rapidly to an approximate policy, or use larger sets of belief points, which improves its precision. Another recent approach which has shown interesting efficiency is HSVI [11, 12], which maintains both an upper bound defined by a set of points and a lower bound defined by α-vectors. HSVI uses an heuristic that approximates the error of the belief points in order to select the belief point on which to do value iteration updates. When it selects a belief to update, it also updates its upper bound using linear programming methods. While these methods have in common the fact that they try to solve the problem offline, i.e. they compute a complete policy prior to execution, another strategy which has been investigated in the litterature are online approaches, that interleaves computation and execution steps. The advantage of the later approach is that the policy needs only to be computed for the belief states that are encountered during the execution, and as a consequence, only reachable belief states from the current belief state need to be considered to find the next action to execute. Online algorithms generally proceed by doing a lookahead search in the space of reachable belief states over some finite horizon, and uses approximate value function of the infinite horizon value of belief states at fringe nodes in the search tree/graph [13 16]. Branch & Bound pruning techniques and factored representation have also been used to reduce the complexity of the search [15]. Futhermore, various heuristics have also been proposed to guide the search toward more important region of the belief state space [13, 14, 16]. Some authors have also proposed sampling approaches to further reduce the complexity of the search in large action/observation space [17 19]. 3 Bayes-Adaptive POMDP In this section, we introduce the Bayes-Adaptive POMDP model, that takes into account the uncertainty on the parameters of a standard POMDP. Here we assume that the state space, action space and observation space are known, and that the transition and observation functions are unknown or partially known. We also assume that the reward function is known as it is generally specified by the user for the specific task he wants to accomplish, but the model can easily be generalised to learn the reward function as well. We will denote by T a ss the parameter for the transition probability T(s,a,s ) and by O a s z the parameter for the observation probability O(s,a,z). To model the uncertainty on these parameters, we will make extensive use of Dirichlet distributions. As a consequence, we first introduce Dirichlet distributions and then provide a complete formalisation of the Bayes-Adaptive POMDP model and its solution.

6 6 3.1 Dirichlet Distribution The Dirichlet distribution is the conjugate prior of the multinomial distribution, in other words, it is a probability distribution over the parameters of a multinomial distribution. The multinomial distribution is a generalization of the Binomial distribution, where each trial result in one of k possible outcomes, and represent the probability to observe a certain number of times each outcome over n trials, given the probability to observe each outcome. For example, consider the following problem: suppose we have a k-sided dice and we want to determine whether the dice is fair or not, i.e. that each face as an equal probability to occur when we roll the dice. To determine this we are able to roll the dice a given number of times n. Each roll (trial) are considered independant and result in one of k possible outcomes, f 1 to f k, where f i represent the outcome that face i occured after rolling the dice. Let p i denote the unknown probability that f i occur after a roll and let α i be the number of times we have observed f i after n rolls. In this example, we have that the probability parameters p i follow a Dirichlet distribution, i.e. (p 1,...,p k ) Dir(α 1,...,α k ). This distribution represent the probability that the dice behaves according to the probability distribution (p 1,...,p k ), given that we have observed the counts (α 1,...,α k ) over n rolls (n = k i=1 α i). The probability density function of the Dirichlet distribution is defined as in equation 5. f(p,α) = 1 B(α) k i=1 p αi 1 i (5) The normalization constant is the beta function, which is expressed in terms of the gamma function, i.e. B(α) = k i=1 Γ(α i)/γ( k i=1 α i). The gamma function is a generalization of the factorial to complex numbers and the equality Γ(n + 1) = n! holds for natural numbers. For our particular POMDP with unknown parameters, we will be able to define our uncertainty on the distributions Ts a and Os a if we maintain counts αss a s that represent the number of times we have transited from state s to state s by doing action a and βs a z z for the number of times we have observed z in state s after doing action a. If we have such counts, then we would have Ts a Dir(αss a 1,...,αss a S ) and Oa s Dir(βa sz 1 ),...,αsz a Ω ). The problem here is that we need to observe the state of the environment in order to know which counts to increment every time a transition and observation happens in the environment. However, since we do not observe the state, we can still consider all possible state transitions that could have occured from our current state. Each state transition will lead to different count values and will have different probabilities according to our current Dirichlet distributions. Thus, we end up with a probability distribution over the values of the count variables. This can be interpreted as if the uncertainty on our unknown parameters is now represented as a mixture of Dirichlet distributions. In the next section, we will provide a formal description of the Bayes-Adaptive POMDP model that will allow us to take such uncertainty into account in the planning process.

7 7 3.2 Model The Bayes-Adaptive POMDP is constructed from the model of the POMDP with unknown parameters. Let < S,A,Ω,T,O,R,γ,b 0 > represent our POMDP with unknown transition and observation function T and O, we will first define counts αss a, (s,a,s ) S A S that represent the number of times we have transited from state s to state s by doing action a and counts βs a z, (s,a,z) S A Ω that represent the number of times we have observed z when arriving in state s by doing action a. We will refer to α as the vector of all transition counts and β as the vector of all observation counts. We will also refer to T = R S 2 A as the vector space in which α lies and O = R S A Ω as the vector space in which β lies. In order to maintain our probability distribution over the values of the α and β vectors, and take this into account in the planning process, we will include the α and β vector in the state of the Bayes-Adaptive POMDP. Thus, the state space S of the Bayes-Adaptive POMDP can be defined as S = S T O. The action and observation sets of the Bayes-Adaptive POMDP will be the same as the ones of the original POMDP. For the transition and observation functions of the Bayes-Adaptive POMDP, what we want to model is how the counts evolves as transitions and observations are made in the environment. Hence we want that, if we are in a particular state s with count vectors α and β, and the agent performs action a, transit in state s and observe z, then the count vector α after the transition should be defined such that α = α+δss a, where δa ss T is a vector full of zeroes, with a 1 for the counter n a ss, and the count vector β after the observation should be defined such that β = β + δs a z, where δa s z O is a vector full of zeroes, with a 1 for the counter n a s z. Furthermore, the probabilities of such transitions and observations to occur should be defined by considering all models and their probabilities as specified by the current Dirichlet distributions defined by α and β. This is exactly what the expected value of the Dirichlet is; thus we only need to define the transition and observation probabilities with the expected values of the Dirichlet distributions. Hence we will define our transition and observation functions T and O in the Bayes-Adaptive POMDP as follow: { α a T ((s,α,β),a,(s,α,β ss β a s z )) = (Σ s α a ss )(Σ z β a s z ) if α = α + δss a and β = β + δs a z 0 otherwise (6) O ((s,α,β),a,(s,α,β ),z) = { 1 if α = α + δ a ss and β = β + δ a s z 0 otherwise (7) Notice here that the observation probabilities defined by the Dirichlet distributions are taken into account in the transition function, since a state transition in the Bayes-Adaptive POMDP also specifies which observation will be observed after the transition via the way the counts are incremented. As a result, the observation function becomes determinist. The other particularity here is that the

8 8 observation function depend on both the previous and current state, since the way the counts are incremented specifies the observation observed. Since the counts do not affect the reward, we can simply define the reward function of the Bayes-Adaptive POMDP as R ((s,α,β),a) = R(s,a). The discount factor of the Bayes-Adaptive POMDP will also be the same. Finally, if the count vectors α 0 and β 0 represent the prior knowledge on the POMDP model, then the initial belief state of the Bayes-Adaptive POMDP is defined as b 0(s,α 0,β 0 ) = b 0 (s), and b 0(s,α,β) = 0 everywhere else. Using the definitions we just presented, the Bayes-Adaptive POMDP has a known model specified by the tuple (S,A,Ω,T,O,R,γ,b 0). Using this model, we can compute the probability to observe a certain observation z after doing a certain action a in a belief state b as follows: P(z b,a) = s S b(s) s S O(s,a,s,z)T(s,a,s ) = (s,α,β) S p (b) b(s,α,β) s S T((s,α,β),a,(s,α + δ a ss,β + δa s z )) = (s,α,β) S p (b) b(s,α,β) s S α a ss β a s z (Σ s α a ss )(Σ z β a s z ) where S p(b) = {s S b(s) > 0}. Furthermore, we can also derive a simplification of the belief update function for the Bayes-Adaptive POMDP. b (s,α,β ) = η (s,α,β) S b(s,α,β)o((s,α,β),a,(s,α,β ),z)t((s,α,β),a,(s,α,β )) = η s S b(s,α δss a,β δs a z )T((s,α δss a,β δs a z ),a(s,α,β )) = η s S b(s,α δss a,β δs a z )T((s,α δss a,β δs a z ),a(s,α,β )) = η s S b(s,α δ a ss,β δ a s z ) (α a ss 1)(β a s z 1) (Σ s α a ss 1)(Σ z β a s z 1) where the normalization constant η = 1/P(z b,a). It is clear that, in practice, these terms will be computable only if the set S p(b) is finite. We proove this in the following theorem. Theorem 1. Let (S,A,Ω,T,O,R,γ,b 0) be a Bayes-Adaptive POMDP constructed from the POMDP (S,A,Ω,T,O,R,γ,b 0 ). Then if S is finite, at any time t, the set S p(b t ) = {s S b t (s) > 0} is finite. Furthermore, S p(b t ) S t+1. Proof. We will proove this by induction. The base case is obvious, when t = 0, then b 0 (s,α,β) is 0 except if α = α 0 and β = β 0. Hence S p(b 0 ) S and therefore S p(b 0 ) is finite. For the general case, let s assume that S p(b t 1 ) is finite and such that S p(b t 1 ) S t ; we will show that S p(b t ) S t+1. From the definition of belief update function, we see that a b t (s,α,β ) can be greater than 0 only if there is an (s,α,β) such that b t 1 (s,α,β) > 0, α = α + δss a and β = β + δs a z. Hence, we have that a particular (s,α,β) such that b t 1 (s,α,β) > 0, yield non zero probabilities to at most S different states in b t, i.e. {(s,α + δss a,β + δa s z ) s S,a = a t 1,z = z t 1 }. Since by assumption we have that S p(b t 1 ) S t, by generating S different probable state in b t for each probable state in S p(b t 1 ), it follows that S p(b t ) S t+1,. Hence S p(b t )

9 9 is also finite since S is finite by assumption. Since we have proven the base and general case, it prooves that S p(b t ) is finite and bounded by S t+1 for all t. This proof suggest that we only need to iterate over S and S p(b t ) in order to update the belief state b t when an action and observation is taken in the environment. Hence we will generally use the following algorithm for the belief update function τ. function τ(b, a, z) Initialize b as a 0 vector. η 0 for all (s, α, β) S p(b) do η T È s α a ss for all s S do α α + δ a ss β β + δ a s z η O È z β a s z tmp b(s, α, β) αa ss β a s z η T η O b (s, α, β ) b (s, α, β ) + tmp η η + tmp end for end for return (1/η)b Using these definitions of τ(b,a,z) and P(z b,a), we can caracterize the optimal solution of the Bayes-Adaptive POMDP as in a standard POMDP with equations 2 and 4. The only difference is that to compute the immediate reward R(b,a) we will need to iterate over S p(b) instead of S, i.e. R(b,a) = s S p (b) R(s,a). 3.3 Approximate solution Since solving a Bayes-Adaptive POMDP is equivalent to solving a POMDP with an infinite number of states, it is clear that it is quite a challenging task. Standard offline approaches that uses value iteration with piecewize linear function approximator do not seem applicable here as a linear function in the infinite dimensionnal belief state space would require an infinite number of parameters to be specified. Hence, discretization of the problem to a finite number of states would be required to use such methods. On the other hand, one particular approach that could work well is to simply plan online by doing a K-step lookahead search each time the agent must perform an action. This would result in an approximate policy since it only plan for an horizon of K instead of the infinite horizon. However, another problem to address is that the complexity of updating the belief state grows exponentially in the size of the history, i.e. O( S t+2 Ω ) where

10 10 t is the current time step and S is the original finite set of state of the POMDP with unknown parameters. Since t can grow arbitrarily large, we would like to eliminate this dependance on t by using some approximation. One way to do this is to limit the size of S p(b) to a certain constant n. In that case the complexity would be limited to O(n S Ω ). In order to limit the number of probable states in b, we can use different methods such as sampling n probable state or keeping the n most probable state in τ(b,a,z) and renormalizing the belief state b with only those n probable states. We will refer to ˆτ(b,a,z,n) as this approximate belief update function that limits the number of probable state to n. The following variant of the RTBSS algorithm [15] implements these ideas: function RTBSS(b, k, n) input: b: the current belief state k: the remaining depth of the search to perform n: the number of probable states we keep in the belief state static: K: the total depth of the search actiontodo: Next action to perform in the environment if k = 0 then return max a A R(b, a) end if maxq for all a A do Q a R(b, a) for all z Ω do Q a Q a + γp(z b, a)rtbss(ˆτ(b, a, z, n), k 1, n) end for if Q a > maxq then maxq Q a argmaxq a end if end for if k = K then actiont odo argmaxq end if return maxq The algorithm simply computes, for each action, the discounted sum of reward for an horizon of K, using the approximate belief update function, and performs the action that maximize it. This algorithm would be executed each time the agent must choose an action in the environment. The complexity of the algorithm is in O(( A Ω ) K n S Ω ). Because the complexity of the algorithm depends a lof on A and Ω, the depth of the search K will be limited when there are a lot of actions and observations. However, it is expected that the algorithm should provide a good and efficient approximation when A and Ω are small.

11 11 4 Conclusion In conclusion, we have proposed a new mathematical model, the Bayes-Adaptive POMDP, that allows us to take into account uncertainty on the parameters of a standard POMDP model. The Bayes-Adaptive POMDP, when solved exactly, provides an optimal exploration-exploitation trade-off that will maximize reward over the infinite horizon while planning actions to learn information on the model when this is profitable. Because the Bayes-Adaptive POMDP has a very high complexity, we proposed a simple online lookahead search using an approximate belief update function to find an approximate solution to this problem. In future work, we would like to gather experimental results that could tell us at which point this approach can be efficient and what size of problems it can tackle. We would also like to explore other belief state approximations, such as using parametric distributions to represent the belief state. Finally, further theoretical analisys of these approximations will be required to determine error bounds on the performance of these approaches. References 1. Papadimitriou, C., Tsitsiklis, J.N.: The complexity of Markov decision processes. Mathematics of Operations Research 12 (1987) Madani, O., Hanks, S., Condon, A.: On the undecidability of probabilistic planning and infinite-horizon partially observable markov decision problems. In: Proceedings of the Sixteenth National Conference on Artificial Intelligence. (AAAI-99), The MIT Press (1999) Koenig, S., Simmons, R.: Unsupervised learning of probabilistic models for robot navigation. In: Proceedings of the IEEE International Conference on Robotics and Automation. (1996) 4. R. Jaulmes, J.P., Precup, D.: Active learning in partially observable markov decision processes. In: Proceedings of the 16 t h European Conference on Machine Learning (ECML). (2005) 5. Duff, M.: Optimal Learning: Computational Procedure for Bayes-Adaptive Markov Decision Processes. PhD thesis, University of Massachusetts, Amherst, USA (2002) 6. Littman, M.L.: Algorithms for sequential decision making. PhD thesis, Brown University (1996) 7. Cassandra, A., Littman, M.L., Zhang, N.L.: Incremental pruning: a simple, fast, exact method for partially observable Markov decision processes. In: Proceedings of the Thirteenth Conference on Uncertainty in Artificial Intelligence (UAI-97). (1997) Pineau, J., Gordon, G., Thrun, S.: Point-based value iteration: an anytime algorithm for POMDPs. In: Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI-03), Acapulco, Mexico (2003) Pineau, J.: Tractable planning under uncertainty: exploiting structure. PhD thesis, Carnegie Mellon University, Pittsburgh, PA (2004) 10. Spaan, M.T.J., Vlassis, N.: Perseus: randomized point-based value iteration for POMDPs. Journal of Artificial Intelligence Research 24 (2005) Smith, T., Simmons, R.: Heuristic search value iteration for POMDPs. In: Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence (UAI-04), Banff, Canada (2004)

12 Smith, T., Simmons, R.: Point-based POMDP algorithms: improved analysis and implementation. In: Proceedings of the 21th Conference on Uncertainty in Artificial Intelligence (UAI-05), Edinburgh, Scotland (2005) 13. Washington, R.: BI-POMDP: bounded, incremental partially observable Markov model planning. In: Proceedings of the 4th European Conference on Planning. Volume 1348 of Lecture Notes in Computer Science., Toulouse, France, Springer (1997) Satia, J.K., Lave, R.E.: Markovian decision processes with probabilistic observation of states. Management Science 20 (1973) Paquet, S., Tobin, L., Chaib-draa, B.: An online POMDP algorithm for complex multiagent environments. In: Proceedings of The fourth International Joint Conference on Autonomous Agents and Multi Agent Systems (AAMAS-05), Utrecht, The Netherlands (2005) 16. Ross, S., Chaib-draa, B.: AEMS: An Anytime Online Search Algorithm for Approximate Policy Refinement in Large POMDPs. In: Proceedings of the 20th International Joint Conference on Artificial Intelligence (IJCAI). (2007) 17. Kearns, M., Mansour, Y., Ng, A.Y.: A sparse sampling algorithm for near-optimal planning in large Markov decision processes. Machine Learning 49 (2002) McAllester, D., Singh, S.: Approximate Planning for Factored POMDPs using Belief State Simplification. In: Proceedings of the 15th Annual Conference on Uncertainty in Artificial Intelligence (UAI-99), San Francisco, CA, Morgan Kaufmann Publishers (1999) Bertsekas, D.P., Castanon, D.A.: Rollout algorithms for stochastic scheduling problems. Journal of Heuristics 5 (1999)

Bayes-Adaptive POMDPs 1

Bayes-Adaptive POMDPs 1 Stéphane Ross, Brahim Chaib-draa and Joelle Pineau SOCS-TR-007.6 School of Computer Science McGill University Montreal, Qc, Canada Department of Computer Science and Software Engineering