Bayes-Adaptive POMDPs: Toward an Optimal Policy for Learning POMDPs with Parameter Uncertainty

Size: px
Start display at page:

Download "Bayes-Adaptive POMDPs: Toward an Optimal Policy for Learning POMDPs with Parameter Uncertainty"

Transcription

1 Bayes-Adaptive POMDPs: Toward an Optimal Policy for Learning POMDPs with Parameter Uncertainty Stéphane Ross School of Computer Science McGill University, Montreal (Qc), Canada, H3A 2A7 Abstract. Most of the POMDP litterature as focused on developping new approximate algorithms to solve ever larger POMDPs, under the general assumption that the POMDP model is known a priori. In practice, however this is rarely the case. For instance, robot navigation problems generally require that the parameters of the POMDP be well tuned to the robot s sensors and actuators in order for the POMDP to reflect the reality, but the sensor and actuator parameters are rarely known precisely. Hence it is of crucial importance to develop new approaches which can take the uncertainty of these parameters into account during the planning process and further refine the model of the POMDP as experience is acquired in the environment. To this end, we formulate a new Bayes-Adaptive POMDP model such that its optimal policy provide an optimal exploration-exploitation tradeoff that will maximize long-term reward while taking into account the parameter uncertainty. However, since the Bayes-Adaptive POMDP has an infinite number of states, we propose an approximate algorithm that can solve the problem in a reasonable ammount of time. 1 Introduction In real world systems, uncertainty generally arises in both the prediction of the system s behaviour under different controls and the observability of the current system state. Partially Observable Markov Decision Processes (POMDPs) take both kind of uncertainty into account and provide a powerful model for sequential decision making under these conditions. However, most real world problem have huge state space and observation space, such that exact solving approaches are completely intractable (finite-horizon POMDPs are PSPACE-complete [1] and infinite-horizon POMDPs are undecidable [2]). This has motivated most researchers to focus on elaborating approximate solving approaches to this problem in order to solve ever larger POMDPs. However, it is generally assumed in the community that the POMDP model is known a priori, which is rarely the case in practice. A typical example is the robot navigation problem. POMDPs have been used extensively to solve robot navigation problems, but in practice, if we want to find the optimal policy

2 2 that the robot should follow in the real world, the POMDP must reflect exactly the uncertainty on the robot sensors and actuators. These parameters are rarely known exactly and they are generally approximated by human beeings, such that even if the resulting POMDP is solved exactly, the resulting policy may not be optimal due to model (parameter) uncertainty. A more desirable approach would be to take into account the uncertainty on the model in the planning process and be able to learn from experience the values of these unknown parameters. Several approaches have been explored to learn POMPD models. A first commonly used approach is the Baum-Welch algorithm [3], which is an Expectation- Maximization (EM) algorithm that uses a maximum likelyhood approach to find the most likely model given the sequence of action and observations observed. This approach converges to a local optima and does not address the issue of planning with an uncertain model. Another recent approach, called Medusa [4], tried to address this problem in an active learning fashion. The POMDP is extended such that an extra Query action is added, and when executed, this action provide full information on the current state of the environment. Using this information, the algorithm updates Dirichlet distributions over its unknown parameters. During the planning process, several models are sampled from the joint Dirichlet distribution and solved independently. Then the executed action is chosen randomly among the best actions to do in each sampled model, with a probability proportionnal to the likelyhood of its corresponding model. The drawback of this approach is that it requires the use of an oracle, which might not always be available. Furthermore, because the sampled models are solved independently, as if they were the correct model of the POMDP, the resulting policy do not take into account the uncertainty on the model. Query actions in this approach were only planned according to specific heuristics. The approaches most related to our approach come from the field of bayesian reinforcement learning, where Bayes-Adaptive MDP [5] were formulated to provide a theoretically optimal exploration-exploitation tradeoff to learn MDPs. These approaches, as in Medusa, use Dirichlet distribution to maintain the uncertainty on the parameters of the model. To take the parameter uncertainty into account in the planning process, the state space is extended with the Dirichlet distribution parameters, which are known at all times, and the transition probabilities are computed according to the expected value of the Dirichlet distributions in the current state. Because the state is observable in MDPs, they do not need to use an oracle to update the Dirichlet distributions after an action is taken in the environment. In this report, we propose an extension of Bayes-Adaptive MDPs to POMDPs that do not require any oracle to learn the POMDP model. However, since the Bayes-Adaptive POMDP has an infinite number of states, belief state maintenance and value function representation becomes a problematic issue. We propose different approximations that can be used to palliate to these problems. We first introduce the POMDP model and some approximate solving approaches. Then we introduce our new Bayes-Adaptive POMDP model and pro-

3 3 vide approximations that can be used to solve them with standard POMDP solving algorithms. We conclude with future possible extensions and improvements. 2 POMDP In this section we introduce the POMDP model and introduce some approximate algorithms to solve POMDPs. 2.1 Model A Partially Observable Markov Decision Process (POMDP) is a model for sequential decision making under uncertainty. Using such a model, an agent can plan an optimal sequence of action according to its belief by taking into account the uncertainty associated with its actions and observations. A POMDP is generally defined by a tuple (S,A,Ω,T,R,O,γ,b 0 ) where S is the state space, A is the action set, Ω is the observation set, T(s,a,s ) : S A S [0,1] is the transition function which specifies the probability of ending up in a certain state s, given that we were in state s and did action a, R : S A R is the reward function where R(s,a) specifies the immediate reward obtained by doing action a in state s, O : S A Ω [0,1] is the observation function where O(s,a,z) specifies the probability of observing a certain observation z, given that we did action a and ended in state s and γ is the discount factor. Finally, b 0 is the initial belief state of the environment and specifies the probability distribution over the initial state of the environment. In a POMDP, the agent does not know exactly in which state it currently is, since its observations on its current state are uncertain. Instead the agent maintains a belief state b which is a probability distribution over all states that specifies the probability that the agent is in each state. After the agent performs an action a and perceives an observation z, the agent can update its current belief state b using the belief update function τ(b,a,z) specified in equation 1. b (s ) = ηo(s,a,z) s S T(s,a,s )b(s) (1) Here, b is the new belief state and b is the last belief state of the agent. The summation part specifies the expected probability of transiting in state s, given that we performed action a and belief state b. Afterward, this expected probability is weighted by the probability that the agent observed o in state s after doing action a. η is a normalization constant such that the new probability distribution over all states sums to 1. Solving a POMDP consists in finding an optimal policy π which specifies the best action to do in every belief state b. This optimal policy depends on the planning horizon and on the discount factor used. In order to find this optimal policy, we need to compute the optimal value of a belief state over the planning

4 4 horizon. For the infinite horizon, the optimal value function is the fixed point of the Bellman equation (equation 2). V (b) = max R(b,a) + γ P(o b,a)v (τ(b,a,o)) (2) a A o Ω In this equation, R(b,a) = s S R(s,a)b(s) is the expected immediate reward of doing action a in belief state b and P(z b,a) is the probability of observing z after doing action a in belief state b. This probability can be computed using equation 3. P(z b,a) = s S O(z,a,s ) s S T(s,a,s )b(s) (3) This equation is very similar to the belief update function, except that it needs to sum over all the possible resulting states s in order to consider the global probability of observing z over all the state space. In fact, when computing the belief update function τ(b,a,z), the normalization constant η = P(z b,a). Similarly to the definition of the optimal value function, we can define the optimal policy π as in equation 4. π (b) = arg max R(b,a) + γ P(z b,a)v (τ(b,a,z)) (4) a A z Ω However, one problem with this formulation is that there is an infinite number of belief states and as a consequence, it would be impossible to compute such a policy for all belief states in a finite amount of time. But, since it has been shown that the optimal value function over a finite horizon is piecewise linear and convex, we can define the optimal value function and policy of a finite-horizon POMDP using a finite set of S-dimensional hyper plan, called α-vector, over the belief state space. This is how exact offline value iteration algorithms are able to compute a very close approximation to V in a finite amount of time. However, exact value iteration algorithms can only be applied to small problems of 10 to 20 states due to their high complexity. For more detail, refer to Littman and Cassandra [6, 7]. 2.2 Approximate algorithms Contrary to exact value iteration algorithms, approximate value iteration algorithms try to keep only a subset of α-vectors after each iteration of the algorithm in order to limit the complexity of the algorithm. Pineau [8, 9] has developed a point based value iteration algorithm (PBVI) which bounds the complexity of exact value iteration to the number of belief points in its set. Instead of keeping all the α-vectors as in exact value iteration, PBVI only keeps a maximum of one α-vector per belief point, that maximizes its value. Therefore, the precision of the algorithm depends on the number of belief points and the location of the chosen belief points. Spaan [10] has adopted a similar approach (Perseus), but instead of updating all belief points at each iteration, Perseus updates only the

5 5 belief points which have not been improved by a previous α-vector update in the current iteration. Since Perseus generally updates only a small subset of belief points at each turn, it can converge more rapidly to an approximate policy, or use larger sets of belief points, which improves its precision. Another recent approach which has shown interesting efficiency is HSVI [11, 12], which maintains both an upper bound defined by a set of points and a lower bound defined by α-vectors. HSVI uses an heuristic that approximates the error of the belief points in order to select the belief point on which to do value iteration updates. When it selects a belief to update, it also updates its upper bound using linear programming methods. While these methods have in common the fact that they try to solve the problem offline, i.e. they compute a complete policy prior to execution, another strategy which has been investigated in the litterature are online approaches, that interleaves computation and execution steps. The advantage of the later approach is that the policy needs only to be computed for the belief states that are encountered during the execution, and as a consequence, only reachable belief states from the current belief state need to be considered to find the next action to execute. Online algorithms generally proceed by doing a lookahead search in the space of reachable belief states over some finite horizon, and uses approximate value function of the infinite horizon value of belief states at fringe nodes in the search tree/graph [13 16]. Branch & Bound pruning techniques and factored representation have also been used to reduce the complexity of the search [15]. Futhermore, various heuristics have also been proposed to guide the search toward more important region of the belief state space [13, 14, 16]. Some authors have also proposed sampling approaches to further reduce the complexity of the search in large action/observation space [17 19]. 3 Bayes-Adaptive POMDP In this section, we introduce the Bayes-Adaptive POMDP model, that takes into account the uncertainty on the parameters of a standard POMDP. Here we assume that the state space, action space and observation space are known, and that the transition and observation functions are unknown or partially known. We also assume that the reward function is known as it is generally specified by the user for the specific task he wants to accomplish, but the model can easily be generalised to learn the reward function as well. We will denote by T a ss the parameter for the transition probability T(s,a,s ) and by O a s z the parameter for the observation probability O(s,a,z). To model the uncertainty on these parameters, we will make extensive use of Dirichlet distributions. As a consequence, we first introduce Dirichlet distributions and then provide a complete formalisation of the Bayes-Adaptive POMDP model and its solution.

6 6 3.1 Dirichlet Distribution The Dirichlet distribution is the conjugate prior of the multinomial distribution, in other words, it is a probability distribution over the parameters of a multinomial distribution. The multinomial distribution is a generalization of the Binomial distribution, where each trial result in one of k possible outcomes, and represent the probability to observe a certain number of times each outcome over n trials, given the probability to observe each outcome. For example, consider the following problem: suppose we have a k-sided dice and we want to determine whether the dice is fair or not, i.e. that each face as an equal probability to occur when we roll the dice. To determine this we are able to roll the dice a given number of times n. Each roll (trial) are considered independant and result in one of k possible outcomes, f 1 to f k, where f i represent the outcome that face i occured after rolling the dice. Let p i denote the unknown probability that f i occur after a roll and let α i be the number of times we have observed f i after n rolls. In this example, we have that the probability parameters p i follow a Dirichlet distribution, i.e. (p 1,...,p k ) Dir(α 1,...,α k ). This distribution represent the probability that the dice behaves according to the probability distribution (p 1,...,p k ), given that we have observed the counts (α 1,...,α k ) over n rolls (n = k i=1 α i). The probability density function of the Dirichlet distribution is defined as in equation 5. f(p,α) = 1 B(α) k i=1 p αi 1 i (5) The normalization constant is the beta function, which is expressed in terms of the gamma function, i.e. B(α) = k i=1 Γ(α i)/γ( k i=1 α i). The gamma function is a generalization of the factorial to complex numbers and the equality Γ(n + 1) = n! holds for natural numbers. For our particular POMDP with unknown parameters, we will be able to define our uncertainty on the distributions Ts a and Os a if we maintain counts αss a s that represent the number of times we have transited from state s to state s by doing action a and βs a z z for the number of times we have observed z in state s after doing action a. If we have such counts, then we would have Ts a Dir(αss a 1,...,αss a S ) and Oa s Dir(βa sz 1 ),...,αsz a Ω ). The problem here is that we need to observe the state of the environment in order to know which counts to increment every time a transition and observation happens in the environment. However, since we do not observe the state, we can still consider all possible state transitions that could have occured from our current state. Each state transition will lead to different count values and will have different probabilities according to our current Dirichlet distributions. Thus, we end up with a probability distribution over the values of the count variables. This can be interpreted as if the uncertainty on our unknown parameters is now represented as a mixture of Dirichlet distributions. In the next section, we will provide a formal description of the Bayes-Adaptive POMDP model that will allow us to take such uncertainty into account in the planning process.

7 7 3.2 Model The Bayes-Adaptive POMDP is constructed from the model of the POMDP with unknown parameters. Let < S,A,Ω,T,O,R,γ,b 0 > represent our POMDP with unknown transition and observation function T and O, we will first define counts αss a, (s,a,s ) S A S that represent the number of times we have transited from state s to state s by doing action a and counts βs a z, (s,a,z) S A Ω that represent the number of times we have observed z when arriving in state s by doing action a. We will refer to α as the vector of all transition counts and β as the vector of all observation counts. We will also refer to T = R S 2 A as the vector space in which α lies and O = R S A Ω as the vector space in which β lies. In order to maintain our probability distribution over the values of the α and β vectors, and take this into account in the planning process, we will include the α and β vector in the state of the Bayes-Adaptive POMDP. Thus, the state space S of the Bayes-Adaptive POMDP can be defined as S = S T O. The action and observation sets of the Bayes-Adaptive POMDP will be the same as the ones of the original POMDP. For the transition and observation functions of the Bayes-Adaptive POMDP, what we want to model is how the counts evolves as transitions and observations are made in the environment. Hence we want that, if we are in a particular state s with count vectors α and β, and the agent performs action a, transit in state s and observe z, then the count vector α after the transition should be defined such that α = α+δss a, where δa ss T is a vector full of zeroes, with a 1 for the counter n a ss, and the count vector β after the observation should be defined such that β = β + δs a z, where δa s z O is a vector full of zeroes, with a 1 for the counter n a s z. Furthermore, the probabilities of such transitions and observations to occur should be defined by considering all models and their probabilities as specified by the current Dirichlet distributions defined by α and β. This is exactly what the expected value of the Dirichlet is; thus we only need to define the transition and observation probabilities with the expected values of the Dirichlet distributions. Hence we will define our transition and observation functions T and O in the Bayes-Adaptive POMDP as follow: { α a T ((s,α,β),a,(s,α,β ss β a s z )) = (Σ s α a ss )(Σ z β a s z ) if α = α + δss a and β = β + δs a z 0 otherwise (6) O ((s,α,β),a,(s,α,β ),z) = { 1 if α = α + δ a ss and β = β + δ a s z 0 otherwise (7) Notice here that the observation probabilities defined by the Dirichlet distributions are taken into account in the transition function, since a state transition in the Bayes-Adaptive POMDP also specifies which observation will be observed after the transition via the way the counts are incremented. As a result, the observation function becomes determinist. The other particularity here is that the

8 8 observation function depend on both the previous and current state, since the way the counts are incremented specifies the observation observed. Since the counts do not affect the reward, we can simply define the reward function of the Bayes-Adaptive POMDP as R ((s,α,β),a) = R(s,a). The discount factor of the Bayes-Adaptive POMDP will also be the same. Finally, if the count vectors α 0 and β 0 represent the prior knowledge on the POMDP model, then the initial belief state of the Bayes-Adaptive POMDP is defined as b 0(s,α 0,β 0 ) = b 0 (s), and b 0(s,α,β) = 0 everywhere else. Using the definitions we just presented, the Bayes-Adaptive POMDP has a known model specified by the tuple (S,A,Ω,T,O,R,γ,b 0). Using this model, we can compute the probability to observe a certain observation z after doing a certain action a in a belief state b as follows: P(z b,a) = s S b(s) s S O(s,a,s,z)T(s,a,s ) = (s,α,β) S p (b) b(s,α,β) s S T((s,α,β),a,(s,α + δ a ss,β + δa s z )) = (s,α,β) S p (b) b(s,α,β) s S α a ss β a s z (Σ s α a ss )(Σ z β a s z ) where S p(b) = {s S b(s) > 0}. Furthermore, we can also derive a simplification of the belief update function for the Bayes-Adaptive POMDP. b (s,α,β ) = η (s,α,β) S b(s,α,β)o((s,α,β),a,(s,α,β ),z)t((s,α,β),a,(s,α,β )) = η s S b(s,α δss a,β δs a z )T((s,α δss a,β δs a z ),a(s,α,β )) = η s S b(s,α δss a,β δs a z )T((s,α δss a,β δs a z ),a(s,α,β )) = η s S b(s,α δ a ss,β δ a s z ) (α a ss 1)(β a s z 1) (Σ s α a ss 1)(Σ z β a s z 1) where the normalization constant η = 1/P(z b,a). It is clear that, in practice, these terms will be computable only if the set S p(b) is finite. We proove this in the following theorem. Theorem 1. Let (S,A,Ω,T,O,R,γ,b 0) be a Bayes-Adaptive POMDP constructed from the POMDP (S,A,Ω,T,O,R,γ,b 0 ). Then if S is finite, at any time t, the set S p(b t ) = {s S b t (s) > 0} is finite. Furthermore, S p(b t ) S t+1. Proof. We will proove this by induction. The base case is obvious, when t = 0, then b 0 (s,α,β) is 0 except if α = α 0 and β = β 0. Hence S p(b 0 ) S and therefore S p(b 0 ) is finite. For the general case, let s assume that S p(b t 1 ) is finite and such that S p(b t 1 ) S t ; we will show that S p(b t ) S t+1. From the definition of belief update function, we see that a b t (s,α,β ) can be greater than 0 only if there is an (s,α,β) such that b t 1 (s,α,β) > 0, α = α + δss a and β = β + δs a z. Hence, we have that a particular (s,α,β) such that b t 1 (s,α,β) > 0, yield non zero probabilities to at most S different states in b t, i.e. {(s,α + δss a,β + δa s z ) s S,a = a t 1,z = z t 1 }. Since by assumption we have that S p(b t 1 ) S t, by generating S different probable state in b t for each probable state in S p(b t 1 ), it follows that S p(b t ) S t+1,. Hence S p(b t )

9 9 is also finite since S is finite by assumption. Since we have proven the base and general case, it prooves that S p(b t ) is finite and bounded by S t+1 for all t. This proof suggest that we only need to iterate over S and S p(b t ) in order to update the belief state b t when an action and observation is taken in the environment. Hence we will generally use the following algorithm for the belief update function τ. function τ(b, a, z) Initialize b as a 0 vector. η 0 for all (s, α, β) S p(b) do η T È s α a ss for all s S do α α + δ a ss β β + δ a s z η O È z β a s z tmp b(s, α, β) αa ss β a s z η T η O b (s, α, β ) b (s, α, β ) + tmp η η + tmp end for end for return (1/η)b Using these definitions of τ(b,a,z) and P(z b,a), we can caracterize the optimal solution of the Bayes-Adaptive POMDP as in a standard POMDP with equations 2 and 4. The only difference is that to compute the immediate reward R(b,a) we will need to iterate over S p(b) instead of S, i.e. R(b,a) = s S p (b) R(s,a). 3.3 Approximate solution Since solving a Bayes-Adaptive POMDP is equivalent to solving a POMDP with an infinite number of states, it is clear that it is quite a challenging task. Standard offline approaches that uses value iteration with piecewize linear function approximator do not seem applicable here as a linear function in the infinite dimensionnal belief state space would require an infinite number of parameters to be specified. Hence, discretization of the problem to a finite number of states would be required to use such methods. On the other hand, one particular approach that could work well is to simply plan online by doing a K-step lookahead search each time the agent must perform an action. This would result in an approximate policy since it only plan for an horizon of K instead of the infinite horizon. However, another problem to address is that the complexity of updating the belief state grows exponentially in the size of the history, i.e. O( S t+2 Ω ) where

10 10 t is the current time step and S is the original finite set of state of the POMDP with unknown parameters. Since t can grow arbitrarily large, we would like to eliminate this dependance on t by using some approximation. One way to do this is to limit the size of S p(b) to a certain constant n. In that case the complexity would be limited to O(n S Ω ). In order to limit the number of probable states in b, we can use different methods such as sampling n probable state or keeping the n most probable state in τ(b,a,z) and renormalizing the belief state b with only those n probable states. We will refer to ˆτ(b,a,z,n) as this approximate belief update function that limits the number of probable state to n. The following variant of the RTBSS algorithm [15] implements these ideas: function RTBSS(b, k, n) input: b: the current belief state k: the remaining depth of the search to perform n: the number of probable states we keep in the belief state static: K: the total depth of the search actiontodo: Next action to perform in the environment if k = 0 then return max a A R(b, a) end if maxq for all a A do Q a R(b, a) for all z Ω do Q a Q a + γp(z b, a)rtbss(ˆτ(b, a, z, n), k 1, n) end for if Q a > maxq then maxq Q a argmaxq a end if end for if k = K then actiont odo argmaxq end if return maxq The algorithm simply computes, for each action, the discounted sum of reward for an horizon of K, using the approximate belief update function, and performs the action that maximize it. This algorithm would be executed each time the agent must choose an action in the environment. The complexity of the algorithm is in O(( A Ω ) K n S Ω ). Because the complexity of the algorithm depends a lof on A and Ω, the depth of the search K will be limited when there are a lot of actions and observations. However, it is expected that the algorithm should provide a good and efficient approximation when A and Ω are small.

11 11 4 Conclusion In conclusion, we have proposed a new mathematical model, the Bayes-Adaptive POMDP, that allows us to take into account uncertainty on the parameters of a standard POMDP model. The Bayes-Adaptive POMDP, when solved exactly, provides an optimal exploration-exploitation trade-off that will maximize reward over the infinite horizon while planning actions to learn information on the model when this is profitable. Because the Bayes-Adaptive POMDP has a very high complexity, we proposed a simple online lookahead search using an approximate belief update function to find an approximate solution to this problem. In future work, we would like to gather experimental results that could tell us at which point this approach can be efficient and what size of problems it can tackle. We would also like to explore other belief state approximations, such as using parametric distributions to represent the belief state. Finally, further theoretical analisys of these approximations will be required to determine error bounds on the performance of these approaches. References 1. Papadimitriou, C., Tsitsiklis, J.N.: The complexity of Markov decision processes. Mathematics of Operations Research 12 (1987) Madani, O., Hanks, S., Condon, A.: On the undecidability of probabilistic planning and infinite-horizon partially observable markov decision problems. In: Proceedings of the Sixteenth National Conference on Artificial Intelligence. (AAAI-99), The MIT Press (1999) Koenig, S., Simmons, R.: Unsupervised learning of probabilistic models for robot navigation. In: Proceedings of the IEEE International Conference on Robotics and Automation. (1996) 4. R. Jaulmes, J.P., Precup, D.: Active learning in partially observable markov decision processes. In: Proceedings of the 16 t h European Conference on Machine Learning (ECML). (2005) 5. Duff, M.: Optimal Learning: Computational Procedure for Bayes-Adaptive Markov Decision Processes. PhD thesis, University of Massachusetts, Amherst, USA (2002) 6. Littman, M.L.: Algorithms for sequential decision making. PhD thesis, Brown University (1996) 7. Cassandra, A., Littman, M.L., Zhang, N.L.: Incremental pruning: a simple, fast, exact method for partially observable Markov decision processes. In: Proceedings of the Thirteenth Conference on Uncertainty in Artificial Intelligence (UAI-97). (1997) Pineau, J., Gordon, G., Thrun, S.: Point-based value iteration: an anytime algorithm for POMDPs. In: Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI-03), Acapulco, Mexico (2003) Pineau, J.: Tractable planning under uncertainty: exploiting structure. PhD thesis, Carnegie Mellon University, Pittsburgh, PA (2004) 10. Spaan, M.T.J., Vlassis, N.: Perseus: randomized point-based value iteration for POMDPs. Journal of Artificial Intelligence Research 24 (2005) Smith, T., Simmons, R.: Heuristic search value iteration for POMDPs. In: Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence (UAI-04), Banff, Canada (2004)

12 Smith, T., Simmons, R.: Point-based POMDP algorithms: improved analysis and implementation. In: Proceedings of the 21th Conference on Uncertainty in Artificial Intelligence (UAI-05), Edinburgh, Scotland (2005) 13. Washington, R.: BI-POMDP: bounded, incremental partially observable Markov model planning. In: Proceedings of the 4th European Conference on Planning. Volume 1348 of Lecture Notes in Computer Science., Toulouse, France, Springer (1997) Satia, J.K., Lave, R.E.: Markovian decision processes with probabilistic observation of states. Management Science 20 (1973) Paquet, S., Tobin, L., Chaib-draa, B.: An online POMDP algorithm for complex multiagent environments. In: Proceedings of The fourth International Joint Conference on Autonomous Agents and Multi Agent Systems (AAMAS-05), Utrecht, The Netherlands (2005) 16. Ross, S., Chaib-draa, B.: AEMS: An Anytime Online Search Algorithm for Approximate Policy Refinement in Large POMDPs. In: Proceedings of the 20th International Joint Conference on Artificial Intelligence (IJCAI). (2007) 17. Kearns, M., Mansour, Y., Ng, A.Y.: A sparse sampling algorithm for near-optimal planning in large Markov decision processes. Machine Learning 49 (2002) McAllester, D., Singh, S.: Approximate Planning for Factored POMDPs using Belief State Simplification. In: Proceedings of the 15th Annual Conference on Uncertainty in Artificial Intelligence (UAI-99), San Francisco, CA, Morgan Kaufmann Publishers (1999) Bertsekas, D.P., Castanon, D.A.: Rollout algorithms for stochastic scheduling problems. Journal of Heuristics 5 (1999)

Bayes-Adaptive POMDPs 1

Bayes-Adaptive POMDPs 1 Bayes-Adaptive POMDPs 1 Stéphane Ross, Brahim Chaib-draa and Joelle Pineau SOCS-TR-007.6 School of Computer Science McGill University Montreal, Qc, Canada Department of Computer Science and Software Engineering

More information

RL 14: POMDPs continued

RL 14: POMDPs continued RL 14: POMDPs continued Michael Herrmann University of Edinburgh, School of Informatics 06/03/2015 POMDPs: Points to remember Belief states are probability distributions over states Even if computationally

More information

Bayesian Reinforcement Learning in Continuous POMDPs with Application to Robot Navigation

Bayesian Reinforcement Learning in Continuous POMDPs with Application to Robot Navigation 2008 IEEE International Conference on Robotics and Automation Pasadena, CA, USA, May 19-23, 2008 Bayesian Reinforcement Learning in Continuous POMDPs with Application to Robot Navigation Stéphane Ross

More information

Partially Observable Markov Decision Processes (POMDPs)

Partially Observable Markov Decision Processes (POMDPs) Partially Observable Markov Decision Processes (POMDPs) Geoff Hollinger Sequential Decision Making in Robotics Spring, 2011 *Some media from Reid Simmons, Trey Smith, Tony Cassandra, Michael Littman, and

More information

Probabilistic robot planning under model uncertainty: an active learning approach

Probabilistic robot planning under model uncertainty: an active learning approach Probabilistic robot planning under model uncertainty: an active learning approach Robin JAULMES, Joelle PINEAU and Doina PRECUP School of Computer Science McGill University Montreal, QC CANADA H3A 2A7

More information

Learning in non-stationary Partially Observable Markov Decision Processes

Learning in non-stationary Partially Observable Markov Decision Processes Learning in non-stationary Partially Observable Markov Decision Processes Robin JAULMES, Joelle PINEAU, Doina PRECUP McGill University, School of Computer Science, 3480 University St., Montreal, QC, Canada,

More information

Kalman Based Temporal Difference Neural Network for Policy Generation under Uncertainty (KBTDNN)

Kalman Based Temporal Difference Neural Network for Policy Generation under Uncertainty (KBTDNN) Kalman Based Temporal Difference Neural Network for Policy Generation under Uncertainty (KBTDNN) Alp Sardag and H.Levent Akin Bogazici University Department of Computer Engineering 34342 Bebek, Istanbul,

More information

An Adaptive Clustering Method for Model-free Reinforcement Learning

An Adaptive Clustering Method for Model-free Reinforcement Learning An Adaptive Clustering Method for Model-free Reinforcement Learning Andreas Matt and Georg Regensburger Institute of Mathematics University of Innsbruck, Austria {andreas.matt, georg.regensburger}@uibk.ac.at

More information

Optimally Solving Dec-POMDPs as Continuous-State MDPs

Optimally Solving Dec-POMDPs as Continuous-State MDPs Optimally Solving Dec-POMDPs as Continuous-State MDPs Jilles Dibangoye (1), Chris Amato (2), Olivier Buffet (1) and François Charpillet (1) (1) Inria, Université de Lorraine France (2) MIT, CSAIL USA IJCAI

More information

RL 14: Simplifications of POMDPs

RL 14: Simplifications of POMDPs RL 14: Simplifications of POMDPs Michael Herrmann University of Edinburgh, School of Informatics 04/03/2016 POMDPs: Points to remember Belief states are probability distributions over states Even if computationally

More information

European Workshop on Reinforcement Learning A POMDP Tutorial. Joelle Pineau. McGill University

European Workshop on Reinforcement Learning A POMDP Tutorial. Joelle Pineau. McGill University European Workshop on Reinforcement Learning 2013 A POMDP Tutorial Joelle Pineau McGill University (With many slides & pictures from Mauricio Araya-Lopez and others.) August 2013 Sequential decision-making

More information

Finite-State Controllers Based on Mealy Machines for Centralized and Decentralized POMDPs

Finite-State Controllers Based on Mealy Machines for Centralized and Decentralized POMDPs Finite-State Controllers Based on Mealy Machines for Centralized and Decentralized POMDPs Christopher Amato Department of Computer Science University of Massachusetts Amherst, MA 01003 USA camato@cs.umass.edu

More information

Today s Outline. Recap: MDPs. Bellman Equations. Q-Value Iteration. Bellman Backup 5/7/2012. CSE 473: Artificial Intelligence Reinforcement Learning

Today s Outline. Recap: MDPs. Bellman Equations. Q-Value Iteration. Bellman Backup 5/7/2012. CSE 473: Artificial Intelligence Reinforcement Learning CSE 473: Artificial Intelligence Reinforcement Learning Dan Weld Today s Outline Reinforcement Learning Q-value iteration Q-learning Exploration / exploitation Linear function approximation Many slides

More information

An Analytic Solution to Discrete Bayesian Reinforcement Learning

An Analytic Solution to Discrete Bayesian Reinforcement Learning An Analytic Solution to Discrete Bayesian Reinforcement Learning Pascal Poupart (U of Waterloo) Nikos Vlassis (U of Amsterdam) Jesse Hoey (U of Toronto) Kevin Regan (U of Waterloo) 1 Motivation Automated

More information

Efficient Maximization in Solving POMDPs

Efficient Maximization in Solving POMDPs Efficient Maximization in Solving POMDPs Zhengzhu Feng Computer Science Department University of Massachusetts Amherst, MA 01003 fengzz@cs.umass.edu Shlomo Zilberstein Computer Science Department University

More information

Planning by Probabilistic Inference

Planning by Probabilistic Inference Planning by Probabilistic Inference Hagai Attias Microsoft Research 1 Microsoft Way Redmond, WA 98052 Abstract This paper presents and demonstrates a new approach to the problem of planning under uncertainty.

More information

Artificial Intelligence

Artificial Intelligence Artificial Intelligence Dynamic Programming Marc Toussaint University of Stuttgart Winter 2018/19 Motivation: So far we focussed on tree search-like solvers for decision problems. There is a second important

More information

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Q-Learning Christopher Watkins and Peter Dayan Noga Zaslavsky The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992)

More information

Markov decision processes (MDP) CS 416 Artificial Intelligence. Iterative solution of Bellman equations. Building an optimal policy.

Markov decision processes (MDP) CS 416 Artificial Intelligence. Iterative solution of Bellman equations. Building an optimal policy. Page 1 Markov decision processes (MDP) CS 416 Artificial Intelligence Lecture 21 Making Complex Decisions Chapter 17 Initial State S 0 Transition Model T (s, a, s ) How does Markov apply here? Uncertainty

More information

Planning Under Uncertainty II

Planning Under Uncertainty II Planning Under Uncertainty II Intelligent Robotics 2014/15 Bruno Lacerda Announcement No class next Monday - 17/11/2014 2 Previous Lecture Approach to cope with uncertainty on outcome of actions Markov

More information

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be REINFORCEMENT LEARNING AN INTRODUCTION Prof. Dr. Ann Nowé Artificial Intelligence Lab ai.vub.ac.be REINFORCEMENT LEARNING WHAT IS IT? What is it? Learning from interaction Learning about, from, and while

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Model-Based Reinforcement Learning Model-based, PAC-MDP, sample complexity, exploration/exploitation, RMAX, E3, Bayes-optimal, Bayesian RL, model learning Vien Ngo MLR, University

More information

COMP3702/7702 Artificial Intelligence Lecture 11: Introduction to Machine Learning and Reinforcement Learning. Hanna Kurniawati

COMP3702/7702 Artificial Intelligence Lecture 11: Introduction to Machine Learning and Reinforcement Learning. Hanna Kurniawati COMP3702/7702 Artificial Intelligence Lecture 11: Introduction to Machine Learning and Reinforcement Learning Hanna Kurniawati Today } What is machine learning? } Where is it used? } Types of machine learning

More information

Active Learning of MDP models

Active Learning of MDP models Active Learning of MDP models Mauricio Araya-López, Olivier Buffet, Vincent Thomas, and François Charpillet Nancy Université / INRIA LORIA Campus Scientifique BP 239 54506 Vandoeuvre-lès-Nancy Cedex France

More information

Partially observable Markov decision processes. Department of Computer Science, Czech Technical University in Prague

Partially observable Markov decision processes. Department of Computer Science, Czech Technical University in Prague Partially observable Markov decision processes Jiří Kléma Department of Computer Science, Czech Technical University in Prague https://cw.fel.cvut.cz/wiki/courses/b4b36zui/prednasky pagenda Previous lecture:

More information

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016 Course 16:198:520: Introduction To Artificial Intelligence Lecture 13 Decision Making Abdeslam Boularias Wednesday, December 7, 2016 1 / 45 Overview We consider probabilistic temporal models where the

More information

Planning and Acting in Partially Observable Stochastic Domains

Planning and Acting in Partially Observable Stochastic Domains Planning and Acting in Partially Observable Stochastic Domains Leslie Pack Kaelbling*, Michael L. Littman**, Anthony R. Cassandra*** *Computer Science Department, Brown University, Providence, RI, USA

More information

A Bayesian Approach for Learning and Planning in Partially Observable Markov Decision Processes

A Bayesian Approach for Learning and Planning in Partially Observable Markov Decision Processes Journal of Machine Learning Research 12 (2011) 1729-1770 Submitted 10/08; Revised 11/10; Published 5/11 A Bayesian Approach for Learning and Planning in Partially Observable Markov Decision Processes Stéphane

More information

Complexity of stochastic branch and bound methods for belief tree search in Bayesian reinforcement learning

Complexity of stochastic branch and bound methods for belief tree search in Bayesian reinforcement learning Complexity of stochastic branch and bound methods for belief tree search in Bayesian reinforcement learning Christos Dimitrakakis Informatics Institute, University of Amsterdam, Amsterdam, The Netherlands

More information

Information Gathering and Reward Exploitation of Subgoals for P

Information Gathering and Reward Exploitation of Subgoals for P Information Gathering and Reward Exploitation of Subgoals for POMDPs Hang Ma and Joelle Pineau McGill University AAAI January 27, 2015 http://www.cs.washington.edu/ai/mobile_robotics/mcl/animations/global-floor.gif

More information

Point-Based Value Iteration for Constrained POMDPs

Point-Based Value Iteration for Constrained POMDPs Point-Based Value Iteration for Constrained POMDPs Dongho Kim Jaesong Lee Kee-Eung Kim Department of Computer Science Pascal Poupart School of Computer Science IJCAI-2011 2011. 7. 22. Motivation goals

More information

Open Theoretical Questions in Reinforcement Learning

Open Theoretical Questions in Reinforcement Learning Open Theoretical Questions in Reinforcement Learning Richard S. Sutton AT&T Labs, Florham Park, NJ 07932, USA, sutton@research.att.com, www.cs.umass.edu/~rich Reinforcement learning (RL) concerns the problem

More information

A Decentralized Approach to Multi-agent Planning in the Presence of Constraints and Uncertainty

A Decentralized Approach to Multi-agent Planning in the Presence of Constraints and Uncertainty 2011 IEEE International Conference on Robotics and Automation Shanghai International Conference Center May 9-13, 2011, Shanghai, China A Decentralized Approach to Multi-agent Planning in the Presence of

More information

Topics of Active Research in Reinforcement Learning Relevant to Spoken Dialogue Systems

Topics of Active Research in Reinforcement Learning Relevant to Spoken Dialogue Systems Topics of Active Research in Reinforcement Learning Relevant to Spoken Dialogue Systems Pascal Poupart David R. Cheriton School of Computer Science University of Waterloo 1 Outline Review Markov Models

More information

State Space Compression with Predictive Representations

State Space Compression with Predictive Representations State Space Compression with Predictive Representations Abdeslam Boularias Laval University Quebec GK 7P4, Canada Masoumeh Izadi McGill University Montreal H3A A3, Canada Brahim Chaib-draa Laval University

More information

Satisfaction Equilibrium: Achieving Cooperation in Incomplete Information Games

Satisfaction Equilibrium: Achieving Cooperation in Incomplete Information Games Satisfaction Equilibrium: Achieving Cooperation in Incomplete Information Games Stéphane Ross and Brahim Chaib-draa Department of Computer Science and Software Engineering Laval University, Québec (Qc),

More information

An Introduction to Markov Decision Processes. MDP Tutorial - 1

An Introduction to Markov Decision Processes. MDP Tutorial - 1 An Introduction to Markov Decision Processes Bob Givan Purdue University Ron Parr Duke University MDP Tutorial - 1 Outline Markov Decision Processes defined (Bob) Objective functions Policies Finding Optimal

More information

Reinforcement learning an introduction

Reinforcement learning an introduction Reinforcement learning an introduction Prof. Dr. Ann Nowé Computational Modeling Group AIlab ai.vub.ac.be November 2013 Reinforcement Learning What is it? Learning from interaction Learning about, from,

More information

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Michail G. Lagoudakis Department of Computer Science Duke University Durham, NC 2778 mgl@cs.duke.edu

More information

Decentralized Decision Making!

Decentralized Decision Making! Decentralized Decision Making! in Partially Observable, Uncertain Worlds Shlomo Zilberstein Department of Computer Science University of Massachusetts Amherst Joint work with Martin Allen, Christopher

More information

Optimizing Memory-Bounded Controllers for Decentralized POMDPs

Optimizing Memory-Bounded Controllers for Decentralized POMDPs Optimizing Memory-Bounded Controllers for Decentralized POMDPs Christopher Amato, Daniel S. Bernstein and Shlomo Zilberstein Department of Computer Science University of Massachusetts Amherst, MA 01003

More information

Heuristic Search Value Iteration for POMDPs

Heuristic Search Value Iteration for POMDPs 520 SMITH & SIMMONS UAI 2004 Heuristic Search Value Iteration for POMDPs Trey Smith and Reid Simmons Robotics Institute, Carnegie Mellon University {trey,reids}@ri.cmu.edu Abstract We present a novel POMDP

More information

Reinforcement Learning in Partially Observable Multiagent Settings: Monte Carlo Exploring Policies

Reinforcement Learning in Partially Observable Multiagent Settings: Monte Carlo Exploring Policies Reinforcement earning in Partially Observable Multiagent Settings: Monte Carlo Exploring Policies Presenter: Roi Ceren THINC ab, University of Georgia roi@ceren.net Prashant Doshi THINC ab, University

More information

Learning in Zero-Sum Team Markov Games using Factored Value Functions

Learning in Zero-Sum Team Markov Games using Factored Value Functions Learning in Zero-Sum Team Markov Games using Factored Value Functions Michail G. Lagoudakis Department of Computer Science Duke University Durham, NC 27708 mgl@cs.duke.edu Ronald Parr Department of Computer

More information

Decision-theoretic approaches to planning, coordination and communication in multiagent systems

Decision-theoretic approaches to planning, coordination and communication in multiagent systems Decision-theoretic approaches to planning, coordination and communication in multiagent systems Matthijs Spaan Frans Oliehoek 2 Stefan Witwicki 3 Delft University of Technology 2 U. of Liverpool & U. of

More information

University of Alberta

University of Alberta University of Alberta NEW REPRESENTATIONS AND APPROXIMATIONS FOR SEQUENTIAL DECISION MAKING UNDER UNCERTAINTY by Tao Wang A thesis submitted to the Faculty of Graduate Studies and Research in partial fulfillment

More information

Region-Based Dynamic Programming for Partially Observable Markov Decision Processes

Region-Based Dynamic Programming for Partially Observable Markov Decision Processes Region-Based Dynamic Programming for Partially Observable Markov Decision Processes Zhengzhu Feng Department of Computer Science University of Massachusetts Amherst, MA 01003 fengzz@cs.umass.edu Abstract

More information

Exploiting Structure to Efficiently Solve Large Scale Partially Observable Markov Decision Processes. Pascal Poupart

Exploiting Structure to Efficiently Solve Large Scale Partially Observable Markov Decision Processes. Pascal Poupart Exploiting Structure to Efficiently Solve Large Scale Partially Observable Markov Decision Processes by Pascal Poupart A thesis submitted in conformity with the requirements for the degree of Doctor of

More information

Preference Elicitation for Sequential Decision Problems

Preference Elicitation for Sequential Decision Problems Preference Elicitation for Sequential Decision Problems Kevin Regan University of Toronto Introduction 2 Motivation Focus: Computational approaches to sequential decision making under uncertainty These

More information

CS 7180: Behavioral Modeling and Decisionmaking

CS 7180: Behavioral Modeling and Decisionmaking CS 7180: Behavioral Modeling and Decisionmaking in AI Markov Decision Processes for Complex Decisionmaking Prof. Amy Sliva October 17, 2012 Decisions are nondeterministic In many situations, behavior and

More information

Partially Observable Markov Decision Processes (POMDPs) Pieter Abbeel UC Berkeley EECS

Partially Observable Markov Decision Processes (POMDPs) Pieter Abbeel UC Berkeley EECS Partially Observable Markov Decision Processes (POMDPs) Pieter Abbeel UC Berkeley EECS Many slides adapted from Jur van den Berg Outline POMDPs Separation Principle / Certainty Equivalence Locally Optimal

More information

Minimizing Communication Cost in a Distributed Bayesian Network Using a Decentralized MDP

Minimizing Communication Cost in a Distributed Bayesian Network Using a Decentralized MDP Minimizing Communication Cost in a Distributed Bayesian Network Using a Decentralized MDP Jiaying Shen Department of Computer Science University of Massachusetts Amherst, MA 0003-460, USA jyshen@cs.umass.edu

More information

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti 1 MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti Historical background 2 Original motivation: animal learning Early

More information

ARTIFICIAL INTELLIGENCE. Reinforcement learning

ARTIFICIAL INTELLIGENCE. Reinforcement learning INFOB2KI 2018-2019 Utrecht University The Netherlands ARTIFICIAL INTELLIGENCE Reinforcement learning Lecturer: Silja Renooij These slides are part of the INFOB2KI Course Notes available from www.cs.uu.nl/docs/vakken/b2ki/schema.html

More information

Basics of reinforcement learning

Basics of reinforcement learning Basics of reinforcement learning Lucian Buşoniu TMLSS, 20 July 2018 Main idea of reinforcement learning (RL) Learn a sequential decision policy to optimize the cumulative performance of an unknown system

More information

6 Reinforcement Learning

6 Reinforcement Learning 6 Reinforcement Learning As discussed above, a basic form of supervised learning is function approximation, relating input vectors to output vectors, or, more generally, finding density functions p(y,

More information

Learning Low Dimensional Predictive Representations

Learning Low Dimensional Predictive Representations Learning Low Dimensional Predictive Representations Matthew Rosencrantz MROSEN@CS.CMU.EDU Computer Science Department, Carnegie Mellon University, Forbes Avenue, Pittsburgh, PA, USA Geoff Gordon GGORDON@CS.CMU.EDU

More information

CS 4649/7649 Robot Intelligence: Planning

CS 4649/7649 Robot Intelligence: Planning CS 4649/7649 Robot Intelligence: Planning Probability Primer Sungmoon Joo School of Interactive Computing College of Computing Georgia Institute of Technology S. Joo (sungmoon.joo@cc.gatech.edu) 1 *Slides

More information

On Prediction and Planning in Partially Observable Markov Decision Processes with Large Observation Sets

On Prediction and Planning in Partially Observable Markov Decision Processes with Large Observation Sets On Prediction and Planning in Partially Observable Markov Decision Processes with Large Observation Sets Pablo Samuel Castro pcastr@cs.mcgill.ca McGill University Joint work with: Doina Precup and Prakash

More information

Bayesian Congestion Control over a Markovian Network Bandwidth Process

Bayesian Congestion Control over a Markovian Network Bandwidth Process Bayesian Congestion Control over a Markovian Network Bandwidth Process Parisa Mansourifard 1/30 Bayesian Congestion Control over a Markovian Network Bandwidth Process Parisa Mansourifard (USC) Joint work

More information

Solving Risk-Sensitive POMDPs with and without Cost Observations

Solving Risk-Sensitive POMDPs with and without Cost Observations Solving Risk-Sensitive POMDPs with and without Cost Observations Ping Hou Department of Computer Science New Mexico State University Las Cruces, NM 88003, USA phou@cs.nmsu.edu William Yeoh Department of

More information

Lecture 1: March 7, 2018

Lecture 1: March 7, 2018 Reinforcement Learning Spring Semester, 2017/8 Lecture 1: March 7, 2018 Lecturer: Yishay Mansour Scribe: ym DISCLAIMER: Based on Learning and Planning in Dynamical Systems by Shie Mannor c, all rights

More information

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam:

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam: Marks } Assignment 1: Should be out this weekend } All are marked, I m trying to tally them and perhaps add bonus points } Mid-term: Before the last lecture } Mid-term deferred exam: } This Saturday, 9am-10.30am,

More information

Multi-Agent Online Planning with Communication

Multi-Agent Online Planning with Communication Multi-Agent Online Planning with Communication Feng Wu Department of Computer Science University of Sci. & Tech. of China Hefei, Anhui 230027 China wufeng@mail.ustc.edu.cn Shlomo Zilberstein Department

More information

10 Robotic Exploration and Information Gathering

10 Robotic Exploration and Information Gathering NAVARCH/EECS 568, ROB 530 - Winter 2018 10 Robotic Exploration and Information Gathering Maani Ghaffari April 2, 2018 Robotic Information Gathering: Exploration and Monitoring In information gathering

More information

Accelerated Vector Pruning for Optimal POMDP Solvers

Accelerated Vector Pruning for Optimal POMDP Solvers Accelerated Vector Pruning for Optimal POMDP Solvers Erwin Walraven and Matthijs T. J. Spaan Delft University of Technology Mekelweg 4, 2628 CD Delft, The Netherlands Abstract Partially Observable Markov

More information

Outline. CSE 573: Artificial Intelligence Autumn Agent. Partial Observability. Markov Decision Process (MDP) 10/31/2012

Outline. CSE 573: Artificial Intelligence Autumn Agent. Partial Observability. Markov Decision Process (MDP) 10/31/2012 CSE 573: Artificial Intelligence Autumn 2012 Reasoning about Uncertainty & Hidden Markov Models Daniel Weld Many slides adapted from Dan Klein, Stuart Russell, Andrew Moore & Luke Zettlemoyer 1 Outline

More information

Markov Decision Processes (and a small amount of reinforcement learning)

Markov Decision Processes (and a small amount of reinforcement learning) Markov Decision Processes (and a small amount of reinforcement learning) Slides adapted from: Brian Williams, MIT Manuela Veloso, Andrew Moore, Reid Simmons, & Tom Mitchell, CMU Nicholas Roy 16.4/13 Session

More information

Symbolic Dynamic Programming for Continuous State and Observation POMDPs

Symbolic Dynamic Programming for Continuous State and Observation POMDPs Symbolic Dynamic Programming for Continuous State and Observation POMDPs Zahra Zamani ANU & NICTA Canberra, Australia zahra.zamani@anu.edu.au Pascal Poupart U. of Waterloo Waterloo, Canada ppoupart@uwaterloo.ca

More information

Planning Delayed-Response Queries and Transient Policies under Reward Uncertainty

Planning Delayed-Response Queries and Transient Policies under Reward Uncertainty Planning Delayed-Response Queries and Transient Policies under Reward Uncertainty Robert Cohn Computer Science and Engineering University of Michigan rwcohn@umich.edu Edmund Durfee Computer Science and

More information

Chapter 16 Planning Based on Markov Decision Processes

Chapter 16 Planning Based on Markov Decision Processes Lecture slides for Automated Planning: Theory and Practice Chapter 16 Planning Based on Markov Decision Processes Dana S. Nau University of Maryland 12:48 PM February 29, 2012 1 Motivation c a b Until

More information

CS599 Lecture 1 Introduction To RL

CS599 Lecture 1 Introduction To RL CS599 Lecture 1 Introduction To RL Reinforcement Learning Introduction Learning from rewards Policies Value Functions Rewards Models of the Environment Exploitation vs. Exploration Dynamic Programming

More information

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon.

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon. Administration CSCI567 Machine Learning Fall 2018 Prof. Haipeng Luo U of Southern California Nov 7, 2018 HW5 is available, due on 11/18. Practice final will also be available soon. Remaining weeks: 11/14,

More information

Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms *

Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms * Proceedings of the 8 th International Conference on Applied Informatics Eger, Hungary, January 27 30, 2010. Vol. 1. pp. 87 94. Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms

More information

Symbolic Dynamic Programming for Continuous State and Observation POMDPs

Symbolic Dynamic Programming for Continuous State and Observation POMDPs Symbolic Dynamic Programming for Continuous State and Observation POMDPs Zahra Zamani ANU & NICTA Canberra, Australia zahra.zamani@anu.edu.au Pascal Poupart U. of Waterloo Waterloo, Canada ppoupart@uwaterloo.ca

More information

Reinforcement Learning. George Konidaris

Reinforcement Learning. George Konidaris Reinforcement Learning George Konidaris gdk@cs.brown.edu Fall 2017 Machine Learning Subfield of AI concerned with learning from data. Broadly, using: Experience To Improve Performance On Some Task (Tom

More information

Symbolic Perseus: a Generic POMDP Algorithm with Application to Dynamic Pricing with Demand Learning

Symbolic Perseus: a Generic POMDP Algorithm with Application to Dynamic Pricing with Demand Learning Symbolic Perseus: a Generic POMDP Algorithm with Application to Dynamic Pricing with Demand Learning Pascal Poupart (University of Waterloo) INFORMS 2009 1 Outline Dynamic Pricing as a POMDP Symbolic Perseus

More information

Markov decision processes

Markov decision processes CS 2740 Knowledge representation Lecture 24 Markov decision processes Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square Administrative announcements Final exam: Monday, December 8, 2008 In-class Only

More information

A Review of the E 3 Algorithm: Near-Optimal Reinforcement Learning in Polynomial Time

A Review of the E 3 Algorithm: Near-Optimal Reinforcement Learning in Polynomial Time A Review of the E 3 Algorithm: Near-Optimal Reinforcement Learning in Polynomial Time April 16, 2016 Abstract In this exposition we study the E 3 algorithm proposed by Kearns and Singh for reinforcement

More information

Towards Faster Planning with Continuous Resources in Stochastic Domains

Towards Faster Planning with Continuous Resources in Stochastic Domains Towards Faster Planning with Continuous Resources in Stochastic Domains Janusz Marecki and Milind Tambe Computer Science Department University of Southern California 941 W 37th Place, Los Angeles, CA 989

More information

CAP Plan, Activity, and Intent Recognition

CAP Plan, Activity, and Intent Recognition CAP6938-02 Plan, Activity, and Intent Recognition Lecture 10: Sequential Decision-Making Under Uncertainty (part 1) MDPs and POMDPs Instructor: Dr. Gita Sukthankar Email: gitars@eecs.ucf.edu SP2-1 Reminder

More information

Sensitivity Analysis of POMDP Value Functions

Sensitivity Analysis of POMDP Value Functions Sensitivity Analysis of POMDP Value Functions Stephane Ross, Carnegie Mellon Universiy Pittsburgh, USA Masoumeh Izadi, Mark Mercer, David Buckeridge McGill University Montreal, Canada Abstract In sequential

More information

Artificial Intelligence & Sequential Decision Problems

Artificial Intelligence & Sequential Decision Problems Artificial Intelligence & Sequential Decision Problems (CIV6540 - Machine Learning for Civil Engineers) Professor: James-A. Goulet Département des génies civil, géologique et des mines Chapter 15 Goulet

More information

Reinforcement Learning and Control

Reinforcement Learning and Control CS9 Lecture notes Andrew Ng Part XIII Reinforcement Learning and Control We now begin our study of reinforcement learning and adaptive control. In supervised learning, we saw algorithms that tried to make

More information

Optimal Control of Partiality Observable Markov. Processes over a Finite Horizon

Optimal Control of Partiality Observable Markov. Processes over a Finite Horizon Optimal Control of Partiality Observable Markov Processes over a Finite Horizon Report by Jalal Arabneydi 04/11/2012 Taken from Control of Partiality Observable Markov Processes over a finite Horizon by

More information

Grundlagen der Künstlichen Intelligenz

Grundlagen der Künstlichen Intelligenz Grundlagen der Künstlichen Intelligenz Uncertainty & Probabilities & Bandits Daniel Hennes 16.11.2017 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Uncertainty Probability

More information

Decayed Markov Chain Monte Carlo for Interactive POMDPs

Decayed Markov Chain Monte Carlo for Interactive POMDPs Decayed Markov Chain Monte Carlo for Interactive POMDPs Yanlin Han Piotr Gmytrasiewicz Department of Computer Science University of Illinois at Chicago Chicago, IL 60607 {yhan37,piotr}@uic.edu Abstract

More information

Point Based Value Iteration with Optimal Belief Compression for Dec-POMDPs

Point Based Value Iteration with Optimal Belief Compression for Dec-POMDPs Point Based Value Iteration with Optimal Belief Compression for Dec-POMDPs Liam MacDermed College of Computing Georgia Institute of Technology Atlanta, GA 30332 liam@cc.gatech.edu Charles L. Isbell College

More information

Bayes Adaptive Reinforcement Learning versus Off-line Prior-based Policy Search: an Empirical Comparison

Bayes Adaptive Reinforcement Learning versus Off-line Prior-based Policy Search: an Empirical Comparison Bayes Adaptive Reinforcement Learning versus Off-line Prior-based Policy Search: an Empirical Comparison Michaël Castronovo University of Liège, Institut Montefiore, B28, B-4000 Liège, BELGIUM Damien Ernst

More information

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Introduction to Reinforcement Learning. CMPT 882 Mar. 18 Introduction to Reinforcement Learning CMPT 882 Mar. 18 Outline for the week Basic ideas in RL Value functions and value iteration Policy evaluation and policy improvement Model-free RL Monte-Carlo and

More information

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro CMU 15-781 Lecture 12: Reinforcement Learning Teacher: Gianni A. Di Caro REINFORCEMENT LEARNING Transition Model? State Action Reward model? Agent Goal: Maximize expected sum of future rewards 2 MDP PLANNING

More information

Coarticulation in Markov Decision Processes

Coarticulation in Markov Decision Processes Coarticulation in Markov Decision Processes Khashayar Rohanimanesh Department of Computer Science University of Massachusetts Amherst, MA 01003 khash@cs.umass.edu Sridhar Mahadevan Department of Computer

More information

CS 570: Machine Learning Seminar. Fall 2016

CS 570: Machine Learning Seminar. Fall 2016 CS 570: Machine Learning Seminar Fall 2016 Class Information Class web page: http://web.cecs.pdx.edu/~mm/mlseminar2016-2017/fall2016/ Class mailing list: cs570@cs.pdx.edu My office hours: T,Th, 2-3pm or

More information

Temporal Difference Learning & Policy Iteration

Temporal Difference Learning & Policy Iteration Temporal Difference Learning & Policy Iteration Advanced Topics in Reinforcement Learning Seminar WS 15/16 ±0 ±0 +1 by Tobias Joppen 03.11.2015 Fachbereich Informatik Knowledge Engineering Group Prof.

More information

Distributed Optimization. Song Chong EE, KAIST

Distributed Optimization. Song Chong EE, KAIST Distributed Optimization Song Chong EE, KAIST songchong@kaist.edu Dynamic Programming for Path Planning A path-planning problem consists of a weighted directed graph with a set of n nodes N, directed links

More information

Lecture 4: Approximate dynamic programming

Lecture 4: Approximate dynamic programming IEOR 800: Reinforcement learning By Shipra Agrawal Lecture 4: Approximate dynamic programming Deep Q Networks discussed in the last lecture are an instance of approximate dynamic programming. These are

More information

A Reinforcement Learning Algorithm with Polynomial Interaction Complexity for Only-Costly-Observable MDPs

A Reinforcement Learning Algorithm with Polynomial Interaction Complexity for Only-Costly-Observable MDPs A Reinforcement Learning Algorithm with Polynomial Interaction Complexity for Only-Costly-Observable MDPs Roy Fox Computer Science Department, Technion IIT, Israel Moshe Tennenholtz Faculty of Industrial

More information

Probabilistic inference for computing optimal policies in MDPs

Probabilistic inference for computing optimal policies in MDPs Probabilistic inference for computing optimal policies in MDPs Marc Toussaint Amos Storkey School of Informatics, University of Edinburgh Edinburgh EH1 2QL, Scotland, UK mtoussai@inf.ed.ac.uk, amos@storkey.org

More information

Learning in POMDPs with Monte Carlo Tree Search

Learning in POMDPs with Monte Carlo Tree Search Sammie Katt 1 Frans A. Oliehoek 2 Christopher Amato 1 Abstract The POMDP is a powerful framework for reasoning under outcome and information uncertainty, but constructing an accurate POMDP model is difficult.

More information