Heuristic Search Value Iteration for POMDPs

Size: px
Start display at page:

Download "Heuristic Search Value Iteration for POMDPs"

Transcription

1 520 SMITH & SIMMONS UAI 2004 Heuristic Search Value Iteration for POMDPs Trey Smith and Reid Simmons Robotics Institute, Carnegie Mellon University Abstract We present a novel POMDP planning algorithm called heuristic search value iteration (HSVI). HSVI is an anytime algorithm that returns a policy and a provable bound on its regret with respect to the optimal policy. HSVI gets its power by combining two well-known techniques: attention-focusing search heuristics and piecewise linear convex representations of the value function. HSVI s soundness and convergence have been proven. On some benchmark problems from the literature, HSVI displays speedups of greater than 100 with respect to other state-of-the-art POMDP value iteration algorithms. We also apply HSVI to a new rover exploration problem 10 times larger than most POMDP problems in the literature. 1 INTRODUCTION Partially observable Markov decision processes (POMDPs) constitute a powerful probabilistic model for planning problems that include hidden state and uncertainty in action effects. There are a wide variety of solution approaches. To date, problems of a few hundred states are at the limits of tractability. The present work gathers a number of threads in the POMDP literature. Our HSVI algorithm draws on prior approaches that combine heuristic search and value iteration [Washington, 1997, Geffner and Bonet, 1998], and a multitude of algorithms that employ a piecewise linear convex value function representation and gradient backups [Cassandra et al., 1997, Pineau et al., 2003]. It keeps compact representations of both upper and lower bounds on the value function [Hauskrecht, 1997]. Making use of these bounds, HSVI incorporates a novel excess uncertainty observation heuristic that empirically outperforms the usual sampling, and allows us to derive a theoretical bound on time complexity. By employing all of these techniques, HSVI gains a number of benefits. Its use of heuristic search forward from an initial belief (aided by the new observation heuristic) avoids unreachable or otherwise irrelevant parts of the belief space. Its representations for both bounds are compact and well-suited to generalizing local updates: improving the bounds at a specific belief also improves them at neighboring beliefs. Some weaknesses of HSVI are that it is relatively complicated, and its upper bound updates are a source of major overhead that only becomes worthwhile on the larger problems we studied. This paper describes HSVI, discusses its soundness and convergence, and compares its performance with other state-of-the-art value iteration algorithms on four benchmark problems from the literature. On some of these problems, HSVI displays speedups of greater than 100. We provide additional results on the new RockSample rover exploration problem. Our largest instance of RockSample has 12,545 states, 10 times larger than most problems in the scalable POMDP literature. 2 POMDP INTRODUCTION A POMDP models an agent acting under uncertainty. At each time step, the agent selects an action that has some stochastic result, then receives a noisy observation. The sequence of events can be viewed as a tree structure (fig. 1). Nodes of the tree are points where the agent must make a decision. We label each node with the belief b that the agent would have if it reached that node. The root node is labeled with the initial belief, b 0. Starting from node b, selecting action a, and receiving observation o, the agent proceeds to a new belief τ(b, a, o), corresponding to one of the children of b in the tree structure. Formally, the POMDP is a tuple S, A, O, T, O, R, γ, b 0, where S is the set of states, A the set of actions, O is the set of observations, T is the stochastic transition function, O is the stochastic observation function, R is the reward

2 UAI 2004 SMITH & SIMMONS 521 t=0 a 1 a 2 b 0 an 3 HEURISTIC SEARCH VALUE ITERATION t=1 o 1 o 2 o n Figure 1: POMDP tree structure. function, γ < 1 is the discount factor, and b 0 is the agent s belief about the initial state. Let s t, a t, and o t denote, respectively, the state, action, and observation at time t. Then we define b 0 (s) = Pr(s 0 = s) T (s, a, s ) = Pr(s t+1 = s s t = s, a t = a) O(s, a, o) = Pr(o t = o s t+1 = s, a t = a). Let a t = {a 0,..., a t } denote the history of actions up to time t, and similarly define o t. At time t + 1, the agent does not know s t+1, but does know the initial belief b 0, and the history a t and o t. The agent can act optimally on this information by conditioning its policy on its current belief at every step. The belief is recursively updated as follows: If b = τ(b, a, o), then b t+1 = τ(b t, a t, o t ), b (s ) = ηo(s, a, o) s T (s, a, s )b(s), where η is a normalizing constant. The agent s policy π specifies an action π(b) to follow given any current belief b. The (expected) long-term reward for a policy π, starting from a belief b, is defined to be [ ] J π (b) = E γ t R(s t, a t ) b, π. t=0 The optimal POMDP planning problem is to compute a policy π that optimizes long-term reward. π = argmax J π (b 0 ) π The usual goal for approximate POMDP planning is to minimize the regret of the returned policy π, defined to be regret(π, b) = J π (b) J π (b) In particular, we focus on minimizing regret(π, b 0 ) for the initial belief b 0 specified as part of the problem. HSVI is an approximate POMDP solution algorithm that combines techniques for heuristic search with piecewise linear convex value function representations. HSVI stores upper and lower bounds on the optimal value function V. Its fundamental operation is to make a local update at a specific belief, where the beliefs to update are chosen by exploring forward in the search tree according to heuristics that select actions and observations. HSVI makes asynchronous (Gauss-Seidel) updates to the value function bounds, and always bases its heuristics on the most recent bounds when choosing which successor to visit. It uses a depth-first exploration strategy. Beyond the usual memory vs. time trade-off, this choice makes sense because a breadth-first heuristic search typically employs a priority queue, and propagating the effects of asynchronous bounds updates to the priorities of queue elements would create substantial extra overhead. We refer to the lower and upper bound functions as V and V, respectively. We use the interval function ˆV to refer to them collectively, such that ˆV (b) = [V (b), V (b)] width( ˆV (b)) = V (b) V (b) HSVI is outlined in algs. 1 and 2. The following subsections describe aspects of the algorithm in more detail. 3.1 VALUE FUNCTION REPRESENTATION Most value iteration algorithms focus on storing and updating the lower bound. The vector set representation is commonly used. The value at a point b is the maximum projection of b onto a finite set Γ V of vectors α: V (b) = max α Γ V (α b). For finite-horizon POMDPs, a finite vector set can represent V exactly [Sondik, 1971]. Even for the discounted infinite-horizon formulation, a finite vector set can approximate V arbitrarily closely. Equally important, when the value function is a lower bound, it is easy to perform a local update on the vector set by adding a new vector. Unfortunately, if we represent the upper bound with a vector set, updating by adding a vector does not have the desired effect of improving the bound in the neighborhood of the local update. To accommodate the need for updates, we use a point set representation for the upper bound. The value at a point b is the projection of b onto the convex hull formed by a finite set Υ V of belief/value points (b i, v i ). Updates are performed by adding a new point to the set. The projection onto the convex hull is calculated with a linear program (LP). This upper bound representation and

3 522 SMITH & SIMMONS UAI 2004 Algorithm 1 π = HSVI(ɛ). HSVI(ɛ) returns a policy π such that regret(π, b 0 ) ɛ. a 1. Initialize the bounds ˆV. 2. While width( ˆV (b 0 )) > ɛ, repeatedly invoke explore(b 0, ɛ, 0). 3. Having achieved the desired precision, return the direct-control policy π corresponding to the lower bound. a In fact, π can be executed starting at any belief b. In general, regret(π, b) width( ˆV (b)), which is guaranteed to be less than ɛ only at b 0. Algorithm 2 explore(b, ɛ, t). explore recursively follows a single path down the search tree until satisfying a termination condition based on the width of the bounds interval. It then performs a series of updates on its way back up to the initial belief. 1. If width( ˆV (b)) ɛγ t, return. 2. Select an action a and observation o according to the forward exploration heuristics. 3. Call explore(τ(b, a, o ), ɛ, t + 1). 4. Locally update the bounds ˆV at belief b. [Hauskrecht, 1997]. Let π a be the policy of always selecting action a. We can calculate a lower bound R a on the long-term reward of π a by assuming that we are always in the worst state to choose action a from. R a = t=0 γ t min s R(s, a) = min s R(s, a) 1 γ We select the tightest of these bounds by maximizing. R = max R a a Then the vector set for the initial lower bound V 0 contains a single vector α such that every α(s) = R. To initialize the upper bound, we assume full observability and solve the MDP version of the problem [Astrom, 1965]. This provides upper bound values at the corners of the belief simplex, which form the initial point set. We call the resulting upper bound V MDP. V V * V b update b Figure 2: Locally updating at b. LP technique was suggested in [Hauskrecht, 2000], but in that work LP projection seems to have been rejected without testing on time complexity grounds. Note that with the high dimensionality of the belief space in our larger problems, LP projection is far more efficient than explicitly calculating the convex hull: an explicit representation would not even fit into available memory. We solve the LP using the commercial ILOG CPLEX software package. 3.2 INITIALIZATION HSVI requires initial bounds, which we would like to have the following properties: 1. Validity: V 0 V V Uniform improvability: This property is explained in the section on theoretical results. 3. Precision: The bounds should be fairly close to V. 4. Efficiency: Initialization should take a negligible proportion of the overall running time. The following initialization procedure meets these requirements. We calculate V 0 using the blind policy method 1 Throughout this paper, inequalities between functions are universally quantified, i.e., V V means V (b) V (b) for all b. 3.3 LOCAL UPDATES The Bellman update, H, is the fundamental operation of value iteration. It is defined as follows: Q V (b, a) = s γ o R(s, a)b(s) + HV (b) = max Q V (b, a) a Pr(o b, a)v (τ(b, a, o)) Q V (b, a) can be interpreted as the value of taking action a from belief b. Exact value iteration calculates this update exactly over the entire belief space. HSVI, however, uses local update operators based on H. Because the lower and upper bound are represented differently, we have distinct local update operators L b and U b. Locally updating at b means applying both operators. To update the lower bound vector set, we add a vector. To update the upper bound point set, we add a point. The operators are defined as: Γ Lb V = Γ V backup(v, b) Υ Ub V = Υ V (b, H V (b)), where backup(v, b) is the usual gradient backup, described in alg. 3.

4 UAI 2004 SMITH & SIMMONS 523 Algorithm 3 β = backup(v, b). The backup function can be viewed as a generalization of the Bellman update that makes use of gradient information. The assignments are universally quantified, e.g., β a,o is computed for every a, o. 1. β a,o argmax α ΓV (α τ(b, a, o)) 2. β a (s) R(s, a)+ γ o,s β a,o(s )O(s, a, o)t (s, a, s ) 3. β argmax βa (β a b). Fig. 2 represents the structure of the bounds representations and the process of locally updating at b. In the left side of the figure, the points and dotted lines represent V (upper bound points and convex hull). Several solid lines represent the vectors of Γ V. In the right side of the figure, we see the result of updating both bounds at b, which involves adding a new point to Υ V and a new vector to Γ V, bringing both bounds closer to V. HSVI periodically prunes dominated elements in both the lower bound vector set and the upper bound point set. Pruning occurs each time the size of the relevant set has increased by 10% since the last pruning episode. This pruning frequency was not carefully optimized, but there is not much to be gained by tuning it, since we do note see substantial overhead either from keeping around up to 10% too many elements or from the pruning operation itself. For the lower bound, we prune only vectors that are pointwise dominated (i.e., dominated by a single other vector). This type of pruning does not eliminate all redundant vectors, but it is simple and fast. For the upper bound, we prune all dominated points, defined as (b i, v i ) such that H V (b i ) < v i. 3.4 FORWARD EXPLORATION HEURISTICS This section discusses the heuristics that are used to decide which child of the current node to visit as the algorithm works its way forward from the initial belief. Starting from parent node b, HSVI must choose an action a and an observation o : the child node to visit is τ(b, a, o ). Define the uncertainty at b to mean the width of the bounds interval. Recalling that the regret of a policy returned by HSVI is bounded by the uncertainty at the root node b 0, our goal in designing the heuristics is to ensure that updates at the chosen child tend to reduce the uncertainty at the root. First we consider the choice of action. Define the interval function ˆQ as follows: ˆQ(b, a) = [Q V (b, a), Q V (b, a)] Fig. 3 shows the relationship between the bounds ˆQ(b, a) on each potential action and the bounds H ˆV (b) at b after a Q(b,a 1 ) Q(b,a 2 ) Q(b,a 3 ) H V(b) Figure 3: Relationship between ˆQ(b, a i ) and H ˆV (b). Bellman update. We see that the H ˆV (b) interval is determined by only two of the ˆQ(b, a) intervals: the ones with the maximal upper and lower bounds. This relationship immediately suggests that, among the ˆQ intervals, we should choose to update one of these two most promising actions. But which one? It turns out we can guarantee convergence only by choosing the action with the greatest upper bound. a = argmax Q V (b, a). a This is sometimes called the IE-MAX heuristic [Kaelbling, 1993]. It works because, if we repeatedly choose an a that is sub-optimal, we will eventually discover its sub-optimality when the a upper bound drops below the upper bound of another action. However, if we were to choose a according to the highest lower bound, we might never discover its sub-optimality, because further work could only cause its lower bound to rise. Next we need to select an observation o. Consider the relationship between ˆQ(b, a ) and the bounds at the various child nodes τ(b, a, o) that correspond to different observations. From the Bellman equation, we have width( ˆQ(b, a ))=γ o Pr(o b, a )width( ˆV (τ(b, a, o))). Note that this explains the termination criterion of explore, width( ˆV (b)) ɛγ t. Because the uncertainty at a node b after an update is at most γ times a weighted average of its child nodes, we have successively looser requirements on uncertainty at deeper nodes: we rely on the γ factor at each layer on the way back up to make up the difference. Given these facts, we can define excess uncertainty excess(b, t) = width( ˆV (b)) ɛγ t such that a node with negative excess uncertainty satisfies the explore termination condition. We say such a node is finished. Conveniently, the excess uncertainty at b is at most a probability-weighted sum of the excess uncertainties at its children excess(b, t) o Pr(o b, a )excess(τ(b, a, o), t + 1). Thus we can focus on ensuring early termination by selecting the depth t + 1 child that most contributes to excess uncertainty at b: o = argmax o [ Pr(o b, a )excess(τ(b, a, o), t + 1) ].

5 524 SMITH & SIMMONS UAI 2004 Algorithm 4 π = AnytimeHSVI(). AnytimeHSVI() is an anytime variant of HSVI. When interrupted, it returns a policy whose regret is bounded by the current value of width( ˆV (b 0 )). Implementation: As HSVI, but in step (2), in the call to explore(b 0, ɛ, 0), replace ɛ with ζ width( ˆV (b 0 )), where ζ < 1 is a scalar parameter. Empirically, performance is not very sensitive to ζ; we used ζ = 0.95 in the experiments, which gives good performance. Past heuristic search approaches have usually either sampled from Pr(o b, a ) or maximized weighted uncertainty rather than weighted excess uncertainty. We find the excess uncertainty heuristic to be empirically superior. In addition, this heuristic allows us to derive a time bound on HSVI convergence. 3.5 ANYTIME USAGE The definition of HSVI(ɛ) given above assumes that we know in advance that we want a policy with regret bounded by ɛ. In practice, however, we often do not know what a reasonable ɛ is for a given problem we just want the algorithm to do the best it can in the available time. In support of this goal, we define a variant algorithm called AnytimeHSVI (alg. 4). Where HSVI uses a fixed ɛ, AnytimeHSVI adjusts ɛ at each top-level call to explore, setting it to be slightly smaller than the current uncertainty at b 0. Instead of having a fixed finish line, we have a finish line that is always just ahead, receding as we approach. AnytimeHSVI is used for all of the experiments in this paper. However, our theoretical analysis focuses on HSVI(ɛ), which is easier to handle mathematically. 4 THEORETICAL RESULTS This section discusses some of the key soundness and convergence properties of HSVI(ɛ). The actual proofs are presented in [Smith and Simmons, 2004]. The initial lower and upper bound value functions are uniformly improvable, meaning that applying H brings them everywhere closer to V. If V is uniformly improvable, then the corresponding direct control policy P V supports V, meaning that V J P V. 2 If V is uniformly improvable, then it is valid, in the sense that V V. 2 Direct and lookahead control policies corresponding to a value function are discussed in, e.g., [Hauskrecht, 2000]. Exit Figure 4: RockSample[7, 8]. Our local update operators preserve uniform improvability. Thus, throughout the execution of HSVI, the current best policy P V supports V, and V is valid. Together, these facts imply that HSVI has valid bounds on the direct control policy, in the sense that V J π V V. This validity holds throughout execution and everywhere in the belief space. The regret(π, b 0 ) of the policy π returned by HSVI(ɛ) is at most ɛ. When AnytimeHSVI is interrupted, the regret(π, b 0 ) of the current best policy π is at most width( ˆV (b 0 )). There is a finite depth t max = log γ (ɛ/ V0 V 0 ) such that all nodes with depth t t max are finished. The uncertainty at a node never increases (thus finished nodes never become unfinished). After each top-level call to explore, at least one previously unfinished node is finished. This property depends on our particular choice of heuristics. As a result, HSVI(ɛ) is guaranteed to terminate after performing at most u max updates, where ( A O ) tmax+1 1 u max = t max. A O 1 (Note this is a conservative theoretical bound; empirically, it is much faster.) 5 THE ROCKSAMPLE PROBLEM To test HSVI, we have developed RockSample, a scalable problem that models rover science exploration (fig. 4). The rover can achieve reward by sampling rocks in the immediate area, and by continuing its traverse (reaching the exit at the right side of the map). The positions of the rover and the rocks are known, but only some of the rocks have scientific value; we will call these rocks good. Sampling a rock is expensive, so the rover is equipped with a noisy long-range sensor that it can use to help determine whether a rock is good before choosing whether to approach and sample it. An instance of RockSample with map size n n and k rocks is described as RockSample[n, k]. The POMDP model of RockSample[n, k] is as follows. The state

6 UAI 2004 SMITH & SIMMONS 525 space is the cross product of k + 1 features: Position = {(1, 1), (1, 2),..., (n, n)}, and k binary features RockType i = {Good, Bad} that indicate which of the rocks are good. There is an additional terminal state, reached when the rover moves off the right-hand edge of the map. The rover can select from k + 5 actions: {North, South, East, West, Sample, Check 1,..., Check k }. The first four are deterministic single-step motion actions. The Sample action samples the rock at the rover s current location. If the rock is good, the rover receives a reward of 10 and the rock becomes bad (indicating that nothing more can be gained by sampling it). If the rock is bad, it receives a penalty of 10. Moving into the exit area yields reward 10. All other moves have no cost or reward. Each Check i action applies the rover s long-range sensor to rock i, returning a noisy observation from {Good, Bad}. The noise in the long-range sensor reading is determined by the efficiency η, which decreases exponentially as a function of Euclidean distance from the target. At η = 1, the sensor always returns the correct value. At η = 0, it has a 50/50 chance of returning Good or Bad. At intermediate values, these behaviors are combined linearly. The initial belief is that every rock has equal probability of being Good or Bad. 6 EXPERIMENTAL RESULTS We tested HSVI on several well-known problems from the scalable POMDP literature, as well as instances of Rock- Sample. Our benchmark set follows [Pineau et al., 2003], which provides performance numbers for PBVI and some other value iteration algorithms. Note that all of the problems have γ = Experiments were conducted as follows. Periodically during each run, we interrupted HSVI and simulated its current best policy π, providing an estimate of the solution quality, J π (b 0 ). The reported quality is the average reward received over many simulation runs ( ). Replicating earlier experiments, each simulation was terminated after 251 steps. For each problem, results are reported over a single run of the algorithm. In a few cases we made multiple runs, but since HSVI is not stochastic, successive runs are identical up to minuscule changes arising from varying background load on the platform we used, a Pentium-III running at 850 MHz, with 256 MB of RAM. Fig. 5 shows HSVI solution quality vs. time for four problems. In these plots, we also track the bounds V (b 0 ) and V (b 0 ). Recall that at every phase of the algorithm, we are guaranteed that V (b 0 ) J π (b 0 ) V (b 0 ). Fig. 5 should reflect this, at least up to the error in our estimate of J π (b 0 ) (errorbars are 95% confidence intervals). ThreeState is a trivial problem we generated, an example of HSVI running solution quality solution quality ThreeState (3s 4a 3o) Tag (870s 5a 30o) bounds simulation wallclock time (seconds) solution quality bounds simulation wallclock time (seconds) RockSample[5,7] (3201s 12a 2o)) RockSample[7,8] (12545s 13a 2o) bounds simulation wallclock time (seconds) Figure 5: Solution quality vs. time. solution quality bounds simulation wallclock time (seconds) to convergence. On the larger problems, the bounds have not converged by the end of the run. Fig. 6 shows running times and final solution quality for HSVI and some other state-of-the-art algorithms. Unfortunately, not all competitive algorithms could be included in the comparison, because there is no widely accepted POMDP benchmark that we could use. Results for algorithms other than HSVI were computed on different platforms; running times are only very roughly comparable. Among the algorithms compared, HSVI s final solution quality is in every case within measurement error of the best reported so far, and in one case (the Tag problem) is significantly better. Problem (num. states/actions/observations) Goal% Reward Time (s) Γ Tiger-Grid (36s 5a 17o) QMDP [Pineau et al., 2003] n.a n.a. Grid [Brafman, 1997] n.a n.v. 174 PBUA [Poon, 2001] n.a PBVI [Pineau et al., 2003] n.a HSVI n.a Hallway (61s 5a 21o) QMDP [Littman et al., 1995] 47.4 n.v. n.v. n.a. PBUA [Poon, 2001] PBVI [Pineau et al., 2003] HSVI Hallway2 (93s 5a 17o) QMDP [Littman et al., 1995] 25.9 n.v. n.v. n.a. Grid [Brafman, 1997] 98 n.v. n.v. 337 PBUA [Poon, 2001] PBVI [Pineau et al., 2003] HSVI Tag (870s 5a 30o) QMDP [Pineau et al., 2003] n.a. PBVI [Pineau et al., 2003] HSVI RockSample[4,4] (257s 9a 2o) PBVI [Pineau, personal communication] n.a n.v. HSVI n.a RockSample[5,5] (801s 10a 2o) HSVI n.a RockSample[5,7] (3201s 12a 2o) HSVI n.a RockSample[7,8] (12545s 13a 2o) HSVI n.a n.a. = not applicable n.v. = not available Figure 6: Multi-algorithm performance comparison.

7 526 SMITH & SIMMONS UAI 2004 Time Problem (num. states/actions/observations) v c PBVI HSVI Speedup Tiger-Grid (36 s 5a 17o) Hallway (61s 5a 21o) Hallway2 (93s 5a 17o) Tag (870s 5a 30o) RockSample[4,4] (256s 9a 2o) Figure 7: Performance comparison, HSVI and PBVI. Fig. 6, which shows only a single time/quality data point for each problem, does not provide enough data for speed comparisons. Therefore we decided to make a closer comparison with one algorithm. PBVI was chosen both because it is a competitive algorithm, and because [Pineau et al., 2003] presents detailed solution quality vs. time curves for our benchmark problems. In order to control for differing lengths of runs, we report the time that each algorithm took to reach a common value v c, defined to be the highest value that both algorithms were able to reach at some point during their run. There is uncertainty associated with some of the times for PBVI because they were derived from manual reading of published plots; this uncertainty is noted in our comparison table. PBVI and HSVI appear to have been run on comparable platforms. 3 Fig. 7 compares PBVI and HSVI performance. The two algorithms show similar performance on smaller problems. As the problems scale up, however, HSVI provides dramatic speedup. A brief explanation of why this might be the case: Recall that the policy returned by HSVI is based solely on the lower bound. The upper bound is used only to guide forward exploration. But upper bound updates, which involve the solution of several linear programs, often take longer than lower bound updates. Since PBVI keeps only a lower bound, its updates proceed much more quickly. HSVI can only have competitive performance to the extent that the intelligence of its heuristics outweighs the speed penalty of updating the upper bound. Apparently, the heuristics become relatively more important as problem size increases. 7 RELATED WORK Because HSVI combines several existing solution techniques, it can be compared to a wide range of related work. Figure 8, although far from exhaustive, lists many relevant algorithms and some of the features they share with HSVI. [Hauskrecht, 1997], perhaps the closest prior work, describes separate algorithms for incrementally calculating the upper bound (ICUB) and lower bound (ICLB). The ICUB upper bound is similar to that of HSVI in that it is initialized with the value function for the underlying MDP (V MDP ), improved with asynchronous backups, and used as 3 PBVI performance on RockSample[4, 4] and a rough performance estimate for the computer used in PBVI experiments were provided courtesy of J. Pineau (personal communication). Keeps upper and lower bounds Leverages value function convexity Uses observation/outcome heuristic Uses action heuristic Examines only reachable states/beliefs Asynchronous updates Applied to POMDPs HSVI Y Y Y Y Y Y Y ICUB/ICUL [Hauskrecht, 1997] Y Y - Y - Y Y BI-POMDP [Washington, 1997] Y Y Y Y Y - Y RTDP-BEL [Geffner and Bonet, 1998] Y Y Y Y [Brafman, 1997] Y Y - Y Y Y - [Dearden and Boutilier, 1994] - Y Y Y - - Y LAO* [Hansen and Zilberstein, 2001] - Y Y Y - - Y PBVI [Pineau et al., 2003] Y - Y - - Y - PBDP [Zhang and Zhang, 2001] Y Y - Incremental pruning [Cassandra et al., 1997] Y Y - Figure 8: Relevant algorithms and features. an action heuristic. Unlike HSVI, ICUB uses a grid-based representation, and explores forward from belief space critical points rather than a specified initial belief. The ICUL lower bound uses the same vector set representation as HSVI and adds the result of each gradient backup in the same way. But because ICUB and ICUL are separate algorithms, ICUL s forward exploration does not select actions based on the upper bound, and neither algorithm makes use of an uncertainty-based observation heuristic. Other related work mostly falls into two camps. The first are algorithms that combine heuristic search with dynamic programming updates. RTDP-BEL [Geffner and Bonet, 1998], a POMDP extension of the well-known RTDP value iteration technique for MDPs [Barto et al., 1995], turns out to be very similar to ICUB. BI-POMDP [Washington, 1997] uses forward exploration based on AO with V MDP as its heuristic. BI-POMDP keeps upper and lower bounds on nodes in the search tree however, it does not explicitly represent the bounds as functions, so it is unable to generalize the value at a belief to neighboring beliefs. Some other algorithms in this group are [Dearden and Boutilier, 1994, Brafman, 1997, Hansen and Zilberstein, 2001]. The second camp includes algorithms that employ a piecewise linear convex value function representation and gradient backups. There are a host of algorithms along these lines, dating back to [Sondik, 1971]. Most differ from HSVI in that they perform gradient backups over the full belief space instead of focusing on relevant beliefs. One exception is PBVI [Pineau et al., 2003], which performs synchronous gradient backups on a growing subset of the belief space, designed such that it examines only reachable beliefs. Unlike HSVI, PBVI does not keep an upper bound and does not use a valuebased action heuristic when expanding its belief set. Other algorithms in this group include incremental pruning [Cassandra et al., 1997] and point-based dynamic programming [Zhang and Zhang, 2001]. HSVI avoids examining unreachable beliefs using forward

8 UAI 2004 SMITH & SIMMONS 527 exploration. [Boutilier et al., 1998] describe how to precompute reachability in order to eliminate states in an MDP context. In a POMDP context their technique would go beyond HSVI by explicitly reducing the dimensionality of the belief space, but the remaining space might still include unreachable beliefs never visited by HSVI. Finally, there are many competitive POMDP solution approaches that do not employ heuristic search or a PWLC value function representation: too many to discuss here. We refer the reader to a survey [Aberdeen, 2002]. Hopefully, increased adoption of common benchmarks in the POMDP community will allow us to better compare HSVI with other algorithms in the future. 8 CONCLUSION This paper presents HSVI, a POMDP solution algorithm that uses heuristics, based on upper and lower bounds of the optimal value function, to guide local updates. Experimentally, HSVI is able to find solutions with quality within measurement error of the best previous report on all of the benchmark problems we tried, and it did significantly better on the Tag problem. In time comparisons with the state-of-the-art PBVI algorithm, HSVI showed dramatic speedups on larger problems. We applied HSVI to an instance of the new RockSample domain with 12,545 states, more than 10 times larger than most problems presented in the scalable POMDP literature. There are several ways that we would like to extend HSVI. First, it should be possible to speed up lower bound updates through the following observation: most beliefs are sparse, and most α vectors are optimal for only a few closely related beliefs. Therefore, only a few elements of any given α vector are relevant, and we can save effort if we avoid computing the rest. Second, we are working on reducing the number of LP calculations needed for the upper-bound by pruning some actions early, and by reusing old LP solutions. Finally, we could leverage better data structures such as ADDs for representing beliefs, α vectors, and other objects used by the algorithm [Hoey et al., 1999]. In summary, this is an exciting time: recent progress in solution performance suggests that the POMDP planning model will soon be a feasible choice for robot decisionmaking on a much wider range of real problems. References [Aberdeen, 2002] Aberdeen, D. (2002). A survey of approximate methods for solving partially observable Markov decision processes. Technical report, Research School of Information Science and Engineering, Australia National University. [Astrom, 1965] Astrom, K. J. (1965). Optimal control of Markov decision processes with incomplete state estimation. Journal of Mathematical Analysis and Applications, 10: [Barto et al., 1995] Barto, A., Bradtke, S., and Singh, S. (1995). Learning to act using real-time dynamic programming. Artificial Intelligence, 72(1-2): [Boutilier et al., 1998] Boutilier, C., Brafman, R., and Geib, C. (1998). Structured reachability analysis for Markov decision processes. In Proc. of UAI, pages [Brafman, 1997] Brafman, R. I. (1997). A heuristic variable grid solution method for POMDPs. In Proc. of AAAI. [Cassandra et al., 1997] Cassandra, A., Littman, M., and Zhang, N. (1997). Incremental pruning: A simple, fast, exact method for partially observable Markov decision processes. In Proc. of UAI. [Dearden and Boutilier, 1994] Dearden, R. and Boutilier, C. (1994). Integrating planning and execution in stochastic domains. In Proc. of the AAAI Spring Symposium on Decision Theoretic Planning, pages 55 61, Stanford, CA. [Geffner and Bonet, 1998] Geffner, H. and Bonet, B. (1998). Solving large POMDPs by real time dynamic programming. In Working Notes Fall AAAI Symposium on POMDPs. [Hansen and Zilberstein, 2001] Hansen, E. and Zilberstein, S. (2001). LAO*: A heuristic search algorithm that finds solutions with loops. Artificial Intelligence, 129: [Hauskrecht, 1997] Hauskrecht, M. (1997). Incremental methods for computing bounds in partially observable Markov decision processes. In Proc. of AAAI, pages , Providence, RI. [Hauskrecht, 2000] Hauskrecht, M. (2000). Value-function approximations for partially observable Markov decision processes. Journal of Artificial Intelligence Research, 13: [Hoey et al., 1999] Hoey, J., St-Aubin, R., Hu, A., and Boutilier, C. (1999). SPUDD: Stochastic planning using decision diagrams. In Proc. of UAI, pages [Kaelbling, 1993] Kaelbling, L. P. (1993). Learning in Embedded Systems. The MIT Press. [Pineau et al., 2003] Pineau, J., Gordon, G., and Thrun, S. (2003). Point-based value iteration: An anytime algorithm for POMDPs. In Proc. of IJCAI. [Poon, 2001] Poon, K.-M. (2001). A fast heuristic algorithm for decision-theoretic planning. Master s thesis, The Hong Kong University of Science and Technology. [Smith and Simmons, 2004] Smith, T. and Simmons, R. (2004). Heuristic search value iteration for POMDPs: Detailed theory and results. Technical report, Robotics Institute, Carnegie Mellon University. (in preparation). [Sondik, 1971] Sondik, E. J. (1971). The optimal control of partially observable Markov processes. PhD thesis, Stanford University. [Washington, 1997] Washington, R. (1997). BI-POMDP: Bounded, incremental, partially-observable Markov-model planning. In Proc. of European Conf. on Planning (ECP), Toulouse, France. [Zhang and Zhang, 2001] Zhang, N. L. and Zhang, W. (2001). Speeding up the convergence of value iteration in partially observable Markov decision processes. Journal of AI Research, 14:29 51.

Partially Observable Markov Decision Processes (POMDPs)

Partially Observable Markov Decision Processes (POMDPs) Partially Observable Markov Decision Processes (POMDPs) Geoff Hollinger Sequential Decision Making in Robotics Spring, 2011 *Some media from Reid Simmons, Trey Smith, Tony Cassandra, Michael Littman, and

More information

Efficient Maximization in Solving POMDPs

Efficient Maximization in Solving POMDPs Efficient Maximization in Solving POMDPs Zhengzhu Feng Computer Science Department University of Massachusetts Amherst, MA 01003 fengzz@cs.umass.edu Shlomo Zilberstein Computer Science Department University

More information

Focused Real-Time Dynamic Programming for MDPs: Squeezing More Out of a Heuristic

Focused Real-Time Dynamic Programming for MDPs: Squeezing More Out of a Heuristic Focused Real-Time Dynamic Programming for MDPs: Squeezing More Out of a Heuristic Trey Smith and Reid Simmons Robotics Institute, Carnegie Mellon University {trey,reids}@ri.cmu.edu Abstract Real-time dynamic

More information

Information Gathering and Reward Exploitation of Subgoals for P

Information Gathering and Reward Exploitation of Subgoals for P Information Gathering and Reward Exploitation of Subgoals for POMDPs Hang Ma and Joelle Pineau McGill University AAAI January 27, 2015 http://www.cs.washington.edu/ai/mobile_robotics/mcl/animations/global-floor.gif

More information

Finite-State Controllers Based on Mealy Machines for Centralized and Decentralized POMDPs

Finite-State Controllers Based on Mealy Machines for Centralized and Decentralized POMDPs Finite-State Controllers Based on Mealy Machines for Centralized and Decentralized POMDPs Christopher Amato Department of Computer Science University of Massachusetts Amherst, MA 01003 USA camato@cs.umass.edu

More information

Region-Based Dynamic Programming for Partially Observable Markov Decision Processes

Region-Based Dynamic Programming for Partially Observable Markov Decision Processes Region-Based Dynamic Programming for Partially Observable Markov Decision Processes Zhengzhu Feng Department of Computer Science University of Massachusetts Amherst, MA 01003 fengzz@cs.umass.edu Abstract

More information

RL 14: POMDPs continued

RL 14: POMDPs continued RL 14: POMDPs continued Michael Herrmann University of Edinburgh, School of Informatics 06/03/2015 POMDPs: Points to remember Belief states are probability distributions over states Even if computationally

More information

Optimally Solving Dec-POMDPs as Continuous-State MDPs

Optimally Solving Dec-POMDPs as Continuous-State MDPs Optimally Solving Dec-POMDPs as Continuous-State MDPs Jilles Dibangoye (1), Chris Amato (2), Olivier Buffet (1) and François Charpillet (1) (1) Inria, Université de Lorraine France (2) MIT, CSAIL USA IJCAI

More information

Prioritized Sweeping Converges to the Optimal Value Function

Prioritized Sweeping Converges to the Optimal Value Function Technical Report DCS-TR-631 Prioritized Sweeping Converges to the Optimal Value Function Lihong Li and Michael L. Littman {lihong,mlittman}@cs.rutgers.edu RL 3 Laboratory Department of Computer Science

More information

Accelerated Vector Pruning for Optimal POMDP Solvers

Accelerated Vector Pruning for Optimal POMDP Solvers Accelerated Vector Pruning for Optimal POMDP Solvers Erwin Walraven and Matthijs T. J. Spaan Delft University of Technology Mekelweg 4, 2628 CD Delft, The Netherlands Abstract Partially Observable Markov

More information

Kalman Based Temporal Difference Neural Network for Policy Generation under Uncertainty (KBTDNN)

Kalman Based Temporal Difference Neural Network for Policy Generation under Uncertainty (KBTDNN) Kalman Based Temporal Difference Neural Network for Policy Generation under Uncertainty (KBTDNN) Alp Sardag and H.Levent Akin Bogazici University Department of Computer Engineering 34342 Bebek, Istanbul,

More information

Artificial Intelligence

Artificial Intelligence Artificial Intelligence Dynamic Programming Marc Toussaint University of Stuttgart Winter 2018/19 Motivation: So far we focussed on tree search-like solvers for decision problems. There is a second important

More information

Q-Learning in Continuous State Action Spaces

Q-Learning in Continuous State Action Spaces Q-Learning in Continuous State Action Spaces Alex Irpan alexirpan@berkeley.edu December 5, 2015 Contents 1 Introduction 1 2 Background 1 3 Q-Learning 2 4 Q-Learning In Continuous Spaces 4 5 Experimental

More information

Towards Faster Planning with Continuous Resources in Stochastic Domains

Towards Faster Planning with Continuous Resources in Stochastic Domains Towards Faster Planning with Continuous Resources in Stochastic Domains Janusz Marecki and Milind Tambe Computer Science Department University of Southern California 941 W 37th Place, Los Angeles, CA 989

More information

Planning in Markov Decision Processes

Planning in Markov Decision Processes Carnegie Mellon School of Computer Science Deep Reinforcement Learning and Control Planning in Markov Decision Processes Lecture 3, CMU 10703 Katerina Fragkiadaki Markov Decision Process (MDP) A Markov

More information

CS 7180: Behavioral Modeling and Decisionmaking

CS 7180: Behavioral Modeling and Decisionmaking CS 7180: Behavioral Modeling and Decisionmaking in AI Markov Decision Processes for Complex Decisionmaking Prof. Amy Sliva October 17, 2012 Decisions are nondeterministic In many situations, behavior and

More information

Motivation for introducing probabilities

Motivation for introducing probabilities for introducing probabilities Reaching the goals is often not sufficient: it is important that the expected costs do not outweigh the benefit of reaching the goals. 1 Objective: maximize benefits - costs.

More information

Introduction to Reinforcement Learning

Introduction to Reinforcement Learning CSCI-699: Advanced Topics in Deep Learning 01/16/2019 Nitin Kamra Spring 2019 Introduction to Reinforcement Learning 1 What is Reinforcement Learning? So far we have seen unsupervised and supervised learning.

More information

, and rewards and transition matrices as shown below:

, and rewards and transition matrices as shown below: CSE 50a. Assignment 7 Out: Tue Nov Due: Thu Dec Reading: Sutton & Barto, Chapters -. 7. Policy improvement Consider the Markov decision process (MDP) with two states s {0, }, two actions a {0, }, discount

More information

Markov Decision Processes Chapter 17. Mausam

Markov Decision Processes Chapter 17. Mausam Markov Decision Processes Chapter 17 Mausam Planning Agent Static vs. Dynamic Fully vs. Partially Observable Environment What action next? Deterministic vs. Stochastic Perfect vs. Noisy Instantaneous vs.

More information

Heuristic Search Algorithms

Heuristic Search Algorithms CHAPTER 4 Heuristic Search Algorithms 59 4.1 HEURISTIC SEARCH AND SSP MDPS The methods we explored in the previous chapter have a serious practical drawback the amount of memory they require is proportional

More information

Final Exam December 12, 2017

Final Exam December 12, 2017 Introduction to Artificial Intelligence CSE 473, Autumn 2017 Dieter Fox Final Exam December 12, 2017 Directions This exam has 7 problems with 111 points shown in the table below, and you have 110 minutes

More information

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Introduction to Reinforcement Learning. CMPT 882 Mar. 18 Introduction to Reinforcement Learning CMPT 882 Mar. 18 Outline for the week Basic ideas in RL Value functions and value iteration Policy evaluation and policy improvement Model-free RL Monte-Carlo and

More information

Final Exam December 12, 2017

Final Exam December 12, 2017 Introduction to Artificial Intelligence CSE 473, Autumn 2017 Dieter Fox Final Exam December 12, 2017 Directions This exam has 7 problems with 111 points shown in the table below, and you have 110 minutes

More information

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro CMU 15-781 Lecture 12: Reinforcement Learning Teacher: Gianni A. Di Caro REINFORCEMENT LEARNING Transition Model? State Action Reward model? Agent Goal: Maximize expected sum of future rewards 2 MDP PLANNING

More information

CS 188 Introduction to Fall 2007 Artificial Intelligence Midterm

CS 188 Introduction to Fall 2007 Artificial Intelligence Midterm NAME: SID#: Login: Sec: 1 CS 188 Introduction to Fall 2007 Artificial Intelligence Midterm You have 80 minutes. The exam is closed book, closed notes except a one-page crib sheet, basic calculators only.

More information

Internet Monetization

Internet Monetization Internet Monetization March May, 2013 Discrete time Finite A decision process (MDP) is reward process with decisions. It models an environment in which all states are and time is divided into stages. Definition

More information

Producing Efficient Error-bounded Solutions for Transition Independent Decentralized MDPs

Producing Efficient Error-bounded Solutions for Transition Independent Decentralized MDPs Producing Efficient Error-bounded Solutions for Transition Independent Decentralized MDPs Jilles S. Dibangoye INRIA Loria, Campus Scientifique Vandœuvre-lès-Nancy, France jilles.dibangoye@inria.fr Christopher

More information

RL 14: Simplifications of POMDPs

RL 14: Simplifications of POMDPs RL 14: Simplifications of POMDPs Michael Herrmann University of Edinburgh, School of Informatics 04/03/2016 POMDPs: Points to remember Belief states are probability distributions over states Even if computationally

More information

Homework 2: MDPs and Search

Homework 2: MDPs and Search Graduate Artificial Intelligence 15-780 Homework 2: MDPs and Search Out on February 15 Due on February 29 Problem 1: MDPs [Felipe, 20pts] Figure 1: MDP for Problem 1. States are represented by circles

More information

Learning Tetris. 1 Tetris. February 3, 2009

Learning Tetris. 1 Tetris. February 3, 2009 Learning Tetris Matt Zucker Andrew Maas February 3, 2009 1 Tetris The Tetris game has been used as a benchmark for Machine Learning tasks because its large state space (over 2 200 cell configurations are

More information

Learning in Depth-First Search: A Unified Approach to Heuristic Search in Deterministic, Non-Deterministic, Probabilistic, and Game Tree Settings

Learning in Depth-First Search: A Unified Approach to Heuristic Search in Deterministic, Non-Deterministic, Probabilistic, and Game Tree Settings Learning in Depth-First Search: A Unified Approach to Heuristic Search in Deterministic, Non-Deterministic, Probabilistic, and Game Tree Settings Blai Bonet and Héctor Geffner Abstract Dynamic Programming

More information

Planning and Acting in Partially Observable Stochastic Domains

Planning and Acting in Partially Observable Stochastic Domains Planning and Acting in Partially Observable Stochastic Domains Leslie Pack Kaelbling*, Michael L. Littman**, Anthony R. Cassandra*** *Computer Science Department, Brown University, Providence, RI, USA

More information

Final. Introduction to Artificial Intelligence. CS 188 Spring You have approximately 2 hours and 50 minutes.

Final. Introduction to Artificial Intelligence. CS 188 Spring You have approximately 2 hours and 50 minutes. CS 188 Spring 2014 Introduction to Artificial Intelligence Final You have approximately 2 hours and 50 minutes. The exam is closed book, closed notes except your two-page crib sheet. Mark your answers

More information

ARTIFICIAL INTELLIGENCE. Reinforcement learning

ARTIFICIAL INTELLIGENCE. Reinforcement learning INFOB2KI 2018-2019 Utrecht University The Netherlands ARTIFICIAL INTELLIGENCE Reinforcement learning Lecturer: Silja Renooij These slides are part of the INFOB2KI Course Notes available from www.cs.uu.nl/docs/vakken/b2ki/schema.html

More information

Today s Outline. Recap: MDPs. Bellman Equations. Q-Value Iteration. Bellman Backup 5/7/2012. CSE 473: Artificial Intelligence Reinforcement Learning

Today s Outline. Recap: MDPs. Bellman Equations. Q-Value Iteration. Bellman Backup 5/7/2012. CSE 473: Artificial Intelligence Reinforcement Learning CSE 473: Artificial Intelligence Reinforcement Learning Dan Weld Today s Outline Reinforcement Learning Q-value iteration Q-learning Exploration / exploitation Linear function approximation Many slides

More information

Point-Based Value Iteration for Constrained POMDPs

Point-Based Value Iteration for Constrained POMDPs Point-Based Value Iteration for Constrained POMDPs Dongho Kim Jaesong Lee Kee-Eung Kim Department of Computer Science Pascal Poupart School of Computer Science IJCAI-2011 2011. 7. 22. Motivation goals

More information

Reinforcement Learning and Control

Reinforcement Learning and Control CS9 Lecture notes Andrew Ng Part XIII Reinforcement Learning and Control We now begin our study of reinforcement learning and adaptive control. In supervised learning, we saw algorithms that tried to make

More information

European Workshop on Reinforcement Learning A POMDP Tutorial. Joelle Pineau. McGill University

European Workshop on Reinforcement Learning A POMDP Tutorial. Joelle Pineau. McGill University European Workshop on Reinforcement Learning 2013 A POMDP Tutorial Joelle Pineau McGill University (With many slides & pictures from Mauricio Araya-Lopez and others.) August 2013 Sequential decision-making

More information

CS 4100 // artificial intelligence. Recap/midterm review!

CS 4100 // artificial intelligence. Recap/midterm review! CS 4100 // artificial intelligence instructor: byron wallace Recap/midterm review! Attribution: many of these slides are modified versions of those distributed with the UC Berkeley CS188 materials Thanks

More information

Bayes-Adaptive POMDPs 1

Bayes-Adaptive POMDPs 1 Bayes-Adaptive POMDPs 1 Stéphane Ross, Brahim Chaib-draa and Joelle Pineau SOCS-TR-007.6 School of Computer Science McGill University Montreal, Qc, Canada Department of Computer Science and Software Engineering

More information

Markov decision processes (MDP) CS 416 Artificial Intelligence. Iterative solution of Bellman equations. Building an optimal policy.

Markov decision processes (MDP) CS 416 Artificial Intelligence. Iterative solution of Bellman equations. Building an optimal policy. Page 1 Markov decision processes (MDP) CS 416 Artificial Intelligence Lecture 21 Making Complex Decisions Chapter 17 Initial State S 0 Transition Model T (s, a, s ) How does Markov apply here? Uncertainty

More information

6 Reinforcement Learning

6 Reinforcement Learning 6 Reinforcement Learning As discussed above, a basic form of supervised learning is function approximation, relating input vectors to output vectors, or, more generally, finding density functions p(y,

More information

Bounded Real-Time Dynamic Programming: RTDP with monotone upper bounds and performance guarantees

Bounded Real-Time Dynamic Programming: RTDP with monotone upper bounds and performance guarantees : RTDP with monotone upper bounds and performance guarantees H. Brendan McMahan mcmahan@cs.cmu.edu Maxim Likhachev maxim+@cs.cmu.edu Geoffrey J. Gordon ggordon@cs.cmu.edu Carnegie Mellon University, 5

More information

Reinforcement Learning Active Learning

Reinforcement Learning Active Learning Reinforcement Learning Active Learning Alan Fern * Based in part on slides by Daniel Weld 1 Active Reinforcement Learning So far, we ve assumed agent has a policy We just learned how good it is Now, suppose

More information

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Q-Learning Christopher Watkins and Peter Dayan Noga Zaslavsky The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992)

More information

POMDP S : Exact and Approximate Solutions

POMDP S : Exact and Approximate Solutions POMDP S : Exact and Approximate Solutions Bharaneedharan R University of Illinois at Chicago Automated optimal decision making. Fall 2002 p.1/21 Road-map Definition of POMDP Belief as a sufficient statistic

More information

Chapter 3: The Reinforcement Learning Problem

Chapter 3: The Reinforcement Learning Problem Chapter 3: The Reinforcement Learning Problem Objectives of this chapter: describe the RL problem we will be studying for the remainder of the course present idealized form of the RL problem for which

More information

Reinforcement Learning. Spring 2018 Defining MDPs, Planning

Reinforcement Learning. Spring 2018 Defining MDPs, Planning Reinforcement Learning Spring 2018 Defining MDPs, Planning understandability 0 Slide 10 time You are here Markov Process Where you will go depends only on where you are Markov Process: Information state

More information

CSE 573. Markov Decision Processes: Heuristic Search & Real-Time Dynamic Programming. Slides adapted from Andrey Kolobov and Mausam

CSE 573. Markov Decision Processes: Heuristic Search & Real-Time Dynamic Programming. Slides adapted from Andrey Kolobov and Mausam CSE 573 Markov Decision Processes: Heuristic Search & Real-Time Dynamic Programming Slides adapted from Andrey Kolobov and Mausam 1 Stochastic Shortest-Path MDPs: Motivation Assume the agent pays cost

More information

CSL302/612 Artificial Intelligence End-Semester Exam 120 Minutes

CSL302/612 Artificial Intelligence End-Semester Exam 120 Minutes CSL302/612 Artificial Intelligence End-Semester Exam 120 Minutes Name: Roll Number: Please read the following instructions carefully Ø Calculators are allowed. However, laptops or mobile phones are not

More information

CS188: Artificial Intelligence, Fall 2009 Written 2: MDPs, RL, and Probability

CS188: Artificial Intelligence, Fall 2009 Written 2: MDPs, RL, and Probability CS188: Artificial Intelligence, Fall 2009 Written 2: MDPs, RL, and Probability Due: Thursday 10/15 in 283 Soda Drop Box by 11:59pm (no slip days) Policy: Can be solved in groups (acknowledge collaborators)

More information

Optimally Solving Dec-POMDPs as Continuous-State MDPs

Optimally Solving Dec-POMDPs as Continuous-State MDPs Optimally Solving Dec-POMDPs as Continuous-State MDPs Jilles Steeve Dibangoye Inria / Université de Lorraine Nancy, France jilles.dibangoye@inria.fr Christopher Amato CSAIL / MIT Cambridge, MA, USA camato@csail.mit.edu

More information

Probabilistic robot planning under model uncertainty: an active learning approach

Probabilistic robot planning under model uncertainty: an active learning approach Probabilistic robot planning under model uncertainty: an active learning approach Robin JAULMES, Joelle PINEAU and Doina PRECUP School of Computer Science McGill University Montreal, QC CANADA H3A 2A7

More information

RAO : an Algorithm for Chance-Constrained POMDP s

RAO : an Algorithm for Chance-Constrained POMDP s RAO : an Algorithm for Chance-Constrained POMDP s Pedro Santana and Sylvie Thiébaux + and Brian Williams Massachusetts Institute of Technology, CSAIL + The Australian National University & NICTA 32 Vassar

More information

Solving Risk-Sensitive POMDPs with and without Cost Observations

Solving Risk-Sensitive POMDPs with and without Cost Observations Solving Risk-Sensitive POMDPs with and without Cost Observations Ping Hou Department of Computer Science New Mexico State University Las Cruces, NM 88003, USA phou@cs.nmsu.edu William Yeoh Department of

More information

Bayes-Adaptive POMDPs: Toward an Optimal Policy for Learning POMDPs with Parameter Uncertainty

Bayes-Adaptive POMDPs: Toward an Optimal Policy for Learning POMDPs with Parameter Uncertainty Bayes-Adaptive POMDPs: Toward an Optimal Policy for Learning POMDPs with Parameter Uncertainty Stéphane Ross School of Computer Science McGill University, Montreal (Qc), Canada, H3A 2A7 stephane.ross@mail.mcgill.ca

More information

An Adaptive Clustering Method for Model-free Reinforcement Learning

An Adaptive Clustering Method for Model-free Reinforcement Learning An Adaptive Clustering Method for Model-free Reinforcement Learning Andreas Matt and Georg Regensburger Institute of Mathematics University of Innsbruck, Austria {andreas.matt, georg.regensburger}@uibk.ac.at

More information

PROBABILISTIC PLANNING WITH RISK-SENSITIVE CRITERION PING HOU. A dissertation submitted to the Graduate School

PROBABILISTIC PLANNING WITH RISK-SENSITIVE CRITERION PING HOU. A dissertation submitted to the Graduate School PROBABILISTIC PLANNING WITH RISK-SENSITIVE CRITERION BY PING HOU A dissertation submitted to the Graduate School in partial fulfillment of the requirements for the degree Doctor of Philosophy Major Subject:

More information

CS 570: Machine Learning Seminar. Fall 2016

CS 570: Machine Learning Seminar. Fall 2016 CS 570: Machine Learning Seminar Fall 2016 Class Information Class web page: http://web.cecs.pdx.edu/~mm/mlseminar2016-2017/fall2016/ Class mailing list: cs570@cs.pdx.edu My office hours: T,Th, 2-3pm or

More information

Bootstrapping LPs in Value Iteration for Multi-Objective and Partially Observable MDPs

Bootstrapping LPs in Value Iteration for Multi-Objective and Partially Observable MDPs Bootstrapping LPs in Value Iteration for Multi-Objective and Partially Observable MDPs Diederik M. Roijers Vrije Universiteit Brussel & Vrije Universiteit Amsterdam Erwin Walraven Delft University of Technology

More information

Prediction-Constrained POMDPs

Prediction-Constrained POMDPs Prediction-Constrained POMDPs Joseph Futoma Harvard SEAS Michael C. Hughes Dept of. Computer Science, Tufts University Abstract Finale Doshi-Velez Harvard SEAS We propose prediction-constrained (PC) training

More information

Lecture 3: The Reinforcement Learning Problem

Lecture 3: The Reinforcement Learning Problem Lecture 3: The Reinforcement Learning Problem Objectives of this lecture: describe the RL problem we will be studying for the remainder of the course present idealized form of the RL problem for which

More information

Relational Partially Observable MDPs

Relational Partially Observable MDPs Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence (AAAI-10) elational Partially Observable MDPs Chenggang Wang and oni Khardon Department of Computer Science Tufts University

More information

Interactive POMDP Lite: Towards Practical Planning to Predict and Exploit Intentions for Interacting with Self-Interested Agents

Interactive POMDP Lite: Towards Practical Planning to Predict and Exploit Intentions for Interacting with Self-Interested Agents Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence Interactive POMDP Lite: Towards Practical Planning to Predict and Exploit Intentions for Interacting with Self-Interested

More information

Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms

Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms Artificial Intelligence Review manuscript No. (will be inserted by the editor) Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms Mostafa D. Awheda Howard M. Schwartz Received:

More information

Decision Theory: Markov Decision Processes

Decision Theory: Markov Decision Processes Decision Theory: Markov Decision Processes CPSC 322 Lecture 33 March 31, 2006 Textbook 12.5 Decision Theory: Markov Decision Processes CPSC 322 Lecture 33, Slide 1 Lecture Overview Recap Rewards and Policies

More information

POMDPs and Policy Gradients

POMDPs and Policy Gradients POMDPs and Policy Gradients MLSS 2006, Canberra Douglas Aberdeen Canberra Node, RSISE Building Australian National University 15th February 2006 Outline 1 Introduction What is Reinforcement Learning? Types

More information

Learning in non-stationary Partially Observable Markov Decision Processes

Learning in non-stationary Partially Observable Markov Decision Processes Learning in non-stationary Partially Observable Markov Decision Processes Robin JAULMES, Joelle PINEAU, Doina PRECUP McGill University, School of Computer Science, 3480 University St., Montreal, QC, Canada,

More information

Markov Decision Processes (and a small amount of reinforcement learning)

Markov Decision Processes (and a small amount of reinforcement learning) Markov Decision Processes (and a small amount of reinforcement learning) Slides adapted from: Brian Williams, MIT Manuela Veloso, Andrew Moore, Reid Simmons, & Tom Mitchell, CMU Nicholas Roy 16.4/13 Session

More information

Inverse Reinforcement Learning in Partially Observable Environments

Inverse Reinforcement Learning in Partially Observable Environments Proceedings of the Twenty-First International Joint Conference on Artificial Intelligence (IJCAI-09) Inverse Reinforcement Learning in Partially Observable Environments Jaedeug Choi and Kee-Eung Kim Department

More information

Using first-order logic, formalize the following knowledge:

Using first-order logic, formalize the following knowledge: Probabilistic Artificial Intelligence Final Exam Feb 2, 2016 Time limit: 120 minutes Number of pages: 19 Total points: 100 You can use the back of the pages if you run out of space. Collaboration on the

More information

Temporal Difference Learning & Policy Iteration

Temporal Difference Learning & Policy Iteration Temporal Difference Learning & Policy Iteration Advanced Topics in Reinforcement Learning Seminar WS 15/16 ±0 ±0 +1 by Tobias Joppen 03.11.2015 Fachbereich Informatik Knowledge Engineering Group Prof.

More information

Q-Learning for Markov Decision Processes*

Q-Learning for Markov Decision Processes* McGill University ECSE 506: Term Project Q-Learning for Markov Decision Processes* Authors: Khoa Phan khoa.phan@mail.mcgill.ca Sandeep Manjanna sandeep.manjanna@mail.mcgill.ca (*Based on: Convergence of

More information

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18 CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$

More information

Markov Decision Processes

Markov Decision Processes Markov Decision Processes Noel Welsh 11 November 2010 Noel Welsh () Markov Decision Processes 11 November 2010 1 / 30 Annoucements Applicant visitor day seeks robot demonstrators for exciting half hour

More information

Optimizing Memory-Bounded Controllers for Decentralized POMDPs

Optimizing Memory-Bounded Controllers for Decentralized POMDPs Optimizing Memory-Bounded Controllers for Decentralized POMDPs Christopher Amato, Daniel S. Bernstein and Shlomo Zilberstein Department of Computer Science University of Massachusetts Amherst, MA 01003

More information

Optimal Control of Partiality Observable Markov. Processes over a Finite Horizon

Optimal Control of Partiality Observable Markov. Processes over a Finite Horizon Optimal Control of Partiality Observable Markov Processes over a Finite Horizon Report by Jalal Arabneydi 04/11/2012 Taken from Control of Partiality Observable Markov Processes over a finite Horizon by

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Model-Based Reinforcement Learning Model-based, PAC-MDP, sample complexity, exploration/exploitation, RMAX, E3, Bayes-optimal, Bayesian RL, model learning Vien Ngo MLR, University

More information

Decentralized Decision Making!

Decentralized Decision Making! Decentralized Decision Making! in Partially Observable, Uncertain Worlds Shlomo Zilberstein Department of Computer Science University of Massachusetts Amherst Joint work with Martin Allen, Christopher

More information

arxiv: v2 [cs.gt] 4 Aug 2016

arxiv: v2 [cs.gt] 4 Aug 2016 Dynamic Programming for One-Sided Partially Observable Pursuit-Evasion Games Karel Horák, Branislav Bošanský {horak,bosansky}@agents.fel.cvut.cz arxiv:1606.06271v2 [cs.gt] 4 Aug 2016 Department of Computer

More information

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam:

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam: Marks } Assignment 1: Should be out this weekend } All are marked, I m trying to tally them and perhaps add bonus points } Mid-term: Before the last lecture } Mid-term deferred exam: } This Saturday, 9am-10.30am,

More information

Chapter 16 Planning Based on Markov Decision Processes

Chapter 16 Planning Based on Markov Decision Processes Lecture slides for Automated Planning: Theory and Practice Chapter 16 Planning Based on Markov Decision Processes Dana S. Nau University of Maryland 12:48 PM February 29, 2012 1 Motivation c a b Until

More information

Learning in Zero-Sum Team Markov Games using Factored Value Functions

Learning in Zero-Sum Team Markov Games using Factored Value Functions Learning in Zero-Sum Team Markov Games using Factored Value Functions Michail G. Lagoudakis Department of Computer Science Duke University Durham, NC 27708 mgl@cs.duke.edu Ronald Parr Department of Computer

More information

Dialogue as a Decision Making Process

Dialogue as a Decision Making Process Dialogue as a Decision Making Process Nicholas Roy Challenges of Autonomy in the Real World Wide range of sensors Noisy sensors World dynamics Adaptability Incomplete information Robustness under uncertainty

More information

Decision Making As An Optimization Problem

Decision Making As An Optimization Problem Decision Making As An Optimization Problem Hala Mostafa 683 Lecture 14 Wed Oct 27 th 2010 DEC-MDP Formulation as a math al program Often requires good insight into the problem to get a compact well-behaved

More information

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon.

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon. Administration CSCI567 Machine Learning Fall 2018 Prof. Haipeng Luo U of Southern California Nov 7, 2018 HW5 is available, due on 11/18. Practice final will also be available soon. Remaining weeks: 11/14,

More information

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Michail G. Lagoudakis Department of Computer Science Duke University Durham, NC 2778 mgl@cs.duke.edu

More information

Distributed Optimization. Song Chong EE, KAIST

Distributed Optimization. Song Chong EE, KAIST Distributed Optimization Song Chong EE, KAIST songchong@kaist.edu Dynamic Programming for Path Planning A path-planning problem consists of a weighted directed graph with a set of n nodes N, directed links

More information

The exam is closed book, closed calculator, and closed notes except your one-page crib sheet.

The exam is closed book, closed calculator, and closed notes except your one-page crib sheet. CS 188 Spring 2017 Introduction to Artificial Intelligence Midterm V2 You have approximately 80 minutes. The exam is closed book, closed calculator, and closed notes except your one-page crib sheet. Mark

More information

University of Alberta

University of Alberta University of Alberta NEW REPRESENTATIONS AND APPROXIMATIONS FOR SEQUENTIAL DECISION MAKING UNDER UNCERTAINTY by Tao Wang A thesis submitted to the Faculty of Graduate Studies and Research in partial fulfillment

More information

Probabilistic inference for computing optimal policies in MDPs

Probabilistic inference for computing optimal policies in MDPs Probabilistic inference for computing optimal policies in MDPs Marc Toussaint Amos Storkey School of Informatics, University of Edinburgh Edinburgh EH1 2QL, Scotland, UK mtoussai@inf.ed.ac.uk, amos@storkey.org

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Markov decision process & Dynamic programming Evaluative feedback, value function, Bellman equation, optimality, Markov property, Markov decision process, dynamic programming, value

More information

In Advances in Neural Information Processing Systems 6. J. D. Cowan, G. Tesauro and. Convergence of Indirect Adaptive. Andrew G.

In Advances in Neural Information Processing Systems 6. J. D. Cowan, G. Tesauro and. Convergence of Indirect Adaptive. Andrew G. In Advances in Neural Information Processing Systems 6. J. D. Cowan, G. Tesauro and J. Alspector, (Eds.). Morgan Kaufmann Publishers, San Fancisco, CA. 1994. Convergence of Indirect Adaptive Asynchronous

More information

Partially observable Markov decision processes. Department of Computer Science, Czech Technical University in Prague

Partially observable Markov decision processes. Department of Computer Science, Czech Technical University in Prague Partially observable Markov decision processes Jiří Kléma Department of Computer Science, Czech Technical University in Prague https://cw.fel.cvut.cz/wiki/courses/b4b36zui/prednasky pagenda Previous lecture:

More information

A Decentralized Approach to Multi-agent Planning in the Presence of Constraints and Uncertainty

A Decentralized Approach to Multi-agent Planning in the Presence of Constraints and Uncertainty 2011 IEEE International Conference on Robotics and Automation Shanghai International Conference Center May 9-13, 2011, Shanghai, China A Decentralized Approach to Multi-agent Planning in the Presence of

More information

Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms

Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms Mostafa D. Awheda Department of Systems and Computer Engineering Carleton University Ottawa, Canada KS 5B6 Email: mawheda@sce.carleton.ca

More information

Coarticulation in Markov Decision Processes

Coarticulation in Markov Decision Processes Coarticulation in Markov Decision Processes Khashayar Rohanimanesh Department of Computer Science University of Massachusetts Amherst, MA 01003 khash@cs.umass.edu Sridhar Mahadevan Department of Computer

More information

Complexity of stochastic branch and bound methods for belief tree search in Bayesian reinforcement learning

Complexity of stochastic branch and bound methods for belief tree search in Bayesian reinforcement learning Complexity of stochastic branch and bound methods for belief tree search in Bayesian reinforcement learning Christos Dimitrakakis Informatics Institute, University of Amsterdam, Amsterdam, The Netherlands

More information