Learning in Depth-First Search: A Unified Approach to Heuristic Search in Deterministic, Non-Deterministic, Probabilistic, and Game Tree Settings

Size: px
Start display at page:

Download "Learning in Depth-First Search: A Unified Approach to Heuristic Search in Deterministic, Non-Deterministic, Probabilistic, and Game Tree Settings"

Transcription

1 Learning in Depth-First Search: A Unified Approach to Heuristic Search in Deterministic, Non-Deterministic, Probabilistic, and Game Tree Settings Blai Bonet and Héctor Geffner Abstract Dynamic Programming provides a convenient and unified framework for studying many state models used in AI but no algorithms for handling large spaces. Heuristic-search methods, on the other hand, can handle large spaces but lack a common foundation. In this work, we combine the benefits of a general dynamic programming formulation with the power of heuristic-search techniques for developing an algorithmic framework, that we call Learning in Depth-First Search, that aims to be both general and effective. The basic LDFS algorithm searches for solutions by combining iterative, bounded depth-first searches, with learning in the sense of Korf s LRTA* and Barto s et al. RTDP. In each iteration, if there is a solution with cost not exceeding a lower bound, then the solution is found, else the process restarts with the lower bound and the value function updated. LDFS reduces to IDA* with Transposition Tables over deterministic models, but solves also non-deterministic, probabilistic, and game tree models, over which a slight variation reduces to the stateof-the-art MTD algorithm. Over Max AND/OR graphs, on the other hand, LDFS is a new algorithm which appears to be quite competitive with AO*. Introduction Dynamic Programming provides a convenient and unified framework for studying many state models used in AI (Bellman 1957; Bertsekas 1995) but no algorithms for handling large spaces. Heuristic-search methods, on the other hand, can handle large spaces effectively, but lack a common foundation: algorithms like IDA* aim at deterministic models (Korf 1985), AO* at non-deterministic models (Martelli & Montanari 1973), Alpha-Beta at Game Trees (Newell, Shaw, & Simon 1963), and so on (see (Nilsson 1980; Pearl 1983)), and it is not always clear what these tasks and techniques have in common, nor how they can be generalized in a principled way to other models like nondeterministic models with cycles or Markov decision processes. In this work, we combine the benefits of a general dynamic programming formulation with the effectiveness of heuristic-search techniques for developing an algorithmic framework, that we call Learning in Depth-First Search, that aims to be both general and effective. The basic LDFS algorithm searches for solutions by combining iterative, bounded Copyright c 2005, American Association for Artificial Intelligence ( All rights reserved. depth-first searches, with learning in the sense (Korf 1990) and (Barto, Bradtke, & Singh 1995). In each iteration, if there is a solution with cost not exceeding a lower bound, then the solution is found and reported, else the process restarts with the lower bound and the value function updated. LDFS reduces to IDA* with Transposition Tables (Reinefeld & Marsland 1994) over deterministic models, but solves also non-deterministic, probabilistic, and game tree models, over which a slight variation reduces to the state-of-the-art MTD algorithm (Plaat et al. 1996). The LDFS framework makes explicit and generalizes two key ideas underlying a family of effective search algorithms across a variety of models: learning and lower bounds. We build on recent work that combines DP updates with the use of lower bounds and knowledge of the initial state for computing partial optimal policies for MDPs (Barto, Bradtke, & Singh 1995; Hansen & Zilberstein 2001; Bonet & Geffner 2003). However, rather than developing another algorithm for MDPs, we make use of these notions to lay out a general framework covering a wide range of models which we hope is transparent and useful. Preliminary experiments over Max AND/OR graphs, suggest indeed that LDFS is quite competitive with AO* (Bonet & Geffner 2005). 1 Models All the models can be defined in terms of the following common elements: 1. a discrete and finite state space S, 2. an initial state s 0 S, 3. a non-empty set of terminal states S T S, 4. actions A(s) A applicable in each non-terminal state, 5. a function mapping non-terminal states s and actions a A(s) into sets of states F (a, s) S, 6. action costs c(a, s) for non-terminal states s, and 7. terminal costs c T (s) for terminal states. We assume that both A(s) and F (a, s) are non-empty. The various models correspond to: 1 Note for reviewers: (Bonet & Geffner 2005) and this paper feature the LDFS algorithm, and both papers are submitted to AAAI-05 with ID s 294 and 296 respectively. This one introduces the algorithm and analyzes its scope and properties, while 294 evaluates performance over AND/OR graphs.

2 Deterministic Models (DET): F (a, s) = 1, Non-Deterministic Models (NON-DET): F (a, s) 1, Markov Decision Processes (MDPs): with probabilities P a (s s) for s F (a, s) s.t. s F (a,s) P a(s s) = 1. In addition, for DET, NON-DET, and MDPs action costs c(a, s) are all positive, and terminal costs c T (s) are non-negative. When terminal costs are all zero, terminal states are called goals. Finally, Game Trees (GT): are non-deterministic models (NON- DET) with zero action costs, arbitrary terminal costs, and a tree-structure. A model has a tree-structure when two different paths cannot lead to the same state. A path s 0, a 0, s 1, a 1,..., a n 1, s n is a sequence of states and actions starting in the initial state s 0, such that each action a i is applicable in s i, a i A(s i ), and each state s i+1 is a possible successor of s i given action a i, s i+1 F (a i, s i ). We also define the acyclic models as those which do not accommodate cyclic paths, i.e., paths s 0, a 0, s 1, a 1,..., a n 1, s n with s i = s j for i j. We write anon-det and amdps to refer to the subclass of acyclic NON-DET and MDPs models. For example, the type of problems in the scope of the AO* algorithm, are defined by those in anon-det. Solutions The solutions to the various models can be expressed in terms of the so-called Bellman equation that characterizes the optimal cost function (Bellman 1957; Bertsekas 1995): V (s) def = { ct (s) if s terminal min a A(s) Q V (a, s) otherwise where the Q V (a, s) values express the cost-to-go and are short-hand for: c(a, s) + V (s ), s F (a, s) for DET, c(a, s) + max s F (a,s) V (s ) for NON-DET-Max, c(a, s) + s F (a,s) V (s ) for NON-DET-Add, c(a, s) + s F (a,s) P a(s s)v (s ) for MDPs, max s F (a,s) V (s ) for Game Trees. We make a distinction between worst-case (Max) and additive (Add) non-deterministic models, as both are considered in AI, and yet, they have slightly different properties. We will refer to the models (NON-DET-Max and GT) whose Q- values are defined with Max as Max models, and to the rest of the models, defined with Sums, as Additive models. Under some conditions, there is a unique value function V (s), the optimal cost function, that solves the Bellman equation, and the optimal solutions to all the models can be expressed in terms of the policies π that are greedy with respect to V (s). A policy π is a function mapping states s S into actions a A(s), and a policy π V is greedy with respect to a value function V (s), or simply greedy in V, iff π V is the best policy assuming that the cost-to-go is given by V (s); i.e. (1) π V (s) = argmin Q V (a, s). (2) a A(s) Often, however, these conditions are not met, and the set of S Bellman equations have no solution. These happens for example in the presence of dead-s. Also, for Max models, as we will see, it is not the case that optimal solutions must be greedy with respect to V. For these reasons, we characterize optimal solutions in a slightly different way, taking into account the information about the initial state s 0 of the system which is assumed to be available and known. We deal then with partial policies that map some states into actions only. We say that a partial policy π is closed (relative to s 0 ) if π prescribes the action to be done in all the (non-terminal) states reachable from s 0 and π; this set S S is defined inductively as comprising s 0 and all the states s F (π(s), s) for s S. In particular, closed policies for deterministic models correspond to action sequences, for game trees, to actual trees, and so on. Any closed policy π relative to a state s has a cost V π (s) that expresses the cost of solving the problem starting from s. The costs V π (s) are given by the solution of (1) but with the operator min a A(s) removed and the action a replaced by π(s). These costs are thus well-defined when the resulting equations have a solution over the subset of states reachable from s 0 and π. Moreover, for all models above, except MDPs, it can be shown that (closed) policies π have a well-defined finite cost V π (s 0 ) when they are acyclic, and for MDPs, when they are proper. Otherwise V π (s 0 ) =. A closed policy π is cyclic if it gives rise to cyclic paths s 0, a 0, s 1, a 1,..., a n 1, s n where a i = π(s i ), and it is proper if a terminal state is reachable from every state s reachable from s 0 and π (Bertsekas 1995). For all models except for MDPs, since solutions π are acyclic, the costs V π (s 0 ) can be defined also recursively, starting with the terminal states s for which V π (s ) = c T (s ), and up to the non-terminal states s reachable from s 0 and π for which V π (s) = Q V π(π(s), s). In all cases, we are interested in computing a solution π that minimizes V π (s 0 ). The resulting value is the optimal problem cost V (s 0 ). A General Algorithm We assume throughout the paper that we have an initial value (or heuristic) function V that for all non-terminal states is a lower bound, V (s) V (s), and monotonic, V (s) min a A(s) Q V (a, s). Also for simplicity, we assume V (s) = c T (s) for all terminal states. We summarize these conditions by simply saying that V is admissible. This value function is then modified by learning in the sense of (Korf 1990) and (Barto, Bradtke, & Singh 1995), where the values of selected states are made consistent with the values of successor states; an operation that takes the form of a Bellman update: V (s) := min Q V (a, s). (3) a A(s) If the initial value function is admissible, it remains so after one or more updates. Methods like value iteration perform the iteration V (s) := min a A(s) Q V (a, s) until the difference between right and left does not exceed some ɛ 0. The difference min a A(s) Q V (a, s) V (s), which is nonnegative for monotonic value functions, is called the residual of V over s, denoted Res V (s). Clearly, a value function

3 V is a solution to Bellman equation and is thus equal to V if it has zero residuals over all states. Given a fixed initial state s 0, however, it is not actually necessary to eliminate all residuals for ensuring optimality: Proposition 1 Let V be an admissible value function and let π be a policy greedy in V. Then π minimizes V π (s 0 ) and hence is optimal if Res V (s) = 0 over all the states s reachable from s 0 and π. This suggests a simple and general schema for solving all the models, that avoids the strong assumptions required by standard DP methods and yields partial optimal policies. Since there may be many policies π greedy in V, we will assume an ordering on actions and let π V refer to the particular greedy policy obtained by selecting in each state the action greedy in V that is minimal with respect to this ordering. Then, in order to obtain a policy and a value function satisfying Proposition 1, it is sufficient to search for a state s reachable from s 0 and π V with residual Res V (s) > ɛ and update the state, keeping this iteration until there are no such states left. If ɛ = 0 and the initial value value function V is admissible, then the resulting (closed) greedy policy π V is optimal. This FIND-and-REVISE schema, shown in Fig. 1 and introduced in (Bonet & Geffner 2003) for solving MDPs, can be used to solve all the models, without the strong assumptions made by DP algorithms and without having to compute complete policies: 2 Proposition 2 Starting with an admissible value function V, the FIND-and-REVISE schema for ɛ = 0, solves all the models (DET, NON-DET, GT, MDPs) provided they have solution. For the non-probabilistic models with integer action and terminal costs, the number of iterations of FIND-and-REVISE with ɛ = 0 is bounded by s S [min(v (s), MaxV ) V (s)], where MaxV stands for any upper bound on the optimal costs V (s) of the states with finite cost. This is because updates increase the value function by at least one in some states, decrease it in none, and preserve its admissibility. In addition, states with values above M axv are not reachable from s 0 and the greedy policy. Since the Find procedure can implemented by a simple DFS procedure that keeps track of visited states in time O( S ), it follows that the time complexity of FIND-and-REVISE over those models can be bounded by the same expression times O( S ). For MDPs, the convergence of FIND-and-REVISE with ɛ = 0 is asymptotic and cannot be bounded in this way. However, for any ɛ > 0, the convergence is bounded by the same expression divided by ɛ. Learning DFS We have seen that all the models admit a common formulation and a common algorithm. This algorithm, while not practical, will be useful for understanding and proving the correctness of other, more effective approaches. We will say that an iterative algorithm is instance of FIND-and- REVISE[ɛ], if each iteration of the algorithm terminates, either identifying and updating a state reachable from s 0 and 2 It is assumed that the initial value function is represented intensionally and that the updated values are stored in a hash table. starting with an admissible V repeat FIND s reachable from s 0 and π V with Res V (s) > ɛ Update V (s) to min a A(s) Q V (a, s) until no such state is found return V Algorithm 1: The FIND-and-REVISE schema π V with residual Res V (s) > ɛ, or proving that no such state exists, and hence, that the model is solved. Such algorithms will inherit the correctness of FIND-and-REVISE, but by performing more updates per iteration will converge faster. We focus first on the models whose solutions are necessarily acyclic, excluding thus MDPs but not acyclic MDPs (amdps). We are not excluding models with cycles though; only models whose solutions may be cyclic. Hence the requirements are weaker than those of algorithms like AO*. We will say that a state s is consistent relative to a value function V if the residual of V over s is no greater than ɛ. Unless mentioned otherwise, we take ɛ to be 0. The first practical instance of FIND-and-REVISE that we consider, LDFS, implements the Find operation as a DFS that considers all the greedy actions in a state, backtracks on inconsistent states, and updates not only the inconsistent states that are found, but upon backtracking, their ancestors too. The code for LDFS is shown in Fig. 2. The Depth-First Search is achieved by means of two loops: one over the (greedy) actions a A(s) in s, the other, nested, over the possible successors s F (a, s). The tip nodes in this search are the the inconsistent states s, (where for all the actions Q V (a, s) > V (s)), the terminal states, and the states that are labeled as solved. A state s is labeled as solved when the search beneath s did not find any inconsistent state. This is captured by the boolean flag. If s is consistent, and flag is true after searching beneath the successors s F (a, s) of a greedy action a, then s is labeled as solved, π(s) is set to a, and no more actions are tried at s. Otherwise, the next greedy action is tried, and if no one is left, s is updated. Similarly, if the search beneath a successor state s F (a, s) reports an inconsistency or a no longer satisfies the greedy condition Q V (a, s) V (s), the rest of the successor states s F (a, s) are skipped. LDFS is called iteratively over s 0 from a driver routine that terminates when s 0 is solved, returning a value function V and a greedy policy π that satisfies Proposition 1, and hence is optimal. We show this by proving that LDFS is an instance of FIND-and-REVISE. First, since no model other than MDPs can accommodate a cycle of consistent states, we get that: Proposition 3 For DET, NON-DET, GT, and amdps, a call to LDFS cannot enter into a loop and thus terminates. Then, provided with the same ordering on actions as FINDand-REVISE, it is simple to show that the first state s that is updated by LDFS is inconsistent and reachable from s 0 and π V, and if there is not such state, LDFS terminates with π V. Proposition 4 Provided an initial admissible value function, LDFS is an instance of FIND-and-REVISE[ɛ = 0], and

4 LDFS-DRIVER(s 0) repeat solved := LDFS(s 0) until solved return (V, π) LDFS(s) if s is SOLVED or terminal then if s is terminal then V (s) := c T (s) Mark s as solved return true flag := false foreach a A(s) do if Q V (a, s) > V (s) then continue flag := true foreach s F (a, s) do flag := LDFS(s ) & [Q V (a, s) V (s)] if flag then break if flag then break if flag then π(s) := a Mark s as SOLVED else V (s) := min a A(s) Q V (a, s) return flag Algorithm 2: Learning DFS Algorithm (LDFS) hence, it terminates with a closed partial policy π that is optimal for DET, NON-DET, GT, and amdps. In addition, for the models that are additive, it can be shown that all the updates performed by LDFS are effective, in the sense that they are all done on states that are inconsistent, and which as a result, strictly increase their values: Proposition 5 Provided an initial admissible value function, all the updates in LDFS over the additive models DET, NON-DET-Add, and amdps, strictly increase the value function. An immediate consequence of this is that for DET and NON-DET-Add models with integer action and terminal costs, the bound on the number of iterations can be reduced to V (s 0 ) V (s 0 ), which corresponds to the maximum number of iterations in IDA* under the same conditions. Actually, provided that LDFS and IDA* (with transposition tables (Reinefeld & Marsland 1994)) consider the actions in the same order, it can be shown that they will both traverse the same paths, and maintain the same value (heuristic) function in memory: Proposition 6 Provided an admissible (and monotonic) value function V, and that actions are applied in the same order in every state, LDFS is equivalent to IDA* with Transposition Tables over the class of deterministic models (DET). Actually, for Additive Models, the workings of LDFS can be characterized as follows: Proposition 7 Over the Additive Models DET, Non-DET- Add, and amdps, LDFS tests whether there is solution π with cost V π (s 0 ) V (s 0 ) for an initial admissible value function V. If a solution exists, one such solution is found and reported; else V (s 0 ) is increased, and the test is run again, til a solution is found. Since V remains a lower bound, the solution found is optimal. This is indeed the idea underlying IDA* and the Memoryenhanced Test Driver algorithm or MTD( ) for Game Trees (Plaat et al. 1996). Interestingly, however, while LDFS solves Game Trees, it does not exhibit this pattern: the reason is that over Max models, like GT and NON-DET-Max, updates in LDFS are not always effective. We discuss this next. Local and Global Optimality An optimal solution is one that minimizes V π (s 0 ). If π also minimizes V π (s) for all the states s reachable from s 0 and π, we say that π is globally optimal. A characteristic of additive models is that the first condition implies the second. In Max models, however, this is not true. This distinction arises because while all arguments count in a sum, not all arguments count in a maximization. Game Tree algorithms make use of this difference for computing optimal policies that are not necessarily globally optimal. On the other hand, LDFS and FIND-and-REVISE, compute only globally optimal policies. This is because they keep updating V until all inconsistencies over the states reachable from s 0 and the greedy policy π V are eliminated. Some of these inconsistencies, however, are harmless in Max models. Interestingly, the AO* algorithm has the same limitation and computes globally optimal solutions even in models like anon-det-max where this is not required. Bounded LDFS The properties that LDFS exhibits over additive models, in particular the notion of effective updates, can be exted to Max models by adding an extra argument to LDFS: a Bound parameter. For simplicity, we will restrict our attention to Max models where no state can be reached through paths of different costs. This includes of course Game Trees and NON-DET-Max Tree models. For the general case, see (Bonet & Geffner 2005). The key observation is that while an increase in V (s ) for some s F (a, s) does not necessarily translate into an increase of Q V (a, s) in Max models, an increase of V (s ) above a certain bound will. Namely, Q V (a, s) Bound iff V (s ) Bound for Bound equal to Bound in Game Trees, to Bound c(a, s) in DET, to Bound c(a, s) s F (a,s)\{s } V (s ) in NON-DET-Add, etc. The procedure Bounded LDFS shown in Fig. 3 takes advantage of this, replacing the tests Q V (a, s) V (s) in LDFS with Q V (a, s) Bound where Bound is the extra parameter, which is initialized to V (s 0 ) in the driver routine, and passes as the Bound above in the recursive calls. For Additive models, this change has no effect because the following invariant holds before the loop over the actions: Proposition 8 For the additive models, in all LDFS-BOUND calls, the invariant V (s) = Bound holds before the A-loop.

5 LDFS-BOUND-DRIVER(s 0) repeat solved := LDFS-BOUND(s 0, V (s 0)) until solved return (V, π) LDFS-BOUND(s,Bound)... foreach a A(s) do if Q V (a, s) > Bound then continue flag := true foreach s F (a, s) do Bound := see text flag := LDFS-BOUND(s, Bound ) & [Q V (a, s) Bound] if flag then break... if flag then break Algorithm 3: LDFS-BOUND: fragment that differs from LDFS As a result, directly from the code, it can be established that Proposition 9 LDFS and LDFS-BOUND are equivalent over the Additive models. For Max models, however, a weaker invariant holds: Proposition 10 For Max models, in all LDFS-BOUND calls, the invariant V (s) Bound holds before the A-loop. The difference between LDFS and LDFS-BOUND over Max models is that while the former regards min a A(s) Q V (a, s) > V (s) as an inconsistency that needs to be removed, LDFS-BOUND removes this inconsistency only when min a A(s) Q V (a, s) > Bound holds, which is weaker due to the invariant V (s) Bound. As a result, LDFS-BOUND, unlike LDFS, is not an instance of FIND-and-REVISE over Max models, and thus termination and correctness need to be proved in a different way. First, a LDFS-BOUND call terminates over Game Trees as there cannot be cyclic paths, and over NON-DET-Max models, as the invariant 0 V (s) Bound holds and the Bound decreases monotonically along any path: Proposition 11 A call to LDFS-BOUND over Max models cannot enter into a loop and thus it always terminates. Then with arguments similar to the ones used for FIND-and- REVISE, one can bound the number of iterations, so that upon termination LDFS-BOUND returns a policy π and a value function V such that Q V (π(s), s) Bound holds in all the invocations LDFS-BOUND(s, Bound) over the states s reachable from s 0 and π. Yet since π must be acyclic, inductively once can prove that V π (s 0 ) V (s 0 ) and hence: Proposition 12 Provided an initial admissible value function, LDFS-BOUND terminates with a closed policy π that is optimal over the Max models GT and NON-DET-Max. In addition, like the updates in LDFS over the additive models, updates in LDFS-BOUND are effective over Max models: e 41 5 g b a c i p d f j l q r k m 36 o 35 s 50 t 36 Figure 1: Game Tree with MIN player playing first h 25 3 Proposition 13 Provided an initial admissible value function, all updates in LDFS-BOUND over Max models, strictly increase the value function. As result LDFS-BOUND over Max models can be also described as an algorithm that finds a solution π with cost V π (s 0 ) V (s 0 ) if there is one such solution, and else updates V (s 0 ) and other values, and tries again. It is thus not entirely surprising that over Game Trees, LDFS-BOUND reduces to the MTD( ) algorithm (Plaat et al. 1996); i.e., given the same ordering on actions and successor states, both algorithms traverse the same paths, and maintain the same value function: 3 Proposition 14 With the initial value function V (s) =, for all s, LDFS-BOUND is equivalent to MTD( ). The proof involves code transformations and invariants. For example, the MT algorithm works with both lower and upper bounds, yet it can be shown that when the initial value function is, upper bounds (from the perspective of the MIN player) play the same role as labels in LDFS-BOUND. For the equivalence, it is necessary to assume that both the top-nodes and the terminal nodes correspond to MIN-player moves. While MT is symmetric in this sense, LDFS-BOUND is not. Finally, for keeping the presentation simple, we have assumed that V (s) = c T (s) for terminal states, yet this condition is not required. The reader is encouraged to try the LDFS-BOUND algorithm on the Game Tree shown in Fig. 1, modified from (Plaat et al. 1996) so that it retains the same solution but with the MIN player moving first (thus payoffs have been made negative). The top node a is the initial state where there are two applicable actions, left and right. The action right in a has non-deterministic effects i and p, and so on (note that MAX nodes like b and h do not correspond to states but to And nodes). With an initial value function V =, LDFS-BOUND, like MTD( ), traverses the subtree formed by a, b, c, d, e, f, g, h, i, j, k, l, m in the first iteration and sets V (a) = 41. This value is used as the bound of the second iteration, which also returns unsuccessfully with a new bound V (a) = 36, updated in the third iteration to V (a) = 35. The fourth iteration s successfully proving this value optimal, with a solution π that chooses right at a and i, and left at p. 3 As in MTD, refers to a large negative number. Thus, < + k for positive k.

6 MDPs In order to handle MDPs, two features are needed: an ɛ > 0 bound on the size of the residuals allowed for avoiding asymptotic convergence, and a bookkeeping mechanism for avoiding loops and recognizing states that are solved. To illustrate the subtleties involved, let us assume that there is a single action π(s) applicable in each state. By performing a single depth-first pass over the descants of s, keeping track of visited states for not visiting them twice, we want to know when a state s can be labeled as solved. The subtlety arises due to the presence of cycles. In particular, it is no longer correct to label a state s as solved when the variable f lag indicates that all descants of s are consistent (i.e., have residuals no greater than ɛ); as there may be ancestors of s that are also reachable from s with unexplored descents. Yet, even in such case, there must be states s in the DFS tree spanned by LDFS-MDP (recall that no states are visited twice) such that all the states that are reachable from s and π are beneath s, and for those states, it is correct to label them as solved when flag is true. Moreover, the labeling scheme becomes complete if at that point not only s is labeled but also its descants. The question of course is how to recognize such top elements in the state graph during the depth-first search. The answer is given by Tarjan s strongly connected component algorithm (Tarjan 1972) that keeps track of two indices s.low and s.idx for each state s encountered. The top elements s are precisely those for which s.low = s.idx. The resulting algorithm for MDPs, LDFS-MDP, is LDFS+ ɛ-residuals + TARJAN. We lack the space to show it here but can prove that: Proposition 15 LDFS-MDP is an instance of FIND-and- REVISE[ɛ] and hence for a sufficiently small ɛ, solves MDPs provided they have a solution with finite (expected) cost. LDFS-MDP is similar to the HDP algorithm (Bonet & Geffner 2003) that introduced Tarjan s algorithm for labeling states in MDPs, yet by trying all greedy actions in every state LDFS-MDP ensures that all updates remain effective. Discussion We have developed a computational framework, LDFS, which makes explicit and generalizes two key ideas underlying a family of effective search algorithms across a variety of models: learning and lower bounds. 4 The same LDFS algorithm handles deterministic and non-deterministic models, with or without cycles, and a simple variation handles MDPs where solutions can be cyclic. The framework uncovers also a key distinction between Additive and Max models that has apparently gone unnoticed: optimal solutions to Additive models are globally optimal, but optimal solutions to Max models need not be. Still algorithms like AO* compute globally optimal solutions, which is adequate for Additive AND/OR graphs but is not required for Max AND/OR graphs. We have actually implemented the LDFS 4 For other works emphasizing the common ideas between single-agent and two-player search (DET and GTs in our terms); see (Marsland & Reinefeld 1993) and (Schaeffer, Plaat, & Junghanns 2001). and Bounded LDFS algorithms and compared them with AO* and Value Iteration over Max AND/OR Graphs. The results, reported in (Bonet & Geffner 2005), show that over a wide variety of instances and heuristic functions, LDFS and Bounded LDFS are almost never worse than either AO* or Value Iteration, and instead are often one or more orders of magnitude faster. We do not expect similar practical gains over DET and GTs where LDFS and Bounded LDFS reduce to well known algorithms. The value of the proposed framework, however, goes beyond the particular algorithms obtained for the various models. For example, since all LDFS algorithms can be understood as extensions of IDA* some of their limitations can be understood in terms of the limitations of IDA* itself. For example, it is well known that IDA* doesn t do well when action costs are real numbers. In MDPs, this problem arises even when action costs are integers because of the probabilities. We are thus currently exploring variations on the basic LDFS-MDP algorithm that exploit this parallelism. References Barto, A.; Bradtke, S.; and Singh, S Learning to act using real-time dynamic programming. Artificial Intelligence 72: Bellman, R Dynamic Programming. Princeton University Press. Bertsekas, D Dynamic Programming and Optimal Control, Vols 1 and 2. Athena Scientific. Bonet, B., and Geffner, H Faster heuristic search algorithms for planning with uncertainty and full feedback. In Proc. IJCAI-03, Bonet, B., and Geffner, H An algorithm better than AO*? Hansen, E., and Zilberstein, S Lao*: A heuristic search algorithm that finds solutions with loops. Artificial Intelligence 129: Korf, R Depth-first iterative-deepening: an optimal admissible tree search. Artificial Intelligence 27(1): Korf, R Real-time heuristic search. Artificial Intelligence 42: Marsland, T. A., and Reinefeld, A Heuristic search in one and two player games. Technical Report TR 93-02, University of Alberta. Martelli, A., and Montanari, U Additive AND/OR graphs. In Proc. IJCAI-73, Newell, A.; Shaw, J. C.; and Simon, H Chess-playing programs and the problem of complexity. In Feigenbaum, E., and Feldman, J., eds., Computers and Thought. McGraw Hill Nilsson, N Principles of Artificial Intelligence. Tioga. Pearl, J Heuristics. Addison Wesley. Plaat, A.; Schaeffer, J.; Pijls, W.; and Bruin, A Bestfirst fixed-depth minimax algorithms. Artificial Intelligence 87(1-2): Reinefeld, A., and Marsland, T Enhanced iterativedeepening search. IEEE Trans. on Pattern Analysis and Machine Intelligence 16(7): Schaeffer, J.; Plaat, A.; and Junghanns, A Unifying singleagent and two-player search. Inf. Sci. 135(3-4): Tarjan, R Depth first search and linear graph algorithms. SIAM Journal on Computing 1(2):

An Algorithm better than AO*?

An Algorithm better than AO*? An Algorithm better than? Blai Bonet Universidad Simón Boĺıvar Caracas, Venezuela Héctor Geffner ICREA and Universitat Pompeu Fabra Barcelona, Spain 7/2005 An Algorithm Better than? B. Bonet and H. Geffner;

More information

An Algorithm better than AO*?

An Algorithm better than AO*? An Algorithm better than? Blai Bonet Departamento de Computación Universidad Simón Bolívar Caracas, Venezuela bonet@ldc.usb.ve Héctor Geffner ICREA & Universitat Pompeu Fabra Paseo de Circunvalación, 8

More information

Heuristic Search Algorithms

Heuristic Search Algorithms CHAPTER 4 Heuristic Search Algorithms 59 4.1 HEURISTIC SEARCH AND SSP MDPS The methods we explored in the previous chapter have a serious practical drawback the amount of memory they require is proportional

More information

Artificial Intelligence. Non-deterministic state model. Model for non-deterministic problems. Solutions. Blai Bonet

Artificial Intelligence. Non-deterministic state model. Model for non-deterministic problems. Solutions. Blai Bonet Artificial Intelligence Blai Bonet Non-deterministic state model Universidad Simón Boĺıvar, Caracas, Venezuela Model for non-deterministic problems Solutions State models with non-deterministic actions

More information

CS 188 Introduction to Fall 2007 Artificial Intelligence Midterm

CS 188 Introduction to Fall 2007 Artificial Intelligence Midterm NAME: SID#: Login: Sec: 1 CS 188 Introduction to Fall 2007 Artificial Intelligence Midterm You have 80 minutes. The exam is closed book, closed notes except a one-page crib sheet, basic calculators only.

More information

Alpha-Beta Pruning: Algorithm and Analysis

Alpha-Beta Pruning: Algorithm and Analysis Alpha-Beta Pruning: Algorithm and Analysis Tsan-sheng Hsu tshsu@iis.sinica.edu.tw http://www.iis.sinica.edu.tw/~tshsu 1 Introduction Alpha-beta pruning is the standard searching procedure used for 2-person

More information

Alpha-Beta Pruning: Algorithm and Analysis

Alpha-Beta Pruning: Algorithm and Analysis Alpha-Beta Pruning: Algorithm and Analysis Tsan-sheng Hsu tshsu@iis.sinica.edu.tw http://www.iis.sinica.edu.tw/~tshsu 1 Introduction Alpha-beta pruning is the standard searching procedure used for 2-person

More information

Alpha-Beta Pruning: Algorithm and Analysis

Alpha-Beta Pruning: Algorithm and Analysis Alpha-Beta Pruning: Algorithm and Analysis Tsan-sheng Hsu tshsu@iis.sinica.edu.tw http://www.iis.sinica.edu.tw/~tshsu 1 Introduction Alpha-beta pruning is the standard searching procedure used for solving

More information

Markov Decision Processes Chapter 17. Mausam

Markov Decision Processes Chapter 17. Mausam Markov Decision Processes Chapter 17 Mausam Planning Agent Static vs. Dynamic Fully vs. Partially Observable Environment What action next? Deterministic vs. Stochastic Perfect vs. Noisy Instantaneous vs.

More information

Distributed Optimization. Song Chong EE, KAIST

Distributed Optimization. Song Chong EE, KAIST Distributed Optimization Song Chong EE, KAIST songchong@kaist.edu Dynamic Programming for Path Planning A path-planning problem consists of a weighted directed graph with a set of n nodes N, directed links

More information

Focused Real-Time Dynamic Programming for MDPs: Squeezing More Out of a Heuristic

Focused Real-Time Dynamic Programming for MDPs: Squeezing More Out of a Heuristic Focused Real-Time Dynamic Programming for MDPs: Squeezing More Out of a Heuristic Trey Smith and Reid Simmons Robotics Institute, Carnegie Mellon University {trey,reids}@ri.cmu.edu Abstract Real-time dynamic

More information

Scout and NegaScout. Tsan-sheng Hsu.

Scout and NegaScout. Tsan-sheng Hsu. Scout and NegaScout Tsan-sheng Hsu tshsu@iis.sinica.edu.tw http://www.iis.sinica.edu.tw/~tshsu 1 Abstract It looks like alpha-beta pruning is the best we can do for an exact generic searching procedure.

More information

CSE 573. Markov Decision Processes: Heuristic Search & Real-Time Dynamic Programming. Slides adapted from Andrey Kolobov and Mausam

CSE 573. Markov Decision Processes: Heuristic Search & Real-Time Dynamic Programming. Slides adapted from Andrey Kolobov and Mausam CSE 573 Markov Decision Processes: Heuristic Search & Real-Time Dynamic Programming Slides adapted from Andrey Kolobov and Mausam 1 Stochastic Shortest-Path MDPs: Motivation Assume the agent pays cost

More information

Chapter 3 Deterministic planning

Chapter 3 Deterministic planning Chapter 3 Deterministic planning In this chapter we describe a number of algorithms for solving the historically most important and most basic type of planning problem. Two rather strong simplifying assumptions

More information

Worst-Case Solution Quality Analysis When Not Re-Expanding Nodes in Best-First Search

Worst-Case Solution Quality Analysis When Not Re-Expanding Nodes in Best-First Search Worst-Case Solution Quality Analysis When Not Re-Expanding Nodes in Best-First Search Richard Valenzano University of Alberta valenzan@ualberta.ca Nathan R. Sturtevant University of Denver sturtevant@cs.du.edu

More information

CS599 Lecture 1 Introduction To RL

CS599 Lecture 1 Introduction To RL CS599 Lecture 1 Introduction To RL Reinforcement Learning Introduction Learning from rewards Policies Value Functions Rewards Models of the Environment Exploitation vs. Exploration Dynamic Programming

More information

Rollout-based Game-tree Search Outprunes Traditional Alpha-beta

Rollout-based Game-tree Search Outprunes Traditional Alpha-beta Journal of Machine Learning Research vol:1 8, 2012 Submitted 6/2012 Rollout-based Game-tree Search Outprunes Traditional Alpha-beta Ari Weinstein Michael L. Littman Sergiu Goschin aweinst@cs.rutgers.edu

More information

The exam is closed book, closed calculator, and closed notes except your one-page crib sheet.

The exam is closed book, closed calculator, and closed notes except your one-page crib sheet. CS 188 Spring 2017 Introduction to Artificial Intelligence Midterm V2 You have approximately 80 minutes. The exam is closed book, closed calculator, and closed notes except your one-page crib sheet. Mark

More information

Scout, NegaScout and Proof-Number Search

Scout, NegaScout and Proof-Number Search Scout, NegaScout and Proof-Number Search Tsan-sheng Hsu tshsu@iis.sinica.edu.tw http://www.iis.sinica.edu.tw/~tshsu 1 Introduction It looks like alpha-beta pruning is the best we can do for a generic searching

More information

Internet Monetization

Internet Monetization Internet Monetization March May, 2013 Discrete time Finite A decision process (MDP) is reward process with decisions. It models an environment in which all states are and time is divided into stages. Definition

More information

Width and Complexity of Belief Tracking in Non-Deterministic Conformant and Contingent Planning

Width and Complexity of Belief Tracking in Non-Deterministic Conformant and Contingent Planning Width and Complexity of Belief Tracking in Non-Deterministic Conformant and Contingent Planning Blai Bonet Universidad Simón Bolívar Caracas, Venezuela bonet@ldc.usb.ve Hector Geffner ICREA & Universitat

More information

CS 4100 // artificial intelligence. Recap/midterm review!

CS 4100 // artificial intelligence. Recap/midterm review! CS 4100 // artificial intelligence instructor: byron wallace Recap/midterm review! Attribution: many of these slides are modified versions of those distributed with the UC Berkeley CS188 materials Thanks

More information

Final. Introduction to Artificial Intelligence. CS 188 Spring You have approximately 2 hours and 50 minutes.

Final. Introduction to Artificial Intelligence. CS 188 Spring You have approximately 2 hours and 50 minutes. CS 188 Spring 2014 Introduction to Artificial Intelligence Final You have approximately 2 hours and 50 minutes. The exam is closed book, closed notes except your two-page crib sheet. Mark your answers

More information

CS 7180: Behavioral Modeling and Decisionmaking

CS 7180: Behavioral Modeling and Decisionmaking CS 7180: Behavioral Modeling and Decisionmaking in AI Markov Decision Processes for Complex Decisionmaking Prof. Amy Sliva October 17, 2012 Decisions are nondeterministic In many situations, behavior and

More information

Efficient Maximization in Solving POMDPs

Efficient Maximization in Solving POMDPs Efficient Maximization in Solving POMDPs Zhengzhu Feng Computer Science Department University of Massachusetts Amherst, MA 01003 fengzz@cs.umass.edu Shlomo Zilberstein Computer Science Department University

More information

Goal Recognition over POMDPs: Inferring the Intention of a POMDP Agent

Goal Recognition over POMDPs: Inferring the Intention of a POMDP Agent Goal Recognition over POMDPs: Inferring the Intention of a POMDP Agent Miquel Ramírez Universitat Pompeu Fabra 08018 Barcelona, SPAIN miquel.ramirez@upf.edu Hector Geffner ICREA & Universitat Pompeu Fabra

More information

Lecture 3: The Reinforcement Learning Problem

Lecture 3: The Reinforcement Learning Problem Lecture 3: The Reinforcement Learning Problem Objectives of this lecture: describe the RL problem we will be studying for the remainder of the course present idealized form of the RL problem for which

More information

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Q-Learning Christopher Watkins and Peter Dayan Noga Zaslavsky The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992)

More information

CS 188: Artificial Intelligence

CS 188: Artificial Intelligence CS 188: Artificial Intelligence Adversarial Search II Instructor: Anca Dragan University of California, Berkeley [These slides adapted from Dan Klein and Pieter Abbeel] Minimax Example 3 12 8 2 4 6 14

More information

MDP Preliminaries. Nan Jiang. February 10, 2019

MDP Preliminaries. Nan Jiang. February 10, 2019 MDP Preliminaries Nan Jiang February 10, 2019 1 Markov Decision Processes In reinforcement learning, the interactions between the agent and the environment are often described by a Markov Decision Process

More information

Characterization of Semantics for Argument Systems

Characterization of Semantics for Argument Systems Characterization of Semantics for Argument Systems Philippe Besnard and Sylvie Doutre IRIT Université Paul Sabatier 118, route de Narbonne 31062 Toulouse Cedex 4 France besnard, doutre}@irit.fr Abstract

More information

Informed Search. Chap. 4. Breadth First. O(Min(N,B L )) BFS. Search. have same cost BIBFS. Bi- Direction. O(Min(N,2B L/2 )) BFS. have same cost UCS

Informed Search. Chap. 4. Breadth First. O(Min(N,B L )) BFS. Search. have same cost BIBFS. Bi- Direction. O(Min(N,2B L/2 )) BFS. have same cost UCS Informed Search Chap. 4 Material in part from http://www.cs.cmu.edu/~awm/tutorials Uninformed Search Complexity N = Total number of states B = Average number of successors (branching factor) L = Length

More information

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro CMU 15-781 Lecture 12: Reinforcement Learning Teacher: Gianni A. Di Caro REINFORCEMENT LEARNING Transition Model? State Action Reward model? Agent Goal: Maximize expected sum of future rewards 2 MDP PLANNING

More information

Selecting Efficient Correlated Equilibria Through Distributed Learning. Jason R. Marden

Selecting Efficient Correlated Equilibria Through Distributed Learning. Jason R. Marden 1 Selecting Efficient Correlated Equilibria Through Distributed Learning Jason R. Marden Abstract A learning rule is completely uncoupled if each player s behavior is conditioned only on his own realized

More information

Real-Time Search in Non-Deterministic Domains. Carnegie Mellon University. Pittsburgh, PA

Real-Time Search in Non-Deterministic Domains. Carnegie Mellon University. Pittsburgh, PA Real-Time Search in Non-Deterministic Domains Sven Koenig and Reid G. Simmons Carnegie Mellon University School of Computer Science Pittsburgh, PA 15213-3890 skoenig@cs.cmu.edu, reids@cs.cmu.edu Abstract

More information

Machine Learning I Reinforcement Learning

Machine Learning I Reinforcement Learning Machine Learning I Reinforcement Learning Thomas Rückstieß Technische Universität München December 17/18, 2009 Literature Book: Reinforcement Learning: An Introduction Sutton & Barto (free online version:

More information

Deterministic planning

Deterministic planning Chapter 3 Deterministic planning The simplest planning problems involves finding a sequence of actions that lead from a given initial state to a goal state. Only deterministic actions are considered. Determinism

More information

CITS4211 Mid-semester test 2011

CITS4211 Mid-semester test 2011 CITS4211 Mid-semester test 2011 Fifty minutes, answer all four questions, total marks 60 Question 1. (12 marks) Briefly describe the principles, operation, and performance issues of iterative deepening.

More information

Chapter 4: Computation tree logic

Chapter 4: Computation tree logic INFOF412 Formal verification of computer systems Chapter 4: Computation tree logic Mickael Randour Formal Methods and Verification group Computer Science Department, ULB March 2017 1 CTL: a specification

More information

Min/Max-Poly Weighting Schemes and the NL vs UL Problem

Min/Max-Poly Weighting Schemes and the NL vs UL Problem Min/Max-Poly Weighting Schemes and the NL vs UL Problem Anant Dhayal Jayalal Sarma Saurabh Sawlani May 3, 2016 Abstract For a graph G(V, E) ( V = n) and a vertex s V, a weighting scheme (w : E N) is called

More information

Markov Decision Processes

Markov Decision Processes Markov Decision Processes Lecture notes for the course Games on Graphs B. Srivathsan Chennai Mathematical Institute, India 1 Markov Chains We will define Markov chains in a manner that will be useful to

More information

Notes on induction proofs and recursive definitions

Notes on induction proofs and recursive definitions Notes on induction proofs and recursive definitions James Aspnes December 13, 2010 1 Simple induction Most of the proof techniques we ve talked about so far are only really useful for proving a property

More information

CS 570: Machine Learning Seminar. Fall 2016

CS 570: Machine Learning Seminar. Fall 2016 CS 570: Machine Learning Seminar Fall 2016 Class Information Class web page: http://web.cecs.pdx.edu/~mm/mlseminar2016-2017/fall2016/ Class mailing list: cs570@cs.pdx.edu My office hours: T,Th, 2-3pm or

More information

Mathematics for Decision Making: An Introduction. Lecture 8

Mathematics for Decision Making: An Introduction. Lecture 8 Mathematics for Decision Making: An Introduction Lecture 8 Matthias Köppe UC Davis, Mathematics January 29, 2009 8 1 Shortest Paths and Feasible Potentials Feasible Potentials Suppose for all v V, there

More information

Reinforcement Learning. Spring 2018 Defining MDPs, Planning

Reinforcement Learning. Spring 2018 Defining MDPs, Planning Reinforcement Learning Spring 2018 Defining MDPs, Planning understandability 0 Slide 10 time You are here Markov Process Where you will go depends only on where you are Markov Process: Information state

More information

Introduction to Reinforcement Learning

Introduction to Reinforcement Learning CSCI-699: Advanced Topics in Deep Learning 01/16/2019 Nitin Kamra Spring 2019 Introduction to Reinforcement Learning 1 What is Reinforcement Learning? So far we have seen unsupervised and supervised learning.

More information

Course on Automated Planning: Transformations

Course on Automated Planning: Transformations Course on Automated Planning: Transformations Hector Geffner ICREA & Universitat Pompeu Fabra Barcelona, Spain H. Geffner, Course on Automated Planning, Rome, 7/2010 1 AI Planning: Status The good news:

More information

Chapter 3: The Reinforcement Learning Problem

Chapter 3: The Reinforcement Learning Problem Chapter 3: The Reinforcement Learning Problem Objectives of this chapter: describe the RL problem we will be studying for the remainder of the course present idealized form of the RL problem for which

More information

Chapter 3: The Reinforcement Learning Problem

Chapter 3: The Reinforcement Learning Problem Chapter 3: The Reinforcement Learning Problem Objectives of this chapter: describe the RL problem we will be studying for the remainder of the course present idealized form of the RL problem for which

More information

PROBABILISTIC PLANNING WITH RISK-SENSITIVE CRITERION PING HOU. A dissertation submitted to the Graduate School

PROBABILISTIC PLANNING WITH RISK-SENSITIVE CRITERION PING HOU. A dissertation submitted to the Graduate School PROBABILISTIC PLANNING WITH RISK-SENSITIVE CRITERION BY PING HOU A dissertation submitted to the Graduate School in partial fulfillment of the requirements for the degree Doctor of Philosophy Major Subject:

More information

Motivation for introducing probabilities

Motivation for introducing probabilities for introducing probabilities Reaching the goals is often not sufficient: it is important that the expected costs do not outweigh the benefit of reaching the goals. 1 Objective: maximize benefits - costs.

More information

Chapter 11. Approximation Algorithms. Slides by Kevin Wayne Pearson-Addison Wesley. All rights reserved.

Chapter 11. Approximation Algorithms. Slides by Kevin Wayne Pearson-Addison Wesley. All rights reserved. Chapter 11 Approximation Algorithms Slides by Kevin Wayne. Copyright @ 2005 Pearson-Addison Wesley. All rights reserved. 1 Approximation Algorithms Q. Suppose I need to solve an NP-hard problem. What should

More information

Midterm. Introduction to Artificial Intelligence. CS 188 Summer You have approximately 2 hours and 50 minutes.

Midterm. Introduction to Artificial Intelligence. CS 188 Summer You have approximately 2 hours and 50 minutes. CS 188 Summer 2014 Introduction to Artificial Intelligence Midterm You have approximately 2 hours and 50 minutes. The exam is closed book, closed notes except your one-page crib sheet. Mark your answers

More information

FINAL EXAM PRACTICE PROBLEMS CMSC 451 (Spring 2016)

FINAL EXAM PRACTICE PROBLEMS CMSC 451 (Spring 2016) FINAL EXAM PRACTICE PROBLEMS CMSC 451 (Spring 2016) The final exam will be on Thursday, May 12, from 8:00 10:00 am, at our regular class location (CSI 2117). It will be closed-book and closed-notes, except

More information

Examination Artificial Intelligence Module Intelligent Interaction Design December 2014

Examination Artificial Intelligence Module Intelligent Interaction Design December 2014 Examination Artificial Intelligence Module Intelligent Interaction Design December 2014 Introduction This exam is closed book, you may only use a simple calculator (addition, substraction, multiplication

More information

Reinforcement Learning with Function Approximation. Joseph Christian G. Noel

Reinforcement Learning with Function Approximation. Joseph Christian G. Noel Reinforcement Learning with Function Approximation Joseph Christian G. Noel November 2011 Abstract Reinforcement learning (RL) is a key problem in the field of Artificial Intelligence. The main goal is

More information

Lecture 23: Reinforcement Learning

Lecture 23: Reinforcement Learning Lecture 23: Reinforcement Learning MDPs revisited Model-based learning Monte Carlo value function estimation Temporal-difference (TD) learning Exploration November 23, 2006 1 COMP-424 Lecture 23 Recall:

More information

Reinforcement Learning II

Reinforcement Learning II Reinforcement Learning II Andrea Bonarini Artificial Intelligence and Robotics Lab Department of Electronics and Information Politecnico di Milano E-mail: bonarini@elet.polimi.it URL:http://www.dei.polimi.it/people/bonarini

More information

Informed Search. Day 3 of Search. Chap. 4, Russel & Norvig. Material in part from

Informed Search. Day 3 of Search. Chap. 4, Russel & Norvig. Material in part from Informed Search Day 3 of Search Chap. 4, Russel & Norvig Material in part from http://www.cs.cmu.edu/~awm/tutorials Uninformed Search Complexity N = Total number of states B = Average number of successors

More information

Principles of AI Planning

Principles of AI Planning Principles of AI Planning 5. Planning as search: progression and regression Albert-Ludwigs-Universität Freiburg Bernhard Nebel and Robert Mattmüller October 30th, 2013 Introduction Classification Planning

More information

Pengju

Pengju Introduction to AI Chapter04 Beyond Classical Search Pengju Ren@IAIR Outline Steepest Descent (Hill-climbing) Simulated Annealing Evolutionary Computation Non-deterministic Actions And-OR search Partial

More information

Reinforcement Learning

Reinforcement Learning CS7/CS7 Fall 005 Supervised Learning: Training examples: (x,y) Direct feedback y for each input x Sequence of decisions with eventual feedback No teacher that critiques individual actions Learn to act

More information

A Review of the E 3 Algorithm: Near-Optimal Reinforcement Learning in Polynomial Time

A Review of the E 3 Algorithm: Near-Optimal Reinforcement Learning in Polynomial Time A Review of the E 3 Algorithm: Near-Optimal Reinforcement Learning in Polynomial Time April 16, 2016 Abstract In this exposition we study the E 3 algorithm proposed by Kearns and Singh for reinforcement

More information

Food delivered. Food obtained S 3

Food delivered. Food obtained S 3 Press lever Enter magazine * S 0 Initial state S 1 Food delivered * S 2 No reward S 2 No reward S 3 Food obtained Supplementary Figure 1 Value propagation in tree search, after 50 steps of learning the

More information

The exam is closed book, closed calculator, and closed notes except your one-page crib sheet.

The exam is closed book, closed calculator, and closed notes except your one-page crib sheet. CS 188 Spring 2017 Introduction to Artificial Intelligence Midterm V2 You have approximately 80 minutes. The exam is closed book, closed calculator, and closed notes except your one-page crib sheet. Mark

More information

Learning in Zero-Sum Team Markov Games using Factored Value Functions

Learning in Zero-Sum Team Markov Games using Factored Value Functions Learning in Zero-Sum Team Markov Games using Factored Value Functions Michail G. Lagoudakis Department of Computer Science Duke University Durham, NC 27708 mgl@cs.duke.edu Ronald Parr Department of Computer

More information

POLYNOMIAL SPACE QSAT. Games. Polynomial space cont d

POLYNOMIAL SPACE QSAT. Games. Polynomial space cont d T-79.5103 / Autumn 2008 Polynomial Space 1 T-79.5103 / Autumn 2008 Polynomial Space 3 POLYNOMIAL SPACE Polynomial space cont d Polynomial space-bounded computation has a variety of alternative characterizations

More information

Pattern Database Heuristics for Fully Observable Nondeterministic Planning

Pattern Database Heuristics for Fully Observable Nondeterministic Planning Pattern Database Heuristics for Fully Observable Nondeterministic Planning Robert Mattmüller and Manuela Ortlieb and Malte Helmert University of Freiburg, Germany, {mattmuel,ortlieb,helmert}@informatik.uni-freiburg.de

More information

Principles of AI Planning

Principles of AI Planning Principles of 5. Planning as search: progression and regression Malte Helmert and Bernhard Nebel Albert-Ludwigs-Universität Freiburg May 4th, 2010 Planning as (classical) search Introduction Classification

More information

CS221 Practice Midterm

CS221 Practice Midterm CS221 Practice Midterm Autumn 2012 1 ther Midterms The following pages are excerpts from similar classes midterms. The content is similar to what we ve been covering this quarter, so that it should be

More information

Fast SSP Solvers Using Short-Sighted Labeling

Fast SSP Solvers Using Short-Sighted Labeling Luis Pineda, Kyle H. Wray and Shlomo Zilberstein College of Information and Computer Sciences, University of Massachusetts, Amherst, USA July 9th Introduction Motivation SSPs are a highly-expressive model

More information

Q-Learning for Markov Decision Processes*

Q-Learning for Markov Decision Processes* McGill University ECSE 506: Term Project Q-Learning for Markov Decision Processes* Authors: Khoa Phan khoa.phan@mail.mcgill.ca Sandeep Manjanna sandeep.manjanna@mail.mcgill.ca (*Based on: Convergence of

More information

Chapter 16 Planning Based on Markov Decision Processes

Chapter 16 Planning Based on Markov Decision Processes Lecture slides for Automated Planning: Theory and Practice Chapter 16 Planning Based on Markov Decision Processes Dana S. Nau University of Maryland 12:48 PM February 29, 2012 1 Motivation c a b Until

More information

The exam is closed book, closed calculator, and closed notes except your one-page crib sheet.

The exam is closed book, closed calculator, and closed notes except your one-page crib sheet. CS 188 Fall 2015 Introduction to Artificial Intelligence Final You have approximately 2 hours and 50 minutes. The exam is closed book, closed calculator, and closed notes except your one-page crib sheet.

More information

Learning in State-Space Reinforcement Learning CIS 32

Learning in State-Space Reinforcement Learning CIS 32 Learning in State-Space Reinforcement Learning CIS 32 Functionalia Syllabus Updated: MIDTERM and REVIEW moved up one day. MIDTERM: Everything through Evolutionary Agents. HW 2 Out - DUE Sunday before the

More information

Kazuyuki Tanaka s work on AND-OR trees and subsequent development

Kazuyuki Tanaka s work on AND-OR trees and subsequent development Kazuyuki Tanaka s work on AND-OR trees and subsequent development Toshio Suzuki Department of Math. and Information Sciences, Tokyo Metropolitan University, CTFM 2015, Tokyo Institute of Technology September

More information

CS60007 Algorithm Design and Analysis 2018 Assignment 1

CS60007 Algorithm Design and Analysis 2018 Assignment 1 CS60007 Algorithm Design and Analysis 2018 Assignment 1 Palash Dey and Swagato Sanyal Indian Institute of Technology, Kharagpur Please submit the solutions of the problems 6, 11, 12 and 13 (written in

More information

Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms

Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms Artificial Intelligence Review manuscript No. (will be inserted by the editor) Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms Mostafa D. Awheda Howard M. Schwartz Received:

More information

Introduction to Spring 2009 Artificial Intelligence Midterm Exam

Introduction to Spring 2009 Artificial Intelligence Midterm Exam S 188 Introduction to Spring 009 rtificial Intelligence Midterm Exam INSTRUTINS You have 3 hours. The exam is closed book, closed notes except a one-page crib sheet. Please use non-programmable calculators

More information

Today s Outline. Recap: MDPs. Bellman Equations. Q-Value Iteration. Bellman Backup 5/7/2012. CSE 473: Artificial Intelligence Reinforcement Learning

Today s Outline. Recap: MDPs. Bellman Equations. Q-Value Iteration. Bellman Backup 5/7/2012. CSE 473: Artificial Intelligence Reinforcement Learning CSE 473: Artificial Intelligence Reinforcement Learning Dan Weld Today s Outline Reinforcement Learning Q-value iteration Q-learning Exploration / exploitation Linear function approximation Many slides

More information

Realization Plans for Extensive Form Games without Perfect Recall

Realization Plans for Extensive Form Games without Perfect Recall Realization Plans for Extensive Form Games without Perfect Recall Richard E. Stearns Department of Computer Science University at Albany - SUNY Albany, NY 12222 April 13, 2015 Abstract Given a game in

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Markov decision process & Dynamic programming Evaluative feedback, value function, Bellman equation, optimality, Markov property, Markov decision process, dynamic programming, value

More information

The efficiency of identifying timed automata and the power of clocks

The efficiency of identifying timed automata and the power of clocks The efficiency of identifying timed automata and the power of clocks Sicco Verwer a,b,1,, Mathijs de Weerdt b, Cees Witteveen b a Eindhoven University of Technology, Department of Mathematics and Computer

More information

IS-ZC444: ARTIFICIAL INTELLIGENCE

IS-ZC444: ARTIFICIAL INTELLIGENCE IS-ZC444: ARTIFICIAL INTELLIGENCE Lecture-07: Beyond Classical Search Dr. Kamlesh Tiwari Assistant Professor Department of Computer Science and Information Systems, BITS Pilani, Pilani, Jhunjhunu-333031,

More information

CSE 202 Homework 4 Matthias Springer, A

CSE 202 Homework 4 Matthias Springer, A CSE 202 Homework 4 Matthias Springer, A99500782 1 Problem 2 Basic Idea PERFECT ASSEMBLY N P: a permutation P of s i S is a certificate that can be checked in polynomial time by ensuring that P = S, and

More information

Prioritized Sweeping Converges to the Optimal Value Function

Prioritized Sweeping Converges to the Optimal Value Function Technical Report DCS-TR-631 Prioritized Sweeping Converges to the Optimal Value Function Lihong Li and Michael L. Littman {lihong,mlittman}@cs.rutgers.edu RL 3 Laboratory Department of Computer Science

More information

CS188: Artificial Intelligence, Fall 2009 Written 2: MDPs, RL, and Probability

CS188: Artificial Intelligence, Fall 2009 Written 2: MDPs, RL, and Probability CS188: Artificial Intelligence, Fall 2009 Written 2: MDPs, RL, and Probability Due: Thursday 10/15 in 283 Soda Drop Box by 11:59pm (no slip days) Policy: Can be solved in groups (acknowledge collaborators)

More information

Lecture notes for Analysis of Algorithms : Markov decision processes

Lecture notes for Analysis of Algorithms : Markov decision processes Lecture notes for Analysis of Algorithms : Markov decision processes Lecturer: Thomas Dueholm Hansen June 6, 013 Abstract We give an introduction to infinite-horizon Markov decision processes (MDPs) with

More information

Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms

Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms Mostafa D. Awheda Department of Systems and Computer Engineering Carleton University Ottawa, Canada KS 5B6 Email: mawheda@sce.carleton.ca

More information

Shortest paths with negative lengths

Shortest paths with negative lengths Chapter 8 Shortest paths with negative lengths In this chapter we give a linear-space, nearly linear-time algorithm that, given a directed planar graph G with real positive and negative lengths, but no

More information

Pairing Transitive Closure and Reduction to Efficiently Reason about Partially Ordered Events

Pairing Transitive Closure and Reduction to Efficiently Reason about Partially Ordered Events Pairing Transitive Closure and Reduction to Efficiently Reason about Partially Ordered Events Massimo Franceschet Angelo Montanari Dipartimento di Matematica e Informatica, Università di Udine Via delle

More information

Using first-order logic, formalize the following knowledge:

Using first-order logic, formalize the following knowledge: Probabilistic Artificial Intelligence Final Exam Feb 2, 2016 Time limit: 120 minutes Number of pages: 19 Total points: 100 You can use the back of the pages if you run out of space. Collaboration on the

More information

Algorithms for Playing and Solving games*

Algorithms for Playing and Solving games* Algorithms for Playing and Solving games* Andrew W. Moore Professor School of Computer Science Carnegie Mellon University www.cs.cmu.edu/~awm awm@cs.cmu.edu 412-268-7599 * Two Player Zero-sum Discrete

More information

Lecture Notes on Emptiness Checking, LTL Büchi Automata

Lecture Notes on Emptiness Checking, LTL Büchi Automata 15-414: Bug Catching: Automated Program Verification Lecture Notes on Emptiness Checking, LTL Büchi Automata Matt Fredrikson André Platzer Carnegie Mellon University Lecture 18 1 Introduction We ve seen

More information

Principles of AI Planning

Principles of AI Planning Principles of 7. Planning as search: relaxed Malte Helmert and Bernhard Nebel Albert-Ludwigs-Universität Freiburg June 8th, 2010 How to obtain a heuristic STRIPS heuristic Relaxation and abstraction A

More information

Solving Fuzzy PERT Using Gradual Real Numbers

Solving Fuzzy PERT Using Gradual Real Numbers Solving Fuzzy PERT Using Gradual Real Numbers Jérôme FORTIN a, Didier DUBOIS a, a IRIT/UPS 8 route de Narbonne, 3062, Toulouse, cedex 4, France, e-mail: {fortin, dubois}@irit.fr Abstract. From a set of

More information

An Adaptive Clustering Method for Model-free Reinforcement Learning

An Adaptive Clustering Method for Model-free Reinforcement Learning An Adaptive Clustering Method for Model-free Reinforcement Learning Andreas Matt and Georg Regensburger Institute of Mathematics University of Innsbruck, Austria {andreas.matt, georg.regensburger}@uibk.ac.at

More information

COMP3702/7702 Artificial Intelligence Lecture 11: Introduction to Machine Learning and Reinforcement Learning. Hanna Kurniawati

COMP3702/7702 Artificial Intelligence Lecture 11: Introduction to Machine Learning and Reinforcement Learning. Hanna Kurniawati COMP3702/7702 Artificial Intelligence Lecture 11: Introduction to Machine Learning and Reinforcement Learning Hanna Kurniawati Today } What is machine learning? } Where is it used? } Types of machine learning

More information

Finite Automata. Mahesh Viswanathan

Finite Automata. Mahesh Viswanathan Finite Automata Mahesh Viswanathan In this lecture, we will consider different models of finite state machines and study their relative power. These notes assume that the reader is familiar with DFAs,

More information

Optimism in the Face of Uncertainty Should be Refutable

Optimism in the Face of Uncertainty Should be Refutable Optimism in the Face of Uncertainty Should be Refutable Ronald ORTNER Montanuniversität Leoben Department Mathematik und Informationstechnolgie Franz-Josef-Strasse 18, 8700 Leoben, Austria, Phone number:

More information