Efficient Reinforcement Learning in Factored MDPs

Size: px
Start display at page:

Download "Efficient Reinforcement Learning in Factored MDPs"

Transcription

1 In Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence (IJCAI99), Stockholm, Sweden, August 1999 Efficient Reinforcement Learning in Factored MDPs Michael Kearns ATT Labs mkearnsresearch.att.com Daphne Koller Stanford University kollercs.stanford.edu Abstract We present a provably efficient and nearoptimal algorithm for reinforcement learning in Markov decision processes (MDPs) whose transition model can be factored as a dynamic Bayesian network (DBN). Our algorithm generalizes the recent E algorithm of Kearns and Singh, and assumes that we are given both an algorithm for approximate planning, and the graphical structure (but not the parameters) of the DBN. Unlike the original E algorithm, our new algorithm exploits the DBN structure to achieve a running time that scales polynomially in the number of parameters of the DBN, which may be exponentially smaller than the number of global states. 1 Introduction Kearns and Singh (1998) recently presented a new algorithm for reinforcement learning in Markov decision processes (MDPs). Their E algorithm (for Explicit Explore or Exploit) achieves nearoptimal performance in a running time and a number of actions which are polynomial in the number of states and a parameter, which is the horizon time in the case of discounted return, and the mixing time of the optimal policy in the case of infinitehorizon average return. The E algorithm makes no assumptions on the structure of the unknown MDP, and the resulting polynomial dependence on the number of states makes E impractical in the case of very large MDPs. In particular, it cannot be easily applied to MDPs in which the transition probabilities are represented in the factored form of a dynamic Bayesian network (DBN). MDPs with very large state spaces, and such DBNMDPs in particular, are becoming increasingly important as reinforcement learning methods are applied to problems of growing difficulty [Boutilier et al., 1999]. In this paper, we extend the E algorithm to the case of DBNMDPs. The original E algorithm relies on the ability to find optimal strategies in a given MDP that is, to perform planning. This ability is readily provided by algorithms such as value iteration in the case of small state spaces. While the general planning problem is intractable in large MDPs, significant progress has been made recently on approximate solution algorithms for both DBNMDPs in particular [Boutilier et al., 1999], and for large state spaces in general [Kearns et al., 1999; Koller and Parr, 1999]. Our new DBNE algorithm therefore assumes the existence of a procedure for finding approximately optimal policies in any given DBNMDP. Our algorithm also assumes that the qualitative structure of the transition model is known, i.e., the underlying graphical structure of the DBN. This assumption is often reasonable, as the qualitative properties of a domain are often understood. Using the planning procedure as a subroutine, DBNE explores the state space, learning the parameters it considers relevant. It achieves nearoptimal performance in a running time and a number of actions that are polynomial in and the number of parameters in the DBNMDP, which in general is exponentially smaller than the number of global states. We further examine conditions under which the mixing time of a policy in a DBNMDP is polynomial in the number of parameters of the DBNMDP. The anytime nature of DBNE allows it to compete with such policies in total running time that is bounded by a polynomial in the number of parameters. Preliminaries We begin by introducing some of the basic concepts of MDPs and factored MDPs. A Markov Decision Process (MDP) is defined as a tuple where is a set of states; is a set of actions; is a reward function "!, such that # represents the reward obtained by the agent in state # ; is a transition model '( *),./, such that 1#" 3#( 45 represents the probability of landing in state #6 if the agent takes action 4 in state #. Most simply, MDPs are described explicitly, by writing down a set of transition matrices and reward vectors one for each action 4. However, this approach is impractical for describing complex processes. Here, the set of states is 9 typically?> described via a set of random variables 8 ;;<;, where each takes on values in some finite domain ABDC1. In general, for a set of variables EGFH, an instantiation I assigns a value JLKMANBDC for every KOE ; we use ABPCQER to denote the set of possible instantiations to S A reward function is sometimes associated with (state,action) pairs rather than with states. Our assumption that the reward depends only on the state is made purely to simplify the presentation; it has no effect on our results.

2 F > E. A state in this MDP is an assignment,k ANBDCQM ; the total number of states is therefore exponentially large in the number of variables. Thus, it is impractical to represent the transition model explicitly using transition matrices. The framework of dynamic Bayesian networks (DBNs) allows us to describe a certain important class of such MDPs in a compact way. Processes whose state is described via a set of variables typically exhibit a weak form of decoupling not all of the variables at time directly influence the transition of a variable from time to time. For example, in a simple robotics domain, the location of the robot at time may depend on its position, velocity, and orientation at time, but not on what it is carrying, or on the amount of paper in the printer. DBNs are designed to represent such processes compactly. Let 4 KG be an action. We first want to specify the transition model O3 45. Let denote the variable at the current time and denote the variable at the next time step. The transition model for action 4 will consist of two parts an underlying transition graph associated with 4, and parameters associated with that graph. The transition graph 9 is a layer directed acyclic graph whose nodes are ;<;; <;;<; 9 >. All edges > in this graph 9< are directed from nodes in <;;; to nodes in <;;<; > ; note that we are assuming that there are no edges between variables within a time slice. We denote the parents of in the graph by B 5. Intuitively, the transition graph for 4 specifies the qualitative nature of probabilistic dependencies in a single time step namely, the new setting of depends only on the current setting of the variables in B. To make this dependence quantitative, each node is associated with a conditional probability table (CPT) 3 B. The transition probability? 3 4 is then defined to be J 3, where is the setting in of the variables in B 5. We also need to provide a compact representation of the reward function. As in the transition model, explicitly specifying a reward for each of the exponentially many states is impractical. Again, we use the idea of factoring the representation of the reward function into a set of localized reward functions, each of which only depends on a small set of variables. In our robot example, our reward might be composed of several subrewards for example, one associated with location (for getting too close to a wall), one associated with the printer status (for letting paper run out), and so on. More precisely, let be a set of functions <;;<; ; each function 9< is > associated with a cluster of variables <;;<;, such that is a function from ANBDC to. Abusing notation, we will use to denote the value that takes for the part of the state vector corresponding to. The reward function associated with the DBNMDP at a state is then defined to be 8 NK*!. The following definitions for finitelength paths in MDPs will be of repeated technical use in the analysis. Let be a Markov decision process, and let be a policy in. A path in is a sequence of states (that is, transitions) of 8 ;<;; "! "! #. The probability that is traversed in upon starting in state and executing policy is denoted! 8('! )# 3 ) * )5. There are three standard notions of the expected return enjoyed by a policy in an MDP the asymptotic discounted return, the asymptotic average return, and the finitetime average return. Like the original E algorithm, our new generalization will apply to all three cases, and to convey the main ideas it suffices for the most part to concentrate on the finitetime average return. This is because our finitetime average return result can be applied to the asymptotic returns through either the horizon time, /. for the discounted case, or the mixing time of the optimal policy in the average case. (We examine the properties of mixing times in a DBNMDP in Section 5.) Let be a Markov decision process, let be a policy in, and let be a path in. The average return along in is 1 8, "1,*! # "; The step (expected) average return from state is *8354N6 5! 1 where the sum is over all paths in that start at. Furthermore, we define the optimal step average return from in 6 by B9. An important problem in MDPs is planning finding the policy that achieves optimal return in a given MDP. In our case, we are interested in achieving the optimal step average return. The complexity of all exact MDP planning algorithms depends polynomially on the number of states; this property renders all of these algorithms impractical for DBN MDPs, where the number of states grows exponentially in the size of the representation. However, there has been recent progress on algorithms for approximately solving MDPs with large state spaces [Kearns et al., 1999], particularly on ones represented in a factored way as an MDP [Boutilier et al., 1999; Koller and Parr, 1999]. The focus of our work is on the reinforcement learning task, so we simply assume that we have access to a black box that performs approximate planning for a DBNMDP. Definition.1A approximation step planning algorithm for a DBNMDP is one that, given a DBNMDP, produces a (compactly represented) policy such that >. <; We will charge our learning algorithm a single step of computation for each call to the assumed approximate planning algorithm. One way of thinking about our result is as a reduction of the problem of efficient learning in DBNMDPs to the problem of efficient planning in DBNMDPs. Our goal is to perform modelbased reinforcement learning. Thus, we wish to learn an approximate model from experience, and then exploit it (or explore it) by planning given the approximate model. In this paper, we focus on the problem of learning the model parameters (the CPTs), assuming that the model structure (the transition graphs) is given to us. It is therefore useful to consider the set of parameters that we wish to estimate. As we assumed that the rewards are deterministic, we can focus on the probabilistic parameters. (Our results easily extend to the case of stochastic rewards.) We define a transition component of the DBNMDP to be a

3 distribution 5 3" for some action 4 and some particular instantiation to the parents B 5 in the transition model. Note that the number of transition components is at most 3 ABPC B 3, but may be much lower when a variable s behavior is identical for several actions. 3 Overview of the Original E Since our algorithm for learning in DBNMDPs will be a direct generalization of the E algorithm of Kearns and Singh hereafter abbreviated KS we begin with an overview of that algorithm and its analysis. It is important to bear in mind that the original algorithm is designed only for the case where the total number of states is small, and the algorithm runs in time polynomial in. E is what is commonly referred to as an indirect or modelbased algorithm rather than maintaining only a current policy or value function, the algorithm maintains a model for the transition probabilities and the rewards for some subset of the states of the unknown MDP. Although the algorithm maintains a partial model of, it may choose to never build a complete model of, if doing so is not necessary to achieve high return. The algorithm starts off by doing balanced wandering the algorithm, upon arriving in a state, takes the action it has tried the fewest times from that state (breaking ties randomly). At each state it visits, the algorithm maintains the obvious statistics the reward received at that state, and for each action, the empirical distribution of next states reached (that is, the estimated transition probabilities). A crucial notion is that of a known state a state that the algorithm has visited so many times that the transition probabilities for that state are very close to their true values in. This definition is carefully balanced so that so many times is still polynomially bounded, yet very close suffices to meet the simulation requirements below. An important observation is that we cannot do balanced wandering indefinitely before at least one state becomes known by the Pigeonhole Principle, we will soon start to accumulate accurate statistics at some state. The most important construction of the analysis is the knownstate MDP. If is the set of currently known states, the knownstate MDP is simply an MDP that is naturally induced on by the full MDP. Briefly, all transitions in between states in are preserved in, while all other transitions in are redirected in to lead to a single new, absorbing state that intuitively represents all of the unknown and unvisited states. Although E does not have direct access to, by virtue of the definition of the known states, it does have a good approximation. The KS analysis hinges on two central technical lemmas. The first is called the Simulation Lemma, and it establishes that has good simulation accuracy that is, the expected step return of any policy in is close to its expected step return in. Thus, at any time, is a useful partial model of, for that part of that the algorithm knows very well. The second central technical lemma is the Explore or Exploit Lemma. It states that either the optimal ( step) policy in achieves its high return by staying (with high probability) in the set of currently known states, or the optimal policy has significant probability of leaving within steps. Most importantly, the algorithm can detect which of these two is the case; in the first case, it can simulate the behavior of the optimal policy by finding a highreturn exploitation policy in the partial model, and in the second case, it can replicate the behavior of the optimal policy by finding an exploration policy that quickly reaches the additional absorbing state of the partial model. Thus, by performing two offline planning computations on, the algorithm is guaranteed to find either a way to get nearoptimal return for the next steps, or a way to improve the statistics at an unknown or unvisited state within the next steps. KS show that this algorithm ensures nearoptimal return in time polynomial in. 4 The DBNE Algorithm Our goal is to derive a generalization of E for DBNMDPs, and to prove for it a result analogous to that of KS but with a polynomial dependence not on the number of states, but on the number of CPT parameters in the DBN model. Our analysis closely mirrors the original, but requires a significant generalization of the Simulation Lemma that exploits the structure of a DBNMDP, a modified construction of that can be represented as a DBNMDP, and a number of alterations of the details. Like the original E algorithm, DBNE will build a model of the unknown DBNMDP on the basis of its experience, but now the model will be represented in a compact, factorized form. More precisely, suppose that our algorithm is in state, executes action 4, and arrives in state. This experience will be used to update all the appropriate CPT entries of our model namely, all the estimates 51J 3 are updated in the obvious way, where as usual is the setting of B 5 in. We will also maintain counts 51J of the number of times 5J 3 has been updated. Recall that a crucial element of the original E analysis was the notion of a known state. In the original analysis, it was observed that if is the total number of states, then after experiences some state must become known by the Pigeonhole Principle. We cannot hope to use the same logic here, as we are now in a DBNMDP with an exponentially large number of states. Rather, we must pigeonhole not on the number of states, but on the number of parameters required to specify the DBNMDP. Towards this goal, we will say that the CPT entry J 3 is known if it has been visited enough times to ensure that, with high probability 3 (J 3 5J 3 3N; We now would like to establish that if, for an appropriate choice of, all CPT entries are known, then our approximate DBNMDP can be used to accurately estimate the expected return of any policy in the true DBNMDP. This is the desired generalization of the original Simulation Lemma. As in the original analysis, we will eventually apply it to a generalization of the induced MDP, in which we deliberately restrict attention to only the known CPT entries.

4 3!! 3! 4.1 The DBNMDP Simulation Lemma Let and be two DBNMDPs over the same state space with the same transition graphs for every action 4, and with the same reward functions. Then we say that is an approximation of if for every action 4 and node in the transition graphs, for every setting of B 5, and for every possible value J of, where respectively. 3 (J 3 and (J 3 3 are the CPTs of and, Lemma 4.1 Let be any DBNMDP over state variables with CPT entries in the transition model, and let be an approximation of, where 8 <. Then for any policy, and for any state, 3 <3 ; Proof (Sketch) Let us fix a policy and state. Recall that for any next state and any action 4, the transition probability factorizes via the CPTs as / J 3. where is the setting of B 5 in. Let us say that 3 4 contains a small factor if any of its CPT factors J 3 is smaller than. Note that a transition probability may actually be quite small itself (exponentially small in ) without necessarily containing a small factor. Our first goal is to show that trajectories in and that cross transitions containing a small CPT factor can be thrown away without much error. Consider a random trajectory of steps in from state following policy. It can be shown that the probability that such a trajectory will cross at least one transition 3 4 that contains a small factor is at most. Essentially, the probability that at any step, any particular small transition (CPT factor) will be taken by any particular variable is at most. A simple union argument over the CPT entries and the time steps gives the desired bound. Therefore, the total contribution to the difference 3 3 by these trajectories can be shown to be at most N. We will thus ignore such trajectories for now. The key advantage of eliminating small factors is that we can convert additive approximation guarantees into multiplicative ones. Let be any path of length. If all the relevant CPT factors are greater than, and we let '8, it can be shown that 5! 5! 5! ; In other words, ignoring small CPT factors, the distributions on paths induced by in and are quite similar. From this it follows that, for the upper bound, P ; 8, 8 < For the choices lemma is obtained. The lower bound argument is entirely symmetric. the Returning to the main development, we can now give a precise definition of a known CPT entry. It is a simple application of Chernoff bounds to show that provided the count (J exceeds C,, 5J 3 has additive error at most with probability at least. We thus say that this CPT entry is known if its count exceeds the given bound for the choice 8 5 specified by the DBNMDP Simulation Lemma. The DBNMDP Simulation Lemma shows that if all CPT entries are known, then our approximate model can be used to find a nearoptimal policy in the true DBNMDP. Note that we can identify which CPT entries are known via the counts 1J. Thus, if we are at a state for which at least one of the associated CPT entries 5J 3 is unknown, by taking action 4 we then obtain an experience that will increase the corresponding count 1J. Thus, in analogy with the original E, as long as we are encountering unknown CPT entries, we can continue taking actions that increase the quality of our model but now rather than increasing counts on a perstate basis, the DBNMDP Simulation Lemma shows why it suffices to increase the counts on a percpt entry basis, which is crucial for obtaining the running time we desire. We can thus show that if we encounter unknown CPT entries for a number of steps that is polynomial in the total number of CPT entries and,, there can no longer be any unknown CPT entries, and we know the true DBNMDP well enough to solve for a nearoptimal policy. However, similar to the original algorithm, the real difficulty arises when we are in a state with no unknown CPT entries, yet there do remain unknown CPT entries elsewhere. Then we have no guarantee that we can improve our model at the next step. In the original algorithm, this was solved by defining the knownstate MDP, and proving the aforementioned Explore or Exploit Lemma. Duplicating this step for DBNMDPs will require another new idea. 4. The DBNMDP Explore or Exploit Lemma In our context, when we construct a knownstate MDP, we must satisfy the additional requirement that the knownstate MDP preserve the DBN structure of the original problem, so that if we have a planning algorithm for DBNMDPs that exploits the structure, we can then apply it to the knownstate MDP. Therefore, we cannot just introduce a new sink state to represent that part of that is unknown to us; we must also show how this sink state can be represented as a setting of the state variables of a DBNMDP. We present a new construction, which extends the idea of known states to the idea of known transitions. We say 3 is known if all of that a transition component 5 its CPT entries are known. The basic idea is that, while it is impossible to check locally whether a state is known, it is easy to check locally whether a transition component is known. Let be the set of known transition components. We define the knowntransition DBNMDP as follows. The Certain approaches to approximate planning in large MDPs do not require any structural assumptions [Kearns et al., 1999], but we anticipate that the most effective DBNMDP planning algorithms eventually will.

5 3 model behaves identically to as long as only known transitions are taken. As soon as an unknown transition is taken for some variable, the variable takes on a new wandering value, which we introduce into the model. The transition model is defined so that, once a variable takes on the value, its value never changes. The reward function is defined so that, once at least one variable takes on the wandering value, the total reward is nonpositive. These two properties give us the same overall behavior that KS got by making a sink state for the set of unknown states. Definition 4.Let be a DBNMDP and let be any subset of the transition components in the model. The induced DBNMDP on, denoted, is defined as follows has the same set of state variables as ; however, in, each variable has, in addition to its original set of values ABPC, a new value. has the same transition graphs as. For each 4,, and K ABDC B 5, we have that N8' 3 if the corresponding transition component is in ; in all other cases, '3 8, and J 3 8 for all J KOABPC. has the same set as. For each 8 <;;; and K ABDC, we have that (8. For other vectors, we have that ( 8. With this definition, we can prove the analogue to the Explore or Exploit Lemma (details omitted). Lemma 4.3Let be any DBNMDP, let be any subset of the transition components of, and let be the induced MDP on. For any K, any, and any, either there exists a policy in such that > ;, or there exists a policy in such that the probability that a walk of steps following will take at least one transition not in exceeds. This lemma essentially asserts that either there exists a policy that already achieves nearoptimal (global) return by staying only in the local model, or there exists a policy that quickly exits the local model. 4.3 Putting It All Together We now have all the pieces to finish the description and analysis of the DBNE algorithm. The algorithm initially executes balanced wandering for some period of time. After some number of steps, by the Pigeonhole Principle one or more transition components become known. When the algorithm reaches a known state one where all the transition components are known it can no longer perform balanced wandering. At that point, the algorithm performs approximate offline policy computations for two different DBNMDPs. The first corresponds to attempted exploitation, and the second to attempted exploration. Let be the set of known transitions at this step. In the attempted exploitation computation, the DBNE algorithm would like to find the optimal policy on the induced DBN MDP. Clearly, this DBNMDP is not known to the algorithm. Thus, we use its approximation, where the true transition probabilities are replaced with their current approximation in the model. The definition of uses only the CPT entries of known transition components. The Simulation Lemma now tells us that, for an appropriate choice of a choice that will result in a definition of known transition that requires the corresponding count to be only polynomial in,,, and the return of any policy in is within of its return in. We will specify a choice for later (which in turn sets the choice of and the definition of known state). Let us now consider the two cases in the Explore or Exploit Lemma. In the exploitation case, > there exists a policy in such that ;. (Again, we will discuss the choice of below.) From the Simulation Lemma, we have that ; >. Our approximate planning algorithm returns a policy whose value in is guaranteed to be a multiplicative factor of at most away from the optimal policy in. Thus, we are guaranteed that ; > ". Therefore, in the exploitation case, our approximate planner is guaranteed to return a policy whose value is close to the optimal value. In the exploration case, there exists a policy in (and therefore in ) that is guaranteed to take an unknown transition within steps with some minimum probability. Our goal now is to use our approximate planner to find such a policy. In order to do that, we need use a slightly different construction ( ). The transition structure of is identical to that of. However, the rewards are now different. Here, for each 8 ;;<; and KMANBDC, we have that 8 ; for other vectors, we have that ( 8. Now let be the policy returned by our approximate planner on the DBNMDP. It can be shown that the probability that a step walk following? will take at least one unknown transition is at least (. To summarize our approximate planner either finds an exploitation policy in that enjoys actual return ;. " from our current state, or it finds an exploitation policy in that has probability at least 8 of improving our statistics at an unknown transition in the next steps. Appropriate choices for and yield our main theorem, which we are now finally ready to describe. Recall that for expository purposes we have concentrated on the case of step average return. However, as for the original E, our main result can be stated in terms of the asymptotic discounted and average return cases. We omit the details of this translation, but it is a simple matter of arguing that it suffices to set to be either >. 5C, (discounted) or the mixing time of the optimal policy (average). Theorem 4.4 (Main Theorem) Let be a DBNMDP with total entries in the CPTs. (Undiscounted case) Let be the mixing time of the policy achieving the optimal average asymptotic 6 return in. There exists an algorithm DBNE that, given access to a approximation planning algorithm for DBN

6 8 J 9 > 8 MDPs, and given inputs P and, takes a number of actions and computation time bounded by a polynomial in,,,, and, and with probability at least, achieves total actual return exceeding. (Discounted case) Let denote the value function for the policy with the optimal expected discounted return in. There exists an algorithm DBNE that, given access to a approximation planning algorithm for DBNMDPs, and given inputs,, and, takes a number of actions and computation time bounded by a polynomial in 6 ",,., the horizon time, and, and with probability at least, will halt in a state, and output a policy, such that ;. Some remarks The loss in policy quality induced by the approximate planning subroutine translates into degradation in the running time of our algorithm. As with the original E, we can eliminate knowledge of the optimal returns in both cases via search techniques. Although we have stated our asymptotic undiscounted average return result in terms of the mixing time of the optimal policy, we can instead give an anytime algorithm that competes against policies with longer and longer mixing times the longer it is run. (We omit details, but the analysis is analogous to the original E analysis.) This extension is especially important in light of the results of the following section, where we examine properties of mixing times in DBNMDPs. 5 Mixing Time Bounds for DBNMDPs As in the original E paper, our average case result depends on the amount of time that it takes the target policy to mix. This dependence is unavoidable. If some of the probabilities are very small, so that the optimal policy cannot easily reach the highreward parts of the space, it is unrealistic to expect the reinforcement learning algorithm to do any better. In the context of a DBNMDP, however, this dependence is more troubling. The size of the state space is exponentially large, and virtually all of the probabilities for transitioning from one state to the next will be exponentially small (because a transition probability is the product of numbers that are ). Indeed, one can construct very reasonable DBN MDPs that have an exponentially long mixing time. For example, a DBN representing the Markov chain of an Ising model [Jerrum and Sinclair, 1993] has small parent sets (at most four parents per node), and CPT entries that are reasonably large. Nevertheless, the mixing time of such a DBN can be exponentially large in. Given that even reasonable DBNs such as this can have exponential mixing times, one might think that this is the typical situation that is, that most DBNMDPs have an exponentially long mixing time, reintroducing the exponential dependence on that we have been trying so hard to avoid. We now show that this is not always the case. We provide a tool for analyzing the mixing time of a policy in a DBNMDP, which can give us much better bounds on the mixing time. In particular, we demonstrate a class of DBNMDPs and associated policies for which we can guarantee rapid mixing. Note that any fixed policy in a DBNMDP defines a Markov chain whose transition model is represented as a DBN. We therefore begin by considering the mixing time of a pure DBN, with no actions. We then extend that analysis to the mixing rate for a fixed policy in a DBNMDP. Definition 5.1Let be a transition model for a Markov chain, and let 9 > represent the state of the chain. Let 8 J ;;<; J. Let be the stationary probability of J in this Markov chain. We say that the Markov chain is mixed at time if 8 B J 3 Our bounds on mixing times make use of the coupling method [Lindvall, 199]. The idea of the coupling method is as follows we run two copies of the Markov chain in parallel, from different starting points. Our goal is to make the states of the two processes coalesce. Intuitively, the first time the states of the two copies are the same, the initial states have been forgotten, which corresponds to the processes having mixed. More precisely, consider a transition matrix over some state space. Let be a transition matrix over the state space ) 9 >, such that if is the Markov chain 9 for 9 >, then the separated Markov chains > and both evolve according to. Let be the random variable that represents the coupling time the smallest for which 8. The following lemma establishes the correspondence between mixing and coupling times. Lemma 5. For any, let be such that for any 8 <;;<; #,.3 8J 8 J. Then is mixed at time. Thus, to show that a Markov chain is mixed by some time, we need only construct a coupled chain and show that the probability that this chain has not coupled by time decreases very rapidly in. The coupling method allows us to construct the joint chain over in any way that we want, as long as each of the two chains in isolation has the same dynamics as the original Markov chain. In particular, we can correlate the transitions of the two processes, so as to make their states coincide faster than they would if each was picked independently of the other. That is, we choose # and # to be equal to each other whenever possible, subject to the constraints on the transition probabilities. More precisely, let 8 J and 8 J. For any value J K, we can make the event # 8J # 8J have a probability that is the smaller of M8J 3 8 J and 8 J,3 8 J. Compare this to the probability of this event if the two processes were independent, which is the product of these two numbers rather than their minimum. Overall, by correlating the two processes as much as possible, and considering the worst case over the current state

7 of the process, we can guarantee that, at every step, the two processes couple with probability at least J 3 8 J 8 J 3 8,JD1! This quantity represents the amount of probability mass that any two transition distributions are guaranteed to have in common. It is called the Dobrushin coefficient, and is the contraction rate for norm [Dobrushin, 1956] in Markov chains. 9< Now, consider > a DBN over the state variables 8 <;;<;. As above, we create two copies of the process, letting ;;<; denote the variables in the first component of the coupled Markov chain, and <;;; denote those in the second component. Our goal is to construct a Markov chain over E* such that both E and separately have the same dynamics as in the original DBN. Our construction of the joint Markov chain is very similar to the one used above, except that will now choose the transition of each variable pair and so as to maximize the probability that they couple (assume the same value). As above, we can guarantee that and couple at any time with probability at least J 3 " 1J 3! This coefficient was defined by [Boyen and Koller, 1998] in their analysis of the contraction rate of DBNs. Note that depends only on the numbers in a single CPT of the DBN. Assuming that the transition probabilities in each CPT are not too extreme, the probability that any single variable couples will be reasonably high. Unfortunately, this bound is not enough to show that all of the variable pairs couple within a short time. The problem is that it is not enough for two variables and to couple, as process dynamics may force us to decouple them at subsequent time slices. To understand this issue, consider a simple process with two variables, and a transition graph with the edges,,. Assume that at time, the variable pair has coupled with value J, but has not, so that 8 J and 8 J. At the next time slice, we # must select # from two different distributions 3 J J and 3 J J, respectively. Thus, our sampling process may be forced to give them different values, decoupling them again. As this example clearly illustrates, it is not enough for a variable pair to couple momentarily. In order to eventually couple the two processes as a whole, we need to make each variable pair a stable pair i.e., we need to guarantee that our sampling process can keep them coupled from then on. In our example, the pair is stable as soon as it first couples. And once is stable, then will also be stable as soon as it couples. However, if couples while is not yet stable, then the sampling process cannot guarantee stability. In general, a variable pair can only be stable if their parents are also stable. So what happens if we add the edge to our transition model? In this case, neither nor can stabilize in isolation. They can only stabilize if they couple simultaneously. This discussion leads to the following definition. Definition 5.3Consider a DBN over the state variables ;;<;. The dependency graph for the DBN is a di and where rected cyclic graph whose nodes are <;;<; there is a directed edge from to if there is an edge in the transition graph of the DBN from to. Hence, there is a directed path from to in iff influences for some (. We assume that the transition graph of the DBN always has arcs, so that the every node in has a selfloop. Let! <;;<; "!# be the maximal strongly connected compo nents in, sorted so that if, there are no directed edges from! to!. Our analysis will be based on stabilizing the! s in succession. (We note that we provide only a rough bound; a more refined analysis is possible.) Let 8 8 and' 8 8 B9 3! 3. Assume that! ;<;; (! *) have all stabilized by time. In order for! to stabilize, all of the variables need to couple at exactly the same time. This event happens at time, with probability ;. As soon as! stabilizes, we can move on to stabilizing! #. When all the! s have stabilized, we are done. Theorem 5.4For any 6;, the Markov chain corresponding to a DBN as described above is mixed at time provided ;./ C, Thus, the mixing time of a DBN grows exponentially with the size of the largest component in the dependency graph, which may be significantly smaller than the total number of variables in a DBN. Indeed, in two reallife DBNs BAT [Forbes et al., 1995] with ten state variables, and WATER [Jensen et al., 1989] with eight the maximal cluster size is 3 4. It remains only to extend this analysis to DBNMDPs, where we have a policy. Our stochastic coupling scheme must now deal with the fact that the actions taken at time in the two copies of the process may be different. The difficulty is that different actions at time correspond to different transition models. If a variable has a different transition model in different transition graphs, it will use a different transition distribution if the action is not the same. Hence cannot stabilize until we are guaranteed that the same action is taken in both copies. That is, the action must also stabilize. The action is only guaranteed to have stabilized when all of the variables on which the choice of action can possibly depend have stabilized. Otherwise, we might encounter a pair of states in which we are forced to use different actions in the two copies. We can analyze this behavior by extending the dependency graph to include a new node corresponding to the choice of action. We then see what assumptions allow us to bound the set of incoming and outgoing edges. We can then use ;

8 the same analysis described above to bound the mixing time. The outgoing edges correspond to the effect of an action. In many processes, the action only directly affects the transition model of a small number of state variables in the process. In other words, for many variables, we have that B 5 and ( 3 B 5 are the same for all 4. In this case, the new action node will only have outgoing edges to the remaining variables (those for which the transition model might differ). We note that such localized influence models have a long history both for influence diagram [Howard and Matheson, 1984] and for DBNMDPs [Boutilier et al., 1999]. Now, consider outgoing edges. In general, the optimal policy might well be such that the action depends on every variable. However, the mere representation of such a policy may be very complex, rendering its use impractical in a DBNMDP with many variables. Therefore, we often want to restrict attention to a simpler class of policies, such as a small finite state machine or a small decision tree. If our target policy is such that the choice of action only depends on a small number of variables, then there will only be a small number of incoming edges into the action node in the dependency graph. Having integrated the action node into the dependency graph, our analysis above holds unchanged. The only difference from a random variable is that we do not have to include the action node when computing the size of the! that contains it, as we do not have to stochastically make it couple; rather, it couples immediately once its parents have coupled. Finally, we note that this analysis easily accommodates DBNMDPs where the decision about the action is also decomposed into several independent decisions (e.g., as in [Meuleau et al., 1998]). Different component decisions can influence different subsets of variables, and the choice of action in each one can depend on different subsets of variables. Each decision forms a separate node in the dependency graph, and can stabilize independently of the other decisions. The analysis above gives us techniques for estimating the mixing rate of policies in DBNMDPs. In particular, if we want to focus on getting a good steadystate return from DBNE in a reasonable amount of time, this analysis shows us how to restrict attention to policies that are guaranteed to mix rapidly given the structure of the given DBNMDP. 6 Conclusions Structured probabilistic models, and particularly Bayesian networks, have revolutionized the field of reasoning under uncertainty by allowing compact representations of complex domains. Their success is built on the fact that this structure can be exploited effectively by inference and learning algorithms. This success leads one to hope that similar structure can be exploited in the context of planning and reinforcement learning under uncertainty. This paper, together with the recent work on representing and reasoning with factored MDPs [Boutilier et al., 1999], demonstrate that substantial computational gains can indeed be obtained from these compact, structured representations. This paper leaves many interesting problems unaddressed. Of these, the most intriguing one is to allow the algorithm to learn the model structure as well as the parameters. The recent body of work on learning Bayesian networks from data [Heckerman, 1995] lays much of the foundation, but the integration of these ideas with the problems of exploration/exploitation is far from trivial. Acknowledgements We are grateful to the members of the DAGS group for useful discussions, and particularly to Brian Milch for pointing out a problem in an earlier version of this paper. The work of Daphne Koller was supported by the ARO under the MURI program Integrated Approach to Intelligent Systems, by ONR contract N6619C8554 under DARPA s HPKB program, and by the generosity of the Powell Foundation and the Sloan Foundation. References [Boutilier et al., 1999] C. Boutilier, T. Dean, and S. Hanks. Decision theoretic planning Structural assumptions and computational leverage. Journal of Artificial Intelligence Research, To appear. [Boyen and Koller, 1998] X. Boyen and D. Koller. Tractable inference for complex stochastic processes. In Proc. UAI, pages 33 4, [Dobrushin, 1956] R.L. Dobrushin. Central limit theorem for nonstationary Markov chains. Theory of Probability and its Applications, pages 65 8, [Forbes et al., 1995] J. Forbes, T. Huang, K. Kanazawa, and S.J. Russell. The BATmobile Towards a Bayesian automated taxi. In Proc. IJCAI, [Heckerman, 1995] D. Heckerman. A tutorial on learning with Bayesian networks. Technical Report MSRTR956, Microsoft Research, [Howard and Matheson, 1984] R. A. Howard and J. E. Matheson. Influence diagrams. In R. A. Howard and J. E. Matheson, editors, Readings on the Principles and Applications of Decision Analysis, pages 1 6. Strategic Decisions Group, Menlo Park, California, [Jensen et al., 1989] F.V. Jensen, U. Kjærulff, K.G. Olesen, and J. Pedersen. An expert system for control of waste water treatment a pilot project. Technical report, Judex Datasystemer A/S, Aalborg, In Danish. [Jerrum and Sinclair, 1993] M. Jerrum and A. Sinclair. Polynomialtime approximation algorithms for the Ising model. SIAM Journal on Computing, , [Kearns and Singh, 1998] M. Kearns and S.P. Singh. Nearoptimal performance for reinforcement learning in polynomial time. In Proc. ICML, pages 6 68, [Kearns et al., 1999] M. Kearns, Y. Mansour, and A. Ng. A sparse sampling algorithm for nearoptimal planning in large markov decision processes. In these proceedings, [Koller and Parr, 1999] D. Koller and R. Parr. Computing factored value functions for policies in structured MDPs. In these proceedings, [Lindvall, 199] T. Lindvall. Lectureson the Coupling Method. Wiley, 199. [Meuleau et al., 1998] N. Meuleau, M. Hauskrecht, KE. Kim, L. Peshkin, L.P. Kaelbling, T. Dean, and C. Boutilier. Solving very large weakly coupled Markov decision processes. In Proc. AAAI, pages 165 1, 1998.

Efficient Reinforcement Learning in Factored MDPs

Efficient Reinforcement Learning in Factored MDPs Efficient Reinforcement Learning in Factored MDPs Michael Kearns AT&T Labs mkearns@research.att.com Daphne Koller Stanford University koller@cs.stanford.edu Abstract We present a provably efficient and

More information

A Review of the E 3 Algorithm: Near-Optimal Reinforcement Learning in Polynomial Time

A Review of the E 3 Algorithm: Near-Optimal Reinforcement Learning in Polynomial Time A Review of the E 3 Algorithm: Near-Optimal Reinforcement Learning in Polynomial Time April 16, 2016 Abstract In this exposition we study the E 3 algorithm proposed by Kearns and Singh for reinforcement

More information

Prioritized Sweeping Converges to the Optimal Value Function

Prioritized Sweeping Converges to the Optimal Value Function Technical Report DCS-TR-631 Prioritized Sweeping Converges to the Optimal Value Function Lihong Li and Michael L. Littman {lihong,mlittman}@cs.rutgers.edu RL 3 Laboratory Department of Computer Science

More information

Bias-Variance Error Bounds for Temporal Difference Updates

Bias-Variance Error Bounds for Temporal Difference Updates Bias-Variance Bounds for Temporal Difference Updates Michael Kearns AT&T Labs mkearns@research.att.com Satinder Singh AT&T Labs baveja@research.att.com Abstract We give the first rigorous upper bounds

More information

Planning by Probabilistic Inference

Planning by Probabilistic Inference Planning by Probabilistic Inference Hagai Attias Microsoft Research 1 Microsoft Way Redmond, WA 98052 Abstract This paper presents and demonstrates a new approach to the problem of planning under uncertainty.

More information

MDP Preliminaries. Nan Jiang. February 10, 2019

MDP Preliminaries. Nan Jiang. February 10, 2019 MDP Preliminaries Nan Jiang February 10, 2019 1 Markov Decision Processes In reinforcement learning, the interactions between the agent and the environment are often described by a Markov Decision Process

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Model-Based Reinforcement Learning Model-based, PAC-MDP, sample complexity, exploration/exploitation, RMAX, E3, Bayes-optimal, Bayesian RL, model learning Vien Ngo MLR, University

More information

Learning in Zero-Sum Team Markov Games using Factored Value Functions

Learning in Zero-Sum Team Markov Games using Factored Value Functions Learning in Zero-Sum Team Markov Games using Factored Value Functions Michail G. Lagoudakis Department of Computer Science Duke University Durham, NC 27708 mgl@cs.duke.edu Ronald Parr Department of Computer

More information

Learning to Coordinate Efficiently: A Model-based Approach

Learning to Coordinate Efficiently: A Model-based Approach Journal of Artificial Intelligence Research 19 (2003) 11-23 Submitted 10/02; published 7/03 Learning to Coordinate Efficiently: A Model-based Approach Ronen I. Brafman Computer Science Department Ben-Gurion

More information

Outline. CSE 573: Artificial Intelligence Autumn Agent. Partial Observability. Markov Decision Process (MDP) 10/31/2012

Outline. CSE 573: Artificial Intelligence Autumn Agent. Partial Observability. Markov Decision Process (MDP) 10/31/2012 CSE 573: Artificial Intelligence Autumn 2012 Reasoning about Uncertainty & Hidden Markov Models Daniel Weld Many slides adapted from Dan Klein, Stuart Russell, Andrew Moore & Luke Zettlemoyer 1 Outline

More information

Policy Search via Density Estimation

Policy Search via Density Estimation Policy Search via Density Estimation AndrewY. Ng Computer Science Division u.c. Berkeley Berkeley, CA 94720 ang@cs.berkeley.edu Ronald Parr Computer Science Dept. Stanford University Stanford, CA 94305

More information

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Michail G. Lagoudakis Department of Computer Science Duke University Durham, NC 2778 mgl@cs.duke.edu

More information

Food delivered. Food obtained S 3

Food delivered. Food obtained S 3 Press lever Enter magazine * S 0 Initial state S 1 Food delivered * S 2 No reward S 2 No reward S 3 Food obtained Supplementary Figure 1 Value propagation in tree search, after 50 steps of learning the

More information

Policy Search via Density Estimation

Policy Search via Density Estimation Policy Search via Density Estimation Andrew Y. Ng Computer Science Division U.C. Berkeley Berkeley, CA 94720 ang@cs.berkeley.edu Ronald Parr Computer Science Dept. Stanford University Stanford, CA 94305

More information

CS188: Artificial Intelligence, Fall 2009 Written 2: MDPs, RL, and Probability

CS188: Artificial Intelligence, Fall 2009 Written 2: MDPs, RL, and Probability CS188: Artificial Intelligence, Fall 2009 Written 2: MDPs, RL, and Probability Due: Thursday 10/15 in 283 Soda Drop Box by 11:59pm (no slip days) Policy: Can be solved in groups (acknowledge collaborators)

More information

Reinforcement Learning and Control

Reinforcement Learning and Control CS9 Lecture notes Andrew Ng Part XIII Reinforcement Learning and Control We now begin our study of reinforcement learning and adaptive control. In supervised learning, we saw algorithms that tried to make

More information

Exploration in Metric State Spaces

Exploration in Metric State Spaces Exploration in Metric State Spaces Sham Kakade Michael Kearns John Langford Gatsby Unit Department of Computer and Information Science University College London University of Pennsylvania London, England

More information

Introduction to Reinforcement Learning

Introduction to Reinforcement Learning CSCI-699: Advanced Topics in Deep Learning 01/16/2019 Nitin Kamra Spring 2019 Introduction to Reinforcement Learning 1 What is Reinforcement Learning? So far we have seen unsupervised and supervised learning.

More information

CS188: Artificial Intelligence, Fall 2009 Written 2: MDPs, RL, and Probability

CS188: Artificial Intelligence, Fall 2009 Written 2: MDPs, RL, and Probability CS188: Artificial Intelligence, Fall 2009 Written 2: MDPs, RL, and Probability Due: Thursday 10/15 in 283 Soda Drop Box by 11:59pm (no slip days) Policy: Can be solved in groups (acknowledge collaborators)

More information

Notes on Tabular Methods

Notes on Tabular Methods Notes on Tabular ethods Nan Jiang September 28, 208 Overview of the methods. Tabular certainty-equivalence Certainty-equivalence is a model-based RL algorithm, that is, it first estimates an DP model from

More information

Decision Theory: Markov Decision Processes

Decision Theory: Markov Decision Processes Decision Theory: Markov Decision Processes CPSC 322 Lecture 33 March 31, 2006 Textbook 12.5 Decision Theory: Markov Decision Processes CPSC 322 Lecture 33, Slide 1 Lecture Overview Recap Rewards and Policies

More information

Optimism in the Face of Uncertainty Should be Refutable

Optimism in the Face of Uncertainty Should be Refutable Optimism in the Face of Uncertainty Should be Refutable Ronald ORTNER Montanuniversität Leoben Department Mathematik und Informationstechnolgie Franz-Josef-Strasse 18, 8700 Leoben, Austria, Phone number:

More information

Artificial Intelligence

Artificial Intelligence Artificial Intelligence Dynamic Programming Marc Toussaint University of Stuttgart Winter 2018/19 Motivation: So far we focussed on tree search-like solvers for decision problems. There is a second important

More information

CS 7180: Behavioral Modeling and Decisionmaking

CS 7180: Behavioral Modeling and Decisionmaking CS 7180: Behavioral Modeling and Decisionmaking in AI Markov Decision Processes for Complex Decisionmaking Prof. Amy Sliva October 17, 2012 Decisions are nondeterministic In many situations, behavior and

More information

Introduction. next up previous Next: Problem Description Up: Bayesian Networks Previous: Bayesian Networks

Introduction. next up previous Next: Problem Description Up: Bayesian Networks Previous: Bayesian Networks Next: Problem Description Up: Bayesian Networks Previous: Bayesian Networks Introduction Autonomous agents often find themselves facing a great deal of uncertainty. It decides how to act based on its perceived

More information

THE COMPLEXITY OF DECENTRALIZED CONTROL OF MARKOV DECISION PROCESSES

THE COMPLEXITY OF DECENTRALIZED CONTROL OF MARKOV DECISION PROCESSES MATHEMATICS OF OPERATIONS RESEARCH Vol. 27, No. 4, November 2002, pp. 819 840 Printed in U.S.A. THE COMPLEXITY OF DECENTRALIZED CONTROL OF MARKOV DECISION PROCESSES DANIEL S. BERNSTEIN, ROBERT GIVAN, NEIL

More information

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Q-Learning Christopher Watkins and Peter Dayan Noga Zaslavsky The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992)

More information

Distributed Optimization. Song Chong EE, KAIST

Distributed Optimization. Song Chong EE, KAIST Distributed Optimization Song Chong EE, KAIST songchong@kaist.edu Dynamic Programming for Path Planning A path-planning problem consists of a weighted directed graph with a set of n nodes N, directed links

More information

Lecture 21: Spectral Learning for Graphical Models

Lecture 21: Spectral Learning for Graphical Models 10-708: Probabilistic Graphical Models 10-708, Spring 2016 Lecture 21: Spectral Learning for Graphical Models Lecturer: Eric P. Xing Scribes: Maruan Al-Shedivat, Wei-Cheng Chang, Frederick Liu 1 Motivation

More information

Sequential Decision Problems

Sequential Decision Problems Sequential Decision Problems Michael A. Goodrich November 10, 2006 If I make changes to these notes after they are posted and if these changes are important (beyond cosmetic), the changes will highlighted

More information

, and rewards and transition matrices as shown below:

, and rewards and transition matrices as shown below: CSE 50a. Assignment 7 Out: Tue Nov Due: Thu Dec Reading: Sutton & Barto, Chapters -. 7. Policy improvement Consider the Markov decision process (MDP) with two states s {0, }, two actions a {0, }, discount

More information

Near-Optimal Reinforcement Learning in Polynomial Time

Near-Optimal Reinforcement Learning in Polynomial Time Near-Optimal Reinforcement Learning in Polynomial Time ichael Kearns AT&T Labs 180 Park Avenue, Room A235 Florham Park, New Jersey 07932 mkearns@research.att.com Satinder Singh Department of Computer Science

More information

Logic, Knowledge Representation and Bayesian Decision Theory

Logic, Knowledge Representation and Bayesian Decision Theory Logic, Knowledge Representation and Bayesian Decision Theory David Poole University of British Columbia Overview Knowledge representation, logic, decision theory. Belief networks Independent Choice Logic

More information

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro CMU 15-781 Lecture 12: Reinforcement Learning Teacher: Gianni A. Di Caro REINFORCEMENT LEARNING Transition Model? State Action Reward model? Agent Goal: Maximize expected sum of future rewards 2 MDP PLANNING

More information

CSL302/612 Artificial Intelligence End-Semester Exam 120 Minutes

CSL302/612 Artificial Intelligence End-Semester Exam 120 Minutes CSL302/612 Artificial Intelligence End-Semester Exam 120 Minutes Name: Roll Number: Please read the following instructions carefully Ø Calculators are allowed. However, laptops or mobile phones are not

More information

Q-Learning in Continuous State Action Spaces

Q-Learning in Continuous State Action Spaces Q-Learning in Continuous State Action Spaces Alex Irpan alexirpan@berkeley.edu December 5, 2015 Contents 1 Introduction 1 2 Background 1 3 Q-Learning 2 4 Q-Learning In Continuous Spaces 4 5 Experimental

More information

CS 188 Introduction to Fall 2007 Artificial Intelligence Midterm

CS 188 Introduction to Fall 2007 Artificial Intelligence Midterm NAME: SID#: Login: Sec: 1 CS 188 Introduction to Fall 2007 Artificial Intelligence Midterm You have 80 minutes. The exam is closed book, closed notes except a one-page crib sheet, basic calculators only.

More information

CSEP 573: Artificial Intelligence

CSEP 573: Artificial Intelligence CSEP 573: Artificial Intelligence Hidden Markov Models Luke Zettlemoyer Many slides over the course adapted from either Dan Klein, Stuart Russell, Andrew Moore, Ali Farhadi, or Dan Weld 1 Outline Probabilistic

More information

CSC 5170: Theory of Computational Complexity Lecture 4 The Chinese University of Hong Kong 1 February 2010

CSC 5170: Theory of Computational Complexity Lecture 4 The Chinese University of Hong Kong 1 February 2010 CSC 5170: Theory of Computational Complexity Lecture 4 The Chinese University of Hong Kong 1 February 2010 Computational complexity studies the amount of resources necessary to perform given computations.

More information

The Complexity of Decentralized Control of Markov Decision Processes

The Complexity of Decentralized Control of Markov Decision Processes The Complexity of Decentralized Control of Markov Decision Processes Daniel S. Bernstein Robert Givan Neil Immerman Shlomo Zilberstein Department of Computer Science University of Massachusetts Amherst,

More information

The Simplex and Policy Iteration Methods are Strongly Polynomial for the Markov Decision Problem with Fixed Discount

The Simplex and Policy Iteration Methods are Strongly Polynomial for the Markov Decision Problem with Fixed Discount The Simplex and Policy Iteration Methods are Strongly Polynomial for the Markov Decision Problem with Fixed Discount Yinyu Ye Department of Management Science and Engineering and Institute of Computational

More information

Abstraction in Predictive State Representations

Abstraction in Predictive State Representations Abstraction in Predictive State Representations Vishal Soni and Satinder Singh Computer Science and Engineering University of Michigan, Ann Arbor {soniv,baveja}@umich.edu Abstract Most work on Predictive

More information

Improving the Efficiency of Bayesian Inverse Reinforcement Learning

Improving the Efficiency of Bayesian Inverse Reinforcement Learning Improving the Efficiency of Bayesian Inverse Reinforcement Learning Bernard Michini* and Jonathan P. How** Aerospace Controls Laboratory Massachusetts Institute of Technology, Cambridge, MA 02139 USA Abstract

More information

Testing Problems with Sub-Learning Sample Complexity

Testing Problems with Sub-Learning Sample Complexity Testing Problems with Sub-Learning Sample Complexity Michael Kearns AT&T Labs Research 180 Park Avenue Florham Park, NJ, 07932 mkearns@researchattcom Dana Ron Laboratory for Computer Science, MIT 545 Technology

More information

Open Theoretical Questions in Reinforcement Learning

Open Theoretical Questions in Reinforcement Learning Open Theoretical Questions in Reinforcement Learning Richard S. Sutton AT&T Labs, Florham Park, NJ 07932, USA, sutton@research.att.com, www.cs.umass.edu/~rich Reinforcement learning (RL) concerns the problem

More information

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016 Course 16:198:520: Introduction To Artificial Intelligence Lecture 13 Decision Making Abdeslam Boularias Wednesday, December 7, 2016 1 / 45 Overview We consider probabilistic temporal models where the

More information

Inference in Bayesian Networks

Inference in Bayesian Networks Andrea Passerini passerini@disi.unitn.it Machine Learning Inference in graphical models Description Assume we have evidence e on the state of a subset of variables E in the model (i.e. Bayesian Network)

More information

Internet Monetization

Internet Monetization Internet Monetization March May, 2013 Discrete time Finite A decision process (MDP) is reward process with decisions. It models an environment in which all states are and time is divided into stages. Definition

More information

Applying Bayesian networks in the game of Minesweeper

Applying Bayesian networks in the game of Minesweeper Applying Bayesian networks in the game of Minesweeper Marta Vomlelová Faculty of Mathematics and Physics Charles University in Prague http://kti.mff.cuni.cz/~marta/ Jiří Vomlel Institute of Information

More information

Selecting Efficient Correlated Equilibria Through Distributed Learning. Jason R. Marden

Selecting Efficient Correlated Equilibria Through Distributed Learning. Jason R. Marden 1 Selecting Efficient Correlated Equilibria Through Distributed Learning Jason R. Marden Abstract A learning rule is completely uncoupled if each player s behavior is conditioned only on his own realized

More information

Tensor rank-one decomposition of probability tables

Tensor rank-one decomposition of probability tables Tensor rank-one decomposition of probability tables Petr Savicky Inst. of Comp. Science Academy of Sciences of the Czech Rep. Pod vodárenskou věží 2 82 7 Prague, Czech Rep. http://www.cs.cas.cz/~savicky/

More information

Solving Factored MDPs with Continuous and Discrete Variables

Solving Factored MDPs with Continuous and Discrete Variables Appeared in the Twentieth Conference on Uncertainty in Artificial Intelligence, Banff, Canada, July 24. Solving Factored MDPs with Continuous and Discrete Variables Carlos Guestrin Milos Hauskrecht Branislav

More information

Final. Introduction to Artificial Intelligence. CS 188 Spring You have approximately 2 hours and 50 minutes.

Final. Introduction to Artificial Intelligence. CS 188 Spring You have approximately 2 hours and 50 minutes. CS 188 Spring 2014 Introduction to Artificial Intelligence Final You have approximately 2 hours and 50 minutes. The exam is closed book, closed notes except your two-page crib sheet. Mark your answers

More information

Recoverabilty Conditions for Rankings Under Partial Information

Recoverabilty Conditions for Rankings Under Partial Information Recoverabilty Conditions for Rankings Under Partial Information Srikanth Jagabathula Devavrat Shah Abstract We consider the problem of exact recovery of a function, defined on the space of permutations

More information

Today s Outline. Recap: MDPs. Bellman Equations. Q-Value Iteration. Bellman Backup 5/7/2012. CSE 473: Artificial Intelligence Reinforcement Learning

Today s Outline. Recap: MDPs. Bellman Equations. Q-Value Iteration. Bellman Backup 5/7/2012. CSE 473: Artificial Intelligence Reinforcement Learning CSE 473: Artificial Intelligence Reinforcement Learning Dan Weld Today s Outline Reinforcement Learning Q-value iteration Q-learning Exploration / exploitation Linear function approximation Many slides

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Ron Parr CompSci 7 Department of Computer Science Duke University With thanks to Kris Hauser for some content RL Highlights Everybody likes to learn from experience Use ML techniques

More information

Efficient Learning in Linearly Solvable MDP Models

Efficient Learning in Linearly Solvable MDP Models Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence Efficient Learning in Linearly Solvable MDP Models Ang Li Department of Computer Science, University of Minnesota

More information

Solving Stochastic Planning Problems With Large State and Action Spaces

Solving Stochastic Planning Problems With Large State and Action Spaces Solving Stochastic Planning Problems With Large State and Action Spaces Thomas Dean, Robert Givan, and Kee-Eung Kim Thomas Dean and Kee-Eung Kim Robert Givan Department of Computer Science Department of

More information

Probabilistic inference for computing optimal policies in MDPs

Probabilistic inference for computing optimal policies in MDPs Probabilistic inference for computing optimal policies in MDPs Marc Toussaint Amos Storkey School of Informatics, University of Edinburgh Edinburgh EH1 2QL, Scotland, UK mtoussai@inf.ed.ac.uk, amos@storkey.org

More information

Control Theory : Course Summary

Control Theory : Course Summary Control Theory : Course Summary Author: Joshua Volkmann Abstract There are a wide range of problems which involve making decisions over time in the face of uncertainty. Control theory draws from the fields

More information

Tractable Inference for Complex Stochastic Processes

Tractable Inference for Complex Stochastic Processes In Proceedings of the Fourteenth Annual Conference on Uncertainty in Artificial Intelligence (UAI98, pages 4, Madison, Wisconsin, July 998 Tractable Inference for Complex Stochastic Processes Xavier Boyen

More information

Reinforcement Learning

Reinforcement Learning 1 Reinforcement Learning Chris Watkins Department of Computer Science Royal Holloway, University of London July 27, 2015 2 Plan 1 Why reinforcement learning? Where does this theory come from? Markov decision

More information

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

A graph contains a set of nodes (vertices) connected by links (edges or arcs) BOLTZMANN MACHINES Generative Models Graphical Models A graph contains a set of nodes (vertices) connected by links (edges or arcs) In a probabilistic graphical model, each node represents a random variable,

More information

Abstract. Three Methods and Their Limitations. N-1 Experiments Suffice to Determine the Causal Relations Among N Variables

Abstract. Three Methods and Their Limitations. N-1 Experiments Suffice to Determine the Causal Relations Among N Variables N-1 Experiments Suffice to Determine the Causal Relations Among N Variables Frederick Eberhardt Clark Glymour 1 Richard Scheines Carnegie Mellon University Abstract By combining experimental interventions

More information

Announcements. CS 188: Artificial Intelligence Fall Causality? Example: Traffic. Topology Limits Distributions. Example: Reverse Traffic

Announcements. CS 188: Artificial Intelligence Fall Causality? Example: Traffic. Topology Limits Distributions. Example: Reverse Traffic CS 188: Artificial Intelligence Fall 2008 Lecture 16: Bayes Nets III 10/23/2008 Announcements Midterms graded, up on glookup, back Tuesday W4 also graded, back in sections / box Past homeworks in return

More information

Distributed Randomized Algorithms for the PageRank Computation Hideaki Ishii, Member, IEEE, and Roberto Tempo, Fellow, IEEE

Distributed Randomized Algorithms for the PageRank Computation Hideaki Ishii, Member, IEEE, and Roberto Tempo, Fellow, IEEE IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 55, NO. 9, SEPTEMBER 2010 1987 Distributed Randomized Algorithms for the PageRank Computation Hideaki Ishii, Member, IEEE, and Roberto Tempo, Fellow, IEEE Abstract

More information

Planning Under Uncertainty: Structural Assumptions and Computational Leverage

Planning Under Uncertainty: Structural Assumptions and Computational Leverage Planning Under Uncertainty: Structural Assumptions and Computational Leverage Craig Boutilier Dept. of Comp. Science Univ. of British Columbia Vancouver, BC V6T 1Z4 Tel. (604) 822-4632 Fax. (604) 822-5485

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning March May, 2013 Schedule Update Introduction 03/13/2015 (10:15-12:15) Sala conferenze MDPs 03/18/2015 (10:15-12:15) Sala conferenze Solving MDPs 03/20/2015 (10:15-12:15) Aula Alpha

More information

Computer Science 385 Analysis of Algorithms Siena College Spring Topic Notes: Limitations of Algorithms

Computer Science 385 Analysis of Algorithms Siena College Spring Topic Notes: Limitations of Algorithms Computer Science 385 Analysis of Algorithms Siena College Spring 2011 Topic Notes: Limitations of Algorithms We conclude with a discussion of the limitations of the power of algorithms. That is, what kinds

More information

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam:

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam: Marks } Assignment 1: Should be out this weekend } All are marked, I m trying to tally them and perhaps add bonus points } Mid-term: Before the last lecture } Mid-term deferred exam: } This Saturday, 9am-10.30am,

More information

Optimal Control of Partiality Observable Markov. Processes over a Finite Horizon

Optimal Control of Partiality Observable Markov. Processes over a Finite Horizon Optimal Control of Partiality Observable Markov Processes over a Finite Horizon Report by Jalal Arabneydi 04/11/2012 Taken from Control of Partiality Observable Markov Processes over a finite Horizon by

More information

Final Exam December 12, 2017

Final Exam December 12, 2017 Introduction to Artificial Intelligence CSE 473, Autumn 2017 Dieter Fox Final Exam December 12, 2017 Directions This exam has 7 problems with 111 points shown in the table below, and you have 110 minutes

More information

CS 188: Artificial Intelligence Spring Announcements

CS 188: Artificial Intelligence Spring Announcements CS 188: Artificial Intelligence Spring 2011 Lecture 18: HMMs and Particle Filtering 4/4/2011 Pieter Abbeel --- UC Berkeley Many slides over this course adapted from Dan Klein, Stuart Russell, Andrew Moore

More information

Learning Exploration/Exploitation Strategies for Single Trajectory Reinforcement Learning

Learning Exploration/Exploitation Strategies for Single Trajectory Reinforcement Learning JMLR: Workshop and Conference Proceedings vol:1 8, 2012 10th European Workshop on Reinforcement Learning Learning Exploration/Exploitation Strategies for Single Trajectory Reinforcement Learning Michael

More information

Sampling Algorithms for Probabilistic Graphical models

Sampling Algorithms for Probabilistic Graphical models Sampling Algorithms for Probabilistic Graphical models Vibhav Gogate University of Washington References: Chapter 12 of Probabilistic Graphical models: Principles and Techniques by Daphne Koller and Nir

More information

1 Problem Formulation

1 Problem Formulation Book Review Self-Learning Control of Finite Markov Chains by A. S. Poznyak, K. Najim, and E. Gómez-Ramírez Review by Benjamin Van Roy This book presents a collection of work on algorithms for learning

More information

Reinforcement Learning II

Reinforcement Learning II Reinforcement Learning II Andrea Bonarini Artificial Intelligence and Robotics Lab Department of Electronics and Information Politecnico di Milano E-mail: bonarini@elet.polimi.it URL:http://www.dei.polimi.it/people/bonarini

More information

Approximate Planning in Large POMDPs via Reusable Trajectories

Approximate Planning in Large POMDPs via Reusable Trajectories Approximate Planning in Large POMDPs via Reusable Trajectories Michael Kearns AT&T Labs mkearns@research.att.com Yishay Mansour Tel Aviv University mansour@math.tau.ac.il Andrew Y. Ng UC Berkeley ang@cs.berkeley.edu

More information

CPSC 467: Cryptography and Computer Security

CPSC 467: Cryptography and Computer Security CPSC 467: Cryptography and Computer Security Michael J. Fischer Lecture 14 October 16, 2013 CPSC 467, Lecture 14 1/45 Message Digest / Cryptographic Hash Functions Hash Function Constructions Extending

More information

Some AI Planning Problems

Some AI Planning Problems Course Logistics CS533: Intelligent Agents and Decision Making M, W, F: 1:00 1:50 Instructor: Alan Fern (KEC2071) Office hours: by appointment (see me after class or send email) Emailing me: include CS533

More information

Announcements. CS 188: Artificial Intelligence Fall Markov Models. Example: Markov Chain. Mini-Forward Algorithm. Example

Announcements. CS 188: Artificial Intelligence Fall Markov Models. Example: Markov Chain. Mini-Forward Algorithm. Example CS 88: Artificial Intelligence Fall 29 Lecture 9: Hidden Markov Models /3/29 Announcements Written 3 is up! Due on /2 (i.e. under two weeks) Project 4 up very soon! Due on /9 (i.e. a little over two weeks)

More information

Symmetry Breaking for Relational Weighted Model Finding

Symmetry Breaking for Relational Weighted Model Finding Symmetry Breaking for Relational Weighted Model Finding Tim Kopp Parag Singla Henry Kautz WL4AI July 27, 2015 Outline Weighted logics. Symmetry-breaking in SAT. Symmetry-breaking for weighted logics. Weighted

More information

A Nonlinear Predictive State Representation

A Nonlinear Predictive State Representation Draft: Please do not distribute A Nonlinear Predictive State Representation Matthew R. Rudary and Satinder Singh Computer Science and Engineering University of Michigan Ann Arbor, MI 48109 {mrudary,baveja}@umich.edu

More information

Lecture 1: March 7, 2018

Lecture 1: March 7, 2018 Reinforcement Learning Spring Semester, 2017/8 Lecture 1: March 7, 2018 Lecturer: Yishay Mansour Scribe: ym DISCLAIMER: Based on Learning and Planning in Dynamical Systems by Shie Mannor c, all rights

More information

Introduction to Artificial Intelligence (AI)

Introduction to Artificial Intelligence (AI) Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 9 Oct, 11, 2011 Slide credit Approx. Inference : S. Thrun, P, Norvig, D. Klein CPSC 502, Lecture 9 Slide 1 Today Oct 11 Bayesian

More information

Grundlagen der Künstlichen Intelligenz

Grundlagen der Künstlichen Intelligenz Grundlagen der Künstlichen Intelligenz Reinforcement learning Daniel Hennes 4.12.2017 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Reinforcement learning Model based and

More information

Probabilistic Reasoning. (Mostly using Bayesian Networks)

Probabilistic Reasoning. (Mostly using Bayesian Networks) Probabilistic Reasoning (Mostly using Bayesian Networks) Introduction: Why probabilistic reasoning? The world is not deterministic. (Usually because information is limited.) Ways of coping with uncertainty

More information

Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms

Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms Mostafa D. Awheda Department of Systems and Computer Engineering Carleton University Ottawa, Canada KS 5B6 Email: mawheda@sce.carleton.ca

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Markov decision process & Dynamic programming Evaluative feedback, value function, Bellman equation, optimality, Markov property, Markov decision process, dynamic programming, value

More information

Experts in a Markov Decision Process

Experts in a Markov Decision Process University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 2004 Experts in a Markov Decision Process Eyal Even-Dar Sham Kakade University of Pennsylvania Yishay Mansour Follow

More information

CS Lecture 3. More Bayesian Networks

CS Lecture 3. More Bayesian Networks CS 6347 Lecture 3 More Bayesian Networks Recap Last time: Complexity challenges Representing distributions Computing probabilities/doing inference Introduction to Bayesian networks Today: D-separation,

More information

Planning With Information States: A Survey Term Project for cs397sml Spring 2002

Planning With Information States: A Survey Term Project for cs397sml Spring 2002 Planning With Information States: A Survey Term Project for cs397sml Spring 2002 Jason O Kane jokane@uiuc.edu April 18, 2003 1 Introduction Classical planning generally depends on the assumption that the

More information

Equivalence Notions and Model Minimization in Markov Decision Processes

Equivalence Notions and Model Minimization in Markov Decision Processes Equivalence Notions and Model Minimization in Markov Decision Processes Robert Givan, Thomas Dean, and Matthew Greig Robert Givan and Matthew Greig School of Electrical and Computer Engineering Purdue

More information

CS599 Lecture 1 Introduction To RL

CS599 Lecture 1 Introduction To RL CS599 Lecture 1 Introduction To RL Reinforcement Learning Introduction Learning from rewards Policies Value Functions Rewards Models of the Environment Exploitation vs. Exploration Dynamic Programming

More information

Heuristic Search Algorithms

Heuristic Search Algorithms CHAPTER 4 Heuristic Search Algorithms 59 4.1 HEURISTIC SEARCH AND SSP MDPS The methods we explored in the previous chapter have a serious practical drawback the amount of memory they require is proportional

More information

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014 Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2014 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several

More information

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels? Machine Learning and Bayesian Inference Dr Sean Holden Computer Laboratory, Room FC6 Telephone extension 6372 Email: sbh11@cl.cam.ac.uk www.cl.cam.ac.uk/ sbh11/ Unsupervised learning Can we find regularity

More information

Using learning for approximation in stochastic processes

Using learning for approximation in stochastic processes In Proceedings of the Fifteenth International Conference on Machine Learning (ICML-98), pages 287-295, Madison, Wisconsin, July 1998 Using learning for approximation in stochastic processes Daphne Koller

More information

STATISTICAL METHODS IN AI/ML Vibhav Gogate The University of Texas at Dallas. Bayesian networks: Representation

STATISTICAL METHODS IN AI/ML Vibhav Gogate The University of Texas at Dallas. Bayesian networks: Representation STATISTICAL METHODS IN AI/ML Vibhav Gogate The University of Texas at Dallas Bayesian networks: Representation Motivation Explicit representation of the joint distribution is unmanageable Computationally:

More information

Price of Stability in Survivable Network Design

Price of Stability in Survivable Network Design Noname manuscript No. (will be inserted by the editor) Price of Stability in Survivable Network Design Elliot Anshelevich Bugra Caskurlu Received: January 2010 / Accepted: Abstract We study the survivable

More information