Generalizing Plans to New Environments in Relational MDPs

Size: px

Start display at page:

Download "Generalizing Plans to New Environments in Relational MDPs"

Theodora West
6 years ago
Views:

1 In International Joint Conference on Artificial Intelligence IJCAI-03), Acapulco, Mexico, August 003. Generalizing Plans to New Environents in Relational MDPs Carlos Guestrin Daphne Koller Chris Gearhart Neal Kanodia Coputer Science Departent, Stanford University {guestrin, koller, cg33, Abstract A longstanding goal in planning research is the ability to generalize plans developed for soe set of environents to a new but siilar environent, with inial or no replanning. Such generalization can both reduce planning tie and allow us to tackle larger doains than the ones tractable for direct planning. In this paper, we present an approach to the generalization proble based on a new fraework of relational Markov Decision Processes RMDPs). An RMDP can odel a set of siilar environents by representing objects as instances of different classes. In order to generalize plans to ultiple environents, we define an approxiate value function specified in ters of classes of objects and, in a ultiagent setting, by classes of agents. This class-based approxiate value function is optiized relative to a sapled subset of environents, and coputed using an efficient linear prograing ethod. We prove that a polynoial nuber of sapled environents suffices to achieve perforance close to the perforance achievable when optiizing over the entire space. Our experiental results show that our ethod generalizes plans successfully to new, significantly larger, environents, with inial loss of perforance relative to environent-specific planning. We deonstrate our approach on a real strategic coputer war gae. 1 Introduction Most planning ethods optiize the plan of an agent in a fixed environent. However, in any real-world settings, an agent will face ultiple environents over its lifetie, and its experience with one environent should help it to perfor well in another, even with inial or no replanning. Consider, for exaple, an agent designed to play a strategic coputer war gae, such as the Freecraft gae shown in Fig. 1 an open source version of the popular Warcraft gae). In this gae, the agent is faced with any scenarios. In each scenario, it ust control a set of agents or units) with different skills in order to defeat an opponent. Most scenarios share the sae basic eleents: resources, such as gold and wood; units, such as peasants, who collect resources and build structures, and footen, who fight with eney units; and structures, such as barracks, which are used to train footen. Each scenario is coposed of these sae basic building blocks, but they differ in ters of the ap layout, types of units available, aounts of resources, etc. We would like the agent to learn fro its experience with playing soe scenarios, enabling it to tackle new scenarios without significant aounts of replanning. In particular, we would like the agent to generalize fro siple scenarios, allowing it to deal with other scenarios that are too coplex for any effective planner. The idea of generalization has been a longstanding goal in Markov Decision Process MDP) and reinforceent learning Figure 1: Freecraft strategic doain with 9 peasants, a barrack, a castle, a forest, a gold ine, 3 footen, and an eney, executing the generalized policy coputed by our algorith. research [15; 16], and even earlier in traditional planning [5]. This proble is a challenging one, because it is often unclear how to translate the solution obtained for one doain to another. MDP solutions assign values and/or actions to states. Two different MDPs e.g., two Freecraft scenarios), are typically quite different, in that they have a different set and even nuber) of states and actions. In cases such as this, the apping of one solution to another is not well-defined. Our approach is based on the insight that any doains can be described in ters of objects and the relations between the. A particular doain will involve ultiple objects fro several classes. Different tasks in the sae doain will typically involve different sets of objects, related to each other in different ways. For exaple, in Freecraft, different tasks ight involve different nubers of peasants, footen, eneies, etc. We therefore define a notion of a relational MDP RMDP), based on the probabilistic relational odel PRM) fraework [10]. An RMDP for a particular doain provides a general schea for an entire suite of environents, or worlds, in that doain. It specifies a set of classes, and how the dynaics and rewards of an object in a given class depend on the state of that object and of related objects. We use the class structure of the RMDP to define a value function that can be generalized fro one doain to another. We begin with the assuption that the value function can be well-approxiated as a su of value subfunctions for the different objects in the doain. Thus, the value of a global Freecraft state is approxiated as a su of ters corresponding to the state of individual peasants, footen, gold, etc. We then assue that individual objects in the sae class have a very siilar value function. Thus, we define the notion of a class-based value function, where each class is associated with a class subfunction. All objects in the sae class have the value subfunction of their class, and the overall value function for a particular environent is the su of value subfunctions for the individual objects in the doain. A set of value subfunctions for the different classes ie-

2 diately deterines a value function for any new environent in the doain, and can be used for acting. Thus, we can copute a set of class subfunctions based on a subset of environents, and apply the to another one without replanning. We provide an optiality criterion for evaluating a classbased value function for a distribution over environents, and show how it can, in principle, be optiized using a linear progra. We can also learn a value function by optiizing it relative to a saple of environents encountered by the agent. We prove that a polynoial nuber of sapled environents suffice to construct a class-based value function which is close to the one obtainable for the entire distribution over environents. Finally, we show how we can iprove the quality of our approxiation by autoatically discovering subclasses of objects that have siilar value functions. We present experients for a coputer systes adinistration task and two Freecraft tasks. Our results show that we can successfully generalize class-based value functions. Iportantly, our approach also obtains effective policies for probles significantly larger than our planning algorith could handle otherwise. Relational Markov Decision Processes A relational MDP defines the syste dynaics and rewards at the level of a teplate for a task doain. Given a particular environent within that doain, it defines a specific MDP instantiated for that environent. As in the PRM fraework of [10], the doain is defined via a schea, which specifies a set of object classes C = {C 1,..., C c }. Each class C is also associated with a set of state variables S[C] = {C.S 1,..., C.S k }, which describe the state of an object in that class. Each state variable C.S has a doain of possible values Do[C.S]. We define S C to be the set of possible states for an object in C, i.e., the possible assignents to the state variables of C. For exaple, our Freecraft doain ight have classes such as Peasant, Footan, Gold; the class Peasant ay have a state variable Task whose doain is Do[Peasant.Task] = {Waiting, Mining, Harvesting, Building}, and a state variable Health whose doain has three values. In this case, S Peasant would have 4 3 = 1 values, one for each cobination of values for Task and Health. The schea also specifies a set of links L[C] = {L 1,..., L l } for each class representing links between objects in the doain. Each link C.L has a range ρ[c.l] = C. For exaple, Peasant objects ight be linked to Barrack objects ρ[peasant.buildtarget] = Barrack, and to the global Gold and Wood resource objects. In a ore coplex situation, a link ay relate C to any instances of a class C, which we denote by ρ[c.l] = {C }, for exaple, ρ[eney.my Footen] = {Footan} indicates that an instance of the eney class ay be related to any footan instances. A particular instance of the schea is defined via a world ω, specifying the set of objects of each class; we use O[ω][C] to denote the objects in class C, and O[ω] to denote the total set of objects in ω. The world ω also specifies the links between objects, which we take to be fixed throughout tie. Thus, for each link C.L, and for each o O[ω][C], ω specifies a set of objects o ρ[c.l], denoted o.l. For exaple, in a world containing peasants, we would have O[ω][Peasant] = {Peasant1, Peasant}; if Peasant1 is building a barracks, we would have that Peasant1.BuildTarget = Barrack1. The dynaics and rewards of an RMDP are also defined at the schea level. For each class, the schea specifies an action C.A, which can take on one of several values Do[C.A]. For exaple, Do[Peasant.A] = {Wait, Mine, Harvest, Build}. Each class C is also associated with a transition odel P C, which specifies the probability distribution over the next state of an object o in class C, given the current state of o, the action taken on o, and the states and actions of all of the objects linked to o: P C S C S C, C.A, S C.L1, C.L 1.A,..., S C.Ll, C.L l.a). 1) For exaple, the status of a barrack, Barrack.Status, depends on its status in the previous tie step, on the task perfored by any peasant that could build it Barrack.BuiltBy.Task), on the aount of wood and gold, etc. The transition odel is conditioned on the state of C.L i, which is, in general, an entire set of objects e.g., the set of peasants linked to a barrack). Thus we ust now provide a copact specification of the transition odel that can depend on the state of an unbounded nuber of variables. We can deal with this issue using the idea of aggregation [10]. In Freecraft, our odel uses the count aggregator, where the probability that Barrack.Status transitions fro Unbuilt to Built depends on [Barrack.BuiltBy.Task = Built], the nuber of peasants in Barrack.BuiltBy whose Task is Build. Finally, we also define rewards at the class level. We assue for siplicity that rewards are associated only with the states of individual objects; adding ore global dependencies is possible, but coplicates planning significantly. We define a reward function R C S C, C.A), which represents the contribution to the reward of any object in C. For exaple, we ay have a reward function associated with the Eney class, which specifies a reward of 10 if the state of an eney object is Dead: R Eney Eney.State = Dead) = 10. We assue that the reward for each object is bounded by R ax. Given a world, the RMDP uniquely defines a ground factored MDP Π ω, whose transition odel is specified as usual) as a dynaic Bayesian network DBN) [3]. The rando variables in this factored MDP are the state variables of the individual objects o.s, for each o O[ω][C] and for each S S[C]. Thus, the state s of the syste at a given point in tie is a vector defining the states of the individual objects in the world. For any subset of variables X in the odel, we define s[x] to be the part of the instantiation s that corresponds to the variables X. The ground DBN for the transition dynaics specifies the dependence of the variables at tie t + 1 on the variables at tie t. The parents of a variable o.s are the state variables of the objects o that are linked to o. In our exaple with the two peasants, we ight have the rando variables Peasant1.Task, Peasant.Task, Barrack1.Status, etc. The parents of the tie t + 1 variable Barrack1.Status are the tie t variables Barrack1.Status, Peasant1.Task, Peasant.Task, Gold1.Aount and Wood1.Aount. The transition odel is the sae for all instances in the sae class, as in 1). Thus, all of the o.status variables for

3 F o o t a n y_ eney Eney R Health A Footan H Count Health H 1 Tie t t+ Footan1 E ne y 1 R R Footan E ne y F 1.Health F 1.A E 1.Health F.Health F.A E.Health a) b) Figure : Freecraft tactical doain: a) Schea; b) Resulting factored MDP for a world with footen and eneies. barrack objects o share the sae conditional probability distribution. Note, however, that each specific barrack depends on the particular peasants linked to it. Thus, the actual parents in the DBN of the status variables for two different barrack objects can be different. The reward function is siply the su of the reward functions for the individual objects: Rs, a) = Rs[S o ], a[o.a]). C C o O[ω][C] Thus, for reward function for the Eney class described above, our overall reward function in a given state will be 10 ties the nuber of dead eneies in that state. It reains to specify the actions in the ground MDP. The RMDP specifies a set of possible actions for every object in the world. In a setting where only a single action can be taken at any tie step, the agent ust choose both an object to act on, and which action to perfor on that object. Here, the set of actions in the ground MDP is siply the union o ω Do[o.A]. In a setting where ultiple actions can be perfored in parallel say, in a ultiagent setting), it ight be possible to perfor an action on every object in the doain at every step. Here, the set of actions in the ground MDP is a vector specifying an action for every object: o ω Do[o.A]. Interediate cases, allowing degrees of parallelis, are also possible. For siplicity of presentation, we focus on the ultiagent case, such as Freecraft, where, an action is an assignent to the action of every unit. Exaple.1 Freecraft tactical doain) Consider a siplified version of Freecraft, whose schea is illustrated in Fig. a), where only two classes of units participate in the gae: C = {Footan, Eney}. Both the footan and the eney classes have only one state variable each, Health, with doain Do[Health] = {Healthy, Wounded, Dead}. The footan class contains one single-valued link: ρ[footan.my Eney] = Eney. Thus the transition odel for a footan s health will depend on the health of its eney: P Footan S Footan S Footan, S Footan.My Eney ), i.e., if a footan s eney is not dead, than the probability that a footan will becoe wounded, or die, is significantly higher. A footan can choose to attack any eney. Thus each footan is associated with an action Footan.A which selects the eney it is attacking. 1 As a consequence, an 1 A odel where an action can change the link structure in the F 1.H E 1.H F.H E.H eney could end up being linked to a set of footen, ρ[eney.my Footen] = {Footan}. In this case, the transition odel of the health of an eney ay depend on the nuber of footen who are not dead and whose action choice is to attack this eney: P Eney S Eney S Eney, [S Eney.My Footen, Eney.My Footen.A]). Finally, we ust define the teplate for the reward function. Here there is only a reward when an eney is dead: R Eney S Eney ). We now have a teplate to describe any instance of the tactical Freecraft doain. In a particular world, we ust define the instances of each class and the links between these instances. For exaple, a world with footen and eneies will have 4 objects: {Footan1, Footan, Eney1, Eney}. Each footan will be linked to an eney: Footan1.My Eney = Eney1 and Footan.My Eney = Eney. Each eney will be linked to both footen: Eney1.My Footen = Eney.My Footen = {Footan1, Footan}. The teplate, along with the nuber of objects and the links in this specific vs ) world yield a well-defined factored MDP, Π vs, as shown in Fig. b). 3 Approxiately Solving Relational MDPs There are any approaches to solving MDPs [15]. An effective one is based on linear prograing LP): Let SΠ) denote the states in an MDP Π and AΠ) the actions. If SΠ) = {s 1,..., s N }, our LP variables are V 1,..., V N, where V i represents Vs i ), the value of state s i. The LP forulation is: Miniize: i αs i)v i ; Subject to: V i Rs i, a) + γ k P s k s i, a)v k s i SΠ), a AΠ). The state relevance weights αs 1 ),..., αs N ) in the objective function are any set of positive weights, αs i ) > 0. In our setting, the state space is exponentially large, with one state for each joint assignent to the rando variables o.s of every object e.g., exponential in the nuber of units in the Freecraft scenario). In a ultiagent proble, the nuber of actions is also exponential in the nuber of agents. Thus this LP has both an exponential nuber of variables and an exponential nuber of constraints. Therefore the exact solution to this linear progra is infeasible. We address this issue using the assuption that the value function can be well-approxiated as a su of local value subfunctions associated with the individual objects in the odel. This approxiation is a special case of the factored linear value function approach used in [6].) Thus we associate a value subfunction V o with every object in ω. Most siply, this local value function can depend only on the state of the individual object S o. In our exaple, the local value subfunction V Eney1 for eney object Eney1 ight associate a nueric value for each assignent to the variable Eney1.Health. A richer approxiation ight associate a value function with pairs, or even sall subsets, of closely related objects. Thus, the world requires a sall extension of our basic representation. We oit details due to lack of space.

4 function V Footan1 for Footan1 ight be defined over the joint assignents of Footan1.Health and Eney1.Health, where Footan1.My Eney = Eney1. We will represent the coplete value function for a world as the su of the local value subfunction for each individual object in this world. In our exaple world ω = vs) with footen and eneies, the global value function will be: V vs F1.Health, E1.Health, F.Health, E.Health) = V Footan1 F1.Health, E1.Health) + V Eney1 E1.Health) + V Footan F.Health, E.Health) + V Eney E.Health). Let T o be the scope of the value subfunction of object o, i.e., the state variables that V o depends on. Given the local subfunctions, we approxiate the global value function as: V ω s) = V o s[t o ]). ) o O[ω] As for any linear approxiation to the value function, the LP approach can be adapted to use this value function representation [14]. Our LP variables are now the local coponents of the individual local value functions: {V o t o ) : o O[ω], t o Do[T o ]}. 3) In our exaple, there will be one LP variable for each joint assignent of F1.Health and E1.Health to represent the coponents of V Footan1. Siilar LP variables will be included for the coponents of V Footan, V Eney1, and V Eney. As before, we have a constraint for each global state s and each global action a: o V os[t o ]) o Ro s[s o ], a[o.a])+ γ s P ωs s, a) o V os 4) [T o ]); s, a. This transforation has the effect of reducing the nuber of free variables in the LP to n the nuber of objects) ties the nuber of paraeters required to describe an object s local value function. However, we still have a constraint for each global state and action, an exponentially large nuber. Guestrin, Koller and Parr [6] GKP hereafter) show that, in certain cases, this exponentially large LP can be solved efficiently and exactly. In particular, this copact solution applies when the MDP is factored i.e., represented as a DBN), and the approxiate value function is decoposed as a weighted linear cobination of local basis functions, as above. Under these assuptions, GKP present a decoposition of the LP which grows exponentially only in the induced tree width of a graph deterined by the coplexity of the process dynaics and the locality of the basis function. This approach applies very easily here. The structure of the DBN representing the process dynaics is highly factored, defined via local interactions between objects. Siilarly, the value functions are local, involving only single objects or groups of closely related objects. Often, the induced width of the resulting graph in such probles is quite sall, allowing the techniques of GKP to be applied efficiently. 4 Generalizing Value Functions Although this approach provides us with a principled way of decoposing a high-diensional value function in certain types of doains, it does not help us address the generalization proble: A local value function for objects in a world ω does not help us provide a value function for objects in other worlds, especially worlds with different sets of objects. To obtain generalization, we build on the intuition that different objects in the sae class behave siilarly: they share the transition odel and reward function. Although they differ in their interactions with other objects, their local contribution to the value function is often siilar. For exaple, it ay be reasonable to assue that different footen have a siilar long-ter chance of killing eneies. Thus, we restrict our class of value functions by requiring that all of the objects in a given class share the sae local value subfunction. Forally, we define a class-based local value subfunction V C for each class. We assue that the paraeterization of this value function is well-defined for every object o in C. This assuption holds trivially if the scope of V C is siply S C : we siply have a paraeter for each assignent to Do[S C ]. When the local value function can also depend on the states of neighboring objects, we ust define the paraeterization accordingly; for exaple, we ight have a paraeter for each possible joint state of a linked footan-eney pair. Specifically rather than defining separate subfunctions V Footan1 and V Footan, we define a class-based subfunction V Footan. Now the contribution of Footan1 to the global value function will be V Footan F1.Health, E1.Health). Siilarly Footan will contribute V Footan F.Health, E.Health). A class-based value function defines a specific value function for each world ω, as the su of the class-based local value functions for the objects in ω: V ω s) = V C s[t o ]). 5) C C o O[ω][C] This value function depends both on the set of objects in the world and when local value functions can involve related objects) on the links between the. Iportantly, although objects in the sae class contribute the sae function into the suation of 5), the arguent of the function for an object is the state of that specific object and perhaps its neighbors). In any given state, the contributions of different objects of the sae class can differ. Thus, every footan has the sae local value subfunction paraeters, but a dead footan will have a lower contribution than one which is alive. 5 Finding Generalized MDP Solutions With a class-level value function, we can easily generalize fro one or ore worlds to another one. To do so, we assue that a single set of local class-based value functions V C is a good approxiation across a wide range of worlds ω. Assuing we have such a set of value functions, we can act in any new world ω without replanning, as described in Step 3 of Fig. 3. We siply define a world-specific value function as in 5), and use it to act. We ust now optiize V C in a way that axiizes the value over an entire set of worlds. To foralize this intuition, we assue that there is a probability distribution P ω) over the worlds that the agent encounters. We want to find a single set of class-based local value functions {V C } C C that is a good fit for this distribution over worlds. We view this task as one of optiizing for a single eta-level MDP Π, where

5 nature first chooses a world ω, and the rest of the dynaics are then deterined by the MDP Π ω. Precisely, the state space of Π is {s 0 } ω {ω, s) : s SΠ ω)}. The transition odel is the obvious one: Fro the initial state s 0, nature chooses a world ω according to P ω), and an initial state in ω according to the initial starting distribution P 0 ωs) over the states in ω. The reaining evolution is then done according to ω s dynaics. In our exaple, nature will choose the nuber of footen and eneies, and define the links between the, which then yields a well-defined MDP,e.g., Π vs. 5.1 LP Forulation The eta-mdp Π allows us to foralize the task of finding a generalized solution to an entire class of MDPs. Specifically, we wish to optiize the class-level paraeters for V C, not for a single ground MDP Π ω, but for the entire Π. We can address this proble using a siilar LP solution to the one we used for a single world in Sec. 3. The variables are siply paraeters of the local class-level value subfunctions {V C t C ) : C C, t C Do[T C ]}. For the constraints, recall that our object-based LP forulation in 4) had a constraint for each state s and each action vector a = {a o } o O[ω]. In the generalized solution, the state space is the union of the state spaces of all possible worlds. Our constraint set for Π will, therefore, be a union of constraint sets, one for each world ω, each with its own actions: V ωs) o Ro s[s o], a o) + γ s P ωs s, a)v ωs ); ω, s SΠ ω), a AΠ ω); 6) where the value function for a world, V ω s), is defined at the class level as in Eq. 5). In principle, we should have an additional constraint for the state s 0. However, with a natural choice of state relevance weights α, this constraint is eliinated and the objective function becoes: Miniize: 1 + γ ω s S ω P ω)p 0 ωs)v ω s); 7) if Pωs) 0 > 0, s. In soe odels, the potential nuber of objects ay be infinite, which could ake the objective function unbounded. To prevent this proble, we assue that the P ω) goes to zero sufficiently fast, as the nuber of objects tends to infinity. To understand this assuption, consider the following generative process for selecting worlds: first, the nuber of objects is chosen according to P ); then, the classes and links of each object are chosen according to P ω ). Using this decoposition, we have that P ω) = P )P ω ). The intuitive assuption described above can be foralized as: n, P = n) κ e λn ; for soe κ > 0. Thus, the distribution P ) over nuber of objects can be chosen arbitrarily, as long as it is bounded by soe exponentially decaying function. 5. Sapling worlds The ain proble with this forulation is that the size of the LP the size of the objective and the nuber of constraints grows with the nuber of worlds, which, in ost situations, grows exponentially with the nuber of possible objects, or ay even be infinite. A practical approach to address this proble is to saple soe reasonable nuber of worlds fro the distribution P ω), and then to solve the LP for these worlds only. The resulting class-based value function can then be used for worlds that were not sapled. We will start by sapling a set D of worlds according to P ω). We can now define our LP in ters of the worlds in D, rather than all possible worlds. For each world ω in D, our LP will contain a set of constraints of the for presented in Eq. 4). Note that in all worlds these constraints share the variables V C, which represent our class-based value function. The coplete LP is given by: Variables: Miniize: Subject to: {V Ct C) : C C, t C Do[T C]}. 1+γ ω D C C o O[ω][C] t o T o P 0 ωt o)v Ct o). C C o O[ω][C] VCs[T o]) o O[ω][C] RC s[s o], a[o.a])+ γ s P ωs s, a) C C o O[ω][C] VCs [T o]); ω D, s SΠ ω), a AΠ ω); 8) where PωT 0 o ) is the arginalization of PωS 0 o ) to the variables in T o. For each world, the constraints have the sae for as the ones in Sec. 3. Thus, once we have sapled worlds, we can apply the sae LP decoposition techniques of GKP to each world to solve this LP efficiently. Our generalization algorith is suarized in Step of Fig. 3. The solution obtained by the LP with sapled worlds will, in general, not be equal to the one obtained if all worlds are considered siultaneously. However, we can show that the quality of the two approxiations is close, if a sufficient nuber of worlds are sapled. Specifically, with a polynoial nuber of sapled worlds, we can guarantee that, with high probability, the quality of the value function approxiation obtained when sapling worlds is close to the one obtained when considering all possible worlds. Theore 5.1 Consider the following class-based value functions each with k paraeters): V obtained fro the LP over all possible worlds by iniizing Eq. 7) subject to the constraints in Eq. 6); Ṽ obtained fro the LP with the sapled worlds in 8); and V the optial value function of the eta- MDP Π. For a nuber of sapled worlds polynoial in 1/ε, ln 1/δ, 1/1 γ), k,, 1/κ ), the error is bounded by: V Ṽ 1,P Ω V V 1,PΩ + εr ax ; with probability at least 1 δ, for any δ > 0 and ε > 0; where V 1,PΩ = ω,s S ω P ω)pωs) 0 V ω s), and R ax is the axiu per-object reward. Proof: See Appendix A. The proof uses soe of the techniques developed by de Farias and Van Roy [] for analyzing constraint sapling in general MDPs. However, there are two iportant differences: First, our analysis includes the error introduced when sapling the objective, which in our case is a su only over a subset of the worlds rather than over all of the as in the LP for the full eta-mdp. This issue was not previously addressed. Second, the algorith of de Farias and Van Roy relies on the assuption that constraints are sapled according to soe

6 ideal distribution the stationary distribution of the optial policy). Unfortunately, sapling fro this distribution is as difficult as coputing a near-optial policy. In our analysis, after each world is sapled, our algorith exploits the factored structure in the odel to represent the constraints exactly, avoiding the dependency on the ideal distribution. 6 Learning Classes of Objects The definition of a class-based value function assues that all objects in a class have the sae local value function. In any cases, even objects in the sae class ight play different roles in the odel, and therefore have a different ipact on the overall value. For exaple, if only one peasant has the capability to build barracks, his status ay have a greater ipact. Distinctions of this type are not usually known in advance, but are learned by an agent as it gains experience with a doain and detects regularities. We propose a procedure that takes exactly this approach: Assue that we have been presented with a set D of worlds ω. For each world ω, an approxiate value function V ω = o O[ω] V o was coputed as described in Sec. 3. In addition, each object is associated with a set of features F ω [o]. For exaple, the features ay include local inforation, such as whether the object is a peasant linked to a barrack, or not, as well as global inforation, such as whether this world contains archers in addition to footen. We can define our training data D as { F ω [o], V o : o O[ω], ω D}. We now have a well-defined learning proble: given this training data, we would like to partition the objects into classes, such that objects of the sae class have siilar value functions. There are any approaches for tackling such a task. We choose to use decision tree regression, so as to construct a tree that predicts the local value function paraeters given the features. Thus, each split in the tree corresponds to a feature in F ω [o]; each branch down the tree defines a subset of local value functions in D whose feature values are as defined by the path; the leaf at the end of the path is the average value function for this set. As the regression tree learning algorith tries to construct a tree which is predictive about the local value function, it will ai to construct a tree where the ean at each leaf is very close to the training data assigned to that leaf. Thus, the leaves tend to correspond to objects whose local value functions are siilar. We can thus take the leaves in the tree to define our subclasses, where each subclass is characterized by the cobination of feature values specified by the path to the corresponding leaf. This algorith is suarized in Step 1 of Fig. 3. Note that the ean subfunction at a leaf is not used as the value subfunction for the corresponding class; rather, the paraeters of the value subfunction are optiized using the class-based LP in Step of the algorith. 7 Experiental results We evaluated our generalization algorith on two doains: coputer network adinistration and Freecraft. 7.1 Coputer network adinistration For this proble, we ipleented our algorith in Matlab, using CPLEX as the LP solver. Rather than using the full LP decoposition of GKP [6], we used the constraint generation extension proposed in [13], as the eory requireents 1. Learning Subclasses: Input: A set of training worlds D. A set of features F ω[o]. Algorith: a) For each ω D, copute an object-based value function, as described in Sec. 3. b) Apply regression tree learning on { F ω[o], V o : o O[ω], ω D}. c) Define a subclass for each leaf, characterized by the feature vector associated with its path.. Coputing Class-Based Value Function: Input: A set of sub)class definitions C. A teplate for {V C : C C}. A set of training worlds D. Algorith: a) Copute the paraeters for {V C : C C} that optiize the LP in 8) relative to the worlds in D. 3. Acting in a New World: Input: A set of local value functions {V C : C C}. A set of sub)class definitions C. A world ω. Algorith: Repeat a) Obtain the current state s. b) Deterine the appropriate class C for each o O[ω] according to its features. c) Define V ω according to 5). d) Use the coordination graph algorith of GKP to copute an action a that axiizes Rs, a) + γ s P s s, a)v ωs ). e) Take action a in the world. Figure 3: The overall generalization algorith. were lower for this second approach. We experiented with the ultiagent coputer network exaples in [6], using various network topologies and pair basis functions that involve states of neighboring achines see [6]). In one of these probles, if we have n coputers, then the underlying MDP has 9 n states and n actions. However, the LP decoposition algorith uses structure in the underlying factored odel to solve such probles very efficiently [6]. We first tested the extent to which value functions are shared across objects. In Fig. 4a), we plot the value each object gave to the assignent Status = working, for instances of the three legs topology. These values cluster into three classes. We used CART R to learn decision trees for our class partition. In this case, the learning algorith partitioned the coputers into three subclasses illustrated in Fig. 4b): server, interediate, and leaf. In Fig. 4a), we see that server third colun) has the highest value, because a broken server can cause a chain reaction affecting the whole network, while leaf value first colun) is lowest, as it cannot affect any other coputer. We then evaluated the generalization quality of our classbased value function by coparing its perforance to that of planning specifically for a new environent. For each topology, we coputed the class-based value function with 5 sapled networks of up to 0 coputers. We then sapled a

7 L L L Nuber of objects Value function paraeter value Interediate eaf eaf Interediate Server Interediate eaf Estiated policy value per agent Class-based value function 'Optial' approxiate value function Utopic expected axiu value Ring Star Three legs Max-nor error of value function No class learning Learnt classes Ring Star Three legs a) b) c) d) e) Figure 4: Network adinistrator results: a) Training data for learning classes; b) Classes learned for three legs ; c) Generalization quality evaluated by 0 Monte Carlo runs of 100 steps); d) Advantage of learning subclasses. Tactical Freecraft: e) 3 footen against 3 eneies. new network and coputed for it a value function that used the sae factorization, but with no class restrictions. This value function has ore paraeters different paraeters for each object, rather than for entire classes, which are optiized for this particular network. This process was repeated for 8 sets of networks. The results, shown in Fig. 4c), indicate that the value of the policy fro the class-based value function is very close to the value of replanning, suggesting that we can generalize well to new probles. We also coputed a utopic upper bound on the expected value of the optial policy by reoving the negative) effect of the neighbors on the status of the achines. Although this bound is loose, our approxiate policies still achieve a value close to it. Next, we wanted to deterine if our procedure for learning classes yields better approxiations than the ones obtained fro the default classes. Fig. 4d) copares the axnor error between our class-based value function and the one obtained by replanning. The graph suggests that, by learning classes using our decision trees regression tree procedure, we obtain a uch better approxiation of the value function we would have, had we replanned. 7. Freecraft In order to evaluate our algorith in the Freecraft gae, we ipleented the ethods in C++ and used CPLEX as the LP solver. We created two tasks that evaluate two aspects of the gae: long-ter strategic decision aking and local tactical battle aneuvers. Our Freecraft interface, and scenarios for these and other ore coplex tasks are publicly available at: For each task we designed an RMDP odel to represent the syste, by consulting a doain expert. After planning, our policies were evaluated on the actual gae. To better visualize our results, we direct the reader to view videos of our policies at a website: guestrin/research/generalization/. This website also contains details on our RMDP odel. It is iportant to note that, our policies were constructed relative to a very approxiate odel of the gae, but evaluated against the real gae. In the tactical odel, the goal is to take out an opposing eney force with an equivalent nuber of units. At each tie step, each footan decides which eney to attack. The eneies are controlled using Freecraft s hand-built strategy. We odelled footen and eneies as each having 5 health points, which can decrease as units are attacked. We used a siple aggregator to represent the effect of ultiple attackers. To encourage coordination, each footan is linked to a buddy in a ring structure. The local value functions include ters over triples of linked variables. We solved this odel for a world with 3 footen and 3 eneies, shown in Fig. 4e). The resulting policy which is fairly coplex) deonstrates successful coordination between our footen: initially all three footen focus on one eney. When the eney becoes injured, one footan switches its target. Finally, when the eney is very weak, only one footan continues to attack it, while the others tackle a different eney. Using this policy, our footen defeat the eneies in Freecraft. The factors generated in our planning algorith grow exponentially in the nuber of units, so planning in larger odels is infeasible. Fortunately, when executing a policy, we instantiate the current state at every tie step, and action selection is significantly faster [6]. Thus, even though we cannot execute Step in Fig. 3 of our algorith for larger scenarios, we can generalize our class-based value function to a world with 4 footen and eneies, without replanning using only Step 3 of our approach. The policy continues to deonstrate successful coordination between footen, and we again beat Freecraft s policy. However, as the nuber of units increases, the position of eneies becoes increasingly iportant. Currently, our odel does not consider this feature, and in a world with 5 footen and eneies, our policy loses to Freecraft in a close battle. In the strategic odel, the goal is to kill a strong eney. The player starts with a few peasants, who can collect gold or wood, or attept to build a barrack, which requires both gold and wood. All resources are consued after each Build action. With a barrack and gold, the player can train a footan. The footen can choose to attack the eney. When attacked, the eney loses health points, but fights back and ay kill the footen. We solved a odel with peasants, 1 barrack, footen, and an eney. Every peasant was related to a central peasant and every footan had a buddy. The scope of our local value function included triples between related objects. The resulting policy is quite interesting: the peasants gather gold and wood to build a barrack, then gold to build a footan. Rather than attacking the eney at once, this footan waits until a second footan is built. Then, they attack the eney together. The stronger eney is able to kill both footen, but it becoes quite weak. When the next footan is trained, rather than waiting for a second one, it attacks the now weak eney, and is able to kill hi. Again, planning in large scenarios is infeasible, but action selection can be perfored efficiently. Thus, we can use our generalized value function to tackle a world with 9 peasants and 3 footen, without replanning. The 9 peasants coordinate to gather resources. Interestingly, rather than attacking with footen, the policy now waits for 3 to be trained before attacking. The 3 footen kill the eney, and only one of the dies. Thus,

8 we have successfully generalized fro a proble with about 10 6 joint state-action pairs to one with over pairs. 8 Discussion and Conclusions In this paper, we have tackled a longstanding goal in planning research, the ability to generalize plans to new environents. Such generalization has two copleentary uses: First we can tackle new environents with inial or no replanning. Second it allows us to generalize plans fro saller tractable environents to significantly larger ones, which could not be solved directly with our planning algorith. Our experiental results support the fact that our class-based value function generalizes well to new plans and that the class and subclass structure discovered by our learning procedure iproves the quality of the approxiation. Furtherore, we successfully deonstrated our ethods on a real strategic coputer gae, which contains any characteristics present in real-world dynaic resource allocation probles. Several other papers consider the generalization proble. Several approaches can represent value functions in general ters, but usually require it to be hand-constructed for the particular task. Others [1; 8; 4] have focused on reusing solutions fro isoorphic regions of state space. By coparison, our ethod exploits siilarities between objects evolving in parallel. It would be very interesting to cobine these two types of decoposition. The work of Boutilier et al. [1] on sybolic value iteration coputes first-order value functions, which generalize over objects. However, it focuses on coputing exact value functions, which are unlikely to generalize to a different world. Furtherore, it relies on the use of theore proving tools, which adds to the coplexity of the approach. Methods in deterinistic planning have focused on generalizing fro copactly described policies learned fro any doains to increentally build a first-order policy [9; 11]. Closest in spirit to our approach is the recent work of Yoon et al. [17], which extends these approaches to stochastic doains. We perfor a siilar procedure to discover classes by finding structure in the value function. However, our approach finds regularities in copactly represented value functions rather than policies. Thus, we can tackle tasks such as ultiagent planning, where the action space is exponentially large and copact policies often do not exist. The key assuption in our ethod is interchangeability between objects of the sae class. Our echanis for learning subclasses allows us to deal with cases where objects in the doain can vary, but our generalizations will not be successful in very heterogeneous environents, where ost objects have very different influences on the overall dynaics or rewards. Additionally, the efficiency of our LP solution algorith depends on the connectivity of the underlying proble. In a doain with strong and constant interactions between any objects e.g., Robocup), or when the reward function depends arbitrarily on the state of any objects e.g., Blocksworld), the solution algorith will probably not be efficient. In soe cases, such as the Freecraft tactical doain, we can use generalization to scale up to larger probles. In others, we could cobine our LP decoposition technique with constraint sapling [] to address this high connectivity issue. In general, however, extending these techniques to highly connected probles is still an open proble. Finally, although we have successfully applied our class-value functions to new environents without replanning, there are doains where such direct application would not be sufficient to obtain a good solution. In such doains, our generalized value functions can provide a good initial policy, which could be refined using a variety of local search ethods. We have assued that relations do not change over tie. In any doains e.g., Blocksworld or Robocup), this assuption is false. In recent work, Guestrin et al. [7] show that context-specific independence can allow for dynaically changing coordination structures in ultiagent environents. Siilar ideas ay allow us to tackle dynaically changing relational structures. In suary, we believe that the class-based value functions ethods presented here will significantly further the applicability of MDP odels to large-scale real-world tasks. Acknowledgeents We are very grateful to Ron Parr for any useful discussions. This work was supported by the DoD MURI progra, adinistered by the Office of Naval Research under Grant N , and by Air Force contract F under DARPA s TASK progra. References [1] C. Boutilier, R. Reiter, and B. Price. Sybolic dynaic prograing for first-order MDPs. In IJCAI-01, 001. [] D.P. de Farias and B. Van Roy. On constraint sapling for the linear prograing approach to approxiate dynaic prograing. Subitted to Math. of Operations Research, 001. [3] T. Dean and K. Kanazawa. Probabilistic teporal reasoning. In AAAI-88, [4] T. G. Dietterich. Hierarchical reinforceent learning with the MAXQ value function decoposition. Journal of Artificial Intelligence Research, 13:7 303, 000. [5] R. E. Fikes, P. E. Hart, and N. J. Nilsson. Learning and executing generalized robot plans. Artf. Intel., 34):51 88, 197. [6] C. E. Guestrin, D. Koller, and R. Parr. Multiagent planning with factored MDPs. In NIPS-14, 001. [7] C. E. Guestrin, S. Venkataraan, and D. Koller. Context specific ultiagent coordination and planning with factored MDPs. In AAAI-0, 00. [8] M. Hauskrecht, N. Meuleau, L. Kaelbling, T. Dean, and C. Boutilier. Hierarchical solution of Markov decision processes using acro-actions. In UAI, [9] R. Khardon. Learning action strategies for planning doains. Artificial Intelligence, 113:15 148, [10] D. Koller and A. Pfeffer. Probabilistic frae-based systes. In AAAI, [11] M. Martin and H. Geffner. Learning generalized policies in planning using concept languages. In KR, 000. [1] R. Parr. Flexible decoposition algoriths for weakly coupled arkov decision probles. In UAI-98, [13] D. Schuurans and R. Patrascu. Direct value-approxiation for factored MDPs. In NIPS-14, 001. [14] P. Schweitzer and A. Seidann. Generalized polynoial approxiations in Markovian decision processes. J. of Matheatical Analysis and Applications, 110:568 58, [15] R. Sutton and A. Barto. Reinforceent Learning: An Introduction. MIT Press, Cabridge, MA, [16] S. Thrun and J. O Sullivan. Discovering structure in ultiple learning tasks: The TC algorith. In ICML-96, [17] S. W. Yoon, A. Fern, and B. Givan. Inductive policy selection for first-order MDPs. In UAI-0, 00.

9 A Proof of Theore 5.1 Notation: k nuber of paraeters; V value function obtained by LP fro the LP with infinitely any worlds in constraints and objective; V value function obtained by LP fro the LP with sapled any worlds in constraints,but infinitely any in the objective; Ṽ value function obtained by LP fro the LP with sapled worlds in constraints and objective; V ax axiu value function over all possible worlds ties the probability of that world; π optial policy; µ stationary distribution of optial policy; µ ω stationary distribution of optial policy for world ω; D sapled worlds; [ω] nuber of objects in ω; Assuption A.1 ω, P [ω]) κ e λ[ω] ; for soe κ > 0. Theore A. Let V be the value function obtained fro the linear progra with all of the constraints and the correct objective function; let Ṽ be the value function fro the linear progra with the sapled objective and constraints; and let V be the optial value function. If the nuber of sapled worlds is at least: ) 4 {ln 8δ [ )]} 1 γ)ε + k ln 1 γ)ε + ln ln + 1 γ)ε 16e [ ε κ ln e 1 δε κ ] ; then: Ṽ V V V + ε 3κ R ax 1,PΩ 1,PΩ e 1 γ) ; with probability at least 1 δ, for any δ > 0 and ε > 0. Proof: We will start by proving an auxiliary lea which considers only sapled constraints, but not the sapled objective: Lea A.3 ) 4 {ln 8δ [ )]} 1 γ)ε + k ln 1 γ)ε + ln ln ; 1 γ)ε then: V V 1,PΩ V V + ε κ R ax 1,PΩ e 1 γ) ; with probability at least 1 δ Proof: There are two ain differences between our proof and the proof of de Farias and Van Roy s Theore 5.1 [] for standard MDPs: The first is that, in our relational odels, the stationary distribution decoposes as the ixture of the stationary distributions of each world. The second is that we only saple part of the state, in particular, we saple the world, but represent the constraints for each world in closed for. For our generalization proble, the stationary distribution decoposes as: µ ω, s) = P ω)µ ωs). We ust now bound the probability that V violates any constraints with respect to the constraints defined by the optial policy: µ ω, s) 1 V < T π V ) = P ω) µ ωs) 1 Vs) < T π Vs) ) ; ω,s ω s ω P ω) 1 s SΠ ω ), a AΠ ω ) : Vs) < T a Vs) ). If a world ω has been sapled, i.e., ω D, then the indicator over s SΠ ω ), a AΠ ω ) : Vs) < T a Vs) is guaranteed to be zero. Thus, the last ter is less than or equal to ψ ω : Vs) < T a Vs) ) in de Farias and Van Roy s notation), which in turn is bounded in their Theore 4.1.

Graphical Models in Local, Asymmetric Multi-Agent Markov Decision Processes

Graphical Models in Local, Asyetric Multi-Agent Markov Decision Processes Ditri Dolgov and Edund Durfee Departent of Electrical Engineering and Coputer Science University of Michigan Ann Arbor, MI 48109