Graphical Models in Local, Asymmetric Multi-Agent Markov Decision Processes

Graphical Models in Local, Asyetric Multi-Agent Markov Decision Processes Ditri Dolgov and Edund Durfee Departent of Electrical Engineering and Coputer Science University of Michigan Ann Arbor, MI 48109 {ddolgov,durfee}@uich.edu Abstract In ulti-agent MDPs, it is generally necessary to consider the joint state space of all agents, aking the size of the proble and the solution exponential in the nuber of agents. However, often interactions between the agents are only local, which suggests a ore copact proble representation. We consider a subclass of ulti-agent MDPs with local interactions where dependencies between agents are asyetric, eaning that agents can affect others in a unidirectional anner. This asyetry, which often occurs in doains with authority-driven relationships between agents, allows us to ake better use of the locality of agents interactions. We present and analyze a graphical odel of such probles and show that, for soe classes of probles, it can be exploited to yield significant (soeties exponential) savings in proble and solution size, as well as in coputational efficiency of solution algoriths. 1. Introduction Markov decision processes [9] are widely used for devising optial control policies for agents in stochastic environents. Moreover, MDPs are also being applied to ultiagent doains [1, 10, 11]. However, a weak spot of traditional MDPs that subjects the to the curse of diensionality and presents significant coputational challenges is the flat state space odel, which enuerates all states the agent can be in. This is especially significant for ulti-agent MDPs, where, in general, it is necessary to consider the joint state and action spaces of all agents. Fortunately, there is often a significant aount of structure to MDPs, which can be exploited to devise ore copact proble and solution representations, as well as efficient solution ethods that take advantage of such representations. For exaple, a nuber of factored represen- This work was supported, in part, by a grant fro Honeywell Labs. tations have been proposed [2, 3, 5] that odel the state space as being factored into state variables, and use dynaic Bayesian network representations of the transition function to exploit the locality of the relationships between variables. We focus on ulti-agent MDPs and on a particular for of proble structure that is due to the locality of interactions between agents. Let us note, however, that we analyze the structure and coplexity of optial solutions only, and the clais do not apply to approxiate ethods that exploit proble structure (e.g., [5]). Central to our proble representation are dependency graphs that describe the relationships between agents. The idea is very siilar to other graphical odels, e.g., graphical gaes [6], coordination graphs [5], and ulti-agent influence diagras [7], where graphs are used to ore copactly represent the interactions between agents to avoid the exponential explosion in proble size. Siilarly, our representation of a ulti-agent MDP is exponential only in the degree of the dependency graph, and can be exponentially saller than the size of the flat MDP defined on the joint state and action spaces of all agents. We focus on asyetric dependency graphs, where the influences that agents exert on each other do not have to be utual. Such interactions are characteristic of doains with authority-based relationships between agents, i.e. lowauthority agents have no control over higher-authority ones. Given the copact representation of ulti-agent MDPs, an iportant question is whether the copactness of proble representation can be aintained in the solutions, and if so, whether it can be exploited to devise ore efficient solution ethods. We analyze the effects of optiization criteria and shape of dependency graphs on the structure of optial policies, and for probles where the copactness can be aintained in the solution, we present algoriths that ake use of the graphical representation. 2. Preliinaries In this section, we briefly review soe background and introduce our copact representation of ulti-agent MDPs.

2.1. Markov Decision Processes A single-agent MDP can be defined as a n-tuple S, A, P, R, where S = {i} and A = {a} are finite sets of states and actions, P : S A S [0, 1] defines the transition function (the probability that the agent goes to state j if it executes action a in state i is P (i, a, j)), and R : S R defines the rewards (the agent gets a reward of R(i) for visiting state i). 1 A solution to a MDP is a policy defined as a procedure for selecting an action. It is known [9] that, for such MDPs, there always exist policies that are uniforly-optial (optial for all initial conditions), stationary (tie independent), deterinistic (always select the sae action for a given state), and Markov (history-independent); such policies (π) can be described as appings of states to actions: π : S A. Let us now consider a ulti-agent environent with a set of n agents M = {} ( M = n), each of who has its own set of states S = {i } and actions A = {a }. The ost straightforward and also the ost general way to extend the concept of a single-agent MDP to the fullyobservable ulti-agent case is to assue that all agents affect the transitions and rewards of all other agents. Under these conditions, a ulti-agent MDP can be defined siply as a large MDP S M, A M, P M, R M, where the joint state space S M is defined as the cross product of the state spaces of all agents: S M = S 1... S n, and the joint action space is the cross product of the action spaces of all agents: A M = A 1... A n. The transition and the reward functions are defined on the joint state and action spaces of all agents in the standard way: P M : S M A M S M [0, 1] and R M : S M R. This representation, which we refer to as flat, is the ost general one, in that, by considering the joint state and action spaces, it allows for arbitrary interactions between agents. The trouble is that the proble (and solution) size grows exponentially with the nuber of agents. Thus, very quickly it becoes ipossible to even write down the proble, let alone solve it. Let us note that, if the state space of each agent is defined on a set of world features, there can be soe overlap in features between the agents, in which case the joint state space would be saller than the cross product of the state space of all agents, and would grow as a slower exponent. For siplicity, we ignore the possibility of overlapping features, but our results are directly applicable to that case as well. 2.2. Graphical Multi-Agent MDPs In any ulti-agent doains, the interactions between agents are only local, eaning that the rewards and tran- 1 Often, rewards are said to also depend on actions and future states. For siplicity, we define rewards as function of current state only, but our odel can be generalized to the ore general case. 1 2 3 4 (a) (b) (c) Figure 1. Agent Dependency Graphs sitions of an agent are not directly influenced by all other agents, but rather only by a sall subset of the. To exploit the sparseness in agents interactions, we propose a copact representation that is analogous to the Bayesian network representation of joint probability distributions of several rando variables. Given its siilarity to other graphical odels, we label our representation a graphical ulti-agent MDP (graphical MMDP). Central to the definition of a graphical MMDP is a notion of a dependency graph (Figure 1), which shows how agents affect each other. The graph has a vertex for every agent in the ulti-agent MDP. There is a directed edge fro vertex k to vertex if agent k has an influence on agent. The concept is very siilar to coordination graphs [5], but we distinguish between two ways agents can influence each other: (1) an agent can affect another agent s transitions, in which case we use a solid arrow to depict this relationship in the dependency graph, and (2) an agent can affect another agent s rewards, in which case we use a dashed arrow in the dependency graph. To siplify the following discussion of graphical ultiagent MDPs, we also introduce soe additional concepts and notation pertaining to the structure of the dependency graph. For every agent M, let us label all agents that directly affect s transitions as N (P ) (parents of with respect to transition function P ), and all agents whose transitions are directly affected by as N + (P ) (children of with respect to transition function P ). Siilarly, we use N (R) to refer to agents that directly affect s rewards, and N + (R) to refer to agents whose rewards are directly affected by. Thus, in the graph shown in Figure 1a, N (P ) = {1, 4}, N (R) = {1, 2}, N + (P ) = {3}, and N + (R) = {4}. We use the ters transition-related and reward-related parents and children to distinguish between the two categories. Soeties, it will also be helpful to talk about the union of transition-related and rewardrelated parents or children, in which case we use N = N (P ) N (R) and N + = N + (P ) N + (R). Furtherore, let us label the set of all ancestors of (all agents fro which is reachable) with respect to transitionrelated and reward-related dependencies as O (P ) and O (R), respectively. Siilarly, let us label the descendants of (all agents reachable fro ) as O + (P ) and O + (R). A graphical MMDP with a set of agents M is defined as follows. Associated with each agent M is a n-tuple 1 2 1 2

S, A, P, R, where the state space S and the action space A are defined exactly as before, but the transition and the reward functions are defined as follows: P : S N (P ) S A S [0, 1] R : S N (R) S R, where S N (P ) and S N(R) are the joint state spaces of the transition-related and reward-related parents of, respectively. In other words, the transition function of agent specifies a probability distribution over its next states S as a function of its own current state S and the current states of its parents S N (P ) and its own action A. That is P (i N (P ), i, a, j ) is the probability that agent goes to state j if it executes action a when its current state is i and the states of its transition-related parents are i N (P ). The reward function is defined analogously on the current states of the agent itself and the reward-related parents. Also notice that we allow cycles in the agent dependency graph, and oreover the sae agent can both influence and be influenced by soe other agent (e.g. agents 4 and in Figure 1a). We also allow for asyetric influences between agents, i.e. it could be the case that one agent affects the other, but not vice versa (e.g. agent, in Figure 1a, is influenced by agent 1, but the opposite is not true). This is often the case in doains where the relationships between agents are authority-based. It turns out that the existence of such asyetry has iportant iplications on the copactness of the solution and the coplexity of the solution algoriths. We return to a discussion of the consequences of this asyetry in the following sections. It is iportant to note that, in this representation, each transition and reward function only specifies the rewards and transition probabilities of agent, and contains no inforation about the rewards and transitions of other agents. This iplies that the reward and next state of agent are conditionally independent of the rewards and the next states of other agents, given the current action of and the state of and its parents N. Therefore, this odel does not allow for correlations between the rewards or the next states of different agents. For exaple, we cannot odel the situation where two agents are trying to go through the sae door and whether one agent akes it depends on whether the other one does; we can only represent, for each agent, the probability that it akes it, independently of the other. This liitation of the odel can be overcoe by luping together groups of agents that are correlated in such ways into a single agent as in the flat ulti-agent MDP forulation. In fact, we could have allowed for such dependencies in our odel, but it would have coplicated the presentation. Instead, we assue that all such correlations have already been dealt with, and the resulting proble only consists of agents (perhaps coposite ones) whose states and rewards have this conditional independence property. (1) It is easy to see that the size of a proble represented in this fashion is exponential in the axiu nuber of parents of any agent, but unlike the flat odel, it does not depend on the total nuber of agents. Therefore, for probles where agents have a sall nuber of parents, space savings can be significant. In particular, if the nuber of parents of any agent is bounded by a constant, the savings are exponential (in ters of the nuber of agents). 3. Properties of Graphical Multi-Agent MDPs Now that we have a copact representation of ultiagent MDPs, two iportant questions arise. First, can we copactly represent the solutions to these probles? And second, if so, can we exploit the copact representations of the probles and the solutions to iprove the efficiency of the solution algoriths? Positive answers to these questions would be iportant indications of the value of our graphical proble representation. However, before we attept to answer these questions and get into a ore detailed analysis of the related issues, let us lay down soe groundwork that will siplify the following discussion. First of all, let us note that a graphical ulti-agent MDP is just a copact representation, and any graphical MMDP can be easily converted to a flat ulti-agent MDP, analogously to how a copact Bayesian network can be converted to a joint probability distribution. Therefore, all properties of solutions to flat ulti-agent MDPs (e.g. stationarity, history-independence, etc.) also hold for equivalent probles that are forulated as graphical MMDPs. Thus, the following siple observation about the for of policies in graphical MMDPs holds. Observation 1 For a graphical MMDP S, A, P, R, M, with an optiization criterion for which optial policies are Markov, stationary, and deterinistic, 2 such policies can be represented as π : S X A, where S X is a cross product of the state spaces of soe subset of all agents (X M). Clearly, this observation does not say uch about the copactness of policies, since it allows X = M, which corresponds to a solution where an agent has to consider the states of all other agents when deciding on an action. If that were always the case, using this copact graphical representation for the proble would not (by itself) be beneficial, because the solution would not retain the copactness and would be exponential in the nuber of agents. However, as it turns out, for soe probles, X can be significantly saller than M. Thus we are interested in deterining, for every agent, the inial set of agents whose states s policy has to depend on: 2 We will iplicitly assue that optial policies are Markov, stationary, and deterinistic fro now on.

Definition 1 In a graphical MMDP, a set of agents X is a inial doain of an optial policy π : S X A of agent iff, for any set of agents Y and any policy π : S Y A, the following iplications hold: Y X = U(π ) < U(π ) Y X = U(π ) U(π ), where U(π) is the payoff that is being axiized. Essentially, this definition allows us to talk about the sets of agents whose joint state space is necessary and sufficient for deterining optial actions of agent. Fro now on, whenever we use the notation π : S X A, we iplicitly assue that X is the inial doain of π. 3.1. Assuptions As entioned earlier, one of the ain goals of the following sections will be to characterize the inial doains of agents policies under various conditions. Let us ake a few observations and assuptions about properties of inial doains that allow us to avoid soe non-interesting degenerate special cases and to focus on the hardest cases in our analysis. These assuptions do not liit the general coplexity results that follow, as the latter only require that there exist soe probles for which the assuptions hold. In the rest of the paper, we iplicitly assue that they hold. Central to our future discussion will be an analysis of which rando variables (rewards, states, etc.) depend on which others. It will be very useful to talk about the conditional independence of future values of soe variables, given the current values of others. Definition 2 We say that a rando variable X is Markov on the joint state space S Y of soe set of agents Y if, given the current values of all states in S Y, the future values of X are independent of any past inforation. If that property does not hold, we say that X is non-markov on S Y. Assuption 1 For a inial doain X of agent s optial policy, and a set of agents Y, the following hold: 1. X is unique 2. X 3. l X = S l is Markov on S X 4. S is Markov on S Y Y X The first assuption allows us to avoid soe special cases with sets of agents with highly-correlated states, where equivalent policies can be constructed as functions of either of the sets. The second assuption iplies that an optial policy of every agent depends on its own state. The third assuption says that the state space of any agent l that is in the inial doain of ust be Markov on the state space of the inial doain. Since the state space of agent l is in the inial doain of, it ust influence s rewards in a non-trivial anner. Thus, if S l is non-markov on S X, agent should be able to expand the doain of its policy to ake S l Markov, since that, in general, would increase s payoff. The fourth assuption says that the agent s state is Markov only on supersets of its inial doain, because the agent would want to increase the doain of its policy just enough to ake its state Markov. These assuptions are slightly redundant (e.g., 4 could be deduced fro weaker conditions), but we use this for for brevity. 3.2. Transitivity Using the results of the previous sections, we can now forulate an iportant clai that will significantly siplify the analysis that follows. Proposition 1 Consider two agents, l M, where the optial policies of and l have inial doains of X and X l, respectively (π : S X A, π l : S Xl A l ). Then, under Assuption 1, the following holds: l X = X l X, Proof: We will show this by contradiction. Let us consider an agent fro l s inial doain: k X l. Let us assue (contradicting the stateent of the proposition) that l X, but k / X. Consider the set of agents that consists of the union of the two inial doains X and X l, but with agent k reoved: Y = X (Xl \k). Then, since Y X l, Assuption 1.4 iplies that S l is non-markov on S Y. Thus, Assuption 1.3 iplies l / X, which contradicts our earlier assuption. Essentially, this proposition says that the inial doains have a certain transitive property: if agent needs to base its action choices on the state of agent l, then, in general, also needs to base its actions on the states of all agents in the inial doain of l. As such, this proposition will help us to establish lower bounds on policy sizes. In the rest of the paper, we analyze soe classes of probles to see how large the inial doains are, under various conditions and assuptions, and for doains where inial doains are not prohibitively large, we outline solution algoriths that exploit graphical structure. In what follows, we focus on two coon scenarios: one, where the agents work as a tea and ai to axiize the social welfare of the group (su of individual payoffs), and the other, where each agent axiizes its own payoff. 4. Maxiizing Social Welfare The following proposition characterizes the structure of the optial solutions to graphical ulti-agent MDPs under the social welfare optiization criterion, and as such serves

as an indication of whether the copactness of this particular representation can be exploited to devise an efficient solution algorith for such probles. We deonstrate that, in general, when the social welfare of the group is considered, the optial actions of each agent depend on the states of all other agents (unless the dependency graph is disconnected). Let us note that this case where all agents are axiizing the sae objective function is equivalent to a singleagent factored MDP, and our results for this case are analogous to the well-known fact that the value function in a single-agent factored MDP does not, in general, retain the structure of the proble [8]. Proposition 2 For a graphical MMDP with a connected (ignoring edge directionality) dependency graph, under the optiization criterion that axiizes the social welfare of all agents, an optial policy π of agent, in general, depends on the states of all other agents, i.e. π : S M A. Proof (Sketch): Agent ust, at the iniu, base its action decisions on the states of its iediate (both transitionand reward-related) parents and children. Indeed, agent should worry about the states of its transition-related parents, N (P ), because their states affect the one-step transition probabilities of, which certainly have a bearing on s payoff. Agent should also include in the doain of its policy the states of its reward-related parents, N (R), because they affect s iediate rewards and agent ight need to act so as to synchronize its state with the state of its parents. Siilarly, since the agent cares about the social welfare of all agents, it will need to consider the effect that its actions have on the states and rewards of its iediate children, and ust thus base its policy on the states of its iediate children N + (P ) and N + (R) to potentially set the up to get higher rewards. Having established that the inial doain of each agent ust include the iediate children and parents of the agent, we can use the transitivity property fro the previous section to extend this result. Although Proposition 1 only holds under the conditions of Assuption 1, for our purpose of deterining the coplexity of policies in general, it is sufficient that there exist probles for which Assuption 1 holds. It follows fro Proposition 1 that the inial doain of agent ust include all parents and children of s parents and children, and so forth. For a connected dependency graph, this expands the inial doain of each agent to all other agents in M. The above result should not be too surprising, as it akes clear, intuitive sense. Indeed, let us consider a siple exaple that has a flavor of a coonly-occurring production scenario. Suppose that there is a set of agents that can either cooperate to generate a certain product, yielding a very high reward, or they can concentrate on soe local tasks that do not require cooperation, but which have lower social payoff. Also, suppose that the interactions between the agents are only local for exaple, the agents are operating an assebly line, where each agent receives the product fro a previous agent, odifies it, and passes it on to the next agent. Let us now suppose that each agent has a certain probability of breaking down, and if that happens to at least one of the agents, the assebly line fails. In such an exaple, the optial policy for the agents would be to participate in the assebly-line production until one of the fails, at which point all agents should switch to working on their local tasks (perhaps processing ites already in the pipeline). Clearly, in that exaple, the policy of each agent is a function of the states of all other agents. The take-hoe essage of the above is that, when the agents care about the social welfare of the group, even when the interactions between the agents are only local, the agents policies depend on the joint state space of all agents. The reason for this is that a state change of one agent ight lead all other agents to want to iediately odify their behavior. Therefore, our particular type of copact graphical representation (by itself and without additional restrictions) cannot be used to copactly represent the solutions. 5. Maxiizing Own Welfare In this section, we analyze probles where each of the agents axiizes its own payoff. Under this assuption, unlike the discouraging scenario of the previous section, the coplexity of agents policies is slightly less frightening. The following result characterizes the size of the inial doain of optial policies for probles where each agent axiizes its own utility. Proposition 3 For a graphical MMDP with an optiization criterion where each agent axiizes its own reward, the inial doain of s policy consists of itself and all of its transition- and reward-related ancestors: X = E, where we define E = O (P ) O (R). Proof (Sketch): To show the correctness of the proposition, we need to prove that, (1) the inial doain ust include at least itself and its ancestors (X E ), and (2) that X does not include any other agents (X E ). We can show (1) by once again applying the transitivity property. Clearly, an agent s policy should be a function of the states of the agent s reward-related and transitionrelated parents, because they affect the one-step transition probabilities and rewards of the agent. Then, by Proposition 1, the inial doain of the agent s policy ust also include all of its ancestors. We establish (2) as follows. We assue that it holds for all ancestors of, and show that it ust then hold for. We then expand the stateent to all agents by induction.

Let us fix the policies π k of all agents except. Consider the n-tuple S E, A, P E, R E, where P E and R E are defined as follows: P E (i E, a, j E ) = P (i N (P ), i, a, j ) ) P k (i N k (P ), i k, π k (i E ), j k k k O R E (i E ) = R (i N (R), i ) The above constitutes a fully-observable MDP on S E and A with transition function P and reward function R. Let us label this decision process MDP 1. By properties of fully-observable MDPs, there exists an optial stationary deterinistic solution π 1 of the for π 1 : S E A. Also consider the following MDP on an augented state space that includes the joint state space of all the agents (and not just s ancestors): MDP 2 = S M, A, P M, R M, where P M and R M are defined as follows: P M (i M, a, j M ) = P (i N (P ), i, a, j ) ) P k (i N k O k (P ), i k, π k (i E ), j k k ) P k (i N k (P ), i k, π k (i M ), j k k M\\O R M (i M ) = R (i N (R), i ) Basically, we have now constructed two fully-observable MDPs: MDP 1 that is defined on S E, and MDP 2 that is defined on S M, where MDP 1 is essentially a projection of MDP 2 onto S E. We need to show that no solution to MDP 2 can have a higher value 3 than the optial solution to MDP 1. Let us refer to the optial solution to MDP 1 as π. 1 Suppose there exists a solution π 2 to MDP 2 that has a higher value than π. 1 The policy π 2 defines soe stochastic trajectory for the syste over the state space S M. Let us label the distribution over the state space at tie t as ρ(i M, t). It can be shown that under our assuptions we can always construct a non-stationary policy π (t) 1 : S E A for MDP 1 that yields the sae distribution ρ(i M, t) over the state space S E as the one produced by π. 2 Thus, there exists a non-stationary solution to MDP 1 that has a higher payoff than π, 1 which is a contradiction, since we assued that π 1 was optial for MDP 1. We have therefore shown that, given that the policies of all ancestors of depend only on their own states and the states of their ancestors, there always exists a policy that aps the state space of and its ancestors (S E ) to s actions (A ) that is at least as good as any policy that aps 3 The proof does not rely on the actual type of optiization criterion used by each agent and holds for any criterion that is a function only of the agents trajectories. (2) (3) the joint space of all agents (S M ) to s actions. Then, by using induction, we can expand this stateent to all agents (for acyclic graphs we use the root nodes as the base case, and for cyclic graphs, we use agents that do not have any ancestors that are not siultaneously their descendants). The point of the above proposition is that, for situations where each agent axiizes its own utility, the optial actions of each agent do not have to depend on the states of all other agents, but rather only on its own state and the states of its ancestors. In contrast to the conclusions of Section 4, this result is ore encouraging. For exaple, for dependency graphs that are trees (typical of authority-driven organizational structures), the nuber of ancestors of any agent equals the depth of the tree, which is logarithic in the nuber of agents. Therefore, if each agent axiizes its own welfare, the size of its policy will be exponential in the depth of the tree, but only linear in the nuber of agents. 5.1. Acyclic Dependency Graphs Thus far we have shown that probles where agents optiize their own welfare can allow for ore copact policy representations. We now describe an algorith that exploits the copactness of the proble representation to ore efficiently solve such policy optiization probles for doains with acyclic dependency graphs. It is a distributed algorith where the agents exchange inforation, and each one solves its own policy optiization proble. The algorith is very straightforward and works as follows. First, the root nodes of the graph (the ones with no parents) copute their optial policies that are siply appings of their own states to their own actions. Once a root agent coputes a policy that axiizes its welfare, it sends the policy to all of its children. Each child waits to receive the policies π k, k N fro its ancestors, then fors a MDP on the state space of itself and its ancestors as in (eq. 2). It then solves this MDP S E, A, P E, R E to produce a policy π : E A, at which point it sends the policy and the policies of its ancestors to its children. The process repeats until all agents copute their optial policies. Essentially, this algorith perfors, in a distributed anner, a topological sort of the dependency graph, and coputes a policy for every agent. 5.2. Cyclic Dependency Graphs We now turn our attention to the case of dependency graphs with cycles. Note that the coplexity result of Proposition 3 still applies, because no assuptions about the cyclic or acyclic nature of dependency graphs were ade in the stateent or proof of the proposition. Thus, the inial doain of an agent s policy is still the set of its ancestors.

The proble is, however, that the solution algorith of the previous section is inappropriate for cyclic graphs, because it will deadlock on agents that are part of a cycle, since these agents will be waiting to receive policies fro each other. Indeed, when self-interested agents utually affect each other, it is not clear how they should go about constructing their policies. Moreover, in general, for such agents there ight not even exist a set of stationary deterinistic policies that are in equilibriu, i.e., since the agents utually affect each other, the best responses of agents to each others policies ight not be in equilibriu. A careful analysis of this case falls in the real of Markov gaes, and is beyond the scope of this paper. However, if we assue that there exists an equilibriu in stationary deterinistic policies, and that the agents in a cycle have soe black-box way of constructing their policies, we can forulate an algorith for coputing optial policies, by odifying the algorith fro the previous section as follows. The agents begin by finding the largest cycle they are a part of, and then, after the agents receive policies fro their parents who are not also their descendants, the agents proceed to devise an optial joint policy for their cycle, which they then pass to their children. 6. Additive Rewards In our earlier analysis, a reward function R of an agent could depend in an arbitrary way on the current states of the agent and its parents (eq. 1). In fact, this is why agents, in general, needed to synchronize their states with the states of their parents (and children in the social welfare case), which, in turn, was why the effects of reward-related dependencies propagated just as the transition-related ones did. In this section, we consider a subclass of reward functions whose effects reain local. Naely, we focus on additively-separable reward functions: R (i N (R), i ) = r (i ) + r k (i k ), (4) k N (R) where r k is a function (r k : S k R) that specifies the contribution of agent k to s reward. In order for all of our following results to hold, these functions have to be subject to the following condition: r k (i k ) = l k (r kk (i k )), (5) where l k is a positive linear function (l k (x) = αx + β, α > 0, β 0). This condition iplies that agents preferences over each other states are positively (and linearly) correlated, i.e., when an agent increases its local reward, its contribution to the rewards of its reward-related children also increases linearly. Furtherore, the results of this section are only valid under certain assuptions about the optiization criteria the agents use. Let us say that if an agent receives a history of rewards H(r) = {r(t)} = {r(0), r(1),...}, its payoff is U(H(r)) = U ( r(0), r(1),... ). Then, in order for our results to hold, U has to be linear additive: U(H(r 1 + r 2 )) = U(H(r 1 )) + U(H(r 2 )) (6) Notice that this assuption holds for the coonly-used risk-neutral MDP optiization criteria, such as expected total reward, expected total discounted reward, and average per-step reward, and is, therefore, not greatly liiting. In the rest of this section, for siplicity, we focus on probles with two agents ore specifically, on two interesting special cases, shown in Figure 1b and 1c. However, the results can be generalized to probles with ultiple agents and arbitrary dependency graphs. First of all, let us note that both of these probles have cyclic dependency graphs. Therefore, if the reward functions of the agents were not additively-separable, per our earlier results of Section 5, there would be no guarantee that there exists an equilibriu in stationary deterinistic policies. However, as we show below, our assuption about the additivity of the reward functions changes that and ensures that an equilibriu always exists. Let us consider the case in Figure 1b. Clearly, the policy of neither agent affects the transition function of the other. Thus, given our assuptions about additivity of rewards and utility functions, it is easy to see that the proble of axiizing the payoff is separable for each agent. For exaple, for agent 1 we have: ( ax U 1 H(R1 ) ) ( = ax U 1 H(r11 + r 21 ) ) = π 1,π 2 π 1,π 2 ax U ( H(r 11 ) ) + ax U ( H(r 21 ) ) (7) π 1 π 2 Thus, regardless of what policy agent 2 chooses, agent 1 should adopt a policy that axiizes the first ter in (eq. 7). In gae-theoretic ters, each of the agents has a (weakly) doinant strategy, and will adopt that strategy, regardless of what the other agent does. This is what guarantees the above-entioned equilibriu. Also notice that this result does not rely on reward linearity (eq. 5) and holds for any additively-separable (eq. 4) reward functions. Now that we have deonstrated that, for each agent, it suffices to optiize a function of only that agent s own states and actions, it is clear that each agent can construct its optial policy independently. Indeed, each agent has to solve a standard MDP on its own state and action space with a slightly odified reward function: R (i ) = r (i ), which differs fro the original reward function (eq. 4) in that it ignores the contribution of s parents to its reward. Let us now analyze the case in Figure 1c, where the state of agent 1 affects the transition probabilities of agent 2, and the state of agent 2 affects the rewards of agent 1. Again,

without the assuption that rewards are additive, this cycle would have caused the policies of both agents to depend on the cross product of their state spaces S 1 S 2, and furtherore the existence of equilibria in stationary deterinistic policies between self-interested agents is not guaranteed. However, when rewards are additive, the proble is sipler. Indeed, due to our additivity assuptions, we can write the optiization probles of the two agents as: ax U 1 (...) = ax U 1 (H(r 11 )) + ax U 1 (H(r 12 )) π 1,π 2 π 1 π 1,π 2 ax U 2 (...) = ax U 2 (H(r 21 )) + ax U 2 (H(r 22 )) π 1,π 2 π 1,π 2 π 1 Notice that here the probles are no longer separable (as in the previous case), so neither agent is guaranteed to have a doinant strategy. However, if we ake use of the assuption that the rewards are positively and linearly correlated (eq. 5), we can show that there always exists an equilibriu in stationary deterinistic policies. This is due to the fact that a positive linear transforation of the reward function does not change the optial policy (we show this for discounted MDPs, but the stateent holds ore generally): Observation 2 Consider two MDPs: Λ = S, A, R, P and Λ = S, A, R, P, where R (s) = α(r(s)) + β, α > 0 and β 0. Then, a policy π is optial for Λ under the total expected discounted reward optiization criterion iff it is optial for Λ. Proof (Sketch): It is easy to see that the linear transforation R (i) = αr(i) + β of the reward function will lead to a linear transforation of the Q function: Q (i, a) = αq(i, a) + β(1 γ) 1, where γ is the discount factor. Indeed, the ultiplicative factor α just changes the scale of all rewards, and the additive factor β siply produces an extra discounted sequence of rewards that sus to β(1 γ) 1 over an infinite horizon. Then, since the optial policy is π(i) = ax a αq(i, a) + β(1 γ) 1 = ax a Q(i, a), a policy π is optial for Λ iff it is optial for Λ. Observation 2 iplies that, for any policy π 1, a policy π 2 that axiizes the second ter of U 1 in (eq. 8) will be siultaneously axiizing (given π 1 ) the second ter of U 2 in (eq. 8). In other words, given any π 1, both agents will agree on the choice of π 2. Therefore, agent 1 can find the pair π 1, π 2 that axiizes its payoff U 1 and adopt that π 1. Then, agent 2 will adopt the corresponding π 2, since deviating fro it cannot increase its utility. To su up, when rewards are additively-separable (eq. 4) and satisfy (eq. 5), for the purposes of deterining the inial doain of agents policies (in two-agent probles), we can ignore reward-related edges in dependency graphs. Furtherore, for graphs where there are no cycles with transition-related edges, the agents can forulate their optial policies via algoriths siilar to the ones described in Section 5.1, and these policies will be in equilibriu. (8) 7. Conclusions We have analyzed the use of a particular copact, graphical representation for a class of ulti-agent MDPs with local, asyetric influences between agents. We have shown that, generally, because the effects of these influences propagate with tie, the copactness of the representation is not fully preserved in the solution. We have shown this for ulti-agent probles with the social welfare optiization criterion, which are equivalent to single-agent probles, and for which siilar results are known. We have also analyzed probles with self-interested agents, and have shown the coplexity of solutions to be less prohibitive in soe cases (acyclic dependency graphs). We have deonstrated that under further restrictions on agents effects on each other (positive-linear, additively-separable rewards), locality is preserved to a greater extent equilibriu sets of stationary deterinistic policies for self-interested agents always exist even in soe classes of probles with rewardrelated cyclic relationships between agents. Our future work will cobine the graphical representation of ulti-agent MDPs with other fors of proble factorization, including constrained ulti-agent MDPs [4]. References [1] C. Boutilier. Sequential optiality and coordination in ultiagent systes. In IJCAI-99, pages 478 485, 1999. [2] C. Boutilier, T. Dean, and S. Hanks. Decision-theoretic planning: Structural assuptions and coputational leverage. JAIR, 11:1 94, 1999. [3] C. Boutilier, R. Dearden, and M. Goldszidt. Stochastic dynaic prograing with factored representations. Artificial Intelligence, 121(1-2):49 107, 2000. [4] D. A. Dolgov and E. H. Durfee. Optial resource allocation and policy forulation in loosely-coupled Markov decision processes. In ICAPS-04, 2004. To Appear. [5] C. Guestrin, D. Koller, R. Parr, and S. Venkataraan. Efficient solution algoriths for factored MDPs. Journal of Artificial Intelligence Research, 19:399 468, 2003. [6] M. Kearns, M. L. Littan, and S. Singh. Graphical odels for gae theory. In Proc. of UAI01, pages 253 260, 2001. [7] D. Koller and B. Milch. Multi-agent influence diagras for representing and solving gaes. In IJCAI-01, pages 1027 1036, 2001. [8] D. Koller and R. Parr. Coputing factored value functions for policies in structured MDPs. In IJCAI-99, pages 1332 1339, 1999. [9] M. L. Puteran. Markov Decision Processes. John Wiley & Sons, New York, 1994. [10] D. Pynadath and M. Tabe. Multiagent teawork: Analyzing the optiality and coplexity of key theories and odels. In AAMAS-02, 2002. [11] S. Singh and D. Cohn. How to dynaically erge Markov decision processes. In M. I. Jordan, M. J. Kearns, and S. A. Solla, editors, NIPS-98, volue 10. The MIT Press, 1998.