Learning to Coordinate Efficiently: A Model-based Approach

Size: px
Start display at page:

Download "Learning to Coordinate Efficiently: A Model-based Approach"

Transcription

1 Journal of Artificial Intelligence Research 19 (2003) Submitted 10/02; published 7/03 Learning to Coordinate Efficiently: A Model-based Approach Ronen I. Brafman Computer Science Department Ben-Gurion University Beer-Sheva, Israel Moshe Tennenholtz Faculty of Industrial Engineering and Management Technion Haifa 32000, Israel brafman@cs.bgu.ac.il moshe@robotics.stanford.edu Abstract In common-interest stochastic games all players receive an identical payoff. Players participating in such games must learn to coordinate with each other in order to receive the highest-possible value. A number of reinforcement learning algorithms have been proposed for this problem, and some have been shown to converge to good solutions in the limit. In this paper we show that using very simple model-based algorithms, much better (i.e., polynomial) convergence rates can be attained. Moreover, our model-based algorithms are guaranteed to converge to the optimal value, unlike many of the existing algorithms. 1. Introduction In some learning contexts, the learning system is actually a collection of components, or agents, that have some common goal or a common utility function. The distributed nature of such systems makes the problem of learning to act in an unknown environment more difficult because the agents must coordinate both their learning process and their action choices. However, the need to coordinate is not restricted to distributed agents, as it arises naturally among self-interested agents in certain environments. A good model for such environments is that of a common-interest stochastic game (CISG). A stochastic game (Shapley, 1953) is a model of multi-agent interactions consisting of multiple finite or infinite stages, in each of which the agents play a one-shot strategic form game. The identity of each stage depends stochastically on the previous stage and the actions performed by the agents in that stage. The goal of each agent is to maximize some function of its reward stream - either its average reward or its sum of discounted rewards. A CISG is a stochastic game in which at each point the payoff of all agents is identical. Various algorithms for learning in CISGs have been proposed in the literature. These include specialized algorithms for certain classes of CISGs, (e.g., Claus & Boutilier, 1997; Wang & Sandholm, 2002), as well as more general algorithms for learning in general stochastic games that converge to a Nash equilibrium (e.g., Littman, 1994; Hu & Wellman, 1998; Littman, 2001; Bowling & Veloso, 2001) and can be applied to CISGs. For many of these, results that show convergence in the limit to an equilibrium exist. However, all current algorithms suffer from at least one of the following when applied to CISGs: (1) When multiple c 2003 AI Access Foundation. All rights reserved.

2 Brafman & Tennenholtz equilibria exist, there is no guarantee of convergence to an optimal equilibrium. (2) They are not known to be efficient i.e., at best, they have convergence in the limit guarantees. In addition, many of the algorithms make strong assumptions about the information obtained by the agents during the course of the game. In particular, the agents know what actions are available to each other in each stage of the game, and they can observe the choices made by the other agents. This latter assumption is called perfect monitoring in the game-theory literature. This is to be contrasted with imperfect monitoring in which it is assumed that each agent can only observe its own payoff, but it cannot observe the actions performed by the other players. In this paper we show that using a simple model-based approach, we can provide algorithms with far better theoretical guarantees, and often, in more general settings than previous algorithms. More precisely, we provide a learning algorithm with polynomial-time convergence to a near-optimal value in stochastic common-interest games under imperfect monitoring. In the case of repeated games, our algorithm converges to the actual optimal value. The key to our result is a recent powerful learning algorithm, R-max (Brafman & Tennenholtz, 2002). In particular, the following aspect of R-max plays a crucial role: it is a deterministic learning algorithm that guarantees polynomial-time convergence to nearoptimal value in the single-agent case. 1 We note that in our treatment of stochastic games we make the standard assumption that the state is fully observable. That is, the agents recognize the situation in which they are in at each point in time. In the following section, we provide the necessary background on stochastic games. In Section 3, we discuss the basic algorithm and its particular variants. In Section 4 we conclude the paper with a discussion of some important variants: n-player common-interest stochastic games and repeated games. Because an understanding of R-max and its convergence properties is needed for a more complete picture of our results, we provide this information in the Appendix. 2. Background We start with a set of standard definitions and a discussion of the basic assumptions of our approach. 2.1 Game Definitions A game is a model of multi-agent interaction. In a strategic form game, we have a set of players, each of whom chooses some action to perform from a given set of actions. As a result of the players combined choices, some outcome is obtained which is described numerically in the form of a payoff vector, i.e., a vector of values, one for each of the players. We concentrate on two-player games. General n-person games can be treated similarly. A common description of a two-player strategic-form game is in the form of a matrix. The rows of the matrix correspond to player 1 s actions and the columns correspond to player 2 s actions. The entry in row i and column j in the game matrix contains the rewards obtained by the players if player 1 plays his i th action and player 2 plays his j th action. The set of possible combined choices of the agents are called joint actions. Thus, 1. In fact, R-max converges in zero-sum stochastic games as well, but we will not need this capability. 12

3 Learning to Coordinate Efficiently if player 1 has k possible actions and player 2 has l possible actions, there are k l joint actions in the game. Without loss of generality, we assume that the payoff of a player in a game is taken from the interval of real numbers P = [0, R max ]. A common-interest game is a game in which the rewards obtained by all agents are identical. A coordination game is a common-interest game in which all agents have the same set A of actions, and, in addition, for every i, the reward all agents receive for selecting the joint action (i, i) is greater than the reward they receive for selecting (i, j) for any j i. In a repeated game the players play a given game G repeatedly. We can view a repeated game, with respect to a game G, as consisting of an infinite number of iterations, for each of which the players have to select an action of the game G. After playing each iteration, the players receive the appropriate payoffs, as dictated by that game s matrix, and move to the next iteration. A stochastic game is a game in which the players play a possibly infinite sequence of one-shot, strategic form games from some given set of games, which we shall take to be finite. After playing each game, the players receive the appropriate payoff, as dictated by that game s matrix, and move to a new game. The identity of this new game depends, stochastically, on the previous game and on the players actions in that previous game. Formally: Definition 1 A two player, stochastic-game A = {a 1,..., a k }, consists of: M on states S = {1,..., N}, and actions Stage Games: each state s S is associated with a two-player game in strategic form, where the action set of each player is A. Probabilistic Transition Function: tr(s, t, a, a ) is the probability of a transition from state s to state t given that the first player plays a and the second player plays a. A stochastic game is similar to a Markov decision process (MDP). In both models actions lead to transitions between states of the world. The main difference is that in an MDP the transition depends on the action of a single player whereas in a stochastic game the transition depends on the joint action of both players. In addition, in a stochastic game, the reward obtained by the player for performing an action depends on its action and the action of the other player. To model this, we associate a game with every state. Therefore, we often use the terms state, stage game, and game interchangeably. Much like in regular one-shot games, we can restrict our attention to special classes of stochastic and repeated games. In particular, in this paper we are interested in commoninterest stochastic games (CISGs), i.e., stochastic games in which each stage-game is a common-interest game. CISGs can be viewed as a special extension of MDPs. They correspond to an MDP in which the agent is actually a distributed system. That is, it has different components, all of which influence the choice of action, but all of which work toward a common goal. Thus, for every CISG, we define its induced MDP to be the MDP obtained from the CISG when we assume the existence of a single agent that controls all the agents in the CISG. That is, a single agent whose set of actions corresponds to the joint-actions of the CISG agents. Because the payoffs for all agents are identical, this is a well-defined concept. 13

4 Brafman & Tennenholtz Each history of length t in a game with perfect monitoring consists of a sequence of t tuples of the form (stage-game, joint actions, payoffs), followed by the last state reached. Thus, the set of possible histories of length t in a game with perfect monitoring is (S A 2 P ) t S. We denote the set of histories of any length in this setting by H p. Similarly, in the case of imperfect monitoring, each history is a sequence of tuples of the form (stage-game, single-agent action, single-agent payoff), followed by the last state reached. Thus, the set of possible histories of length t in a game with imperfect monitoring is (S A P ) t S, and the set of possible histories in this setting is denoted H imp. Given a CISG with imperfect monitoring (respectively, perfect monitoring), a policy for the agent is a mapping from H imp (respectively, H p ) to the set of possible probability distributions over A. Hence, a policy determines the probability of choosing each particular action for each possible history. A pure policy is a deterministic policy. A stationary policy is one in which the action is a function of the last state only. In defining the value of a CISG, we treat it much like an MDP because the ultimate goal is to provide a coordinated strategy for the agents that will lead to the maximal reward. We use the average expected reward criterion. Given a CISG M and a natural number T, U(s, π, ρ, T ) denotes the expected T -step undiscounted average reward when player 1 follows policy π and player 2 follows policy ρ starting from state s. The optimal T-step value for M starting at s is U(s, T ) = max (π,ρ) U(s, π, ρ, T ). We define U(s) = lim inf T U(s, T ), and refer to it as the value of s 2.2 Assumptions, Complexity and Optimality The setting we consider is one in which agents have a common interest. We assume that these agents are aware of the fact that they have a common interest and that they know the number of agents that participate in this game. This is most natural in a distributed system. Moreover, we assume that they all employ the same learning algorithm; this latter assumption is common to most work in coordination learning. It is quite natural if the agents are part of a distributed system, or if there exists a standard, commonly accepted algorithm for coordination learning. We make a number of additional assumptions: First, we assume that the agent always recognizes the identity of the stage-game it reached, but not its associated payoffs and transition probabilities. Second, we assume that the maximal possible reward, R max is known ahead of time. Finally, we restrict our attention to ergodic CISGs, which, in a sense, are the only class of CISGs for which the type of result we seek is possible. Kearns and Singh introduced the ergodicity assumption in the context of their E 3 algorithm for learning in MDPs (Kearns & Singh, 1998). A similar assumption was used by Hoffman and Karp (1966) in the context of stochastic games, where it is referred to as irreducibility. An MDP is said to be ergodic if the Markov-chain obtained by fixing any pure stationary policy is ergodic. That is, if any state is reachable from any other state. We restrict our attention to CISGs whose induced MDP is ergodic. Irreducible stochastic games and ergodic MDPs have a number of nice properties, as shown by Hoffman and Karp (1966). First, the maximal long-term average reward is independent of the starting state, implying that U(s) is actually independent of s. Thus, from now on we use v(m) to denote the optimal value of the game M, which is identical to U(s) 14

5 Learning to Coordinate Efficiently for all states s. Second, this optimal value can be obtained by a stationary policy, i.e., by a policy that depends on the current stage-game only. In particular, in the case of CISGs, this optimal value can be obtained using a pure stationary strategy. In a sense, there seems to be little point in theoretical discussions of learning in the context of MDPs and CISGs that are not ergodic. It is evident that in order to learn the agent must perform some sort of implicit or explicit exploration. If the MDP is not ergodic, the agent s initial choices can have irreversible consequences on its long-term value. But one cannot expect the agent to always make correct initial choices when it has no knowledge of the game. Next, we wish to discuss the central parameter in the analysis of the complexity the mixing time first identified by Kearns and Singh (1998). Kearns and Singh argue that it is unreasonable to refer to the efficiency of learning algorithms without referring to the efficiency of convergence to a desired value given complete information. They defined the ɛ-return mixing time of a policy π (in an MDP) to be the smallest value of T such that for all t > T, and for any starting state s, the expected average reward of π is at least ɛ close to the expected infinite horizon average reward of π. That is, letting U(s, π, t) denote the expected t-step average reward of π starting at s, define U(π) = lim inf t U(s, π, t), where this latter value does not depend on the choice of s because of the ergodicity assumption. The ɛ-return mixing time of π, T satisfies: T = min T 1 s, t T : U(s, π, t ) > U(π) ɛ. Notice that the mixing time has nothing to do with learning. It is influenced by the stochastic nature of the model and its interaction with the policy however obtained. We adopt their definition, but with respect to the MDP induced by the CISG of interest. Finally, we define a near-optimal polynomial-time learning algorithm for CISGs to be a learning algorithm A such that given a CISG M, some ɛ > 0, γ > 0 and 0 < δ < 1, there exists some T > 0 that is polynomial in 1/ɛ, 1/δ, 1/γ, the ɛ-return mixing time of an optimal policy, and the size of the description of the game, such that for every t > T the actual t-step average reward obtained by A is at least (1 γ)(v(m) 2ɛ), with a probability of at least 1 δ. Notice that the above definition requires that if we consider any particular, fixed t > T, then the actual average reward will be close to optimal with an overwhelming probability. 3. The Algorithms The following is our basic result: Theorem 1 There exists an near-optimal polynomial time algorithm for learning in twoplayer CISGs under imperfect monitoring provided that all agents have a shared polynomial bound 2 on the number of actions of each other. We note that in certain special sub-cases, e.g., under perfect monitoring, there will be no need for the multiplicative factor of 1 γ. 2. Formally, we will assume that there is a constant k 1, such that if the number of actions available to agent i is f i, then it is commonly known that any agent can have a most of f = (max(f 1, f 2)) k actions. Other ways of formalizing this concept are treated similarly. 15

6 Brafman & Tennenholtz An immediate corollary of Theorem 1 is: Corollary 1 There exists a near-optimal polynomial time algorithm for learning in CISGs under imperfect monitoring provided that all agents know that all stage games are coordination games. Proof: Coordination games are a special instance of CISGs. In particular, in these games we have an identical number of actions for all players, so each player has a polynomial bound on the number of actions available to the other players. The proof of Theorem 1 will proceed in stages using successively weaker assumptions. The reduction will be from a particular multi-agent case to the case of single-agent learning in MDPs. The settings in the first cases is quite natural for a distributed system, whereas the latter cases make quite weak assumptions on the agents knowledge of each other. To succeed in our proof, the single-agent learning algorithm to which the problem is reduced must have the following properties: It is a near-optimal polynomial time algorithm. That is, given ɛ > 0 and 0 < δ < 1, then with a probability of 1 δ, the algorithm, when applied to an MDP, leads to an actual average payoff that is ɛ-close to the optimal payoff for any t > T, where T is polynomial in 1/ɛ, 1/δ, the ɛ-return mixing time, and the description of the MDP. The precise value of T can be computed ahead of time given ɛ, δ, the ɛ-return mixing time of the optimal policy, and the description of the MDP. The algorithm is deterministic. That is, given a particular history, the algorithm will always select the same action. One such algorithm is R-max (Brafman & Tennenholtz, 2002), described in detail in the Appendix. In order to make the presentation more concrete, and since, at the moment, R-max appears to be the only algorithm with these properties, we explicitly refer to it in the proof. However, any other algorithm with the above properties can be used instead. We now proceed with the description of the reductions. Note: In Cases 1-5 below, we assume that the ɛ-return mixing time of an optimal policy, denoted by T mix is known. For example, in repeated games, T mix is 1. In stochastic games, on the other hand, T mix is unlikely to be known. However, this assumption is useful for later development. In Case 6, we remove this assumption using a standard technique. Finally, in all cases we assume that the learning accuracy parameters ɛ, δ, and γ are known to all agents. Case 1: Common order over joint actions. This case is handled by reduction to the single-agent case. If all agents share an ordering over the set of joint actions in every stage game, then they can emulate the single-agent R-max algorithm on the induced MDP. This follows from the fact that R-max is deterministic. Thus, the agents start with an initially identical model. They all know which joint action a single-agent would execute initially, and so they execute their part of that joint action. Since all agents observe their own reward always. And because this is identical to all agents, the update of the model will be the same for all agents. Thus, the model of all agents is always identical. Therefore, in essence we 16

7 Learning to Coordinate Efficiently have a distributed, coordinated execution of the R-max algorithm, and the result follows immediately from the properties of R-max. Note that Case 1 does not require perfect monitoring. We note that many cases can be reduced to this case. Indeed, below we present techniques for handling a number of more difficult cases in which an ordering does not exist, that work by reduction to Case 1. We note that in special games in which there is sufficient asymmetry in the game (e.g., different action set sizes, different off-diagonal payoffs) it is possible to devise more efficient methods for generating an ordering (e.g., via randomization). Case 2: A commonly known ordering over the agents, and knowledge of the action-set size of all agents. First, suppose that the agents not only know the action set sizes of each other, but also share a common ordering over the actions of each agent. This induces a natural lexicographic ordering over the joint actions based on the ordering of each agent s actions and the ordering over the agents. Thus, we are back in case 1. In fact, we do not really need a common ordering over the action set of each agent. To see this, consider the following algorithm: Each agent will represent each joint action as a vector. The k th component of this vector is the index of the action performed by agent k in this joint action. Thus, (3, 4) denotes the joint action in which the first agent according to the common ordering performs its third action, and the second agent performs its fourth action. At each point, the agent will choose the appropriate joint action using R-max. The choice will be identical for all agents, provided the model is identical. Because each agent only performs its aspect of the joint action, there cannot be any confusion over the identities of the actions. Thus, if R-max determines that the current action is (3, 4), then agent 1 will perform its third action and agent 2 will perform its fourth action. The model is initialized in an identical manner by all agents. It is updated identically by all agents because at each point, they all choose the same vector and their rewards are identical. The fact that when the chosen joint action is (3, 4) agent 1 does not know what precisely is the fourth action of agent 2 does not matter. Following Case 2, we can show that we can deal efficiently with a perfect monitoring setup. Case 3: Perfect Monitoring. We can reduce the case of perfect monitoring to Case 2 above as follows. First, we make the action set sizes known to all agents as follows: In the first phases of the game, each agent will play its actions one after the other, returning to its first action, eventually. This will expose the number of actions. Next, we establish a common ordering over the agents as follows: Each agent randomly selects an action. The agent that selected an action with a lower index will be the first agent. Thus, we are now back in Case 2. 17

8 Brafman & Tennenholtz We note that much of the work on learning coordination to date has dealt with case 3 above in the more restricted setting of a repeated game. Case 4: Known action set sizes. What prevents us from using the approach of Case 2 is the lack of common ordering over the agents. We can deal with this problem as follows: In an initial order exploration phase, each agent randomly selects an ordering over the agents and acts based on this ordering. This is repeated for m trials. The length of each trial is T, where T is such that R-max will return an actual average reward that is ɛ-close to optimal after T steps with a probability of at least 1 δ/2, provided that an identical ordering was chosen by all agents. m is chosen so that with high probability in one of the trials all agents will choose the same ordering. Once the order exploration phase is done, each agent selects the ordering that led to the best average reward in the order exploration phase and uses the policy generated for that ordering for Q steps. Q must be large enough to ensure that the losses during the order exploration phase are compensated for an explicit form is given below. Hopefully, the chosen ordering is indeed identical. However, it may be that it was not identical but simply led to a good average reward in the order exploration phase. In the exploitation phase, it could however lead to lower average rewards. Therefore, while exploiting this policy, the agent checks after each step that the actual average reward is similar to the average reward obtained by this policy in the order exploration phase. T is chosen to be large enough to ensure that a policy with ɛ-return mixing time of T mix will, with high probability yield an actual reward that is ɛ-close to optimal. If the average is not as expected, then with high probability the chosen ordering is not identical for all agents. Once this happens, the agents move to the next best ordering. Notice that such a change can take place at most m 1 times. Notice that whenever we realize in step T that the actual average reward is too low, a loss of at most Rmax T is generated. By choosing T such that Rmax T < ɛ m, we get that we remain in the scope of the desired payoff (i.e., 2ɛ from optimal). The 1/m factor is needed because this error can happen at most m times. We now show that Q and m can be chosen so that the running time of this algorithm is still polynomial in the problem parameters, and that the actual average reward will be ɛ-close to optimal times 1 γ with probability of at least 1 δ. First, notice that the algorithm can fail either in the order exploration phase or in the exploitation phases. We will ensure that either failure occurs with a probability of less the δ/2. Consider the order-exploration phase. We must choose m so that with probability of at least 1 δ/2 an iteration occurs in which: An identical ordering is generated by all agents; R-max leads to an ɛ-close to optimal average reward. In a two-player game, the probability of an identical ordering being generated by both agents is 1/2. With an appropriate, and still polynomial, choice of T, the probability of R-max succeeding given an identical ordering is at least 1 δ/2. We now choose m so that [1 (1/2 (1 δ/2))] m < δ/2, and this can be satisfied by an m that is polynomial in δ. Notice that this bounds the failure probability in selecting the right ordering and obtaining 18

9 Learning to Coordinate Efficiently log(δ/2) log([1 (1/2 (1 δ/2))]) > the desired payoff, and that the above inequality holds whenever m > log(δ/2), which is polynomial in the problem parameters. Next, consider the exploitation phases. Given that we succeeded in the exploration phase, at some point in the exploitation phase we will execute an ɛ-optimal policy for Q iterations. If Q is sufficiently large, we can ensure that this will yield an actual average reward that is ɛ-close to optimal with a probability that is larger than 1 δ/2. Thus, each failure mode has less than δ/2 probability, and overall, the failure probability is less than δ. Finally, we need to consider what is required to ensure that the average reward of this whole algorithm will be as high as we wanted. We know that during the Q exploration steps an ɛ-close to optimal value is attained. However, there can be up to m 1 repetitions of T -step iterations in which we obtain a lower expected reward. Thus, we must have that Q (v(m) ɛ) Q+2mT > (1 γ) (v(m) ɛ). This is satisfied if Q > 1 γ 2mT which is still polynomial in the problem parameters. Case 5: Shared polynomial bound on action-set sizes. We repeat the technique used in Case 4, but for every chosen ordering over the agents, the agents systematically search the space of possible action-set sizes. That is, let Sizes be some arbitrary total ordering over the set of all possible action set sizes consistent with the action-set size bounds. The actual choice of Sizes is fixed and part of the agents algorithm. Now, for each action sets size in Sizes, the agents repeat the exploration phase of Case 4, as if these were the true action sets sizes. Again, the agents stick to the action-set size and ordering of agents that led to the best expected payoff, and they will switch to the next best ordering and sizes if the average reward is lower than expected. Let b be the agents bound on their action sets sizes. The only difference between our case and Case 4 is that the initial exploration phase must be longer by a factor of b 2, and that in the exploitation phase we will have to switch O(m b 2 ) policies at most. In particular, we now need to attempt possible orderings for the b 2 possibilities for the size of the joint actions set. Thus, the failure probabilities are treated as above, and Q needs to be increased so that Q (v(m) ɛ) > (1 γ) (v(m) ɛ). Q remains polynomial in the problem Q+2mb 2 T parameters. Case 6: Polynomial bound on action set sizes, unknown mixing time. In this case we reach the setting discussed in Theorem 1 in its full generality. To handle it, we use the following fact: R-max provides a concrete bound on the number of steps required to obtain near-optimal average reward for a given mixing time. Using this bound, we can provide a concrete bound for the running time of all the cases above, and in particular, Case 5. This enables us to use the standard technique for overcoming the problem of unknown mixing time that was used in E 3 (Kearns & Singh, 1998) and in R-max. Namely, the agents run the algorithm for increasing values of the mixing time. For every fixed mixing time, we operate as in Case 4 5, but ensuring that we achieve a multiplicative factor of 1 γ/2, instead of 1 γ. Suppose that the correct mixing time is T. Thus, when the agents run the algorithm for an assumed mixing time t smaller than T, we expect an average reward lower than optimal, and we need to compensate for these losses when we get to T. These losses are bounded by the optimal value times the number of steps we acted sub-optimally. This number of steps is polynomial in the problem parameters. Suppose that when T is reached, 19

10 Brafman & Tennenholtz we are suddenly told that we have reached the true mixing time. At that point, we can try to compensate for our previous losses by continuing to act optimally for some period of time. The period of time required to compensate for our previous losses is polynomial in the problem parameters. Of course, no one tells us that we have reached the true mixing time. However, whenever we act based on the assumption that the mixing time is greater than T, but still polynomial in T, our actual average reward will remain near optimal. Thus, the stages of the algorithm in which the assumed mixing time is greater than T automatically compensate for the stages in which the mixing time is lower than T. Thus, there exists some T polynomial in T and the other problem parameters, such that by the time we run Case 5 under the assumption that T is the mixing time, we will achieve an ɛ-close to optimal reward with a multiplicative factor of 1 γ/2. In addition, when T T < (1 γ/2) we get that the multiplicative loss due what happened before we have reached the mixing time T is another (1 γ/2) factor. Altogether, when we reach the assumed mixing time of T, we get the desired value with a multiplicative factor of 1 γ. Having shown how to handle Case 6, we are done with the proof of Theorem Discussion We showed that polynomial-time coordination is possible under the very general assumption of imperfect monitoring and a shared polynomial bound on the size of the action sets. To achieve this result, we relied on the existence of a deterministic algorithm for learning nearoptimal policies for MDPs in polynomial time. One such algorithm is R-max, and it may be possible to generate a deterministic variant of E 3 (Kearns & Singh, 1998). Our algorithms range from a very simple and extremely efficient approach to learning in perfect monitoring games, to conceptually simple but computationally more complex algorithms for the most general case covered. Whether these latter algorithms are efficient in practice relative to algorithms based on, e.g., Q-learning, and whether they can form a basis for more practical algorithms, remains an interesting issue for future work which we plan to pursue. Regardless of this question, our algorithms demonstrate clearly that coordination-learning algorithms should strive for more than convergence in the limit to some equilibrium, and that more general classes of CISGs with a more restrictive information structure can be dealt with. In Section 3, we restricted our attention to two-player games. However, we used a more general language that applies to n-player games so that the algorithms themselves extend naturally. The only issue is the effect on the running time of moving from a constant number of agents to n-player games. First, observe that the size of the game matrices is exponential in the number of agents. This implies that little changes in the analysis of the earlier cases in which a shared fixed ordering over the set of agents exists. The complexity of these cases increases exponentially in the number of agents, but so does the size of the game so the algorithm remains polynomial in the input parameters. If we do not have a shared ordering over the agents, then we need to consider the set of n! possible orderings. Notice however that although the number of orderings of the players is exponential in the number of players, using the technique of case 4, we get that with overwhelming probability the players will choose identical orderings after time that is 20

11 Learning to Coordinate Efficiently polynomial in the number of such orderings. More specifically, Chernoff bound implies that after P (n!) trials, for a polynomial P implied by the bound, we will get with overwhelming probability that the same ordering on actions will be selected by all agents. Hence, although this number might be large (when n is large), if the number of agents, n, is smaller than the number of actions available, a, this results in a number of iterations that is polynomial in the representation of the game, since the representation of game requires a n entries. Naturally, in these cases it will be of interest to consider succinct representations of the stochastic game, but this goes beyond the scope of this paper. One special class of CISGs is that of common-interest repeated games. This class can be handled in a much simpler manner because there is no need to learn transition functions and to coordinate the transitions. For example, in the most general case (i.e., imperfect monitoring) the following extremely simple, yet efficient algorithm can be used: the agents play randomly for some T steps, following which they select the first action to provide the best immediate reward. If T is chosen appropriately, we are guaranteed to obtain the optimal value with high probability. T will be polynomial in the problem parameters. More specifically, if each agent has k available actions then the probability of choosing an optimal pair of actions is 1. This implies that after k k 3 trials we get that the optimal joint action 2 will be chosen with probability approaching 1 e k. Acknowledgments We wish to thank the anonymous referees for their useful comments. This work was done while the second author was at the department of computer science, Stanford University. This research was supported by the Israel Science Foundation under grant #91/02-1. The first author is partially supported by the Paul Ivanier Center for Robotics and Production Management. The second author is partially supported by DARPA grant F C P Appendix A. The R-max Algorithm Here we describe the MDP version of the R-max algorithm. R-max can handle zero-sum stochastic games as well, but this capability is not required for the coordination algorithms of this paper. For more details, consult (Brafman & Tennenholtz, 2002). We consider an MDP M consisting of a set S = {s 1,..., s N } of states in each of which the agent has a set A = {a 1,..., a k } of possible actions. A reward function R assigns an immediate reward for each of the agent s actions at every state. We use R(s, a) to denote the reward obtained by the agent after playing actions a in state s. Without loss of generality, we assume all rewards are positive. In addition, we have a probabilistic transition function, tr, such that tr(s, a, t) is the probability of making a transition from state s to state t given that the agent played a. It is convenient to think of tr(s, a, ) as a function associated with the action a in the stage-game s. This way, all model parameters, both rewards and transitions, are associated with an action at a particular state. 21

12 Brafman & Tennenholtz Recall that in our setting, each state of the MDP corresponds to a stage-game and each action of the agent corresponds to a joint action of our group of agents. R-max is a model-based algorithm. It maintains its own model M of the real world model M. M is initialized in an optimistic fashion by making the payoffs for every action in every state maximal. What is special about R-max is that the agent always does the best thing according to its current model, i.e., it always exploits. However, if this behavior leads to new information, the model is updated. Because, initially, the model is inaccurate, exploitation with respect to it may turn out to be exploration with respect to the real model. This is a result of our optimistic initialization of the reward function. This initialization makes state-action pairs that have not been attempted many times look attractive because of the high payoff they yield. Let ɛ > 0, and let R max denote the maximal possible reward, i.e., R max def = max s,a R(s, a). For ease of exposition, we assume that T, the ɛ-return mixing time of the optimal policy, is known. Initialize: Construct the following MDP model M consisting of N+1 states, {s 0, s 1,..., s N }, and k actions, {a 1,..., a k }. Here, s 1,..., s N correspond to the real states, {a 1,..., a k } correspond to the real actions, and s 0 is an additional fictitious state. Initialize R such that R(s, a) = R max for every state s and action a. Initialize tr(s i, a, s 0 ) = 1 for all i = 0,..., N and for all actions a (and therefore, tr(s i, a, s j ) = 0 for all j 0). Repeat: In addition, maintain the following information for each state-action pair: (1) a boolean value known/unknown, initialized to unknown; (2) the list of states reached by playing this action in this state, and how many times each state was reached; (3) the reward obtained when playing this action in this state. Items 2 and 3 are initially empty. Compute and Act: Compute an optimal T -step policy for the current state, and execute it for T -steps or until some state-action pair changes its value to known. Observe and update: Following each action do as follows: Let a be the action you performed and s the state you were in; If the action a is performed for the first time in s, update the reward associated with (s, a), as observed. Update the set of states reached from s by playing a. If at this point your record of states reached from this entry contains 4NT Rmax K 1 = max(( ɛ ) 3, 6ln 3 δ ( ) ) + 1 elements, mark this stateaction pair as known, and update the transition probabilities for this entry 6Nk 2 according to the observed frequencies. The basic property of R-max is captured by the following theorem: Theorem 2 Let M be an MDP with N states and k actions. Let 0 < δ < 1, and ɛ > 0 be constants. Denote the policies for M whose ɛ-return mixing time is T by Π(ɛ, T ), and denote the optimal expected return achievable by such policies by Opt(Π(ɛ, T )). Then, with 22

13 Learning to Coordinate Efficiently probability of no less than 1 δ the R-max algorithm will attain an expected return of Opt(Π(ɛ, T )) 2ɛ within a number of steps polynomial in N, k, T, 1 ɛ, and 1 δ. Recall that we rely heavily on the fact that R-max is deterministic for MDPs. This follows from the existence of a deterministic algorithm for generating an optimal T -step policy. Algorithms for generating an optimal T -step policy for an MDP are deterministic, except for the fact that at certain choice points, a number of actions may have the same value, and each choice is acceptable. In that case, one needs only generate an arbitrary ordering over the actions and always choose, e.g., the first action. This is why in our algorithm the generation of a shared ordering over the actions plays an important role. Note that, without loss of generality, for MDPs we can restrict our attention to deterministic T - step policies. This fact is important in our case because our coordination algorithm would not work if mixed policies were required. References Bowling, M., & Veloso, M. (2001). Rational and covergent learning in stochastic games. In Proceedings of the 17th International Joint Conference on Artificial Intelligence, pp Brafman, R. I., & Tennenholtz, M. (2002). R-max a general polynomial time algorithm for near-optimal reinforcement learning. Journal of Machine Learning Research, 3, Claus, C., & Boutilier, C. (1997). The dynamics of reinforcement learning in cooperative multi-agent systems. In Proceedings of the Workshop on Multi-Agent Learning, pp Hoffman, A., & Karp, R. (1966). On Nonterminating Stochastic Games. Management Science, 12 (5), Hu, J., & Wellman, M. (1998). Multiagent Reinforcement Learning: Theoretical Framework and an Algorithm. In Proceedings of the Fifteenth International Conference on Machine Learning (ICML-98), pp Kearns, M., & Singh, S. (1998). Near-optimal reinforcement learning in polynomial time. In Int. Conf. on Machine Learning. Littman, M. L. (1994). Markov games as a framework for multi-agent reinforcement learning. In Proc. 11th Intl. Conf. on Machine Learning, pp Littman, M. L. (2001). Friend-or-foe q-learning in general-sum games. In Proc. ICML 01. Shapley, L. (1953). Stochastic Games. In Proc. Nat. Acad. Scie. USA, Vol. 39, pp Wang, X., & Sandholm, T. (2002). Reinforcement learning to play an optimal nash equilibrium in team markov games. In Proc. 16th Conf. on Neural Information Processing Systems (NIPS 02). 23

14

Optimal Efficient Learning Equilibrium: Imperfect Monitoring in Symmetric Games

Optimal Efficient Learning Equilibrium: Imperfect Monitoring in Symmetric Games Optimal Efficient Learning Equilibrium: Imperfect Monitoring in Symmetric Games Ronen I. Brafman Department of Computer Science Stanford University Stanford, CA 94305 brafman@cs.stanford.edu Moshe Tennenholtz

More information

Robust Learning Equilibrium

Robust Learning Equilibrium Robust Learning Equilibrium Itai Ashlagi Dov Monderer Moshe Tennenholtz Faculty of Industrial Engineering and Management Technion Israel Institute of Technology Haifa 32000, Israel Abstract We introduce

More information

A Review of the E 3 Algorithm: Near-Optimal Reinforcement Learning in Polynomial Time

A Review of the E 3 Algorithm: Near-Optimal Reinforcement Learning in Polynomial Time A Review of the E 3 Algorithm: Near-Optimal Reinforcement Learning in Polynomial Time April 16, 2016 Abstract In this exposition we study the E 3 algorithm proposed by Kearns and Singh for reinforcement

More information

Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms

Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms Mostafa D. Awheda Department of Systems and Computer Engineering Carleton University Ottawa, Canada KS 5B6 Email: mawheda@sce.carleton.ca

More information

Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms

Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms Artificial Intelligence Review manuscript No. (will be inserted by the editor) Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms Mostafa D. Awheda Howard M. Schwartz Received:

More information

Learning Equilibrium as a Generalization of Learning to Optimize

Learning Equilibrium as a Generalization of Learning to Optimize Learning Equilibrium as a Generalization of Learning to Optimize Dov Monderer and Moshe Tennenholtz Faculty of Industrial Engineering and Management Technion Israel Institute of Technology Haifa 32000,

More information

Learning Near-Pareto-Optimal Conventions in Polynomial Time

Learning Near-Pareto-Optimal Conventions in Polynomial Time Learning Near-Pareto-Optimal Conventions in Polynomial Time Xiaofeng Wang ECE Department Carnegie Mellon University Pittsburgh, PA 15213 xiaofeng@andrew.cmu.edu Tuomas Sandholm CS Department Carnegie Mellon

More information

Selecting Efficient Correlated Equilibria Through Distributed Learning. Jason R. Marden

Selecting Efficient Correlated Equilibria Through Distributed Learning. Jason R. Marden 1 Selecting Efficient Correlated Equilibria Through Distributed Learning Jason R. Marden Abstract A learning rule is completely uncoupled if each player s behavior is conditioned only on his own realized

More information

Lecture notes for Analysis of Algorithms : Markov decision processes

Lecture notes for Analysis of Algorithms : Markov decision processes Lecture notes for Analysis of Algorithms : Markov decision processes Lecturer: Thomas Dueholm Hansen June 6, 013 Abstract We give an introduction to infinite-horizon Markov decision processes (MDPs) with

More information

Incremental Policy Learning: An Equilibrium Selection Algorithm for Reinforcement Learning Agents with Common Interests

Incremental Policy Learning: An Equilibrium Selection Algorithm for Reinforcement Learning Agents with Common Interests Incremental Policy Learning: An Equilibrium Selection Algorithm for Reinforcement Learning Agents with Common Interests Nancy Fulda and Dan Ventura Department of Computer Science Brigham Young University

More information

Convergence and No-Regret in Multiagent Learning

Convergence and No-Regret in Multiagent Learning Convergence and No-Regret in Multiagent Learning Michael Bowling Department of Computing Science University of Alberta Edmonton, Alberta Canada T6G 2E8 bowling@cs.ualberta.ca Abstract Learning in a multiagent

More information

MDP Preliminaries. Nan Jiang. February 10, 2019

MDP Preliminaries. Nan Jiang. February 10, 2019 MDP Preliminaries Nan Jiang February 10, 2019 1 Markov Decision Processes In reinforcement learning, the interactions between the agent and the environment are often described by a Markov Decision Process

More information

Cyclic Equilibria in Markov Games

Cyclic Equilibria in Markov Games Cyclic Equilibria in Markov Games Martin Zinkevich and Amy Greenwald Department of Computer Science Brown University Providence, RI 02912 {maz,amy}@cs.brown.edu Michael L. Littman Department of Computer

More information

Multiagent Learning Using a Variable Learning Rate

Multiagent Learning Using a Variable Learning Rate Multiagent Learning Using a Variable Learning Rate Michael Bowling, Manuela Veloso Computer Science Department, Carnegie Mellon University, Pittsburgh, PA 15213-3890 Abstract Learning to act in a multiagent

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Model-Based Reinforcement Learning Model-based, PAC-MDP, sample complexity, exploration/exploitation, RMAX, E3, Bayes-optimal, Bayesian RL, model learning Vien Ngo MLR, University

More information

Optimism in the Face of Uncertainty Should be Refutable

Optimism in the Face of Uncertainty Should be Refutable Optimism in the Face of Uncertainty Should be Refutable Ronald ORTNER Montanuniversität Leoben Department Mathematik und Informationstechnolgie Franz-Josef-Strasse 18, 8700 Leoben, Austria, Phone number:

More information

A Reinforcement Learning Algorithm with Polynomial Interaction Complexity for Only-Costly-Observable MDPs

A Reinforcement Learning Algorithm with Polynomial Interaction Complexity for Only-Costly-Observable MDPs A Reinforcement Learning Algorithm with Polynomial Interaction Complexity for Only-Costly-Observable MDPs Roy Fox Computer Science Department, Technion IIT, Israel Moshe Tennenholtz Faculty of Industrial

More information

Optimal Convergence in Multi-Agent MDPs

Optimal Convergence in Multi-Agent MDPs Optimal Convergence in Multi-Agent MDPs Peter Vrancx 1, Katja Verbeeck 2, and Ann Nowé 1 1 {pvrancx, ann.nowe}@vub.ac.be, Computational Modeling Lab, Vrije Universiteit Brussel 2 k.verbeeck@micc.unimaas.nl,

More information

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro CMU 15-781 Lecture 12: Reinforcement Learning Teacher: Gianni A. Di Caro REINFORCEMENT LEARNING Transition Model? State Action Reward model? Agent Goal: Maximize expected sum of future rewards 2 MDP PLANNING

More information

Multiagent Value Iteration in Markov Games

Multiagent Value Iteration in Markov Games Multiagent Value Iteration in Markov Games Amy Greenwald Brown University with Michael Littman and Martin Zinkevich Stony Brook Game Theory Festival July 21, 2005 Agenda Theorem Value iteration converges

More information

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016 Course 16:198:520: Introduction To Artificial Intelligence Lecture 13 Decision Making Abdeslam Boularias Wednesday, December 7, 2016 1 / 45 Overview We consider probabilistic temporal models where the

More information

PAC Model-Free Reinforcement Learning

PAC Model-Free Reinforcement Learning Alexander L. Strehl strehl@cs.rutgers.edu Lihong Li lihong@cs.rutgers.edu Department of Computer Science, Rutgers University, Piscataway, NJ 08854 USA Eric Wiewiora ewiewior@cs.ucsd.edu Computer Science

More information

Prioritized Sweeping Converges to the Optimal Value Function

Prioritized Sweeping Converges to the Optimal Value Function Technical Report DCS-TR-631 Prioritized Sweeping Converges to the Optimal Value Function Lihong Li and Michael L. Littman {lihong,mlittman}@cs.rutgers.edu RL 3 Laboratory Department of Computer Science

More information

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Michail G. Lagoudakis Department of Computer Science Duke University Durham, NC 2778 mgl@cs.duke.edu

More information

Mechanism Design for Resource Bounded Agents

Mechanism Design for Resource Bounded Agents Mechanism Design for Resource Bounded Agents International Conference on Multi-Agent Systems, ICMAS 2000 Noa E. Kfir-Dahav Dov Monderer Moshe Tennenholtz Faculty of Industrial Engineering and Management

More information

Reinforcement Learning

Reinforcement Learning 1 Reinforcement Learning Chris Watkins Department of Computer Science Royal Holloway, University of London July 27, 2015 2 Plan 1 Why reinforcement learning? Where does this theory come from? Markov decision

More information

Introduction to Reinforcement Learning

Introduction to Reinforcement Learning CSCI-699: Advanced Topics in Deep Learning 01/16/2019 Nitin Kamra Spring 2019 Introduction to Reinforcement Learning 1 What is Reinforcement Learning? So far we have seen unsupervised and supervised learning.

More information

Learning in Zero-Sum Team Markov Games using Factored Value Functions

Learning in Zero-Sum Team Markov Games using Factored Value Functions Learning in Zero-Sum Team Markov Games using Factored Value Functions Michail G. Lagoudakis Department of Computer Science Duke University Durham, NC 27708 mgl@cs.duke.edu Ronald Parr Department of Computer

More information

Multi-Agent Learning with Policy Prediction

Multi-Agent Learning with Policy Prediction Multi-Agent Learning with Policy Prediction Chongjie Zhang Computer Science Department University of Massachusetts Amherst, MA 3 USA chongjie@cs.umass.edu Victor Lesser Computer Science Department University

More information

CS 7180: Behavioral Modeling and Decisionmaking

CS 7180: Behavioral Modeling and Decisionmaking CS 7180: Behavioral Modeling and Decisionmaking in AI Markov Decision Processes for Complex Decisionmaking Prof. Amy Sliva October 17, 2012 Decisions are nondeterministic In many situations, behavior and

More information

Logarithmic Online Regret Bounds for Undiscounted Reinforcement Learning

Logarithmic Online Regret Bounds for Undiscounted Reinforcement Learning Logarithmic Online Regret Bounds for Undiscounted Reinforcement Learning Peter Auer Ronald Ortner University of Leoben, Franz-Josef-Strasse 18, 8700 Leoben, Austria auer,rortner}@unileoben.ac.at Abstract

More information

A Polynomial-time Nash Equilibrium Algorithm for Repeated Games

A Polynomial-time Nash Equilibrium Algorithm for Repeated Games A Polynomial-time Nash Equilibrium Algorithm for Repeated Games Michael L. Littman mlittman@cs.rutgers.edu Rutgers University Peter Stone pstone@cs.utexas.edu The University of Texas at Austin Main Result

More information

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti 1 MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti Historical background 2 Original motivation: animal learning Early

More information

Bias-Variance Error Bounds for Temporal Difference Updates

Bias-Variance Error Bounds for Temporal Difference Updates Bias-Variance Bounds for Temporal Difference Updates Michael Kearns AT&T Labs mkearns@research.att.com Satinder Singh AT&T Labs baveja@research.att.com Abstract We give the first rigorous upper bounds

More information

A Multiagent Reinforcement Learning Algorithm with Non-linear Dynamics

A Multiagent Reinforcement Learning Algorithm with Non-linear Dynamics Journal of Artificial Intelligence Research 33 (28) 52-549 Submitted 6/8; published 2/8 A Multiagent Reinforcement Learning Algorithm with Non-linear Dynamics Sherief Abdallah Faculty of Informatics The

More information

Reinforcement Learning and Control

Reinforcement Learning and Control CS9 Lecture notes Andrew Ng Part XIII Reinforcement Learning and Control We now begin our study of reinforcement learning and adaptive control. In supervised learning, we saw algorithms that tried to make

More information

Lecture 1: March 7, 2018

Lecture 1: March 7, 2018 Reinforcement Learning Spring Semester, 2017/8 Lecture 1: March 7, 2018 Lecturer: Yishay Mansour Scribe: ym DISCLAIMER: Based on Learning and Planning in Dynamical Systems by Shie Mannor c, all rights

More information

A Gentle Introduction to Reinforcement Learning

A Gentle Introduction to Reinforcement Learning A Gentle Introduction to Reinforcement Learning Alexander Jung 2018 1 Introduction and Motivation Consider the cleaning robot Rumba which has to clean the office room B329. In order to keep things simple,

More information

A Reinforcement Learning (Nash-R) Algorithm for Average Reward Irreducible Stochastic Games

A Reinforcement Learning (Nash-R) Algorithm for Average Reward Irreducible Stochastic Games Learning in Average Reward Stochastic Games A Reinforcement Learning (Nash-R) Algorithm for Average Reward Irreducible Stochastic Games Jun Li Jun.Li@warnerbros.com Kandethody Ramachandran ram@cas.usf.edu

More information

Computational Game Theory Spring Semester, 2005/6. Lecturer: Yishay Mansour Scribe: Ilan Cohen, Natan Rubin, Ophir Bleiberg*

Computational Game Theory Spring Semester, 2005/6. Lecturer: Yishay Mansour Scribe: Ilan Cohen, Natan Rubin, Ophir Bleiberg* Computational Game Theory Spring Semester, 2005/6 Lecture 5: 2-Player Zero Sum Games Lecturer: Yishay Mansour Scribe: Ilan Cohen, Natan Rubin, Ophir Bleiberg* 1 5.1 2-Player Zero Sum Games In this lecture

More information

Fast Planning in Stochastic Games

Fast Planning in Stochastic Games UNCERTAINTY IN ARTIFICIAL INTELLIGENCE PROCEEDINGS 2000 309 Fast Planning in Stochastic Games Michael Kearns AT&T Labs Florham Park, New Jersey mkearns@research.att.com Yishay Mansour Tel Aviv U ni versi

More information

Today s Outline. Recap: MDPs. Bellman Equations. Q-Value Iteration. Bellman Backup 5/7/2012. CSE 473: Artificial Intelligence Reinforcement Learning

Today s Outline. Recap: MDPs. Bellman Equations. Q-Value Iteration. Bellman Backup 5/7/2012. CSE 473: Artificial Intelligence Reinforcement Learning CSE 473: Artificial Intelligence Reinforcement Learning Dan Weld Today s Outline Reinforcement Learning Q-value iteration Q-learning Exploration / exploitation Linear function approximation Many slides

More information

: Cryptography and Game Theory Ran Canetti and Alon Rosen. Lecture 8

: Cryptography and Game Theory Ran Canetti and Alon Rosen. Lecture 8 0368.4170: Cryptography and Game Theory Ran Canetti and Alon Rosen Lecture 8 December 9, 2009 Scribe: Naama Ben-Aroya Last Week 2 player zero-sum games (min-max) Mixed NE (existence, complexity) ɛ-ne Correlated

More information

An Analysis of Model-Based Interval Estimation for Markov Decision Processes

An Analysis of Model-Based Interval Estimation for Markov Decision Processes An Analysis of Model-Based Interval Estimation for Markov Decision Processes Alexander L. Strehl, Michael L. Littman astrehl@gmail.com, mlittman@cs.rutgers.edu Computer Science Dept. Rutgers University

More information

MS&E338 Reinforcement Learning Lecture 1 - April 2, Introduction

MS&E338 Reinforcement Learning Lecture 1 - April 2, Introduction MS&E338 Reinforcement Learning Lecture 1 - April 2, 2018 Introduction Lecturer: Ben Van Roy Scribe: Gabriel Maher 1 Reinforcement Learning Introduction In reinforcement learning (RL) we consider an agent

More information

Satisfaction Equilibrium: Achieving Cooperation in Incomplete Information Games

Satisfaction Equilibrium: Achieving Cooperation in Incomplete Information Games Satisfaction Equilibrium: Achieving Cooperation in Incomplete Information Games Stéphane Ross and Brahim Chaib-draa Department of Computer Science and Software Engineering Laval University, Québec (Qc),

More information

1 Basic Game Modelling

1 Basic Game Modelling Max-Planck-Institut für Informatik, Winter 2017 Advanced Topic Course Algorithmic Game Theory, Mechanism Design & Computational Economics Lecturer: CHEUNG, Yun Kuen (Marco) Lecture 1: Basic Game Modelling,

More information

Game-Theoretic Learning:

Game-Theoretic Learning: Game-Theoretic Learning: Regret Minimization vs. Utility Maximization Amy Greenwald with David Gondek, Amir Jafari, and Casey Marks Brown University University of Pennsylvania November 17, 2004 Background

More information

Reinforcement Learning Active Learning

Reinforcement Learning Active Learning Reinforcement Learning Active Learning Alan Fern * Based in part on slides by Daniel Weld 1 Active Reinforcement Learning So far, we ve assumed agent has a policy We just learned how good it is Now, suppose

More information

CS599 Lecture 1 Introduction To RL

CS599 Lecture 1 Introduction To RL CS599 Lecture 1 Introduction To RL Reinforcement Learning Introduction Learning from rewards Policies Value Functions Rewards Models of the Environment Exploitation vs. Exploration Dynamic Programming

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Markov decision process & Dynamic programming Evaluative feedback, value function, Bellman equation, optimality, Markov property, Markov decision process, dynamic programming, value

More information

Interacting Vehicles: Rules of the Game

Interacting Vehicles: Rules of the Game Chapter 7 Interacting Vehicles: Rules of the Game In previous chapters, we introduced an intelligent control method for autonomous navigation and path planning. The decision system mainly uses local information,

More information

Experts in a Markov Decision Process

Experts in a Markov Decision Process University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 2004 Experts in a Markov Decision Process Eyal Even-Dar Sham Kakade University of Pennsylvania Yishay Mansour Follow

More information

ARTIFICIAL INTELLIGENCE. Reinforcement learning

ARTIFICIAL INTELLIGENCE. Reinforcement learning INFOB2KI 2018-2019 Utrecht University The Netherlands ARTIFICIAL INTELLIGENCE Reinforcement learning Lecturer: Silja Renooij These slides are part of the INFOB2KI Course Notes available from www.cs.uu.nl/docs/vakken/b2ki/schema.html

More information

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Q-Learning Christopher Watkins and Peter Dayan Noga Zaslavsky The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992)

More information

NOTE. A 2 2 Game without the Fictitious Play Property

NOTE. A 2 2 Game without the Fictitious Play Property GAMES AND ECONOMIC BEHAVIOR 14, 144 148 1996 ARTICLE NO. 0045 NOTE A Game without the Fictitious Play Property Dov Monderer and Aner Sela Faculty of Industrial Engineering and Management, The Technion,

More information

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels? Machine Learning and Bayesian Inference Dr Sean Holden Computer Laboratory, Room FC6 Telephone extension 6372 Email: sbh11@cl.cam.ac.uk www.cl.cam.ac.uk/ sbh11/ Unsupervised learning Can we find regularity

More information

Gradient descent for symmetric and asymmetric multiagent reinforcement learning

Gradient descent for symmetric and asymmetric multiagent reinforcement learning Web Intelligence and Agent Systems: An international journal 3 (25) 17 3 17 IOS Press Gradient descent for symmetric and asymmetric multiagent reinforcement learning Ville Könönen Neural Networks Research

More information

Practicable Robust Markov Decision Processes

Practicable Robust Markov Decision Processes Practicable Robust Markov Decision Processes Huan Xu Department of Mechanical Engineering National University of Singapore Joint work with Shiau-Hong Lim (IBM), Shie Mannor (Techion), Ofir Mebel (Apple)

More information

Sequential Optimality and Coordination in Multiagent Systems

Sequential Optimality and Coordination in Multiagent Systems Sequential Optimality and Coordination in Multiagent Systems Craig Boutilier Department of Computer Science University of British Columbia Vancouver, B.C., Canada V6T 1Z4 cebly@cs.ubc.ca Abstract Coordination

More information

Learning ε-pareto Efficient Solutions With Minimal Knowledge Requirements Using Satisficing

Learning ε-pareto Efficient Solutions With Minimal Knowledge Requirements Using Satisficing Learning ε-pareto Efficient Solutions With Minimal Knowledge Requirements Using Satisficing Jacob W. Crandall and Michael A. Goodrich Computer Science Department Brigham Young University Provo, UT 84602

More information

Learning to Compete, Compromise, and Cooperate in Repeated General-Sum Games

Learning to Compete, Compromise, and Cooperate in Repeated General-Sum Games Learning to Compete, Compromise, and Cooperate in Repeated General-Sum Games Jacob W. Crandall Michael A. Goodrich Computer Science Department, Brigham Young University, Provo, UT 84602 USA crandall@cs.byu.edu

More information

Distributed Randomized Algorithms for the PageRank Computation Hideaki Ishii, Member, IEEE, and Roberto Tempo, Fellow, IEEE

Distributed Randomized Algorithms for the PageRank Computation Hideaki Ishii, Member, IEEE, and Roberto Tempo, Fellow, IEEE IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 55, NO. 9, SEPTEMBER 2010 1987 Distributed Randomized Algorithms for the PageRank Computation Hideaki Ishii, Member, IEEE, and Roberto Tempo, Fellow, IEEE Abstract

More information

AWESOME: A General Multiagent Learning Algorithm that Converges in Self-Play and Learns a Best Response Against Stationary Opponents

AWESOME: A General Multiagent Learning Algorithm that Converges in Self-Play and Learns a Best Response Against Stationary Opponents AWESOME: A General Multiagent Learning Algorithm that Converges in Self-Play and Learns a Best Response Against Stationary Opponents Vincent Conitzer conitzer@cs.cmu.edu Computer Science Department, Carnegie

More information

The Complexity of Ergodic Mean-payoff Games,

The Complexity of Ergodic Mean-payoff Games, The Complexity of Ergodic Mean-payoff Games, Krishnendu Chatterjee Rasmus Ibsen-Jensen Abstract We study two-player (zero-sum) concurrent mean-payoff games played on a finite-state graph. We focus on the

More information

1 MDP Value Iteration Algorithm

1 MDP Value Iteration Algorithm CS 0. - Active Learning Problem Set Handed out: 4 Jan 009 Due: 9 Jan 009 MDP Value Iteration Algorithm. Implement the value iteration algorithm given in the lecture. That is, solve Bellman s equation using

More information

Artificial Intelligence

Artificial Intelligence Artificial Intelligence Dynamic Programming Marc Toussaint University of Stuttgart Winter 2018/19 Motivation: So far we focussed on tree search-like solvers for decision problems. There is a second important

More information

Computation of Efficient Nash Equilibria for experimental economic games

Computation of Efficient Nash Equilibria for experimental economic games International Journal of Mathematics and Soft Computing Vol.5, No.2 (2015), 197-212. ISSN Print : 2249-3328 ISSN Online: 2319-5215 Computation of Efficient Nash Equilibria for experimental economic games

More information

Combining Memory and Landmarks with Predictive State Representations

Combining Memory and Landmarks with Predictive State Representations Combining Memory and Landmarks with Predictive State Representations Michael R. James and Britton Wolfe and Satinder Singh Computer Science and Engineering University of Michigan {mrjames, bdwolfe, baveja}@umich.edu

More information

QUICR-learning for Multi-Agent Coordination

QUICR-learning for Multi-Agent Coordination QUICR-learning for Multi-Agent Coordination Adrian K. Agogino UCSC, NASA Ames Research Center Mailstop 269-3 Moffett Field, CA 94035 adrian@email.arc.nasa.gov Kagan Tumer NASA Ames Research Center Mailstop

More information

Sequential Decision Problems

Sequential Decision Problems Sequential Decision Problems Michael A. Goodrich November 10, 2006 If I make changes to these notes after they are posted and if these changes are important (beyond cosmetic), the changes will highlighted

More information

Learning To Cooperate in a Social Dilemma: A Satisficing Approach to Bargaining

Learning To Cooperate in a Social Dilemma: A Satisficing Approach to Bargaining Learning To Cooperate in a Social Dilemma: A Satisficing Approach to Bargaining Jeffrey L. Stimpson & ichael A. Goodrich Computer Science Department, Brigham Young University, Provo, UT 84602 jstim,mike@cs.byu.edu

More information

An Adaptive Clustering Method for Model-free Reinforcement Learning

An Adaptive Clustering Method for Model-free Reinforcement Learning An Adaptive Clustering Method for Model-free Reinforcement Learning Andreas Matt and Georg Regensburger Institute of Mathematics University of Innsbruck, Austria {andreas.matt, georg.regensburger}@uibk.ac.at

More information

Self-stabilizing uncoupled dynamics

Self-stabilizing uncoupled dynamics Self-stabilizing uncoupled dynamics Aaron D. Jaggard 1, Neil Lutz 2, Michael Schapira 3, and Rebecca N. Wright 4 1 U.S. Naval Research Laboratory, Washington, DC 20375, USA. aaron.jaggard@nrl.navy.mil

More information

The Simplex and Policy Iteration Methods are Strongly Polynomial for the Markov Decision Problem with Fixed Discount

The Simplex and Policy Iteration Methods are Strongly Polynomial for the Markov Decision Problem with Fixed Discount The Simplex and Policy Iteration Methods are Strongly Polynomial for the Markov Decision Problem with Fixed Discount Yinyu Ye Department of Management Science and Engineering and Institute of Computational

More information

Some notes on Markov Decision Theory

Some notes on Markov Decision Theory Some notes on Markov Decision Theory Nikolaos Laoutaris laoutaris@di.uoa.gr January, 2004 1 Markov Decision Theory[1, 2, 3, 4] provides a methodology for the analysis of probabilistic sequential decision

More information

Notes on Coursera s Game Theory

Notes on Coursera s Game Theory Notes on Coursera s Game Theory Manoel Horta Ribeiro Week 01: Introduction and Overview Game theory is about self interested agents interacting within a specific set of rules. Self-Interested Agents have

More information

Incremental Policy Learning: An Equilibrium Selection Algorithm for Reinforcement Learning Agents with Common Interests

Incremental Policy Learning: An Equilibrium Selection Algorithm for Reinforcement Learning Agents with Common Interests Brigham Young University BYU ScholarsArchive All Faculty Publications 2004-07-01 Incremental Policy Learning: An Equilibrium Selection Algorithm for Reinforcement Learning Agents with Common Interests

More information

Best-Response Multiagent Learning in Non-Stationary Environments

Best-Response Multiagent Learning in Non-Stationary Environments Best-Response Multiagent Learning in Non-Stationary Environments Michael Weinberg Jeffrey S. Rosenschein School of Engineering and Computer Science Heew University Jerusalem, Israel fmwmw,jeffg@cs.huji.ac.il

More information

CS 570: Machine Learning Seminar. Fall 2016

CS 570: Machine Learning Seminar. Fall 2016 CS 570: Machine Learning Seminar Fall 2016 Class Information Class web page: http://web.cecs.pdx.edu/~mm/mlseminar2016-2017/fall2016/ Class mailing list: cs570@cs.pdx.edu My office hours: T,Th, 2-3pm or

More information

Sequential-Simultaneous Information Elicitation in Multi-Agent Systems

Sequential-Simultaneous Information Elicitation in Multi-Agent Systems Sequential-Simultaneous Information Elicitation in Multi-Agent Systems Gal Bahar and Moshe Tennenholtz Faculty of Industrial Engineering and Management Technion Israel Institute of Technology Haifa 3000,

More information

Near-Potential Games: Geometry and Dynamics

Near-Potential Games: Geometry and Dynamics Near-Potential Games: Geometry and Dynamics Ozan Candogan, Asuman Ozdaglar and Pablo A. Parrilo January 29, 2012 Abstract Potential games are a special class of games for which many adaptive user dynamics

More information

Online regret in reinforcement learning

Online regret in reinforcement learning University of Leoben, Austria Tübingen, 31 July 2007 Undiscounted online regret I am interested in the difference (in rewards during learning) between an optimal policy and a reinforcement learner: T T

More information

Planning With Information States: A Survey Term Project for cs397sml Spring 2002

Planning With Information States: A Survey Term Project for cs397sml Spring 2002 Planning With Information States: A Survey Term Project for cs397sml Spring 2002 Jason O Kane jokane@uiuc.edu April 18, 2003 1 Introduction Classical planning generally depends on the assumption that the

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning March May, 2013 Schedule Update Introduction 03/13/2015 (10:15-12:15) Sala conferenze MDPs 03/18/2015 (10:15-12:15) Sala conferenze Solving MDPs 03/20/2015 (10:15-12:15) Aula Alpha

More information

Markov Decision Processes

Markov Decision Processes Markov Decision Processes Noel Welsh 11 November 2010 Noel Welsh () Markov Decision Processes 11 November 2010 1 / 30 Annoucements Applicant visitor day seeks robot demonstrators for exciting half hour

More information

Zero-Sum Stochastic Games An algorithmic review

Zero-Sum Stochastic Games An algorithmic review Zero-Sum Stochastic Games An algorithmic review Emmanuel Hyon LIP6/Paris Nanterre with N Yemele and L Perrotin Rosario November 2017 Final Meeting Dygame Dygame Project Amstic Outline 1 Introduction Static

More information

chapter 12 MORE MATRIX ALGEBRA 12.1 Systems of Linear Equations GOALS

chapter 12 MORE MATRIX ALGEBRA 12.1 Systems of Linear Equations GOALS chapter MORE MATRIX ALGEBRA GOALS In Chapter we studied matrix operations and the algebra of sets and logic. We also made note of the strong resemblance of matrix algebra to elementary algebra. The reader

More information

DECENTRALIZED LEARNING IN GENERAL-SUM MATRIX GAMES: AN L R I LAGGING ANCHOR ALGORITHM. Xiaosong Lu and Howard M. Schwartz

DECENTRALIZED LEARNING IN GENERAL-SUM MATRIX GAMES: AN L R I LAGGING ANCHOR ALGORITHM. Xiaosong Lu and Howard M. Schwartz International Journal of Innovative Computing, Information and Control ICIC International c 20 ISSN 349-498 Volume 7, Number, January 20 pp. 0 DECENTRALIZED LEARNING IN GENERAL-SUM MATRIX GAMES: AN L R

More information

Cyber-Awareness and Games of Incomplete Information

Cyber-Awareness and Games of Incomplete Information Cyber-Awareness and Games of Incomplete Information Jeff S Shamma Georgia Institute of Technology ARO/MURI Annual Review August 23 24, 2010 Preview Game theoretic modeling formalisms Main issue: Information

More information

Reinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina

Reinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina Reinforcement Learning Introduction Introduction Unsupervised learning has no outcome (no feedback). Supervised learning has outcome so we know what to predict. Reinforcement learning is in between it

More information

Gradient Descent for Symmetric and Asymmetric Multiagent Reinforcement Learning

Gradient Descent for Symmetric and Asymmetric Multiagent Reinforcement Learning 1 Gradient Descent for Symmetric and Asymmetric Multiagent Reinforcement Learning Ville Könönen Neural Networks Research Centre Helsinki University of Technology P.O. Box 5400, FI-02015 HUT, Finland Tel:

More information

Quitting games - An Example

Quitting games - An Example Quitting games - An Example E. Solan 1 and N. Vieille 2 January 22, 2001 Abstract Quitting games are n-player sequential games in which, at any stage, each player has the choice between continuing and

More information

Multiagent Learning and Optimality Criteria in Repeated Game Self-play

Multiagent Learning and Optimality Criteria in Repeated Game Self-play Multiagent Learning and Optimality Criteria in Repeated Game Self-play Andriy Burkov burkov@damas.ift.ulaval.ca Brahim Chaib-draa chaib@damas.ift.ulaval.ca DAMAS Laboratory Computer Science Department

More information

Towards a General Theory of Non-Cooperative Computation

Towards a General Theory of Non-Cooperative Computation Towards a General Theory of Non-Cooperative Computation (Extended Abstract) Robert McGrew, Ryan Porter, and Yoav Shoham Stanford University {bmcgrew,rwporter,shoham}@cs.stanford.edu Abstract We generalize

More information

Economics 209A Theory and Application of Non-Cooperative Games (Fall 2013) Extensive games with perfect information OR6and7,FT3,4and11

Economics 209A Theory and Application of Non-Cooperative Games (Fall 2013) Extensive games with perfect information OR6and7,FT3,4and11 Economics 209A Theory and Application of Non-Cooperative Games (Fall 2013) Extensive games with perfect information OR6and7,FT3,4and11 Perfect information A finite extensive game with perfect information

More information

, and rewards and transition matrices as shown below:

, and rewards and transition matrices as shown below: CSE 50a. Assignment 7 Out: Tue Nov Due: Thu Dec Reading: Sutton & Barto, Chapters -. 7. Policy improvement Consider the Markov decision process (MDP) with two states s {0, }, two actions a {0, }, discount

More information

A Game-Theoretic Approach to Apprenticeship Learning

A Game-Theoretic Approach to Apprenticeship Learning Advances in Neural Information Processing Systems 20, 2008. A Game-Theoretic Approach to Apprenticeship Learning Umar Syed Computer Science Department Princeton University 35 Olden St Princeton, NJ 08540-5233

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Lecture 3: RL problems, sample complexity and regret Alexandre Proutiere, Sadegh Talebi, Jungseul Ok KTH, The Royal Institute of Technology Objectives of this lecture Introduce the

More information

The Reinforcement Learning Problem

The Reinforcement Learning Problem The Reinforcement Learning Problem Slides based on the book Reinforcement Learning by Sutton and Barto Formalizing Reinforcement Learning Formally, the agent and environment interact at each of a sequence

More information