Selecting Efficient Correlated Equilibria Through Distributed Learning. Jason R. Marden

Size: px

Start display at page:

Download "Selecting Efficient Correlated Equilibria Through Distributed Learning. Jason R. Marden"

Dominick Williams
5 years ago
Views:

1 1 Selecting Efficient Correlated Equilibria Through Distributed Learning Jason R. Marden Abstract A learning rule is completely uncoupled if each player s behavior is conditioned only on his own realized payoffs, and does not need to know the actions or payoffs of anyone else. We demonstrate a simple, completely uncoupled learning rule such that, in any finite normal form game with generic payoffs, the players realized strategies implements a Pareto optimal coarse correlated (Hannan) equilibrium a very high proportion of the time. A variant of the rule implements correlated equilibrium a very high proportion of the time. I. INTRODUCTION This paper builds on a recent literature that seeks to identify learning rules that lead to equilibrium without the usual assumptions of perfect rationality and common knowledge. Of particular interest are learning rules that are simple to implement and require a minimum degree of information about what others in the population are doing. Such rules can be viewed as models of behavior in games with many dispersed agents and very limited observability. They also have practical application to the design of distributed control systems, where the agents can be designed to respond to their environment in ways that lead to desirable system-wide outcomes. One can distinguish between various classes of learning rules depending on the amount of information they require. A rule is uncoupled if it does not require any knowledge of the payoffs of the other players [1]. A rule is completely uncoupled if it does not require any knowledge of the actions or payoffs of the other players [2]. The latter paper identifies a family of completely uncoupled learning rules that come close to Nash equilibrium (pure or mixed) This research was supported by AFOSR grants #FA and #FA and by ONR grant #N J. R. Marden is with the Department of Electrical, Computer, and Energy Engineering, University of Colorado, Boulder, CO 80309, jason.marden@colorado.edu.

2 2 with high probability in two-person normal form games with generic payoffs. Subsequently, [3] showed that similar results hold for n-person normal form games with generic payoffs. Lastly, [4] exhibited a much simpler class of completely uncoupled rules that lead to Nash equilibrium in weakly acyclic games. These learning algorithms all have the feature that agents occasionally experiment with new strategies, which they adopt if they lead to higher realized payoffs. In [5], this approach was further developed by making an agent s search behavior dependent on his mood (an internal state variable). Changes in mood are triggered by changes in realized payoffs relative to the agent s current aspiration level. Rules of this nature can be designed that select pure Nash equilibria in any normal form game with generic payoffs that has at least one pure Nash equilibrium. Moreover the rule can be designed so that it selects a Pareto optimal pure Nash equilibrium [6] or even a Pareto optimal action profile (irrespective of whether this action profile is a pure Nash equilibrium) [7]. There is a quite different class of learning dynamics that leads to coarse correlated equilibrium (alternatively correlated equilibrium). These rules are based on the concept of no-regret. They can be formulated so that they depend only on a player s own realized payoffs, that is, they are completely uncoupled [8] [10]. However, while the resulting dynamics converge almost surely to the set of correlated equilibria, they do not necessarily converge to or even approximate correlated equilibrium behavior at a given point in time. The contribution of this paper is to demonstrate a class of completely uncoupled learning rules that bridges these two approaches. In overall structure the rules are similar to the learning dynamics introduced in [5] [7]. Like the no-regret rules, our approach selects (coarse) correlated equilibria instead of Nash equilibria. Unlike no-regret learning, however, our rule leads to equilibrium in the sense that players strategies actually constitute a coarse correlated equilibrium a high proportion of the time. In fact, as a bonus, they constitute a Pareto optimal coarse correlated equilibrium a high proportion of the time. It is important to highlight that there have been great strides in developing polynomialtime algorithms for computing (coarse) correlated equilibria, e.g., [11] [14]. The starting point associated with these algorithms is a complete representation of the game. Unfortunately, the applicability of these algorithms to the design of distributed control systems is limited as such representations are typically not available. Hence, the focus of this paper is on identifying distributed algorithms where agents can learn to play an efficient (coarse) correlated equilibrium

3 3 under less stringent informational demands. II. PRELIMINARIES Let G be a finite strategic-form game with n agents. The set of agents is denoted by N := {1,..., n}. Each agent i N has a finite action set A i and a utility function U i : A R, where A = A 1 A n denotes the joint action set. We shall henceforth refer to a finite strategic-form game simply as a game. For any joint distribution q = {q a } a A (A) where (A) denotes the simplex over the joint action set A, we extend the definition of an agent s utility function in the usual fashion U i (q) = a A U i (a)q a. The set of coarse correlated equilibria can then be characterized by the set of joint distributions { CCE = q (A) : U i (a)q a } U i (a i, a i )q a, i N, a i A i a A a A which is by definition non-empty. In this paper we focus on the derivation of learning rules that provide convergence to an efficient coarse correlated equilibria of the form q arg max q CCE U i (q). To that end, we consider the framework of repeated one-shot game where a given game G is repeated once each period t {0, 1, 2,... }. In period t, the agents simultaneously choose actions a(t) = (a 1 (t),..., a n (t)) and receive payoffs U i (a(t)). Agent i N chooses the action a i (t) according to a probability distribution p i (t) (A i ), which we refer to as the strategy of agent i at time t. We adopt the convention that p a i i (t) is the probability that agent i selects action a i at time t according to the strategy p i (t). Here, an agent s strategy at time t can only rely on observations from the one-shot games played at times {0, 1, 2,..., t 1}. Different learning algorithms are specified by the agents available information and the mechanism by which their strategies are updated as information is gathered. Here, we focus on one of the most informationally restrictive class of learning rules, termed completely uncoupled or payoff-based, where agents only have access to: (i) the action they played and (ii) the payoff they received. More formally, the strategy adjustment mechanism of agent i takes the form ( ) p i (t) = F i {a i (τ), U i (a(τ))} τ=0,...,t 1. (1) i N

4 4 Recent work has shown that for finite games with generic payoffs, there exist completely uncoupled learning rules that lead to Pareto optimal Nash equilibria [6] and also Pareto optimal action profile irrespective of whether or not they are a pure Nash equilibrium [7]; see also [5], [15], [16]. Here, we exhibit a different class of learning procedures that lead to efficient coarse correlated equilibria. III. ALGORITHM DESCRIPTION We will now introduce a payoff based learning algorithm which ensures that the agents collective behavior will constitute a coarse correlated equilibrium that maximizes the sum of the players average payoffs with high probability. In the forthcoming algorithm, each agent will commit to playing a sequence of actions as opposed to just a single action when faced with a decision. More specifically, the set of possible sequenced actions for agent i will be represented by the set Ai = k=1,...,w A k i where w represents the maximum length of a sequence of actions that any agent will play and A k i defines all action sequences of length k for agent i. Accordingly, if agent i commits to playing a sequence of actions a i A i of length l i = a i w at time t, then the resulting sequence of play for agent i is a i (t) = a i (1) a i (t + 1) = a i (2). =. a i (t + l i 1) = a i (l i ) The following algorithm follows the theme of [5] where an agent s search behavior is dependent on his mood (an internal state variable). Changes in mood are triggered by changes in realized payoffs relative to the agent s current aspiration level. In the following, we provide an informal description of the forthcoming algorithm. We divide the algorithm into two parts, agent dynamics and state dynamics, for a more fluid presentation. Agent Dynamics: At any given time, the specific action that agent i plays at time t > 0, i.e., a i (t), is determined solely by the agent s local state variable, which we represent by x i (t). The details of this state variable will be described in detail in the ensuing section. Since the agents commit to playing action sequences, most of the times the specific action played will merely be

5 5 the component of the action sequence that is to come next. At the end of an action sequence, each agent has the opportunity to revise his strategy and select a new action sequence. Here, each agent has an internal state variable (content, discontent, hopeful, and watchful) which governs this process in the following way. First, each agent has a baseline action sequence and a baseline utility. Roughly speaking, each agent presumes that the average utility attained by playing this baseline action sequence will be the baseline utility. When this is true, we say that the baseline action sequence and baseline utility are aligned. When an agent is content, the agent will select his baseline action sequence with high probability. Occasionally, the agent will experiment with an constant action sequence of the same length as his baseline action sequence. When an agent is discontent, the agent will select an action sequence of arbitrary length. When an agent is hopeful or watchful, an agent will repeat his baseline action sequence with certainty. Hopeful and watchful are intermediate states that are triggered when the realized average utility does not match the baseline utility. Hence, the agent enters an intermediate mode where he waits to get a better observation before overreacting. State Dynamics: At any given time, the state of each agent i, i.e., x i (t + 1), will be updated using only information regarding the previous state x i (t), the decision of agent i at time t, i.e., a i (t), and the utility of agent i at time t, i.e., U i (a(t)). As with the agent dynamics, the key state components will only change when an agent has completed a given action sequence. The key component of the state dynamics will be deriving how each agent s mood changes as a function of the (i) baseline utility and the (ii) the average utility received over the previously played action sequence. Roughly speaking, the process can be described as follows: A player switches from content to discontent for sure if his average utility is below his baseline utility for several periods in a row and he was not experimenting. He may also switch spontaneously from content to discontent with a very small probability even if this is not the case. A player switches from discontent to content with a probability that is an increasing function of his current average payoff (in which case he takes the previous action sequence and its realized average payoff as his new baseline). The details associated with the intermediate states hopeful and watchful will be spelled out

6 6 later. Their role will become clear when we give the learning rule in detail. A. Notation At each point in time, the action of agent i N can be represented by the tuple [ a i, a i ], where Agent i s action sequence is a i A i. Agent i s current action is a i a i. At each point in time an agent s state can be represented by the tuple a i : Trial sequence of actions k i : Element of trial sequence of actions currently on a b i : Baseline trial sequence of actions x i = u b i : Payoff over baseline trial sequence of actions u i : Payoff over trial sequence of actions m i : Mood (content, discontent, hopeful, or watchful) c H/W i : Counter for number of times hopeful/watchful periods repeated L H/W i : Number of times hopeful/watchful periods will be repeated The first three components of the state { a i, u i, k i } correspond to the action sequence that is currently being played by agent i. The action sequence is represented by a i A i. The counter k i {1,..., a i } keeps track of what component of a i the agent should play next. Lastly, the payoff u i represent the average utility received over the first (k i 1) iterations of the action sequence a i. The fourth and fifth components of the state { a b i, u b i} correspond to the baseline action sequence and baseline payoff. The baseline action sequence is represented by a b i A i and the baseline payoff u b i captures the average utility received for the baseline sequence of actions. The baseline payoff is used as a gauge to determine whether experimentations with alternative action sequences is advantageous. The sixth component of the state is the mood m i, which can take on four values: content (C), discontent (D), hopeful (H), and watchful (W). Each of the moods will lead to different types of behavior from the player as will be discussed in detail. The seventh and eighth components of the state {c H/W i, L H/W i } represent counters on the number of times that either a hopeful or watchful mood has been repeated. The number

7 7 L H/W i {0} {w + 1,..., w n + w} prescribes the number of times that the intermediate state (hopeful or watchful) should be repeated. The number c H/W i {0, 1, 2,..., w n + w} prescribes the number of times that the intermediate state (hopeful or watchful) has already been repeated. Accordingly, c H/W i watchful, we adopt the convention that c H/W i B. Formal Algorithm Description L H/W i. In the case when the mood is not hopeful or = L H/W i = 0. We divide the dynamics into the following two parts: the agent dynamics and the state dynamics. Without loss of generality we shall focus on the case where agent utility functions are strictly bounded between 0 and 1, i.e., for any agent i N and action profile a A we have 1 > U i (a) 0. Lastly, we define a constant c > n which will be utilized in the following algorithm. Agent Dynamics: Fix an experimentation rate ɛ > 0. The dynamics for agent i only rely on the state of agent i at that given time. Let x i (t) = [ a i, u i, k i, a b i, u b i, m i, c H/W i, L H/W i ] be the state of agent i at time t. For the following dynamics, each agent only has the opportunity to change strategies at the beginning of a planning window. Accordingly, if k i > 1 then a i (t) = a i, (2) a i (t) = a i (k i ), (3) where a i (k i ) denotes the k i -th component of the vector a i. If k i = 1, then a player makes a decision based on the player s underlying mood: Content (m i = C): In this state, the agent chooses a sequence of actions a i A i according to the following probability distribution 1 ɛ c for a i = a b i, Pr [ a i (t) = a i ] = ɛ c A i for any a i = (a i,..., a i ) A ab i i where a i A i, where A i represents the cardinality of the set A i. The action is then chosen as a i (t) = a i (1; t) where a i (1; t) denotes the first component of the vector a i (t). 1 (4) 1 We could consider variations of deviations to stabilize alternative equilibria, e.g., correlated equilibria. In particular, if (4) focused on conditional deviations as opposed to unconditional deviations, then the forthcoming dynamics would stabilize efficient correlated equilibria as opposes to efficient coarse correlated equilibria.

8 8 Discontent (m i = D): In this state, the agent chooses a sequence of actions a i according to the following probability distribution: Pr [ a i (t) = a i ] = 1 A i for every a i A i. (5) Note that the baseline action and utility play no role in the agent dynamics when the agent is discontent. The action is then chosen as a i (t) = a i (1; t). Hopeful (m i = H) or Watchful (m i = W ): In either of these states, the agent selects his trial action sequence, i.e., a i (t) = a i, (6) a i (t) = a i (1; t). (7) Note that the first component of the state vector corresponds to the current trial action. Hence, the agent dynamics update purely this component of the state vector. State Dynamics: First, the majority of the state components only change at the end of a sequence of actions. Let x i (t) = [ a i, u i, k i, a b i, u b i, m i, c H/W i, L H/W i ] be the state of agent i at time t, a i = a i (t) be the action sequence played at time t, a i (t) = a i (k i ) be the action that agent i played at time t, and U i (a(t)) be the utility player i received at time t. If k i < a i, then a i a i x i (t) = u i k i a b i u b i m i c H/W i L H/W i x i (t + 1) = k i 1 k i u i + 1 k i U i (a(t)) k i + 1 a b i u b i m i c H/W i L H/W i Otherwise, if k i = a i then the state is updated according to the underlying mood as follows: For shorthand notation, we define the running average of the payoff over the trial actions as u i (t) = k i 1 u i + 1 U i (a(t)). k i k i

9 9 Content (m i = C): If [ a i, u i (t)] = [ a b i, u b i], the state of agent i is updated as [ a b i, u b i, 1, a b i, u b i, C, 0, 0 ] with probability 1 ɛ 2c, x i (t + 1) = [ a b i, u b i, 1, a b i, u b i, D, 0, 0 ] with probability ɛ 2c. (8) If a i a b i, the state of agent i is updated as [ a i, u i (t), 1, a i, u i (t), C, 0, 0] if u i (t) > u b i, x i (t + 1) = [ a b i, u b i, 1, a b i, u b i, C, 0, 0 ] if u i (t) u b i. If a i (t) = a b i but u i (t) u b i, the state of agent i is updated as [ ] a b i, u b i, 1, a b i, u b i, H, 1, L H i if u i (t) > u b i, x i (t + 1) = [ ] a b i, u b i, 1, a b i, u b i, W, 1, L W i if ui (t) < u b i, where L H i (or L W i ) is randomly selected from the set {w + 1,..., w n + w} with uniform probability. 2 Discontent (m i = D): The new state is determined by the transition [ a i, u i (t), 1, a i, u i (t), C, 0, 0] with probability ɛ 1 ui(t), x i (t + 1) = [ a i, u i (t), 1, a i, u i (t), D, 0, 0] with probability 1 ɛ 1 ui(t). Hopeful (m i = H): First, it is important to highlight that if the mood of any player i N is hopeful then a i = a b i. The new state is determined as follows: If c H i < L H i, then If c H i = L H i and u i (t) u b i, then x i (t + 1) = [ a i, u i (t), 1, a i, u b i, H, c H i + 1, L H i ]. If c H i = L H i and u i (t) < u b i, then x i (t + 1) = [ a i, u i (t), 1, a i, u i (t), C, 0, 0]. x i (t + 1) = [ a i, u i (t), 1, a i, u b i, W, 1, L W i ], where L W i is randomly selected from the set {w + 1,..., w n + w} with uniform probability. Watchful (m i = W ): First, it is important to highlight that if the mood of any player i N is watchful then a i = a b i. The new state is determined as follows: If c W i < L W i, then x i (t + 1) = [ a b i, u i (t), 1, a b i, u b i, W, c W i + 1, L W i ]. 2 The need for this repetition arises from the fact that each of the agents could be playing action sequences of distinct lengths. The purpose of this repetition will become more clear during the proof.

10 10 If c W i = L W i and u i (t) < u b i, then If c W i = L W i and u i (t) u b i, then x s i (t + 1) = [ a b i, u i (t), 1, a i, u i (t), D, 0, 0 ]. x i (t + 1) = [ a i, u i (t), 1, a b i, u b i, H, 1, n H i ], where n H i is randomly selected from the set {w + 1,..., w n + w} with uniform probability. IV. MAIN RESULT Before stating the main result we introduce a bit of notation. Let X = i X i denote the full set of states of the players where X i is the set of possible states for player i. For a given state [ ] x = (x 1,..., x n ) where x i = a i, u i, k i, a b i, u b i, m i, c H/W i, L H/W i, define the ensuing sequence of baseline actions as follows: for every k {0, 1, 2,... } and agent i N we have a i (k x i ) = a b i(k + k i ) where we write a b i(k + k i ) even in the case when k + k i > a b i with the understanding that this implies the component ((k + k i 1) mod a b i ) + 1. We express the sequence of joint action profiles by a(k x) = (a 1 (k x 1 ),..., a n (k x n )). Define the average payoff over the forthcoming periods (provided that all players play according to their baseline action) for any player i N and period l {1, 2,... } as u i (0 x) = k i 1 u i + a a i k i k i + 1 i U i (a(k x)), (9) a i a i u i (l x) = 1 a i l a i k i k=l a i k i +1 k=0 U i (a(k x)). (10) We will characterize the above dynamics by analyzing the empirical distribution of the joint distribution. To that end, define the empirical distribution of the joint actions associated with the baseline sequence of actions for a given state x by q(x) = {q a (x)} a A (A) where q a (x) = lim t = t τ=0 Q i N a i I{a = a(τ x)}, (11) t + 1 τ=1 I{a = a(τ x)} i N a, (12) i

11 11 where I{ } represents the usual indicator function and the equality derives from the fact that players are repeating finite sequence of actions which ensures that for any k {0, 1,... } we have a(k x) = a ( k + i N a i x ). (13) Define the set of states which induce coarse correlated equilibria through repeated play of the baseline sequence of actions as X CCE := {x X : q(x) CCE} Lastly, define the set of states X which induce coarse correlated equilibria and are aligned, i.e., X = {x X : x X CCE, u i (0 x) = u i (k x) i N, k {1, 2,... }}. Note that in general the set X need not be empty. In fact, a sufficient condition for X to not be empty is { {q (a) : q a k=1,...,w 0, 1 k,..., k 1 } k, 1 } for all a A X CCE. The process described above can be characterized as a finite Markov chain parameterized by an exploration rate ɛ > 0. The following theorem characterizes the support of the limiting stationary distribution, whose elements are referred to as the stochastically stable states [17]. More precisely, a state x X is stochastically stable if and only if lim ɛ 0 + µ(x, ɛ) > 0 where µ(x, ɛ) is a stationary distribution of the process P ɛ for a fixed ɛ > 0. Our characterization requires a mild degree of genericity in the agents utility functions, which is summarized by the following notion of interdependence as introduced in [5]. Definition 1 (Interdependence). An n-person game G on the finite action space A is interdependent if, for every a A and every proper subset of agents J N, there exists an agent i / J and a choice of actions a J j J A j such that U i (a J, a J) U i (a J, a J ). Theorem 1. Let G be an finite interdependent game and suppose all players follow the above dynamics. If X, then a state x X is stochastically stable if and only if x X and u i (0 x) = max i N x X i N u i (0 x ).

12 12 If X =, then a state x X is stochastically stable if and only if u i (0 x) = max U i (a). a A i N This theorem demonstrates that as the exploration rates ɛ 0 +, the process will spend most of the time at the efficient coarse correlated equilibrium provided that the (discretized) set of coarse correlated equilibria in nonempty. If this set is empty, then the process will spend most of the time at the action profile which maximizes the sum of the agent s payoffs. We prove this theorem using the theory of resistance tree for regular perturbed processes developed in [18]. We provide a brief review of the theory of resistance trees in the Appendix. For a detailed review, we direct the readers to [18]. i N V. PROOF OF THEOREM 1 Let X i denote the set of admissible states for agent i. The above dynamics induce a Markov process over the finite state space X = i N X i. We shall denote the transition probability matrix by P ɛ for each ɛ > 0. Computing the stationary distribution of this process is challenging because of the large number of states and the fact that the underlying process is not reversible. Accordingly, we shall focus on characterizing the support of the limiting stationary distribution, whose elements are referred to as the stochastically stable states [17]. More precisely, a state x X is stochastically stable if and only if lim ɛ 0 + µ(x, ɛ) > 0 where µ(x, ɛ) is a stationary distribution of the process P ɛ for a fixed ɛ > 0. The proof of the above theorem will encompass two major parts. The first part involves characterizing the recurrence classes of the unperturbed process. The unperturbed process is the process induced by ɛ = 0. The importance of the first part of the proof centers on the fact that the stochastically stable states are contained in the recurrence classes of the unperturbed process. The second part of the proof involves characterizing the limiting behavior of the process using the theory of resistance trees for regular perturbed processes [18]. In particular, the theory of resistance trees provides a tool for evaluating which of the recurrence classes are stochastically stable. A. Part #1: Characterizing the recurrence classes of the unperturbed process The following lemma characterizes the recurrence classes of the unperturbed process. We will prove this lemma by a series of claims which we will present after the lemma.

13 13 Lemma 2. A state x = (x 1,..., x n ) is in a recurrence class of the unperturbed process P 0 if and only if the state x is in one of the following two forms: Form #1: The state for every agent i N is of the form x i = [ a i, u i, k i, a i, u b i, C, 0, 0 ] where a i A i, k i {1,..., a i }, and u b i = u i (l x) for every l {0, 1, 2,... }. Form #2: The state for every agent i N is of the form where a i A i and k i {1,..., a i }. x i = [ a i, u i, k i, a i, u b i, D, 0, 0 ] We begin by showing that the any state of the above forms is in fact a recurrence class of the unperturbed process. With that goal in mind, let C 0 represent all states of Form #1 and D 0 represent all states of Form #2. First, the set of states D 0 represents a single recurrence class of the unperturbed process since the probability of transitioning between any two states x 1, x 2 D 0 is O(1) and when ɛ = 0 there is no possibility of exiting from D 0. 3 Second, for any state x C 0, all components of the state will remain constant for all future times except for the counter {k i } i N. This is a result of the third condition which ensures that the payoff associated with all future periods, where we use the term period to describe the entire sequence of actions, is identical to the baseline payoff. Since players are repeating actions of finite length, we know that we will return to the same counters {k i } i N in exactly i N a i iterations. Hence, x is a recurrence class of the unperturbed process. We will now show through a series of claims that any state not of the above forms is not in a recurrence class of the above process. The first claim will show that in any recurrence class, there must be an equivalence between the baseline action vector and the trial action vector. Claim 3. If a state x = (x 1,..., x n ) is in a recurrence class of the unperturbed process P 0, then for every player i N the baseline action vector and the trial action vector must be identical, i.e., the state x i is of the form x i = [ a i, u i, k i, a i, u b i, m i, c H/W i, n H/W i ] 3 We use the notation O(1) to denote probabilities that are on the order of 1, i.e., probabilities that are bounded away from 0.

14 14 where m i {C, D, H, W }. Proof: According to the specified dynamics, we know that if a i a b i then the agent must be content, i.e., the state is of the form: x i = [ a i, u i, k i, a b i, u b i, C, 0, 0] where k i a i. For notational simplicity, let l i = a i be the number of actions in the trial vector of agent i. Given this state, the action of player i over the next l i k i + 1 iterations will be a i (1) = a i (k i ). =. a i (l i k i + 1) = a i (l i ) with probability 1. Let a i (1), a i (2),..., denote the ensuing sequence of actions chosen by the other players j i according to the unperturbed process. Define the running average of the trial action for player i over the next l i k i + 1 iterations as u i (1) = k i 1 u i + 1 U i (a(1)), k i u i (2) = k i k i k i + 1 u i(1) + 1 k i + 1 U i(a(2)),. =. ( ) li 1 u i (l i k i + 1) = u i (l i k i ) + l i ( 1 l i ) U i (a(l i k i + 1)). The state of player i evolves over the next l i k i iterations according to where for every k {1,..., l i k i } we have x i x i (1) x i (2) x i (l i k i ) x i (k) = [ a i, u i (k), k i + k, a b i, u b i, C, 0, 0]. The ensuing state resulting from the transition x i (l i k i ) x i (l i k i + 1) is then of the form [ a i, u i (l i k i + 1), 1, a i, u i (l i k i + 1), C, 0, 0] if u i (l i k i + 1) > u b i x i (l i k i + 1) = [ a b i, u b i, 1, a b i, u b i, C, 0, 0 ] if u i (l i k i + 1) u b i

15 15 Hence, irrespective of the play of the other players j i, player i returns to a content state with a i = a b i within l i periods with probability 1. Furthermore, when ɛ = 0 a player will never experiments in a content state; hence, a i = a b i for all future time periods. This completes the proof. The following claim will show that the average payoff received over all subsequent periods must be the same for any player in any recurrence class of the unperturbed process. Claim 4. If a state x = (x 1,..., x n ) is in a recurrence class of the unperturbed process then for every agent i N where m i {C, H, W } we have u i (0 x) = u i (l x) for every l {1, 2,... }. Proof: Suppose the state x is of the form depicted in Claim 3. We will focus on the case where every player continues to play according to their baseline action vector (regardless of their mood) which occurs with probability O(1). Let x i be the state after w time steps of the players playing according to their baseline actions. The state of agent i after w time steps is one of the following four forms: [ ai, u i, k i, a i, u b i, C, 0, 0 ] [ ] ai, u x i, k i, a i, u b i, H, c H i, n H i i = [ ] ai, u i, k i, a i, u b i, W, c W i, n W i [ ai, u i, k i, a i, u b i, D, 0, 0 ] We will refer to x i as the current state and refer to these four different states as content, hopeful, watchful, and discontent respectively. Because the players are repeating vectored actions, combining (10) and (13) gives us (( u i (l x) = u i l + ) a i ) x i j for any l {0, 1, 2,... }. Note that waiting the initial w periods was essential for ensuring the above equality for l = 0. For the forthcoming proof, we will focus on analyzing the state at the end of each period where we represent the state of agent i at the end of the l-th period by x j (l). We will prove the above claim by contradiction. In particular, if the utilities are not in the form specified by the claim, then we will specify a sequence of transitions, all that occur with probability O(1) in the unperturbed process, that lead to agent i becoming discontent. Once an agent becomes discontent the agent remains discontent for all future times in the unperturbed which completes the proof.

16 16 First, if the agent is discontent according to the state x then we are done. Accordingly, we will analyze each of the remaining possible states for agent i independently below. Case #1: Content We start with the case when agent i is content in the state x and suppose u i (l 0 x ) u i (l 1 x ) for some l 0, l 1 {0, 1, 2,... }. Let l0 {0, 1, 2,... } be the first such period where u j (l0 x ) u j (l0 1 x ) where we set u j ( 1 x ) = u b j. Player i will remain content to the end of the l0 period at which time the player transitions to hopeful if u i (l0 x) > u b i or watchful if u i (l0 x) < u b i. Suppose that u i (l0 x) < u b i. In this case, there is a probability O(1) that the state of agent i at the end of the l0 will be x i (l0) = [ a i, u i (l0 x), 1, a i, u b i, W, 1, n W i ] where n W i {n i {w + 1,..., w n + w} : n i mod i j a i = 0}. Note that this set is not empty since i j a i < w n. Furthermore, note that u i (l0 + n W i x) = u i (l0 x). Conditioned on this event, the state of agent i will remain watchful until the end of the n W i period at which it will transition to x i (l0 + n W i ) = [ a i, u i (l0 x), 1, a i, u i (l0 x), D, 0, 0] which completes the proof. Now suppose that u i (l0 x) > u b i. In this case, there is a probability O(1) that the state of agent i at the end of the l0 will be x i (l0) = [ a i, u i (l0 x), 1, a i, u b i, H, 1, n H i ] where n H i {n i {w + 1,..., w n + w 2 + w} : w mod i j a i = 0}. Conditioned on this event, the state of agent i will remain hopeful until the end of the n H i period at which it will transition to x i (l0 + n H i ) = [ a i, u i (l0 x), 1, a i, u i (l0 x), C, 0, 0]. Conditioned on this event, let l1 {1, 2,... } be the first such period where u j (l0 + l1 x) u j (l0 + l1 1 x). Player i will remain content to the end of the l0 + l1 period at which he transitions to hopeful if u i (l0 + l1 x) > u i (l0 x) or watchful if u i (l0 + l1 x) < u i (l0 x). If u i (l0 + l1 x) < u i (l0 x), then we can follow the first process depicted above which results in the agent becoming discontent and we are done. Otherwise, if u i (l0 + l1 x) > u i (l0 x), then we can follow the second process depicted above which results in the agent becoming content with a

17 17 baseline payoff u i (l0 + l1 x) > u i (l0 x) > u b i. Repeat the process depicted above. Note that an agent can only transition to hopeful a finite number of times, less than j i a j w n, before the agent will transition to watchful. Since this process happens with probability O(1), the mood of agent i will eventually transition to D. This completes the proof. Case #2: Hopeful Next, we focus on the case when agent i is hopeful in the state x and suppose u i (l 0 x ) u i (l 1 x ) for some l 0, l 1 {0, 1, 2,... }. In this case, agent i remains hopeful to the end of the n H i period at which point agent i transitions to content or watchful depending on how u i (n H i c H i x ) compares to u b i. If u i (n H i c H i x ) u b i, the the state of agent i at the end of the n H i c H i period is x i (n H i c H i ) = [ a i, u i (n H i c H i x ), 1, a i, u b i, C, 0, 0]. However, note that if the agent is in a state of this form, then this matches the form analyzed in Case #1; hence, with probability O(1) this agent will transition to discontent and we are done. Alternatively, if u i (n H i c H i x ) < u b i, then the state of agent i at the end of the n H i c H i period will be x i (n H i c H i ) = [ a i, u i (n H i c H i x ), 1, a i, u b i, W, 1, n W i ] where n W i {n i {w + 1,..., w n + w} : n i mod i j a i = 0} with probability O(1). Conditioned on this event, the agent will remain watchful for an additional n W i periods at which point the state of the agent will be x i (n W i + n H i c H i ) = [ a i, u i (n W i + n H i c H i x ), 1, a i, u i (n W i + n H i c H i x ), D, 0, 0] and we are done. This results from the fact that u i (n W i + n H i c H i x ) = u i (n H i c H i x ) < u b i. Case #3: Watchful Lastly, we focus on the case when agent i is watchful according to the state x. In this case, agent i remains watchful to the end of the n W i period at which point agent i transitions to hopeful or discontent depending on how u i (n W i c W i x ) compares to u b i. If u i (n W i c W i x ) < u b i, the the state of agent i at the end of the n W i c W i period will be x i (n W i c W i ) = [ a i, u i (n W i c W i x ), 1, a i, u i (n W i c W i x ), D, 0, 0] and we are done. Alternatively, suppose u i (n W i c W i x ) u b i. For the case, the state of agent i at the end of the n W i c W i period will be x i (n W i c W i ) = [ a i, u i (n W i c W i x ), 1, a i, u b i, H, 1, n H i ]

18 18 where n H i {n i {w + 1,..., w n + w} : n i mod i j a i = 0} with probability O(1). Conditioned on this event, the agent will remain hopeful for an additional n H i periods at which point the state of the agent will be x i (n H i + n W i c W i ) = [ a i, u i (n H i + n W i c W i ), 1, a i, u i (n H i + n W i c W i ), C, 0, 0]. However, note that the agent in this state now matches the form analyzed in Case #1; hence, with probability O(1) this agent will transition to discontent and we are done. This completes the proof. The next claim will show that in any recurrence class, if one agent is discontent, then all agents must be discontent. Claim 5. If a state x = (x 1,..., x n ) is in a recurrence class of the unperturbed process P 0 and m i = D for some agent i N, then m j = D for every agent j N. Proof: Suppose the state x is of the form depicted in Claims 3 and 4. We will focus on the case where every player plays according to their baseline action vector which occurs with probability O(1). As in the proof of Claim 4, let x i be the state after w time steps of the players playing according to their baseline actions. The state of each agent i N after w time steps is one of the following four forms: [ ai, u i, k i, a i, u b i, C, 0, 0 ] [ ] ai, u x i, k i, a i, u b i, H, c H i, n H i i = [ ] ai, u i, k i, a i, u b i, W, c W i, n W i [ ai, u i, k i, a i, u b i, D, 0, 0 ] We will refer to x as the current state, i.e., state at time 0, and refer to these four different states as content, hopeful, watchful, and discontent respectively. Let S N denote the subset of players that are discontent given the state x, i.e., m i = D for all agents i S and m j D for all agents j / S. If S = N then we are done. Otherwise, let ã {a(0 x ), a(1 x ),..., a(w n + w x )} be any ensuing action profile. By our interdependence condition, there exists a player j / S such that U j (ã) U j (a S, ã S ) for some action profile a S i S A i where ã S = {ã j : j / S}. Suppose all players play accordingly to their baseline action which happens with probability O(1). As in the previous

19 19 claims, we will focus on analyzing the state of agent j at the end of each period. We will denote the state of agent j at the end of the l-th period as x j (l). We begin by showing that if agent j is either hopeful or watchful, the agent transitions to being either content or discontent within 2w n periods with probability O(1). We start with the case when agent j is hopeful, i.e., the state of agent j is of the form x j = [ a j, u j, k j, a j, u b j, H, c H j, n H j ] where 0 < c H j n H j. The mood of agent j will continue to be hopeful until the end of the n H j -th period which yields a payoff of u H j = u j (n H j c H j x ). If u H j u b j, then at the end of the n H j period the state of agent j transitions to and we are done. Otherwise, if u H j j transitions to with probability O(1) where n W j x j (n H j c H j ) = [ a j, u H j, 1, a j, u H j, C, 0, 0] < u b j, then at the end of the n H j -th period the state of agent x j (n H j c H j ) = [ a j, u H j, 1, a j, u b j, W, 1, n W j ] {l {w +1,..., w n +w} : l mod i j a i = 0}. Note that this set is nonempty since i j a i w n. Conditioned on this event, we know that u j (n W j + n H j c H j x ) = u H j < u b j; hence, at the end of the (n W j + n H j )-th period the state of agent j transitions to x j (n W j + n H j c H j ) = [ a j, u H j, 1, a j, u j, D, 0, 0]. Similar arguments could be constructed to show that if agent j was initially watchful then the agent transitions to either content or discontent within the same number of periods (at most 2w n ) with probability O(1). We complete the proof by focusing on the case where the agent j is content or discontent, i.e., the state of agent j is of the form [ aj, u x j, 1, a j, u b j, C, 0, 0 ] j [ aj, u j, 1, a j, u b j, D, 0, 0 ] If agent j is discontent, then we can repeat the argument above for a new agent j which satisfies the interdependence condition. Otherwise, suppose agent j is content. If agent j s payoffs are not aligned, i.e., u j (l x ) u j (0 x ) for every l {0, 1,... }

20 20 then we can follow the arguments posed in the proof of Claim 4 which shows that agent j will become discontent with probability O(1). Now, suppose that agent j s payoffs are aligned, i.e., u j (l x ) = u j (0 x ) for every l {0, 1,... }. Consider the ensuing sequence of actions where for each k {0, 1,... } we have a(k x ) if a(k x ) ã, ã(k x ) = (a S, ã S ) if a(k x ) = ã. Note that such a sequence of actions will be played with probability O(1). Define ũ j ( ) in the same fashion as u j ( ) with the sole exception of using ã( x ) as opposed to a( x ). Suppose U j (a S, ã S ) < U j (ã) which in turn guarantees that ũ j (l x ) u j (l x ) = u b j for every l {0, 1,... }. Let l {0, 1,..., w n 1} denote the first time at which ũ j (l x ) < u j (l x ). Player j will remain content to the end of the l period at which time the player transitions to x j (l ) = [ a j, u j (l x ), 1, a j, u b j, W, 1, n W j ] with probability O(1) where n W j {l {w + 1,..., w n + w} : l mod i j a i = 0}. Conditioned on this event, the state of agent j after n W j and we are done. additional periods will be x j (l + n W i ) = [ a j, u j (l x ), 1, a j, u j (l x ), D, 0, 0] Alternatively, suppose U j (a S, ã S ) > U j (ã) which in turn guarantees that ũ j (l x ) u b j for every l {0, 1,... }. Player j will remain content to the end of the l period at which time the player transitions to x j (l ) = [ a j, u j (l x ), 1, a j, u b j, H, 1, n H j ] with probability O(1) where n H j {l {w + 1,..., w n + w} : l mod i j a i = 0}. Conditioned on this event, the state of agent j after n H i x j (l + n H i ) = [ a j, u j, 1, a j, u j, C, 0, 0] additional periods will be where u j = ũ j (l + n H j x ). Conditioned on this event, consider the case where the agents play according to a( x ) as opposed to ã( x ) for all future times and let ũ j ( ) reflect this change. Such a sequence will be played with probability O(1). Note that this situation is precisely the situation highlighted above; hence, the highlighted procedure demonstrates that agent j will transition to discontent with probability O(1). This completes the proof.

21 21 The following claim proves that either all agents must be content or all agents must be discontent in any recurrence class of the unperturbed process. Claim 6. If a state x = (x 1,..., x n ) is in a recurrence class of the unperturbed process P 0 then (i) m i = C for every agent i N or (ii) m i = D for every agent i N. Proof: Suppose the state x is of the form depicted in Claims 3 and 4. We will focus on the case where every player plays according to their baseline action vector which occurs with probability O(1). As in the proof of Claim 5, let x = (x 1,..., x n) be the state after w time steps of the players playing according to their baseline actions. First note that if m i = D for any agent i N, then by Claim 5 we know that m j = D for every agent j N. Hence we are done. Alternatively, suppose that m i {C, H, W } for every agent i N. By Claim 4, we know that since x and x are both in a recurrence class of the unperturbed process, then for every agent i N and every period l 1, l 2 {0, 1, 2,... } we have u i := u i (l 1 x ) = u i (l 2 x ). Suppose m i = H for some agent i N. Accordingly, the state of agent i is of the form x i = [ a i, u i, k i, a i, u b i, H, c H i, n H i ]. The mood of agent i will continue to be hopeful until the end of the n H i -th period which yields a payoff of u i. If u i u b i, then at the end of the n H i period the state of agent i transitions to x i (n H i c H i ) = [ a i, u i, 1, a i, u i, C, 0, 0]. Note that in this case, the agent will remain content for all future times and we are done. Otherwise, if u i < u b i, then at the end of the n H i -th period the state of agent i transitions to x i (n H i c H i ) = [ a i, u i, 1, a i, u b i, W, 1, n W i ] where n W i {w + 1,..., w n + w}. Since, u i = u i (l x ) for any l {0, 1,... }, at the end of the (n W i + n H i )-th period the state of agent j transitions to x i (n H i + n W i c H i ) = [ a i, u i, 1, a i, u i, D, 0, 0]. Hence, an agent cannot be hopeful in a recurrence class of the unperturbed process. Suppose m i = W for some agent i N. Accordingly, the state of agent i is of the form x i = [ a i, u i, k i, a i, u b i, W, c W i, n W i ].

22 22 The mood of agent i will continue to be watchful until the end of the n W i -th period which yields a payoff of u i. If u i < u b i, then at the end of the n W i period the state of agent i transitions to x i (n W i c W i ) = [ a i, u i, 1, a i, u i, D, 0, 0] and we are done. Otherwise, if u i u b i, then at the end of the n W i -th period the state of agent i transitions to x i (n W i c W i ) = [ a i, u i, 1, a i, u b i, H, 1, n H i ] where n H i {w + 1,..., w n + w}. Since, u i = u i (l x ) for any l {0, 1,... }, at the end of the (n H i + n W i )-th period the state of agent i transitions to x i (n H i + n W i c W i ) = [ a i, u i, 1, a i, u i, C, 0, 0]. Note that in this case, the agent will remain content for all future times and we are done. Hence, an agent cannot be watchful in a recurrence class of the unperturbed process. This completes the proof. The following claim finishes the proof of Lemma 2 by showing that in any recurrence class of the unperturbed process, if the agents are all content, then their baseline utilities must be aligned with their baseline action sequences. Claim 7. If a state x = (x 1,..., x n ) is in a recurrence class of the unperturbed process P 0 and m i = C for every agent i N then u b i = u i (l x) for every l {0, 1, 2,... }. Proof: By Claim 4, we know that if x is in a recurrence class of the unperturbed process then u i (l x) = u i (l x) for every l, l {0, 1,... }. Hence, we will complete the proof of this claim by showing that u b i = u i (0 x). As in the previous claims, we will prove this claim by contradiction. Suppose u i (0 x) > u b i for some agent i N. Then the state of player i at the end of the 0-th period will be x i (0) = [ a i, u i (0 x), 1, a i, u b i, H, 1, n H i ] where n H i {w + 1,..., w n + w}. Furthermore, after an additional n H i periods, the state of player i will be x i (n H i ) = [ a i, u i (0 x), 1, a i, u i (0 x), C, 0, 0]. The state of agent i will stay fixed for all future times so we are done.

23 23 Alternatively, suppose u i (0 x) < u b i for some agent i N. Then the state of player i at the end of the 0-th period will be x i (0) = [ a i, u i (0 x), 1, a i, u b i, W, 1, n W i ] where n W i {w + 1,..., w n + w}. Furthermore, after an additional n W i periods, the state of player i will be x i (n W i ) = [ a i, u i (0 x), 1, a i, u i (0 x), D, 0, 0]. The state of agent i will stay fixed for all future times so we are done. This completes the proof. B. Part #2: Derivation of Stochastically Stable States We know from [18] that the computation of the stochastically stable states can be reduced to an analysis of rooted trees on the vertex set consisting solely of the recurrence classes. To that end, we classify the recurrence classes as follows: We denote the collection of states of Form #2, i.e., m i = D for all agents i N, by a single variable D 0. For each state x = (x 1,..., x n ) of Form #1, consider the collection of states x(1), x(2),..., where for any agent i N and l {0, 1,... }, the state is of the form x i (l) = [ a i, u i (l), k i (l), a i, u i, C, 0, 0] where k i (l) = (k + l 1) mod a b i ) + 1 u i (0) = u i u i (k) = 1 k i + k U i(a(k x)) + k i + k 1 k i + k u i(k 1). for any k {1, 2,... }. Note that this collection of states represents a single recurrence class of the unperturbed process P 0. Consequently, we will represent this collection of states compactly by the tuple [ a i, u i, k i ]. We denote the collection of these recurrence classes by C 0. The set of recurrence classes of the unperturbed process are characterized by the set D C 0. The theory of resistance trees for regular perturbed processes provides an analytical technique

24 24 for evaluating the stochastically stable states using graph theoretic arguments constructed over the vertex set D C 0. Before proceeding with this derivation, we define the set of states C C 0 as follows: a state x C if for every player i N, every action a i l {0, 1,..., w n } we have 1 a i l a i k i k=l a i k i +1 U i (a i, a i (k x)) 1 a i l a i k i k=l a i k i +1 Note that if x C, then q(x) is a coarse correlated equilibrium. U i (a(k x)). A i, and every Definition 2 (Edge resistance). For every pair of distinct recurrence classes w and x, let r(w z) denote the total resistance of the least-resistance path that starts in w and ends in x. We call w z an edge and r(w z) the resistance of the edge. The following lemma will highlight five properties regarding the edge resistances. Lemma 8. The edge resistances defined over the states C 0 D satisfy the following five properties: (i) For any state x C 0, the resistance associated with the transitions D x satisfies r(d x) = n i N (ii) For any states x C 0 \ C and x C 0, the resistance associated with the transitions x x satisfies r(x x ) c + i N:u i <u i u i. (1 u i). (iii) For any sequence of transitions the form D x 0 x 1 x m = x where x k C 0 \ C for every k {0, 1,..., m 1} and x m C 0, the resistance associated with this sequence of transitions satisfies r(d x 0 ) + m 1 k=0 r(x k x k+1 ) mc + i N(1 u i ). (iv) For any states x C and x C 0 D, the resistance associated with the transition x x satisfies r(x x ) 2c.

25 25 (v) For any state x C, the resistance associated with the transition x D is r(x D) = 2c. Proof: The first three properties are relatively straightforward. Property (i) results from the fact that each agent needs to accept the baseline utility which has a resistance of 1 u i. Property (ii) results from two facts: First, at least one player needs to experiment in order to transition from a content state, which occurs with a resistance of c. Second, if a player transitions from a content state to an alternative content state with a lower baseline utility u i < u b i, then the agent must become content with this baseline payoff which occurs with a resistance of 1 u i. Lastly, Property (iii) follows immediately from Properties (i) and (ii). We will prove Property (iv) by demonstrating that it requires at least two experimentations in order to leave a state x C. To that end, let x i = [ a i, u i, k i, a i, u b i, C, 0, 0] denote the state of agent i N. Suppose agent i experiments with a constant block of actions (a i,..., a i ) A a i i at the beginning of the l i -th period of agent i. Likewise, assume this experimentation occurred during the periods {l j, l j + 1,..., l j } for each agent j. Since a i w and a j 1 we know that l j l j w. Since x C, we know that 1 a i l i a i k i k=l i a i k i +1 U i (a i, a i (k x)) u b i. (14) Hence, the state of agent i at the end of the l-th period will remain content and of the form where u i represents the sum expressed in (14). x i (l j ) = [ a i, u i, 1, a i, u b i, C, 0, 0] It is important to note that the utility of any other agent j i could have changed during the periods {l j, l j + 1,..., l j }. Let l j represent the first period for which u j (l j ) u b j. If u j (l j ) > u b j, then at the end of the l j -th period the state of agent j transitions to where n H j x j (l j ) = [ a j, u j (l j ), 1, a j, u b j, H, 1, n H j ]. {w + 1,..., w n + w}. However, note that for all k w + 1 > l j l j we have that u j (k) = u b j. Therefore, after n H i and we are done. additional periods, the state of agent j transitions to x j (l j + n H j ) = [ a j, u b j, 1, a j, u b j, C, 0, 0].

Achieving Pareto Optimality Through Distributed Learning

1 Achieving Pareto Optimality Through Distributed Learning Jason R. Marden, H. Peyton Young, and Lucy Y. Pao Abstract We propose a simple payoff-based learning rule that is completely decentralized, and