The Dynamics of Generalized Reinforcement Learning

Size: px

Start display at page:

Download "The Dynamics of Generalized Reinforcement Learning"

Blaze Barrett
5 years ago
Views:

1 The Dynamics of Generalized Reinforcement Learning Ratul Lahkar and Robert M. Seymour August 21, 2012 Abstract We consider reinforcement learning in games with both positive and negative payoffs. The Cross rule is the prototypical reinforcement learning rule in games that have only positive games. We extend this rule to incorporate negative payoffs to obtain the generalized reinforcement learning rule. Applying this rule to a population game, we obtain the generalized reinforcement dynamic which describes the evolution of mixed strategies of agents in the population. We show that pure strategy Nash equilibria in negative payoffs are not stationary points of this dynamic. Therefore, in simple two strategy games like the stag hunt and prisoner s dilemna, the population moves away from a Pareto inferior Nash equilibrium in negative payoffs towards a more cooperative state. Finally, simulations reveal convergence of the dynamic to interior stationary points in all monocyclic games including the bad RSP game. Keywords: Reinforcement learning; Negative reinforcement; Generalized Reinforcement Dynamic. JEL classification: C72; C73. IFMR, 24, Kothari Road, Nungambakkam, Chennai, , India. r.lahkar@ifmr.ac.in. My coauthor passed away on July 24, He was fully involved in the initial stages of this paper when we were discussing the idea and preparing the preliminary draft. Unfortunately, he expired before we could finish the final version of the paper. I dedicate this paper to his memory Deceased. Formerly, Professor Emeritus of Mathematics, University College London. 1

2 1 Introduction Learning and evolutionary game theory seek to provide foundations to social equilibrium behavior on more realistic norms of human behavior. Reinforcement learning models form a significant part of the literature in these fields of research. Such models are based on the general psychological principle that higher the benefit from using an action in the past, the greater is its likelihood in the present (Estes, 1950; Estes and Burke, 1953; Bush and Mosteller, 1951a, 1951b). More formally, in reinforcement models, an agent carries an internal mixed strategy, construed as the agent s behavioral disposition. If an action has yielded a high payoff in the past, then the probability assigned to it increases in the present; or the behavior associated with the action gets reinforced. Young (2004) provides a review of the several variants of strategic models of reinforcement learning that have been developed around this general principle. Experimental tests on some of these models (for example, Roth and Erev, 1995; Erev and Roth, 1998) have also yielded significant support for the predictions of reinforcement learning. In reinforcement learning models, payoffs in a game are not interpreted as von Neumann Morgenstern utilities. Instead, as emphasized in Börgers and Sarin (1997), they represent reinforcement stimuli. The payoffs parameterize the direction and magnitude by which an agent s behavioral disposition changes in light of his experience. The agent s experience consists of the action he has chosen and the payoff he consequently receives. A common feature of reinforcement models have been that they allow agents to only experience positive payoffs. Hence, in such models, actions are always positively reinforced in proportion to the magnitude of the payoff obtained. For example, Börgers and Sarin (1997, 2000) consider a model in which all payoffs exceed an exogenously fixed aspiration parameter common to all agents. Whichever action an agent may play, his aspiration is always satisfied and therefore, there is no stimulus to reduce the propensity of that action. Instead, the probability of that action always increases. By normalizing the aspiration parameter to zero, Börgers and Sarin (1997, 2000) arrive at the canonical learning model of Cross (1973) in which positive payoffs signify fulfilment of aspiration. In this model, the current probability of an action increases by a fraction, equal to the payoff obtained, of the residual probability that had been assigned to the other actions. In this paper, we generalize the principle of reinforcement to incorporate the idea of negative reinforcement. We consider games which may contain both positive and negative payoffs, with all payoffs lying between 1 and 1. 1 Positive payoffs represent, as usual, positive reinforcement stimuli. However, if an action yields negative payoff, then we allow its probability to decrease. Negative payoffs, therefore, serve as negative reinforcement stimuli which reduces an agent s propensity to use that action in his next round of play. We formalize negative reinforcement with a rule analogous to the Cross (1973) rule of positive reinforcement. Whereas the Cross rule transfers a fraction of the residual probability to the current action, our rule of negative reinforcement diverts a part 1 This is a technical assumption required to ensure that strategy revision under generalized reinforcement generates a sensible probability distribution. This is a generalization of the assumption in the Cross rule of positive reinforcement that all payoffs are between 0 and 1. 1

3 of current action s probability to other actions. Therefore, the probability of the current action declines in proportion to the negative payoff it generates. Following Börgers and Sarin (1997, 2000), we interpret negative reinforcement as arising from payoffs falling short of a common aspiration parameter of zero. A negative payoff, therefore, represent failure to meet aspiration which acts as a stimulus to reduce the propensity of the action that generates that payoff. We then combine the Cross rule with the negative reinforcement rule to obtain our rule of generalized reinforcement. In the generalized reinforcement rule, an action that yields a positive payoff gets positively reinforced as per the Cross rule. On the other hand, if the action yields a negative payoff, it gets negatively reinforced in a manner analogous to the Cross rule. We analyze our model of generalized reinforcement in the setting of a population game. In this setting, members of a large population are repeatedly randomly matched in pairs to play a two player symmetric normal form game. Every agent in the population revises strategy using the generalized reinforcement rule. Our objective is to assess the impact of such individual behavior on aggregate population behavior. To make our analysis tractable, we start with the critical assumption that all agents use the same mixed strategy in the first round of matching in the population game. We also follow the standard procedure, as in Borgers and Sarin (1997), of formulating our reinforcement rule in a way such that the extent of strategy revision declines as the duration of each matching declines. With these assumptions, we establish that as the duration of each matching becomes vanishingly small, the evolution of the mixed strategy of each agent in the population in approximated arbitrarily well by the solution trajectories of an ordinary differential equation system we call the generalized reinforcement (GR) dynamic. Due to our large population setting, we may also interpret the solution of this dynamic as representing the distribution of agents in the population across the different actions. This interpretation corresponds to the conventional definition of a population state in evolutionary game theory models. For the special case in which all payoffs in the game are between 0 and 1, the GR dynamic is identical to the replicator dynamic. This recovers the result in Borgers and Sarin (1997) that the expected change in mixed strategy generated by the Cross rule of positive reinforcement is given by the replicator dynamic. 2 However, if some payoffs are negative, the GR dynamic differs in certain crucial aspects from the replicator dynamic. The most striking difference is that not all Nash equilibria of the game are stationary points of the GR dynamic. For example, in a pure Nash equilibrium in negative payoffs, negative reinforcement reduces the propensity of the equilibrium action. Therefore, such a Nash equilibrium is not a stationary point of the dynamic. Generalized reinforcement, therefore, has the important implication that it enables a society to move away from an equilibrium that fails to meet the aspiration levels of the members of the society. We also show that interior stationary 2 We note that in the Borgers and Sarin (1997) model, the same two players are repeatedly matched to play the game, with each player updating strategies using the Cross rule. Our model, on the other hand, is a population game model in which, in each new matching, a player encounters a new opponent. However, due to our assumption that all players start by using the same strategy, we are able to follow the same technical procedure as in Borgers and Sarin (1997) to generate the GR dynamic from the generalized reinforcement strategy revision process. 2

4 points of the dynamic do not typically correspond to the mixed equilibria of the game. However, monomorphic states, including pure equilibria, in positive payoffs do represent stationary points of the dynamic. We apply the GR dynamic to 2 2 normal form games. The application is not straightforward because the dynamic depends intricately on the distribution of positive and negative signs among the payoff parameters of the game. We use this analysis to discuss the operation of the dynamic in certain interesting 2 2 games, for example, the stag hunt game and the prisoner s dilemma game. These games are used as prototypical models of the problem of establishing cooperation in society. The stag hunt has two pure equilibria, one of them payoff dominated. Under a conventional evolutionary dynamic like the replicator dynamic, society may get trapped in the inferior equilibrium if it starts near that equilibrium. In the prisoner s dilemma, the unique Nash equilibrium is payoff dominated but attracts the replicator dynamic. However, under the generalized reinforcement paradigm, if the inferior equilibrium is in negative payoffs, it cannot be an attracting state. Instead, society is able to move away from that state towards one where the cooperative action may have significant presence. This suggests that if a status quo equilibrium fails to meet aspiration in society, generalized reinforcement is able to facilitate greater cooperation. But we need to modify this conclusion if the inferior equilibrium is in positive payoff in which case, it may remain attracting even under generalized reinforcement. Our final application of the GR dynamic is to a class of n strategy games known as monocyclic games (Hofbauer, 1995). This class of games is characterized by a unique mixed strategy equilibrium and no pure equilibria. An example of such games is a Rock Scissor Paper (RSP) game. We focus on the bad RSP game where the loss from losing is greater than the gain from winning. The bad RSP game is interesting because all standard evolutionary dynamic including the replicator dynamic display non-convergence to the interior equilibrium. Instead, such dynamics cycle away from the equilibrium. However, our simulations suggest that the GR dynamic globally converges to its interior stationary point, which in this case happens to coincide with the Nash equilibrium. Indeed, these simulations show convergence in all RSP games. This gives rise to the conjecture that the GR dynamic converges in all monocyclic games, including such monocyclic games in which other dynamics display cyclical behavior. It is important to note that such convergence typically would not imply convergence to mixed equilibrium. Unfortunately, we have not been able to rigorously prove this conjecture. However, simulations on another such game where other dynamics are known to cycle lend support to this conjecture. The rest of the paper is organized as follows. Section 2 introduces generalized reinforcement. Section 3 derives the GR dynamic and Section 4 analyzes some of its properties, including the stationary points of the dynamic. After the general analysis of 2 2 games in Section 5, we consider four categories of such games games with two pure equilibria, prisoner s dilemma, games with payoff dominant equilibria in dominant strategies and the hawk-dove game in Sections 6 9. The stag hunt game belongs to the first of these categories. Section 10 is on the application to monocyclic games. Section 11 concludes. 3

5 2 Generalized Reinforcement Learning Let U be a n n symmetric two player normal form game. The game has set of pure actions A = {A 1, A 2,, A n }. We denote by u ij the payoff to the row player when the row player plays action A i and the column player A j. 3 We assume that u ij [ 1, 1], for all A i, A j A. We consider a population consisting of a continuum of agents who are randomly matched to play this game. We refer to the game U, the population and the random matching framework as a population game. Each agent in the population carries a mixed strategy, interpreted as their behavioral disposition, which they use to choose their pure action when called upon to do so. We denote the set of mixed strategies of an agent by { } n = x R n : x i 0 for each A i A, with x i = 1. (1) At time t, all agents are randomly matched in pairs and each pair plays the game. Matchings last for a period τ, 0 < τ 1, after which they are rearranged. Once matched, an agent adopts a mixed strategy from which he then (possibly) revises for use during the next matching. In revising his strategy, the agent uses some method of heuristic learning by recalling his personal experience in his previous matching. The agent can observe the actions used by his opponents in the past, or at least infer them from the payoffs received, but has no knowledge of the strategy used by his opponents. We may describe the process of heuristic learning in the population game as follows. Suppose that in a current matching, a player using strategy x plays A i and his opponent uses A j. The row player updates his strategy to x given by i=1 x = x + τf ij (x). (2) We call f ij (x) as the potential strategy revision function. With this interpretation, (2) implies that the proportion of the total strategy revision potential that is realized depends upon τ, the time difference between two successive pairwise matchings in the population game. In particular, we are assuming that as τ 0, then the difference between x and x also becomes infinitesimally low. This assumption becomes crucial when we derive the continuous time limit of the population dynamics generated by a strategy revision rule of the form (2). We assume that the strategy revision function f ij (x) satisfies the following properties: 1. f ij (x) extends to a differentiable function f ij : IR n IR n. 2. n r=1 f ij,r(x) = 0 for x. 3. x r + f ij,r (x) 0 for all 1 r n, x n. 3 We confine ourselves to two-player symmetric games merely for notational convenience. All the ideas involved can be extended easily to multi-player symmetric as well as asymmetric games at the cost of more cumbersome notation. 4

6 Condition 1 is a technical property. The other two conditions ensure that the new strategy x. The strategy revision function can take any form subject to the three conditions. Our main interest is in the case when agents use generalized reinforcement learning to update their strategies. In order to introduce generalized reinforcement, we consider three cases depending upon the range of values the payoffs can take. The first is the well known case of positive reinforcement where all payoffs are positive. In the second case, we introduce negative reinforcement to accommodate the case where all payoffs are negative. The third case combines positive and negative reinforcement to yield generalized reinforcement where payoffs can be both positive and negative. 2.1 Positive Reinforcement Let 0 < u ij 1, for all A i, A j A. Conventional models of reinforcement learning have only considered the case where all payoffs are positive. In such models, payoffs are interpreted as positive stimuli which serve to increase the likelihood of an action the agent uses. The most well known rule of positive reinforcement is the Cross (1973) rule. To specify this rule, we assume an agent with strategy x plays A i encounters an opponent using A j. Under the Cross rule, the agent then revises his strategy to x given by x i = x i + τu ij (1 x i ), (3) x k = x k τu ij x k, k i. (4) We may also describe this rule by specifying the strategy revision function as f ij (x) = (e i x)u i j, (5) where e i R n is the i-th standard basis vector. The Cross rule has been extensively analyzed, for example, in Börgers and Sarin (1997, 2000) and Börgers et al. (2004). Under the Cross rule, the likelihood of the action the agent currently uses always increases. The rule transfers a fraction τu ij of the probability (1 x i ) assigned to other actions to A i. Therefore, higher is the payoff obtained from using A i, the greater is the increase in its likelihood in the next opportunity the agent gets to play the game. Börgers and Sarin (2000) interprets the Cross rule in terms of the aspiration level of an agent. Suppose that an agent aspires to a payoff of s [0, 1]. The probability of playing an action A k A i is then x k = x k + (s τu ij )x k. Hence, if u ij > s, then A i gets reinforced. Setting s = 0 for all agents, we obtain the Cross rule. 2.2 Negative Reinforcement We now extend the fundamental idea behind the Cross rule to develop the notion of negative reinforcement when all payoffs are strictly negative. Let 1 u ij < 0, for all A i, A j A. The payoffs now represent negative stimuli which, by a logical extension of the notion of positive 5

7 reinforcement, should act to reduce the likelihood of the current action. Positive reinforcement rewards a current action by transferring a fraction of the remaining probability to the current action. In contrast, negative reinforcement should penalize the current action by shifting some of its probability to other actions. Of course, x i cannot decrease below zero, so it is reasonable to expect that x i decreases by an amount proportional to x i. The proportion, in turn, is determined by the payoff obtained from the action. To formalize this notion, we assume as before that an agent with strategy x plays A i and encounters an opponent using A j. The updated probability of A i is then x i given by x i = x i + τu ij x i. (6) Since 1 u ij < 0, x i < x i. Therefore, f ij,i (x) = u ij x i < 0. While x i falls, we expect x k for k i to compensate by increasing. Clearly x k cannot increase to more than x k = 1. So it is reasonable to assume that x k increases by an amount proportional to 1 x k. That is, we assume that x k = x k ταu ij (1 x k ), k i. (7) Hence, f ij,k (x) = αu ij (1 x k ), for all A k A i. The value of α is determined by the requirement that n l=1 f ij,l(x) = 0 or n r=1 x r = 1. From (6) and (7), this condition gives, from which we obtain α {(n 1) (1 x i )} = x i, α = α(x i ) = x i n 2 + x i. (8) From (6), (7) and (8), we obtain the reinforcement learning rule for negative payoffs: x i = x i + τu ij x i, x k = x k τu ij x i n 2 + x i (1 x k ), k i. (9) We note that we can also write (9) as x k = x k τu ij x i 1 x k l i (1 x l) τu ij x i available for redistribution among the other actions, A k receives a fraction lower is x k, the higher is the increase in its mass following the redistribution.. Therefore, of the total mass 1 x k l i (1 x l). The In positive reinforcement, we note that x i = 1 implies x i = 1. Once an action is played with certainty, it simply gets repeated on all future occasions. However, this is not so under negative reinforcement. From (6), even if x i = 1, u ij < 0 implies x i < 1. This, in turn, means that all other 1 strategies x k are reinforced to the positive value, u ij n 1. Thus, actions that were not previously available become so. Therefore, negative reinforcement allows an agent to revise his strategy in the future even when he currently plays an action with complete certainty. As we show later, this has significant implications on the population dynamics under generalized reinforcement. In particular, it allows a population to escape from a monomorphic state, which may be a Nash equilibrium, if 6

8 that state has negative payoffs. In the standard interpretation of positive reinforcement, positive parameters represent the amount by which the payoffs exceed the aspiration level of zero. Since all payoffs are positive, agents always obtain more than the aspiration level irrespective of whichever action they use. Therefore, the reinforcement stimulus is always in the direction of increasing the likelihood of the current action. Furthermore, the greater the payoff, the higher is the positive reinforcement. We may provide an analogous interpretation of negative reinforcement. If we set an agent s aspiration level at zero, then negative parameters measure the degree to which the agent s payoffs fall short of the aspired level. Therefore, reinforcement stimulus acts towards reducing the likelihood of the current action with the fall in likelihood greater the lower the payoff. This interpretation also provides the intuition why the likelihood of A i can decline from x i = Generalized Reinforcement We now combine positive and negative reinforcement to define generalized reinforcement. This case applies when when a game has both positive or negative payoffs. We assume that for all actions A i, A j A, u ij [ 1, 1]. We now define the following parameters from the payoff matrix U. u + ij = 1 2 (u ij + u ij ), (10) u ij = 1 2 (u ij u ij ), (11) so that u ij = u + ij + u ij ; u ij = 0 if u ij > 0, and u + ij = 0 if u ij < 0. Therefore, if u ij > 0, then u + ij = u ij and u ij = 0. If, instead, u ij < 0, then u + ij = 0 and u ij = u ij. As in Cases 1 and 2, we assume an agent with strategy x plays A i and encounters an opponent using A j. Combining Cases 1 and 2, we then obtain the agent s updated strategy x i = = x i + τu + ij (1 x i) + τu ij x i, (12) x i x k = x k τu + ij x k τu ij (1 x k ), k i. (13) n 2 + x i In terms of the strategy revision function f, we have f ij,i (x) = u + ij (1 x i) + u ij x i, (14) x i f ij,r (x) = u + ij x r u ij (1 x r ), r i. (15) n 2 + x i Equations (12)-(13) define the rule of generalized reinforcement learning. The updated probability of an action, therefore, depends upon whether the payoff u ij the agent obtains is positive or negative. If u ij > 0 so that u ij = 0, then generalized reinforcement is equivalent to positive reinforcement. On the other hand, if u ij < 0 so that u + ij = 0, then generalized reinforcement reduces to negative reinforcement. We may extend the interpretation of reinforcement learning in terms of an aspiration level to 7

9 generalized reinforcement. We set the aspiration parameter at zero as usual. Then a game with both positive and negative payoffs imply that sometimes, an agent obtains a higher than aspired payoff whereas on other occasions, the payoff is less than the aspiration level. If the payoff is more than zero, then positive reinforcement holds and the likelihood of the current action increases in the future. On the other hand, in the case of negative payoffs, negative reinforcement makes it less likely that the agent uses the current action in the future. This also implies that x i may decline from 1 if the payoff obtained from using A i is negative. 3 Generalized Reinforcement Dynamic We identify a population state with a probability measure over. Formally, a probability measure P is a population state such that P (A), A, denotes the proportion of the population using strategies in A. Our objective is to analyze the way the population state changes as agents revise their strategies using the generalized reinforcement rule (12)-(13) in successive matchings of the population game. A general solution to this problem can be quite complex since it would involve the analysis of a partial differential equation system in an abstract space of probability measures. 4 Since the primary aim of this paper is to introduce generalized reinforcement, we adopt a simpler approach. In particular, we assume that at the initial time t = 0, all agents start with the same behavioral disposition x(0) = x 0. We may describe this as the population state δ x0, the Dirac distribution on x 0. As we argue below, with this simplifying assumption, we can analyze population dynamics using a much simpler ordinary differential equation system. We first provide an intuitive explanation of this approach. Consider an agent during his first matching as he uses strategy x 0. Since the population state is δ x0, the opponent he encounters also uses x 0. As the agent then revises his strategy to x 0, the expected change in his strategy is given by τl(x 0 ) R n, where L k (x) = i,j x i f ij,k (x)x j, (16) with f ij,k (x) being the strategy revision function (12)-(13) under generalized reinforcement. The expected value of x 0 is, therefore, x 0 + τl(x 0 ). Typically, this value will be different from x 0. However, the agent s adjustment of his strategy, τf i,j (x 0 ), slows down as the duration of a matching, τ, declines. Hence, as τ 0, x 0 + τl(x 0 ) becomes an increasingly close approximation of x 0. The argument applies to every agent. Since every agent starts with the same strategy, each agent s strategy at time τ is close to x τ = x 0 + τl(x 0 ) if τ is sufficiently small. But then, a repetition of the earlier argument implies that in the next round of matching in time 2τ, every agent uses a strategy close to x 2τ = x τ + τl(x τ ). We can continue this chain of reasoning for any finite number of matching rounds. Hence, if we have N rounds of matching and Nτ = T, then up to time T, the change in the mixed strategy x of an agent between any two matchings is 4 See Lahkar and Seymour (2012) for an application of this approach to the Cross learning rule of positive reinforcement in population games. 8

10 well approximated by τl(x), where L(x) is given by (16). Formally, for τ small enough, x (mτ) x(mτ) E(x (mτ) x(mτ)) = τl(x(mτ)), for m < N, or x (mτ) x(mτ) τ L(x(mτ)). To make the approximation increasingly accurate, we take the continuous time limit of the strategy adjustment process as τ 0. But τ is simply the time differential between two matchings. We, therefore, conclude that for t [0, T ], the continuous time mixed strategy trajectory x(t) for any agent is given the solution to the differential equation system with initial condition x(t) = x(0). dx dt = L(x), To make this argument rigorous, we can follow the approach adopted by Borgers and Sarin (1997) to derive the replicator dynamic as the continuous time limit of the Cross rule of positive reinforcement learning. Their analysis is in the context of a learning model in which the same two players are repeatedly matched to play the game. But their proof can be easily adapted to the large population random matching context in our model. We, therefore, provide the formal statement of our result in Proposition 3.1 below while referring the reader to Proposition 1 in Borgers and Sarin (1997) for details of the proof. Let x τ (m) be the strategy of an agent in his m-th matching, with the superscript τ denoting the duration of a matching. We rewrite the generalized reinforcement rule (12)-(13) as x τ i (m + 1) = = x τ i (m) + τu + ij (1 xτ i (m)) + τu ij xτ i (m), (17) x τ k (m + 1) = xτ k (m) τu+ ij xτ k (m) x τ τu i (m) ij n 2 + x τ i (m)(1 xτ k (m)), k i. (18) With a slight abuse of notation, we treat x τ as a random variable. We, therefore, obtain a Markov process {x τ (m)} m N in discrete time if we specify the initial vale of the random variable x τ (0) = x(0). This Markov process describes the the process of strategy change of an agent in. If we now assume that every agent starts with the strategy x(0) at time t = 0, then the same Markov process holds for every agent. Since each matching lasts for duration τ, the variable x τ (m) describes the agent s strategy at time mτ. We are interested in the continuous time limiting behavior of the process as τ 0. In the following proposition, we characterize this limit for some finite time T 0, under the conditions that τ 0 and mτ T. Proposition 3.1 Suppose all agents revise strategies according the generalized reinforcement learning rule (17) (18). Further suppose that for all agents, x τ (0) = x(0). Let T [0, ) and assume τ 0 and mτ T. Let x(t) be the solution to the differential equation ẋ = L(x), (19) where L k (x) = i,j x if ij,k (x)x j, with f i,j (x) being the potential strategy revision function (14) (15) 9

11 under generalized reinforcement. Then, x τ (m) converges in probability to x(t) for every agent. For the formal details of the proof, we refer the reader to Borgers and Sarin (1997). Their proof is a straightforward application of results from Norman (1972) on the the continuous time limit of discrete time Markov processes with infinite state spaces. The proof requires that the function L k (x) be polynomial, which as we calculate below, is obviously the case. We call (19) the generalized reinforcement dynamic or the GR dynamic. In order to obtain the precise form of the GR dynamic, we need to calculate the vector field L(x) corresponding to generalized reinforcement. For this purpose, we define two n n matrices, both derived from the payoff matrix U. The first is U + whose components are u + ij defined in (10). The second is U consisting of elements u ij defined in (11). Therefore, U + = [u + ij ] such that u+ ij = u ij if u ij > 0 and u + ij = 0 if u ij 0 (20) U = [u ij ] such that u ij = u ij if u ij < 0 and u ij = 0 if u ij 0 (21) Along with these matrices, we also use a property derived from the function α(x i ) defined in (8). Given a scalar h, we have from (8), α(h) = h 1 n 2+h. Clearly, 0 α(h) n 1 for all 0 h 1, and h + (1 h)α(h) = (n 1)α(h). (22) With these preliminaries, we now compute L k (x) = i,j x if ij,k (x)x k with f ij (x) being the potential strategy revision function under generalized reinforcement defined in (12)-(13). L k (x) = x i f ij,k x j i,j = x kf kj,k + x i f ij,k x j j i k = x k(1 x k )u + kj + x2 k u kj x k x i u + ij (1 x k) α(x i )x i u ij x j j i k i k = { x k u + kj x k x i u + ij + x2 k u kj + x k(1 x k )α(x k )u kj (1 x k) } α(x i )x i u ij x j j i i { = x k u + kj } x i u + ij x j + { x k [x k + (1 x k )α(x k )] u kj (1 x k) } α(x i )x i u ij j i j i { } = x k {e k x} U + x + j (n 1)α(x k )x k u kj (1 x k) i α(x i )x i u ij = x k {e k x} U + x + {(n 1) [α(x k )x k ] e k (1 x k ) [α(x)x]} U x, x j from(22) where α(x)x IR n is the vector with components α(x i )x i. We, therefore, obtain the vector field x j 10

12 L(x) on whose i-th component is L i (x) = x i {e i x} U + x + {(n 1) [α(x i )x i ] e i (1 x i ) [α(x)x]} U x, (23) for 1 i n. From (23), we obtain the GR dynamic ẋ = L(x), with L i (x) being the rate at which the probability associated with action A i, x i, changes. Clearly, the first term in (23), involving U +, is the standard replicator dynamics component associated with positive payoffs. The second term is the component associated with the negative payoffs U. By Proposition 3.1, we may interpret solutions to the GR dynamic in two equivalent ways. The initial common strategy implies that at all future times t, all agents play the same strategy x(t) given by the solution of the GR dynamic with initial point x(0). This is, of course, a limiting result obtained when the duration of each matching, τ, becomes vanishingly small. For τ that is small but remains positive, agents play different strategies at time t > 0, but all those strategies are extremely close to x(t). Equivalently, if the initial population state is δ x0, then at time t > 0, the population state is δ x(t) where x(t) is the solution to the GR dynamic from initial condition x(0). This also means x i (t) is the proportion of agents who play action A i in time t. Therefore, to simplify notation, in the following sections, we identify the population state δ x(t) with the strategy x(t). 3.1 The case n = 2 We now derive the GR dynamic for the special case of a two strategy symmetric game. When the number of strategies, n = 2, α(h) = 1 (see 22), and hence α(x)x = x. Therefore, (23) reduces to L i (x) = x i {e i x} U + x + {x i e i (1 x i )x} U x, i = 1, 2. (24) However, since x 1 + x 2 = 1, only one of these two equations is independent. Writing x = (x 1, x 2 ) = (x, 1 x), we may then specify the GR-dynamic completely by ẋ = L(x), the rate of change in the probability of action A 1. The relevant state space is therefore the interval [0, 1]. To specify L(x), we can expand the two components of (24) more explicitly. The first component, which is the replicator component associated with U +, has the explicit form x 1 {e 1 x} U + = x(1 x)(u + 11 x + u+ 12 (1 x) u+ 21 x u+ 22 (1 x)). (25) The second component, associated with U, is {xe 1 (1 x)x} U x = x { u 11 x + u 12 (1 x)} (1 x) { u 11 x2 + (u 12 + u 21 )x(1 x) + u 22 (1 x)2} = u 11 x3 + u 12 x2 (1 x) u 21 x(1 x)2 u 22 (1 x)3. (26) 11

13 Adding (25) and (26), we obtain the GR dynamic ẋ = L(x) where L(x) = ( u + 11 x2 (1 x) + u + 12 x(1 x)2 u + 21 x2 (1 x) u + 22x(1 x)2) + ( u 11 x3 + u 12 x2 (1 x) u 21 x(1 x)2 u 22 (1 x)3). (27) We may also rewrite L(x) as L(x) = ( u + 11 x2 (1 x) + u + 12 x(1 x)2 u 21 x(1 x)2 u 22 (1 x)3) + ( u 11 x3 + u 12 x2 (1 x) u + 21 x2 (1 x) u + 22 x(1 x)2). (28) The first component in the right hand side of (28) represents the inflow of probability mass into action A 1. Note that this component is positive. The value of x increases if playing A 1 generates a positive payoff or playing A 2 generates a negative payoff. For example, u + 11 x2 (1 x) is the product of the increase in x for a player, u + 11 (1 x), multiplied with the probability x2 of both players in the matching playing A 1. The second component denotes the outflow of mass from A 1. This component is negative. The probability of A 1 declines if it generates a negative payoff or if the other action, A 2, generates a positive payoff. We note that a similar inflow-outflow interpretation holds for the more general n strategy version of the GR dynamic (23). 4 Some Properties of the GR Dynamic The strategy revision rule f under generalized reinforcement defined in (12)-(13) is clearly differentiable. Therefore, from any initial point x(0), the GR dynamic (19) admits a unique solution which is continuous with respect to initial condition. However, for this dynamic to be relevant in a game theoretic context, we also require that it satisfy forward invariance, i.e. from initial condition x(0), the solution trajectory x(t) for all t > 0. This condition ensures that x(t) remains a meaningful description of strategy for all times t. To establish forward invariance, it is sufficient to show that at any point on the boundary of, the GR dynamic never points outward from. To state and prove this property formally, we first define the (n k) dimensional face of. Definition 4.1 Let A = {A 1, A 2,..., A n } be the set of pure strategies. A i = {A i1,..., A ik } A, defines an (n k)-dimensional face of Then a proper subset, (i) = {x x i = 0 for i A i }. (29) In an (n k) dimensional face, x ij = 0 for k actions {A i1,..., A ik }. Therefore, any such face represents a part of the boundary of. The following proposition establishes that from any such face of, the GR dynamic either remains on the face or points inwards into. This is equivalent to showing that for any x (i), ẋ ij 0 for any A ij A i. Clearly, this implies that the dynamic never points outward from the boundary of. 12

14 Proposition 4.2 The (n k)-dimensional face (i) is invariant under the GR dynamic ẋ = L(x) if and only if u ij 0 for all A i, A j / A i. If there exists A i, A j x int i. Proof. For x int (i), we have, for 1 r k, / A i for which u ij < 0, then L(x) points into the interior of for all L ir (x) = [α(x)x] U x = u ij α(x i)x i x j A i,a j / A i 0. Since x i, x j > 0 for x int (i) and i, j / A i, it follows that L ir (x) = 0 if and only if u ij = 0 for all i, j / A i. That is, if and only if u ij 0 for all i, j / A i. If there exists i, j / A i for which u ij < 0, then L ir (x) u ij α(x i)x i x j > 0 for all 1 r k. Hence, the vector field L(x) points into the interior of at x int (i). The calculation in the proof involves points in the interior of (i). However, any point in the boundary of (i) is in the interior of a lower dimensional face. Therefore, the proposition covers every point in the boundary of. The calculation of L ir (x) in the proof of the proposition then suffices to establish forward invariance. However, Proposition 4.2 goes beyond establishing forward invariance. It also establishes a basic distinction between the replicator dynamic and the GR dynamic. It is well known that the boundary faces of is invariant under the replicator dynamic. Hence, any solution trajectory of the replicator dynamic that starts in a particular face remains in that face at all times in the future. However, under the GR dynamic, this is only true in the first part of Proposition 4.2. A face is invariant if and only if actions present in that face always yield positive payoffs. If all payoffs are positive, then all faces are invariant. But this is also the case in which the GR dynamic is equivalent to the replicator dynamic. In the more general case in which some action leads to negative payoff, then the GR dynamic diverts a part of that action s probability into an unused action. This pushes the state variable into the interior of. This also implies that any pure strategy that does not yield a positive payoff is not a stationary point of the GR dynamic. We can establish this result formally as a corollary of Proposition 4.2. Corollary 4.3 The pure strategy e i is a stationary point of the reinforcement dynamics if and only if u ii 0. Proof. Take A i = A \ {i} in Proposition 4.2. In a large population context, a monomorphic state e i is a state in which all agents play the same action A i. Corollary 4.3 implies that from any monomorphic state in which the payoff in negative, the GR dynamic moves away from that state. In particular, a pure strategy Nash equilibrium in 13

15 negative payoffs in not a stationary point of the GR dynamic. This behavior is in contrast to the replicator dynamic in which all monomorphic states are stationary points. Corollary 4.3 establishes that if all monomorphic states of the population game are in negative payoffs, the then only possible stationary points of the GR dynamic are in the interior of. The following proposition establishes the existence of at least one such interior stationary point. To prove the proposition, we define the forward flow of the GR dynamic. The forward flow of the GR dynamic is the function φ t (ξ) = x t where {x t } t [0, ) is the solution to the dynamic with initial condition x 0 = ξ. The existence of a stationary point follows from an application of the Brouwer s fixed point to φ t (ξ). Proposition 4.4 If u ii < 0 for all A i A, then the GR dynamic ẋ = L(x) is inward pointing everywhere on the boundary, and hence the dynamic has only interior stationary points. Proof. If u ii < 0 for all A i A, the Corollary 4.3 implies the GR dynamic is inward pointing. Hence, no pure strategy can be a stationary point of the dynamic. To establish the existence of an interior rest point, note that by forward invariance, the forward flow φ t is a function from to itself. Furthermore, through the standard results on existence, uniqueness and continuity of solutions, φ t : is a continuous function for any t [0, ). Since is a compact set, the Brouwer s fixed point theorem then implies the existence of a fixed point φ t (x) = x for any t [0, ). Clearly, such a fixed point is a stationary point of the GR dynamic. Since u ii < 0 for all A i A, such a stationary point can only be in the interior of. Any mixed Nash equilibrium of U is a stationary point of the replicator dynamic. Therefore, a fully mixed equilibrium of U is an interior stationary point of the replicator dynamic. For the GR dynamic, however, it is not readily apparent how to arrive at such an intuitive characterization of an interior stationary point. We can establish that a mixed Nash equilibrium is not generally a stationary point of the GR dynamic using a simple 2 strategy game. Consider the 2 strategy game U in which u 11 > 0 but the other payoffs are negative. Therefore, u 12, u 21, u 22 < 0. We assume that u 22 > u 12 or u 22 < u 12. So the game has three Nash equilibria, x = 0, x = 1 and the mixed equilibrium x u = 22 u 12 u 22 u 12 +u 11 u 21. If we apply the replicator dynamic to the game, then the three Nash equilibria constitute the set of stationary points of the dynamic. To obtain the stationary points of the GR dynamic, let us apply (28) to U to obtain L(x) = u + 11 x2 (1 x) u 21 x(1 x)2 u 22 (1 x)3 + u 12 x2 (1 x) = (1 x) ( u 11 x 2 + u 12 x 2 u 21 x(1 x) u 22 (1 x) 2). (30) From (30), it is readily apparent that x = 1 is a rest point of the GR dynamic. But at x = 0, L(x) = u 22 > 0. At x = 0, all agents play action A 2 with certainty. Due to the negative payoff, all agents negatively reinforce A 2. Hence, there is an inflow of mass into A 1 and so, x rises from 0. The interior stationary point is given by the solution to u 11 x 2 + u 12 x 2 u 21 x(1 x) u 22 (1 x) 2 = 0. At that solution, the inflow into A 1 is matched by the outflow from that action. However, it is 14

16 evident that this solution bears no relation to the mixed Nash equilibrium x, unless the payoffs are such that x = 1 2. In the special case in which all payoffs are positive so that the GR dynamic reduces to replicator dynamic, mixed equilibria of a game are stationary points of the GR dynamic. Except for this case, the above example in (30) verifies that typically, such Nash equilibria do not constitute stationary points of the GR dynamic. 5 Application: Two Strategy Games We now apply the GR dynamic to two strategy symmetric games. The GR dynamic depends intricately on the signs of the payoffs in the game U. We, therefore, expect the behavior of the dynamic, particularly its long-run asymptotic state, to also depend up on the signs of the payoffs. This is totally unlike the case with conventional evolutionary dynamics like the replicator dynamic. For example, in a prisoner s dilemma game, the replicator dynamic always converges to the unique Nash equilibrium irrespective of the signs of the payoffs. It is readily apparent that such a general conclusion does not hold for the GR dynamic if we consider a prisoner s dilemma where the Nash equilibrium is in negative payoffs. In that case, the Nash equilibrium is not even a stationary point of the GR dynamic. We, therefore, need to distinguish between various distributions of positive and negative signs for a complete analysis of the the GR dynamic in two strategy games. While there are various ways to make the distinction, the most convenient one is to classify such a game U according to the signs of u 11 and u 22. These are the payoffs corresponding to the monomorphic states x = 1 and x = 0, where x is the probability of action A 1 or the proportion of agents using A 1. This gives us four cases to analyze. 5.1 Case 1: u 11 0 and u 22 < 0. In this case, u 11 = u+ 22 = 0. Therefore, from (27), the GR dynamic takes the form L(x) = (1 x) ( u + 11 x2 + u + 12 x(1 x) u 21 x(1 x) u 22 (1 x)2 + u 12 x2 u + 21 x2) = (1 x) ( u + 11 x2 + u + 12 x(1 x) + u 21 x(1 x) + u 22 (1 x)2 u 12 x2 u + 21 x2). (31) We write this equation as L(x) = (1 x)q(x), where q(x) is quadratic in x. We note that q(0) = u 22 > 0 and q(1) = (u+ 11 u 12 u+ 21 ). It is obvious that x = 1 is a stationary point of the GR dynamic in this case. The only other possible stationary points of the dynamic are such roots of q(x) that belong to the interval [0, 1]. We now show that if q(1) > 0, then x = 1 is the globally asymptotic stationary point of the GR dynamic. Let us apply the change of variable ξ = 1 x x and note x [0, 1] = ξ 0. With the 15

17 transformation, we can write q(x) as p(ξ) where p(ξ) = 1 (1 + ξ) 2 ( (u + 11 u 12 u+ 21 ) + (u u 21 )ξ + u 22 ξ2). This transforms the condition q(x) = 0 into the equivalent condition p(ξ) = 0 in terms of the positive variable ξ. If p(ξ ) = 0 for some ξ 0, then x = 1 1+ξ is a root of q(x). However, if q(1) = (u + 11 u 12 u+ 21 ) > 0, then then p(ξ) > 0 for all ξ 0. Therefore, in this case, q(x) has no root in [0, 1]. But with with q(0) > 0, this further implies that q(x) > 0 for all x [0, 1]. Hence, the only stationary point of L(x) is x = 1. For any other x [0, 1], L(x) > 0 which implies that x = 1 is globally asymptotically stable. However, if q(1) < 0, then p(ξ) has the root ξ = 1 } { (u (u u 22 + u 21 ) u 21 )2 4 u 22 (u+11 u 12 u+21 ) Therefore, q(x) has the unique root x = 1 1+ξ (0, 1). In this case, x is another stationary point of the GR dynamic along with x = 1. Furthermore, for ξ > ξ, p(ξ) > 0 and for ξ < ξ, p(ξ) < 0. Given the inverse relationship between x and ξ, this implies that for x < x, q(x) > 0 and for x > x, q(x) < 0. dynamic. (32) This implies that x is the globally asymptotic stationary point of the GR For the borderline case q(1) = 0, ξ = 0. Therefore, x = 1 is the root of q(x). Hence, x = 1 is the only stationary point of the GR dynamic which, furthermore, is globally asymptotically stable. We summarize this discussion in the following proposition. Proposition 5.1 Consider a 2 2 symmetric normal form game with u 11 0 and u 22 < 0. Let ẋ = L(x) be the GR dynamic for this game, where L(x) is given by (31). Then, 1. If u + 11 u 12 u+ 21 0, then x = 1 is the unique stationary point and is globally asymptotically stable. 2. If u + 11 u 12 u+ 21 < 0, then there is a unique interior stationary point x = 1 1+ξ where ξ is given by (32). Further, x is globally asymptotically stable on 0 x < 1. The other stationary point x = 1 is unstable. 5.2 Case 2: u 11 < 0 and u 22 0 Here, u + 11 = u 22 = 0. The GR dynamic, therefore, takes the form L(x) = x ( u + 12 (1 x)2 + u 21 (1 x)2 u 11 x2 u 12 x(1 x) u+ 22 (1 x)2). (33) Instead of going through the formal analysis of the dynamic, we can solve this case directly from Case 1 in Section 5.1 by interchanging the strategies 1 2 and the roles of x and (1 x). We, therefore, obtain the following proposition. 16

18 Proposition 5.2 Consider a 2 2 symmetric normal form game with u 11 < 0 and u Let ẋ = L(x) be the GR dynamic for this game, where L(x) is given by (33). Then, 1. If u + 22 u 21 u+ 12 0, then x = 0 is the unique stationary point and is globally asymptotically stable. 2. If u + 22 u 21 u+ 12 < 0, then there is a unique interior stationary point x = ξ 1+ξ where ξ is given by ξ = 1 } { (u (u u 11 + u 12 ) u 12 )2 4 u 11 (u+22 u 21 u+12 ) Further, x is globally asymptotically stable on 0 < x 1. The other stationary point x = 0 is unstable. 5.3 Case 3: u 11 0 and u In this case, u 11 = u 22 = 0. Hence, the dynamics ẋ = L(x), with L(x) given by (27) reduces to ẋ = x(1 x)l(x), where l(x) = ( u 21 + u+ 12 u+ 22 ) { ( u 21 + u+ 12 u+ 22 ) + ( u 12 + u+ 21 u+ 11 )} x. (34) Note that l(0) = ( u 21 + u+ 12 u+ 22 ), and l(1) = ( u 12 + u+ 21 u+ 11 ). Thus, if ( u 21 + u+ 12 u+ 22 ) and ( u 12 + u+ 21 u+ 11 ) have opposite signs, then l(x) has no zero in the range 0 x 1. Hence, x = 0 and x = 1 are the only stationary points in this case. Further, if l(x) > 0, then x = 1 is globally asymptotically stable, and if l(x) < 0, then x = 0 is globally asymptotically stable. On the other hand, if ( u 21 + u+ 12 u+ 22 ) and ( u 12 + u+ 21 u+ 11 ) have the same sign, then there is an interior equilibrium at x = u 21 + u+ 12 u+ 22 ( u 21 + u+ 12 u+ 22 ) + ( u 12 + u+ 21 (35) u+ 11 ). In this case, we can write the dynamics in the form ẋ = γx(1 x)(x x), (36) where γ = ( u 21 + u+ 12 u+ 22 ) + ( u 12 + u+ 21 u+ 11 ). (37) If γ > 0, then ẋ > 0 for 0 < x < x, and ẋ < 0 for x < x < 1. In this case, x is globally asymptotically stable on 0 < x < 1. On the other hand, if γ < 0, then ẋ < 0 for 0 < x < x and ẋ > 0 for x < x < 1. In this case, x = 0 is locally asymptotically stable, with basin of attraction 0 x < x, and x = 1 is locally asymptotically stable with basin of attraction x < x 1. We summarize the analysis in the following proposition. 17

19 Proposition 5.3 Consider a 2 2 symmetric normal form game with u 11 0 and u Let ẋ = L(x) be the GR dynamic, with L(x) = x(1 x)l(x) and l(x) given by (reflx). Then 1. If ( u 21 + u+ 12 u+ 22 ) and ( u 12 + u+ 21 u+ 11 ) have the opposite signs, then x = 0 and x = 1 are the only stationary points of the GR dynamic. Further, if l(x) > 0, then x = 1 is globally asymptotically stable, and if l(x) < 0, then x = 0 is globally asymptotically stable. 2. If ( u 21 +u+ 12 u+ 22 ) and ( u 12 +u+ 21 u+ 11 ) have the same sign, then there are three stationary points; x = 0, x = 1 and the interior point x defined in (35). Given γ defined in (37), if γ > 0, then x is globally asymptotically stable on 0 < x < 1. If γ < 0, x = 0 and x = 1 are locally asymptotically stable with respective basins of attraction 0 x < x and x < x 1. We leave the reader to explicate the special cases in which one or both of ( u 21 + u+ 12 u+ 22 ) and ( u 12 + u+ 21 u+ 11 ) are zero. 5.4 Case 4: u 11 < 0 and u 22 < 0 In this final case, u + 11 = u+ 22 = 0. Therefore, the GR dynamic (27) reduces to L(x) = u 11 x3 + (u 12 u+ 21 u+ 12 )x2 (1 x) u 21 x(1 x)2 u 22 (1 x)3 + u + 12x(1 x). (38) We note that L(0) = u 22 > 0 and L(1) = u 11 < 0. Therefore, this case provides an example of Proposition 4.4 in which the GR dynamic has only interior stationary points when u ii < 0, for all actions A i A. However, the problem of finding an interior stationary point in this case cannot be reduced to solving for the roots of a quadratic equation. Therefore, we cannot obtain an explicit expression for such a rest point x as a function of the payoffs u ij. Nevertheless, we can establish that there exist exactly one stationary point for the GR dynamic in this case. To prove this claim, apply the change of variable ξ = x / (1 x) and note that x [0, 1] = ξ 0. We can then express the stationarity condition L(x) = 0 in terms of the ξ as P (ξ) = 0, where P (ξ) = L(x) / (1 x) 3 is the cubic P (ξ) = u 11 ξ3 + ( u 12 + u u+ 21 )ξ2 u 21 ξ u 22 u+ 12ξ(1 + ξ) = u 11 ξ3 + ( u 12 + u+ 21 )ξ2 (u u 21 )ξ u 22. Positive solutions of P (ξ) = 0 are equivalent to solutions of k(ξ) = m(ξ), where k(ξ) = u 11 ξ3 + ( u 12 + u+ 21 )ξ2 and m(ξ) = ( u 21 + u+ 12 )ξ + u 22. Now m(ξ) is a straight line with non-negative slope and m(0) > 0. On the other hand, k(0) = 0, and k(ξ) is a positive, strictly convex, increasing function for ξ > 0, with k(ξ) as ξ. It follows immediately that there is exactly on solution ξ > 0 of k(ξ) = m(ξ). The unique stationary point x of L(x) is then determined by ξ = x / (1 x ). Moreover, L(x) > 0 for 0 x < x, and L(x) < 0 for x < x 1. asymptotically stable on 0 x Hence, x is globally

Near-Potential Games: Geometry and Dynamics

Near-Potential Games: Geometry and Dynamics Ozan Candogan, Asuman Ozdaglar and Pablo A. Parrilo January 29, 2012 Abstract Potential games are a special class of games for which many adaptive user dynamics