On the Impossibility of Predicting the Behavior of Rational Agents. Dean P. Foster* and H. Peyton Young ** February, 1999 This version: October, 2000

Size: px

Start display at page:

Download "On the Impossibility of Predicting the Behavior of Rational Agents. Dean P. Foster* and H. Peyton Young ** February, 1999 This version: October, 2000"

Benjamin Griffith
5 years ago
Views:

1 On the Impossibility of Predicting the Behavior of Rational Agents Dean P. Foster* and H. Peyton Young ** February, 1999 This version: October, 2000 We exhibit a class of games that almost surely cannot be learned by perfectly rational players. That is, if players are uncertain about their opponents payoffs, and if all of them are rational, then no matter what their prior beliefs, the probability is zero that they will learn to predict the strategies of their opponents to a good approximation. The probability is also zero that play will come close to Nash equilibrium play. Thus there are inherent limits to the ability of rational players to predict the behavior of other rational players when the others payoffs are unknown. *Department of Statistics, Wharton School, University of Pennsylvania, Philadelphia PA **Department of Economics, Johns Hopkins University, Baltimore, MD pyoung@jhu.edu The authors are indebted to Junfu Zhang and several anonymous referees for comments on an earlier draft. This research was supported in part by NSF grant SBR

2 Rationality vs Predictability A standard assumption in economics is that people are rational: they maximize their expected payoffs given their beliefs about future states of the world. This hypothesis plays a crucial role in game theory, where each player is assumed to choose an optimal strategy given his belief about the strategies of his opponents. In this setting, a belief amounts to a forecast or prediction of the probabilities with which one s opponents will take various actions. The prediction is good if the forecasted probabilities are close to the actual probabilities. Together, prediction and rationality justify the central solution concept of the theory. Namely, if each player correctly predicts the opponents strategies, and if each chooses an optimal strategy given his prediction, then the strategies form a Nash equilibrium of the game. It remains an open problem, however, to explain how equilibrium actually emerges from out-of-equilibrium conditions. One branch of the literature attempts to identify equilibria that seem natural or focal from an a priori standpoint (Harsanyi and Selten, 1988). This equilibrium selection approach presumes that there is agreement among the players on the relevant selection principles. An alternative branch of the literature seeks to identify dynamic processes by which the players learn equilibria by observing each other s behavior over a long series of trials. To date, however, this learning approach has not succeeded in identifying any single learning rule that is fully consistent with rational behavior and that leads to Nash equilibrium in general games. (For surveys of this literature see Weibull (1995), Kirman and Salmon (1995), Vega-Redondo (1996), Samuelson (1997), Fudenberg and Levine (1998), Young (1998), Foster, Levine, and Vohra (1999).) In this paper we argue that no such rule exists when players are sufficiently uncertain about their opponents payoffs. That is, there are situations in which they will never learn to predict their opponents behavior even approximately no matter what learning rule they use, and they will never come close to playing Nash equilibrium. This result builds on pioneering work of Jordan (1991, 1993, 1995) and is in a similar spirit to recent papers of Nachbar (1997, 1999a, 1999b), and Miller and Sanchirico (1997). 1 These results differ 1 For a general critique of Bayesian rationality in game theory see Binmore (1987, 1990, 1991). 2

3 from ours in that agents are either assumed to be myopic, or they are forward-looking but certain belief combinations are ruled out or assumed to be improbable. The contribution of the present paper is to demonstrate that a fundamental tension exists between rationality and predictability even when players are perfectly rational, forward-looking, and there are no restrictions on their beliefs. An example To illustrate the problem in a concrete case, consider two individuals, A and B, who are playing the game of matching pennies. Simultaneously each turns a penny face up or face down. If the pennies match (both are Heads or both are Tails), then A wins a prize; if they do not match, B wins a prize. Assume first that the prize is one dollar, which the loser pays to the winner. Then the game has a unique Nash equilibrium in which each player randomizes by choosing Heads (H) and Tails (T) with equal probability. If both adopt this strategy, then each is optimizing given the strategy of the other. Moreover, although neither can predict the action of the opponent, each can predict his strategy, namely, the probabilities with which the actions will be taken. In this case no difficulty arises because the game has a unique equilibrium and the players know what it is. Now change the situation by assuming that if both choose Heads, then B buys an ice cream cone for A, whereas if both choose Tails then B buys A a milk shake. Similarly, if A chooses Heads and B chooses Tails then A buys B a coke, whereas if the opposite occurs then A buys B a bag of chips. Assume that the game is played once each day, and the players tastes do not change from one day to the next (e.g., they do not get tired of eating ice cream cones). Unlike the previous situation, neither player knows the other s payoffs. A s action B s action B s action H T H T H eat cone buy coke buy cone drink coke T buy chips drink shake eat chips buy shake Outcomes for A Outcomes for B 3

4 For expositional simplicity let us assume first that the players are completely myopic: they do not worry about the effect of their actions on the future course of the game; but are only concerned about optimizing in the current period given their knowledge of what has happened in previous periods. Imagine that the following sequence of actions has occurred over the first ten periods Period A: H T T H H H T H T H? B: T H T H T H T H T H? The problem for each player is to predict the intention of the opponent in period eleven. Let s put ourselves in A s shoes for a moment. The behavior of B suggests an alternating pattern, perhaps leading us to predict that B will play T next period. Since we are rational, we will (given our prediction) play T for sure next period. But if B is a good predictor, then she must be able to predict that, with high probability, we are in fact going to play T next period. This prediction by B leads her to play H next period, thus falsifying our original prediction that she is about to play T. The point is that if either side makes a prediction that leads them to play H or T for sure, the other side must predict that they are going to do so with high probability, which means that they too will choose H or T for sure. But there is no pair of predictions such that both are approximately correct and the optimal responses are H or T for sure. It follows that, for both players to be good predictors of the opponent s next period behavior, at least one of them must be intending to play a mixed strategy next period, and the other must predict this. Suppose, for example, that player B intends to play a mixed strategy in period eleven. This means she must be exactly indifferent between playing H and T given her predictions about A. Now B s predictions about A s behavior in the eleventh period are based on the observed history of play in the first ten periods. Let s say that the particular history given above leads B to predict that A will play H with probability.127. Since B intends to play mixed, it must be the case that B s expected utility from playing H or T are identical, given B s utility function u B for the various outcomes. In other words, it must be that.127 u B (buy cone) +.873u B (drink coke) =.127u B (drink coke) +.873u B (buy shake). 4

5 But there is no reason to think that B s utilities actually do satisfy this equation exactly. More precisely, let us suppose that B s utility for each outcome could be any real number within a certain interval, and that B s actual utility (B s type) is the result of a random draw from among these possible values. (The draw occurs once and for all before the game begins.) Following Jordan (1993), we claim that the probability is zero that the above equation will be satisfied. The reason is that there are only a finite number of distinct predictions that B could make at this point in time, because B s prediction can only be based A s observed behavior (together with B s initial beliefs). Since this argument holds for every period, the probability is zero that B will ever be indifferent. From this and the preceding argument it follows that, in any given period, one or both players must be making a bad prediction. Moreover, they cannot be playing a Nash equilibrium in any given period (or even close to a Nash equilibrium), because this would require them to play mixed strategies, which means that both must be indifferent. Does this imply that Nash equilibrium is useless as a predictive tool? Not necessarily. The reason is that, even though the players strategies may not be close to equilibrium in any given period, the empirical distribution of their actions may in fact be approaching equilibrium in an asymptotic sense. Specifically, consider the following simple learning rule known as fictitious play (Brown, 1951; Robinson, 1951). Let p t denote the relative frequency with which A has played Heads and let q t be the relative frequency with which B has played Heads up through period t. Suppose that each player predicts that the opponent will act in period t + 1 according to these observed frequency distributions. These predictions will typically be quite wrong for the reasons given above. (That is, for almost all utility functions, the players will never be indifferent, so they will not play mixed strategies close to p t and q t.) It can be shown, however, that if the players are rational and they act on these (incorrect) predictions, then the frequencies p t, q t will converge with probability one to the mixed Nash equilibrium of the game (Miyasawa, 1961; Monderer and Shapley, 1996). In other words, to a naive observer it will appear as if the players are playing close to Nash equilibrium on average, even though in each period their intention is to play a pure strategy (which is not close to Nash). This point illustrates that our impossibility theorem must be interpreted with some care. It does not show that Nash equilibrium is useless as a descriptive regularity of behavior; rather, it shows that the players cannot arrive at equilibrium by being good predictors. The fundamental tension is between rationality and prediction, not between rationality and equilibrium. 5

6 One might well ask, however, whether this argument holds up when the players are more thoroughly rational in the sense that they are forward-looking and calculate the effect of their present moves on future payoffs (suitably discounted). Forward-looking behavior allows for more flexibility in the learning process and a much richer set of strategies. First, a patient player--one who does not discount the future heavily--might be willing to test the responses of the opponent in the early stages of the game in order to make better predictions about his behavior later in the game. Second, due to a result in game theory known as the Folk Theorem, it may be possible for patient players to discover equilibrium strategies that do not require them to randomize. An example would be a form of tacit cooperation in which one of the players always plays H, say, while the other cycles between H and T in some fixed proportions. Third, forward-looking behavior opens up the possibility of invoking weaker notions of prediction. One might argue, for example, that it suffices to predict the opponent s average behavior over some given number of periods, rather than his period-by-period behavior. This will not resolve the difficulty however. In the first place, it does not correspond to what we normally mean by good prediction. If a weather forecaster correctly predicted that approximately one in three days will be rainy over the coming year, but he cannot estimate the probability that it will rain tomorrow, we would not be very impressed. Moreover, even if we were to accept this weaker conception of prediction, it leads to the same kinds of impossibility results that obtain in the case of period-by-period prediction. To see why, suppose that, in matching pennies (with the usual payoffs) A predicts that B will play Heads with probability.489 over the next hundred periods. Further, suppose that B predicts that A will play Heads with probability.501 over the next hundred periods. Then the rational response is that both play Tails for certain over the next hundred periods. Hence their predictions of the opponent s intentions are not close to being correct even in an average sense. The argument is easily generalized to show that a prediction about average behavior over any finite horizon is almost certainly not going to lead the predictor to be indifferent, which means that someone s prediction will not be close to the opponent s intended play over the relevant horizon. The principal result of this paper is to show that it is, in fact, inherently impossible to learn to predict the behavior of rational players in certain kinds of strategic interactions. Specifically, no matter how forward-looking the players are, there exists a distribution of 6

7 payoffs for the prizes in matching pennies such that, given any learning rule, for almost all realizations of the payoffs at least one of the players will fail to learn to predict the behavior of the others (even approximately), and they will fail to learn to play a Nash equilibrium of the repeated game (even approximately). A similar impossibility theorem holds for any finite game in which the players information about their opponents payoffs is sufficiently incomplete. In particular, we show that if the payoffs are independently and normally distributed, there is a positive probability that the players will fail to predict their opponents behavior and will also fail to learn to play Nash under any learning rule. We may contrast this result with several other papers that have a similar flavor. Jordan (1993) was the first to point out that myopic players will not learn to play close to a mixed equilibrium unless they are eventually indifferent, and for any initial beliefs there is a measure-zero set of payoffs that will in fact make them indifferent. Our result extends this observation to the case where players are arbitrarily forward-looking, and we show that it has important implications for prediction as well as for learning equilibrium. Of course, players can learn to predict under certain conditions. In particular, Kalai and Lehrer (1993) show that if the players initial beliefs about the others strategies are sufficiently correct to begin with, then with probability one they learn to predict more and more accurately as time goes on. Our result therefore implies that, in some games of incomplete information, no beliefs are sufficiently correct to begin with, that is, the Kalai-Lehrer condition holds with probability zero. A third branch of the literature explores various conditions on players beliefs that make learning impossible in certain kinds of games (Nachbar, 1997, 1999a, 1999b). Our approach differs in that we place no restrictions on the players beliefs, but we do restrict the class of games to a greater extent. Our result says that for these kinds of games, and any combination of beliefs, learning fails to occur with probability one. The learning model Consider an n-person game of incomplete information with finite action space X = Π X i and utility functions u i : X R. We shall assume that the payoffs can be represented in the form u i (x) = u i *(x) + ω i (x), where the values u i *(x) are the payoffs in a benchmark game G*, and the ω i (x) are i.i.d. random variables with distribution ν(ω). It is assumed that the support of ν is a bounded interval [-λ/2, λ/2], and that ν is absolutely continuous 7

8 with respect to Lebesgue measure (e.g., the uniform distribution). The game G* and the error structure are common knowledge, but the realization u i (x) is known only to player i. Errors are drawn once only at the beginning of play, and the resulting game is denoted by G. The players then play G infinitely often. The outcome in period t is an n-tuple of actions x t X. The partial history up to time t is the sequence of outcomes h t = (x 1, x 2,..., x t ). Let h 0 represent the null history, H t the set of all length-t histories, and H = H t the set of all finite histories. The set of infinite histories is denoted by H. Histories are publicly observed, that is, there is perfect monitoring. Each player i has a discount rate 0 < δ i < 1. Thus the payoff to player i from the infinite history h = (x 1, x 2,..., x t,... ) H is U i (h) = Σ δ i t-1 u i (x t ). (1) t = 1 Let i denote the set of probability distributions over X i. Let = i denote the product set of mixtures, and let -i = j i j be the product set of mixtures by i's opponents. The behavioral strategy of a player is a probability distribution over i's actions in each period t, contingent on the observed history up through t - 1. Thus we can represent i's strategy by a function q t i = g i (u i, h t-1 ) i, where q t i(x i ) is the probability that i plays x i in period t, contingent on his realized payoff function being u i and the history to date being h t-1. A prior belief of player i is a probability distribution over all possible combinations of strategies g -i. We can decompose an agent's beliefs into period-by-period forecasts of the others' behavior contingent on the history observed so far. Thus if h t-1 is the history up to time t - 1, i's forecast about the behavior of her opponents in period t can be represented by p t -i = f i (h t-1 ) -i, where p t -i(x -i ) is the probability that i assigns to the others playing the combination x -i in period t. The function f i : H -i is determined by i's prior beliefs, and will be called i's forecasting function. Given any vector of forecasting functions f = (f 1, f 2,..., f n ), it can be shown that there exists a set of prior beliefs such that the f i describe the period-by-period forecasts of players with these beliefs (see Kalai and Lehrer, 1993). 8

9 Consider the situation just after the players have been informed privately of their payoff functions u i. Because of the independence of the draws among players, no one knows anything he did not already know about the others' payoffs. This has an implication for the forecasting functions. Namely, at each stage t, i knows that j's information consists solely of the publicly observed history h t and j's own payoff function u j. Player j's behavior cannot be conditioned on information that j does not have (namely u -j ), and player i's forecast of j's behavior cannot be conditioned on information that i does not have (namely, u -i ). Thus i's forecast (f i (h t-1 )) j about j's behavior in period t does not depend on the realization of the values u k for any k, including k = i, j. Moreover, this holds for all t and all i. It follows that the functions f i do not depend on the realized payoff functions u i (x), though they may depend on the prior distribution ν and the game G*. Another way of saying this is that the prior beliefs must be consistent with the players' a priori knowledge of the information structure. They cannot believe things that they know a priori to be false. Following Jordan (1993), we shall say that a learning process is a pair (f, g) = (f 1,..., f n ; g 1,..., g n ) where f i : H -i and g i : H i for each player i. Note that, unlike f i, g i does depend on i's realized payoff function u i, since g i represents an optimal choice of strategy by i. For notational simplicity we shall not write this dependence explicitly however. A learning path is a realization of this process; in other words, it is a history h H. Each history is associated with a sequence of period-by-period forecasts and strategies for each player. Given a history h, we shall denote player i's forecast in period t by p t -i(h) = f t i(h t-1 ), and i's strategy in period t by q t i(h) = g i (h t-1 ). Given a realized payoff function u i for player i, the pair (f i, g i ) induces a probability measure on the set of all infinite play paths h H. Similarly, for every partial history h t-1, f i and g i induce a conditional probability distribution on all continuation histories from h t-1. We shall denote this conditional distribution by µ i (f i, g i, h t-1 ). Individual i is rational if, for every h t-1, i's conditional strategy g i (. h t-1 ) optimizes i's expected utility from time t on, given i's conditional forecast f i (. h t-1 ). That is, for every other choice of strategy g' i, and for every partial history h t-1, U i (h)dµ i (f i, g i, h t-1 ) U i (h)dµ i (f i, g' i, h t-1 ). (2) 9

10 Intuitively, player i "learns to predict" the behavior of his opponent(s) if i's forecast of their next-period behavior becomes closer and closer to their actual next-period strategies. This idea may be formalized as follows. Consider a learning process (f, g), and let µ(g) denote the probability measure induced on H by the strategies g = (g 1, g 2,.., g n ). We shall say that player i learns to predict if for µ(g)-almost all realizations h, T lim Σ [p t -i(h) - q t -i(h)] 2 /T = 0. (3) T t =1 In other words, the mean-square error of i's next-period predictions goes to zero over almost all histories of play. This is a somewhat weaker condition than the one considered by Kalai and Lehrer (1993). In their framework, good prediction occurs if for every small positive ε and every positive integer L, the predictive error over the next L periods eventually falls below ε with probability one, that is, there exists a T ε such that, for almost all h, T ε + L Σ [p t -i(h) - q t -i(h)] 2 /L < ε. (4) t =T ε + 1 In Kalai and Lehrer's terminology, i's beliefs are (ε, L)-close to the actual strategies. (Note that if (4) holds for L = 1, then it holds for all positive L.) It is clear that (4) implies (3), but (3) is weaker because it allows for occasional bad predictions. 2 We say that player i never learns to predict if the subset of histories for which (3) holds has µ- measure zero. An impossibility theorem We now demonstrate a class of infinitely repeated games of incomplete information such that, with probability one, some player never learns to predict, and this holds for all prior beliefs. Given the Kalai-Lehrer result, it follows that, with probability one, the true distribution of play is not absolutely continuous with respect to any combination of prior beliefs. Further, since our result holds for all beliefs, it must hold for rational expectations beliefs, that is, beliefs that form a sophisticated Bayesian Nash equilibrium 2See Kalai, Lehrer, and Smorodinsky (1999) for a comprehensive discussion of various notions of prediction and their relationship to merging and calibration. 10

11 (Jordan, 1991). These beliefs have the property that, at each stage of play, a player s forecast of the opponent s behavior in future periods is correctly conditioned on the posterior distribution of payoff-types revealed by the history of play so far. But this does not mean that the prediction is correct (or even approximately correct) for a given opponent. Of course, this leaves open the question of whether the players strategies might come close to Nash equilibrium even if the players are not good predictors. We shall show, however, that with probability one play does not come close to Nash whether the players learn to predict or not. Indeed, we shall show that this is true even under the following weak conception of what it means to "come close to Nash." Let Q* be the set of all oneperiod strategy tuples q such that q occurs in some time period in some Nash equilibrium of the repeated game. Let d(q, q') denote the Euclidean distance between any two vectors q, q'. Given a learning process (f, g) and a specific history h, we say that play comes asymptotically close to Nash on h if, for every small ε > 0 and almost all t, inf d(q, g(u i, h t-1 )) < ε. (5) q Q* The learning process (f, g) comes asymptotically close to Nash if the subset of histories h for which (5) holds has µ(g)-probability one. Similarly, the learning process is asymptotically far from Nash if the subset of histories h for which (5) holds has µ(g)- probability zero. Theorem. Let G* be a finite, zero-sum, two-person game with a unique Nash equilibrium, which has full support. Consider a game of incomplete information in which the payoffs of G* are perturbed by i.i.d. random errors drawn from a distribution ν whose support is an interval of length λ containing zero as an interior point and that is absolutely continuous with respect to Lebesgue measure. Assume that the players are rational and Bayesian with arbitrary priors on the play of the infinitely repeated game. If λ is sufficiently small, then with ν-probability one, at least one of the players almost surely will not learn to predict the strategies of the others, and the learning process will almost surely be asymptotically far from Nash. Since the proof is technically quite involved, we shall first explain why the argument given in the introduction for myopic players does not extend easily to the general case. 11

12 Consider a zero-sum game G* with a unique Nash equilibrium, which puts positive probability on all actions. It is not difficult to show that, in the repeated game of G*, every Nash equilibrium is behaviorally mixed in the following sense: for every partial history that occurs with positive probability, the behavioral strategies in the next period have full support on the one-shot strategies in G*. Furthermore, this remains true when the payoffs of G* are perturbed slightly. This assures not only that the behavioral strategies put positive probability on every action in all time periods, but that these probabilities are uniformly bounded away from zero by some small ε > 0 in all time periods. From this we would like to be able to conclude, as in the myopic case, that good prediction plus rationality requires that someone must play mixed in almost all time periods. Things are not so simple however. One difficulty is that we have to show that good prediction fails on all but a measure-zero set of histories. It could be that good prediction holds on a small set of positive measure even though the strategies themselves are far from Nash equilibrium. Eliminating this case requires a fairly intricate probabilistic argument. The second difficulty is that, even when players mix and are therefore indifferent among alternative strategies, this does not immediately imply that the expected payoffs fall into a subset of measure zero. Indifference implies that the payoffs satisfy a complex relationship based on a player s expectations about the future play paths generated by the others strategies. One has to show, not merely that indifference occurs, but that it occurs in such a way that it forces the payoffs into a set of measure zero. To increase the transparency of the proof, we give it for the game of matching pennies. It will then be seen that it generalizes readily to any finite, zero-sum, two-person game with a unique Nash equilibrium, which has full support. Fix an error distribution ν on the interval [-λ/2, λ/2] that is absolutely continuous with respect to Lebesgue measure, where λ > 0 is the length of the interval. (A similar proof holds for any small interval that contains zero in its interior.) For the sake of concreteness, we may think of ν as the uniform distribution. The game of matching pennies has the payoff matrix , -1-1, 1 2-1, 1 1, -1 12

13 The perturbed game G has the form ω 11, -1 + ω' ω 12, 1 + ω' ω 21, 1 + ω' ω 22, -1 + ω' 22 where ω ij, ω' ij are i.i.d. random variables distributed according to ν. Fix two players with discount rates 0 < δ 1, δ 2 < 1. Let their beliefs be f 1, f 2, and let their strategies be g 1 (. A), g 2 (. B), where A and B are the realized values of the players' payoff matrices. The functions f 1, f 2, g 1, g 2 will be fixed throughout the proof. We shall always assume that the players are rational, that is, g 1 (. A) and g 2 (. B) optimize the players' expected payoffs from any time T on, conditional on any partial history h T-1 having been observed. (Note that this requires that a player optimize conditional on any observable event, whether or not he believed initially that the event could occur with positive probability.) Given a bound λ > 0 on the perturbation size, let B λ denote the set of all payoff pairs (A, B) such that the set of histories h satisfying condition (3) has positive measure. Lemma 1. Given any ε (0, 1), and any positive integer k, for each (A, B) B λ, there exists a partial history h T-1 = h T-1 (ε, k, A, B) such that, conditional on being in state h T-1 at the end of period T - 1, the probability is at least 1 - ε 4k that each player forecasts the other's behavioral strategy within ε in each of the periods T, T + 1,..., T + 2k - 1. Proof. Suppose there were no such partial history h T-1 for some choice of ε, k, and (A, B) B λ. Then for every time T the probability would be greater than ε 4k that at least one player misforecasts by more than ε in one or more of the periods T, T + 1,..., T + 2k - 1. This would imply that condition (3) is violated for almost all histories, which contradicts the assumption that the set of histories satisfying (3) has positive measure for the pair (A, B). 13

14 Suppose now that (A, B) is a pair for which good prediction occurs with positive probability. We show next that, when λ and ε are sufficiently small, both players must play mixed for many periods in succession. Lemma 2. There exists a bound γ > 0 and a positive integer k such that, for every λ, ε (0, γ) and every (A, B) B λ, the partial history h T-1 (ε, k, A, B) guaranteed by Lemma 1 has the property that both g 1 and g 2 are mixed in the k successive periods T, T + 1,...,T + k - 1. Moreover, g 1 and g 2 put probability at least ε on each of the players' actions in each of these periods. Proof. Let δ = max {δ 1, δ 2 }, and let γ be so small that γ δ and γ γ > δ (1 - δ)2 /50. Let k be the next integer larger than log γ / log δ ; thus k 2. We shall show that Lemma 2 holds with these choices of γ and k, though we make no claim that they are best possible. For future reference let us observe that, under these assumptions on k, δ, and γ, γ(k - 1) γ log γ / log δ < (1 - δ) 2 /50. (6) In particular, γ <.02 and γk <.04. (7) Given (A, B) B λ, let h T-1 (ε, k, A, B) be the partial history guaranteed by Lemma 1. To avoid cumbersome notation we shall assume that T = 1 and h T-1 is the null history. This involves no loss of generality, because we make no restrictions on how the players condition on the history to date. For each period t 1, let a t (g 1, f 1 ) be player 1's expected payoff in period t given his strategy and forecast, and let b t (f 2, g 2 ) be player 2's expected payoff in period t given her strategy and forecast. Let α t denote player 1's expected discounted payoffs from period t on (normalized by the factor 1 - δ 1 ), given his forecast of developments up to period t. Similarly define β t for player 2. We claim that α t, β t are small for each t k, that is, both players' expected payoffs are approximately zero in each of these periods. We remark that this is not immediately obvious. Even though the players are predicting well 14

15 with positive probability, their forecasts may be quite far off at future times, which might lead them to be overly optimistic about their expected payoffs now. Specifically, we are going to establish the following Claim. α t, β t < [(12k + 7)/(1 - δ)]γ. (8) The first step in proving this is to observe that, in every future period t, the sum of the one-shot undiscounted payoffs at time t under the actual strategies is close to zero. In particular, we know that a t (g 1, g 2 ) + b t (g 1, g 2 ) λ. (9) Next we shall show that something close to this holds for the players' expectations of their one-shot payoffs in period t, even though these expectations may be incorrect due to forecasting errors. Let us first evaluate player 1's expectation of his one-shot payoff in period t, namely, a t (g 1, f 1 ). Let p 1 (h t ) be the probability that player 1 attaches to each of the 2 4t possible states h t in period t, and let p*(h t ) be the actual probability of h t under the players' strategies g 1 and g 2. We shall say that a forecast is good up to t if it is within ε of the other's actual behavior in each of the periods 1, 2,..., t; otherwise it is bad. Divide the outcomes into two classes: S t is the set of outcomes h t that were preceded only by good forecasts; L t is the set of all outcomes h t that were preceded by at least one "bad" forecast. Bad forecasts in this interval are extremely rare; indeed by hypothesis, p*(l t ) ε 4k, where k t. Define the absolute discrepancy between p 1 and p* in period t to be d t = Σ p 1 (h t ) - p*(h t ). (10) h t The discrepancy between player 1's forecast of his expected payoff and his actual expected payoff in period t is bounded as follows: a t (g 1, f 1 )- a t (g 1, g 2 ) (1 + λ/2)d t. (11) Let us now evaluate the discrepancy on S t and L t separately. For each outcome h t S t, all of player 1's forecasts of 2's behavior are within ε of being correct, and he makes t 15

16 such forecasts to estimate p 1 (h t ). (He forecasts his own behavior correctly of course.) If t = 1, the sum of the absolute differences is at most 2ε. Over t periods, player 1's forecasting errors on S t cumulate by at most the following: p 1 (h t ) - p*(h t ) t (2ε) + h t S t t t 2 (2ε)2 + 3 (2ε) (2ε) t 2tε/(1-2tε). (12) Now consider the total discrepancy between p 1 and p* on L t. Since p*(l t ) ε 4k, the total downside error on outcomes in L t is no more than ε 4k. Since the total downside error on S t is at most 2tε/(1-2tε), the total upside error on L t is at most 2tε/(1-2tε). Hence d t 4tε/(1-2tε) + ε 4k 4kε/(1-2kε) + ε 4k. (13) Recalling that ε γ <.02, it follows that d t 5kγ. Combining this with (11) we therefore have Similarly we can show that a t (g 1, f 1 ) - a t (g 1, g 2 ) 6kγ. (14) b t (f 2, g 2 ) - b t (g 1, g 2 ) 6kγ. (15) Putting (14) and (15) together with (9) it follows that a t (g 1, f 1 ) + b t (f 2, g 2 ) 12kγ + λ (12k + 1)γ. (16) Expression (16) shows that the players' expected payoffs in the t th period stage game are close to being zero-sum. To complete the proof of the claim, we must show that their expected discounted payoffs α t, β t are close to zero. In the stage game, let the maximin security levels of players 1 and 2 be v 1 and v 2 respectively. If λ is sufficiently small, v 1 and v 2 are close to zero. Indeed, the mixture (1/2, 1/2) guarantees each player at least -λ, so v 1,v 2 - λ. (17) 16

17 By assumption, each player's repeated game strategy g i is a best reply given his or her forecast. By (16), either player can guarantee himself at least -λ after any period t. Given our assumption that λ γ, it follows that α t, β t - γ for all t 1. (18) We shall now bound α t and β t from above by a small number. For notational convenience we shall show this for α 1 and β 1, then observe that the argument extends to every t k. For brevity, denote the expected one-shot payoffs in period t by a t = a t (g 1, f 1 ) and b t = b t (f 2, g 2 ). Assume without loss of generality that 0 < δ 1 δ 2 < 1. Then we have the identities α 1 = (1 - δ 1 )[a 1 + δ 1 a 2 + δ 1 2 a ] (δ 2 - δ 1 ) α 2 = (1 - δ 1 )(δ 2 - δ 1 )[a 2 + δ 1 a 3 + δ 1 2 a ] (δ 2 - δ 1 )δ 2 α 3 = (1 - δ 1 )(δ 2 - δ 1 )δ 2 [a 3 + δ 1 a 4 + δ 1 2 a ].. (δ 2 - δ 1 )δ 2 k-2 α k = (1 - δ 1 )(δ 2 - δ 1 )δ 2 k-2 [a k + δ 1 a k ] Adding these expressions we obtain α 1 + (δ 2 - δ 1 )[α 2 + δ 2 α δ 2 k-2 α k ] By definition, = (1 - δ 1 )[a 1 + δ 2 a 2 + δ 2 2 a δ 2 k-1 a k ] + δ 2 k-1 δ 1 α k+1. (19) β 1 = (1 - δ 2 )[b 1 + δ 2 b 2 + δ 2 2 b δ 2 k-1 b k ] + δ 2 k β k+1. (20) For each t define s t = a t + b t. Multiply (19) by (1 - δ 2 ) and (20) by (1 - δ 1 ), then add them to obtain (1 - δ 2 )α 1 + (1 - δ 1 )β 1 = (1 - δ 1 )ξ - (δ 2 - δ 1 )ψ + (1 - δ 1 )(1 - δ 2 )ζ, (21) where ξ = (1 - δ 2 )[s 1 + δ 2 s 2 + δ 2 2 s δ 2 k-1 s k ], ψ = (1 - δ 2 )[α 2 + δ 2 α 3 + δ 2 2 α δ 2 k-2 α k ], ζ = δ 2 k-1 δ 1 α k+1 + δ 2 k β k+1. 17

18 The term ξ is a nonnegative weighted sum of s 1,..., s k, where the weights sum to less than unity. Similarly, ψ is a nonnegative weighted sum of α 2,..., α k, where the weights sum to less than unity. From (16) we know that s t = a t + b t (12k + 1)γ for each t, so ξ 13kγ. From (18) we know that α t -γ for each t, so ψ -γ. Since the one-shot payoff to any player in any period is at most 1 + λ/2, α k+1, β k γ/2. Hence ζ δ k (2 + γ) where δ = δ 2 δ 1. By choice of k, δ k < γ. Putting all of this together, we have (1 - δ 2 )α 1 + (1 - δ 1 )β 1 < (1 - δ 1 )(12k + 1)γ + (δ 2 - δ 1 )γ + (1 - δ 1 )(1 - δ 2 )γ(2 + γ) < (12k + 1)γ + γ + 2γ + γ 2. < 12kγ + 5γ. Rewrite the left side as follows (1 - δ 2 )[α 1 + β 1 ] + (δ 2 - δ 1 ) β 1 < 12kγ + 5γ. (22) Since β 1 -γ and δ = δ 2 it follows from (22) that α 1 + β 1 < [(12k + 6)/(1 - δ)]γ. Using the fact that α 1, β 1 -γ, we therefore have α 1, β 1 < [12kγ + 6γ + (1 - δ)γ]/(1 - δ) < [(12k + 7)/(1 - δ)]γ. (23) Now observe that the same argument holds for every period t k, because by assumption the players predict within ε for each of the next 2k periods. Thus α t, β t < [(12k + 7)/(1 - δ)]γ for every t k, which establishes claim (8). Next we assert that both players put probability at least ε on each of their actions in each of the first k periods. Consider first the situation in period 1. Suppose by way of contradiction that some agent--say player 1--puts less than ε probability on one of his actions. Thus he puts more than probability 1 - ε on the other action (say action 1). Each player moves once each period, and each puts at least 1/2 probability on some action. Since player 1 begins by choosing action 1 with probability 1 - ε > 1/2, there is at least 18

19 one continuation history up to time t = 2k whose probability exceeds 1/2 4k. Let h 2k denote such a history. By assumption, the probability that a bad forecast occurs by time t = 2k is at most ε 4k. Since Pr(h 2k ) (1/2) 4k > ε 4k, there can be no bad forecasts along this history. Therefore, in period t = 1, player 2 forecasts that player 1 will choose action 1 with probability at least 1-2ε. Consider now the following deviation strategy g* for player 2: in period 1, she plays action 2, and in each period thereafter she plays her minimax strategy. Given her forecast in period 1, player 2 can expect to get at least (1-2ε)(1 - λ/2) in the first period, and at least v 2 -γ thereafter. Hence her expected discounted payoff from using g* is β (1 - δ 2 )(1-2ε)(1 - λ/2) + δ 2 v 2 (1 - δ)(1-2γ)(1 - γ/2) - δγ 1-δ - 3γ. (24) On the other hand, under player 2's original strategy g 2, the expected payoff is β 1 < [(12k + 7)/(1 - δ)]γ. Thus we will have reached a contradiction if we can show that [(12k + 7)/(1 - δ)] γ 1 - δ - 3γ. The latter holds if 12kγ + 7γ + 3(1 - δ)γ < (1 - δ) 2. However, this follows easily from our earlier assumption (6) that k 2 and 50γ(k - 1) < (1 - δ) 2. This contradiction shows that both players put probability at least ε on each of their actions in period 1. Now consider the situation in period 2. By the above, each outcome in period 2 occurs with probability at least ε 2. Hence, for each outcome in period 2, there must exist a continuation path to period t = 2k along which no bad forecasts occur. (Otherwise a bad forecast would occur between periods 1 and 2k with probability at least ε 2 (1/2) 4k-2 > ε 4k, which contradicts our present hypothesis.) As in the preceding, we conclude that both players must place probability at least ε on each of their actions in period 2. Using (23) we can extend this argument to every period 1 t k. This establishes lemma 2. 19

20 Let h T-1 be a partial history satisfying the conditions of Lemma 2. Since player 1 plays mixed at T, T + 1,..., T + k - 1, his expected payoffs from taking action 1 or 2 must be exactly tied, given his beliefs at each of these times. This means that we can evaluate his payoffs as if he chose action 1 in each of the periods T, T + 1,..., T + k - 1. Specifically, player 1's payoff from choosing action 1 at time T can be written as follows: k U 11 T (A) = (1 - δ 1 ) Σ δ 1 s-1 (π s a 11 + (1 - π s )a 12 ) + δ 1 k U 11 T+k (A). (26) s = 1 = θa 11 + (1 - δ 1 k - θ)a 12 + δ 1 k U 11 T+k (A), where θ = Σ s 1 (1 - δ 1 ) δ 1 s-1 π s and the coefficients π s are determined by player 1's beliefs about player 2's actions in period T + s - 1, evaluated over all possible developments between periods T and T + s - 1. Similarly, player 1's expected payoff from playing action 2 at time T can be computed as if player 1 played action 2 from T to T + k - 1 inclusive; that is, for some θ' > 0, U 12 T (A) = θ'a 21 + (1 - δ 1 k - θ')a 22 + δ 1 k U 12 T+k (A). (27) Since player 1 plays mixed at T, it must be that U 11 T (A) = U 12 T (A). Let F(A) = U 11 T (A) - U 12 T (A), that is, F(A) = θa 11 + (1 - δ 1 k - θ)a 12 - [θ'a 21 + (1 - δ 1 k - θ')a 22 ] + δ 1 k [U 11 T+k (A) - U 12 T+k (A)]. (28) We can view (28) as defining F on the entire space R 4. The preceding argument shows that if (A, B) B λ and h T-1 satisfies the conditions of Lemma 2, then A is a zero of F. (Likewise, B is a zero of another function, but this fact will not be needed in the rest of the proof.) Let Z(h T-1 ) denote the set of zeros of F, where F is determined by the initial history h T-1. Lemma 3. Z(h T-1 ) has Lebesgue measure zero in R 4. Proof. It is convenient to make the following change of variable: w 11 = a 11 + a 12, w 12 = a 11 - a 12, w 21 = a 21, and w 22 = a

21 Let F(w 11, w 12, w 21, w 22 ) = F(A). Let G(w) be the difference between player 1's future expected payoff from playing action 1 as opposed to action 2 in period T + k, that is, G(w) = U 11 T+k (A) - U 12 T+k (A). From this and (28) we have F(w) = [(1 - δ1 k )/2]w11 + [θ - (1 - δ1 k )/2]w12 - [θ'w21 + (1 - δ1 k - θ')w22] + δ1 k G(w). (29) We shall consider F, F, and G to be defined on all of R 4. Let Z be the set of zeroes of F. Since F is continuous, Z is closed and therefore it is measurable. We are going to show that Z has Lebesgue measure zero. To this end, let us examine the rates of change of G(w) with respect to each of the variables w ij. Suppose, first, that we increase w 11 by some amount w 11 > 0. Then the one-period payoff (to player 1) from choosing action 1 goes up by w 11 at each of 1's decision nodes (independently of how player 1 expects player 2 to mix), while the oneperiod payoff from choosing action 2 stays fixed. Therefore player 1's expected discounted payoff from choosing action 1 at time T + k increases by at most w 11 and by at least zero, while player 1's expected discounted payoff from action 2 also increases by at most w 11 and by at least zero. A similar analysis holds if w 11 < 0. Since G(w) is the difference between player 1's expected payoffs from taking actions 1 or 2 at time T + k, we have G(w)/ w Next consider an incremental change α = w 12 > 0 where w 11 stays fixed. Then a 11 increases by some α/2 > 0, while a 12 decreases by α/2. Thus the one-period payoff from choosing action 1 at any given decision node changes by at most ± α, while the one-period payoff from choosing action 2 stays fixed. It follows that player 1's expected discounted payoff from choosing action 1 at time T + k changes by at most ± α, and the same holds for choosing action 2 at time T + k. Thus G(w)/ w Indeed we can show that G(w)/ w ij 2 for every (i, j). (30) 21

22 Now consider the ratios F(w)/ w ij. All of the coefficients in (29) have modulus one or less. It follows from this and (30) that F(w)/ w ij 3 for every (i, j). (31) We now bound F(w)/ w 11 from below. From (29) and (30) we have F(w)/ w 11 (1 - δ 1 k )/2 - δ 1 k G(w)/ w ij 1/2 - (5/2)δ 1 k 1/2 - (5/2)δ k. (32) By choice of k, δ k < γ and from (7) we know that γ <.02. Thus F(w)/ w (33) Let Z -1 denote the set Z projected onto the last three coordinates. For each triple (w 12, w 21, w 22 ) Z -1, there is at most one value w 11 such that (w 11, w 12, w 21, w 22 ) Z, for otherwise (20) would be violated. Thus we may define a function from φ: Z -1 R such that φ(w 12, w 21, w 22 ) is the unique value w 11 satisfying (w 11, w 12, w 21, w 22 ) Z. We claim that φ is uniformly continuous on Z -1. Given a small α > 0, consider any triples (w 12, w 21, w 22 ) Z -1 and (w' 12, w' 21, w' 22 ) Z -1 such that w' 12 - w 12 + w' 21 - w 21 + w' 22 - w 22 α. (34) Let w 11 = φ(w 12, w 21, w 22 ) and w' 11 = φ(w' 12, w' 21, w' 22 ). From (31) it follows that the change in F(w) due to changes in w 12, w 21, w 22 is at most ± (3 w w w 12 ), which by assumption is at most ± 9α. Since the total change in F(w) equals zero ( F(w) = F(w + w)), the change due to the difference w' 11 - w 11 is also at most ± 9α. It follows from this and (33) that w' 11 - w 11 9α/.45 = 20α. Thus for every small α > 0, w' 12 - w 12 + w' 21 - w 21 + w' 22 - w 22 α implies φ(w') - φ(w) 20α. (35) This proves that φ is uniformly continuous on Z

23 For each small r > 0, we can cover the set Z -1 with a finite number of open spheres S 1,..., S k(ε) in R 3, each of radius r, such that the total Lebesgue measure of these spheres in R 3 is µ(s 1 ) +..., µ(s k(ε)) µ(z -1 ) + r. On each sphere S j, φ varies by at most 2r(20α). Hence we may find an open envelope in R 4 of Lebesgue measure at most 40rα(µ(Z -1 ) + r) that contains all the zeroes of F. Taking the limit as r 0, we conclude that the set of zeroes of F (and hence of F) has Lebesgue measure zero. This concludes the proof of Lemma 3. The proof of the theorem is now completed as follows. Choose γ and k as in the proof of Lemma 2, and let 0 < λ, ε γ. Fix two arbitrary forecast functions f 1, f 2, and best reply strategies g 1, g 2. Let (A, B) be a pair of payoff functions for which the players learn to predict with positive probability. Lemma 2 implies that there exists an initial history h T-1 such that A Z(h T-1 ), and Lemma 3 implies that Z(h T-1 ) has Lebesgue measure zero in R 4. Hence if (A, B) is a pair for which the players predict with positive probability, then A is contained in the countable union Z(h T-1 ), which also has Lebesgue measure zero. Since the ex ante distribution ν that determines (A, B) is absolutely continuous with respect to Lebesgue measure, it is ν-almost certain that A Z(h T-1 ). In other words, almost surely at least one of the players almost never learns to predict his opponent's strategy. This completes the proof of the first statement of the theorem. It remains to be shown that, with ν-probability one, play is asymptotically far from Nash. The first step is to show that all the Nash equilibria are mixed at every stage provided λ is small enough. Lemma 4. There exists ε* > 0 and γ* > 0 such that whenever 0 < λ γ*, every Nash equilibrium (g 1 *, g 2 *) puts probability at least 2ε* on each action in every time period. Proof. In each time period t the expected one-shot payoffs satisfy a t + b t λ. (36) Let α t, β t denote the expected discounted payoffs of players 1 and 2 from time t on, using their respective discount rates. Note that these assessments are made in period 1 by estimating the probability of the various outcomes in period t under the strategies g 1 * and g 2 *, and then computing the continuation payoffs from t on contingent on each of these 23

24 outcomes. Since (g 1 *, g 2 *) is a Nash equilibrium, these expected continuation payoffs must be at least equal to the players' maximin values in every period. Hence α t, β t -λ for all t 1. (37) Let s t = a t + b t. As in the proof of Lemma 2, we can derive the analog of expression (21), which in the present case holds for all positive integers k. Letting k run to infinity we obtain the identity (1 - δ 2 )α 1 + (1 - δ 1 )β 1 = (1 - δ 1 )ξ* - (δ 2 - δ 1 )ψ*. (38) Here ξ* is (1 - δ2) times the δ2-discounted value of the series {s 1, s 2, s 3,... }, and ψ* is (1 - δ2) times the δ2-discounted value of the series {α 2, α 3, α 4,... }. By (36) every s t λ, so (1 - δ1)ξ* λ. By (37) every α t -λ, so -(δ2 - δ1)ψ* λ. Finally, (1 - δ1)β 1 -λ. Thus we have α 1 3λ/(1 - δ 2 ). (39) A similar argument shows that β 1 3λ/(1 - δ 2 ). Combining this with (37), and recalling that λ γ* and δ = δ 2, we obtain the inequality α 1, β 1 3γ*/(1 - δ) for all t. (40) Thus we can make α 1 and β 1 as small as we like by choosing γ* to be sufficiently small. Suppose now that one player (say player 1) puts probability less than 2ε* on one of his actions in period 1. Then player 2 can get a payoff of at least (1-2ε*)(1 - λ/2) - 2ε*(1 + λ/2) in period 1, and at least -λ in each period thereafter. Player 2's expected payoff from this deviation should not be profitable, that is, (1 - δ 2 )[(1-2ε*)(1 - λ/2) - 2ε*(1 + λ/2)] + δ 2 (-λ) β 1 3γ*/(1 - δ 2 ). (41) But this is false whenever 0 < ε*.05 and 0 < γ* (1 - δ 2 ) 2 /15 (1 - δ) 2 /15. The same argument applies to every period t: given any outcome h t-1 that is reached with positive probability, the expected continuation payoffs in all subsequent periods t, t + 1,... are bounded below by -λ -γ*. As in (38)-(40), it follows that the expected continuation 24

25 payoffs in period t are bounded above by 3γ*/(1 - δ). This means that the strategies must put probability at least 2ε* on each action at time t, provided that 0 < ε*.05 and 0 < γ (1 - δ) 2 /15. This concludes the proof of Lemma 4. Now let g = (g 1, g 2 ) be the players' actual strategies, and let B λ * be the set of pairs of payoff matrices (A, B) such that the subset of histories h for which (5) holds has positive measure under µ. We are going to show that the set B λ * has Lebesgue measure zero in R 4, which implies that it has ν-probability zero, for every choice of g. Lemma 5. Fix a pair (A, B) B λ * and let 0 < ε*.05 and 0 < λ γ (1 - δ) 2 /15. For each positive integer k, there exists a time T and a partial history h T-1 = h T-1 (k, A, B), having positive probability under µ, such that each player puts probability at least ε on every action in every period T,..., T + k - 1. Proof. Given (A, B) B λ *, any positive integer k, we claim that there exists a partial history h T-1 (ε*, k, A, B), such that the µ-probability is at least 1 - (ε*) 2k that every player is within ε* of a Nash equilibrium in each of the periods T, T + 1,...,T + k - 1, conditional on reaching h T-1 (ε*, k, A, B) in period T - 1. If this were not the case, then at every time T the probability would be greater than (ε*) 2k that some player is more than ε away from a Nash equilibrium sometime during the next k periods. But then the set of histories for which (5) holds has µ-measure zero, contrary to the hypothesis that (A, B) B λ *. Now choose h T-1 = h T-1 (ε*, k, A, B) as above, and let 0 < λ γ* (1 - δ) 2 /15. Divide the continuation histories in period T + k -1 into two classes: "good" if everyone was within ε* of Nash at each stage between T and T + k - 1, and "bad" otherwise. There is at least one set of choices at T that leads to a good continuation path, which means that both players are in fact within ε* of Nash in period T. By Lemma 4 every Nash equilibrium puts probability at least 2ε* on each action, hence the players' actual strategies put probability at least ε* on each action in period T. Using this fact we deduce that the players must be within ε* of Nash in period T + 1, and so forth. (A similar trick was used in the proof of Lemma 2.) This completes the proof of Lemma 5. The proof of the theorem is now completed as follows. Let 0 < λ γ* (1 - δ) 2 /15 and choose k large enough that δ k < 0.2. Let h T-1 (k, A, B) be the partial history guaranteed by Lemma 5. After reaching h T-1 (k, A, B), each of the players chooses a mixed strategy 25

The Game of Normal Numbers

The Game of Normal Numbers Ehud Lehrer September 4, 2003 Abstract We introduce a two-player game where at each period one player, say, Player 2, chooses a distribution and the other player, Player 1, a