Evolution of the Theory of Mind

Evolution of the Theory of Mind Daniel Monte, São Paulo School of Economics (EESP-FGV) Nikolaus Robalino, Simon Fraser University Arthur Robson, Simon Fraser University Conference Biological Basis of Economic Preferences and Behavior Becker Friedman Institute, University of Chicago May 5, 2012

The Theory of Mind (TOM) refers to the ability to ascribe agency to other individuals (and to oneself), to ascribe beliefs and desires, more particularly, to one and all. This ability is manifest in non-autistic humans beyond infancy, but less obvious in other species. A key early experiment is the Sally-Ann test. (Baron- Cohen, Leslie, Frith (1985), for example.)

Onishi and Baillargeon (2005) find that even 15-month-olds exhibit behavior consistent with TOM in a non-verbal task where violation of expectation is inferred from increased length of gaze.

Onishi and Baillargeon (2005) find that even 15-month-olds exhibit behavior consistent with TOM in a non-verbal task where violation of expectation is inferred from increased length of gaze. Would this work on nonhuman species too?

Brodmann area 10

Such an ability is crucial to game theory. That is, it is usually necessary to understand an opponent s payoffs, to put oneself in his shoes, that is, in order to predict his behavior and therefore to choose an optimal strategy oneself.

This ability might be seen as an aspect of Machiavellian Intelligence. The hypothesis here is that our intelligence evolved under pressure to outsmart our conspecifics. (See, for example, Byrne and Whiten (1998).) This is often contrasted with the Ecological Intelligence Hypothesis that the nonhuman environment provided the impetus (Robson and Kaplan AER (2003), for example).

The present approach substantially extends and generalizes Robson Why Would Nature Give Individuals Utility Functions? JPE (2001), who considered the evolutionary rationale for own utility in a decision-theoretic framework. In contrast, we consider a rationale for knowing the utility functions of others in a strategic framework. In both cases, however, it is the need to address novelty that is the evolutionary impetus. The approach here contrasts with that of a small literature represented by Stahl GEB (1993), who considers the evolutionary advantage of greater strategic sophistication. (See also Crawford and Iriberri (2007) and Mohlin (2012), for example.) Stahl argues that it may be better to be (lucky and) dumb than to be smart. We obtain a clearer advantage to smart mainly because we consider a game with outcomes that are randomly selected from a growing outcome set, rather than a particular fixed game.

Here we investigate the evolutionary basis of an ability to acquire the preferences of an opponent. We contrast such more sophisticated agents with naive agents who adapt to each game as in evolutionary game theory and as in reinforcement learning in psychology.

Consider an anecdote about vervet monkeys from Cheney and Seyfarth (1990). When a male vervet sought to join Kitui s group, where Kitui was bottom-ranked, Kitui might make the leopard warning cry. The interloper would stay in his tree, his plans thwarted. The TOM Kitui understood the effect of such a cry on the others and was deliberately deceptive. A naive Kitui, in contrast, had no model of other vervets preferences and beliefs. Inadvertently, perhaps, he once made the leopard warning in such a circumstance and it worked. (That the latter option was actually true was suggested by Kitui occasionally walking on the ground towards the other male, alarm calling all the while.)

Here we model the contest between TOMer s and naive players. We consider games of perfect information, the simplest of which has two stages. If player 1 is naive, she must see all games before learning to play appropriately; if she is a TOMer, she must merely see player 2 confronted with all possible pairs of outcomes. The edge for the TOMer s then derives from there being many more games than pairs of outcomes. We demonstrate this edge by considering an environment that becomes richer as time passes. If the environment rapidly becomes more complex, neither type is able to keep up; if the environment only slowly becomes more complex both types can keep up. In an intermediate range, however, the TOMers learn essentially everything, the naive types essentially nothing.

Two-Stage Model Two large populations player 1 s ( she s ) and player 2 s ( he s ). At t = 1, 2,..., randomly paired to play a simple two-stage game of perfect information. Game tree with exactly two moves at each of the non-terminal nodes. The games vary on account of the outcomes assigned to the terminal nodes. Each player has a fixed strict preference ordering over the countably infinite set X. P 2 L 5 5555555 P 1 R P 2 L R 6 666666 R L x 1 x 2 x 3 x 4

The two-stage game at t is Γ t. At each t Γ t is completed by a random draw of four outcomes from the finite X t X. If any two of the outcomes are identical, it is known to the sophisticated players that they are indifferent; if the outcomes are different, they induce a strict preference ordering for the player in question, but this strict preference is not known to other players.

Each history of outcomes and choices is public information. If a mixed strategy is used, this would be observed, although just on the nodes that are reached. We consider only symmetric strategies.

Each history of outcomes and choices is public information. If a mixed strategy is used, this would be observed, although just on the nodes that are reached. We consider only symmetric strategies. There are no feedback effects of individuals actions on their opponents and so there is no incentive to play non-myopically. Player 2 s have no interest in player 1 s payoffs. Further, the player 1 s react only to the distribution of player 2 choices. Since any particular player 2 has no effect on this distribution of player 2 choices, any particular player 2 is myopic. The outcome sets increase over time, capturing growing complexity. There is a sequence t 1,..., t k,..., where t k is the arrival date of the k-th new object. The initial set of objects is X 0 with size n. Thus, in t k to t k+1 1 the number of objects is n + k. We assume t k = k α, for α 0.

A naive player adapts to each game Γ t. For simplicity, we assume that if the game Γ t is novel, the naive player plays inappropriately at t, say by randomizing 50-50 at each decision node. However, the next time this same game in encountered, the naive player makes the appropriate choices. That is, the adaptive learning here is as fast as it possibly could be. Even with this advantage, however, the naive types will be outdone by the TOM types. In contrast, a player has a theory of mind if she knows that her opponent will take binary decisions that are consistent with some initially unknown preference ordering. She does not avail herself of the transitivity of her opponent s preferences. Our comparison of the naive type with the TOM is then a comparison of the speed of the associated learning processes.

We now characterize the maximum amount of knowledge in this environment for naive and TOM players. For naive player 1 s or 2 s at date t the maximum amount of knowledge is simply the number of games, G t, say. We have that G t = X t 4 If the number of distinct games that have played before date t is Kt N, then we consider the fraction of these that are known to any naive player as L N t = KN t G t.

For the TOM players, consider the number of binary choices for a particular player. When that particular player makes some binary choice, that is, this is common knowledge to all TOM players. For both players, Q t is the number of outcome pairs. We then have Q t = X t 2 If Kt 2 is the number of player 2 s binary choices that have been revealed to TOM player 1 s, then the fraction of these that have been revealed is L 2 t = K2 t Q t

We now have the main result for the two-stage game case Theorem 1 All the results here concern player 1 i) If α [0, 2) then L 2 t 0 and LN t 0 as t in probability. That is, both the sophisticated and the naive type are overwhelmed by the rapid rate of arrival of new outcomes. ii) If α (4, ), then L 2 t 1 and LN t 1 as t in probability. That is, the rate of arrival of new outcomes is slow enough that both types are able to essentially learn everything. iii) Finally, however, if α (2, 4), then L 2 t 1 but LN t 0 as t in probability. That is, for this intermediate range of arrival rates, the TOM type learns essentially everything, while the naive type learns essentially nothing.

Except possibly for the values 2 and 4, the results are dramatic either everything is learnt in the limit or nothing is. Indeed, the range where nothing is learnt is inescapable in that the arrival rate of novelty outstrips there the maximum rate at which learning can occur. So the real contribution is to show the less obvious result that full learning occurs essentially whenever it is even possible that it could. In terms of the contest between the two types, the existence of an interval over which the TOM type learns everything and the naive type learns nothing implies we can finesse the issue of considering payoffs explicitly. Whatever these payoffs might be it is clear that the TOM type is outdoing the naive type in this intermediate range.

Revealed Preference Suppose that x 2 y and w 2 z for some distinct x, y, z, w X. Consider the following game P 2 L 5 5555555 P 1 R P 2 L R 7 7777777 R L x y z w

Revealed Preference Suppose that x 2 y and w 2 z for some distinct x, y, z, w X. Consider the following game P 2 L 5 5555555 P 1 R P 2 L R 7 7777777 R L x y z w Then, if 1 knows these aspects of 2 then she will choose L and get x, if x 1 w, but R and get w, if w 1 x. Conversely, if 1 chooses appropriately in any game like this, she has revealed that she has a correct representation of 2 s preferences.

There is no simpler way to ensure that player 1 always chooses correctly in all two-stage games, for all preferences 1.

Three (or More)-Stage Model Three equally large populations player 1 s, player 2 s, and player 3 s. At t = 1, 2,..., players matched randomly in triples to play a three-stage game of perfect information with exactly two moves at each of the non-terminal nodes. Each player has a fixed strict preference ordering over a countably infinite set of outcomes X.

L P 3 R L 5 5555555 P 2 L 5 5555555 R P 1 P 3 6 666666 6 6666666 x 1 x 2 x 3 x 4 L R R P 2 P 3

L P 3 R L 5 5555555 P 2 L 5 5555555 R P 1 P 3 6 666666 6 6666666 x 1 x 2 x 3 x 4 L R R P 2 P 3 What if the naive type of player 2, for example, were less naive, keying not on the entire three stage-game, but on the subgame deriving from each choice by player 1? A somewhat weaker version of the argument goes through.

The type of the last player, now player 3, is again irrelevant, since player 3 uses only his own preferences. TOM player 2 s construct a simple model of player 3 s preferences. TOM player 1 s construct simple models of the other two players preferences, and the player 1 s also need to know that the player 2 s know player 3 s preferences.

Individual players again have no incentive to mislead the opponent about one s preferences. In the three-stage case, this observation has force only for the player 2 s and 3 s. That is, the player 1 s cannot advantageously mislead the player 2 s or 3 s because the player 2 s and 3 s do not consider 1 s preferences. The player 1 s and 2 s react only to the distribution of choices made by the player 3 s. Since any particular player 3 then has no effect on the distribution of player 3 choices, and hence cannot affect the play of the player 1 s or 2 s, any such particular player 3 must behave myopically. Similarly, the player 1 s only react to the distribution of choices made by the player 2 s and so there is no incentive for the player 2 s to distort the choices made in order to influence player 1 s.

As before, a player who is naive adapts to each game Γ t as a distinct circumstance, but takes only two trials to play appropriately.

As before, a player who is naive adapts to each game Γ t as a distinct circumstance, but takes only two trials to play appropriately. Again, in contrast, a player with a theory of mind knows her opponent will take decisions that are consistent with some preference ordering. She has the capacity to learn this ordering.

For naive player 1 s, 2 s or 3 s at date t the maximum amount of knowledge is again the number of games, G t, say. Now G t = X t 8

For naive player 1 s, 2 s or 3 s at date t the maximum amount of knowledge is again the number of games, G t, say. Now G t = X t 8 If the number of distinct games that have played before at date t is Kt N, the fraction of games that are known to any naive player is L N t = KN t G t.

For the players with a theory of mind, it is convenient to consider the number of binary choices that can be made by a particular player, j, say. When player j makes some binary choice, that is, this is common knowledge to all TOM players. For all players, the number of outcome pairs is as before Q t = X t 2. If K j t is the number of player j = 2, 3 s binary choices that have been revealed to all TOM players, as common knowledge, then the fraction of these is L j t = Kj t Q t.

The following is then the main result for the three-stage game case. We first show that player 2 derives an advantage from TOM over naivete, essentially as in the two-stage game. Once player 2 is of TOM type, we then show that player 1 derives an advantage from TOM over naivete. Theorem 2 A) Suppose that player 1 plays in an arbitrary fashion. Then we have the following results for player 2 i) If α [0, 2) then L 3 t 0 and LN t 0 as t in probability. That is, both the sophisticated and the naive type of player 2 are overwhelmed by the rapid rate of arrival of new outcomes. ii) If α (8, ), then L 3 t 1 and LN t 1 as t in probability. That is, the rate of arrival of new outcomes is slow enough that both types are able to essentially learn everything. iii) Finally, however, if α (2, 8), then L 3 t 1 but LN t 0 as t in probability. That is, for this intermediate range of arrival rates, the TOM type learns essentially everything, while the naive type learns essentially nothing.

B) Suppose that player 2 is TOM. Then we have the following results for player 1 i) If α [0, 2) then L 2 t 0, L 3 t 0, and L N t 0, as t in probability. That is, both the sophisticated and the naive type are overwhelmed by the rapid rate of arrival of new outcomes. ii) If α (8, ), then L 2 t 1, L3 t 1, and LN t 1 as t in probability. That is, the rate of arrival of new outcomes is slow enough that both types are able to essentially learn everything. iii) Finally, however, if α (2, 8), then L 2 t 1, L3 t 1, but LN t 0 as t in probability. That is, for this intermediate range of arrival rates, the TOM type learns essentially everything, while the naive type learns essentially nothing.

Again, the results are dramatic except possibly at two points, either everything is learnt in the limit or nothing is. Again, when nothing is learnt, it is because it is simply mechanically impossible to keep up with the rate of novelty, so that the key contribution of this theorem is to show that everything is learnt essentially whenever this is not mechanically ruled out. As in the two-stage case, there is then an interval over which the TOM type learns everything and the naive type learns nothing. This interval is larger in the three-stage game case because the naive types of player 1 or 2 face a larger set of possible games, now with eight outcomes drawn from the outcome set, and hence can only keep up with a slower rate of novelty. On the other hand, for the result in A), the TOM type of player 2 still needs to know only how player 3 would make each possible binary choice.

For B), the situation for player 1 is more complex, because player 1 not only needs to know both player 2 s preferences and player 3 s, but also needs to know that player 2 knows player 3 s preferences. This would seem likely to shift the transition point from no learning to full learning for TOM player 1 s. However, the following observations apply. As long as α > 2, player 1 will learn player 3 s preferences completely in the limit. In addition, at the same time that 3 s choices reveal information about 3 s preferences to player 1, they reveal the same information to player 2 and player 1 knows this. But now, given only that α > 2, player 1 can also completely learn player 2 s preferences. For A), the large numbers of player 1 s, 2 s and 3 s ensure that the player 2 s should be sequentially rational. It might somehow be to the player 2s overall advantage to be perceived by the player 1 s as naive rather than sophisticated. However, when there are many player 2 s, with some of these naive and some sophisticated, each individual player 2 has no effect on player 1 s perceptions of the distribution of player 2 s. Thus each sophisticated sequentially rational player 2 outperforms each naive player 2.

It is always clear that learning must be slow if L 2 t is close to one. When α > 2, however, the proof involves showing this is the only circumstance under which learning is slow. There are two complicating factors. The first is that there are subgames in which 2 s choice cannot reveal information about 2 s preferences because there is insufficient knowledge about 3 s preferences and therefore choices. The second factor concerns the existence of player 2 subgames with outcomes that are avoided by player 3, thus making it difficult to reveal information about 2 s preferences. Such games arise even as t. However, A1 implies that these problematic games are a vanishing fraction of all games as t.

These results extend straightforwardly to a perfect information game with S stages and S players. When considering the sophistication of player s we assume that players s + 1,..., S are already TOM. The critical value of α for any TOM player remains 2, for all s. The critical value of α for a naive player grows with the number of possible games that can be formed.

Further Extensions How much do the current results depend on the particular model described here? Although the environment is rather particular, it is best seen merely as a test to discriminate between the underlying characteristics of ToM s and of the naive players. What about games with imperfect information? With multiple Nash equilibria, it is not clear how to disentangle the lack of knowledge of payoffs from the lack of information about which equilibrium is to be played, at least in the absence of strong assumptions. Perhaps normal form games crop up along with the games of perfect information emphasized here, using outcomes drawn from the same set. Whether or not any learning can be accomplished on such normal form games, our approach shows that learning would arise based only on the games of perfect information.

Within the class of games of perfect information, it is only for simplicity that we restrict attention to a fixed game tree. The tree itself could be random: it might involve a random number of moves or a random order of play, for example. Similarly, players could be allowed to move multiple times, and so on. Similarly, the assumption that individuals have a strict ranking over each distinct pair of outcomes is basically innocuous. If indifference is allowed, suppose, for example, that individuals randomize over each pair of indifferent outcomes. The indifference of player i between z and z would then be common knowledge to the ToM types, if ever player i chose z over z and z over z.

We assume here that ToM types do not apply transitivity in their deductions about the preferences of other players. This might make a difference to the relevant ranges of the growth parameter α. The new value of α cannot exceed 2, since applying transitivity could not be disadvantageous. It could not lower the critical value of α below 1. More sophisticated naive types could clearly do better than the ones we describe here. If naive types assign beliefs to subgames, rather than to entire games, for example, they would do as well as ToM types in the two stage game case. More generally, with three or more stages, such more sophisticated naive players would do better than the naive players considered here, but not as well as the sophisticated players.

Experiments It would be interest to experimentally implement the model, perhaps simplified to have no innovation. Also it seems we might not need a very large number of subjects in each of the I pools. Induce the same preferences over a large set of outcomes for each of the player i s for i = 1,..., I by using monetary payoffs. No player knows the other players payoffs. Play the game otherwise as above. How fast would players learn other players preferences? Would they be closer to the sophisticated ToM types described above or to the naive types? How would the number of stages I affect matters?

The End

Sketch of Proofs We treat a general case with I stages and A choices at each decision node.

Sketch of Proofs We treat a general case with I stages and A choices at each decision node. No learning in the limit If outcomes arrive at too fast a rate, it is straightforward to prove that learning cannot occur even when the greatest possible amount of information is revealed in every period. Lemma 1 In each of the following convergence is sure. i) Suppose α [0, 2). Then L t i 1,..., I. 0 for each preference type i = ii) Suppose there are T terminal nodes. If α [0, T ), then L N t 0.

Results About Learning Theorem 1 is a special case of Theorem 2. Attention is restricted to the ToM players, and so to the L i t s. (The corresponding claim about the naive types goes through with minor changes to the analysis.) As hypothesized in Theorem 2, when considering L i t, players i + 1,...,, I are taken to be ToM. Auxiliary Results The first minor result, Proposition 1, relates how much is commonly known about i s preferences over pairs of outcomes to what is commonly known about i s preferences over A-tuples of outcomes. The need for this result can be finessed, for expositional simplicity, by setting A = 2.

The gist of Proposition 2 is the following. Suppose types i,..., I are all ToM and that L i+1 t,..., L I t each converge to one in probability. Then, in the limit, the probability of revealing new information about i s preferences is small only if the fraction of extant knowledge about preferences, L i t, is close to one. Indeed, although the probability of revealing new information about i is clearly small if 1 L i t is small, the converse is not obviously true. Proposition 2, however, establishes an appropriate lower bound. This bound decomposes the expected amount learnt into a factor involving of 1 L i t, reflecting what is yet to be revealed about i s preferences, and a residual term.

Proposition 2 Suppose each of the random variables L i+1 t,..., L I t converges to one in probability. Then for each ε [0, 1] there exists a random variable ξt iε such that E(K t+1 H t ) K t ε 2i (1 L i t) 2i ξ iε t, where ξ iε t converges in probability to a continuous function, m i : [0, 1] [0, 1] such that m i 0 as ε 0. Proposition 2 is the heart of the matter. Consider the case of player I. What is a lower bound on the probability that something new will be learned about this player s preferences? Such a lower bound arises from the event that every pair of outcomes that follows choice by player I is unfamiliar to the ToM players. This lower bound is then [1 L I t ]2I 1 [1 L I t ]2I.

For players who are not last, the situation is more complex. There are two complicating factors. The first is that there are i-type subgames in which i s choice cannot reveal information about i s preferences because there is insufficient knowledge about the remaining players choices. The second factor concerns the existence of i-type subgames with outcomes that are avoided by the remaining opponents, thus making it difficult to reveal information about i s preferences. Such games arise even as t. However, A1 implies that these problematic games are a vanishing fraction of all games as t.

Proposition 3 then essentially completes the proof of Theorem 2 (and Theorem 1) by applying the result of Proposition 2.

Proposition 3 then essentially completes the proof of Theorem 2 (and Theorem 1) by applying the result of Proposition 2. Proposition 3. Suppose that, for each ε [0, 1] there exists a random variable ξt iε such that E(K t+1 H t ) K t ε 2i (1 L i t) 2i ξ iε t, where ξt iε converges in probability to a continuous function, m i : [0, 1] [0, 1] such that m i 0 as ε 0. If, in addition, α > 2, then L i t converges to one in probability.

Consider the case that I = 3 for simplicity of exposition and consider player 2. Assume for the moment that L 2 t converges in probability to the random variable L. Fix η > 0 and let A denote the event L < 1 η. Given the above facts about the probability of new information being revealed, the random variable E(Kt+τ 2 h t) is bounded below asymptotically by Kt 2 + η2 τ, on A, as τ., That is, on A, in the limit learning occurs at least linearly in τ. Observe now that Q t+τ is of order τ 2/α. Then, dividing through by the non-random Q t+τ, we obtain that E(L 2 t h t) on A. Since by definition L 2 t is bounded above by one surely, it must be that P (A) = 0. That is, if α > 2, and L 2 t converges in probability to L, it must be that L = 1 almost surely.

It remains then to show the convergence in probability just assumed. We do this by showing that the L t processes lie in a class of generalized martingales. Consider the following

It remains then to show the convergence in probability just assumed. We do this by showing that the L t processes lie in a class of generalized martingales. Consider the following Weak-Submartingale in the Limit: The adapted process (L t, h t ) is a weak sub-martingale in the limit (w-submil) if, for each η > 0, there is a T such that τ t T implies P (E(L τ h t ) L t η) > 1 η, a.e. and uniformly in τ.

Given the above, we need to prove that the L t sequences are w- submils. Clearly the L t process has the sub-martingale property in between arrival dates. However, at an arrival date, L t is discounted Q by the factor t 1, and increases by at most. The key is then Q t+1 Q t+1 to show that the sequence of L t at the arrival dates is nevertheless a w-submil. This follows because there is a process converging to 1 such that L t at the arrival dates can only decrease if it is greater than this process or close to it.

The End