Behavorial learning in a population of interacting agents Jean-Pierre Nadal

Size: px

Start display at page:

Download "Behavorial learning in a population of interacting agents Jean-Pierre Nadal"

Ethan Watts
5 years ago
Views:

1 Behavorial learning in a population of interacting agents Jean-Pierre Nadal nadal@lps.ens.fr Laboratoire de Physique Statistique, ENS and Centre d Analyse et Mathématique Sociales, EHESS 1

2 Continental divide Van Huyck, Battalio & Cook, 1997 Nash 2

3 Continental divide 3

4 continental divide Camerer 4

5 . modeling human and animal behaviour (experimental psychology) computational neuroscience (at the neuronal level): decision making based on expected reward/punishment, motor control, economics/game theory: behavorial game theory Some ref.: Behavioral learning BushR. & Mosteller, F., Psychological Rev Rescorla,R.A.&Wagner,A.R.(1972) A theory of Pavlovian conditioning Sutton and Barto, Reinforcement learning, 1984, 1988, 1990 Book: The MIT Press, 1998 free online Behavioral game theory: Cross 1973 ; Arthur 1991 ; McAllister 1991; Walliser 1997 ; Camerer 1998 Dayan P & Daw ND (2008), Decision theory, reinforcement learning, and the brain. Cognitive, Affective & Behavioral Neuroscience

6 Behavorial learning For a set of possible actions, utility / pay-off/ profit not known in advance: Exploration making choices whose possible outcomes are not (well) known Learning: reinforcement of the actions which appear to be the most efficient: higher probability to choose such actions in the future Exploitation of acquired knowledge: past experience (possibly of others) allows to make expectations on outcomes of some actions/choices/strategies. Efficient learning; compromise exploration/exploitation. Collective scale: learning in a population of interacting agents 6

7 Attractions dynamics At each time step every agent i makes a choice (among a set of possible actions/choices/strategies ω = 1,, Ω) Iterated game: at each time step t, agent i associates to each possible action ω a weight (an attraction ) A i (ω, t ) ( estimate of < u i (ω ) > ) Choice of ω i (t): p i (ω i (t) = ω) = f ( A i (ω, t ) ) / Σ { ω} f (A i (ω, t )) With, e.g.: f(x) = exp ( β x ) («logit») 7

8 Reinforcement learning: basic idea A i (ω i, t ) ω = 3 choosen at t payoff u i (3, ω i (t) ) ω i = 1 ω i = 2 ω i = 3 ω i = 4 strategies (actions) A i (ω, t ) = agent i s attraction for action ω: the larger A i (ω, t ), the larger the probability for agent i to choose ω i =ω 8

9 Basic reinforcement learning A i (ω i, t ) If payoffs u i (ω, ω i (t) ) are known for ω=1,2,3,4 «fictitious play» ω i = 1 ω i = 2 ω i = 3 ω i = 4 strategies (actions) The larger A i (ω, t ), the larger the probability for agent i to choose ω i =ω (Cournot 1838 ; Brown 1951 ; Robinson 1951) 9

10 Basic reinforcement learning Ai (ω i, t+1) renormalisation : uniform weakening of the attractions A i ω i = 1 ω i = 2 ω i = 3 ω i = 4 strategies (actions) 10

11 Attractions dynamics At each time step every agent i makes a choice (among a set of possible actions/choices/strategies ω = 1,, Ω) Choice rule: depends on «attractions» (weights) {A i (ω, t), ω=1,,ω} the greater the attraction A i (ω, t) for ω, the greater the probability p i (ω, t) that i choose ω at time t. A i (ω, t) ~ expectation/estimate of the payoff if ω i =ω ~ opinion on the usefulness of ω deterministic choice: ω i (t) = argmax A i (ω, t) ω i 0 (t) ω probabilistic choice ω i (t) = { ω i 0 (t) with proba. 1 ε «trembling hand» any other ω with proba. ε / (Ω 1) p i (ω,t) = f ( A i (ω,t) ) / Z i (t) Z i (t) = Σ ω f( A i (ω,t) ) example: f (A) = exp ( β A ) logit choice function 11

12 Adaptation of attractions updating of attractions: family of learning rules A i (ω, t+1) ) = (1 - μ) A i (ω,t) + μ Φ[π i (ω,t), ω i (t)] π i (ω, t) payoff (which would have been) received at t if ω i (t) = ω [ payoff at t depends on the actions/choices of the other agents at time t: π i (ω, t) = π i (ω i = ω, {ω j (t), j =1,,N; j i} ) ] «fictitious play»: A i (ω, t+1) ) = (1 - μ) A i (ω,t) + μ π i (ω,t), ω = 1,, Ω «weighted belief learning» A i (ω, t+1) ) = (1 - μ) A i (ω,t) + μ π i (ω,t) if ω i (t) = ω A i (ω, t+1) ) = (1 - μ) A i (ω,t) + μ δ π i (ω,t) otherwise. 0 < δ < 1 fictitious play: δ = 1 myopic best response: μ = 1, δ = 1, ε = 0 12

13 Marché aux poissons de Marseille G. Weisbuch, A. Kirman, D. Herreiner (2000) (Données : A. Vignes) -pas de prix affiché ( pas de jeu fictif ) - observation : mélange de clients fidèles et de clients infidèles On se place du point de vue d un acheteur i, face à K vendeurs Stratégies S i de i = { k = aller chez le vendeur numéro k (k = 1,, K) } Renforcement A ik (t) = poids attribué par i à la stratégie «visiter le vendeur k» pour k = S i (t) : A ik ( t + 1 ) = (1 μ) A ik ( t ) + μ u ik ( t ) pour k S i (t) : A ik ( t + 1 ) = (1 μ) A ik ( t ) p i (S i (t) = k, t) = f (A ik ( t ) ) / Σ {k =1,,K} f (A ik ( t ) ) 13

14 Marché aux poissons de Marseille (suite) A ik ( t + 1 ) - A ik ( t ) = - μ A ik ( t ) + μ u ik ( t ) Hypothèse : convergence (régime stationnaire) à chaque fois que i visite k : u ik ( t ) = u ik Approximation de type «champ moyen» : u ik ( t ) < u ik > < u ik > = p i (S i = k) u ik point fixe : A ik = < u ik >, soit : A ik = u ik exp ( β A ik ) / Σ {k } exp (β A ik ) Cas le plus simple : K = 2 vendeurs, et u i1 = u i2 = u Δ i = A i1 - A i2 Points fixes : Δ = u tanh [ β Δ /2 ] β < β c = 2/u Δ = 0 stable : p i (S i = k) = 1/2 β > β c Δ = 0 instable, Δ +, Δ stables (0 < Δ + = - Δ ). pour chaque agent i, la dynamique conduit à Δ i = Δ + ou à Δ i = Δ Si de plus on a hétérogénéité des β : β i < β c «infidélité» β i > β c «fidélité» 14

15 Laboratory experiments 15

16 Experiment: laboratory version of the Dying Seminar Alexis GARAPIN, Bernard RUFFIEUX, Viktoriya SEMESHENKO, Mirta B. GORDON 16

17 4 different information treatments All treatments A B C & D for each individual, information on the individual threshold and payoffs Treatment A: On line OL Each individual is given the additional following information: All individuals thresholds All On line individual decisions Ex post: number of participants for each period of each seminar Treatment B: (ex post) # participants NP Each individual is given the additional following information All individuals thresholds Ex post: number of participants for each period of each seminar Treatment C: (ex post) threshold reached H Each individual is given the additional following information Ex post: Individual threshold reached or not Treatment D: Payoff P No additional information (therefore, a subject who did not participate a seminar does not know whether his threshold would have been reached). 17

18 Putting the Dying Seminar into the lab. Seminars with N = 16 potential participants. In every treatments, the payoffs are the same for an individual seminar (one period): Individual endowment: 50 (with 100 = 1,45 ) Non participant: 50 Participant with threshold reached: 200 Participant with threshold not reached: 0

19 Seminar 1 fragile stable stable

20 Seminar 1

21 Seminar 1

22 Seminar # Distribution of thresholds (IWP) f1bis(hi) F1bis(Hi) stable F1(Hi) stable 2 fragile Hi 22

23 Experiment: laboratory version of the Dying Seminar Alexis GARAPIN, Bernard RUFFIEUX, Viktoriya SEMESHENKO, Mirta B. GORDON 23

24 Seminar # 3. Distribution of thresholds (IWP) stable unstable stable 24

25 Seminar 3 25

26 Seminar # 4 stable stable fragile 26

27 Seminar 4 27

28 Modeling Attraction dynamics ~ reinforcement learning ~ experience Weighted Attraction (EWA) learning scheme (Camerer, 2003) Numerical simulations 28

29 Treatment A Simulations Experiments Treatment A On line : For each individual, information on his threshold and payoffs Each individual is given the additional following information: - All individuals thresholds - All On line individual decisions -Ex post: number of participants for each period of each seminar 29

30 Treatment B Experiments Simulations Treatment B: (ex post) # participants NP For each individual, information on his threshold and payoffs Each individual is given the additional following information - All individuals thresholds -Ex post: number of participants for each period of each seminar 30

31 Treatment C Simulations Experiments Treatment C: (ex post) threshold reached H For each individual, information on his threshold and payoffs Each individual is given the additional following information -Ex post: Individual threshold reached or not 31

32 Treatment D Experiments Simulations Treatment D: Payoff P For each individual, information on his threshold and payoffs No additional information (therefore, a subject who did not participate a seminar does not know whether his threshold would have been reached) 32

33 Perspectives (Much) more participants in large experiments: are the results homothetic? What is a large group in coordination and critical mass problems? Models of (thinking and) learning 33

34 (not too rational) expectations «beauty contest» (Keynes, 1936) vote for the most beautiful winner: the closest to the average choice best strategy: vote for what you expect to be the choice of the majority «p-beauty contest» N players. Every player must choose a number between 0 and 100. Winner: the player with the number closest to p=2/3 of the mean 34

35 p-beauty contest 1 -step thinker: assumes others make a random choice between 0 and 100 hence expected average = 50 their choice = 2/3 x 50 ~ 33 2-step thinker assumes others are 1-step thinker hence expected average = 33 their choice = 2/3 x 33 ~ 22 or rather: assumes a mixture of zero-step and one step thinkers their choice ~ 27 Rational expectations: consistency of expectations every player choose the same number = 2/3 of the average average = 2/3 x average average = 0 35

36 p-beauty contest Camerer 36

37 (not too rational) expectations Cognitive Hierarchy Camerer, Ho & Chong 2002 thinking steps zero step thinkers k-step thinker myopic behavior anticipates k steps of reasoning assumes other players anticipate at most k-1 steps f(k) distribution of k-step thinkers in the population τ = mean number of steps of thinking empirical data: τ ~ 1 2 = <k> = Σ k f(k) k simplest hypothesis: Poisson distribution f(k) = ( τ k / k! ) e -τ 37

38 38

39 cognitive hierarchy exple: 2 players game strategy of a k-step thinker: expected payoff from choosing strategy ω: E k [π(ω)] = Σ ω π(ω, ω ) Σ k <k g k (k ) P k (ω ) probablity that the other player choose strategy ω g k (k ) = fraction of k -step thinkers among those with k < k = f(k ) / Σ k < k f(k ) case of best respons: P k (ω) = 1 if E k [π(ω)] = max ω E k [π(ω )] = 0 otherwise 39

40 40

41 41

42 42

Choice under social influence : effects of learning behaviors on the collective dynamics

Choice under social influence : effects of learning behaviors on the collective dynamics Viktoriya Semeshenko(1) Jean-Pierre Nadal(2,3) Mirta B. Gordon(1) Denis Phan(4,1) (1) Laboratoire Leibniz-IMAG,