Game Theory, Evolutionary Dynamics, and Multi-Agent Learning Prof. Nicola Gatti (nicola.gatti@polimi.it)
Game theory
Game theory: basics Normal form Players Actions Outcomes Utilities Strategies Solutions
Game theory: basics Normal form Players Actions Outcomes Utilities Strategies Solutions
Game theory: basics Rock Paper Scissors Normal form Players Actions Outcomes Utilities Strategies Solutions Rock Paper Scissors
Game theory: basics Rock Paper Scissors Normal form Players Rock Tie wins wins Actions Outcomes Utilities Strategies Paper wins Tie wins Solutions Scissors wins wins Tie
Game theory: basics Rock Paper Scissors Normal form Players Actions Outcomes Utilities Strategies Solutions Rock 0,0-1,1 1,-1 Paper 1,-1 0,0-1,1 Scissors -1,1 1,-1 0,0
Game theory: basics Rock Paper Scissors Normal form Players Rock 0,0-1,1 1,-1 1(Rock) Actions Outcomes Utilities Strategies Paper 1,-1 0,0-1,1 1(Paper) Solutions Scissors -1,1 1,-1 0,0 1(Scissors) 2(Rock) 2(Paper) 2(Scissors)
Game theory: basics Rock Paper Scissors Normal form Players Rock 0,0-1,1 1,-1 1/3 Actions Outcomes Utilities Strategies Paper 1,-1 0,0-1,1 1/3 Solutions Scissors -1,1 1,-1 0,0 1/3 1/3 1/3 1/3
Nash Equilibrium A strategy profile 1 2 arg max 1 2 2 arg max 2 n n 1U 1 1U 2 ( 1, 2 ) 2 o 2 o is a Nash equilibrium if and if:
Evolutionary dynamics
An evolutionary interpretation 33% 33% 33% Rock Paper Scissors
An evolutionary interpretation Infinite population of individuals 33% 33% 33% Rock Paper Scissors
Bacterials
An evolutionary interpretation 33% 33% 30% 50% 33% 20% Rock Paper Scissors Rock Paper Scissors
An evolutionary interpretation Rock Paper Scissors 33% 33% Rock 0,0-1,1 1,-1 Paper 1,-1 0,0-1,1 30% 50% 33% Scissors -1,1 1,-1 0,0 20% Rock Paper Scissors Rock Paper Scissors
An evolutionary interpretation Rock Paper Scissors 33% 33% Rock 0,0-1,1 1,-1 30% 50% Paper 1,-1 0,0-1,1 33% Scissors -1,1 1,-1 0,0 20% Rock Paper Scissors Rock Paper Scissors At each round, each individual of a population plays against each individual of the opponent s population
An evolutionary interpretation Utility = - 0.3 Utility = 0.1 Rock Paper Scissors 33% 33% Rock 0,0-1,1 1,-1 Paper 1,-1 0,0-1,1 30% 50% 33% Scissors -1,1 1,-1 0,0 20% Rock Paper Scissors Rock Paper Scissors Utility = 0.2
An evolutionary interpretation Utility = - 0.3 33% 33% Utility = 0.1 33% Rock Paper Scissors Utility = 0.2
An evolutionary interpretation Utility = - 0.3 33% 33% Utility = 0.1 33% Rock Paper Scissors Utility = 0.2 The average utility is 0
An evolutionary interpretation Utility = - 0.3 33% 33% Utility = 0.1 33% Rock Paper Scissors Utility = 0.2 The average utility is 0 A positive (utility - average utility) leads to an increase of the population A negative (utility - average utility) leads to a decrease of the population
An evolutionary interpretation 17% 37% 46% Rock Paper Scissors New population after the replication
Revision protocol Question: how the populations change? 1(a, t) = 1 (a, t) Replicator dynamics e a U 1 2 (t) 1(t)U 1 2 (t) Utility given by playing a with a probability of 1 Average population utility
Evolutionary Stable Strategies A strategy is an ESS if it is immune to invasion by mutant strategies, given that the mutants initially occupy a small fraction of population Every ESS is an asymptotically stable fixed point of the replicator dynamics
Evolutionary Stable Strategies A strategy is an ESS if it is immune to invasion by mutant strategies, given that the mutants initially occupy a small fraction of population Every ESS is an asymptotically stable fixed point of the replicator dynamics Nash equilibria ESS While a NE always exists, an ESS may not exist
Prisoner s dilemma Cooperate Defeat Cooperate 3,3 0,5 Defeat 5,0 1,1
Prisoner s dilemma Cooperate Defeat Cooperate 3,3 0,5 Defeat 5,0 1,1
Prisoner s dilemma 1 Cooperate Defeat Cooperate 3,3 0,5 Defeat 5,0 1,1 s cooperate probability 0.75 y 1 0.5 0.25 0 0 0.25 0.5 0.75 1 x 1 s cooperate probability
Prisoner s dilemma 1 Cooperate Defeat Cooperate 3,3 0,5 Defeat 5,0 1,1 s cooperate probability 0.75 y 1 0.5 0.25 asymptotically stable 0 0 0.25 0.5 0.75 1 x 1 s cooperate probability
Stag hunt Stag Hare Stag 4,4 1,3 Hare 3,1 3,3
Stag hunt Stag Hare Stag 4,4 1,3 Hare 3,1 3,3
Stag hunt Stag Hare 1 0.75 Stag 4,4 1,3 Hare 3,1 3,3 s stag probability 1 0.5 0.25 0 0 0.25 0.5 0.75 1 x 1 s stag probability
Stag hunt 1 asymptotically stable Stag Hare 0.75 Stag 4,4 1,3 Hare 3,1 3,3 s stag probability 1 0.5 0.25 asymptotically stable 0 0 0.25 0.5 0.75 1 x 1 s stag probability
Matching pennies Head Tail Head 0,1 1,0 Tail 1,0 0,1
Matching pennies Head Tail Head 0,1 1,0 Tail 1,0 0,1
Matching pennies 1 Head Tail 0.75 Head 0,1 1,0 Tail 1,0 0,1 s head probability y 1 0.5 0.25 0 0 0.25 0.5 0.75 1 x 1 s head probability
Matching pennies 1 Head Tail 0.75 Head 0,1 1,0 Tail 1,0 0,1 s head probability y 1 0.5 0.25 0 0 0.25 0.5 0.75 1 x 1 s head probability
Multi-agent learning
Markov decision problem
Reinforcement learning
Q-learning (1) For every pair state/action: Q(s, a) Q(s, a)+ h i r + max a 0 Q(s, a 0 ) Q(s, a) learning rate reward discount factor
Example: normal-form games 1 state 2 actions (Cooperate, Defeat) Cooperate Defeat Cooperate 3,3 0,5 Defeat 5,0 1,1
Example: normal-form games Q(a) Q(a)+ r Q(a) =0.2 1(a) = 1(a) = ( 1.0 a = Cooperate 0.0 ( a = Defeat 0.2 a = Cooperate 0.8 a = Defeat Cooperate Defeat Cooperate 3,3 0,5 Defeat 5,0 1,1
Example: normal-form games Q(a) Q(a)+ r Q(a) =0.2 round s action s Q function t =0 Q(Cooperate) = 0 t = 1 a = Cooperate Q(Cooperate) = 0.6 t = 2 a = Defeat Q(Cooperate) = 0.48 t = 3 a = Defeat Q(Cooperate) = 0.384 t = 4 a = Defeat Q(Cooperate) = 0.3072 t = 5 a = Defeat Q(Cooperate) = 0.24576 t = 6 a = Cooperate Q(Cooperate) = 0.496608 ( 1.0 a = Cooperate 1(a) = 0.0 a = Defeat ( 0.2 a = Cooperate 1(a) = 0.8 a = Defeat Cooperate Defeat Cooperate 3,3 0,5 Defeat 5,0 1,1
Example: normal-form games Q(a) Q(a)+ r Q(a) =0.2 round s action s Q function t =0 Q(Cooperate) = 0 t = 1 a = Cooperate Q(Cooperate) = 0.6 t = 2 a = Defeat Q(Cooperate) = 0.48 t = 3 a = Defeat Q(Cooperate) = 0.384 t = 4 a = Defeat Q(Cooperate) = 0.3072 t = 5 a = Defeat Q(Cooperate) = 0.24576 t = 6 a = Cooperate Q(Cooperate) = 0.496608 ( 1.0 a = Cooperate 1(a) = 0.0 a = Defeat ( 0.2 a = Cooperate 1(a) = 0.8 a = Defeat Cooperate Defeat Cooperate 3,3 0,5 Defeat 5,0 1,1
Example: normal-form games Q(a) Q(a)+ r Q(a) =0.2 round s action s Q function t =0 Q(Cooperate) = 0 t = 1 a = Cooperate Q(Cooperate) = 0.6 t = 2 a = Defeat Q(Cooperate) = 0.48 t = 3 a = Defeat Q(Cooperate) = 0.384 t = 4 a = Defeat Q(Cooperate) = 0.3072 t = 5 a = Defeat Q(Cooperate) = 0.24576 t = 6 a = Cooperate Q(Cooperate) = 0.496608 ( 1.0 a = Cooperate 1(a) = 0.0 a = Defeat ( 0.2 a = Cooperate 1(a) = 0.8 a = Defeat Cooperate Defeat Cooperate 3,3 0,5 Defeat 5,0 1,1
Example: normal-form games Q(a) Q(a)+ r Q(a) =0.2 round s action s Q function t =0 Q(Cooperate) = 0 t = 1 a = Cooperate Q(Cooperate) = 0.6 t = 2 a = Defeat Q(Cooperate) = 0.48 t = 3 a = Defeat Q(Cooperate) = 0.384 t = 4 a = Defeat Q(Cooperate) = 0.3072 t = 5 a = Defeat Q(Cooperate) = 0.24576 t = 6 a = Cooperate Q(Cooperate) = 0.496608 ( 1.0 a = Cooperate 1(a) = 0.0 a = Defeat ( 0.2 a = Cooperate 1(a) = 0.8 a = Defeat Cooperate Defeat Cooperate 3,3 0,5 Defeat 5,0 1,1
Example: normal-form games Q(a) Q(a)+ r Q(a) =0.2 round s action s Q function t =0 Q(Cooperate) = 0 t = 1 a = Cooperate Q(Cooperate) = 0.6 t = 2 a = Defeat Q(Cooperate) = 0.48 t = 3 a = Defeat Q(Cooperate) = 0.384 t = 4 a = Defeat Q(Cooperate) = 0.3072 t = 5 a = Defeat Q(Cooperate) = 0.24576 t = 6 a = Cooperate Q(Cooperate) = 0.496608 ( 1.0 a = Cooperate 1(a) = 0.0 a = Defeat ( 0.2 a = Cooperate 1(a) = 0.8 a = Defeat Cooperate Defeat Cooperate 3,3 0,5 Defeat 5,0 1,1
Example: normal-form games Q(a) Q(a)+ r Q(a) =0.2 round s action s Q function t =0 Q(Cooperate) = 0 t = 1 a = Cooperate Q(Cooperate) = 0.6 t = 2 a = Defeat Q(Cooperate) = 0.48 t = 3 a = Defeat Q(Cooperate) = 0.384 t = 4 a = Defeat Q(Cooperate) = 0.3072 t = 5 a = Defeat Q(Cooperate) = 0.24576 t = 6 a = Cooperate Q(Cooperate) = 0.496608 ( 1.0 a = Cooperate 1(a) = 0.0 a = Defeat ( 0.2 a = Cooperate 1(a) = 0.8 a = Defeat Cooperate Defeat Cooperate 3,3 0,5 Defeat 5,0 1,1
Example: normal-form games Q(a) Q(a)+ r Q(a) =0.2 round s action s Q function t =0 Q(Cooperate) = 0 t = 1 a = Cooperate Q(Cooperate) = 0.6 t = 2 a = Defeat Q(Cooperate) = 0.48 t = 3 a = Defeat Q(Cooperate) = 0.384 t = 4 a = Defeat Q(Cooperate) = 0.3072 t = 5 a = Defeat Q(Cooperate) = 0.24576 t = 6 a = Cooperate Q(Cooperate) = 0.496608 ( 1.0 a = Cooperate 1(a) = 0.0 a = Defeat ( 0.2 a = Cooperate 1(a) = 0.8 a = Defeat Cooperate Defeat Cooperate 3,3 0,5 Defeat 5,0 1,1
Q-learning (2) Softmax (a.k.a. Boltzam exploration) i(a) = exp(q(s, a)/ ) P a 0 exp(q(s, a 0 )/ )
Q-learning (2) Softmax (a.k.a. Boltzam exploration) i(a) = exp(q(s, a)/ ) P a exp(q(s, a 0 )/ ) 0 temperature
Q-learning (2) Softmax (a.k.a. Boltzam exploration) i(a) = exp(q(s, a)/ ) P a exp(q(s, a 0 )/ ) 0 temperature Every action is played with strictly positive probability The larger the temperature, the smoother the function If the temperature is 0, we would have a best response
Example: normal-form games Q(Cooperate) Q(Defeat) 1(Cooperate) 1(Cooperate) 0 0 0.5 0.5 1 0 0.731 0.269 5 0 0.99331 0.00669 10 0 0.999955 0.000045
Example: normal-form games Q(Cooperate) Q(Defeat) 1(Cooperate) 1(Cooperate) 0 0 0.5 0.5 1 0 0.731 0.269 5 0 0.99331 0.00669 10 0 0.999955 0.000045
Example: normal-form games Q(Cooperate) Q(Defeat) 1(Cooperate) 1(Cooperate) 0 0 0.5 0.5 1 0 0.731 0.269 5 0 0.99331 0.00669 10 0 0.999955 0.000045
Example: normal-form games Q(Cooperate) Q(Defeat) 1(Cooperate) 1(Cooperate) 0 0 0.5 0.5 1 0 0.731 0.269 5 0 0.99331 0.00669 10 0 0.999955 0.000045
Self-play Q-learning dynamics
Self-play learning Q-learning algorithm Q-learning algorithm Cooperate Defeat Cooperate 3,3 0,5 Defeat 5,0 1,1
Learning dynamics Assumptions: Time is continuous All the actions can be selected simultaneously
Learning dynamics Assumptions: Time is continuous All the actions can be selected simultaneously 1(a, t) = 1(a, t) e a U 1 2 (t) 1(t)U 1 2 (t) 1 (a, t) log( 1 (a)) X 1(a 0 ) log( 1 (a 0 )) a 0
Learning dynamics Assumptions: Time is continuous All the actions can be selected simultaneously 1(a, t) = 1(a, t) e a U 1 2 (t) 1(t)U 1 2 (t) 1 (a, t) log( 1 (a)) X 1(a 0 ) log( 1 (a 0 )) a 0 exploitation term exploration term
Learning dynamics Assumptions: Time is continuous All the actions can be selected simultaneously 1(a, t) = 1(a, t) e a U 1 2 (t) 1(t)U 1 2 (t) 1 (a, t) log( 1 (a)) X 1(a 0 ) log( 1 (a 0 )) a 0 exploitation term exploration term When the temperature is 0, the Q-learning behaves as the replicator dynamics
0 Prisoner s dilemma 1 Cooperate Defeat Cooperate 3,3 0,5 Defeat 5,0 1,1 s cooperate probability y 1 0.75 0.5 0.25 asymptotically stable 0 0 0.25 0.5 0.75 1 x 1 s cooperate probability
Stag hunt 1 asymptotically stable Stag Hare 0.75 Stag 4,4 1,3 Hare 3,1 3,3 s stag probability y 1 0.5 0.25 asymptotically stable 0 0 0.25 0.5 0.75 1 x 1 s stag probability 1 0.75 0.7
Matching pennies 1 Head Tail Head 0,1 1,0 Tail 1,0 0,1 s head probability y 1 0.75 0.5 0.25 0 0 0.25 0.5 0.75 1 Player x 1 s head probability 1 1 0.75 0.75