Games A game is a tuple = (I; (S i ;r i ) i2i) where ffl I is a set of players (i 2 I) ffl S i is a set of (pure) strategies (s i 2 S i ) Q ffl r i :

Size: px

Start display at page:

Download "Games A game is a tuple = (I; (S i ;r i ) i2i) where ffl I is a set of players (i 2 I) ffl S i is a set of (pure) strategies (s i 2 S i ) Q ffl r i :"

Simon Caldwell
6 years ago
Views:

1 On the Connection between No-Regret Learning, Fictitious Play, & Nash Equilibrium Amy Greenwald Brown University Gunes Ercal, David Gondek, Amir Jafari July,

2 Games A game is a tuple = (I; (S i ;r i ) i2i) where ffl I is a set of players (i 2 I) ffl S i is a set of (pure) strategies (s i 2 S i ) Q ffl r i : S! R is a reward function (s 2 S = S i i) A repeated game is a sequence of tuples T

3 Nash Equilibrium A mixed strategy q i is a probability distribution over pure strategy set S i. Given opposing strategy q i, a best-response q Λ i strategy that maximizes (one-step) reward: q Λ i = arg max q i r i (q i ;q i ) is a A Nash equilibrium q Λ = (q Λ ;qλ ) is a strategy profile s.t. i i q Λ is a best-response to q Λ, for all players i. i i All (finite) games have mixed strategy Nash equilibria.

4 No-Regret Regret is the expected difference in rewards for playing mixed strategy q t i rather than pure strategy s i: ρ t i(s i ) = r i (s i ;q t i) r i (q t i ;qt i) An on-line learning algorithm exhibits no-regret iff 8s i : lim T!1 1 T TX t=1 ρ t i(s i )» No-Regret Learning Algorithms Narendra and Thathachar Learning Automata Freund and Schapire Multiplicative Updating Foster and Vohra The Mixing Method Hart and Mas-Colell Correlated Equilibrium Hannan, Banos, Megiddo, etc.

5 No-Regret Regret is the expected difference in rewards for playing mixed strategy q t i rather than pure strategy s i: ρ t i(s i ) = r i (s i ;q t i) r i (q t i ;qt i) An on-line learning algorithm exhibits no-regret iff: lim T!1 TX t=1 ρ t i(s i )» Learning Algorithms Narendra and Thathachar Learning Automata Freund and Schapire Multiplicative Updating Foster and Vohra The Mixing Method Hart and Mas-Colell Correlated Equilibrium Hannan, Banos, Megiddo, etc.

6 Observation In an infinitely repeated game, if all players learn via no-regret algorithms, and if q i t! μq i for all players i, then μq = q Λ : i.e., μq is a mixed strategy Nash equilibrium.

7 Proof By assumption, r i (s i ;q t i)! r i (s i ; μq i ) r i (q t i ;qt i)! r i (q Λ i ; μq i) Therefore, lim T!1 lim T!1 1 T 1 T TX t=1 TX t=1 r i (s i ;q i) t = r i (s i ; μq i) r i (q t i ;qt i) = r i (μq i ; μq i ) By the no-regret property r i (s i ; μq i )» r i (μq i ; μq i ); for all s i ) r i (s Λ i ; μq i)» r i (μq i ; μq i ) where s Λ i 2 arg max s i 2S i r i (s i ; μq i ): i.e., μq = q Λ.

8 Simulations Informed Settings ρ T s i = q T +1 s i = TX t=1 P r i (s i ;q t i) (1 + ff) ρt s i μs i 2S i (1 + ff) ρt μs i [Freund and Schapire 1995] Naive Settings ^r(s i ;q t i) = 1 t s i ^q T +1 i = (1 ffl)q T +1 i + [Auer, Cesa-Bianchi, Freund, and Schapire 1996] ffl ff > is the learning rate ffl ffl > is the exploration rate

9 Simulations Informed Settings ρ T s i = TX t=1 r i (s i ;q t i) q T +1 s i = [Freund and Schapire 1995] Naive Settings ^r(s i ;q t i) = 1 t s i r(s i ;q t i ) ^q t s i ^q T +1 i = (1 ffl)q T +1 i + ffl js i j [Auer, Cesa-Bianchi, Freund, and Schapire 1996] ffl ff > is the learning rate ffl ffl > is the exploration rate

10 Pure Strategy Nash Equilibria Prisoners' Dilemma 1 2 C D C 4,4,5 D 5, 1,1 Battle of the Sexes 1 2 B F B 2,1, F, 1,2

11 Prisoners' Dilemma Informed Setting Weights Time Naive Setting Weights Time

12 Battle of the Sexes Informed Setting Weights Time Naive Setting Weights Time

13 Mixed Strategy Nash Equilibria Matching Pennies 1 2 H T H 1,,1 T,1 1, Shapley Game 1 2 L C R T 1,,1, M, 1,,1 B,1, 1,

14 Matching Pennies Mixed Strategies Weights Time Empirical Frequencies Frequencies Time

15 Shapley Game Mixed Strategies Weights Time Empirical Frequencies Frequencies Time

16 Shapley Game Smoothed Mixed Strategies Weights Time Smoothed Empirical Frequencies Frequencies Time

17 Summary of Simulations Theoretical ffl If no-regret learning converges, it converges to Nash equilibrium. Experimental ffl No-regret learning converges to pure strategy Nash equilibria in games for which such equilibria exist. ffl No-regret learning does not converge otherwise. Empirical frequencies converge to Nash in zero-sum games. Empirical frequencies converge to Nash in non-zero sum games of 2 strategies. Smoothed empirical frequencies converge to Nash in games of > 2 strategies.

18 No-Regret and Fictitious Play Prisoners' Dilemma Prisoner s Dilemma 9 % Fictitous Play Iterations Battle of the Sexes Battle of the Sexes 9 7 % Fictitous Play Iterations

19 No-Regret and Fictitious Play Matching Pennies Matching Pennies 9 7 % Fictitous Play Iterations 9.5 Matching Pennies % Fictitous Play Iterations

20 No-Regret and Fictitious Play Shapley Game Shapley Game 9 7 % Fictitous Play Iterations Shapley Game 9 7 % Fictitous Play Iterations

21 No-Regret and Fictitious Play Shapley Game Shapley Game 9 7 % Fictitous Play Iterations Shapley Game 9 7 % Fictitous Play Iterations

22 Papers ffl Shopbots and Pricebots [w/ Kephart] ffl Santa Fe Bar Problem [w/ Mishra & Parikh] ffl Learning in Networks [w/ Friedman & Shenker]

Game-Theoretic Learning:

Game-Theoretic Learning: Regret Minimization vs. Utility Maximization Amy Greenwald with David Gondek, Amir Jafari, and Casey Marks Brown University University of Pennsylvania November 17, 2004 Background