Game-Theoretic Learning:

Size: px

Start display at page:

Download "Game-Theoretic Learning:"

Mervin Knight
5 years ago
Views:

1 Game-Theoretic Learning: Regret Minimization vs. Utility Maximization Amy Greenwald with David Gondek, Amir Jafari, and Casey Marks Brown University University of Pennsylvania November 17, 2004

2 Background No-external-regret learning converges to the set of minimax equilibria, in zero-sum games. [e.g., Freund and Schapire 1996] No-internal-regret learning converges to the set of correlated equilibria, in general-sum games. [e.g., Foster and Vohra 1997] 1

3 Foreground 1. Definitions A continuum of no-regret properties, called no-φ-regret. A continuum of game-theoretic equilibria, called Φ-equilibria. 2. Existence Theorem Constructive proof: No-Φ-regret learning algorithms exist, Φ. 3. Convergence Theorem No-Φ-regret learning converges to the set of Φ-equilibria, Φ. 4. Surprising Result No-internal-regret is the strongest form of no-φ-regret learning. Therefore, no no-φ-regret algorithm learns Nash equilibria. 2

4 Outline Game Theory Single Agent Learning Model Multiagent Learning & Game-Theoretic Equilibria 3

5 Game Theory: A Crash Course 1. General-Sum Games Nash Equilibrium Correlated Equilibrium 2. Zero-Sum Games Minimax Equilibrium 4

6 An Example Prisoners Dilemma C D C 4,4 0,5 D 5,0 1,1 C: Cooperate D: Defect 5

7 One-Shot Games A one-shot game is a 3-tuple Γ = (I,(A i, r i ) i I ), where I is a set of players for all players i I, a set of pure actions A i with a i A i a reward function r i : A R, where A = i I A i with a A R R 6

8 One-Shot Games A one-shot game is a 3-tuple Γ = (I,(A i, r i ) i I ), where I is a set of players for all players i I, a set of pure actions A i with a i A i a reward function r i : A R, where A = i I A i with a A The players can employ randomized or mixed actions: for all players i I, a set of mixed actions Q i = {q i R A i j q ij = 1 & q ij 0, j}, with q i Q i an expected reward function r i : Q R, where Q = i I Q i with q Q, s.t. r i (q) = a A q(a)r i(a) 7

9 Nash Equilibrium Notation Write a = (a i, a i ) A for a i A i and a i A i = j i A j. Write q = (q i, q i ) Q for q i Q i and q i Q i = j i Q i. Definition A Nash equilibrium is a mixed action profile q s.t. r i (q ) r i (q i, q i ), for all players i and for all mixed actions q i Q i. Theorem [Nash 51] Every finite strategic form game has a mixed strategy Nash equilibrium. 8

10 Correlated Equilibrium Chicken L R T 6,6 2,7 B 7,2 0,0 CE L R T 1/2 1/4 B 1/4 0 max12π TL + 9π TR + 9π BL + 0π BR subject to π TL + π TR + π BL + π BR = 1 π TL, π TR, π BL, π BR 0 6π L T + 2π R T 7π L T + 0π R T 7π L B + 0π R B 6π L B + 2π R B 6π T L + 2π B L 7π T L + 0π B L 7π T R + 0π B R 6π T R + 2π B R 9

11 Correlated Equilibrium Chicken L R T 6,6 2,7 B 7,2 0,0 CE L R T 1/2 1/4 B 1/4 0 max12π TL + 9π TR + 9π BL + 0π BR subject to π TL + π TR + π BL + π BR = 1 π TL, π TR, π BL, π BR 0 6π TL + 2π TR 7π TL + 0π TR 7π BL + 0π BR 6π BL + 2π BR 6π TL + 2π BL 7π TL + 0π BL 7π TR + 0π BR 6π TR + 2π BR 10

12 Correlated Equilibrium Definition A mixed action profile q Q is a correlated equilibrium iff for all pure actions j, k A i, a i A i q(j, a i ) (r i (j, a i ) r i (k, a i )) 0 (1) Observe Every Nash equilibrium is a correlated equilibrium Every finite normal form game has a correlated equilibrium. 11

13 Zero-Sum Games Matching Pennies H T H 1,1 1, 1 T 1, 1 1,1 Rock-Paper-Scissors R P S R 0,0 1,1 1, 1 P 1, 1 0,0 1,1 S 1,1 1, 1 0,0 i I r i(a) = 0, for all a A i I r i(a) = c, for all a A, for some c R 12

14 Minimax Equilibrium Example L R T 1 2 B 4 3 Definition A mixed action profile (q1, q 2 ) Q is a minimax equilibrium in a two-player, zero-sum game iff r 1 (q 1, q 2 ) r 1(j, q 2 ), j A 1 l 2 (q 1, q 2 ) l 2(q 1, k), k A 2 13

15 Single Agent Learning Model set of actions N = {1,..., n} for all times t, mixed action vector q t Q = {q R n i q i = 1 & q i 0, i} pure action vector a t = e i for some pure action i reward vector r t = (r 1,..., r n ) [0,1] n A learning algorithm A is a sequence of functions q t : History t 1 Q, where a History is a sequence of action-reward pairs (a 1, r 1 ),(a 2, r 2 ),

16 Transformations Mixed Transformations Φ LINEAR = {φ : Q Q} = the set of all linear transformations = the set of all row stochastic matrices Φ SWAP = {φ : Q Q φ deterministic} Φ LINEAR Pure Transformations F SWAP = {F : N N} = the set of all pure transformations 15

17 Isomorphism The operation of elements of F SWAP on N = the operation of elements of Φ SWAP on Q φ ij = δ F(i)=j (2) k e k φ = e F(k) (3) Example If n = 4 and F = {1 2,2 3,3 4,4 1}, then φ = Thus, q 1, q 2, q 3, q 4 φ = q 4, q 1, q 2, q 3, for all q 1, q 2, q 3, q 4 Q. 16

18 External Regret Matrices F EXT = {F j F SWAP j N}, where F j (k) = j Φ EXT = {φ j Φ SWAP j N}, where e k φ j = e j Example If n = 4, then φ 2 = Thus, q 1, q 2, q 3, q 4 φ = 0,1,0,0, for all q 1, q 2, q 3, q 4 Q. 17

19 Internal Regret Matrices { F INT = {F ij F SWAP ij N}, where F ij j if k = i (k) = k otherwise { Φ INT = {φ ij Φ SWAP ij N}, where e k φ ij ej if k = i = otherwise e k Example If n = 4, then φ 23 = Thus, q 1, q 2, q 3, q 4 φ = q 1,0, q 2 +q 3, q 4, for all q 1, q 2, q 3, q 4 Q. 18

20 Regret Vector ρ R Φ Observed Regret Vector ρ φ (r, a) = r aφ r a Expected Regret Vector ˆρ φ (r, q) = E[ρ φ (r, a) a q] = ρ φ (r, E[a a q]) = r qφ r q No Observed Φ-Regret limsup t 1 t t τ=1 ρ φ(r τ, a τ ) 0, for all φ Φ, a.s. No Expected Φ-Regret limsup t 1 t t τ=1 ˆρ φ(r τ, q τ ) 0, for all φ Φ 19

21 Approachability U V is said to be approachable iff there exists learning algorithm A = q 1, q 2,... s.t. for any sequence of rewards r 1, r 2,..., lim d(u, t ρt ) = lim inf d(u, t u U ρt ) = 0 a.s., where ρ t denotes the average value of ρ through time t. A Φ-no-regret learning algorithm is one whose observed regret approaches the negative orthant R Φ. 20

22 Blackwell s Theorem The negative orthant R Φ is approachable iff there exists a learning algorithm A = q 1, q 2,... s.t. for any sequence of rewards r 1, r 2,..., ρ(r t+1, q t+1 ) ( ρ t ) + 0 (4) for all times t, where x + = max{x,0}. Moreover, this procedure can be used to approach the negative orthant R Φ : if ρ t R Φ, play arbitrarily; if ρ t V \ R Φ, play according to A. 21

23 Regret Matching Algorithm Given Φ Given Y R Φ + If φ Φ Y φ = 0, play arbitrarily If φ Φ Y φ > 0, define stochastic matrix A A(Φ, Y ) = φ Φ φy φ φ Φ Y φ (5) play mixed strategy q = qa Regret Matching Theorem Regret matching satisfies the generalized Blackwell condition: ρ(r, q) Y = 0 22

24 Proof ρ(r, q) Y = = = φ Φ φ Φ φ Φ = r = ( ρ φ (r, q)y φ (6) (r qφ r q)y φ (7) r (qφy φ qy φ ) (8) q ( φ Φ ( φ Φ Y φ ) r φy φ q ( q φ Φ Y φ ) φ Φ φy φ φ Φ Y φ q ) (9) (10) = φ Φ ( Y φ ) r (qa q) (11) = φ Φ Y φ ) r (q q) (12) = 0 (13) 23

25 Generic Regret Matching Algorithm (Φ, g) for t = 1,..., T 1. play mixed strategy q t 2. realize pure action i 3. observe rewards r t 4. for all φ Φ compute instantaneous regret observed ρ t φ ρ φ(r t, e i ) = r t e i φ r t e i expected ρ t φ ρ φ(r t, q t ) = r t q t φ r t q t update cumulative regret vector X t φ = Xt 1 φ 5. compute Y = g(x t ) 6. compute A = φ Φ φy φ φ Φ Y φ 7. solve for the fixed point q t+1 = q t+1 A + ρ t φ 24

26 Special Cases of Regret Matching Foster and Vohra 97 (Φ INT ) Hart and Mas-Colell 00 (Φ EXT ) Choose G(X) = 1 2 k (X+ k )2 so that g k (X) = X + k Freund and Schapire 95 (Φ EXT ) Cesa-Bianchi and Lugosi 03 (Φ INT ) Choose G(X) = 1 η ln( k eηx k) so that gk (X) = e ηx k / k eηx k 25

27 Multiagent Model a set of players I (i I) for all players i, a set of pure actions A i with a i A i a set of mixed actions Q i with q i Q i a reward function r i : A [0,1], where A = i A i with a A an expected reward function r i : Q [0,1], where Q = i Q i with q Q s.t. r i (q) = a A q(a)r i(a) a set Φ i (φ i Φ i ) 26

28 Φ-Equilibrium An mixed action profile q Q is a Φ-equilibrium iff r i (φ i (q)) r i (q), for all players i and for all φ i Φ i. Examples Correlated Equilibrium: Φ i = Φ INT, for all players i Generalized Minimax Equilibrium: Φ i = Φ EXT, for all players i 27

29 Convergence Theorem If all players i play via some no-φ i -regret algorithm, then the joint empirical distribution of play converges to the set of Φ-equilibria, almost surely. Proof For all players i, for all φ i Φ i, almost surely. lim sup r i(φ i (z t )) r i (z t ) (14) t = limsup t 1 t t τ=1 r i (φ i (a τ i ), aτ i ) 1 t t τ=1 r i (a τ i, aτ i ) (15) 0 (16) 28

30 Zero-Sum Games Matching Pennies H T H 1,1 1, 1 T 1, 1 1,1 Rock-Paper-Scissors R P S R 0,0 1,1 1, 1 P 1, 1 0,0 1,1 S 1,1 1, 1 0,0 29

31 Matching Pennies Weight Weights NER exponential [η = ]: Matching Pennies Player 1: H Player 1: T 0.1 Player 2: H Player 2: T Time Empirical Frequency Frequencies NER exponential [η = ]: Matching Pennies 1 Player 1: H Player 1: T Player 2: H Player 2: T Time 30

32 Rock-Paper-Scissors Weights Frequencies 1 NER polynomial [p = 2]: Rock, Paper, Scissors 1 NER polynomial [p = 2]: Rock, Paper, Scissors Weight Player 1: R Player 1: P 0.2 Player 1: S Player 2: R 0.1 Player 2: P Player 2: S Time Empirical Frequency Player 1: R Player 1: P Player 1: S Player 2: R Player 2: P Player 2: S Time 31

33 General-Sum Games Shapley Game L C R T 0,0 1,0 0,1 M 0,1 0,0 1,0 B 1,0 0,1 0,0 Correlated Equilibrium L C R T 0 1/6 1/6 M 1/6 0 1/6 B 1/6 1/6 0 32

34 Shapley Game: No Internal Regret Learning Frequencies Empirical Frequency NIR polynomial [p = 2]: Shapley Game Player 1: T Player 1: M Player 1: B Player 2: L Player 2: C Player 2: R Empirical Frequency NIR exponential [η = ]: Shapley Game Player 1: T Player 1: M Player 1: B Player 2: L Player 2: C Player 2: R Time Time 33

35 Shapley Game: No Internal Regret Learning Joint Frequencies 1 NIR polynomial [p = 2]: Shapley Game 1 NIR exponential [η = ]: Shapley Game Empirical Frequency (T,L) (M,L) (B,L) (T,C) (M,C) (B,C) (T,R) (M,R) (B,R) Empirical Frequency (T,L) (M,L) (B,L) (T,C) (M,C) (B,C) (T,R) (M,R) (B,R) Time Time 34

36 Shapley Game: No External Regret Learning Frequencies Empirical Frequency NER polynomial [p = 2]: Shapley Game Player 1: T Player 1: M Player 1: B Player 2: L Player 2: C Player 2: R Empirical Frequency NER exponential [η = ]: Shapley Game Player 1: T Player 1: M Player 1: B Player 2: L Player 2: C Player 2: R Time Time 35

37 Summary No-external- and no-internal-regret can be defined along one continuum, no-φ-regret. No-Φ-regret learning algorithms exist, Φ. No-Φ-regret learning converges to the set of Φ-equilibria, Φ. No-internal-regret learning is the strongest form of no-φ-regret learning. Therefore, Nash equilibrium cannot be learned via no-φ-regret learning. 36

38 A little rationality goes a long way [Hart 03] Regret Minimization vs. Utility Maximization RM is easy to implement. RM justifies randomness in actions. Can RM be used to explain human behavior? 37

Games A game is a tuple = (I; (S i ;r i ) i2i) where ffl I is a set of players (i 2 I) ffl S i is a set of (pure) strategies (s i 2 S i ) Q ffl r i :

On the Connection between No-Regret Learning, Fictitious Play, & Nash Equilibrium Amy Greenwald Brown University Gunes Ercal, David Gondek, Amir Jafari July, Games A game is a tuple = (I; (S i ;r i ) i2i)