Advanced Machine Learning

Size: px

Start display at page:

Download "Advanced Machine Learning"

Scott Grant
5 years ago
Views:

1 Advanced Machine Learning Learning and Games MEHRYAR MOHRI COURANT INSTITUTE & GOOGLE RESEARCH.

2 Outline Normal form games Nash equilibrium von Neumann s minimax theorem Correlated equilibrium Internal regret page 2

3 Normal Form Games: Example Rock-Paper-Scissors. R P S R 0,0-1,1 1,-1 P 1,-1 0,0-1,1 S -1,1 1,-1 0,0 page 3

4 Be Truly Random page 4

5 Normal Form Games p players. k 2 [1,p] For each player : A k set of actions (or pure strategies). payoff function u k : Q p k=1 A k! R. Goal of each player: maximize his payoff in a repeated game. page 5

6 Prisoner s Dilemma Silence/Betrayal. for each player, the best action is B, regardless of the other player s action. but, with (B, B), both are worse off than (S, S). S B S 2,2 0,3 B 3,0 1,1 page 6

7 Matching Pennies Player A wins when pennies match, player B otherwise. other versions: penalty kick. no pure strategy Nash equilibrium. H T H 1,-1-1,1 T -1,1 1,-1 page 7

8 Battle of The Sexes Opera/Football. two pure strategy Nash equilibria. O F O 3,2 0,0 F 0,0 2,3 page 8

9 Mixed Strategies Strategies: pure strategies: elements of. mixed strategies: elements of. Payoff: for each player strategies (p 1,...,p p ), E [u k (a)] = a j p j k 2 [1,p] X a=(a 1,...,a p ) Q p k=1 A k Q p k=1 1(A k ), when players play mixed p 1 (a 1 ) p p (a p )u k (a). page 9

10 Nash Equilibrium Definition: a mixed strategy is a (mixed) Nash equilibrium if for all and, if for all, is a pure strategy, then is said to be a pure Nash equilibrium. (p 1,...,p p ) k 2 [1,p] q k 2 1 (A k ) u k (q k, p k ) apple u k (p k, p k ). p k k (p 1,...,p p ) page 10

11 Nash Equilibrium: Examples Prisoner s dilemma: (B, B) is a pure Nash equilibrium. Dominant strategy: both better off playing B regardless of the other player s action. Matching Pennies: no pure Nash equilibrium; clear mixed Nash equilibrium: uniform probability for both. Battle of The Sexes: pure Nash equilibria: both (O, O) and (F, F). mixed Nash equilibria: ((2/3, 1/3), (1/3, 2/3)). payoff of 2/3 for both in mixed case: less than payoffs in pure cases! page 11

12 Nash s Theorem Theorem: any normal form game with a finite set of players and finite set of actions admits a (mixed) Nash equilibrium. page 12

13 Define function Proof : Q p k=1 1(A k )! Q p k=1 1(A k ) (p 1,...p p )=(p 0 1,...p 0 p) with 8k 2 [1,p],j 2 [1,n k ], p 0j k = pj k +cj+ k 1+ P n k j=1 cj+ k where c j k = u k(e j, p k ) u k (p k, p k ), c j+ k = max(0,c j k ). is a continuous function mapping from a non-empty compact convex set to itself, thus, by Brouwer s fixed-point theorem, there exists (p 1,...p p ) such that (p 1,...p p )=(p 1,...p p )., by page 13

14 Proof Observe that for any, Xn k j=1 k 2 [1,p] n k p j k cj k = X p j k u k(e j, p k ) u k (p k, p k )=0. j=1 k 2 [1,p] p j k > 0 Thus, for any, there exists at least one such that with. For that, and p j k = c j k apple 0 p j k 1+ P n k j=1 cj+ k ) 1+ Xn k j=1 j c j+ k =0 c j+ k =1 ) c j+ k =0, 8j ) u k (e j, p k ) apple u k (p k, p k ), 8j ) u k (q k, p k ) apple u k (p k, p k ), 8q k. j page 14

15 Nash Equilibrium: Problems Different equilibria: not clear which one will be selected. different payoffs. Circular definition. Finding any Nash equilibrium is a PPAD-complete (polynomial parity argument on directed graphs) problem (Daskalakis et al., 2009). Not a natural model of rationality if computationally hard. page 15

16 Zero-Sum Games: Order of Play If row player plays then column player plays solution of Thus, if row player starts, he plays quantity and the payoff is p min q2 1 (A 2 ) max p2 1 (A 1 ) to maximize that Similarly, if column player plays first, the expected payoff is min q2 1 (A 2 ) E a 1 p a 2 q min q2 1 (A 2 ) max p2 1 (A 1 ) [u 1 (a)]. E E p a 1 p a 2 q a 1 p a 2 q [u 1 (a)]. [u 1 (a)]. q page 16

17 von Neumann s Theorem Theorem (von Neumann s minimax theorem): for any twoplayer zero-sum game with finite action sets, max p2 1 (A 1 ) min q2 1 (A 2 ) E a 1 p a 2 q [u 1 (a)] = min q2 1 (A 2 ) common value called value of the game. max p2 1 (A 1 ) mixed Nash equilibria coincide with maximizing and minimizing pairs and they all have the same payoff. (von Neumann, 1928) E a 1 p a 2 q [u 1 (a)]. page 17

18 Proof Playing second is never worse: max p2 1 (A 1 ) min q2 1 (A 2 ) E a 1 p a 2 q straightforward: [u 1 (a)] apple min 8p 2 1 (A 1 ), 8q 2 1 (A 2 ), min q2 1 (A 2 ) ) 8q 2 1 (A 2 ), max p2 1 (A 1 ) ) max p2 1 (A 1 ) min q2 1 (A 2 ) min q2 1 (A 2 ) q2 1 (A 2 ) E a 1 p a 2 q E a 1 p a 2 q E a 1 p a 2 q [u 1 (a)] apple E max p2 1 (A 1 ) a1 p a 2 q [u 1 (a)] apple max [u 1 (a)] p2 1 (A 1 ) [u 1 (a)] apple min q2 1 (A 2 ) E a 1 p a 2 q E a 1 p a 2 q [u 1 (a)] max p2 1 (A 1 ) [u 1 (a)]. E a 1 p a 2 q [u 1 (a)]. page 18

19 Proof min q2 1 (A 2 ) Set-up: at reach round, column player selects using RWM. row player selects. Thus, letting proof: max p2 1 (A 1 ) E a 1 p a 2 q T! +1 [u 1 (a)] = min apple apple 1 T q2 1 (A 2 ) in the following completes the max p2 max p2 1 (A 1 ) p> U =min q TX t=1 " 1 (A 1 ) p> Uq " 1 T TX t=1 max p2 1 (A 1 ) p> Uq t = 1 T 1 T p t = TX t=1 q t p > t p2 # max 1 (A 1 ) p> Uq t Uq + R T T q t # = max p2 1 (A 1 ) TX t=1 1 T TX p > Uq t t=1 p > 1 t Uq t =min q T TX t=1 apple max min p > Uq + R T p q T. p > t Uq + R T T page 19

20 Proof Let p 2 argmax min u 1(p, q) p2 1 (A 1 ) q2 1 (A 2 ) q 2 argmin max u 1(p, q). q2 1 (A 2 ) p2 1 (A 1 ) and p and q exist by the continuity of u 1 and the compactness of the simplices. By definition of p and q and the minmax theorem: v =minu 1 (p,q) apple u 1 (p,q ) apple max q p Thus, (p,q ) is a Nash equilibrium. u 1 (p, q )=v. page 20

21 Proof Conversely, assume that (p,q ) is a Nash equilibrium. Then, u 1 (p,q ) = max p u 1 (p, q ) min q max p u 1 (p, q) =v u 1 (p,q )=min q u 1 (p,q) apple max p min q u 1 (p, q) =v. This implies equalities and u 1 (p,q ) = max p min q u 1 (p, q) =min q max p u 1 (p, q). page 21

22 Notes Unique value: all Nash equilibria have the same payoff (less problematic than general case). Potentially several equilibria but no need to cooperate. q O log N T Computationally efficient: convergence in. Plausible explanation of how an equilibrium is reached note that both players can play RWM. In general non-zero-sum games regret minimization does not lead to an equilibrium. page 22

23 Yao s Lemma Theorem: for any two-player zero-sum game with finite action sets, max min p2 1 (A 1 ) a 2 2A 2 E [u a1 1(a)] = p min max q2 1 (A 2 ) a 1 2A 1 E [u a2 1(a)]. q (Yao, 1977) consequence: for any distribution D over the inputs, the cost of a randomized algorithm is lower bounded by the minimum D-average cost of a deterministic algorithm. page 23

24 General Finite Games Regret notion not relevant: (external) regret minimization may not lead to a Nash equilibrium. Notion of equilibrium: several issues related to Nash equilibria. new notion of equilibrium, new notion of regret. page 24

25 Correlated Equilibrium - Tale There is an authority or a correlation mechanism device. The authority defines a probability distribution p-tuple of the players actions. p over the The authority draws player only his action. k (a 1,...,a p ) p a k and reveals to each The authority is a correlated equilibrium if player k has no incentive to deviate from the action recommended: the utility of any other action is lower than a k, conditioned on the fact that he was told. a k page 25

26 X Correlated Equilibrium p<+1 Definition: consider a normal form game with players and finite action sets A Q k, k 2 [1,p]. Then, a p probability distribution p over k=1 A k is a correlated equilibrium if for all k 2[1,p], for all a k 2 A k with positive probability and all, or X a 0 k 2 A k X X (Aumann, 1974) p(a k,a k )u k (a k,a k ) p(a k,a k )u k (a 0 k,a k ), a k 2A k a k 2A k p(a a k )u k (a k,a k ) p(a a k )u k (a 0 k,a k ). a k 2A k a k 2A k page 26

27 Notes Think of the distribution as a correlation device. The set of all correlated equilibria is a convex set (it is a polyhedron): defined by a system of linear inequalities, including the simplex constraints. Solution via solving an LP problem. The set of Nash equilibria in general is not convex. It is defined by the intersection of the polyhedron of correlated equilibria and the constraints p(a) =p 1 (a 1 ) p p (a p ). page 27

28 Traffic Lights Stop/Go. S G Pure Nash equilibria: (S, G), (G, S). Mixed Nash equilibrium: ((1/2, 1/2), (1/2, 1/2)). Correlated equilibria: S 4,4 1,5 G 5,1 0,0 0 1/2 1/2 0 1/3 1/3 1/3 0 page 28

29 Internal Regret Definition: internal regret, functions leaving all actions unchanged but which is switched to. R T = C a,b f : A! A a b Definition: swap regret, family of all functions. R T = TX t=1 TX t=1 E [l(a t )] a t p t C E [l(a t )] a t p t min f2c a,b min f2c TX t=1 TX t=1 E [l(f(a t ))]. a t p t f : A! A E [l(f(a t ))]. a t p t page 29

30 Swap Regret and Correlated Eq. Theorem: consider a finite normal form game played repeatedly. Assume that each player follows a swap regret minimizing strategy. Then, the empirical distribution of all plays converges to a correlated equilibrium. page 30

31 Alternative Proof Let C be the convex set of correlated equilibria. If the sequence of empirical dist. (bp t ) t2n does not converge to C, it admits a subsequence in C = {p: d(p,c) } (compact set), thus, it admits a subsequence converging to bp 62 C. Thus, there exist > 0, k 2 [1,p], and a k,a 0 k 2 A k such that X a k 2A k bp(a)[u k (a 0 k,a k ) u k (a k,a k )] =. Therefore, for X t sufficiently large, a k 2A k bp (t) (a)[u k (a 0 k,a k ) u k (a k,a k )] 2. page 31

32 Alternative Proof bp (t) (a) = 1 Since, 1 (t) (t) X s=1 (t) P (t) s=1 1 a k, (s) =a k 1 ak, (s) =a k h u k (a 0 k, a k, (s) ) u k (a k, a k, (s) ) Thus, the internal regret of player k for switching a k to a 0 k is lower bounded by 2 at time (t) and later, which implies that the player is not following a swap regret minimization strategy. i 1 ak, (s) =a k 2. page 32

33 Proof Define the instantanous regret of player Then, at time as br k,t,j,j 0 =1 ak,t =j[l k (j, a k,t ) l k (j 0,a k,t )], and r k,t,j,j 0 = p k,t,j [l k (j, a k,t ) l k (j 0,a k,t )]. E[br k,t,j,j 0 past ^ other players actions] = r k,t,j,j 0. (j, j 0 ) (r k,t,j,j 0 br k,t,j,j 0) Thus, for any, is a bounded martingale difference. By Azuma s inequality and the Borell-Cantelli lemma, for all and, Therefore, 8k, lim sup T!+1 1 max j,j 0 T k k (j, j 0 ) P 1 T lim sup T!+1 T t=1 br k,t,j,j 0 r k,t,j,j 0 = 0 (a.s.). TX br k,t,j,j 0 apple 0 (a.s.). t=1 t page 33

34 Swap Regret Algorithm Theorem: there exists an algorithm with swap regret. (Blum and Mansour, 2007) O( p NT log N) R p 1 l t p 2 l t p N l t q t,1 q t,2 q t,n R 1 R 2 R N R i s external regret minimization algorithms page 34

35 Proof Define for all the stochastic matrix Since is stochastic, it admits a stationary distribution : Thus, Q t t 2 [1,T] Q t =(q t,i,j ) (i,j)2[1,n] 2 = p > t = p > t Q t,8j 2 [1,N],p t,j = " q > t,1. q > t,n NX i=1 #. p t,i q t,i,j p t TX p t l t = t=1 TX NX p t,j l t,j = t=1 j=1 TX NX NX p t,i q t,i,j l t,j = t=1 j=1 i=1 NX TX q t,i (p t,i l t ) apple i=1 t=1 NX i=1 min j TX p t,i l t,j + R T,i. t=1 page 35

36 Proof f : A! A TX NX p t l t apple Thus, for any, t=1 i=1 TX t=1 p t,i l t,f(i) + R T,i. For RWM, inequality, R T,i = O( p L min,i log N) NX R T,i = N 1 NX R T,i N i=1 i=1 0 v 1 u apple 1 NX L min,i log NA N apple O N i=1 r! 1 N T log N p = O NT log N.. Thus, by Jensen s (Jensen s ineq.) X N L min,i = i=1 TX t=1 NX p t,i l t,j i i=1 page 36

37 Notes Surprising result: no explicit joint distribution in the game! correlation induced by the empirical sequence of plays by the players. Game matrix: no need to know the full matrix (which could be huge with a lot of players). only need to know the loss or payoff for actions taken. page 37

38 Conclusion Zero-sum finite games: external regret minimization algorithms (e.g., RWM). Nash equilibrium, value of the game reached. General finite games: internal/swap regret minimization algorithms. correlated equilibrium, can be learned. Questions: Nash equilibria. extensions: e.g., time selection functions (Blum and Mansour, 2007), conditional correlated equilibrium (MM and Yang, 2014). page 38

39 References Robert Aumann. Subjectivity and correlation in randomized strategies. Journal of Mathematical Economics, 1:67 96, Avrim Blum and Yishay Mansour. From external to internal regret. Journal of Machine Learning Research, 8: , Nicolò Cesa-Bianchi and Gábor Lugosi. Prediction, learning, and games. Cambridge University Press, Constantinos Daskalakis, Paul W. Goldberg, Christos H. Papadimitriou. The complexity of computing a Nash equilibrium. Commun. ACM 52(2): (2009). Dean P. Foster and Rakesh V. Vohra. Calibrated learning and correlated equilibrium. Games and Economic Behavior, 21(12):40 55, Yoav Freund and Robert Schapire. Large margin classification using the perceptron algorithm. In Proceedings of COLT ACM Press, page 39

40 References Ehud Lehrer. Correlated equilibria in two-player repeated games with nonobservable actions. Mathematics of Operations Research, 17(1), Nick Littlestone. From On-Line to Batch Learning. COLT 1989: Nick Littlestone, Manfred K. Warmuth: The Weighted Majority Algorithm. FOCS 1989: Mehryar Mohri and Scott Yang. Conditional swap regret and conditional correlated equilibrium. In NIPS Noam Nisan, Tim Roughgarden, Éva Tardos, and Vijay V. Vazirani. Algorithmic Game Theory. Cambridge University Press, New York, NY, USA, John von Neumann. Zur Theorie der Gesellschaftsspiele. Mathematische Annalen, 100(1): , page 40

41 References Gilles Stoltz and Gábor Lugosi. Learning correlated equilibria in games with compact sets of strategies. Games and Economic Behavior, 59(1), Andrew Yao. Probabilistic computations: Toward a unified measure of complexity, in FOCS, pp , page 41

Conditional Swap Regret and Conditional Correlated Equilibrium

Conditional Swap Regret and Conditional Correlated quilibrium Mehryar Mohri Courant Institute and Google 251 Mercer Street New York, NY 10012 mohri@cims.nyu.edu Scott Yang Courant Institute 251 Mercer