Lecture 6 Lecturer: Shaddin Dughmi Scribes: Omkar Thakoor & Umang Gupta

Size: px

Start display at page:

Download "Lecture 6 Lecturer: Shaddin Dughmi Scribes: Omkar Thakoor & Umang Gupta"

Sabrina Craig
5 years ago
Views:

1 CSCI699: opics in Learning & Game heory Lecture 6 Lecturer: Shaddin Dughmi Scribes: Omkar hakoor & Umang Gupta No regret learning in zero-sum games Definition. A 2-player game of complete information is said to be zero-sum if U 2 (a, a 2 ) = U (a, a 2 ) action profiles (a, a 2 ) Such a game can be described by an n m matri A, where n, m are the number of actions of P, P 2 resp. and we have, A ij = U (i, j) ( U 2 (i, j) = A ij ) () Eample We can represent the well-known Rock-Paper-Scissor game with the following matri (able 2 shows the general-sum game representation of this game.) R P S R 0 + P + 0 S + 0 Net, let n, y m be mied strategies of P, P 2. hen, P plays his action i with probability i, and P 2 plays his action j with probability y j, and thus P gets a payoff A ij with probability i y j. Hence, his epected payoff can be written as, n m U (, y) = i y j A ij = Ay i= j= (=> U 2 (, y) = Ay) Recall that, by definition, (,y) form a Nash Equilibrium iff U (, y) U (, y) & U 2 (, y) U 2 (, y ) y Ay Ay & Ay Ay y Equivalently, we can reduce it to, Ay (Ay) i i [n] & Ay ( A) j j [m]

2 NO REGRE LEARNING IN ZERO-SUM GAMES 2 Definition 2. he maimin value of A is ma min y Ay = ma min j [m] ( A) j Further, we call any argma min y Ay as P s maimin strategy. Similarly, we also define the following: Definition 3. he minima value of A is min y ma Ay = min y ma (Ay) i i [n] Further, we call any y argmin ma y Ay as P 2 s minima strategy. An interesting question that arises, is, Which is better, moving first or second?. o answer this, we first show the following result, which is also known as the weak 2 nd mover s advantage. heorem. maimin(a) minima(a) Proof. maimin(a) = ma min y Ay = min y Ay (By definition of (Def. 2)) Ay ma Ay = min y (By definition of min) (By definition of ma) ma Ay (By definition of y (Def. 3)) = minima(a) Net, we prove, that the two values are in fact equal. heorem 2 (Minima hm). maimin(a) = minima(a) Before proceeding with the proof of the theorem, note the following consequence of the theorem. Corollary 3. (, y ) is a Nash Equilibrium of the simultaneous game.

3 NO REGRE LEARNING IN ZERO-SUM GAMES 3 Proof. Using heorem 2, it follows that every inequality in the proof of heorem is, in fact, an equality. Hence, Ay = ma Ay & Ay = min Ay y hus,, y are best responses to each other, making (, y ) a Nash Equilibrium by definition. Net, we present the proof of heorem 2. Proof. (Minima hm) Consider the following setting. - Assume that the two players play repeatedly times where is large. - At each step t =,...,, they play simultaneously: P, P 2 choose t, y t respectively. P sees all the previous iterations before choosing t, but not see y t. Similarly, P 2 sees all the previous iterations before choosing y t, but not see t. - Utility of P at time t is t Ay t, which is also the loss of P 2 at time t. - Let Ui t = (Ay t ) i be the utility of action i for P at time t. Similarly, let Cj t = ( t A) j be the cost of action j for P 2 at time t. - Each player faces an online learning problem: P must select t based only on Ui,..., Ui t i. similarly, P 2 must select y t based only on Cj,..., Cj t j. - Assume each uses a no-eternal regret algorithm (e.g. Multiplicative Weights). - Let δ( ) be bound on eternal regret. - Let v = t Ay t t= hen, we have, v ma (Ay t ) i δ( ) i t= ( t= y t ) = ma A δ( ) i i min ma(ay) i δ( ) y i = minima(a) δ( ) Similarly, from P 2 s perspective, we can get, Combining the two inequalities, v maimin(a) + δ( ) minima(a) maimin(a) + 2δ( ) (since no-et regret) minima(a) = maimin(a) (since δ( ) 0 as )

4 2 NO REGRE LEARNING IN GENERAL-SUM GAMES 4 As a result of this, the following immediately follows by definition: Corollary 4. ( t= t, ) y t t= is δ-ne. 2 No regret learning in general-sum games In this section, we discuss no regret learning in general-sum games. We talk about equilibrium criterion Correlated equilibrium & Coarse correlated equilibrium and we talk about their eistence in general sum games. We introduce a new regret measure/benchmark called swap regret and an algorithm that has vanishing swap regret. Recall that the Games of Complete Information have : N players, n =,..., n Action set A i for each player i; A = A A 2... A n utility function u i : A [, ] for player i; u i (a, a 2..., a n ) is player s utility when players play A = (a,..., a n ) Definition 4. Correlated equilibrium is a distribution χ over A such that i, j, j A i E a χ [u i (a) a i = j] E a χ [u i (j, a i ) a i = j] In other words, if player i is recommended an action j, he should choose to move j. his is similar to saying that i, s : A i A i (swapping function) E a χ [u i (a)] E a χ [u i (s(a i ), a i )] Note that knowing his own recommended action player can infer something about other player s move yet he is better off playing the recommended action. For eample, consider the chicken-dare game in which players can choose to chickenout or dare. Pay-offs are as mentioned in table 2. If players are recommended action profiles (C,D), (D,C) & (C,C) with equal probability, it is a Correlated equilibrium. None of the players would want to deviate from recommended strategy in this case. Consider player if it is recommended to chicken out, it knows other player is going to dare and chicken out with equal probability and so player s pay off is better off not deviating. If player is recommended dare, it knows other player is going to chicken out and hence won t deviate.

5 2 NO REGRE LEARNING IN GENERAL-SUM GAMES 5 Chicken Dare Chicken 0, 0-2, Dare, -2-0, -0 able : Pay-off in Chicken Dare Game Definition 5. Coarse correlated equilibrium is a distribution χ such that i, j E a χ [u i (a)] E a χ [u i (a j, a i )] Above definition of Coarse correlated equilibrium states that the agent is not going to make any profit by deviating to some other constant/fied action. Note the subtle difference in definition of Coarse correlated equilibrium from that of Correlated equilibrium. Note that Correlated equilibrium uses a smart swap function. herefore, Coarse correlated equilibrium is a weaker equilibrium criteria as there is no correlation between swap and the recommended action in general as compared to Correlated equilibrium. Another way to look at it is to compute Correlated equilibrium, we assume players have more knowledge and hence can respond better in which case it becomes difficult to make them obey. DominantStrategy N asheq. CorrelatedEq. CoarseCorrelatedEq. For eample, consider the rock-paper-scissor game and the pay off matri given as in table 2. Choosing all the actions equally likely ecept the diagonal once is a Coarse correlated equilibrium. o understand why, suppose the player is playing rock, player 2 will respond with either paper or scissor with equal probability, but if he starts responding with paper higher possibility he will incur more loss when player plays scissor (recall player play rock as well scissor with equal probability). Hence, player 2 is not better off playing other strategies. his is not correlated equilibrium strategy. If player 2 is instructed to play paper. He knows that other player is playing either rock or scissor and player2 s average pay off is 0. He can change his move to rock and improve pay off to /2 as against to 0 if he plays recommended action. In this case, player2 could eploit knowledge of action recommended to him to improve pay off.

6 2 NO REGRE LEARNING IN GENERAL-SUM GAMES 6 Rock Paper Scissor Rock 0,0 -,,- Paper,- -0,0 -, Scissor -,-,- 0,0 able 2: Pay-off in rock, paper, scissor Game Definition 6. δ-correlated equilibrium (or approimate Correlated equilibrium ) is a distribution χ over A such that i, j, j A i & δ > 0 E a χ [u i (a) a i = j] E a χ [u i (j, a i ) a i = j] δ In other words, if player i is recommended action j, he is only better off deviating from his action by a small quantity δ. his is similar to saying that i, s : A i A i (swapping function) E a χ [u i (a)] E a χ [u i (s(a i ), a i )] δ Definition 7. δ-coarse correlated equilibrium (or approimate Coarse correlated equilibrium ) is a distribution χ such that δ > 0 & i, j E a χ [u i (a)] E a χ [u i (a j, a i )] δ Note that in both δ-correlated equilibrium & δ-coarse correlated equilibrium inequalities can slack by an arbitrary δ from the Correlated equilibrium & Coarse correlated equilibrium criterion. 2. Eistence of δ-coarse correlated equilibrium heorem 5. Fi a game of complete information, suppose that player plays the game repeatedly times & each player uses a vanishing eternal regret algorithm. hen, the time averaged mied strategy profiles form an approimate Coarse correlated equilibrium Formally, Let Pi t D(A i ) be player i s mied strategy at time t. Let χ t = P t... Pn t & Let χ = i= t be the time averaged joint action profile distribution, hen χ is a δ-coarse correlated equilibrium where δ( ) 0 as Note that χ can be realized in two steps by sampling time t from uniform {, }, then sampling from χ t

7 2 NO REGRE LEARNING IN GENERAL-SUM GAMES 7 Proof. Fi player i & assume it uses a vanishing eternal regret algorithm to choose his action. Vanishing eternal regret implies that switching to fied action j in hindsight cannot give more that δ( ) per time step & on an average δ( ) 0 as. Recall, Multiplicative weights is one such vanishing eternal regret algorithm. For a vanishing eternal regret algorithm, we know that at time, E a t χ t[u i(a t )] E a t t χ t[u i(a t j, a t i)] δ( ) t for any j Since epectations are linear so we reconsider the above equations as picking some t at random & get action from it. herefore the above eq. can be approimately written using χ as follows E a χ [u i (a)] E a χ [u i (a j, a i )] δ( ) Above relation clearly implies that δ-coarse correlated equilibrium eists in above scenario and it can be achieved if agents use a vanishing eternal regret algorithm. Corollary 6. δ-coarse correlated equilibrium eists for every δ > 0 and every finite game. Moreover, a vanishing eternal regret learning agent dynamics arrive at such δ-coarse correlated equilibrium 2.2 Swap Regret Definition 8. Fi an online learning environment, where an adversary chooses cost function at each time, C... C : A [, ]. An online learning algorithm which plays a mied strategy P t D(A) at time t has swap regret as follows swap-regret t alg = ( ) E(C t (j)) min E(C t (s(j))) s:a A t= t= s is the swap function. hus, we say that an algorithm has vanishing swap regret (or no swap regret) if swap-regret alg 0 adversaries, C t (a) Note that swap-regret criteria is a stronger criteria than the eternal regret criteria in that swap function is smart, i.e. it can look at the player s action and make the swap depending on the action as against choosing a fied action in eternal regret criteria. he main idea is that :

8 2 NO REGRE LEARNING IN GENERAL-SUM GAMES 8 Swap-regret benchmark moves around algorithm. It imposes some type of local optimality. It has local modification of action. 2.3 Eistence of δ-correlated equilibrium heorem 7. Fi a game of complete information, suppose that player play the game repeatedly times & each player uses a vanishing swap regret algorithm. hen, the time averaged mied strategy profiles forms an approimate Correlated equilibrium. Formally, Let Pi t D(A i ) be player i s mied strategy at time t Let χ t = P t... Pn t & Let χ = i= t be the time averaged joint action profile distribution hen, χ is a δ-correlated equilibrium where δ( ) 0 as Note that can be realized in two steps by sampling uniform {, } and then sampling from t Proof. Fi player i and use a vanishing swap regret algorithm to choose action. No swap regret implies that swapping the action with a action j in hindsight cannot give more that δ( ) per time step and on an average δ( ) 0 as. For a vanishing swap regret algorithm we know that at time E a t χ t[u i(a t )] E a t t χ t[u i(s(a t i), a t i)] δ( ) t where, s : A A is swap function Since epectations are linear so we consider the above equation as picking some t at random & get action from it. here the above eq. can be approimately written using χ as follows E a χ [u i (a)] E a χ [u i (a j, a i )] δ( ) Above eq. implies that δ-coarse correlated equilibrium eists in above scenario and it can be achieved if agents use a no swap regret algorithm. Corollary 8. δ-correlated equilibrium eists for every δ > 0 and every finite game. Moreover, a no swap regret learning agent dynamics arrive at such δ-correlated equilibrium

9 2 NO REGRE LEARNING IN GENERAL-SUM GAMES 9 So far, we have shown that given an algorithm with vanishing swap regret, δ- correlated equilibrium always eists. Net we try to establish eistence of a no swap regret algorithm 2.4 Eistence of a no swap regret algorithm heorem 9. In any online learning setting with finitely many actions A (at ma n), there is a reduction from no swap regret algorithm to a no eternal regret algorithm. q t D(A) p t C t Alg No eternal regret P t D(A) q t 2 D(A) C t (A) Alg. to compute p t i, q t i (No Swap Regret) p t 2C t Alg 2 No eternal regret q t n D(A) p t nc t Alg n No eternal Regret Figure : No Swap Regret Algorithm Proof. he algorithm for the above is as follows:

10 2 NO REGRE LEARNING IN GENERAL-SUM GAMES 0 Instantiate n copies of vanishing eternal regret algorithm. Each instance of algorithm gets some proportion of cast. his is intuitively interpreted in following way First instance of no eternal regret algorithm is incharge for swapping action to some other fied action, and similarly for other instances. See fig. Now, for each time step t =... Receive q t... q t n D(A) from Alg... Alg n Aggregate q t... q t n to get P t Play according to P t Receive C t (A) [, ] Compute and give p t ic t to each of Alg... Alg n o achieve no swap regret we need for any s : A A for E a Pt [C t (i)] E a Pt [C t (s(i))] + δ( ) i P t i C t (i) i P t i C t (s(i)) + δ( ) (2) We know Alg... Alg n have vanishing eternal regret for each Alg j Let, k = argmin c t (j) j c t = p t jc t E i q t j [c t (i)] E i q t j [c t (k)] + δ ( ) qjip t j C t (i) p j C t (k) + δ ( ) j =... n (3) i We need to prove eq. 2. Since eq. 3 holds for any k, therefore it should also hold for any s : A A as it will map to one of the k. We can write,

11 2 NO REGRE LEARNING IN GENERAL-SUM GAMES qjip t j C t (i) i Adding all eq. from... n. p j C t (s(i)) + δ ( ) j =... n qjip t j C t (i) p j C t (s(i)) + δ ( ) j =... n (4) i j j By eq. 4 & eq. 2, we can conclude that algorithm should have t j q ji p t j = P t i = p t j Assume, P t i, p t j are vectors and q ji is a matri of transition probabilities from j to i. We want p t Q t = P t = P t Q t (5) We know that Every markov chain Q has a stationary dist. P. Moreover such a distribution can be computed efficiently in O(No. of States = n). herefore, eq. 5 has a solution. herefore, we can create a no swap regret algorithm from n instance of no eternal regret algorithm. his proves the eistence of no swap regret algorithm and hence δ-correlated equilibrium t

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture #24 Scribe: Jad Bechara May 2, 2018

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture #24 Scribe: Jad Bechara May 2, 2018 COS 5: heoretical Machine Learning Lecturer: Rob Schapire Lecture #24 Scribe: Jad Bechara May 2, 208 Review of Game heory he games we will discuss are two-player games that can be modeled by a game matrix