Supplemental Material for Monte Carlo Sampling for Regret Minimization in Extensive Games

Size: px

Start display at page:

Download "Supplemental Material for Monte Carlo Sampling for Regret Minimization in Extensive Games"

Sharlene Patrick
5 years ago
Views:

1 Supplemental Material for Monte Carlo Sampling for Regret Minimization in Extensive Games Marc Lanctot Department of Computing Science University of Alberta Edmonton, Alberta, Canada 6G E8 Martin Zinkevich Yahoo! Research Santa Clara, CA, USA Kevin Waugh School of Computer Science Carnegie Mellon University Pittsburgh PA Michael Bowling Department of Computing Science University of Alberta Edmonton, Alberta, Canada 6G E8 1 Introduction he supplementary material presented here first presents a detailed description of the MCCFR algorithm. We then give proofs to heorems 3, 4, and 5 from the submission Monte Carlo Sampling for Regret Minimization in Extensive Games. We begin with some preliminaries, then prove a general result about all members of the MCCFR family of algorithms heorem 18 in Section 6). We then use that result to prove bounds for the MCCFR variants heorems 19 and 0 in Section 7). We finally prove the tightened bound for vanilla CFR heorem 1 in Section 8). MCCFR Algorithm he MCCFR algorithm is presented in detail in Algorithm 1. In Algorithm 1, the average strategy is updated optimistically by weighting the update to the average strategy equally for every iteration not seen since the last time the information set was visited. Note: this can be corrected by maintaining weights at each parent information set which get updated whenever they are visited, and pushing the values of the weights down as needed lazy updating). he average strategy can also be updated stochastically by weighting each update as the inverse of the probability of reaching the information set. he average strategy, σ is obtained by normalizing the values of the cumulative strategy tables s I for each action at each information set I. Although optimistic averaging is not technically a correct average it performs well empirically. We ve discussed two novel sampling schemes in this work: outcome-sampling and external sampling..1 Outcome Sampling When using outcome-sampling, we can do the updates for each player simultaneously on a single pass over the one sampled terminal history. When z[i]a is a prefix of z action a was taken at I in 1

2 Algorithm 1 Monte Carlo CFR with optimistic averaging Require: a sampling scheme S Initialize information set markers: I, c I 0 Initialize regret tables: I, r I [a] 0. Initialize cumulative strategy tables: I, s I [a] 0. Initialize initial profile: σi, a) 1/ AI) for t = {1,, 3, } do for i N do Sample a block of terminal histories Q Q using S for each prefix history z[i] of a terminal history z Q with P z[i]) = i do for a AI) do Let r = ri, a), the sampled counterfactual regret r I [a] r I [a] + r s I [a] s I [a] + t c I )π σ i σ ii, a) end for c I t σ i RegretMatchingr I ) end for end for end for our sampled history) then ri, a) = ṽ i σ t I a), I) ṽ iσ t, I) 1) = u iz)π i σ z[i])πσ z[i]a, z) π σ z) = u iz)π i σ z[i]) π σ z) u iz)π σ i z[i])πσ z[i], z) π σ z) ) π σ z[i]a, z) π σ z[i], z)) 3) = W π σ z[i]a, z) π σ z[i], z)) 4) where W = u iz)π σ i z[i]) π σ z) 5) When z[i]a is not a prefix of z, then ṽ i σi a) t, I) = 0, so ri, a) = 0 ṽ i σ t, I) 6) = W π σ z[i], z) 7). External Sampling When using external sampling, we update for each player separately one pass over the tree for each player). When updating I belonging to player i and z[i]a is a prefix of z, then π i σ z[i], z) = π i σ z[i]a, z) since a is taken by i, not the opponent. Also note that qz) = πσ i z). We have the regret:

3 ri, a) = ṽ i σi a) t, I) ṽ iσ t, I) 8) = u i z)πσ i z[i]) π σ z[i]a, z) π σ z[i], z)) 9) qz) z Q Z I = u i z)πσ i z[i])πσ i z[i], z) π σ z[i]a, z) π σ ) z[i], z) qz) π σ 10) z Q Z I i z[i], z) = u i z)πσ i z) πi σ z[i]a, z) πi σ z[i], z)) qz) z Q Z I 11) = u i z) πi σ z[i]a, z) πi σ z[i], z)) z Q Z I 1) In practice, ri, a) can be computed for all a AI) directly from Equation 1 by splitting the sum into two terms and recursively traversing the sub-tree under each action in I. 3 Preliminaries here are several basic properties of random variables and real numbers that are necessary to prove the main results. Lemma 1 For any random variable X: Pr[ X k E[X ]] 1 k. 13) Proof: Markov s Inequality states, if Y is always non-negative: Pr[Y je[y ]] 1 j. 14) By setting Y = X : Pr[X je[x ]] 1 j 15) Pr[ X je[x ]] 1 j. 16) Replacing k = j: Pr[ X k E[X ]] 1 j. 17) Lemma If a 1..., a n are non-negative real numbers in the interval [0, 1] where n i=1 a i = S, then n i=1 a i) S. Proof: Assume without loss of generality that n S.. Suppose that there are two elements a i,a j, where a i < 1 and a j < 1. If a i + a j 1, then: a i ) + a j ) a i ) + a i a j + a j ) 18) a i + a j ). 19) hus, it is better to have a i + a j, 0). If a i + a j > 1, then define A = a i + a j, and define fx) = A x) + x. Setting the derivative to zero: 0 = f x) 0) f x) = A x) + x 1) A = 4x ) A = x 3) 3

4 Upon further observation, f x) = 4, implying that A is a minimal point. herefore, since the critical points of fx) are A and the limits of the feasible region, namely A 1, and 1, then the limits of the feasible region must be the imal points. herefore, for any two a i and a j, either: 1. One or the other is zero, or:. One is equal to the other. herefore, there can be no more than one i such that a i 0, 1), all others must be equal to zero or one. Define i = S. Without loss of generality, assume for all i {1... i }, a i = 1, a i +1 = S S, and for all i {i +... n}, a i = 0. he result follows directly. Lemma 3 If a 1..., a n are non-negative real numbers where n i=1 a i = S, then n i=1 ai Sn. Proof: We prove this by induction on n. If n = 1, then the result is trivial. Otherwise, define x = n i=1 a i, so that a n + x = S, and therefore by induction n i=1 ai xn 1) + S x. Define fx) = xn 1) + S x. o imize fx), we observe that 0 and S are critical points, and we take the derivative and set it to zero: f x) = 0 4) f x) = 0.5n 1) 0.5 5) xn 1) S x 0.5 n 1 x = 0.5 S x 6) x n 1 = S x 7) x ) = S 8) n 1 ) n x = S 9) n 1 Sn 1) x = 30) n herefore, substituting the three critical points yields: f0) = S 31) fs) = Sn 1) 3) ) Sn 1) Sn 1)n 1) Sn 1) f = + S 33) n n n S S = n 1) n + 34) n he imum of these is Sn, establishing the inductive step. = Sn 35) Lemma 4 If b 1..., b n are non-negative real numbers where n i=1 b i = S, then n i=1 b i Sn. Proof: Let a i = b i and apply Lemma 3. Lemma 5 Given nonnegative reals a i,j in [0, 1], where m n i=1 j=1 a m,n = S, then: m n a m,n ) ms. 36) i=1 j=1 4

5 4 Blackwell s Approachability heorem Consider the following more sophisticated bound for the regret matching procedure using Blackwell s approachability. Lemma 6 For all real a, define a + = a, 0). For all a, b, it is the case that a + b) + ) a + ) + a + )b + b 37) Proof: We prove this by enumerating the possibilities: 1. a 0. hen a + = 0, so we have: a + b) + ) b + ) 38) b, 39) and: a + ) + a + )b + b = b. 40). a 0,b a. hen a = a + and a + b) + = a + b). So: a + b) + ) = a + b). 41) Also: a + ) + a + )b + b = a + ab + b 4) = a + b) 43) 3. a 0, b a. hen a = a +, and a + b) + = 0. So: a + b) + ) = 0. 44) Also: a + ) + a + )b + b = a + ab + b 45) = a + b) 46) 0 47) Define R + P, = R+ a). Regret matching is a strategy σ +1 where: R + a) if R P + σ +1 a) = R P +, > 0, 1 otherwise A 48) Lemma 7 If regret matching is used, then: R + a)r +1a) 0 49) 5

6 Proof: If R P +, 0, then for all a A, R + a) = 0, and the result is trivial. Otherwise: R + a)u +1a) u +1 σ t )) 50) R + a)r +1a) = ) = R + a)u +1a) u +1 σ t ) ) R + a) 51) ) ) = R + a)u +1a) σ +1 a )u +1 a ) R P +, 5) a A ) = R + a)u +1a) a A ) = R + a)u +1a) R + a )u +1 a ) a A ) R + a ) R P + u +1 a ) R P +, 53), ) 54) = 0 55) heorem 8 Define t to be a,a Au t a) u t a )). hen regret matching yields: R + a)) 1 A t ). 56) Proof: We prove this by recursion on. he base case for = 1) is obvious. Assuming this 1) holds for 1, we prove it holds for. Since R a) = R 1 a) + 1 r a), by Lemma 6: Summing yields: R + a)) R + a)) 1)R+ 1 a) 1 ) + 1 R + 1 a)r a) + r a) ) 57) ) ) R + 1 a)) + 1 R + 1 a)r a) + 1 r a)) By Lemma 7, R+ 1 a)r a) = 0, so: ) ) ) 1 R + a)) R a)) + r a)) By induction: Note that r a). So: 58) 59) R a)) 1) A t ). 60) R + a)) 1 1 ) A t ) + A ). 61) 5 Deterministic Strategies Before delving into the general proof, we need a few gory details involving deterministic strategies. 6

7 A deterministic strategy σ i : I i Ai) maps each information set I i I i to an action a AI i ). Define Σ i to be the set of deterministic strategies for i, and Σ = i N Σ i, and Σ i = j N \i Σ j. Define Ih) to be the information set I i I P h) containing h. Given a deterministic strategy profile σ, we can make it into a function from a history to the next action, defined as σh) = σ P h) Ih)). he terminal history hσ) is the unique h Z such that, for all t {0... h 1}, σht)) = h t+1. An information set I is reached with σ if for some h hσ), h I. In a game with perfect recall, define hσ, I) to be the unique h I where h hσ). If no deterministic strategy σ i of i allows I to be reached with σ i, σ i ), then I is unreachable with σ i. In a game with perfect recall, given σ i, each information set I i I i, given two deterministic strategies σ i and σ i, if σ i and σ i both reach I, then hσ i, σ i ), I) = hσ i, σ i ), I). herefore, if I is reachable with σ i we define hσ i, I) = hσ i, σ i ), I) for some σ i such that I is reached with σ i, σ i ). In general, for any set S N, I is reachable with σ S = {ˆσ i } i S if there exists a set σ N \S = {ˆσ i } i N \S such that I is reachable with σ S, σ N \S). Given a history h H, one can consider what would happen if σ was used to play h to termination. In particular, define hσ, h ) Z to be the unique history h Z such that h h and for all t { h... h 1}, σht)) = h t+1. hus, for all h H, we can define u i h, σ) = u i hσ, h )). Given a, σ i obliviously plays a if the strategy that plays the actions in a deterministically in sequence. In particular, for any information set I i I i, define ci i ) = X i h), the length of the sequence of information sets and actions reached by this player before this information set, for any h I i in a game with perfect recall, this is well-defined). herefore, σ i I i ) = a cii)+1, or is arbitrary if ci i ) + 1 is greater than the number of elements of a. Lemma 9 For any deterministic profile σ i, for any a, if I i I i a) is reachable with σ i, then it is reachable with σ i, σ i ), where σ i obliviously plays a. Proof: Since I i is reachable with σ i, then there exists some σ i such that I i is reachable with σ i, σ i ). By definition, the history hσ i, σ i ) has a prefix h I i. Define at) to be the first t elements of a, and define σi t to be the strategy that obliviously plays at), and arbitrary decisions are equal to σ i. We will prove by recursion on t that for all t a, hσ i, σi t) = hσ i, σ i ). First of all observe that σi 0 = σ i, so the basis of the recursion holds. For the inductive step, we assume that hσ i, σ t 1 i ) = hσ i, σ i ), and try to prove that hσ i, σi t) = hσ i, σ i ). Since h I i, then Xh ) = I 1, a 1 )... I k, a k )), and since I i I i a), then a = a 1,..., a k ). herefore, define h to be the prefix of h in I t. Note that σ t 1 i is in control for all I 1... I t 1, and then σ i selects a t in information set I t. However, since σ t would have also selected a t by definition, then hσ i, σi t) = hσ i, σ t 1 i ) = hσ i, σ i ). Note that changing later actions of σ i does not affect whether or not h is played, so that any arbitrary deterministic strategy which is a oblivious will work. Lemma 10 For any deterministic profile σ i, for any a, there is no more than one reachable I i I i a). Proof: Consider two information sets I i, I i I i a) where I i I i, and for the sake of contradiction, assume both are reachable with σ i. Given a σ i which is a-oblivious, then I i and I i are both reachable with σ i, σ i ). But there is only one history generated, hσ i, σ i ), and therefore there must exist h I i and h I i, both prefixes of hσ i, σ i ). But that implies that h h or vice-versa, meaning that in a perfect recall game, the sequence of prior information sets and actions of either I i or I i must include the other, an obvious contradiction to them both having action sequences of equal size. Lemma 11 For any strategy σ j Σ j, there exists a distribution ρ ˆΣ j ) such that for any h H, Pr [ I, a) X j h), ˆσ j I) = a] = π σj h). 6) ˆσ j ρ j 7

8 Proof: First, we define ρˆσ j ) to be: ρˆσ j ) = I I j σ j I)ˆσ j I)). 63) In other words, the probability of playing ˆσ j is the probability of playing like ˆσ j everywhere. Summing over all actions outside of X j h) gives the lemma. Lemma 1 Given a single player strategy σ i Σ i, and ρ generated as in Lemma 11, then: Pr [Reachˆσi i h)] = πσi i h). 64) ˆσ i ρ Lemma 13 Given a history h H, Reachˆσi i h) if and only if for all I, a) X ih), ˆσ i I) = a. Proof: First, if for all I, a) X i h), ˆσ i I) = a, then we can define ˆσ j such that for all I, a) X j h), ˆσ j I) = a, and for all I / X j h), set ˆσ j I) to be arbitrary. Note that for all h h, there exists an I, a) X P h )h) where h I and h h +1 = a. herefore, for all h h, ˆσ P h )I) = h h +1, implying that ˆσ reaches h. If Reachˆσi i h), then there exists a ˆσ i such that h hˆσ i, ˆσ i ). herefore, for all h h, sigmah ˆ ) = h h +1, and ˆσ P h )Ih )) = h h +1. For all I, a) X i h), I = Ih ) and a = h h +1 for some h h. Moreover, P h ) = i, implying that ˆσ i Ih )) = h h +1. Corollary 14 Given any h H, given i N, there exists a ˆσ i Σ i that reaches h. Proof: For any history h, it is easy to construct a strategy which satisfies Lemma 13. Lemma 15 Given a set of strategies ˆσ S = {ˆσ i } i S, if for all i S, Reachˆσi i h), then Reachˆσ S S h). Proof: By Corollary 14, for every i N \S, there exists a strategy ˆσ i that reaches h. By Lemma 13, for all i N, for all I, a) X i h), ˆσ i I) = a. Moreover, this implies that these strategies reconstruct h. Lemma 16 For any strategy profile σ i Σ i, there exists a distribution ρ Σ i ) such that for all I I i, π σ i i I) = ˆσ i b Σ i:reachˆσ i,i) ρˆσ i). Proof: For all j N \i, using Lemma 11, we generate a strategy ρ j Σ j ). Define ρ to be the distribution over Σ i ) obtained by independently sampling each ˆσ j by ρ j ; formally, ρˆσ i ) = ρ j ˆσ j ). 65) j N \i Consider a history h H. By Lemma 1, π σj j h) = Prˆσ [Reachˆσj j ρ j j h)]. Since the strategies are selected independently: Pr [ j N \i, Reachˆσj j h)] = Pr [Reachˆσj j h)] 66) ˆσ i ρ ˆσ j ρ j If we sum over all h I, we get the result. j N \i = j N \i πˆσj h) 67) = πˆσ i h) 68) Lemma 17 For any strategy profile σ i, for any a: π σ i i I) 1 69) I I i a) Proof: his follows directly from Lemma 16 and Lemma 10. 8

9 6 General MCCFR Bound We begin by proving a very general bound applicable to all algorithms in the MCCFR family. First, define B i = {I i a) : a A i }, so M = B B i B. heorem 18 For any p 0, 1], when using any algorithm in the MCCFR family such that for all Q Q and B B, πσ z[i], z)πσ i z[i]) 1 qz) δ 70) z Q Z I where δ 1, then with with probability at least 1 p, average overall regret is bounded by, Ri 1 + ) ) 1 u,i M i Ai. 71) p δ Proof: Define ri ti, a) to be the unsampled immediate counterfactual regret and rt i I, a) to be the sampled immediate counterfactual regret. Formally, ) rii, t a) = v i σi a) t, I) v iσ t, I) 7) ) r ii, t a) = ṽ i σi a) t, I) ṽ iσ t, I) 73) R i I) = 1 R i I) = 1 I) I) rii, t a) 74) r ii, t a) 75) Let Q t Q be the block sampled at time t. Note that we can bound the difference between two sampled counterfactual values for information set I at time t by, ) ṽ i σi a) t, I) ṽ iσ t, I) t u,ii) πσ z[i], z)πσ i z[i]) u,i 76) qz) z Q t Z I so by our assumption, So we can apply heorem 8, to get, Using Lemma 4, R i I) t u,ii) u,i δ 77) AI) t u,i I)) B AB) R i I) t u,i I)) B AB) t u,i I)) B AB) u,i /δ u,i B AB) δ 78) 79) 80) 81) 8) 9

10 he average overall sampled regret then can be bounded by, R i I) 83) R i B B i u,i B AB) B B i δ u,i Ai B B i B δ u,im i Ai δ We now need to prove that R and R are similar. his last portion is tricky. Since the algorithm is randomized, we cannot guarantee that every information set is reached, let alone that it has converged. herefore, instead of proving a bound on the absolute difference of R and R, we focus on proving a probabilistic connection. In particular, we will bound the expected squared difference between I I i Ri I) and I I R i i I) in order to prove that they are close, and then use Lemma 1 to bound the absolute value. We begin by focusing on the similarity of the counterfactual regret Ri I) and R i I)) in every node, by focusing on the similarity of the counterfactual regret of a particular action at a particular time ri ti, a) and rt i I, a)). By the Lemma from the main paper, we know that E[ri ti, a) rt i I, a)] = 0. From Lemma 4 we have, ) E Ri I) R i I)) I i [ E Ri I) R i I)) ] 87) I I i I Ii So, R i I) R i I)) = 1 R i I) R i I)) 1 R i I) R i I)) 1 I) I) I) rii, t a) 1 rii, t a) I) 84) 85) 86) r ii, a)) t 88) r ii, a))) t 89) r t i I, a) r t ii, a) )) 90) Note that if fx) is monotonically increasing on the non-negative numbers, then f a x a ) = a f x a ). R i I) R i I)) 1 R i I) R i I)) 1 E[R i I) R i I)) ] 1 I) I) rii, t a) rii, t a) I) r ii, a)) t 91) r ii, a)) t 9) E[ rii, t a) r ii, t a) ) ] 93) he final step is because if t t, then E[ri ti, a) rt i I, a))rt i I, a) rt i I, a))] = 0, because if t > t, then after time t, r t I, a) is an unbiased estimator of ri t I, a) and vice-versa). Substituting 10

11 back into Equation 87: ] [ I Ii R i I) R i I))) E I i I i I I i I) B B i B) [ r t E i I, a) r ii, t a) ) ] 94) [ r t E i I, a) r ii, t a) ) ] 95) By Equation 7, ri ti, a) u,iπ i σt I). From Equation 76, rt i I, a) t u,i I). hus, E [ r t ii, a) r t ii, a)) ] E [ r t ii, a)) + r t ii, a)) ] 96) Note that for all B B, by Lemma 17: u,iπ ii) σt u,iπ ii) σt u,i Along with Equation 77, and the fact that δ 1 this means, Returning to Equation 95, ) E Ri I) R i I)) I i I I i hus by Lemma 1, with probability at least 1 p, u,iπ σt ii) + t u,ii) 97) π σt ii) u,i 98) E [ r t ii, a) r t ii, a)) ] u,i + u,i δ 99) u,i δ 100) I i u,i δ B B i B) u,i δ 101) B I i AB) 10) Since M I i B i, R i R i Ii B i A i u,i 1 + δ p + u,im A i δ 103) ) 1 ) u,i M A i 104) p δ 7 Specific MCCFR Variants We can now apply heorem 18 to prove a regret bound for outcome-sampling and external-sampling. 7.1 Outcome-Sampling heorem 19 For any p 0, 1], when using outcome-sampling MCCFR where z Z either π i σ z) = 0 or qz) δ > 0 at every timestep, with probability 1 p, average overall regret is bounded by ) 1 ) Ri u,i M i Ai ) p δ 11

12 Proof: We simply need to show that, πσ z[i], z)πσ i z[i]) qz) z Q Z I 1 δ. 106) Note that for all Q Q, Q = 1. Also note that for any B B i there is at most one I B such that Q Z I. his is because all the information sets in Q Z I all have player i s action sequence of a different length, while all information sets in B have player i s action sequence being the same length. herefore, only a single term of the inner sum is ever non-zero. Now by our assumption, for all I and z Z I where π i σ z) > 0, π σ z[i], z)π σ i z[i]) qz) 1 δ 107) as all the terms of the numerator are less than 1. So the one non-zero term is bounded by 1/δ and so the overall sum of squares must be bounded by 1/δ. 7. External-Sampling heorem 0 For any p 0, 1], when using external-sampling MCCFR, with probability at least 1 p, average overall regret is bounded by ) u,i M i Ai ) p R i Proof: We will simply show that, πσ z[i], z)πσ i z[i]) qz) z Q Z I Since qz) = π i σ z), we need to show, πi σ z[i], z) z Q Z I 1 109) 1 110) Let ˆσ t be a deterministic strategy profile sampled from σ t where Q is the set of histories consistent with ˆσ t i. So Q Z I if and only if I is reachable with ˆσ t i. By Lemma 10, for all B B i there is only one I B that is reachable; name it I. Moreover, there is a unique history in I that is a prefix of all z Q Z I ; name it h. So for all z Q Z I, z[i ] = h. his is because ˆσ t t i uniquely specifies the actions for all but player i and B uniquely specifies the actions for player i prior to reaching I. Define ρ to be a strategy for all players including chance) where ρ j i = ˆσ j but ρ i = σ i. Consider a z Q Z I. z must be reachable by ˆσ i, so π ρ i z) = 1. So πi σ z[i ], z) = π ρ i h, z) 111) So, z Q Z I z Q Z I = z Q Z I z Z I πi σ z[i], z) z Q Z I π ρ h, z) 11) π ρ h, z) 1 113) 1 114) 1

13 8 Vanilla CFR: A ighter Bound In the final proof we use some of the same ideas of the previous proofs to tighten the original bound of vanilla CFR, so the bound depends on M i rather than I i as with the MCCFR variants. heorem 1 When using vanilla CFR for player i, R i u,i M i Ai /. Proof: Define t u,i I) = σt i I) u,ii). Using heorem 8, R,+ i I)) AI) R,+ i I) t u,ii)) 115) AI) u,i I) σ i t I)). 116) By summing over all information sets of I, we get: R,+ i 1 AI) u,i I) σ i t I)) 117) I I i Ai u,i σ i t I)) 118) I I i Ai u,i σ i t I)). 119) For each action sequence B B i : herefore, by Lemma 5: Summing over all B B i yields: R,+ i B B i σ ii) t 1 10) σ ii) t 11) σ t i I) B 1) Ai u,i B B i B. 13) In practice, this makes the bound on vanilla counterfactual regret as tight as the sampling bounds. he distinctive difference is the amount of computation required per iteration. 13

2 Background. Definition 1 [6, p. 200] a finite extensive game with imperfect information has the following components:

2 Background. Definition 1 [6, p. 200] a finite extensive game with imperfect information has the following components: Monte Carlo Sampling for Regret Minimization in Extensive Games Marc Lanctot variant that samples chance outcomes on each iteration [4]. They claim that the per-iteration cost reduction far exceeds the