Minimizing Regret on Reflexive Banach Spaces and Nash Equilibria in Continuous Zero-Sum Games

Size: px

Start display at page:

Download "Minimizing Regret on Reflexive Banach Spaces and Nash Equilibria in Continuous Zero-Sum Games"

Avis Riley
6 years ago
Views:

1 Minimizing Regre on Reflexive Banach Spaces and Nash Equilibria in Coninuous Zero-Sum Games Maximilian Balanda, Walid Krichene, Claire Tomlin, Alexandre Bayen Elecrical Engineering and Compuer Sciences, UC Berkeley Absrac We sudy a general adversarial online learning problem, in which we are given a decision se X in a reflexive Banach space X and a sequence of reward vecors in he dual space of X. A each ieraion, we choose an acion from X, based on he observed sequence of previous rewards. Our goal is o minimize regre. Using resuls from infinie dimensional convex analysis, we generalize he mehod of Dual Averaging o our seing and obain upper bounds on he wors-case regre ha generalize many previous resuls. Under he assumpion of uniformly coninuous rewards, we obain explici regre bounds in a seing where he decision se is he se of probabiliy disribuions on a compac meric space S. Imporanly, we make no convexiy assumpions on eiher S or he reward funcions. We also prove a general lower bound on he wors-case regre for any online algorihm. We hen apply hese resuls o he problem of learning in repeaed wo-player zero-sum games on compac meric spaces. In doing so, we firs prove ha if boh players play a Hannan-consisen sraegy, hen wih probabiliy 1 he empirical disribuions of play weakly converge o he se of Nash equilibria of he game. We hen show ha, under mild assumpions, Dual Averaging on he (infinie-dimensional space of probabiliy disribuions indeed achieves Hannan-consisency. 1 Inroducion Regre analysis is a general echnique for designing and analyzing algorihms for sequenial decision problems in adversarial or sochasic seings (Shalev-Shwarz, 2012; Bubeck and Cesa-Bianchi, Online learning algorihms have applicaions in machine learning (Xiao, 2010, porfolio opimizaion (Cover, 1991, online convex opimizaion (Hazan e al., 2007 and oher areas. Regre analysis also plays an imporan role in he sudy of repeaed play of finie games (Har and Mas- Colell, I is well known, for example, ha in a wo-player zero-sum finie game, if boh players play according o a Hannan-consisen sraegy (Hannan, 1957, heir (marginal empirical disribuions of play almos surely converge o he se of Nash equilibria of he game (Cesa-Bianchi and Lugosi, Moreover, i can be shown ha playing a sraegy ha achieves sublinear regre almos surely guaranees Hannan-consisency. A naural quesion hen is wheher a similar resul holds for games wih infinie acion ses. In his aricle we provide a posiive answer. In paricular, we prove ha in a coninuous wo-player zero sum game over compac (no necessarily convex meric spaces, if boh players follow a Hannan-consisen sraegy, hen wih probabiliy 1 heir empirical disribuions of play weakly converge o he se of Nash equilibria of he game. This in urn raises anoher imporan quesion: Do algorihms ha ensure Hannan-consisency exis in such a seing? More generally, can one develop algorihms ha guaranee sub-linear growh of he wors-case regre? We answer hese quesions affirmaively as well. To his end, we develop a general framework o sudy he Dual Averaging (or Follow he Regularized Leader mehod on reflexive Banach spaces. This framework generalizes a wide range of exising 30h Conference on Neural Informaion Processing Sysems (NIPS 2016, Barcelona, Spain.

2 resuls in he lieraure, including algorihms for online learning on finie ses (Arora e al., 2012 and finie-dimensional online convex opimizaion (Hazan e al., Given a convex subse X of a reflexive Banach space X, he generalized Dual Averaging (DA mehod maximizes, a each ieraion, he cumulaive pas rewards (which are elemens of X, he dual space of X minus a regularizaion erm h. We show ha under cerain condiions, he maximizer in he DA updae is he Fréche gradien Dh of he regularizer s conjugae funcion. In doing so, we develop a novel characerizaion of he dualiy beween essenial srong convexiy of h and essenial Fréche differeniabiliy of h in reflexive Banach spaces, which is of independen ineres. We apply hese general resuls o he problem of minimizing regre when he rewards are uniformly coninuous funcions over a compac meric space S. Imporanly, we do no assume convexiy of eiher S or he rewards, and show ha i is possible o achieve sublinear regre under a mild geomeric condiion on S (namely, he exisence of a locally Q-regular Borel measure. We provide explici bounds for a class of regularizers, which guaranee sublinear wors-case regre. We also prove a general lower bound on he regre for any online algorihm and show ha DA asympoically achieves his bound up o a log facor. Our resuls are relaed o work by Lehrer (2003 and Sridharan and Tewari (2010; Srebro e al. (2011. Lehrer (2003 gives necessary geomeric condiions for Blackwell approachabiliy in infiniedimensional spaces, bu no implemenable algorihm guaraneeing Hannan-consisency. Sridharan and Tewari (2010 derive general regre bounds for Mirror Descen (MD under he assumpion ha he sraegy se is uniformly bounded in he norm of he Banach space. We do no make such an assumpion here. In fac, his assumpion does no hold in general for our applicaions in Secion 3. The paper is organized as follows: In Secion 2 we inroduce and provide a general analysis of Dual Averaging in reflexive Banach spaces. In Secion 3 we apply hese resuls o obain explici regre bounds on compac meric spaces wih uniformly coninuous reward funcions. We use hese resuls in Secion 4 in he conex of learning Nash equilibria in coninuous wo-player zero sum games, and provide a numerical example in Secion 4. All proofs are given in he supplemenary maerial. 2 Regre Minimizaion on Reflexive Banach Spaces Consider a sequenial decision problem in which we are o choose a sequence (x 1, x 2,... of acions from some feasible subse X of a reflexive Banach space X, and seek o maximize a sequence (u 1 (x 1, u 2 (x 2,... of rewards, where he u τ : X R are elemens of a given subse U X, wih X he dual space of X. We assume ha x, he acion chosen a ime, may only depend on he sequence of previously observed reward vecors (u 1,..., u 1. We call any such algorihm an online algorihm. We consider he adversarial seing, i.e., we do no make any disribuional assumpions on he rewards. In paricular, hey could be picked maliciously by some adversary. The noion of regre is a sandard measure of performance for such a sequenial decision problem. For a sequence (u 1,..., u of reward vecors, and a sequence of decisions (x 1,..., x produced by an algorihm, he regre of he algorihm w.r.. a (fixed decision x X is he gap beween he realized reward and he reward under x, i.e., R (x := u τ (x u τ (x τ. The regre is defined as R := sup x X R (x. An algorihm is said o have sublinear regre if for any sequence (u 1 in he se of admissible reward funcions U, he regre grows sublinearly, i.e. lim sup R / 0. Example 1. Consider a finie acion se S = {1,..., n}, le X = X = R n, and le X = n 1, he probabiliy simplex in R n. A reward funcion can be idenified wih a vecor u R n, such ha he i-h elemen u i is he reward of acion i. A choice x X corresponds o a randomizaion over he n acions in S. This is he classic seing of many regre-minimizing algorihms in he lieraure. Example 2. Suppose S is a compac meric space wih µ a finie measure on S. Consider X = X = L 2 (S, µ and le X = {x X : x 0 a.e., x 1 = 1}. A reward funcion is an L 2 - inegrable funcion on S, and each choice x X corresponds o a probabiliy disribuion (absoluely coninuous w.r.. µ over S. We will explore a more general varian of his problem in Secion 3. In his Secion, we prove a general bound on he wors-case regre for DA. DA was inroduced by Neserov (2009 for (finie dimensional convex opimizaion, and has also been applied o online learning, e.g. by Xiao (2010. In he finie dimensional case, he mehod solves, a each ieraion, he opimizaion problem x +1 = arg max x X η u τ, x h(x, where h is a srongly convex 2

3 regularizer defined on X R n and (η 0 is a sequence of learning raes. The regre analysis of he mehod relies on he dualiy beween srong convexiy and smoohness (Neserov, 2009, Lemma 1. In order o generalize DA o our Banach space seing, we develop an analogous dualiy resul in Theorem 1. In paricular, we show ha he correc noion of srong convexiy is (uniform essenial srong convexiy. Equipped wih his dualiy resul, we analyze he regre of he Dual Averaging mehod and derive a general bound in Theorem Preliminaries Le (X, be a reflexive Banach space, and denoe by, : X X R he canonical pairing beween X and is dual space X, so ha x, ξ := ξ(x for all x X, ξ X. By he effecive domain of an exended real-valued funcion f : X [, + ] we mean he se dom f = {x X : f(x < + }. A funcion f is proper if f > and dom f is non-empy. The conjugae or Legendre-Fenchel ransform of f is he funcion f : X [, + ] given by f (ξ = sup x, ξ f(x (1 x X for all ξ X. If f is proper, lower semiconinuous and convex, is subdifferenial f is he se-valued mapping f(x = { ξ X : f(y f(x + y x, ξ for all y X }. We define dom f := {x X : f(x }. Le Γ denoe he se of all convex, lower semiconinuous funcions γ : [0, [0, ] such ha γ(0 = 0, and le Γ U := { γ Γ : r > 0, γ(r > 0 } Γ L := { γ Γ : γ(r/r 0, as r 0 } (2 We now inroduce some definiions. Addiional resuls are reviewed in he supplemenary maerial. Definiion 1 (Srömberg, A proper convex lower semiconinuous funcion f : X (, ] is essenially srongly convex if (i f is sricly convex on every convex subse of dom f (ii ( f 1 is locally bounded on is domain (iii for every x 0 dom f here exiss ξ 0 X and γ Γ U such ha f(x f(x 0 + x x 0, ξ 0 + γ( x x 0, x X. (3 If (3 holds wih γ independen of x 0, f is uniformly essenially srongly convex wih modulus γ. Definiion 2 (Srömberg, A proper convex lower semiconinuous funcion f : X (, ] is essenially Fréche differeniable if in dom f, f is Fréche differeniable on in dom f wih Fréche derivaive Df, and Df(x j for any sequence (x j j in in dom f converging o some boundary poin of dom f. Definiion 3. A proper Fréche differeniable funcion f : X (, ] is essenially srongly smooh if x 0 dom f, ξ 0 X, κ Γ L such ha f(x f(x 0 + ξ 0, x x 0 + κ( x x 0, x X. (4 If (4 holds wih κ independen of x 0, f is uniformly essenially srongly smooh wih modulus κ. Wih his we are now ready o give our main dualiy resul: Theorem 1. Le f : X (, + ] be proper, lower semiconinuous and uniformly essenially srongly convex wih modulus γ Γ U. Then (i f is proper and essenially Fréche differeniable wih Fréche derivaive Df (ξ = arg max x, ξ f(x. (5 x X If, in addiion, γ(r := γ(r/r is sricly increasing, hen Df (ξ 1 Df (ξ 2 γ 1( ξ 1 ξ 2 /2. (6 In oher words, Df is uniformly coninuous wih modulus of coninuiy χ(r = γ 1 (r/2. (ii f is uniformly essenially smooh wih modulus γ. Corollary 1. If γ(r C r 1+κ, r 0 hen Df (ξ 1 Df (ξ 2 (2C 1/κ ξ 1 ξ 2 1/κ. In paricular, wih γ(r = K 2 r2, Definiion 1 becomes he classic definiion of K-srong convexiy, and (6 yields he resul familiar from he finie-dimensional case ha he gradien Df is 1/K Lipschiz wih respec o he dual norm (Neserov, 2009, Lemma 1. 3

4 2.2 Dual Averaging in Reflexive Banach Spaces We call a proper convex funcion h : X (, + ] a regularizer funcion on a se X X if h is essenially srongly convex and dom h = X. We emphasize ha we do no assume h o be Fréche-differeniable. Definiion 1 in conjuncion wih Lemma S.1 (supplemenal maerial implies ha for any regularizer h, he supremum of any funcion of he form, ξ h( over X, where ξ X, will be aained a a unique elemen of X, namely Dh (ξ, he Fréche gradien of h a ξ. DA wih regularizer h and a sequence of learning raes (η 1 generaes a sequence of decisions using he simple updae rule x +1 = Dh (η U, where U = u τ and U 0 := 0. Theorem 2. Le h be a uniformly essenially srongly convex regularizer on X wih modulus γ and le (η 1 be a posiive non-increasing sequence of learning raes. Then, for any sequence of payoff funcions (u 1 in X for which here exiss M < such ha sup x X u, x M for all, he sequence of plays (x 0 given by x +1 = Dh ( η u τ (7 ensures ha R (x := u τ, x u τ, x τ h(x h + η where h = inf x X h(x, γ(r := γ(r/r and η 0 := η 1. u τ γ 1( η τ 1 2 u τ I is possible o obain a regre bound similar o (8 also in a coninuous-ime seing. In fac, following Kwon and Merikopoulos (2014, we derive he bound (8 by firs proving a bound on a suiably defined noion of coninuous-ime regre, and hen bounding he difference beween he coninuous-ime and discree-ime regres. This analysis is deailed in he supplemenary maerial. Noe ha he condiion ha sup x X u, x M in Theorem 2 is weaker han he one in Sridharan and Tewari (2010, as i does no imply a uniformly bounded sraegy se (e.g., if X = L 2 (R and X is he se of disribuions on X, hen X is unbounded in L 2, bu he condiion may sill hold. Theorem 2 provides a regre bound for a paricular choice x X. Recall ha R := sup x X R (x. In Example 1 he se X is compac, so any coninuous regularizer h will be bounded, and hence aking he supremum over x in (8 poses no issue. However, his is no he case in our general seing, as he regularizer may be unbounded on X. For insance, consider Example 2 wih he enropy regularizer h(x = x(s log(x(sds, which is easily seen o be unbounded on X. As a S consequence, obaining a wors-case bound will in general require addiional assumpions on he reward funcions and he decision se X. This will be invesigaed in deail in Secion 3. Corollary 2. Suppose ha γ(r C r 1+κ, r 0 for some C > 0 and κ > 0. Then R (x h(x h + (2C 1/κ η 1/κ τ 1 η u τ 1+1/κ. (9 In paricular, if u M for all and η = η β, hen R (x h(x h β + κ ( η 1/κM 1+1/κ 1 β/κ. (10 η κ β 2C Assuming h is bounded, opimizing over β yields a rae of R (x = O( κ 1+κ. In paricular, if γ(r = K 2 r2, which corresponds o he classic definiion of srong convexiy, hen R (x = O(. For non-vanishing u τ we will need ha η 0 for he sum in (9 o converge. Thus we could ge poenially igher conrol over he rae of his erm for κ < 1, a he expense of larger consans. 3 Online Opimizaion on Compac Meric Spaces We now apply he above resuls o he problem minimizing regre on compac meric spaces under he addiional assumpion of uniformly coninuous reward funcions. We make no assumpions on convexiy of eiher he feasible se or he rewards. Essenially, we lif he non-convex problem of minimizing a sequence of funcions over he (possibly non-convex se S o he convex (albei infiniedimensional problem of minimizing a sequence of linear funcionals over a se X of probabiliy measures (a convex subse of he vecor space of measures on S. (8 4

5 3.1 An Upper Bound on he Wors-Case Regre Le (S, d be a compac meric space, and le µ be a Borel measure on S. Suppose ha he reward vecors u τ are given by elemens in L q (S, µ, where q > 1. Le X = L p (S, µ, where p and q are Hölder conjugaes, i.e., 1 p + 1 q = 1. Consider X = {x X : x 0 a.e., x 1 = 1}, he se of probabiliy measures on S ha are absoluely coninuous w.r.. µ wih p-inegrable Radon-Nikodym derivaives. Moreover, denoe by Z he class of non-decreasing χ : [0, [0, ] such ha lim r 0 χ(r = χ(0 = 0. The following assumpion will be made hroughou his secion: Assumpion 1. The reward vecors u have modulus of coninuiy χ on S, uniformly in. Tha is, here exiss χ Z such ha u (s u (s χ(d(s, s for all and for all s, s S. Le B(s, r = {s S : d(s, s < r} and denoe by B(s, δ X he elemens of X wih suppor conained in B(s, δ. Furhermore, le D S := sup s,s S d(s, s. Then we have he following: Theorem 3. Le (S, d be compac, and suppose ha Assumpion 1 holds. Le h be a uniformly essenially srongly convex regularizer on X wih modulus γ, and le (η 1 be a posiive nonincreasing sequence of learning raes. Then, under (7, for any posiive sequence (ϑ 1, R sup s S inf x B(s,ϑ h(x h η + χ(ϑ + u τ γ 1( η τ 1 2 u τ. (11 Remark 1. The sequence (ϑ 1 in Theorem 3 is no a parameer of he algorihm, bu raher a parameer in he regre bound. In paricular, (11 holds rue for any such sequence, and we will use his fac laer on o obain explici bounds by insaniaing (11 wih a paricular choice of (ϑ 1. I is imporan o realize ha he infimum over B(s, ϑ in (11 may be infinie, in which case he bound is meaningless. This happens for example if s is an isolaed poin of some S R n and µ is he Lebesgue measure, in which case B(s, ϑ =. However, under an addiional regulariy assumpion on he measure µ we can avoid such degenerae siuaions. Definiion 4 (Heinonen. e al., A Borel measure µ on a meric space (S, d is (Ahlfors Q-regular if here exis 0 < c 0 C 0 < such ha for any open ball B(s, r c 0 r Q µ(b(s, r C 0 r Q. (12 We say ha µ is r 0 -locally Q-regular if (12 holds for all 0 < r r 0. Inuiively, under an r 0 -locally Q-regular measure, he mass in he neighborhood of any poin of S is uniformly bounded from above and below. This will allow, a each ieraion, o assign sufficien probabiliy mass around he maximizer(s of he cumulaive reward funcion. Example 3. The canonical example for a Q-regular measure is he Lebesgue measure λ on R n. If d is he meric induced by he Euclidean norm, hen Q = n and he bound (12 is igh wih c 0 = C 0, a dimensional consan. However, for general ses S R n, λ need no be locally Q-regular. A sufficien condiion for local regulariy of λ is ha S is v-uniformly fa (Krichene e al., Assumpion 2. The measure µ is r 0 -locally Q-regular on (S, d. Under Assumpion 2, B(s, ϑ for all s S and ϑ > 0, hence we may hope for a bound on inf x B(s,ϑ h(x uniform in s. To obain explici convergence raes, we have o consider a more specific class of regularizers. 3.2 Explici Raes for f-divergences on L p (S We consider a paricular class of regularizers called f-divergences or Csiszár divergences (Csiszár, Following Audiber e al. (2014, we define ω-poenials and he associaed f-divergence. Definiion 5. Le ω 0 and a (, + ]. A coninuous increasing diffeomorphism φ : (, a (ω,, is an ω-poenial if lim z φ(z = ω, lim z a φ(z = + and φ(0 1. Associaed o φ is he convex funcion f φ : [0, R defined by f φ (x = x 1 φ 1 (z dz and he f φ -divergence, defined by h φ (x = S f φ( x(s dµ(s + ιx (x, where ι X is he indicaor funcion of X (i.e. ι X (x = 0 if x X and ι X (x = + if x / X. A remarkable fac is ha for regularizers based on ω poenials, he DA updae (7 can be compued efficienly. More precisely, i can be shown (see Proposiion 3 in Krichene (2015 ha he maximizer in his case has a simple expression in erms of he dual problem, and he problem of compuing x +1 = Dh (η u τ reduces o compuing a scalar dual variable ν. 5

6 Proposiion 1. Suppose ha µ(s = 1, and ha Assumpion 2 holds wih consans r 0 > 0 and 0 < c 0 C 0 <. Under he Assumpions of Theorem 3, wih h = h φ he regularizer associaed o an ω-poenial φ, we have ha, for any posiive sequence (ϑ 1 wih ϑ r 0, R min(c 0ϑ Q, µ(s ( f φ c 1 0 η ϑ Q + χ(ϑ + 1 u τ γ 1( η τ 1 2 u τ. (13 For paricular choices of he sequences (η 1 and (ϑ 1, we can derive explici regre raes. 3.3 Analysis for Enropy Dual Averaging (The Generalized Hedge Algorihm Taking φ(z = e z 1, we have ha f φ (x = x 1 φ 1 (zdz = x log x, and hence he regularizer is h φ (x = S x(s log x(sdµ(s. Then Dh exp ξ(s (ξ(s = exp ξ(s 1. This corresponds o a generalized Hedge algorihm (Arora e al., 2012; Krichene e al., 2015 or he enropic barrier of Bubeck and Eldan (2014 for Euclidean spaces. The regularizer h φ can be shown o be essenially srongly convex wih modulus γ(r = 1 2 r2. Corollary 3. Suppose ha µ(s = 1, ha µ is r 0 -locally Q-regular wih consans c 0, C 0, ha u M for all, and ha χ(r = C α r α for 0 < α 1 (ha is, he rewards are α-hölder coninuous. Then, under Enropy Dual Averaging, choosing η = η log / wih ( η = 1 C0Q M 2c 0 log(c 1 0 ϑ Q/α + 2α Q 1/2 and ϑ > 0, we have ha ( R 2C0 ( 2M log(c 1 0 c ϑ Q/α + Q log + C α ϑ (14 0 2α whenever log / < r α 0 ϑ 1. One can now furher opimize over he choice of ϑ o obain he bes consan in he bound. Noe also ha he case α = 1 corresponds o Lipschiz coninuiy. 3.4 A General Lower Bound Theorem 4. Le (S, d be compac, suppose ha Assumpion 2 holds, and le w : R R be any funcion wih modulus of coninuiy χ Z such ha w(d(, s q M for some s S for which here exiss s S wih d(s, s = D S. Then for any online algorihm, here exis a sequence (u τ of reward vecors u τ X wih u τ M and modulus of coninuiy χ τ < χ such ha R w(d S 2, (15 2 Maximizing he consan in (15 is of ineres in order o benchmark he bound agains he upper bounds obained in he previous secions. This problem is however quie challenging, and we will defer his analysis o fuure work. For Hölder-coninuous funcions, we have he following resul: Proposiion 2. In he seing of Theorem 4, suppose ha µ(s = 1 and ha χ(r = C α r α for some 0 < α 1. Then R min( Cα 1/α DS α, M 2. (16 2 Observe ha, up o a log facor, he asympoic rae of his general lower bound for any online algorihm maches ha of he upper bound (14 of Enropy Dual Averaging. 4 Learning in Coninuous Two-Player Zero-Sum Games Consider a wo-player zero sum game G = (S 1, S 2, u, in which he sraegy spaces S 1 and S 2 of player 1 and 2, respecively, are Hausdorff spaces, and u : S 1 S 2 R is he payoff funcion of player 1 (as G is zero-sum, he payoff funcion of player 2 is u. For each i, denoe by P i := P(S i he se of Borel probabiliy measures on S i. Denoe S := S 1 S 2 and P := P 1 P 2. For a (join mixed sraegy x P, we define he naural exension ū : P R by ū(x := E x [u] = S u(s1, s 2 dx(s 1, s 2, which is he expeced payoff of player 1 under x. 6

7 A coninuous zero-sum game G is said o have value V if sup x 1 P 1 inf ū(x 1, x 2 = x 2 P 2 inf x 2 P 2 sup ū(x 1, x 2 = V. (17 x 1 P 1 The elemens x 1 x 2 P a which (17 holds are he (mixed Nash Equilibria of G. We denoe he se of Nash equilibria of G by N (G. In he case of finie games, i is well known ha every wo-player zero-sum game has a value. This is no rue in general for coninuous games, and addiional condiions on sraegy ses and payoffs are required, see e.g. (Glicksberg, Repeaed Play We consider repeaed play of he coninuous wo-player zero-sum game. Given a game G and a sequence of plays (s 1 1 and (s 2 1, we say ha player i has sublinear (realized regre if ( 1 lim sup sup u i (s i, s i τ u i (s i τ, s i τ 0 (18 s i S i where we use i o denoe he oher player. A sraegy σ i for player i is, loosely speaking, a (possibly random mapping from pas observaions o is acions. Of primary ineres o us are Hannan-consisen sraegies: Definiion 6 (Hannan, A sraegy σ i of player i is Hannan consisen if, for any sequence (s i 1, he sequence of plays (s i 1 generaed by σ i has sublinear regre almos surely. Noe ha he almos sure saemen in Definiion 6 is wih respec o he randomness in he sraegy σ i. The following resul is a generalizaion of is counerpar for discree games (e.g. Corollary 7.1 in (Cesa-Bianchi and Lugosi, 2006: Proposiion 3. Suppose G has value V and consider a sequence of plays (s 1 1, (s 2 1 and 1 assume ha boh players have sublinear realized regre. Then lim u(s1 τ, s 2 τ = V. As in he discree case (Cesa-Bianchi and Lugosi, 2006, we can also say somehing abou convergence of he empirical disribuions of play o he se of Nash Equilibria. Since hese disribuions have finie suppor for every, we can a bes hope for convergence in he weak sense as follows: Theorem 5. Suppose ha in a repeaed wo-player zero sum game G ha has a value boh players follow a Hannan-consisen sraegy, and denoe by ˆx i = 1 δ s i he marginal empirical τ disribuion of play of player i a ieraion. Le ˆx := (ˆx 1, ˆx 2. Then ˆx N (G almos surely, ha is, wih probabiliy 1 he sequence (ˆx 1 weakly converges o he se of Nash equilibria of G. Corollary 4. If G has a unique Nash equilibrium x, hen wih probabiliy 1, ˆx x. 4.2 Hannan-Consisen Sraegies By Theorem 5, if each player follows a Hannan-consisen sraegy, hen he empirical disribuions of play weakly converge o he se of Nash equilibria of he game. Bu do such sraegies exis? Regre minimizing sraegies are inuiive candidaes, and he inimae connecion beween regre minimizaion and learning in games is well sudied in many cases, e.g. for finie games (Cesa- Bianchi and Lugosi, 2006 or poenial games (Monderer and Shapley, Using our resuls from Secion 3, we will show ha, under he appropriae assumpion on he informaion revealed o he player, no-regre learning based on Dual Averaging leads o Hannan consisency in our seing. Specifically, suppose ha afer each ieraion, each player i observes a parial payoff funcion ũ i : S i R describing heir payoff as a funcion of only heir own acion, s i, holding he acion played by he oher player fixed. Tha is, ũ 1 (s 1 := u(s 1, s 2 and ũ 2 (s 2 := u(s 1, s 2. Remark 2. Noe ha we do no assume ha he players have knowledge of he join uiliy funcion u. However, we do assume ha he player has full informaion feedback, in he sense ha hey observe parial reward funcions u(, s i τ on heir enire acion se, as opposed o only observing he reward u(s 1 τ, s 2 τ of he acion played (he laer corresponds o he bandi seing. We denoe by Ũ i = (ũ i τ he sequence of parial payoff funcions observed by player i. We use U i o denoe he se of all possible such hisories, and define U i 0 :=. A sraegy σ i of player i is a collecion (σ i =1 of (possibly random mappings σ i : U i 1 S i, such ha a ieraion, player i plays s i = σ i (U i 1. We make he following assumpion on he payoff funcion: 7

8 Assumpion 3. The payoff funcion u is uniformly coninuous in s i wih modulus of coninuiy independen of s i for i = 1, 2. Tha is, for each i here exiss χ i Z such ha u(s, s i u(s, s i χ i (d i (s, s for all s i S i. I is easy o see ha Assumpion 3 implies ha he game has a value (see supplemenary maerial. I also makes our seing compaible wih ha of Secion 3. Suppose now ha each player randomizes heir play according o he sequence of probabiliy disribuions on S i generaed by DA wih regularizer h i. Tha is, suppose ha each σ i is a random variable wih he following disribuion: σ i Dh ( 1 i η 1 ũi τ. (19 Theorem 6. Suppose ha player i uses sraegy σ i according o (19, and ha he DA algorihm ensures sublinear regre (i.e. lim sup R / 0. Then σ i is Hannan-consisen. Corollary 5. If boh players use sraegies according o (19 wih he respecive Dual Averaging ensuring ha lim sup R / 0, hen wih probabiliy 1 he sequence (ˆx 1 of empirical disribuions of play weakly converges o he se of Nash equilibria of G. Example Consider a zero-sum game G 1 beween wo players on he uni inerval wih payoff funcion u(s 1, s 2 = s 1 s 2 a 1 s 1 a 2 s 2, where a 1 = e 2 e 1 and a2 = 1 e 1. I is easy o verify ha he pair ( x 1, x 2 = ( exp(s e 1, exp(1 s e 1 is a mixed-sraegy Nash equilibrium of G1. For sequences (s 1 τ and (s 2 τ, he cumulaive payoff funcions for fixed acion s [0, 1] are given, respecively, by U 1 (s 1 = ( Σ s 2 τ a 1 s 1 a 2 Σ s 2 τ U 2 (s 2 = ( a 2 Σ s 1 τ s 2 a 1 Σ s 1 τ If each player i uses he Generalized Hedge Algorihm wih learning raes (η τ, heir sraegy in period is o sample from he disribuion x i (s exp(α i s, where α 1 = η (Σ s 2 τ a 1 and α 2 = η (a 2 Σ s 1 τ. Ineresingly, in his case he sum of he opponen s pas plays is a sufficien saisic, in he sense ha i compleely deermines he mixed sraegy a ime x 1 (s player 1, = player 2, = x 2 (s x 1 (s player 1, =50000 player 2, =50000 x 2 (s x 1 (s player 1, = player 2, = x 2 (s Figure 1: Normalized hisograms of he empirical disribuions of play in G (100 bins Figure 1 shows normalized hisograms of he empirical disribuions of play a differen ieraions. As grows he hisograms approach he equilibrium densiies x 1 and x 2, respecively. However, his does no mean ha he individual sraegies x i converge. Indeed, Figure 2 shows he α i oscillaing around he equilibrium parameers 1 and 1, respecively, even for very large. We do, however, observe ha he ime-averaged parameers ᾱ i converge o he equilibrium values 1 and 1. 2 α 1 α ᾱ 1 ᾱ Figure 2: Evoluion of parameers α i and ᾱ i := 1 αi τ in G 1 In he supplemenary maerial we provide addiional numerical examples, including one ha illusraes how our algorihms can be uilized as a ool o compue approximae Nash equilibria in coninuous zero-sum games on non-convex domains. 8

9 References Sanjeev Arora, Elad Hazan, and Sayen Kale. The muliplicaive weighs updae mehod: a meaalgorihm and applicaions. Theory of Compuing, 8(1: , Jean-Yves Audiber, Sébasien Bubeck, and Gàbor Lugosi. Regre in online combinaorial opimizaion. Mahemaics of Operaions Research, 39(1:31 45, S. Bubeck and R. Eldan. The enropic barrier: a simple and opimal universal self-concordan barrier. ArXiv e-prins, December Sébasien Bubeck and Nicolò Cesa-Bianchi. Regre analysis of sochasic and nonsochasic muliarmed bandi problems. Foundaions and Trends in Machine Learning, 5(1:1 122, Nicolo Cesa-Bianchi and Gabor Lugosi. Predicion, Learning, and Games. Cambridge UP, Thomas M. Cover. Universal porfolios. Mahemaical Finance, 1(1:1 29, Imre Csiszár. Informaion-ype measures of difference of probabiliy disribuions and indirec observaions. Sudia Scieniarum Mahemaicarum Hungarica, 2: , Irving L. Glicksberg. Minimax heorem for upper and lower semiconinuous payoffs. Research Memorandum RM-478, The RAND Corporaion, Oc James Hannan. Approximaion o Bayes risk in repeaed play. In Conribuions o he Theory of Games, vol III of Annals of Mahemaics Sudies 39. Princeon Universiy Press, Sergiu Har and Andreu Mas-Colell. A general class of adapive sraegies. Journal of Economic Theory, 98(1:26 54, Elad Hazan, Ami Agarwal, and Sayen Kale. Logarihmic regre algorihms for online convex opimizaion. Machine Learning, 69(2-3: , Juha Heinonen., Pekka Koskela, Nageswari Shanmugalingam, and Jeremy T. Tyson. Sobolev Spaces on Meric Measure Spaces: An Approach Based on Upper Gradiens. New Mahemaical Monographs. Cambridge Universiy Press, Walid Krichene. Dual averaging on compacly-suppored disribuions and applicaion o no-regre learning on a coninuum. CoRR, abs/ , Walid Krichene, Maximilian Balanda, Claire Tomlin, and Alexandre Bayen. The Hedge Algorihm on a Coninuum. In 32nd Inernaional Conference on Machine Learning, pages , Joon Kwon and Panayois Merikopoulos. A coninuous-ime approach o online opimizaion. ArXiv e-prins, January Ehud Lehrer. Approachabiliy in infinie dimensional spaces. Inernaional Journal of Game Theory, 31(2: , Dov Monderer and Lloyd S. Shapley. Poenial games. Games and Economic Behavior, 14(1: , Yurii Neserov. Primal-dual subgradien mehods for convex problems. Mahemaical Programming, 120(1: , Shai Shalev-Shwarz. Online learning and online convex opimizaion. Foundaions and Trends in Machine Learning, 4(2: , Nai Srebro, Karhik Sridharan, and Ambuj Tewari. On he universaliy of online mirror descen. In Advances in Neural Informaion Processing Sysems 24 (NIPS, pages Karhik Sridharan and Ambuj Tewari. Convex games in banach spaces. In COLT The 23rd Conference on Learning Theory,, pages 1 13, Haifa, Israel, June Thomas Srömberg. Dualiy beween Fréche differeniabiliy and srong convexiy. Posiiviy, 15(3: , Lin Xiao. Dual averaging mehods for regularized sochasic learning and online opimizaion. J. Mach. Learn. Res., 11: , December

arxiv: v1 [cs.lg] 3 Jun 2016

arxiv: v1 [cs.lg] 3 Jun 2016 Minimizing Regre on Reflexive Banach Spaces and Learning Nash Equilibria in Coninuous Zero-Sum Games arxiv:66.26v [cs.lg] 3 Jun 26 Maximilian Balanda Walid Krichene Claire Tomlin Alexandre Bayen Deparmen