arxiv: v1 [cs.lg] 31 Jul PDF Free Download

Learning Nash Equilibria in Congestion Games Walid Krichene Benjamin Drighès Alexandre M. Bayen arxiv:408.007v [cs.lg] 3 Jul 204 Abstract We study the reeated congestion game, in which multile oulations of layers share resources, and make, at each iteration, a decentralized decision on which resources to utilize. We investigate the following question: given a model of how individual layers udate their strategies, does the resulting dynamics of strategy rofiles converge to the set of Nash equilibria of the one-shot game? We consider in articular a model in which layers udate their strategies using algorithms with sublinear discounted regret. We show that the resulting sequence of strategy rofiles converges to the set of Nash equilibria in the sense of Cesàro means. However, strong convergence is not guaranteed in general. We show that strong convergence can be guaranteed for a class of algorithms with a vanishing uer bound on discounted regret, and which satisfy an additional condition. We call such algorithms AREP algorithms, for Aroximate REPlicator, as they can be interreted as a discrete-time aroximation of the relicator equation, which models the continuous-time evolution of oulation strategies, and which is known to converge for the class of congestion games. In articular, we show that the discounted Hedge algorithm belongs to the AREP class, which guarantees its strong convergence. Introduction Congestion games are non-cooerative games that model the interaction of layers who share resources. Each layer makes a decision on which resources to utilize. The individual decisions of layers result in a resource allocation at the oulation scale. Resources which are highly utilized become congested, and the corresonding layers incur higher losses. For examle, in routing games a sub-class of congestion games, the resources are edges in a network, and each layer needs to travel from a given source vertex to a given destination vertex on the grah. Each layer chooses a ath, and the joint decision of all layers determines the congestion on each edge. The more a given edge is utilized, the more congested it is, creating delays for those layers using that edge. The one-shot congestion game has been studied extensively, and a comrehensive resentation is given for examle in [9]. In articular, congestion games are shown to be otential games, thus their Nash equilibria can be exressed as the solution to a convex otimization roblem. Characterizing the Nash equilibria of the congestion game gives useful insights, such as the loss of efficiency due to selfishness of layers. One oular measure of inefficiency is the rice of anarchy, introduced by Koutsouias and Paadimitriou in [4], and studied in the case of routing games by Roughgarden et al. in [20]. While characterizing Nash equilibria of the one-shot congestion game gives many insights, it does not model how layers arrive to the equilibrium. Studying the game in a reeated setting can hel answer this question. Additionally, most realistic scenarios do not corresond to a one-shot setting, but rather a reeated setting in which layers make decisions in an online fashion, observe outcomes, and may udate their strategies given the revious outcomes. This motivates the study of the game and the oulation dynamics in an online learning framework. This work was suorted in art by FORCES Foundations Of Resilient CybEr-hysical Systems, which receives suort from the National Science Foundation NSF award numbers CNS-238959, CNS-238962, CNS-239054, CNS-23966. Walid Krichene is with the deartment of Electrical Engineering and Comuter Sciences, UC Berkeley walid@eecs.berkeley.edu. Benjamin Drighès is with the Ecole Polytechnique, Palaiseau, France benjamin.drighes@olytechnique.edu. Alexandre M. Bayen is with the deartment of Electrical Engineering and Comuter Sciences and the deartment of Civil and Environmental Engineering, UC Berkeley bayen@berkeley.edu.

Arguably, a good model for learning should be distributed, and should not have extensive information requirements. In articular, one should not exect the layers to have an accurate model of congestion of the different resources. Players should be able to learn simly by observing the outcomes of their revious actions, and those of other layers. No-regret learning is of articular interest here, as many regret-minimizing algorithms are easy to imlement by individual layers, and only require the layer losses to be revealed. The Hedge algorithm also known as boosting or the exonential udate rule is a famous examle of regretminimizing algorithms. It was introduced to the machine learning community by Freund and Schaire in [2], a generalization of the weighted majority algorithm of Littlestone and Warmuth [5]. The Hedge algorithm will be central in our discussion, as it will motivate the study of the continuous-time relicator equation, and will eventually be shown to converge for congestion games. No-regret learning and its resulting oulation dynamics have been studied in the context of routing games, a secial case of congestion games. For examle, in [4], Blum et al. show that the sequence of strategy rofiles converges to the set of ɛ-aroximate Nash equilibria on a ɛ-fraction of days. In other words, a subsequence of strategy rofiles in which an ɛ fraction of terms is droed converges to the set of ɛ-aroximate Nash equilibria. They also give exlicit convergence rates which deend on the maximum sloes of the congestion functions. Continuous-time oulation dynamics have also been studied for congestion games. In [0], Fischer and Vocking study the convergence of the relicator dynamics for the congestion game. The relicator ODE is also of articular interest in evolutionary game theory, see for examle [23]. In [2], Sandholm studies convergence for the larger class of otential games. He shows that dynamics which satisfy a ositive correlation condition with resect to the otential function of the game converge to the set of stationary oints of the vector field usually, a suerset of Nash equilibria. However, many regret-minimizing algorithms do not satisfy this correlation condition. Our discussion is mainly concerned with discrete-time dynamics. However, roerties of the relicator equation will be used in our analysis. We will consider a model in which the losses are discounted over time, using a vanishing sequence of discount factors γ N, meaning that future losses matter less to layers than resent losses. This defines a discounted regret, and we will focus our attention on online learning algorithms with sublinear discounted regret. The sequence of discount factors will have several interretations beyond its economic motivation. For examle, we will observe that some multilicative weight algorithms, such as the Hedge algorithm, have sublinear discounted regret if we use the sequence γ as learning rates, rovided it also satisfies T γ 2 / T γ 0 as T. After defining the model and giving reliminary results in Sections 2 and 3, we show in Section 4 that when layers use online learning algorithms with sublinear discounted regret, the sequence of strategy rofiles converges to the set of Nash equilibria in the Cesàro sense. In order to obtain strong convergence, we first motivate the study of the relicator dynamics. Indeed, it can be viewed as a continuous-time limit of the Hedge algorithm with decreasing learning rates. In Section 5, we recall the convergence result of the relicator dynamics. By discretizing the relicator equation using the same discount sequence γ N as discretization time stes we obtain a multilicative-weights udate rule with sublinear discounted regret, which we call REP algorithm, for relicator. Finally, in Section 6, we define a class of online learning algorithms we call the AREP algorithms, which can be exressed as a discrete REP algorithm with erturbations that satisfy a condition given in Definition 2. Using results from the theory of stochastic aroximation, we show that strong convergence is guaranteed for AREP algorithms with sublinear discounted regret. We finally observe that both the REP algorithm and the Hedge algorithm belong to this class, which roves convergence for these two algorithms in articular. 2 The congestion game model In the congestion game, a finite set R of resources is shared by a set X of layers. The set of layers is endowed with a structure of measure sace, X, M, m, where M is a σ-algebra of measurable subsets, and m is a finite Lebesgue measure. The measure is non-atomic, in the sense that single-layer sets are null-sets for m. The layer set is artitioned into K oulations, X = X X K. For all k, the total mass of 2

oulation X k is assumed to be finite and nonzero. Each layer x X k has a task to erform, characterized by a collection of bundles P k P, where P is the ower set of R. The task can be accomlished by choosing any bundle of resources P k. The action set of any layer in X k is then simly P k. The joint actions of all layers can be reresented by an action rofile a : X P such that for all x X k, ax P k is the bundle of resources chosen by layer x. The function x ax is assumed to be M-measurable P is equied with the counting measure. The action rofile a determines the bundle loads and resource loads, defined as follows: for all k {,..., K} and P k, the load of bundle under oulation X k is the total mass of layers in X k who chose that bundle f k a = ax= dmx x X k For any r R, the resource load is defined to be the total mass of layers utilizing that resource φ r a = K k= :r f k a 2 The resource loads determine the losses of all layers: the loss associated to a resource r is given by c r φ r a, where the congestion functions c r are assumed to satisfy the following: Assumtion. The congestion functions c r are non-negative, non-decreasing, Lischitz-continuous functions. The total loss of a layer x such that ax = is r c rφ r a. The congestion model is given by the tule K, X k k K, R, P k k K, c r r R. The congestion game is determined by the action set and the loss function for every layer: for all x X k, the action set of x is P k, and the loss function of x, given the action rofile a, is ax= c r φ r a. 2. A macroscoic view P r The action rofile a secifies the bundle of each layer x. A more concise descrition of the joint action of layers is given by the bundle distribution: the roortion of layers choosing bundle in oulation X k is denoted by µ k a = f k a/mx k, which defines a bundle distribution for oulation X k, µ k a = µ k a Pk P k, and a bundle distribution across oulations, given by the roduct distribution µa = µ a,..., µ K a P P K. We say that the action rofile a induces the distribution µa. Here P k denotes the simlex of distributions over P k, that is P k = µ RP k + : µ = The roduct of simlexes P P K will be denoted. This macroscoic reresentation of the joint actions of layers will be useful in our analysis. We will also view the resource loads as linear functions of the roduct distribution µa. Indeed, we have from equation 2 and the definition of µ k a φ r a = K mx k k= :r µ k a = K mx k M k µ k a r k= 3

where for all k, M k R R P k is an incidence matrix defined as follows: for all r R and all P k, { Mr, k if r = 0 otherwise We write in vector form φa = K k= mx km k µ k a, and by defining the scaled incidence matrix M = mx M... mx K M K, we have φa = Mµa By abuse of notation, the deendence on the action rofile a will be omitted, so we will write µ instead of µa and φ instead of φa. Finally, we define the loss function of a bundle P k to be l k µ = r c r φ r = r c r Mµ r = M c Mµ 3 where M is the incidence matrix M = M... M K, and cφ is the vector c r φ r r R. We denote by l k µ the vector of losses l k µ Pk, and by lµ the K-tule lµ = l µ,..., l K µ. 2.2 Nash equilibria of the congestion game We can now define and characterize the Nash equilibria of the congestion game, also called Wardro equilibria, in reference to [22]. Definition Nash equilibrium. A roduct distribution µ is a Nash equilibrium of the congestion game if for all k, and all P k such that µ k > 0, l k µ lk µ for all P k. The set of Nash equilibria will be denoted by N. In finite layer games, a Nash equilibrium is defined to be an action rofile a such that no layer has an incentive to unilaterally deviate [7], that is, no layer can strictly decrease her loss by unilaterally changing her action. We show that this condition referred to as the Nash condition holds for almost all layers whenever µ is a Nash equilibrium in the sense of Definition. Proosition. A distribution µ is a Nash equilibrium if and only if for any joint action a which induces the distribution µ, almost all layers have no incentive to unilaterally deviate from a. Proof. First, we observe that, given an action rofile a, when a single layer x changes her strategy, this does not affect the distribution µ. This follows from the definition of the distribution, µ k = ax= dmx. mx k X k Changing the action rofile a on a null-set {x} does not affect the integral. Now, assume that almost all layers have no incentive to unilaterally deviate. That is, for all k, for almost all x X k, P k, l k µ l k ax µ 4 where µ is the distribution obtained when x unilaterally changes her bundle from ax to. By the revious observation, µ = µ. As a consequence, condition 4 becomes: for almost all x, and for all, l k µ lk ax µ. Therefore, integrating over the set {x X k : ax = }, we have for all k, l k µµk l k µµ k for all. which imlies that µ is a Nash equilibrium in the sense of Definition. Conversely, if a is an action rofile, inducing distribution µ, such that the Nash condition does not hold for a set of layers with ositive measure, then there exists k 0 and a subset X X k0 with mx > 0, such that every layer in X can strictly decrease 4

her loss by changing her action. Let X = {x X : ax = }, then X is the disjoint union X = Pk X, and there exists 0 such that mx 0 > 0. Therefore µ k0 0 = m {x X k 0 : ax = 0 } mx 0 mx k0 mx k0 > 0. Let x X 0. Since x can strictly decrease her loss by unilaterally changing her action, there exists such that l k0 µ < l k0 ax µ = lk0 0 µ. But since µ k0 0 > 0, µ is not a Nash equilibrium. Definition also imlies that, for a oulation X k, all bundles with non-zero mass have equal losses, and bundles with zero mass have greater losses. Therefore almost all layers incur the same loss. This observation motivates a second characterization of Nash equilibria, in terms of the average loss. Definition 2 Average loss. The average loss incurred by oulation X k is the real number: l k µ = l k ax mx k µdmx = µ k l k µ X k Proosition 2. µ is a Nash equilibrium if and only if for all k and all P k, l k µ l k µ. Proof. If µ is a Nash equilibrium, then all bundles with non-zero mass have equal losses, and bundles with zero mass have greater losses. That is, for all k, there exists 0 P k such that for all, if µ > 0 then l k µ = l k 0 µ, and if µ = 0, then l k µ l k 0 µ. Thus l k µ = µ l k µ = l k 0 µ = l k 0 µ, :µ >0 :µ >0 and it follows that for all, l k µ l k 0 µ = l k µ. Conversely, assume that for all, l k µ l k µ, and let 0 arg min Pk :µ >0 l µ. Then, l k 0 µ l k µ = µ l k µ l k 0 µ µ = l k 0 µ :µ >0 µ :µ >0 and the inequalities must hold with equality, thus all bundles with non-zero mass have the same loss equal to the average loss, while bundles with zero mass have larger losses. This roves that µ is a Nash equilibrium. 2.3 Mixed strategies The Nash equilibria we have described so far are ure strategy equilibria, since each layer x deterministically lays a single action ax. We now extend the model to allow mixed strategies. That is, the action of a layer x is a random variable Ax with distribution πx, and with realization ax. We show that when layers use mixed strategies, rovided they randomize indeendently, the resulting Nash equilibria are, in fact, the same as those given in Definition. The key observation is that under indeendent randomization, the resulting bundle distributions µ k are random variables with zero variance, thus they are essentially deterministic. To formalize the robabilistic setting, let Ω, F, P be a robability sace. A mixed strategy rofile is a function A : X Ω P, such that for all k and all x X k, Ax is a P k -valued random variable, such that the maing x, ω Axω is M F-measurable. For all x X k and P k, let πx k = P[Ax = ]. Similarly to the deterministic case, the mixed strategy rofile A determines the bundle distributions µ k, which are, in this case, random variables, as we recall that: µ k = mx k X k Ax= dmx 5

Nevertheless, assuming layers randomize indeendently, the bundle distribution is almost surely equal to its exectation, as stated in the following Proosition. The assumtion of indeendent randomization is a reasonable one, since layers are non-cooerative. Proosition 3. Under indeendent randomization, k, almost surely, µ k = E[µ k ] = π k xdmx mx k X k Proof. Fix k and let P k. Since x, ω Ax= ω is a non-negative bounded M F-measurable function, we can aly Tonelli s theorem and write: E [ [ ] µ k ] = E Ax= dmx mx k X k = E [ ] Ax= dmx mx k X k = π mx k xdmx k X k Similarly, mx k 2 var [ µ k ] 2 2 = E Ax= dmx πxdmx k X k X k = E Ax=;Ax =dmxdmx πxπ k x k dmxdmx X k X k X k X k = P[Ax = ; Ax = ] πxπ k x k dm mx, x X k X k Then observing that the diagonal D = {x, x: x X k } is an m m-nullset this follows for examle from Proosition 25T in [], we can restrict the integral to the set X k X k \ D, on which P[Ax = ; Ax = ] = π k xπ k x, by the indeendent randomization assumtion. This roves that var [ µ k ] = 0. Therefore µ k = E µ k almost surely. We observe that here, the assumtion of non-atomicity is essential. This fact is reminiscent of temerature in statistical hysics: indeendent measurements of the global distribution of the same state yield the same result almost surely. 2.4 The Rosenthal otential function We now discuss how one can formulate the set of Nash equilibria as the solution of a convex otimization roblem. Consider the function V µ = r R Mµr 0 c r udu 5 defined on the roduct of simlexes P P K, which will be denoted. V is called the Rosenthal otential function, and was introduced in [8] for the congestion game with finitely many layers, and later generalized to the infinite-layers case. It can be viewed as the comosition of the function V : φ R R + r R φr 0 c r udu and the linear function µ Mµ. Since for all r, c r is, by assumtion, non-negative, V is differentiable, non-negative and V φ = c r φ r r R. And since c r are non-decreasing, V is convex. Therefore V is convex as the comosition of a convex and a linear function. 6

A simle alication of the chain rule gives V µ = M c Mµ. If we denote µ kv µ the vector of artial derivatives with resect to µ k, P k, we have µ kv µ = mx k M k c Mµ = mx k l k µ. Thus, k, P k, V µ k µ = mx k l k µ 6 and V is a otential function for the congestion game, as defined in [2] for examle. Next, we show the relationshi between the set of Nash equilibria and the otential function V. Theorem Rosenthal [8]. N is the set of minimizers of V on the roduct of simlexes. non-emty convex comact set. We will denote V N the value of V on N. It is a A version of this theorem is roved in [8]. We also give a roof in Aendix B. Since the set of Nash equilibria can be exressed as the solution to a convex otimization roblem, it can be comuted in olynomial time in the size of the roblem. Beyond comuting Nash equilibria, we seek to model how layers arrive at the set N. This is discussed in Section 3. But first, we define routing games, a secial case of congestion games. 2.5 Examle: routing games A routing game is a congestion game with an underlying grah G = V, E, with vertex set V and edge set E V V. In this case, the resource set is equal to the edge set, R = E. Routing games are used to model congestion on transortation or communication networks. Each oulation X k is characterized by a common source vertex s k V and a common destination vertex t k V. In a transortation setting, layers reresent drivers traveling from s k to t k ; in a communication setting, layers send ackets from s k to t k. The action set P k is a set of aths connecting s k to t k. In other words, each layer chooses a ath connecting his or her source and destination vertices. The bundle load f k is then called the flow on ath. The resource load φ r is called the total edge flow. Finally, the congestion functions φ r c r φ r determine the delay or latency incurred by each layer. 0 4 5 2 3 Figure : Routing game with two oulations of layers. We will use the routing game given in Figure as an examle to illustrate some of our results in later sections. In this examle, two oulations of layers share the network, the first oulation sends ackets from v 0 to v, and the second oulation from v 2 to v 3. The oulation masses are F = F 2 =. The congestion functions are given below: c v0,v u = u + 2 c v0,v 4u = u 2 c v2,v 3u = u + c v2,v 4u = 2 c v4,v 5u = 3u c v5,v u = u 3 c v0,v 5u = u c v4,v 3u = u c v5,v 3u = u 4 7

The aths bundles available to each oulation are given by: P = {v 0, v, v 0, v 4, v 5, v, v 0, v 5, v } P 2 = {v 2, v 3, v 2, v 4, v 5, v 3, v 2, v 4, v 3 } In this case, since the congestion functions are linear, the Rosenthal otential function is quadratic. Its minimizer is, in this examle, unique, given by N = 0, 0.87, 0.83, 0.223, 0.053, 0.724 and the corresonding ath losses are given by for all P \ v 0, v 4, v 5, v, l µ =.4 for = v 0, v 4, v 5, v, l µ = 2.00 for all P 2, l 2 µ =.22 3 Online learning in congestion games We now describe the online learning framework for the congestion game, and resent the Hedge algorithm in articular. 3. The online learning framework Suose that the game is layed reeatedly for infinitely many iterations, indexed by N. During iteration, each layer chooses a bundle simultaneously. The decision of all layers can be reresented, as defined above, by an action rofile a : X P. This induces, at the level of each oulation X k, a bundle distribution µ k. These, in turn, determine the resource loads and the bundle losses l k µ. The losses for bundles P k are revealed to all layers in oulation X k, which marks the end of iteration. Players can then use the information revealed to them to udate their strategies before the start of the next iteration. A note on the information assumtions Here, we assume that at the end of the iteration, a layer observes the losses of all bundles P k. Instead, one could assume that a layer can only observe the losses she incurs. This is often called the multi-armed-bandit setting, in reference to the armed-bandit slot machines, in which a gambler can choose, at each iteration, one machine to lay, and is only revealed the loss of that machine. Making this restriction requires layers to use additional exloration of bundles. A comrehensive resentation of online learning algorithms in the multi-armed bandit setting, both stochastic and deterministic, can be found for examle in [6, 2]. Regret bounds are also given in [9] Section 6.7,.56-59 and [8, 7]. We choose to use the full feedback assumtion to simlify our discussion, leaving the multi-armed-bandit setting as a ossible extension. We believe this is a reasonable model in many games, since bundle losses could be announced ublicly. In the secial case of routing games, this can be achieved by having a central authority measure and announce the delays. This is articularly true in transortation networks, in which many agencies and online services measure delays and make this information ublicly available. Assuming the full vector of bundle losses is revealed does not mean, however, that layers have access to the individual resource loads φ r, or to the congestion functions c r, which is consistent with our initial argument that, in a realistic model, layers should only rely on the observed value of the bundle losses. Each layer x X k is assumed to draw her bundle from a randomized strategy π x P k the deterministic case is a secial case in which π x is a vertex on the simlex, i.e. a ure strategy. As discussed in Section 2.3, layers randomize indeendently. At the end of iteration, layer x udates her strategy using an udate rule or online learning algorithm, as defined below. 8

Definition 3 Online learning algorithm for the congestion game. An online learning algorithm or udate rule for the congestion game, alied by a layer x X k, is a sequence of functions x U N, fixed a riori, that is, before the start of the game, such that for each, x U : R P k P k P k l k µ t t, π x π + x is a function which mas, given the history of bundle losses l k µ t t, the strategy on the current day π x to the strategy on the next day π + x. The online learning framework is summarized in Algorithm. Algorithm Online learning framework for the congestion game : For every layer x P k, an initial mixed strategy π 0 x P k and an online learning algorithm x U N 2: for each iteration N do 3: Every layer x indeendently draws a bundle according to her strategy π x, i.e. A x π x. 4: The vector of bundle losses l k µ is revealed to all layers in P k. Each layer incurs the loss of the bundle she chose. 5: Players udate their mixed strategies: π + x = x U l k µ t t, π x. 6: end for We will focus our attention on algorithms which have vanishing uer bounds on the average discounted regret, defined in the next section. 3.2 Discounted regret Since the game is layed for infinitely many iterations, we assume that the losses of layers are discounted over time. This is a common technique in infinite-horizon otimal control for examle, and can be motivated from an economic ersective by considering that losses are devalued over time. We also give an interretation of discounting in terms of learning rates, as discussed in Section 3.4. Let γ N denote the sequence of discount factors. We make the following assumtion: Assumtion 2. The sequence of discount factors γ N is assumed to be ositive decreasing, with lim γ = 0 and lim T T =0 γ =. On iteration, a layer x X k who draws an action A x π x incurs a discounted loss given by γ l k A x µ, where µ is the distribution induced by the rofile A. The cumulative discounted loss for layer x, u to iteration T, is then defined to be L T x = T γ l k A x µ 7 =0 We observe that this is a random variable, since the action A x of layer x is random, drawn from a distribution π x. The exectation of the cumulative discounted loss is then E[L T x] = = T =0 T =0 [ ] γ E l k A x µ γ π x, l k µ 9

where, denotes the Euclidean inner roduct on R P k. Similarly, we define the cumulative discounted loss for a fixed bundle P k T T = γ l k µ 8 We can now define the discounted regret. L k =0 Definition 4 Discounted regret. Let x X k, and consider an online learning algorithm for the congestion game, given by the sequence of functions x U N. Let µ N be the sequence of distributions, determined by the mixed strategy rofile of all layers. Then the discounted regret u to iteration T, for layer x, under algorithm U, is the random variable R T x = L T x min L k T The algorithm U is said to have sublinear discounted regret if, for any sequence of distributions µ N, and any initial strategy π 0, [ + T =0 γ R x] T 0 almost surely as T 0 [ [ If we have convergence in the L -norm, T E R T x ]] + 0, we say that the algorithm has sublinear =0 γ discounted regret in exectation. We observe that, in the definition of the regret, one can relace the minimum over the set P k by a minimum over the simlex P k min L T T = min π, L π P k since the minimizers of a bounded linear function lie on the set of extremal oints of the feasible set. Therefore, the discounted regret comares the erformance of the online learning algorithm to the best constant strategy in hindsight. Indeed, π, L T is the cumulative discounted loss of a constant strategy π, and minimizing this exression over π P k yields the best constant strategy in hindsight: one cannot know a riori which strategy will minimize the exression, until all losses u to T are revealed. If the algorithm has sublinear regret, its average erformance is, asymtotically, as good as the erformance of any constant strategy, regardless of the sequence of distributions µ N. A note on monotonicity of the discount factors: A similar definition of discounted regret is used for examle by Cesa-Bianchi and Lugosi in Section 3.2 of [9]. However, in their definition, the sequence of discount factors is increasing. This can be motivated by the following argument: resent observations may rovide better information than ast, stale observations. While this argument is accurate in many alications, it does not serve our urose of convergence of oulation strategies. In our discussion, the standing assumtion is that discount factors are decreasing. Finally, we observe that the cumulative discounted loss and regret are bounded, uniformly in x. Proosition 4. There exists 0 such that k, P k, µ, l k µ [0, ] x X k, T =0 γ L T x [0, ] 2 x X k, [ + T =0 γ R x] T [0, ] 3 Proof. Since the bundle loss functions µ l k µ are continuous on the comact set, they are bounded, and since there are finitely many bundles, there exists a common bound such that for all k, for all P k and all µ, 0 l k µ. The bounds 2 and 3 follow from and the definitions 7 and 9 of L T x and R T x. 9 0

3.3 Poulation-wide regret We have defined the discounted regret R T x for a single layer x. In order to analyze the oulation dynamics, we define a oulation-wide cumulative discounted loss L kt, and discounted regret R kt as follows: L kt = L T xdmx 4 mx k X k R kt = R T xdmx = L kt min L k T 5 mx k X k Since L T x is random for all x, L kt is also a random variable. However, it is, in fact, almost surely equal to its exectation. Indeed, recalling that µ k is the roortion of layers who chose bundle at iteration also a random variable, we can write L kt = = = T =0 T =0 T γ l k A mx k x µ dmx X k γ mx k γ µ k =0 {x X k :A x=} l k µ dmx l k µ 6 thus assuming layers randomize indeendently, µ is almost surely deterministic by Proosition 3, and so is L kt. The same holds for R kt. Proosition 5. If almost every layer x X k alies an online learning algorithm with sublinear regret in exectation, then the oulation-wide regret is also sublinear. Proof. By the revious observation, we have, almost surely, [ R kt = E R kt ] = mx k X k E [ ] R T x dmx where the second equality follows from Tonelli s theorem. Taking the ositive art and using Jensen s inequality, we have [ T =0 γ R kt ] + [ [ + mx k T X k =0 γ E R x]] T dmx [ [ By assumtion, T E R T x ]] + converges to 0 for all x, and by Proosition 4, it is bounded uniformly =0 γ in x. Thus the result follows by alying the dominated convergence theorem. 3.4 Hedge algorithm with vanishing learning rates We now resent one articular online learning algorithm with sublinear regret. Consider a congestion game, and let be an uer bound on the losses. The existence of such an uer bound was established in Proosition 4. Definition 5 Hedge algorithm. The Hedge algorithm, alied by layer x X k, with initial distribution π 0 P k and learning rates η N is an online learning algorithm x U N such that the -th udate

function is given by x U where ψ is the normalization function l k µ t t, π = ψ π ex ψ : R P k + \ {0} P k v v v η l k µ That is, the distribution at iteration + is roortional to the following vector π + π l k µ ex η 7 Intuitively, the Hedge algorithm udates the distribution by comuting, at each iteration, a set of bundle weights, then normalizing the vector of weights. The weight of a bundle is obtained by multilying the robability at the revious iteration, π, by a term which is exonentially decreasing in the bundle loss l k µ, thus the higher the loss of bundle at iteration, the lower the robability of selecting at the next iteration. The arameter η can be interreted as a learning rate, as discussed in the following roosition. Proosition 6. The Hedge udate rule 7 is the solution to the following otimization roblem: π + arg min π, lk µ + D KL π π 8 π P k η where D KL π ν = π log π ν is the Kullback-Leibler divergence of distribution π with resect to ν. Proof. Consider the Lagrangian of the roblem, with dual variable λ R associated to the constraint π =, Lπ; λ = π lk µ + π log π + λ π η π, its gradient is given by Lπ; λ = lk µ + π η Lπ; λ = π λ log π + π and π, λ are rimal-dual otimal if and only if the gradient of L vanishes at π, λ, that is, which can be rewritten as π = α π ex π = π l k µ ex η λ η π = l η k µ is the normalization constant. Thus π satisfies the Hedge udate equation 7. + λ, with α = ex + η λ = P k π α ex η lk µ 2

The objective function in 8 is the sum of an instantaneous loss term π, lk µ and a regularization term η D KL π π which enalizes deviations from the revious distribution π, with a regularization coefficient η. The greedy roblem with no regularization term would yield a ure strategy which concentrates all the mass on the bundle which had minimal loss on the revious iteration. With the regularization term, the layer hedges her bet by enalizing too much deviation from the revious distribution. The coefficient η determines the relative imortance of the two terms in the objective function. In articular, as η 0, the solution to the roblem 8 converges to π since the regularization term dominates the instantaneous loss term. In other words, as η converges to 0, the layer stos learning from new observations, which justifies calling η a learning rate. Remark. The sequence of distributions given by the Hedge algorithm also satisfy, for all, π + π 0 l k µ t ex η t 9 t=0 This follows from the udate equation 7 and a simle induction on. In articular, when η = γ, the term t=0 η tl k µ t coincides with the cumulative discounted loss L k defined in 8. This motivates using the discount factors γ as learning rates. We discuss this in the next roosition. Proosition 7. Consider a congestion game with a sequence of discount factors γ N satisfying Assumtion 2. Then the Hedge algorithm with learning rates γ satisfies the following regret bound: for any sequence of distributions µ and any initial strategy π 0, E[R T x] log π 0 min + 8 where π 0 min = min π 0. In articular, when sublinear discounted regret in exectation. T γ2 T γ Proof. Given an initial strategy π 0, define ξ : u R P k T γ 2, =0 0, the Hedge algorithm with rates γ has log π0 Pk ex u. Recalling the exression of the cumulative bundle loss L k = t=0 γ tl k µ t, we have for all 0: π 0 ex L k ξl k+ ξl k l = log k µ + ex γ P k ex L k + = log γ + π + π + ex γ + l k µ + l k µ + + γ2 + 8 The last inequality follows from Hoeffding s lemma see Aendix A, since 0 lk µ {0,..., T }, we have for all : ξl kt ξl k0 T = γ π l k µ + 8 T = γ 2. Summing over 3

But we also have ξl k0 = log π 0 l ex γ k µ 0 0 γ 0 And as log is increasing, we have for all 0 P k, logπ 0 0 ex L k 0 L k 0 + log π 0 0 Rearranging, we have for all P k T =0 γ π ξl kt T =0 γ T π l k µ L k T 0 log π0 8 0 π 0 l k µ 0 + γ2 0 8 ξl kt, thus l k µ + 8 and we obtain the desired inequality by maximizing both sides over 0 P k. The revious roosition rovides an uer-bound on the exected regret of the Hedge algorithm, of the form E [ R T x ] T γ π 0 min T γ + T γ2 8 T γ Given Assumtion 2 on the discount factors, we have lim T T γ2 = 0 see Fact in the Aendix, T γ which roves that the discounted regret is sub-linear. This also rovides a bound on the convergence rate. For examle, if γ c, then the uer-bound is equivalent to log T, converges to zero as T, albeit slowly. A better bound can be obtained for sequences of discount factors which are not square-summable, for examle, taking γ, the uer-bound is equivalent to c log T. We now have one examle of an online learning algorithm with sublinear discounted regret. Furthermore, we have an interretation of the sequence γ as learning rates, which rovides additional intuition on Assumtion 2 on γ : decreasing the learning rates will hel the system converge. In the next section, we start our analysis of the oulation dynamics when all layers aly a learning algorithm with sublinear discounted regret. 4 Convergence in the Cesàro sense As discussed in Proosition 5, if almost every layer alies an algorithm with sublinear discounted regret in exectation, then the oulation-wide discounted regret is sublinear almost surely. We now show that whenever the oulation has sublinear discounted regret, the sequence of distributions µ converges in the sense of Cesàro. That is, T γ µ / T γ converges to the set of Nash equilibria. We also show that we have convergence of a dense subsequence. First, we give some definitions. Definition 6 Convergence in the sense of Cesàro. Fix a sequence of ositive weights γ N. A sequence u N of elements of a normed vector sace F, converges to u F in the sense of Cesàro means with resect to γ if N: T γ u γ We write u u. lim T N: T γ T 2 = u. + T =0 γ 2 T =0 γ 2 4

The Stolz-Cesàro theorem states that if u converges to u, then it converges in the sense of Cesàro means with resect to any non-summable sequence γ, see for examle [6]. The converse is not true in general. However, if a sequence converges absolutely in the sense of Cesàro means, i.e. u u γ 0, then a dense subsequence of u converges to u. To show this, we first show that absolute Cesàro convergence imlies statistical convergence, as defined below. Definition 7 Statistical convergence. Fix a sequence of ositive weights γ. A sequence u N of elements of a normed vector sace F, converges to u F statistically with resect to γ if for all ɛ > 0, the set of indexes I ɛ = { N: u u ɛ} has zero density with resect to γ. The density of a subset of integers I N, with resect to the sequence of ositive weights γ, is defined to be the limit, if it exists lim T I: T γ N: T γ. Lemma. If u converges to u absolutely in the sense of Cesàro means with resect to γ, then it converges to u statistically with resect to γ. Proof. Let ɛ > 0. We have for all T N, I 0 ɛ : T γ ɛ N: T N : T γ γ u u N: T γ which converges to 0 since u converges to u absolutely in the sense of Cesàro means. Therefore I ɛ has zero density for all ɛ. We can now show convergence of a dense subsequence. Proosition 8. If u N converges to u absolutely in the sense of Cesàro means with resect to γ, then there exists a subset of indexes T N of density one, such that the subsequence u T converges to u. Proof. By Lemma, for all ɛ > 0, the set I ɛ = { N: u u ɛ} has zero density. We will construct a set I N of zero density, such that the subsequence u N\I converges. For all k N, let Since k T = I k : T k T N : T γ converges to 0 as T, there exists T k > 0 such that for all T T k, Without loss of generality, we can assume that T k k N is increasing. Now, let I = I {T k,..., T k k+ }. k N Then we have for all k N, I {0,..., T k+ } = I, we have I {0,..., T k+ } I k we have I : T γ N : T γ γ k T N : T γ k. k j= I {0,..., T j k+ }. But since I I 2 {0,..., T k+ }, thus for all T such that T k k T < T k+, I k : T γ N : T γ k T = N : T γ k which roves that I has zero density. Let T = N \ I. We have that T has density one, and it remains to rove that the subsequence u T converges to u. Since T has density one, it has infinitely many elements, and for all k, there exists S k T 5

such that S k T k. For all T with S k, there exists k k such that T k < T k +. Since / I and T k < T k +, we must have / I, therefore k This roves that u T converges to u. u u < k k. We now resent the main result of this section, which concerns the convergence of the sequence of oulation distributions µ to the set N of Nash equilibria. We say that µ converges to N if dµ, N 0, where dµ, N = inf ν N µ ν. Theorem 2. Consider a congestion game with discount factors γ satisfying Assumtion 2. Assume that for all k {,..., K}, oulation k has sublinear discounted regret. Then the sequence of distributions µ converges to the set of Nash equilibria in the sense of Cesàro means with resect to γ. Furthermore, there exists a dense subsequence µ T which converges to N. Proof. First, we observe the following fact: Lemma 2. A sequence ν in converges to N only if V ν converges to V N, the value of V on N. Indeed, suose by contradiction that V ν V N but ν N. Then there would exist ɛ > 0 and a subsequence ν T, T N such that dν, N ɛ for all T. Since is comact, we can extract a further subsequence ν T which converges to some ν / N. But by continuity of V, V ν T converges to V ν > V N, a contradiction. Consider the otential function V defined in equation 5. By convexity of V and the exression 6 of its gradient, we have for all and for all µ : =0 V µ V µ V µ, µ µ = k= =0 K k= mx k l k µ, µ k µ k then taking the time-weighted sum u to iteration T, [ T K T ] T γ V µ V µ mx k γ µ k, l k µ µ k, γ l k µ = K mx k k= [ L kt K mx k R kt k= µ k kt ], L where for the last inequality, we use the fact that µ k, L kt min Pk L k T. In articular, when µ is a Nash equilibrium, by Theorem, V µ = min µ V µ = V N, thus T =0 γ V µ V N T =0 γ K k= mx k RkT T =0 γ Since the oulation-wide regret R kt is assumed to be sublinear for all k, we have V µ V N γ 0. By Proosition 8, there exists T N of density one, such that V µ T converges to V N. And it follows that µ T converges to N. This roves the second art of the theorem. To rove the first art, we observe that, by convexity of V, T =0 V N V γ µ T =0 γ T =0 γ V µ T =0 γ = V N + =0 T =0 γ V µ V N T =0 γ 6

and the uer bound converges to V N. Therefore T γ µ T γ T N converges to N. To conclude this section, we observe that the Cesàro convergence result of Theorem 2 can be generalized to any game with a convex otential function. 5 Continuous-time dynamics We now turn to the harder question of convergence of µ : we seek to derive sufficient conditions under which the sequence µ converges to N. In this section, we study a continuous-time limit of the udate equation given by the Hedge algorithm. The resulting ODE, known as the relicator equation, will be useful in roving strong convergence results in the next section. 5. The Relicator dynamics To motivate the study of the relicator dynamics from an online learning oint of view, we first derive the continuous-time relicator dynamics as a limit of the discrete Hedge dynamics, as discussed below. Assume that in each oulation X k, all layers start from the same initial distribution π k0 P k, and aly the Hedge algorithm with learning rates γ. As a result, the sequence of distributions µ k satisfies the Hedge udate rule 7. Now suose the existence of an underlying continuous time t R +, and write µt the distribution at time t. Suose that the udates occur at discrete times T, N, such that the time stes are given by a decreasing, vanishing sequence ɛ. That is, T + T = ɛ. Then we have for all k and all P k, using Landau notation: Thus, µ k T + = µ k + = µ k l k e γ µ l k P k µ k e γ µ l = µ k γ k µ + oγ γ P k µ k l k µ + oγ = µ k T µ k T + µ k T T + T [ + γ lk µ l k µ ɛ l k = µ k µ l k µ γ T ] + oγ + o In articular, if we take the discretization time stes ɛ to be equal to the sequence of learning rate γ, the exression simlifies, and taking the limit as γ 0, we obtain the following ODE system: { µ0 20 k, P k, dµk t dt = µ k t l k µt l k µt where = {µ : k, P k, µ k > 0} is the relative interior of. Here, we require that the initial distribution have ositive weights on all bundles for the following reason: whenever µ k 0 = 0, any solution trajectory will have µ k t 0. It is imossible for such trajectories to converge to the set of Nash equilibria N if the suort of equilibria in N contains. In other words, the relicator dynamics cannot exand the suort of the initial distribution, therefore we require that the initial distribution be suorted everywhere. 7

Equation 20 defines a vector field F : H, where H is the roduct H = H P H P K, and H P k = v RP k : v = 0 P is the linear hyerlane arallel to the simlex P k. Indeed, we have for all µ and for all k, F k µ = l k µµ k l k µ µ k = 0. The following roosition ensures that the solutions remain in the relative interior and are defined on all times. Proosition 9. The ODE 20 has a unique solution µt which remains in and is defined on R +. Proof. First, since the congestion functions c r are assumed to be Lischitz continuous, so is the vector field F. We thus have existence and uniqueness of a solution by the Cauchy-Lischitz theorem. To show that the solution remains in the relative interior of, we observe that for all k, d dt µ k t = F k µt = 0 by the revious observation. Therefore, µ k t is constant and equal to. To show that µ k t > 0 for all t in the solution domain, assume by contradiction that there exists t 0 > 0 and 0 P k such that µ k 0 t 0 = 0. Since the solution trajectories are continuous, we can assume, without loss of generality, that t 0 is the infimum of all such times thus for all t < t 0, µ 0 t > 0. Now consider the new system given by µ = l µ l µ µ 0 µ t 0 = µ t 0 0 and µ 0 t is identically equal to 0. Any solution of the new system, defined on t 0 δ, t 0 ], is also a solution of equation 20. Since µt 0 = µt 0, we have µ µ by uniqueness of the solution. This leads to a contradiction since by assumtion, for all t < t 0, µ t > 0 but µ t = 0. This roves that µ remains in. Furthermore, since is comact, we have by Theorem 2.4 in [3] that the solution is defined on R + otherwise it would eventually leave any comact set. Equation 20 is also studied in Evolutionary Game Theory and is referred to as the relicator dynamics see [23] for examle. It arises from the following model: for all time t, layers of oulation X k are matched in random airs, and each air comares their losses. If the two layers have strategies, P k, the layer with higher loss imitates the strategy of the other layer with robability roortional to the difference in losses hence the name relicator, that is, if l k µ > l k µ, then the first layer switches to bundle with robability lk µ lk µ. Under this model, we have dµ k dt = µ k = µ k = µ k P k : l k µ>lk µ P k µ k lk l k µ l k µ which results in the same ODE 20. µ k l k µ l k µ µ lk µ + P k : l k µ lk µ µ k l k µ lk µ 8

5.2 Stationary oints of the relicator dynamics We first give a characterization of stationary oints of the relicator dynamics alied to the congestion game. Proosition 0. A roduct distribution µ is a stationary oint for the relicator dynamics 20 if and only if the bundle losses l k µ are equal on the suort of µ k. This follows immediately from equation 20. We observe in articular that all Nash equilibria are stationary oints, but a stationary oint may not be a Nash equilibrium in general: one may have a stationary oint µ such that µ k = 0 but l k µ is strictly lower than losses of bundles in the suort, which violates the condition in Definition of a Nash equilibrium. A stationary oint µ with suort P P K can be viewed as a Nash equilibrium of a modified congestion game, in which the bundle set of each oulation X k is restricted to P k. For this reason, stationary oints have been called restricted Nash equilibria by Fischer and Vöcking in [0]. We will denote the set of stationary oints by RN, in reference to the aforementioned aer. Remark 2. By the revious observation, a stationary oint with suort P P K is a minimizer of the otential function V on the roduct P P K. As the number of suort sets is finite, the set of otential values of stationary oints V RN is also finite. 5.3 Convergence of the relicator dynamics In [0], Fischer and Vöcking rove, using a Lyaunov argument, that all solution trajectories of the relicator system asymtotically aroach the set of stationary oints RN. Unfortunately, this result only guarantees convergence to a suerset of Nash equilibria. However, this will be useful in the next section, and we resent a roof for comleteness. Proosition Fischer and Vöcling, [0]. Every solution of the system 20 converges to the set of stationary oints RN. Proof. Consider the otential function V defined by equation 5. The function V is continuously differentiable and its derivative along the vector field of the ODE is given by: V µ = V µ, F µ K = mx k l k µµ k µ k lk µ lk µ k= P k K = mx k 2 k= µ k l k µ µ k l k µ 2 By Jensen s inequality, V µ 0, with equality if and only if µ RN. Therefore V is defined on the comact set, and is decreasing along the vector field F. By the LaSalle-Krasovskii invariance rincile, µt aroaches the largest invariant set contained in the set where V vanishes, {µ : V µ = 0} = RN for examle by Theorem 4.4 in [3]. But since RN itself is an invariant set, µt aroaches RN. 5.4 A discrete-time relicator equation: the REP udate rule Insired by the continuous-time relicator dynamics, we roose a discrete-time multilicative udate rule by discretizing the ODE 20. The resulting algorithm has many desirable roerties such as sublinear discounted regret and simlicity of imlementation. We call it the REP algorithm in reference to the relicator ODE. 9

The vector field F can be written in the following form: for all k, F k µ = G k µ, lµ where for all, µ k, l k l k G k µ, l = µ k This motivates the following udate rule for a layer x X k with distribution π x: π + x = π x + η G k π x, lµ Definition 8 Discrete Relicator algorithm. The REP algorithm, alied by layer x X k, with initial distribution π 0 P k and learning rates η N with η, is an online learning algorithm x U N such that the -th udate function is given by x U l k µ t t, π = π +, such that π + π = η π π, l k µ l k µ 2 Here, π, l k µ l k µ is the exected instantaneous regret of the layer, with resect to bundle. Thus the REP udate can also be exressed in terms of the revious distribution and the exected instantaneous regret. Under the REP udate, the sequence of strategy rofiles π remains in the roduct of simlexes, rovided η for all. Indeed, for all N, we have = lk µ l k µ = and π + π + η + η lk µ l k µ µ η 0 π if η, which guarantees that π remains in. We now show that the REP udate rule with learning rates γ has sublinear discounted regret. First, we rove the following lemma, for general online learning roblems with signed losses. Lemma 3. Consider a discounted online learning roblem, with sequence of discount factors γ, with γ 2 for all. Let P k be the finite decision set, and assume that the losses are signed and bounded, m [, ] for all and P. Then the multilicative-weights algorithm defined by the udate rule π + π γ m 22 has the following regret bound: for all T and all P k, γ m, π log π 0 min + 0 T where π 0 min = min π 0. 0 T γ m + 0 T γ 2 m Proof. We extend the roof of Theorem 2. in [] to the discounted case. By a simle induction, we have for all T, π T is roortional to the vector w T, defined as follows w T = π 0 γ m. 0 <T 20

arxiv: v1 [cs.lg] 31 Jul 2014