arxiv: v1 [cs.lg] 21 Sep 2018

Size: px

Start display at page:

Download "arxiv: v1 [cs.lg] 21 Sep 2018"

Maximilian Ferguson
5 years ago
Views:

1 SIC - MMAB: Synchronisation Involves Communication in Multiplayer Multi-Armed Bandits arxiv: v [cs.lg] 2 Sep 208 Etienne Boursier and Vianney Perchet,2 CMLA, ENS Cachan, CNRS, Université Paris-Saclay, Cachan, France 2 Criteo AI Lab, Paris September 24, 208 Abstract We consider the stochastic multiplayer multi-armed bandit problem, where several players pull arms simultaneously and a collision occurs if the same arm is pulled by more than one player; this is a standard model of cognitive radio networs. We construct a decentralized algorithm that achieves the same performances as a centralized one, if players are synchronized and observe their collisions. We actually construct a communication protocol between players by enforcing willingly collisions, allowing them to share their exploration. With a weaer feedbac, when collisions are not observed, we still maintain some communication between players but at the cost of some extra multiplicative term in the regret. We also prove that the logarithmic growth of the regret is still achievable in the dynamic case where players are not synchronized with each other, thus preventing communication. Finally, we prove that if all players follow naively the celebrated ucb algorithm, the total regret grows linearly. Introduction In the stochastic Multi Armed Bandit MAB problem, a single player repeatedly taes a decision or pulls an arm amongst a finite set of possibilities [K] : {,..., K}. After pulling arm [K] at stage t N, the player receives a random reward X t [0, ], drawn i.i.d. according to the unnown distribution ν of expectation µ = E[X t]. Her objective is to maximize the cumulative reward up to stage T N. This sequential decision problem, first introduced for clinical trials [23, 2], involves an exploration/exploitation dilemma where the player must balance acquiring vs. using information. The performance of an algorithm is controlled in term of regret, which is the difference of cumulated reward of an algorithm nowing beforehand the distributions ν and the cumulated reward of the player. Lai and Robbins showed that any reasonable algorithm could not asymptotically perform better than a logarithmic regret [7]. The popularization of MAB problem comes eboursie@ens-paris-saclay.fr vianney.perchet@normalesup.org

2 from the emergence of optimal algorithms as ucb [, 4] and their applications to online recommendation systems, especially in the field of advertising to optimize the clic through rate. As a consequence, many different MAB algorithms and problems have developed since 2002 [9]. In particular, they have been considered for cognitive radios [2]. In this setting, arms are channels, players are secondary users and the reward is the availability of the channel to secondary users. The problem gets more intricate when multiple players are involved. Indeed, if several players choose to transmit on the same channel at the same time, a collision occurs and the reward is assumed to be 0 as in most of the existing literature, but there might exist different ways to model collisions. If a central agent can control simultaneously the behaviors of all players then a tight lower bound is nown [3, 5]. However, this centralized problem is not adapted to cognitive radios, as it implicitly allows communication between players at each time step. This is not a practical solution as communication costs a significant amount of energy and time. As a consequence, most of the literature focuses on the decentralized case [8, 2]. In that setting, the number of players can be either nown [8, 8], estimated [2, 22] or even varying [6, 22]. This dynamic setting however needs additional assumptions on the way players leave and enter the game to recover a sub-linear regret [22]. There exists variants of the multiplayer bandits worth mentioning, although their problematics differ from ours. When reward distributions depend on users [3, 7, 5], an objective is to find a stable marriage configuration, i.e., a Pareto optimal allocation of the arms. Non-stochasticity of rewards can also be modeled by arms with marovian rewards [4]. In our setting, maybe two of the current most efficient algorithms are the mctopm algorithm[8] and Musical Chairs [22] for the stochastic decentralized multiplayer MAB. The latter also allows a variation for the dynamic setting. The guaranteed regret upper-bounds of these algorithms involve the number of players M N, the number of arms K N and the difference between expected reward µ i µ j where µ n is the n-th order statistics of µ, i.e., µ µ 2... µ K. Indeed, their regret are respectively bounded by O M 3 a,b=,...,k µ a<µ b logt µ a µ b 2 and by O MK logt µ M µ M+ 2 in the static setting. However, the first one assumes the number of players already nown while the second one assumes to now the gap µ M µ M+ between the last optimal arm and the first sub-optimal one. The first lower bound for this problem [8] has been recently improved [8]. The latest one gives an asymptotic regret in Ω M logt µ M µ >M Besides the additional M or K factors in the regret, the gap between the current upper and lower bounds is also due to the fact that the upper bounds are in µ M µ 2 while the lower bounds scale with µ M µ. This is mostly due to the fact that lower bounds are proved for collision-less algorithms, while a large part of the regret is due to collisions. This lower bound [8] also suggests that the cost of decentralization is a multiplicative factor M compared to the centralized case [3]. 2

3 Our main contributions are the following. When players observe collisions, we propose a decentralized algorithm where players communicate statistics and additional information through forced collisions. Our algorithm s regret reaches up to a multiplicative constant the lower bound of the centralized problem, meaning that the earlier mentioned lower bounds [8, 8] are unfortunately incorrect under these sets of assumptions. Indeed, the decentralized multiplayer bandits problem is not much more complex, in terms of rates of growth of regret, than the centralized one. 2. Without the observations of collisions, we still obtain a logarithmic growth of regret under an additional assumption: a positive lower bound on µ K is nown. The cost of not observing collisions is, basically, an extra-multiplicative dependency in M. Those strongly rely on the implicit assumption that players are synchronized which is not realistic. It also explains why the current literature in the multiplayer bandits failed to provide near optimal results. At the light of this, it appears that the assumption of synchronization is a cornerstone and has to be removed. 3. Without synchronization, we develop yet another algorithm with logarithmic regret, yet the dependencies in µ a µ b become quadratic again. In a concurrent and independent wor [9] with respect to this paper, the model without observation of collisions was also considered. Although some of the ideas from both papers are similar, their major points remain different. Table recaps the differences in models, assumptions and bounds between the different proposed algorithms.. Models In this section, we introduce the different models of multi-player MAB with K arms and M players, where the former is nown but the latter is unnown. At each time step, given her private information, player j [M] pulls the arm π j t and receives the reward r j t = X πj tt η πj tt where η πj tt is a collision indicator defined by : The total regret is defined by η t = #C t> with C t = {j [M] π j t = } R T = T M µ = T t= j= M µ π j tt η π j tt Different observation models can be considered. However, most of them only mae sense if rewards are Bernoulli or at least discrete, an assumption that we are going to mae as it represents well the availability of cognitive radio channels. Full Sensing: Player j observes X πj tt and η πj tt. This is the easiest bandit model in this configuration, since the player observes both the score of the arm and the collision indicator at each time-step. 3

4 Model Algorithm Input Upper bound up to constant factor logt Centralized Multiplayer MP-TS [5] M µ >M M µ logt Decentralized, Full Sensing sic-mmab - µ >M M µ + KM logt KM Decentralized, No Sensing Concurrent [9] M 2 logt MK 2 log 2 T Decentralized, No Sensing Concurrent 2 [9] M, µ µ + KM min T logt, logt logt µ >M M µ KM logt Decentralized, No Sensing sic-mmab + + KM 3 logt log 2 logt µ M µm+ 2 logt M µ >M M µ Decentralized, No Sensing sic-mmab2 2 Decentralized, No Sensing, Dynamic dyn-mmab MK2 logt M K logt m= 2 m + M 2 K logt µ M Table : Comparison of bounds for different problems and algorithms. = µ M µ M+ here. The definition of is given in [9]. It is actually as soon as 0. m is defined in Lemma. Our algorithms and results are highlighted in red. Sensing I: Player j observes X π j tt and r j t, meaning she observes the collision indicator only if the reward of the arm is. The model represents the case where a secondary user i.e., a player can sense a channel to detect the presence of a primary user X i before deciding to transmit data on this channel. Sensing II: Player j observes η π j tt and r j t, meaning she observes the collision indicator first and then the reward of the arm if no collision. No sensing: Player j only observes r j t. This is the most difficult case, as there is no extra observation. Indeed, a low empirical performance of an arm can either be a consequence of a low mean µ or, alternatively, to a high frequency of collisions. Notice that, the better an arm, the more a player tends to play it. But if all players have the same behavior, then the better an arm is, the more collisions there will be. The collision factor could balance off the intrinsic performance of an arm. This is the main difficulty of multiplayer MAB, especially without sensing. Some algorithms can easily be adapted from a model to another, e.g. see [6] for adapting Musical Chairs from Sensing II to Sensing I. 4

5 2 Full Sensing: Achieving Centralized Performances by Communicating through Collisions In this section, we will consider the Full Sensing model and show that the decentralized problem is not much more complicated, in terms of regret growth, than the centralized one, when players are synchronized: we provide in the following an algorithm reaching up to a multiplicative constant the lower bound of the centralized Multiplayer MAB [3]. This algorithm strongly relies on the synchronization assumption, which actually allows players to communicate together through observed collisions - the communication protocol is detailed and explained later on. This result also implies that the two lower bounds provided in the literature [8, 8] are unfortunately not correct. Especially, the factor M that was supposed to be the cost of the decentralization in the regret should not appear, at least in the Full Sensing model, and for the Sensing II model, as our algorithm can be easily adapted to that case. Our algorithm sic-mmab consists in several phases:. First, an initialization phase lines -4 in Algorithm gives an estimate of the number of players M and attributes a ran to each player 2. Players then alternate exploration phases and communication phases lines 5-7 a During the exploration phase line 6, performance of arms are estimated in a successive accept and rejects fashion [0] yet without collisions. b The communication phase lines 7-3 allows players to share their statistics c Statistics are updated lines 4-6 after those two phases 3. The last phase, the exploitation line 8, is triggered as soon as an arm is detected as among the top-m arms and it consists in pulling this arm until the final horizon T. 2. Notations For the sae of clarity, we first introduce some notations. Players that are not yet in the exploitation phase are called active. We denote, with a slight abuse of notations, [M t ] the set of active players at stage t and by M t M its cardinality. Notice that M t is non increasing because players can only enter the exploitation phase but cannot leave it. Arms that must still be explored i.e., players can t tell whether they are among the top-m arms or not yet are called active. By construction of our algorithm, this set is shared by all the players at each time step. We denote, with the same abuse of notations, the set of active arms by [K t ] of cardinality K t K. Our algorithm is based on a protocol called sequential hopping []. It consists in incrementing the index of the arm a player plays, i.e., if she plays arm πt at time t, she will play πt+ = πt + mod K t+ at time t Description of the protocol As mentioned above, the sic-mmab algorithm consists in several phases. At the end of a communication phase, players update and share the same statistics on arms, thus the decentralized case becomes equivalent to the centralized one. If an arm has been pulled 2 n times, it only requires n + bits to communicate the relevant statistics: the total reward 5

6 Algorithm sic-mmab algorithm Input: T : Musical Chairs until t = T 0 2: Set the index of the arm where the player is fixed at T 0 3: Play until time t = T : Until t = T 0 + 2K, hop sequentially {now M and her internal ran} 5: for T expl 2, 2 2,..., 2 N until fixed do 6: Estimate the remaining arms for a time K t T expl by starting sampling the remaining arm corresponding to her ran among the active players and proceeding sequential hopping among the active arms 7: for j [M t ] do 8: for l [M t ] \ {j} do 9: for [K t ] do 0: Player j communicates in time log 2 T expl + to player l the number of rewards she observed on the arm : end for 2: end for 3: end for 4: All the players update their ranings and have the same statistics. 5: If p arms are detected as among the top-m, each active player with the internal ran j,..., p will exploit the j-th arm among these p arms 6: Update the active arms, the active players and the internal ran 7: end for 8: Continue exploiting during those 2 n pulls. And a bit of information can be sent by forcing or not a collision. Sections 2.2., and respectively describe the initialization, the exploration and the communication phases. After a large enough number of alternations between exploration and communication phases, sub-optimal arms are detected, and eliminated and players are fixed to an arm among the top-m and will exploit it until time T. Notice that our algorithm can be easily adapted to be fair amongst players they asymptotically have the same average reward. During the exploitation phase, players just have to sequentially hop between the top-m arms. Similarly, exploration can be made by mini-batches before hopping to reduce the number of switches [20] Initialization phase The objectives of the first phase is to estimate the number of players M and to distribute rans amongst them. First, players follow the Musical chairs algorithm see [22]: they sample uniformly among the K arms, during T 0 steps and start pulling the same arm as soon as they encounter no collision on it. A correct choice for T 0 is given below by Lemma. At time t = T 0, each player will have a unique number starting from 0 which is the index of the arm she is pulling at T 0. We call it the external ran of a player. The second procedure, explained in Figure, will determine M and their internal ran, which is the raning of a player amongst the others. As an example, if there are three players on the arms 5, 7 and 2 at t = T 0, their external rans respectively correspond to 5, 7 and 2 while their internal rans are, 2 and 0. The procedure is the following. If a player 6

7 t = T 0 t = T 0 + t = T ext. ran 0 int. ran ext. ran 2 int. ran ext. ran 2 int. ran ext. ran int. ran j ext. ran int. ran j ext. ran int. ran j ext. ran int. ran j ext. ran 0 int. ran ext. ran int. ran j ext. ran + 2 int. ran j + ext. ran + 2 int. ran j + ext. ran + 2 int. ran j + ext. ran int. ran j K 2 K 2 K 2 ext. ran K int. ran M K ext. ran K int. ran M K ext. ran K int. ran M K Figure : Estimating M and the internal ran in a time 2K. A column represents the set of arms. The left resp. right arrows represent the choice of a player fixed to an arm resp. sequentially hopping. The information written in red resp. blue is nown resp. unnown to the player at the considered time step. has the external ran, then she will pull until T and will then start sequential hopping. This ensures that players start sequential hopping in a delayed fashion and that the player with the external ran will collide with the player with the external ran at time T At stage T 0 +2K, player of external ran nows her internal ran and M. Indeed, they are respectively the total number of collisions she encountered during the time intervals [T 0, T 0 + 2] and [T 0, T 0 + 2K ]. After that point, all the players have perfect nowledge of M and also have an attributed internal ran in [M]. During the other phases, players will always now the set of active players [M t ] and, each player has a unique internal ran in [M t ]. This is what breas the initial symmetry between all players and allows the algorithm to establish clean communication protocols. 7

8 2.2.2 Exploration phase 2 logt s For the exploration phase, we note B s = the confidence bound after s pulls and ˆµ p the centralized empirical mean of the arm after the p-th exploration phase. The p-th exploration phase simply consists in sequential hopping among the set of active arms for a time K t 2 p so that each arm is pulled 2 p times. Each player starts by pulling the arm whose index among the active arms corresponds to her internal ran. The exploration phase is then collision-less, thans to the sequential hopping procedure. After a succession of two phases exploration and communication, an arm is said to be accepted if #{i [K t ] ˆµ p B T p ˆµ i p + B Tip} K t M t where T i p is the centralized, total number of times the arm i has been pulled after the p-th exploration round. This inequality actually says that the number of arms worse than is at least K t M t with high probability among the active arms. In the same way, is declared as rejected if #{i [K t ] ˆµ i p B Tip ˆµ p + B T p} M t meaning that there are at least M t active arms better than with high probability Communication phase In this phase, each active player will, one at a time, communicate her statistics on the active arms to all other active players. Recall that since an arm is pulled 2 p times by a single player during the p-th exploration phase, the statistics of an arm can be sent by a player in log 2 2 p + bits, i.e., collisions. An active player has two possible status during this phase:. either she is receiving some other players information. In that case, she is always pulling the arm whose index among the active arms corresponds to her internal ran 2. or she is communicating her statistics. This status starts just after the end of the communication for the previous player i.e., with the internal ran just below hers. Then for all active arms and all active players, one at a time, she communicates her statistics about the arm to the player j. This is done by sending in binary a collision corresponding to a the sum of rewards she observed on the arm to the player j during the previous exploration phase. Since all active players perfectly now [M t ], [K t ] and their internal rans, they now when they have to switch from the receiving to the communicating status. While receiving information, they also now which player is communicating about which arm Updating the active sets At the end of the communication phase, all active players share the same information on arms. They now which ones must be accepted and/or rejected. Rejected arms are removed right away from the set of active arms. 8

9 The procedure is a bit more tricy for the accepted arms. Those arms have an intrinsic raning given by their respective indices. If the arm with ran j is declared as accepted, then the active player with the same internal ran j will start pulling this arm until the final horizon. This is possible since the number of accepted arms cannot exceed the number of remaining active players. Afterwards, all accepted arms are removed from the set of active arms and players update [K t ], [M t ] and their internal ran. 2.3 Regret analysis This section is devoted to the analysis of our algorithm. The regret is decomposed as follows, R T = R init + R explo + R comm + R bad run 2 where the first term corresponds to the regret due to the initialization phase, the second one to the exploration phases and the third one to the communication phases. The last term corresponds to the regret caused by bad runs, i.e., incorrect initialization no orthogonal setting at T 0 or exploration/estimation of arms. In case of a good run, the exploitation step is loss-less Initialization regret Recall that the initialization phase consists in T 0 steps of Musical Chair and in a procedure of length 2K. Lemma. If T 0 K logt, then after time T 0, all players will play different arms with probability at least M T. As a consequence, R init = OKM logt The proof is simple and delayed to Section B Exploration regret For the exploration phase, one must just control the number of times an arm must be pulled before being declared as accepted or rejected, with high probability. Proposition. All optimal arms are declared accepted before being pulled t = O µ µ M+ 2 logt times in total, and every sub-optimal arm is declared as rejected after being pulled at most t = O µ M µ 2 logt with probability greater than O K logtm T The proof is delayed to Section B.2, in the appendix. In the following, we eep the notations t = O µ µ M+ logt for an optimal arm and t 2 = O µ M µ logt 2 for a suboptimal one.. 9

10 For the exploration phases, the decomposition used in the centralized case [3] holds because there is no collision at all during the exploration phases: R explo = µ M µ E[T explo T ] + µ µ M T explo E[T explo T ] 3 >M M Where T explo = T T init T comm and T explo T is the number of time the -th best arm has been pulled for exploration. Lemma 2. For a sub-optimal arm : E[T explo T ] = O logt µ M µ 2 Moreover, in case of a good run, the following holds for the optimal arms: µ µ M T explo E[T explo ] = O logt, µ M µ j M and thus it trivially holds that R explo = j>m j>m logt O µ M µ j The first part is a direct consequence of the fact that E[T explo T ] t given by Proposition. The proof of the second part is also delayed to the appendix, Section B Communication cost We now focus on the R comm term in Equation 2. Lemma 3. For the communication phases, the regret is bounded as follows: R comm = O KM 3 log 2 logt µ M µ M+ 2 Proof. As explained in Section 2.2.3, each communication phase is of length at most KM 2 p+ where p = 0,..., N. Also, N log 2 t M + in case of a good run, since t M = max t. So for the cost of communication: log 2 t M + R comm KM 3 n n=0 O KM 3 log 2 2t M This term will be significant for low values of T, but asymptotically, it is in ologt. 0

11 2.3.4 Bad run regret The term R bad run is due to the low probability events of bad initialization or bad estimation during the exploration. Lemma and Proposition directly provide the following result: Lemma 4. R bad run = O M 2 logt + MK log µ M µ M+ 2 Proof. We got from Lemma and Proposition that: P [ bad run ] M O T + K log 2t M T So we have: R bad run MT P [ bad run ] = O M 2 + MK log 2 t M Total regret Theorem finally provides an asymptotical upper bound of the regret: Theorem. For any given set of parameters K, M and µ: lim T R T logt c >M where c and c 2 are two problem independent constants µ M µ + c 2 KM Proof. This is a direct consequence of Lemmas, 2, 3 and 4 and the regret decomposition given by Equation Experiments We now compare the empirical performances of sic-mmab with the mctopm algorithm [8], on generated data. We also compared with the musicalchairs algorithm [22], but its regret is so large that it squeezes the pictures and displaying it maes it really hard to distinguish between sic-mmab and mctopm. All the considered values for regret are averaged over 200 runs. Figure 2 represents the evolution of the regret for both algorithms with the following problem parameters: K = 9, M = 6, T = The means of the arms are linearly distributed between 0.9 and 0.89, so the gap between two consecutive arms is The alternation between exploration phases and communication phases is easily observable here. Figure 3 represents the evolution of the final regret according to the gap between two consecutive arms in a logarithmic scale. The problem parameters K, M and T are the same. Although mctopm seems to provide better results with low values of, sic-mmab seems to have a smaller dependency in /. This confirms the theoretical results claiming that mctopm scales with 2 while sic-mmab scales with. This can be observed on the left part of Figure 3 where the slopes is approximately twice as big for mctopm than for

12 sic-mmab. Also, a different behavior of the regret appears for very low values of which is certainly due to the fact that the regret linearly scales with for extremely small values of. In practice, mctopm might outperform sic-mmab, however simulations tend to confirm the theoretical upper bounds asymptotically. Figure 2: Evolution of regret over time Figure 3: Evolution of final regret according to logarithmic scale 2.5 In contradiction with existing lower bounds? Theorem directly contradicts the two lower bounds provided in the literature [8, 8], however sic-mmab respects the conditions required for both lower bounds. It was thought that the cost of the decentralization was a multiplicative factor M compared to the centralized lower bound [3]. However, with our approach, it appears that the regret of the decentralized case does not seem so much different from the one in the centralized case, at least if there is synchronization between the players. Indeed, sic-mmab actually taes advantage of this synchronization to establish communication protocols where players are able to communicate through collisions. The incorrect statement in the paper [8] seems to be found in Lemma 2. It claims that the probability distribution of collisions does not depend on the distributions of the arms while it does, especially for sic-mmab. In the second paper [8], the mistae would be in Lemma 3, since the proof does not tae into account the fact that the actions of a player also depend on the collisions she observed. Our algorithm reaches, up to a constant factor, the lower bound from the decentralized algorithm [3], up to an additional term in OKM logt due to the initialization. If a pre-agreement is allowed between the players, this phase would not be needed, and this additional term in the regret would disappear. As a consequence, Algorithm shows that the decentralized multiplayer bandits in case of time synchronization between the players is almost equivalent to the centralized multiplayer problem which is a priori a much simpler problem. Since the synchronization is not a reasonable assumption in the practical case and that it leads to undesirable lower bounds, the assumption of synchronization should be removed when considering the multiplayer MAB problem. This setting is called the dynamic setting. However, this problem is quite complex to model formally, since we must either put some restrictions on the way the players enter or leave the game, or find an appropriate formulation of the regret. Indeed, without them and as already noticed [22], the players could enter and 2

13 leave so quicly that they would learn nothing about the arm distributions. With the natural formulation of regret in the dynamic setting, the regret would then be linear. The difficulty to have a formal interesting model under this setting explains why most of the literature currently focuses on the synchronized case. 3 Without Full Sensing. Communication through Synchronization This section focuses on the No Sensing model, with synchronization the dynamic setting is considered in the following Section. First of all, we claim that a communication protocol similar to the one of sic-mmab can be devised here. Indeed, in the Full Sensing model, a bit is sent through a single collision. Without sensing, it can be done with arbitrary high probability using logt time-steps where denotes a nown lower bound of the average rewards of all arms i.e., assuming Hypothesis stated below. This only adds a multiplicative factor of to the initialization phase and logt to the communication one. So, sic-mmab can be easily adapted for the No Sensing model into the sic-mmab with a regret scaling as: O >M logt KM logt + + KM 3 logt log 2 µ M µ logt µ M µm+ 2 Although this regret is not asymptotically logarithmic because of the logt log 2 logt term, for specific problem parameters and for a fixed horizon T, this upper bound would be lower than the natural bound assuming each player actually has to individually explore all arms: M logt. µ M µ >M This remar illustrates how complex an anytime lower bound for the No Sensing setting can be. In this section, we introduce a first algorithm for the synchronized case. It also relies on a communication protocol, but with more limited information. In the following section, we will introduce another algorithm for the dynamic case; it will not be based on communication between players. As a consequence its regret might be larger, but it is adapted to more realistic scenarios. These are among the first algorithms for the No Sensing model with strong theoretical results and logarithmic regrets. Simple algorithms as ucb provide good results in practice but appear to actually provide a linear regret with a constant probability [8]. In Appendix A, the discussion about ucb is extended and the reasons of its failure are explained for the No Sensing model, using algebraic arguments. 3. Adapted Communication Protocol The algorithm sic-mmab2 is described below. The involved subroutines will be detailed in the next sections. Similarly to sic-mmab, it starts with an initialization phase to estimate M. It then alternates between exploration and communication phases in the same fashion. But the goal of the communication phases is to communicate to the other players which arms are optimal or sub-optimal and not to transmit statistics. This allows to eep updated the sets of active arms and active players. How to declare such arms and how to detect 3

14 declarations from other players is detailed in section The algorithm then ends with an exploitation phase as sic-mmab. The following assumption is required for sic-mmab2: Hypothesis. A lower bound of the arm means is nown to all players: min µ i i Some terms of the regret of sic-mmab2 scale with logt. In practice, the lowest average reward is at least one or two orders of magnitude larger than the gaps between the different arms. Thus, the significant terms in the regret are actually those in logt i. Also, it seems reasonable to have the nowledge of such a lower bound; indeed, just represents the order of magnitude of the average rewards of the arms, often nown in practice. This assumption is very close to an assumption made for other no sensing algorithms [9]. The notations for M t and K t are the same as in sic-mmab. Figures 4 and 5 illustrate the possible cases of an alternation between declaration and exploration phases. They will be further explained in Section Initialization phase The objective of the initialization phase is to estimate M here and is explained in Algorithm 3. The idea is the same as for sic-mmab, but as already mentioned, instead of a single time step, a number of logt time steps is needed to transmit a bit with high probability. Then, M is estimated as the number of slots of duration logt where only 0 rewards are observed. If we tae a loo bac at sic-mmab, this procedure is the same as its initialization with the following differences:. the fixation phase is longer adding a factor 2. when colliding with other players to communicate her presence, instead of doing it a single time-step, it is done logt times 3. a collision is detected when a player observes only 0 for logt consecutive time-steps. Notice that the internal ran can also be estimated but won t be used here, because it can t be updated during the exploration/communication phases Exploration phases Each exploration round is split into two parts: a first one where each player fixes to an arm, a second one where all players are sequentially hopping to explore the remaining arms without any collision. At the end of a fixation phase, all players are on different arms they are in an orthogonal position and thus sequential hopping can start. The decisions for accepting/rejecting arms, are still based on the exploratory pulls as in sic-mmab. The differences in exploration are the following:. Statistics are not shared between the players; this induces a multiplicative factor M in the regret 4

15 Algorithm 2 sic-mmab2 algorithm Input: T, : Estimate M until time T M {see section 3..} 2: T logt 3: for T expl in T 0, 2T 0,..., 2 N T 0 until fixed do 4: if n = 0 or there was a declaration during the last phase then Kt logt 5: Fix to an active arm using Musical Chairs during a time 6: Estimate the active arms during a time K t T expl doing sequential hopping 7: else 8: Add the statistics of the last declaration phase in the exploratory statistics {i.e., the last declaration phase is used as part of this exploration phase} 9: Continue sequential hopping according to the end of the last phase for a time K t T expl T d 0: end if : T d K t T 0 2: for in the arms detected as sub-optimal do 3: Declare as a sub-optimal arm if not declared by another player yet {see section 3..3} 4: end for 5: if at least an arm is detected as optimal then 6: Try to occupy any arm detected as optimal time T d per fixation phase 7: end if 8: Continue sequential hopping until there is a phase of length T d without new declaration 9: Any arm with reward 0 during this last phase corresponds to an occupied arm update [M t ] 20: update [K t ] 2: end for 22: if Fixed then 23: Exploit by playing the arm where fixed 24: end if Algorithm 3 Estimation of M Input: T, K Musical Chairs until time logt Play until time t = K logt + 2 logt Until t = K logt + 2K logt, hop sequentially with batches of length logt between arm switches 2. A fixation phase is added at the beginning of a new exploratory round, if there was at least one declaration in the previous declaration phase. This fixation is needed for players to reach an orthogonal position before the sequential hopping. An orthogonal position can t be reached instantly because the internal rans can t be updated in sic-mmab2. Figures 4 and 5 illustrate when such a fixation phase is added 5

16 All players start declaring Player j ended declaring Waits for other players to end All players ended declaring a slot ago Fixation phase Sequential hopping time time K t T expl T f Declare an arm as suboptimal time T d Declare an arm as suboptimal time T d Try to occupy optimal arms time T d Detect other players declarations time T d Detect other players declarations time T d No new declaration observed time T d Fixation phase time T f Sequential hopping time 2K t T expl Figure 4: Alternation between fixation, exploration and declaration blocs. Case where the players declare sub-optimal and try to occupy in vain optimal arms No player declared anything Fixation phase time T f Sequential hopping time K t T expl No new declaration observed time T d Sequential hopping time 2K t T expl T d The declaration phase is included in the exploration one and no fixation phase is required Figure 5: Alternation between fixation, exploration and declaration blocs. Case with no declaration. In that case, the single declaration phase, which just consists in sequential hopping, is included in the next exploration phase lines 7-0 in Algorithm 2. No fixation phase is needed in that case Communication phase All players are in the communication phase at the same time. However, this phase is decomposed into blocs of same length T d to eep synchronization. A bloc can be of three different types, and the type of a bloc does not need to be the same for all players, as illustrated by Figure 4. Types are the folllowing Declaration bloc for player j: she communicates to other players that an arm is suboptimal Fixation bloc for player j: she tries to occupy any arm that she detected as optimal. If she succeeds, she exploits that arm until the end of the algorithm Reception bloc for player j: she just sequential hops, in order to detect other players declarations Player j starts the communication phase with declaration blocs, one per arm detected as sub-optimal. She then proceeds to a fixation bloc, had she detected any arm as optimal during the last exploitation phase. She then proceeds to reception blocs until no new declaration is detected. As soon as no new declaration is detected, she starts a new exploration phase. Notice that for the three types of blocs, players eep receiving declarations from other players. Declaration bloc: In a declaration bloc, players follow Algorithm 4. The idea is to often sample the sub-optimal arm to send a signal to the other players. The other players 6

17 will detect this signal by observing a significant loss in the empirical reward of this arm. However, a player sending a signal should also be able to detect signals on other arms sent by other players. That is the reason why to declare an arm as sub-optimal, a player randomly chooses between pulling this arm and sequential hopping. Algorithm 4 Declare an arm as sub-optimal to the other players Input: T,, [K t ] During a time { T d : {T d computed according to Lemma 7} with probability Pull an arm 2 according to the sequential hopping on [K t ], with probability 2 Lemma 7 gives a good choice for T d such that the declaration is detected by every player, without detecting any false positive declaration, no matter the bloc they are currently proceeding, with high probability. Indeed, let ˆµ i and ˆr i be respectively the empirical reward during the exploration phases and during the last communication bloc for arm i and player j. Arm i is considered as perceived as sub-optimal by another player if: ˆµ i ˆr i 4 ˆµ i 4 Lemma 7 states that the players will only detect arms declared as sub-optimal or that are occupied by a player. However, using the last reception bloc where there is no new declaration, it is easy to distinguish optimal arms occupied by a player from the sub-optimal arms. Indeed, for the former, only 0 are observed during this last phase, while on the latter, at least one positive reward will be observed during the phase of length T d with probability at least T, since there is no player pulling it except for the sequential hopping. Fixation bloc: In a fixation bloc, player j continues sequential hopping on the active arms and decides to occupy an arm as soon as an arm detected as among the M optimal ones returns a reward no collision at that stage. In that case, she exploits this arm until the final horizon T. At the end of a bloc, if she did not manage to fix to any of these arms, then all of them are certainly occupied by other players with high probability. Declarations made by other players are taing care of, following the rule of Equation 4. Reception bloc: In a reception bloc, player j simply sequentially hops and detects the declarations of other players, following the rule of Equation 4. Notice that every active player will at least proceed to one reception bloc per communication phase if she does not manage to occupy an optimal arm. The last reception bloc is considered as the bloc where no new declaration is detected. This bloc is thus the same for all active players. Moreover, the only arms giving 0 rewards during this last reception bloc correspond to optimal arms occupied by other players. This allows to distinguish between arms occupied by another player, i.e., the optimal ones, or just declared by some player, i.e., the sub-optimal ones. As a consequence, active players can update the set of active arms [K t ] and the number of active players M t at the end of each communication phase. 7

18 3.2 Regret Analysis The same decomposition of the regret will be used for sic-mmab2: 3.2. Initialization Regret R T = R init + R explo + R comm + R bad run 5 Lemma 5 claims that the initialization is successful with high probability, meaning all players perfectly now M after this phase. Lemma 5. With probability at least O KM T, at the end of this procedure, every player has a correct estimation of M. In particular, this yields that MK R init = O logt The proof of this lemma is delayed to Section C. in the appendix Exploration regret The proof of Lemma 5 also claims that with probability at least O M t T, all the players are in an orthogonal position at the end of a single fixation phase. In case of good runs, the exploration will thus be collision-less. Using the same ind of arguments as for Lemma 2, we provide an upper bound for the exploration regret of sic-mmab2. Lemma 6. E[R explo ] O >M M logt + MK2 logt µ M µ The proof is also delayed to the appendix, in Section C Communication regret The following Lemma 7 provide the properties, guarantees and cost of the algorithm during the communication phases. Lemma 7. Let the length of a bloc be such that T d, then. With probability at least O M T, a player j declaring an arm i as sub-optimal will be successfully detected by all the other players; 2. With probability at least O MK T, no player will detect a false declaration during the declaration bloc i.e., no arm is detected as sub-optimal if there was no declaration or if it is not occupied by another player; 3. With probability at least O M T, if player j starts occupying arm, it is detected as a declaration by active players following the rule of Equation 4. The communication regret is bounded as MK R comm 2 = O logt 8 200Kt logt

19 Proof. The proof of the first three points is delayed to Section C.3 in the appendix. They imply that, when in a fixation bloc and with probability at least O M T, player j either succeeds to occupy an arm she detected as optimal, or all the arms she detected as optimal are occupied by other players at the end of the bloc. Also, if she manages to occupy an arm, this arm will be detected as declared by other players. This yields that active players remain synchronized and share the nowledge of M t and [K t ] during the whole process. Thans to the construction of the algorithm, there can not be two different blocs used to declare or occupy the same arm, hence the number of declaration phases will be at most K we emphasize here that the declaration phases are not included in an exploration phase. Since each bloc is of length T d, the communication regret is then bounded by KMT d, which gives the result. We mention here that, we described the declaration with probabilities /2. It can be seen from the proof in Section C.3 that T d is actually optimized when pulling the arm to declare with probability 2/ Bad run regret A bad run might occur for three different reasons: a fixation phase fails, 2 there is a wrong estimation of an arm by at least one player, 3 a declaration bloc of some player failed. The previous lemmas directly lead to the following results. Lemma 8. R badrun = O R explo Proof. We consider the three reasons of bad run independently.. Thans to Lemma 5, the probability of failure of a single fixation phase including the blocs in communication phase is in O M T. The number of fixation phases is smaller than OK, so the total loss for the first reason is smaller than OKM For the case of bad exploration, as explained in the proof of Lemma 6, the cost of the low probability event of a bad estimation of an arm is lower than the cost of exploration in case of a good run. We thus simply claim that the cost is dominated by E[R explo ] and actually even by a tighter term. 3. For the case of declarations, as implied by Lemma 7, the probability of failure of a single declaration or reception bloc is smaller than O KM T. There are at most K + N such blocs, where N = O log is the number of exploration phases logt 2 M see the proof of Lemma 6. So the regret due to this case is also dominated by the exploration regret. Summing these three terms yields the result Total Regret Using all these previous intermediate results, we now provide an upper bound for the regret incurred by sic-mmab2: 9

20 Theorem 2. sic-mmab2 has a regret scaling as: M E[R T ] = O logt + MK2 logt µ M µ >M Proof. This is a direct consequence of Lemmas 5, 6, 7, 8 and Equation 2. 4 Without Synchronization, the Dynamic Setting In this section, we no longer want to assume that players can communicate using extreme synchronisation. In the previous sections, it was crucial that all exploration/communication phases start and end at the exact or at least approximately exact same time. This assumption is clearly non-realistic and should be removed, as radios do not start and end transmitting simultaneously. So we are going to assume that players enter the game at different times, without nowing how many players are already present and without informing them of their arrival. As the model needs strong additional assumptions on how players can enter and leave the game [22] to avoid trivial linear regret bound, we are going to assume for simplicity that players do not leave the game once they have entered it. More precisely, player j starts pulling arm at a time T j start until the horizon T, and we denote by Mt the numbers of present players at stage t. We mention that the following results can also be adapted to the case where players can leave the game during specific intervals or where they all share an internal synchronized cloc [22], which is actually a different assumption than synchronization. 4. A Communication-less Protocol Algorithms sic-mmab and sic-mmab2 heavily rely on the fact that phases in the communication protocols start and end at the exact same moments for all active players, which is only possible if they are synchronized. In the dynamic setting, this synchronization is not possible and our new algorithm dyn-mmab does not rely on the same trics as the previous algorithms. The protocol adapted to the dynamic setting is pretty simple: a player will only follow two different sampling strategies: either she samples uniformly at random among all the arms, or she exploits an arm found optimal and pulls it until the final horizon T. In the first case, the exploration of the other players is not too disturbed as it only changes the mean reward of all the arms by a more or less constant factor. In the second case, the exploited arm will appear as sub-optimal to the other players, and this is not hurtful as this arm is now occupied. In the first phase, players uniformly explore among all the arms, without nowing M recall it may evolve dynamically during the process. As soon as a player distinguishes an arm as optimal i.e. the best available arm, she tries to occupy it as follows. She actually continues to sample uniformly and will occupy this arm as soon as she encounters a positive reward. The player either succeeds to occupy an arm within a given number of stages, The naive choice to sample only this arm until observing a and to decide it is already occupied if too many 0 are observed does not deal with cases where several players try to occupy the same arm at the same time. 20

21 and then exploits it until the final horizon, or she fails. In the latter, she mars this arm as occupied and continues to explore to find the next available optimal arm. Here, an arm is said active if it was neither detected as occupied nor optimal at some point. Using the notations described below, this means that this arm is not in Preferences Occupied. Here are the different notations for dyn-mmab : Preferences is the player s raning of arms. The player tries to occupy the p-th arm in Preferences until she succeeds or mars it as occupied p represents the index in Preferences a player is currently trying to occupy T and ˆr T temp respectively are the number of pulls and the empirical mean reward on arm and ˆr temp on arm since the last reset. A reset occurs on arm every T f respectively represent the number of pulls and the empirical mean reward pulls. T f is constantly updated and represents the number of successive 0 to observe on arm to be sure that is actually occupied. T f is updated at each stage with the following rule: T f t + min e logt ˆr t + B T t+t +, T f t 6 with the convention that T f 0 = + and x/0 = + for all non negative x. Occupied is the set of arms mared as occupied by a player B s t corresponds to the confidence bound. Its exact value is given in Lemma 9. The pseudo-code is detailed in Algorithm Exploration rules Two main actions can happen during the exploration. First, an arm can be considered as occupied and is then added to Occupied. Also, it can be considered as optimal and is then added to Preferences. An arm is added to Occupied can already be in Preferences if only 0 have been observed during a whole dynamic sliding window whose length might vary with time. A dynamic sliding window ends when at least T f new observations have been gathered on arm. A new window is then started. The crucial quantity of interest is therefore the length T f. It is an estimation of the number of successive observed 0 required to consider an arm as occupied with high probability. This value is thus constantly updated using Equation 6, which uses the current estimation of a lower bound for µ. This rule is described at lines 5-20 in Algorithm 5. An active arm is added to Preferences if it is better, with high probability, than any active arm: if l, l an active arm, ˆr B T t > ˆr l + B Tl t then is added to Preferences The value for B s t is not as easy as for the previous algorithms and is given by Equation 8 below. This rule is described at lines 2-23 in Algorithm 5. 2

Online Learning and Sequential Decision Making

Online Learning and Sequential Decision Making Emilie Kaufmann CNRS & CRIStAL, Inria SequeL, emilie.kaufmann@univ-lille.fr Research School, ENS Lyon, Novembre 12-13th 2018 Emilie Kaufmann Sequential Decision