arxiv: v1 [cs.lg] 21 Sep 2018

Size: px
Start display at page:

Download "arxiv: v1 [cs.lg] 21 Sep 2018"

Transcription

1 SIC - MMAB: Synchronisation Involves Communication in Multiplayer Multi-Armed Bandits arxiv: v [cs.lg] 2 Sep 208 Etienne Boursier and Vianney Perchet,2 CMLA, ENS Cachan, CNRS, Université Paris-Saclay, Cachan, France 2 Criteo AI Lab, Paris September 24, 208 Abstract We consider the stochastic multiplayer multi-armed bandit problem, where several players pull arms simultaneously and a collision occurs if the same arm is pulled by more than one player; this is a standard model of cognitive radio networs. We construct a decentralized algorithm that achieves the same performances as a centralized one, if players are synchronized and observe their collisions. We actually construct a communication protocol between players by enforcing willingly collisions, allowing them to share their exploration. With a weaer feedbac, when collisions are not observed, we still maintain some communication between players but at the cost of some extra multiplicative term in the regret. We also prove that the logarithmic growth of the regret is still achievable in the dynamic case where players are not synchronized with each other, thus preventing communication. Finally, we prove that if all players follow naively the celebrated ucb algorithm, the total regret grows linearly. Introduction In the stochastic Multi Armed Bandit MAB problem, a single player repeatedly taes a decision or pulls an arm amongst a finite set of possibilities [K] : {,..., K}. After pulling arm [K] at stage t N, the player receives a random reward X t [0, ], drawn i.i.d. according to the unnown distribution ν of expectation µ = E[X t]. Her objective is to maximize the cumulative reward up to stage T N. This sequential decision problem, first introduced for clinical trials [23, 2], involves an exploration/exploitation dilemma where the player must balance acquiring vs. using information. The performance of an algorithm is controlled in term of regret, which is the difference of cumulated reward of an algorithm nowing beforehand the distributions ν and the cumulated reward of the player. Lai and Robbins showed that any reasonable algorithm could not asymptotically perform better than a logarithmic regret [7]. The popularization of MAB problem comes eboursie@ens-paris-saclay.fr vianney.perchet@normalesup.org

2 from the emergence of optimal algorithms as ucb [, 4] and their applications to online recommendation systems, especially in the field of advertising to optimize the clic through rate. As a consequence, many different MAB algorithms and problems have developed since 2002 [9]. In particular, they have been considered for cognitive radios [2]. In this setting, arms are channels, players are secondary users and the reward is the availability of the channel to secondary users. The problem gets more intricate when multiple players are involved. Indeed, if several players choose to transmit on the same channel at the same time, a collision occurs and the reward is assumed to be 0 as in most of the existing literature, but there might exist different ways to model collisions. If a central agent can control simultaneously the behaviors of all players then a tight lower bound is nown [3, 5]. However, this centralized problem is not adapted to cognitive radios, as it implicitly allows communication between players at each time step. This is not a practical solution as communication costs a significant amount of energy and time. As a consequence, most of the literature focuses on the decentralized case [8, 2]. In that setting, the number of players can be either nown [8, 8], estimated [2, 22] or even varying [6, 22]. This dynamic setting however needs additional assumptions on the way players leave and enter the game to recover a sub-linear regret [22]. There exists variants of the multiplayer bandits worth mentioning, although their problematics differ from ours. When reward distributions depend on users [3, 7, 5], an objective is to find a stable marriage configuration, i.e., a Pareto optimal allocation of the arms. Non-stochasticity of rewards can also be modeled by arms with marovian rewards [4]. In our setting, maybe two of the current most efficient algorithms are the mctopm algorithm[8] and Musical Chairs [22] for the stochastic decentralized multiplayer MAB. The latter also allows a variation for the dynamic setting. The guaranteed regret upper-bounds of these algorithms involve the number of players M N, the number of arms K N and the difference between expected reward µ i µ j where µ n is the n-th order statistics of µ, i.e., µ µ 2... µ K. Indeed, their regret are respectively bounded by O M 3 a,b=,...,k µ a<µ b logt µ a µ b 2 and by O MK logt µ M µ M+ 2 in the static setting. However, the first one assumes the number of players already nown while the second one assumes to now the gap µ M µ M+ between the last optimal arm and the first sub-optimal one. The first lower bound for this problem [8] has been recently improved [8]. The latest one gives an asymptotic regret in Ω M logt µ M µ >M Besides the additional M or K factors in the regret, the gap between the current upper and lower bounds is also due to the fact that the upper bounds are in µ M µ 2 while the lower bounds scale with µ M µ. This is mostly due to the fact that lower bounds are proved for collision-less algorithms, while a large part of the regret is due to collisions. This lower bound [8] also suggests that the cost of decentralization is a multiplicative factor M compared to the centralized case [3]. 2

3 Our main contributions are the following. When players observe collisions, we propose a decentralized algorithm where players communicate statistics and additional information through forced collisions. Our algorithm s regret reaches up to a multiplicative constant the lower bound of the centralized problem, meaning that the earlier mentioned lower bounds [8, 8] are unfortunately incorrect under these sets of assumptions. Indeed, the decentralized multiplayer bandits problem is not much more complex, in terms of rates of growth of regret, than the centralized one. 2. Without the observations of collisions, we still obtain a logarithmic growth of regret under an additional assumption: a positive lower bound on µ K is nown. The cost of not observing collisions is, basically, an extra-multiplicative dependency in M. Those strongly rely on the implicit assumption that players are synchronized which is not realistic. It also explains why the current literature in the multiplayer bandits failed to provide near optimal results. At the light of this, it appears that the assumption of synchronization is a cornerstone and has to be removed. 3. Without synchronization, we develop yet another algorithm with logarithmic regret, yet the dependencies in µ a µ b become quadratic again. In a concurrent and independent wor [9] with respect to this paper, the model without observation of collisions was also considered. Although some of the ideas from both papers are similar, their major points remain different. Table recaps the differences in models, assumptions and bounds between the different proposed algorithms.. Models In this section, we introduce the different models of multi-player MAB with K arms and M players, where the former is nown but the latter is unnown. At each time step, given her private information, player j [M] pulls the arm π j t and receives the reward r j t = X πj tt η πj tt where η πj tt is a collision indicator defined by : The total regret is defined by η t = #C t> with C t = {j [M] π j t = } R T = T M µ = T t= j= M µ π j tt η π j tt Different observation models can be considered. However, most of them only mae sense if rewards are Bernoulli or at least discrete, an assumption that we are going to mae as it represents well the availability of cognitive radio channels. Full Sensing: Player j observes X πj tt and η πj tt. This is the easiest bandit model in this configuration, since the player observes both the score of the arm and the collision indicator at each time-step. 3

4 Model Algorithm Input Upper bound up to constant factor logt Centralized Multiplayer MP-TS [5] M µ >M M µ logt Decentralized, Full Sensing sic-mmab - µ >M M µ + KM logt KM Decentralized, No Sensing Concurrent [9] M 2 logt MK 2 log 2 T Decentralized, No Sensing Concurrent 2 [9] M, µ µ + KM min T logt, logt logt µ >M M µ KM logt Decentralized, No Sensing sic-mmab + + KM 3 logt log 2 logt µ M µm+ 2 logt M µ >M M µ Decentralized, No Sensing sic-mmab2 2 Decentralized, No Sensing, Dynamic dyn-mmab MK2 logt M K logt m= 2 m + M 2 K logt µ M Table : Comparison of bounds for different problems and algorithms. = µ M µ M+ here. The definition of is given in [9]. It is actually as soon as 0. m is defined in Lemma. Our algorithms and results are highlighted in red. Sensing I: Player j observes X π j tt and r j t, meaning she observes the collision indicator only if the reward of the arm is. The model represents the case where a secondary user i.e., a player can sense a channel to detect the presence of a primary user X i before deciding to transmit data on this channel. Sensing II: Player j observes η π j tt and r j t, meaning she observes the collision indicator first and then the reward of the arm if no collision. No sensing: Player j only observes r j t. This is the most difficult case, as there is no extra observation. Indeed, a low empirical performance of an arm can either be a consequence of a low mean µ or, alternatively, to a high frequency of collisions. Notice that, the better an arm, the more a player tends to play it. But if all players have the same behavior, then the better an arm is, the more collisions there will be. The collision factor could balance off the intrinsic performance of an arm. This is the main difficulty of multiplayer MAB, especially without sensing. Some algorithms can easily be adapted from a model to another, e.g. see [6] for adapting Musical Chairs from Sensing II to Sensing I. 4

5 2 Full Sensing: Achieving Centralized Performances by Communicating through Collisions In this section, we will consider the Full Sensing model and show that the decentralized problem is not much more complicated, in terms of regret growth, than the centralized one, when players are synchronized: we provide in the following an algorithm reaching up to a multiplicative constant the lower bound of the centralized Multiplayer MAB [3]. This algorithm strongly relies on the synchronization assumption, which actually allows players to communicate together through observed collisions - the communication protocol is detailed and explained later on. This result also implies that the two lower bounds provided in the literature [8, 8] are unfortunately not correct. Especially, the factor M that was supposed to be the cost of the decentralization in the regret should not appear, at least in the Full Sensing model, and for the Sensing II model, as our algorithm can be easily adapted to that case. Our algorithm sic-mmab consists in several phases:. First, an initialization phase lines -4 in Algorithm gives an estimate of the number of players M and attributes a ran to each player 2. Players then alternate exploration phases and communication phases lines 5-7 a During the exploration phase line 6, performance of arms are estimated in a successive accept and rejects fashion [0] yet without collisions. b The communication phase lines 7-3 allows players to share their statistics c Statistics are updated lines 4-6 after those two phases 3. The last phase, the exploitation line 8, is triggered as soon as an arm is detected as among the top-m arms and it consists in pulling this arm until the final horizon T. 2. Notations For the sae of clarity, we first introduce some notations. Players that are not yet in the exploitation phase are called active. We denote, with a slight abuse of notations, [M t ] the set of active players at stage t and by M t M its cardinality. Notice that M t is non increasing because players can only enter the exploitation phase but cannot leave it. Arms that must still be explored i.e., players can t tell whether they are among the top-m arms or not yet are called active. By construction of our algorithm, this set is shared by all the players at each time step. We denote, with the same abuse of notations, the set of active arms by [K t ] of cardinality K t K. Our algorithm is based on a protocol called sequential hopping []. It consists in incrementing the index of the arm a player plays, i.e., if she plays arm πt at time t, she will play πt+ = πt + mod K t+ at time t Description of the protocol As mentioned above, the sic-mmab algorithm consists in several phases. At the end of a communication phase, players update and share the same statistics on arms, thus the decentralized case becomes equivalent to the centralized one. If an arm has been pulled 2 n times, it only requires n + bits to communicate the relevant statistics: the total reward 5

6 Algorithm sic-mmab algorithm Input: T : Musical Chairs until t = T 0 2: Set the index of the arm where the player is fixed at T 0 3: Play until time t = T : Until t = T 0 + 2K, hop sequentially {now M and her internal ran} 5: for T expl 2, 2 2,..., 2 N until fixed do 6: Estimate the remaining arms for a time K t T expl by starting sampling the remaining arm corresponding to her ran among the active players and proceeding sequential hopping among the active arms 7: for j [M t ] do 8: for l [M t ] \ {j} do 9: for [K t ] do 0: Player j communicates in time log 2 T expl + to player l the number of rewards she observed on the arm : end for 2: end for 3: end for 4: All the players update their ranings and have the same statistics. 5: If p arms are detected as among the top-m, each active player with the internal ran j,..., p will exploit the j-th arm among these p arms 6: Update the active arms, the active players and the internal ran 7: end for 8: Continue exploiting during those 2 n pulls. And a bit of information can be sent by forcing or not a collision. Sections 2.2., and respectively describe the initialization, the exploration and the communication phases. After a large enough number of alternations between exploration and communication phases, sub-optimal arms are detected, and eliminated and players are fixed to an arm among the top-m and will exploit it until time T. Notice that our algorithm can be easily adapted to be fair amongst players they asymptotically have the same average reward. During the exploitation phase, players just have to sequentially hop between the top-m arms. Similarly, exploration can be made by mini-batches before hopping to reduce the number of switches [20] Initialization phase The objectives of the first phase is to estimate the number of players M and to distribute rans amongst them. First, players follow the Musical chairs algorithm see [22]: they sample uniformly among the K arms, during T 0 steps and start pulling the same arm as soon as they encounter no collision on it. A correct choice for T 0 is given below by Lemma. At time t = T 0, each player will have a unique number starting from 0 which is the index of the arm she is pulling at T 0. We call it the external ran of a player. The second procedure, explained in Figure, will determine M and their internal ran, which is the raning of a player amongst the others. As an example, if there are three players on the arms 5, 7 and 2 at t = T 0, their external rans respectively correspond to 5, 7 and 2 while their internal rans are, 2 and 0. The procedure is the following. If a player 6

7 t = T 0 t = T 0 + t = T ext. ran 0 int. ran ext. ran 2 int. ran ext. ran 2 int. ran ext. ran int. ran j ext. ran int. ran j ext. ran int. ran j ext. ran int. ran j ext. ran 0 int. ran ext. ran int. ran j ext. ran + 2 int. ran j + ext. ran + 2 int. ran j + ext. ran + 2 int. ran j + ext. ran int. ran j K 2 K 2 K 2 ext. ran K int. ran M K ext. ran K int. ran M K ext. ran K int. ran M K Figure : Estimating M and the internal ran in a time 2K. A column represents the set of arms. The left resp. right arrows represent the choice of a player fixed to an arm resp. sequentially hopping. The information written in red resp. blue is nown resp. unnown to the player at the considered time step. has the external ran, then she will pull until T and will then start sequential hopping. This ensures that players start sequential hopping in a delayed fashion and that the player with the external ran will collide with the player with the external ran at time T At stage T 0 +2K, player of external ran nows her internal ran and M. Indeed, they are respectively the total number of collisions she encountered during the time intervals [T 0, T 0 + 2] and [T 0, T 0 + 2K ]. After that point, all the players have perfect nowledge of M and also have an attributed internal ran in [M]. During the other phases, players will always now the set of active players [M t ] and, each player has a unique internal ran in [M t ]. This is what breas the initial symmetry between all players and allows the algorithm to establish clean communication protocols. 7

8 2.2.2 Exploration phase 2 logt s For the exploration phase, we note B s = the confidence bound after s pulls and ˆµ p the centralized empirical mean of the arm after the p-th exploration phase. The p-th exploration phase simply consists in sequential hopping among the set of active arms for a time K t 2 p so that each arm is pulled 2 p times. Each player starts by pulling the arm whose index among the active arms corresponds to her internal ran. The exploration phase is then collision-less, thans to the sequential hopping procedure. After a succession of two phases exploration and communication, an arm is said to be accepted if #{i [K t ] ˆµ p B T p ˆµ i p + B Tip} K t M t where T i p is the centralized, total number of times the arm i has been pulled after the p-th exploration round. This inequality actually says that the number of arms worse than is at least K t M t with high probability among the active arms. In the same way, is declared as rejected if #{i [K t ] ˆµ i p B Tip ˆµ p + B T p} M t meaning that there are at least M t active arms better than with high probability Communication phase In this phase, each active player will, one at a time, communicate her statistics on the active arms to all other active players. Recall that since an arm is pulled 2 p times by a single player during the p-th exploration phase, the statistics of an arm can be sent by a player in log 2 2 p + bits, i.e., collisions. An active player has two possible status during this phase:. either she is receiving some other players information. In that case, she is always pulling the arm whose index among the active arms corresponds to her internal ran 2. or she is communicating her statistics. This status starts just after the end of the communication for the previous player i.e., with the internal ran just below hers. Then for all active arms and all active players, one at a time, she communicates her statistics about the arm to the player j. This is done by sending in binary a collision corresponding to a the sum of rewards she observed on the arm to the player j during the previous exploration phase. Since all active players perfectly now [M t ], [K t ] and their internal rans, they now when they have to switch from the receiving to the communicating status. While receiving information, they also now which player is communicating about which arm Updating the active sets At the end of the communication phase, all active players share the same information on arms. They now which ones must be accepted and/or rejected. Rejected arms are removed right away from the set of active arms. 8

9 The procedure is a bit more tricy for the accepted arms. Those arms have an intrinsic raning given by their respective indices. If the arm with ran j is declared as accepted, then the active player with the same internal ran j will start pulling this arm until the final horizon. This is possible since the number of accepted arms cannot exceed the number of remaining active players. Afterwards, all accepted arms are removed from the set of active arms and players update [K t ], [M t ] and their internal ran. 2.3 Regret analysis This section is devoted to the analysis of our algorithm. The regret is decomposed as follows, R T = R init + R explo + R comm + R bad run 2 where the first term corresponds to the regret due to the initialization phase, the second one to the exploration phases and the third one to the communication phases. The last term corresponds to the regret caused by bad runs, i.e., incorrect initialization no orthogonal setting at T 0 or exploration/estimation of arms. In case of a good run, the exploitation step is loss-less Initialization regret Recall that the initialization phase consists in T 0 steps of Musical Chair and in a procedure of length 2K. Lemma. If T 0 K logt, then after time T 0, all players will play different arms with probability at least M T. As a consequence, R init = OKM logt The proof is simple and delayed to Section B Exploration regret For the exploration phase, one must just control the number of times an arm must be pulled before being declared as accepted or rejected, with high probability. Proposition. All optimal arms are declared accepted before being pulled t = O µ µ M+ 2 logt times in total, and every sub-optimal arm is declared as rejected after being pulled at most t = O µ M µ 2 logt with probability greater than O K logtm T The proof is delayed to Section B.2, in the appendix. In the following, we eep the notations t = O µ µ M+ logt for an optimal arm and t 2 = O µ M µ logt 2 for a suboptimal one.. 9

10 For the exploration phases, the decomposition used in the centralized case [3] holds because there is no collision at all during the exploration phases: R explo = µ M µ E[T explo T ] + µ µ M T explo E[T explo T ] 3 >M M Where T explo = T T init T comm and T explo T is the number of time the -th best arm has been pulled for exploration. Lemma 2. For a sub-optimal arm : E[T explo T ] = O logt µ M µ 2 Moreover, in case of a good run, the following holds for the optimal arms: µ µ M T explo E[T explo ] = O logt, µ M µ j M and thus it trivially holds that R explo = j>m j>m logt O µ M µ j The first part is a direct consequence of the fact that E[T explo T ] t given by Proposition. The proof of the second part is also delayed to the appendix, Section B Communication cost We now focus on the R comm term in Equation 2. Lemma 3. For the communication phases, the regret is bounded as follows: R comm = O KM 3 log 2 logt µ M µ M+ 2 Proof. As explained in Section 2.2.3, each communication phase is of length at most KM 2 p+ where p = 0,..., N. Also, N log 2 t M + in case of a good run, since t M = max t. So for the cost of communication: log 2 t M + R comm KM 3 n n=0 O KM 3 log 2 2t M This term will be significant for low values of T, but asymptotically, it is in ologt. 0

11 2.3.4 Bad run regret The term R bad run is due to the low probability events of bad initialization or bad estimation during the exploration. Lemma and Proposition directly provide the following result: Lemma 4. R bad run = O M 2 logt + MK log µ M µ M+ 2 Proof. We got from Lemma and Proposition that: P [ bad run ] M O T + K log 2t M T So we have: R bad run MT P [ bad run ] = O M 2 + MK log 2 t M Total regret Theorem finally provides an asymptotical upper bound of the regret: Theorem. For any given set of parameters K, M and µ: lim T R T logt c >M where c and c 2 are two problem independent constants µ M µ + c 2 KM Proof. This is a direct consequence of Lemmas, 2, 3 and 4 and the regret decomposition given by Equation Experiments We now compare the empirical performances of sic-mmab with the mctopm algorithm [8], on generated data. We also compared with the musicalchairs algorithm [22], but its regret is so large that it squeezes the pictures and displaying it maes it really hard to distinguish between sic-mmab and mctopm. All the considered values for regret are averaged over 200 runs. Figure 2 represents the evolution of the regret for both algorithms with the following problem parameters: K = 9, M = 6, T = The means of the arms are linearly distributed between 0.9 and 0.89, so the gap between two consecutive arms is The alternation between exploration phases and communication phases is easily observable here. Figure 3 represents the evolution of the final regret according to the gap between two consecutive arms in a logarithmic scale. The problem parameters K, M and T are the same. Although mctopm seems to provide better results with low values of, sic-mmab seems to have a smaller dependency in /. This confirms the theoretical results claiming that mctopm scales with 2 while sic-mmab scales with. This can be observed on the left part of Figure 3 where the slopes is approximately twice as big for mctopm than for

12 sic-mmab. Also, a different behavior of the regret appears for very low values of which is certainly due to the fact that the regret linearly scales with for extremely small values of. In practice, mctopm might outperform sic-mmab, however simulations tend to confirm the theoretical upper bounds asymptotically. Figure 2: Evolution of regret over time Figure 3: Evolution of final regret according to logarithmic scale 2.5 In contradiction with existing lower bounds? Theorem directly contradicts the two lower bounds provided in the literature [8, 8], however sic-mmab respects the conditions required for both lower bounds. It was thought that the cost of the decentralization was a multiplicative factor M compared to the centralized lower bound [3]. However, with our approach, it appears that the regret of the decentralized case does not seem so much different from the one in the centralized case, at least if there is synchronization between the players. Indeed, sic-mmab actually taes advantage of this synchronization to establish communication protocols where players are able to communicate through collisions. The incorrect statement in the paper [8] seems to be found in Lemma 2. It claims that the probability distribution of collisions does not depend on the distributions of the arms while it does, especially for sic-mmab. In the second paper [8], the mistae would be in Lemma 3, since the proof does not tae into account the fact that the actions of a player also depend on the collisions she observed. Our algorithm reaches, up to a constant factor, the lower bound from the decentralized algorithm [3], up to an additional term in OKM logt due to the initialization. If a pre-agreement is allowed between the players, this phase would not be needed, and this additional term in the regret would disappear. As a consequence, Algorithm shows that the decentralized multiplayer bandits in case of time synchronization between the players is almost equivalent to the centralized multiplayer problem which is a priori a much simpler problem. Since the synchronization is not a reasonable assumption in the practical case and that it leads to undesirable lower bounds, the assumption of synchronization should be removed when considering the multiplayer MAB problem. This setting is called the dynamic setting. However, this problem is quite complex to model formally, since we must either put some restrictions on the way the players enter or leave the game, or find an appropriate formulation of the regret. Indeed, without them and as already noticed [22], the players could enter and 2

13 leave so quicly that they would learn nothing about the arm distributions. With the natural formulation of regret in the dynamic setting, the regret would then be linear. The difficulty to have a formal interesting model under this setting explains why most of the literature currently focuses on the synchronized case. 3 Without Full Sensing. Communication through Synchronization This section focuses on the No Sensing model, with synchronization the dynamic setting is considered in the following Section. First of all, we claim that a communication protocol similar to the one of sic-mmab can be devised here. Indeed, in the Full Sensing model, a bit is sent through a single collision. Without sensing, it can be done with arbitrary high probability using logt time-steps where denotes a nown lower bound of the average rewards of all arms i.e., assuming Hypothesis stated below. This only adds a multiplicative factor of to the initialization phase and logt to the communication one. So, sic-mmab can be easily adapted for the No Sensing model into the sic-mmab with a regret scaling as: O >M logt KM logt + + KM 3 logt log 2 µ M µ logt µ M µm+ 2 Although this regret is not asymptotically logarithmic because of the logt log 2 logt term, for specific problem parameters and for a fixed horizon T, this upper bound would be lower than the natural bound assuming each player actually has to individually explore all arms: M logt. µ M µ >M This remar illustrates how complex an anytime lower bound for the No Sensing setting can be. In this section, we introduce a first algorithm for the synchronized case. It also relies on a communication protocol, but with more limited information. In the following section, we will introduce another algorithm for the dynamic case; it will not be based on communication between players. As a consequence its regret might be larger, but it is adapted to more realistic scenarios. These are among the first algorithms for the No Sensing model with strong theoretical results and logarithmic regrets. Simple algorithms as ucb provide good results in practice but appear to actually provide a linear regret with a constant probability [8]. In Appendix A, the discussion about ucb is extended and the reasons of its failure are explained for the No Sensing model, using algebraic arguments. 3. Adapted Communication Protocol The algorithm sic-mmab2 is described below. The involved subroutines will be detailed in the next sections. Similarly to sic-mmab, it starts with an initialization phase to estimate M. It then alternates between exploration and communication phases in the same fashion. But the goal of the communication phases is to communicate to the other players which arms are optimal or sub-optimal and not to transmit statistics. This allows to eep updated the sets of active arms and active players. How to declare such arms and how to detect 3

14 declarations from other players is detailed in section The algorithm then ends with an exploitation phase as sic-mmab. The following assumption is required for sic-mmab2: Hypothesis. A lower bound of the arm means is nown to all players: min µ i i Some terms of the regret of sic-mmab2 scale with logt. In practice, the lowest average reward is at least one or two orders of magnitude larger than the gaps between the different arms. Thus, the significant terms in the regret are actually those in logt i. Also, it seems reasonable to have the nowledge of such a lower bound; indeed, just represents the order of magnitude of the average rewards of the arms, often nown in practice. This assumption is very close to an assumption made for other no sensing algorithms [9]. The notations for M t and K t are the same as in sic-mmab. Figures 4 and 5 illustrate the possible cases of an alternation between declaration and exploration phases. They will be further explained in Section Initialization phase The objective of the initialization phase is to estimate M here and is explained in Algorithm 3. The idea is the same as for sic-mmab, but as already mentioned, instead of a single time step, a number of logt time steps is needed to transmit a bit with high probability. Then, M is estimated as the number of slots of duration logt where only 0 rewards are observed. If we tae a loo bac at sic-mmab, this procedure is the same as its initialization with the following differences:. the fixation phase is longer adding a factor 2. when colliding with other players to communicate her presence, instead of doing it a single time-step, it is done logt times 3. a collision is detected when a player observes only 0 for logt consecutive time-steps. Notice that the internal ran can also be estimated but won t be used here, because it can t be updated during the exploration/communication phases Exploration phases Each exploration round is split into two parts: a first one where each player fixes to an arm, a second one where all players are sequentially hopping to explore the remaining arms without any collision. At the end of a fixation phase, all players are on different arms they are in an orthogonal position and thus sequential hopping can start. The decisions for accepting/rejecting arms, are still based on the exploratory pulls as in sic-mmab. The differences in exploration are the following:. Statistics are not shared between the players; this induces a multiplicative factor M in the regret 4

15 Algorithm 2 sic-mmab2 algorithm Input: T, : Estimate M until time T M {see section 3..} 2: T logt 3: for T expl in T 0, 2T 0,..., 2 N T 0 until fixed do 4: if n = 0 or there was a declaration during the last phase then Kt logt 5: Fix to an active arm using Musical Chairs during a time 6: Estimate the active arms during a time K t T expl doing sequential hopping 7: else 8: Add the statistics of the last declaration phase in the exploratory statistics {i.e., the last declaration phase is used as part of this exploration phase} 9: Continue sequential hopping according to the end of the last phase for a time K t T expl T d 0: end if : T d K t T 0 2: for in the arms detected as sub-optimal do 3: Declare as a sub-optimal arm if not declared by another player yet {see section 3..3} 4: end for 5: if at least an arm is detected as optimal then 6: Try to occupy any arm detected as optimal time T d per fixation phase 7: end if 8: Continue sequential hopping until there is a phase of length T d without new declaration 9: Any arm with reward 0 during this last phase corresponds to an occupied arm update [M t ] 20: update [K t ] 2: end for 22: if Fixed then 23: Exploit by playing the arm where fixed 24: end if Algorithm 3 Estimation of M Input: T, K Musical Chairs until time logt Play until time t = K logt + 2 logt Until t = K logt + 2K logt, hop sequentially with batches of length logt between arm switches 2. A fixation phase is added at the beginning of a new exploratory round, if there was at least one declaration in the previous declaration phase. This fixation is needed for players to reach an orthogonal position before the sequential hopping. An orthogonal position can t be reached instantly because the internal rans can t be updated in sic-mmab2. Figures 4 and 5 illustrate when such a fixation phase is added 5

16 All players start declaring Player j ended declaring Waits for other players to end All players ended declaring a slot ago Fixation phase Sequential hopping time time K t T expl T f Declare an arm as suboptimal time T d Declare an arm as suboptimal time T d Try to occupy optimal arms time T d Detect other players declarations time T d Detect other players declarations time T d No new declaration observed time T d Fixation phase time T f Sequential hopping time 2K t T expl Figure 4: Alternation between fixation, exploration and declaration blocs. Case where the players declare sub-optimal and try to occupy in vain optimal arms No player declared anything Fixation phase time T f Sequential hopping time K t T expl No new declaration observed time T d Sequential hopping time 2K t T expl T d The declaration phase is included in the exploration one and no fixation phase is required Figure 5: Alternation between fixation, exploration and declaration blocs. Case with no declaration. In that case, the single declaration phase, which just consists in sequential hopping, is included in the next exploration phase lines 7-0 in Algorithm 2. No fixation phase is needed in that case Communication phase All players are in the communication phase at the same time. However, this phase is decomposed into blocs of same length T d to eep synchronization. A bloc can be of three different types, and the type of a bloc does not need to be the same for all players, as illustrated by Figure 4. Types are the folllowing Declaration bloc for player j: she communicates to other players that an arm is suboptimal Fixation bloc for player j: she tries to occupy any arm that she detected as optimal. If she succeeds, she exploits that arm until the end of the algorithm Reception bloc for player j: she just sequential hops, in order to detect other players declarations Player j starts the communication phase with declaration blocs, one per arm detected as sub-optimal. She then proceeds to a fixation bloc, had she detected any arm as optimal during the last exploitation phase. She then proceeds to reception blocs until no new declaration is detected. As soon as no new declaration is detected, she starts a new exploration phase. Notice that for the three types of blocs, players eep receiving declarations from other players. Declaration bloc: In a declaration bloc, players follow Algorithm 4. The idea is to often sample the sub-optimal arm to send a signal to the other players. The other players 6

17 will detect this signal by observing a significant loss in the empirical reward of this arm. However, a player sending a signal should also be able to detect signals on other arms sent by other players. That is the reason why to declare an arm as sub-optimal, a player randomly chooses between pulling this arm and sequential hopping. Algorithm 4 Declare an arm as sub-optimal to the other players Input: T,, [K t ] During a time { T d : {T d computed according to Lemma 7} with probability Pull an arm 2 according to the sequential hopping on [K t ], with probability 2 Lemma 7 gives a good choice for T d such that the declaration is detected by every player, without detecting any false positive declaration, no matter the bloc they are currently proceeding, with high probability. Indeed, let ˆµ i and ˆr i be respectively the empirical reward during the exploration phases and during the last communication bloc for arm i and player j. Arm i is considered as perceived as sub-optimal by another player if: ˆµ i ˆr i 4 ˆµ i 4 Lemma 7 states that the players will only detect arms declared as sub-optimal or that are occupied by a player. However, using the last reception bloc where there is no new declaration, it is easy to distinguish optimal arms occupied by a player from the sub-optimal arms. Indeed, for the former, only 0 are observed during this last phase, while on the latter, at least one positive reward will be observed during the phase of length T d with probability at least T, since there is no player pulling it except for the sequential hopping. Fixation bloc: In a fixation bloc, player j continues sequential hopping on the active arms and decides to occupy an arm as soon as an arm detected as among the M optimal ones returns a reward no collision at that stage. In that case, she exploits this arm until the final horizon T. At the end of a bloc, if she did not manage to fix to any of these arms, then all of them are certainly occupied by other players with high probability. Declarations made by other players are taing care of, following the rule of Equation 4. Reception bloc: In a reception bloc, player j simply sequentially hops and detects the declarations of other players, following the rule of Equation 4. Notice that every active player will at least proceed to one reception bloc per communication phase if she does not manage to occupy an optimal arm. The last reception bloc is considered as the bloc where no new declaration is detected. This bloc is thus the same for all active players. Moreover, the only arms giving 0 rewards during this last reception bloc correspond to optimal arms occupied by other players. This allows to distinguish between arms occupied by another player, i.e., the optimal ones, or just declared by some player, i.e., the sub-optimal ones. As a consequence, active players can update the set of active arms [K t ] and the number of active players M t at the end of each communication phase. 7

18 3.2 Regret Analysis The same decomposition of the regret will be used for sic-mmab2: 3.2. Initialization Regret R T = R init + R explo + R comm + R bad run 5 Lemma 5 claims that the initialization is successful with high probability, meaning all players perfectly now M after this phase. Lemma 5. With probability at least O KM T, at the end of this procedure, every player has a correct estimation of M. In particular, this yields that MK R init = O logt The proof of this lemma is delayed to Section C. in the appendix Exploration regret The proof of Lemma 5 also claims that with probability at least O M t T, all the players are in an orthogonal position at the end of a single fixation phase. In case of good runs, the exploration will thus be collision-less. Using the same ind of arguments as for Lemma 2, we provide an upper bound for the exploration regret of sic-mmab2. Lemma 6. E[R explo ] O >M M logt + MK2 logt µ M µ The proof is also delayed to the appendix, in Section C Communication regret The following Lemma 7 provide the properties, guarantees and cost of the algorithm during the communication phases. Lemma 7. Let the length of a bloc be such that T d, then. With probability at least O M T, a player j declaring an arm i as sub-optimal will be successfully detected by all the other players; 2. With probability at least O MK T, no player will detect a false declaration during the declaration bloc i.e., no arm is detected as sub-optimal if there was no declaration or if it is not occupied by another player; 3. With probability at least O M T, if player j starts occupying arm, it is detected as a declaration by active players following the rule of Equation 4. The communication regret is bounded as MK R comm 2 = O logt 8 200Kt logt

19 Proof. The proof of the first three points is delayed to Section C.3 in the appendix. They imply that, when in a fixation bloc and with probability at least O M T, player j either succeeds to occupy an arm she detected as optimal, or all the arms she detected as optimal are occupied by other players at the end of the bloc. Also, if she manages to occupy an arm, this arm will be detected as declared by other players. This yields that active players remain synchronized and share the nowledge of M t and [K t ] during the whole process. Thans to the construction of the algorithm, there can not be two different blocs used to declare or occupy the same arm, hence the number of declaration phases will be at most K we emphasize here that the declaration phases are not included in an exploration phase. Since each bloc is of length T d, the communication regret is then bounded by KMT d, which gives the result. We mention here that, we described the declaration with probabilities /2. It can be seen from the proof in Section C.3 that T d is actually optimized when pulling the arm to declare with probability 2/ Bad run regret A bad run might occur for three different reasons: a fixation phase fails, 2 there is a wrong estimation of an arm by at least one player, 3 a declaration bloc of some player failed. The previous lemmas directly lead to the following results. Lemma 8. R badrun = O R explo Proof. We consider the three reasons of bad run independently.. Thans to Lemma 5, the probability of failure of a single fixation phase including the blocs in communication phase is in O M T. The number of fixation phases is smaller than OK, so the total loss for the first reason is smaller than OKM For the case of bad exploration, as explained in the proof of Lemma 6, the cost of the low probability event of a bad estimation of an arm is lower than the cost of exploration in case of a good run. We thus simply claim that the cost is dominated by E[R explo ] and actually even by a tighter term. 3. For the case of declarations, as implied by Lemma 7, the probability of failure of a single declaration or reception bloc is smaller than O KM T. There are at most K + N such blocs, where N = O log is the number of exploration phases logt 2 M see the proof of Lemma 6. So the regret due to this case is also dominated by the exploration regret. Summing these three terms yields the result Total Regret Using all these previous intermediate results, we now provide an upper bound for the regret incurred by sic-mmab2: 9

20 Theorem 2. sic-mmab2 has a regret scaling as: M E[R T ] = O logt + MK2 logt µ M µ >M Proof. This is a direct consequence of Lemmas 5, 6, 7, 8 and Equation 2. 4 Without Synchronization, the Dynamic Setting In this section, we no longer want to assume that players can communicate using extreme synchronisation. In the previous sections, it was crucial that all exploration/communication phases start and end at the exact or at least approximately exact same time. This assumption is clearly non-realistic and should be removed, as radios do not start and end transmitting simultaneously. So we are going to assume that players enter the game at different times, without nowing how many players are already present and without informing them of their arrival. As the model needs strong additional assumptions on how players can enter and leave the game [22] to avoid trivial linear regret bound, we are going to assume for simplicity that players do not leave the game once they have entered it. More precisely, player j starts pulling arm at a time T j start until the horizon T, and we denote by Mt the numbers of present players at stage t. We mention that the following results can also be adapted to the case where players can leave the game during specific intervals or where they all share an internal synchronized cloc [22], which is actually a different assumption than synchronization. 4. A Communication-less Protocol Algorithms sic-mmab and sic-mmab2 heavily rely on the fact that phases in the communication protocols start and end at the exact same moments for all active players, which is only possible if they are synchronized. In the dynamic setting, this synchronization is not possible and our new algorithm dyn-mmab does not rely on the same trics as the previous algorithms. The protocol adapted to the dynamic setting is pretty simple: a player will only follow two different sampling strategies: either she samples uniformly at random among all the arms, or she exploits an arm found optimal and pulls it until the final horizon T. In the first case, the exploration of the other players is not too disturbed as it only changes the mean reward of all the arms by a more or less constant factor. In the second case, the exploited arm will appear as sub-optimal to the other players, and this is not hurtful as this arm is now occupied. In the first phase, players uniformly explore among all the arms, without nowing M recall it may evolve dynamically during the process. As soon as a player distinguishes an arm as optimal i.e. the best available arm, she tries to occupy it as follows. She actually continues to sample uniformly and will occupy this arm as soon as she encounters a positive reward. The player either succeeds to occupy an arm within a given number of stages, The naive choice to sample only this arm until observing a and to decide it is already occupied if too many 0 are observed does not deal with cases where several players try to occupy the same arm at the same time. 20

21 and then exploits it until the final horizon, or she fails. In the latter, she mars this arm as occupied and continues to explore to find the next available optimal arm. Here, an arm is said active if it was neither detected as occupied nor optimal at some point. Using the notations described below, this means that this arm is not in Preferences Occupied. Here are the different notations for dyn-mmab : Preferences is the player s raning of arms. The player tries to occupy the p-th arm in Preferences until she succeeds or mars it as occupied p represents the index in Preferences a player is currently trying to occupy T and ˆr T temp respectively are the number of pulls and the empirical mean reward on arm and ˆr temp on arm since the last reset. A reset occurs on arm every T f respectively represent the number of pulls and the empirical mean reward pulls. T f is constantly updated and represents the number of successive 0 to observe on arm to be sure that is actually occupied. T f is updated at each stage with the following rule: T f t + min e logt ˆr t + B T t+t +, T f t 6 with the convention that T f 0 = + and x/0 = + for all non negative x. Occupied is the set of arms mared as occupied by a player B s t corresponds to the confidence bound. Its exact value is given in Lemma 9. The pseudo-code is detailed in Algorithm Exploration rules Two main actions can happen during the exploration. First, an arm can be considered as occupied and is then added to Occupied. Also, it can be considered as optimal and is then added to Preferences. An arm is added to Occupied can already be in Preferences if only 0 have been observed during a whole dynamic sliding window whose length might vary with time. A dynamic sliding window ends when at least T f new observations have been gathered on arm. A new window is then started. The crucial quantity of interest is therefore the length T f. It is an estimation of the number of successive observed 0 required to consider an arm as occupied with high probability. This value is thus constantly updated using Equation 6, which uses the current estimation of a lower bound for µ. This rule is described at lines 5-20 in Algorithm 5. An active arm is added to Preferences if it is better, with high probability, than any active arm: if l, l an active arm, ˆr B T t > ˆr l + B Tl t then is added to Preferences The value for B s t is not as easy as for the previous algorithms and is given by Equation 8 below. This rule is described at lines 2-23 in Algorithm 5. 2

Online Learning and Sequential Decision Making

Online Learning and Sequential Decision Making Online Learning and Sequential Decision Making Emilie Kaufmann CNRS & CRIStAL, Inria SequeL, emilie.kaufmann@univ-lille.fr Research School, ENS Lyon, Novembre 12-13th 2018 Emilie Kaufmann Sequential Decision

More information

Bandit models: a tutorial

Bandit models: a tutorial Gdt COS, December 3rd, 2015 Multi-Armed Bandit model: general setting K arms: for a {1,..., K}, (X a,t ) t N is a stochastic process. (unknown distributions) Bandit game: a each round t, an agent chooses

More information

Advanced Machine Learning

Advanced Machine Learning Advanced Machine Learning Bandit Problems MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Multi-Armed Bandit Problem Problem: which arm of a K-slot machine should a gambler pull to maximize his

More information

The multi armed-bandit problem

The multi armed-bandit problem The multi armed-bandit problem (with covariates if we have time) Vianney Perchet & Philippe Rigollet LPMA Université Paris Diderot ORFE Princeton University Algorithms and Dynamics for Games and Optimization

More information

An Estimation Based Allocation Rule with Super-linear Regret and Finite Lock-on Time for Time-dependent Multi-armed Bandit Processes

An Estimation Based Allocation Rule with Super-linear Regret and Finite Lock-on Time for Time-dependent Multi-armed Bandit Processes An Estimation Based Allocation Rule with Super-linear Regret and Finite Lock-on Time for Time-dependent Multi-armed Bandit Processes Prokopis C. Prokopiou, Peter E. Caines, and Aditya Mahajan McGill University

More information

Introduction to Bandit Algorithms. Introduction to Bandit Algorithms

Introduction to Bandit Algorithms. Introduction to Bandit Algorithms Stochastic K-Arm Bandit Problem Formulation Consider K arms (actions) each correspond to an unknown distribution {ν k } K k=1 with values bounded in [0, 1]. At each time t, the agent pulls an arm I t {1,...,

More information

Bandits and Exploration: How do we (optimally) gather information? Sham M. Kakade

Bandits and Exploration: How do we (optimally) gather information? Sham M. Kakade Bandits and Exploration: How do we (optimally) gather information? Sham M. Kakade Machine Learning for Big Data CSE547/STAT548 University of Washington S. M. Kakade (UW) Optimization for Big data 1 / 22

More information

The Multi-Arm Bandit Framework

The Multi-Arm Bandit Framework The Multi-Arm Bandit Framework A. LAZARIC (SequeL Team @INRIA-Lille) ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course In This Lecture A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-2/94

More information

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon.

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon. Administration CSCI567 Machine Learning Fall 2018 Prof. Haipeng Luo U of Southern California Nov 7, 2018 HW5 is available, due on 11/18. Practice final will also be available soon. Remaining weeks: 11/14,

More information

Thompson Sampling for the non-stationary Corrupt Multi-Armed Bandit

Thompson Sampling for the non-stationary Corrupt Multi-Armed Bandit European Worshop on Reinforcement Learning 14 (2018 October 2018, Lille, France. Thompson Sampling for the non-stationary Corrupt Multi-Armed Bandit Réda Alami Orange Labs 2 Avenue Pierre Marzin 22300,

More information

Multi-Armed Bandit: Learning in Dynamic Systems with Unknown Models

Multi-Armed Bandit: Learning in Dynamic Systems with Unknown Models c Qing Zhao, UC Davis. Talk at Xidian Univ., September, 2011. 1 Multi-Armed Bandit: Learning in Dynamic Systems with Unknown Models Qing Zhao Department of Electrical and Computer Engineering University

More information

Lecture 6 Random walks - advanced methods

Lecture 6 Random walks - advanced methods Lecture 6: Random wals - advanced methods 1 of 11 Course: M362K Intro to Stochastic Processes Term: Fall 2014 Instructor: Gordan Zitovic Lecture 6 Random wals - advanced methods STOPPING TIMES Our last

More information

On the Optimality of Myopic Sensing. in Multi-channel Opportunistic Access: the Case of Sensing Multiple Channels

On the Optimality of Myopic Sensing. in Multi-channel Opportunistic Access: the Case of Sensing Multiple Channels On the Optimality of Myopic Sensing 1 in Multi-channel Opportunistic Access: the Case of Sensing Multiple Channels Kehao Wang, Lin Chen arxiv:1103.1784v1 [cs.it] 9 Mar 2011 Abstract Recent works ([1],

More information

Multi-armed bandit models: a tutorial

Multi-armed bandit models: a tutorial Multi-armed bandit models: a tutorial CERMICS seminar, March 30th, 2016 Multi-Armed Bandit model: general setting K arms: for a {1,..., K}, (X a,t ) t N is a stochastic process. (unknown distributions)

More information

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part I. Sébastien Bubeck Theory Group

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part I. Sébastien Bubeck Theory Group Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part I Sébastien Bubeck Theory Group i.i.d. multi-armed bandit, Robbins [1952] i.i.d. multi-armed bandit, Robbins [1952] Known

More information

COS 402 Machine Learning and Artificial Intelligence Fall Lecture 22. Exploration & Exploitation in Reinforcement Learning: MAB, UCB, Exp3

COS 402 Machine Learning and Artificial Intelligence Fall Lecture 22. Exploration & Exploitation in Reinforcement Learning: MAB, UCB, Exp3 COS 402 Machine Learning and Artificial Intelligence Fall 2016 Lecture 22 Exploration & Exploitation in Reinforcement Learning: MAB, UCB, Exp3 How to balance exploration and exploitation in reinforcement

More information

Introducing strategic measure actions in multi-armed bandits

Introducing strategic measure actions in multi-armed bandits 213 IEEE 24th International Symposium on Personal, Indoor and Mobile Radio Communications: Workshop on Cognitive Radio Medium Access Control and Network Solutions Introducing strategic measure actions

More information

Revisiting the Exploration-Exploitation Tradeoff in Bandit Models

Revisiting the Exploration-Exploitation Tradeoff in Bandit Models Revisiting the Exploration-Exploitation Tradeoff in Bandit Models joint work with Aurélien Garivier (IMT, Toulouse) and Tor Lattimore (University of Alberta) Workshop on Optimization and Decision-Making

More information

Stratégies bayésiennes et fréquentistes dans un modèle de bandit

Stratégies bayésiennes et fréquentistes dans un modèle de bandit Stratégies bayésiennes et fréquentistes dans un modèle de bandit thèse effectuée à Telecom ParisTech, co-dirigée par Olivier Cappé, Aurélien Garivier et Rémi Munos Journées MAS, Grenoble, 30 août 2016

More information

Multiple Identifications in Multi-Armed Bandits

Multiple Identifications in Multi-Armed Bandits Multiple Identifications in Multi-Armed Bandits arxiv:05.38v [cs.lg] 4 May 0 Sébastien Bubeck Department of Operations Research and Financial Engineering, Princeton University sbubeck@princeton.edu Tengyao

More information

Stochastic bandits: Explore-First and UCB

Stochastic bandits: Explore-First and UCB CSE599s, Spring 2014, Online Learning Lecture 15-2/19/2014 Stochastic bandits: Explore-First and UCB Lecturer: Brendan McMahan or Ofer Dekel Scribe: Javad Hosseini In this lecture, we like to answer this

More information

Analysis of Thompson Sampling for the multi-armed bandit problem

Analysis of Thompson Sampling for the multi-armed bandit problem Analysis of Thompson Sampling for the multi-armed bandit problem Shipra Agrawal Microsoft Research India shipra@microsoft.com avin Goyal Microsoft Research India navingo@microsoft.com Abstract We show

More information

Stochastic Contextual Bandits with Known. Reward Functions

Stochastic Contextual Bandits with Known. Reward Functions Stochastic Contextual Bandits with nown 1 Reward Functions Pranav Sakulkar and Bhaskar rishnamachari Ming Hsieh Department of Electrical Engineering Viterbi School of Engineering University of Southern

More information

Online Learning Schemes for Power Allocation in Energy Harvesting Communications

Online Learning Schemes for Power Allocation in Energy Harvesting Communications Online Learning Schemes for Power Allocation in Energy Harvesting Communications Pranav Sakulkar and Bhaskar Krishnamachari Ming Hsieh Department of Electrical Engineering Viterbi School of Engineering

More information

Basics of reinforcement learning

Basics of reinforcement learning Basics of reinforcement learning Lucian Buşoniu TMLSS, 20 July 2018 Main idea of reinforcement learning (RL) Learn a sequential decision policy to optimize the cumulative performance of an unknown system

More information

Lecture 4: Lower Bounds (ending); Thompson Sampling

Lecture 4: Lower Bounds (ending); Thompson Sampling CMSC 858G: Bandits, Experts and Games 09/12/16 Lecture 4: Lower Bounds (ending); Thompson Sampling Instructor: Alex Slivkins Scribed by: Guowei Sun,Cheng Jie 1 Lower bounds on regret (ending) Recap from

More information

Selecting Efficient Correlated Equilibria Through Distributed Learning. Jason R. Marden

Selecting Efficient Correlated Equilibria Through Distributed Learning. Jason R. Marden 1 Selecting Efficient Correlated Equilibria Through Distributed Learning Jason R. Marden Abstract A learning rule is completely uncoupled if each player s behavior is conditioned only on his own realized

More information

Bandit Algorithms. Zhifeng Wang ... Department of Statistics Florida State University

Bandit Algorithms. Zhifeng Wang ... Department of Statistics Florida State University Bandit Algorithms Zhifeng Wang Department of Statistics Florida State University Outline Multi-Armed Bandits (MAB) Exploration-First Epsilon-Greedy Softmax UCB Thompson Sampling Adversarial Bandits Exp3

More information

Lecture 19: UCB Algorithm and Adversarial Bandit Problem. Announcements Review on stochastic multi-armed bandit problem

Lecture 19: UCB Algorithm and Adversarial Bandit Problem. Announcements Review on stochastic multi-armed bandit problem Lecture 9: UCB Algorithm and Adversarial Bandit Problem EECS598: Prediction and Learning: It s Only a Game Fall 03 Lecture 9: UCB Algorithm and Adversarial Bandit Problem Prof. Jacob Abernethy Scribe:

More information

Hybrid Machine Learning Algorithms

Hybrid Machine Learning Algorithms Hybrid Machine Learning Algorithms Umar Syed Princeton University Includes joint work with: Rob Schapire (Princeton) Nina Mishra, Alex Slivkins (Microsoft) Common Approaches to Machine Learning!! Supervised

More information

Satisfaction Equilibrium: Achieving Cooperation in Incomplete Information Games

Satisfaction Equilibrium: Achieving Cooperation in Incomplete Information Games Satisfaction Equilibrium: Achieving Cooperation in Incomplete Information Games Stéphane Ross and Brahim Chaib-draa Department of Computer Science and Software Engineering Laval University, Québec (Qc),

More information

Multi-Armed Bandits. Credit: David Silver. Google DeepMind. Presenter: Tianlu Wang

Multi-Armed Bandits. Credit: David Silver. Google DeepMind. Presenter: Tianlu Wang Multi-Armed Bandits Credit: David Silver Google DeepMind Presenter: Tianlu Wang Credit: David Silver (DeepMind) Multi-Armed Bandits Presenter: Tianlu Wang 1 / 27 Outline 1 Introduction Exploration vs.

More information

Learning Algorithms for Minimizing Queue Length Regret

Learning Algorithms for Minimizing Queue Length Regret Learning Algorithms for Minimizing Queue Length Regret Thomas Stahlbuhk Massachusetts Institute of Technology Cambridge, MA Brooke Shrader MIT Lincoln Laboratory Lexington, MA Eytan Modiano Massachusetts

More information

The information complexity of sequential resource allocation

The information complexity of sequential resource allocation The information complexity of sequential resource allocation Emilie Kaufmann, joint work with Olivier Cappé, Aurélien Garivier and Shivaram Kalyanakrishan SMILE Seminar, ENS, June 8th, 205 Sequential allocation

More information

Channel Probing in Communication Systems: Myopic Policies Are Not Always Optimal

Channel Probing in Communication Systems: Myopic Policies Are Not Always Optimal Channel Probing in Communication Systems: Myopic Policies Are Not Always Optimal Matthew Johnston, Eytan Modiano Laboratory for Information and Decision Systems Massachusetts Institute of Technology Cambridge,

More information

CS261: A Second Course in Algorithms Lecture #11: Online Learning and the Multiplicative Weights Algorithm

CS261: A Second Course in Algorithms Lecture #11: Online Learning and the Multiplicative Weights Algorithm CS61: A Second Course in Algorithms Lecture #11: Online Learning and the Multiplicative Weights Algorithm Tim Roughgarden February 9, 016 1 Online Algorithms This lecture begins the third module of the

More information

The Multi-Armed Bandit Problem

The Multi-Armed Bandit Problem The Multi-Armed Bandit Problem Electrical and Computer Engineering December 7, 2013 Outline 1 2 Mathematical 3 Algorithm Upper Confidence Bound Algorithm A/B Testing Exploration vs. Exploitation Scientist

More information

Spectrum Sensing via Event-triggered Sampling

Spectrum Sensing via Event-triggered Sampling Spectrum Sensing via Event-triggered Sampling Yasin Yilmaz Electrical Engineering Department Columbia University New Yor, NY 0027 Email: yasin@ee.columbia.edu George Moustaides Dept. of Electrical & Computer

More information

Performance of Round Robin Policies for Dynamic Multichannel Access

Performance of Round Robin Policies for Dynamic Multichannel Access Performance of Round Robin Policies for Dynamic Multichannel Access Changmian Wang, Bhaskar Krishnamachari, Qing Zhao and Geir E. Øien Norwegian University of Science and Technology, Norway, {changmia,

More information

Online Learning with Randomized Feedback Graphs for Optimal PUE Attacks in Cognitive Radio Networks

Online Learning with Randomized Feedback Graphs for Optimal PUE Attacks in Cognitive Radio Networks 1 Online Learning with Randomized Feedback Graphs for Optimal PUE Attacks in Cognitive Radio Networks Monireh Dabaghchian, Amir Alipour-Fanid, Kai Zeng, Qingsi Wang, Peter Auer arxiv:1709.10128v3 [cs.ni]

More information

Bandits : optimality in exponential families

Bandits : optimality in exponential families Bandits : optimality in exponential families Odalric-Ambrym Maillard IHES, January 2016 Odalric-Ambrym Maillard Bandits 1 / 40 Introduction 1 Stochastic multi-armed bandits 2 Boundary crossing probabilities

More information

Advanced Topics in Machine Learning and Algorithmic Game Theory Fall semester, 2011/12

Advanced Topics in Machine Learning and Algorithmic Game Theory Fall semester, 2011/12 Advanced Topics in Machine Learning and Algorithmic Game Theory Fall semester, 2011/12 Lecture 4: Multiarmed Bandit in the Adversarial Model Lecturer: Yishay Mansour Scribe: Shai Vardi 4.1 Lecture Overview

More information

Learning to play K-armed bandit problems

Learning to play K-armed bandit problems Learning to play K-armed bandit problems Francis Maes 1, Louis Wehenkel 1 and Damien Ernst 1 1 University of Liège Dept. of Electrical Engineering and Computer Science Institut Montefiore, B28, B-4000,

More information

Communication Engineering Prof. Surendra Prasad Department of Electrical Engineering Indian Institute of Technology, Delhi

Communication Engineering Prof. Surendra Prasad Department of Electrical Engineering Indian Institute of Technology, Delhi Communication Engineering Prof. Surendra Prasad Department of Electrical Engineering Indian Institute of Technology, Delhi Lecture - 41 Pulse Code Modulation (PCM) So, if you remember we have been talking

More information

THE EGG GAME DR. WILLIAM GASARCH AND STUART FLETCHER

THE EGG GAME DR. WILLIAM GASARCH AND STUART FLETCHER THE EGG GAME DR. WILLIAM GASARCH AND STUART FLETCHER Abstract. We present a game and proofs for an optimal solution. 1. The Game You are presented with a multistory building and some number of superstrong

More information

RL 3: Reinforcement Learning

RL 3: Reinforcement Learning RL 3: Reinforcement Learning Q-Learning Michael Herrmann University of Edinburgh, School of Informatics 20/01/2015 Last time: Multi-Armed Bandits (10 Points to remember) MAB applications do exist (e.g.

More information

Analysis of Thompson Sampling for the multi-armed bandit problem

Analysis of Thompson Sampling for the multi-armed bandit problem Analysis of Thompson Sampling for the multi-armed bandit problem Shipra Agrawal Microsoft Research India shipra@microsoft.com Navin Goyal Microsoft Research India navingo@microsoft.com Abstract The multi-armed

More information

On Bayesian bandit algorithms

On Bayesian bandit algorithms On Bayesian bandit algorithms Emilie Kaufmann joint work with Olivier Cappé, Aurélien Garivier, Nathaniel Korda and Rémi Munos July 1st, 2012 Emilie Kaufmann (Telecom ParisTech) On Bayesian bandit algorithms

More information

Algebraic Gossip: A Network Coding Approach to Optimal Multiple Rumor Mongering

Algebraic Gossip: A Network Coding Approach to Optimal Multiple Rumor Mongering Algebraic Gossip: A Networ Coding Approach to Optimal Multiple Rumor Mongering Supratim Deb, Muriel Médard, and Clifford Choute Laboratory for Information & Decison Systems, Massachusetts Institute of

More information

Two optimization problems in a stochastic bandit model

Two optimization problems in a stochastic bandit model Two optimization problems in a stochastic bandit model Emilie Kaufmann joint work with Olivier Cappé, Aurélien Garivier and Shivaram Kalyanakrishnan Journées MAS 204, Toulouse Outline From stochastic optimization

More information

Evaluation of multi armed bandit algorithms and empirical algorithm

Evaluation of multi armed bandit algorithms and empirical algorithm Acta Technica 62, No. 2B/2017, 639 656 c 2017 Institute of Thermomechanics CAS, v.v.i. Evaluation of multi armed bandit algorithms and empirical algorithm Zhang Hong 2,3, Cao Xiushan 1, Pu Qiumei 1,4 Abstract.

More information

Multicast With Prioritized Delivery: How Fresh is Your Data?

Multicast With Prioritized Delivery: How Fresh is Your Data? Multicast With Prioritized Delivery: How Fresh is Your Data? Jing Zhong, Roy D Yates and Emina Solanin Department of ECE, Rutgers University, {ingzhong, ryates, eminasolanin}@rutgersedu arxiv:885738v [csit

More information

Notes on Discrete Probability

Notes on Discrete Probability Columbia University Handout 3 W4231: Analysis of Algorithms September 21, 1999 Professor Luca Trevisan Notes on Discrete Probability The following notes cover, mostly without proofs, the basic notions

More information

Decentralized Control of Discrete Event Systems with Bounded or Unbounded Delay Communication

Decentralized Control of Discrete Event Systems with Bounded or Unbounded Delay Communication Decentralized Control of Discrete Event Systems with Bounded or Unbounded Delay Communication Stavros Tripakis Abstract We introduce problems of decentralized control with communication, where we explicitly

More information

Subsampling, Concentration and Multi-armed bandits

Subsampling, Concentration and Multi-armed bandits Subsampling, Concentration and Multi-armed bandits Odalric-Ambrym Maillard, R. Bardenet, S. Mannor, A. Baransi, N. Galichet, J. Pineau, A. Durand Toulouse, November 09, 2015 O-A. Maillard Subsampling and

More information

Algebra Exam. Solutions and Grading Guide

Algebra Exam. Solutions and Grading Guide Algebra Exam Solutions and Grading Guide You should use this grading guide to carefully grade your own exam, trying to be as objective as possible about what score the TAs would give your responses. Full

More information

EUSIPCO

EUSIPCO EUSIPCO 3 569736677 FULLY ISTRIBUTE SIGNAL ETECTION: APPLICATION TO COGNITIVE RAIO Franc Iutzeler Philippe Ciblat Telecom ParisTech, 46 rue Barrault 753 Paris, France email: firstnamelastname@telecom-paristechfr

More information

Asymptotically Optimal and Bandwith-efficient Decentralized Detection

Asymptotically Optimal and Bandwith-efficient Decentralized Detection Asymptotically Optimal and Bandwith-efficient Decentralized Detection Yasin Yılmaz and Xiaodong Wang Electrical Engineering Department, Columbia University New Yor, NY 10027 Email: yasin,wangx@ee.columbia.edu

More information

Multi-Armed Bandit Formulations for Identification and Control

Multi-Armed Bandit Formulations for Identification and Control Multi-Armed Bandit Formulations for Identification and Control Cristian R. Rojas Joint work with Matías I. Müller and Alexandre Proutiere KTH Royal Institute of Technology, Sweden ERNSI, September 24-27,

More information

Introduction to Algorithms / Algorithms I Lecturer: Michael Dinitz Topic: Intro to Learning Theory Date: 12/8/16

Introduction to Algorithms / Algorithms I Lecturer: Michael Dinitz Topic: Intro to Learning Theory Date: 12/8/16 600.463 Introduction to Algorithms / Algorithms I Lecturer: Michael Dinitz Topic: Intro to Learning Theory Date: 12/8/16 25.1 Introduction Today we re going to talk about machine learning, but from an

More information

Sparse Linear Contextual Bandits via Relevance Vector Machines

Sparse Linear Contextual Bandits via Relevance Vector Machines Sparse Linear Contextual Bandits via Relevance Vector Machines Davis Gilton and Rebecca Willett Electrical and Computer Engineering University of Wisconsin-Madison Madison, WI 53706 Email: gilton@wisc.edu,

More information

Grundlagen der Künstlichen Intelligenz

Grundlagen der Künstlichen Intelligenz Grundlagen der Künstlichen Intelligenz Uncertainty & Probabilities & Bandits Daniel Hennes 16.11.2017 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Uncertainty Probability

More information

CSE250A Fall 12: Discussion Week 9

CSE250A Fall 12: Discussion Week 9 CSE250A Fall 12: Discussion Week 9 Aditya Menon (akmenon@ucsd.edu) December 4, 2012 1 Schedule for today Recap of Markov Decision Processes. Examples: slot machines and maze traversal. Planning and learning.

More information

THE POSTDOC VARIANT OF THE SECRETARY PROBLEM 1. INTRODUCTION.

THE POSTDOC VARIANT OF THE SECRETARY PROBLEM 1. INTRODUCTION. THE POSTDOC VARIANT OF THE SECRETARY PROBLEM ROBERT J. VANDERBEI ABSTRACT. The classical secretary problem involves sequentially interviewing a pool of n applicants with the aim of hiring exactly the best

More information

High-Order Conversion From Boolean to Arithmetic Masking

High-Order Conversion From Boolean to Arithmetic Masking High-Order Conversion From Boolean to Arithmetic Masking Jean-Sébastien Coron University of Luxembourg jean-sebastien.coron@uni.lu Abstract. Masking with random values is an effective countermeasure against

More information

Lecture 4 January 23

Lecture 4 January 23 STAT 263/363: Experimental Design Winter 2016/17 Lecture 4 January 23 Lecturer: Art B. Owen Scribe: Zachary del Rosario 4.1 Bandits Bandits are a form of online (adaptive) experiments; i.e. samples are

More information

A. Notation. Attraction probability of item d. (d) Highest attraction probability, (1) A

A. Notation. Attraction probability of item d. (d) Highest attraction probability, (1) A A Notation Symbol Definition (d) Attraction probability of item d max Highest attraction probability, (1) A Binary attraction vector, where A(d) is the attraction indicator of item d P Distribution over

More information

Multi-armed Bandits in the Presence of Side Observations in Social Networks

Multi-armed Bandits in the Presence of Side Observations in Social Networks 52nd IEEE Conference on Decision and Control December 0-3, 203. Florence, Italy Multi-armed Bandits in the Presence of Side Observations in Social Networks Swapna Buccapatnam, Atilla Eryilmaz, and Ness

More information

An analogy from Calculus: limits

An analogy from Calculus: limits COMP 250 Fall 2018 35 - big O Nov. 30, 2018 We have seen several algorithms in the course, and we have loosely characterized their runtimes in terms of the size n of the input. We say that the algorithm

More information

Optimality of Myopic Sensing in Multi-Channel Opportunistic Access

Optimality of Myopic Sensing in Multi-Channel Opportunistic Access Optimality of Myopic Sensing in Multi-Channel Opportunistic Access Tara Javidi, Bhasar Krishnamachari, Qing Zhao, Mingyan Liu tara@ece.ucsd.edu, brishna@usc.edu, qzhao@ece.ucdavis.edu, mingyan@eecs.umich.edu

More information

On the Complexity of Best Arm Identification with Fixed Confidence

On the Complexity of Best Arm Identification with Fixed Confidence On the Complexity of Best Arm Identification with Fixed Confidence Discrete Optimization with Noise Aurélien Garivier, Emilie Kaufmann COLT, June 23 th 2016, New York Institut de Mathématiques de Toulouse

More information

1 Secure two-party computation

1 Secure two-party computation CSCI 5440: Cryptography Lecture 7 The Chinese University of Hong Kong, Spring 2018 26 and 27 February 2018 In the first half of the course we covered the basic cryptographic primitives that enable secure

More information

Stat 260/CS Learning in Sequential Decision Problems. Peter Bartlett

Stat 260/CS Learning in Sequential Decision Problems. Peter Bartlett Stat 260/CS 294-102. Learning in Sequential Decision Problems. Peter Bartlett 1. Multi-armed bandit algorithms. Concentration inequalities. P(X ǫ) exp( ψ (ǫ))). Cumulant generating function bounds. Hoeffding

More information

Pure Exploration in Finitely Armed and Continuous Armed Bandits

Pure Exploration in Finitely Armed and Continuous Armed Bandits Pure Exploration in Finitely Armed and Continuous Armed Bandits Sébastien Bubeck INRIA Lille Nord Europe, SequeL project, 40 avenue Halley, 59650 Villeneuve d Ascq, France Rémi Munos INRIA Lille Nord Europe,

More information

Relaying a Fountain code across multiple nodes

Relaying a Fountain code across multiple nodes Relaying a Fountain code across multiple nodes Ramarishna Gummadi, R.S.Sreenivas Coordinated Science Lab University of Illinois at Urbana-Champaign {gummadi2, rsree} @uiuc.edu Abstract Fountain codes are

More information

6196 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 9, SEPTEMBER 2011

6196 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 9, SEPTEMBER 2011 6196 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 9, SEPTEMBER 2011 On the Structure of Real-Time Encoding and Decoding Functions in a Multiterminal Communication System Ashutosh Nayyar, Student

More information

On the Complexity of Best Arm Identification in Multi-Armed Bandit Models

On the Complexity of Best Arm Identification in Multi-Armed Bandit Models On the Complexity of Best Arm Identification in Multi-Armed Bandit Models Aurélien Garivier Institut de Mathématiques de Toulouse Information Theory, Learning and Big Data Simons Institute, Berkeley, March

More information

COMP3702/7702 Artificial Intelligence Lecture 11: Introduction to Machine Learning and Reinforcement Learning. Hanna Kurniawati

COMP3702/7702 Artificial Intelligence Lecture 11: Introduction to Machine Learning and Reinforcement Learning. Hanna Kurniawati COMP3702/7702 Artificial Intelligence Lecture 11: Introduction to Machine Learning and Reinforcement Learning Hanna Kurniawati Today } What is machine learning? } Where is it used? } Types of machine learning

More information

Chapter 2. Mathematical Reasoning. 2.1 Mathematical Models

Chapter 2. Mathematical Reasoning. 2.1 Mathematical Models Contents Mathematical Reasoning 3.1 Mathematical Models........................... 3. Mathematical Proof............................ 4..1 Structure of Proofs........................ 4.. Direct Method..........................

More information

Cooperation Speeds Surfing: Use Co-Bandit!

Cooperation Speeds Surfing: Use Co-Bandit! Cooperation Speeds Surfing: Use Co-Bandit! Anuja Meetoo Appavoo, Seth Gilbert, and Kian-Lee Tan Department of Computer Science, National University of Singapore {anuja, seth.gilbert, tankl}@comp.nus.edu.sg

More information

Bayesian and Frequentist Methods in Bandit Models

Bayesian and Frequentist Methods in Bandit Models Bayesian and Frequentist Methods in Bandit Models Emilie Kaufmann, Telecom ParisTech Bayes In Paris, ENSAE, October 24th, 2013 Emilie Kaufmann (Telecom ParisTech) Bayesian and Frequentist Bandits BIP,

More information

NOWADAYS the demand to wireless bandwidth is growing

NOWADAYS the demand to wireless bandwidth is growing 1 Online Learning with Randomized Feedback Graphs for Optimal PUE Attacks in Cognitive Radio Networks Monireh Dabaghchian, Student Member, IEEE, Amir Alipour-Fanid, Student Member, IEEE, Kai Zeng, Member,

More information

MAT 585: Johnson-Lindenstrauss, Group testing, and Compressed Sensing

MAT 585: Johnson-Lindenstrauss, Group testing, and Compressed Sensing MAT 585: Johnson-Lindenstrauss, Group testing, and Compressed Sensing Afonso S. Bandeira April 9, 2015 1 The Johnson-Lindenstrauss Lemma Suppose one has n points, X = {x 1,..., x n }, in R d with d very

More information

Bandits with Delayed, Aggregated Anonymous Feedback

Bandits with Delayed, Aggregated Anonymous Feedback Ciara Pike-Burke 1 Shipra Agrawal 2 Csaba Szepesvári 3 4 Steffen Grünewälder 1 Abstract We study a variant of the stochastic K-armed bandit problem, which we call bandits with delayed, aggregated anonymous

More information

MDP Preliminaries. Nan Jiang. February 10, 2019

MDP Preliminaries. Nan Jiang. February 10, 2019 MDP Preliminaries Nan Jiang February 10, 2019 1 Markov Decision Processes In reinforcement learning, the interactions between the agent and the environment are often described by a Markov Decision Process

More information

Optimality of Myopic Sensing in Multi-Channel Opportunistic Access

Optimality of Myopic Sensing in Multi-Channel Opportunistic Access Optimality of Myopic Sensing in Multi-Channel Opportunistic Access Tara Javidi, Bhasar Krishnamachari,QingZhao, Mingyan Liu tara@ece.ucsd.edu, brishna@usc.edu, qzhao@ece.ucdavis.edu, mingyan@eecs.umich.edu

More information

Stat 260/CS Learning in Sequential Decision Problems. Peter Bartlett

Stat 260/CS Learning in Sequential Decision Problems. Peter Bartlett Stat 260/CS 294-102. Learning in Sequential Decision Problems. Peter Bartlett 1. Thompson sampling Bernoulli strategy Regret bounds Extensions the flexibility of Bayesian strategies 1 Bayesian bandit strategies

More information

Worst case analysis for a general class of on-line lot-sizing heuristics

Worst case analysis for a general class of on-line lot-sizing heuristics Worst case analysis for a general class of on-line lot-sizing heuristics Wilco van den Heuvel a, Albert P.M. Wagelmans a a Econometric Institute and Erasmus Research Institute of Management, Erasmus University

More information

Chapter 5. Number Theory. 5.1 Base b representations

Chapter 5. Number Theory. 5.1 Base b representations Chapter 5 Number Theory The material in this chapter offers a small glimpse of why a lot of facts that you ve probably nown and used for a long time are true. It also offers some exposure to generalization,

More information

Distributed Stochastic Optimization in Networks with Low Informational Exchange

Distributed Stochastic Optimization in Networks with Low Informational Exchange Distributed Stochastic Optimization in Networs with Low Informational Exchange Wenjie Li and Mohamad Assaad, Senior Member, IEEE arxiv:80790v [csit] 30 Jul 08 Abstract We consider a distributed stochastic

More information

Reward Maximization Under Uncertainty: Leveraging Side-Observations on Networks

Reward Maximization Under Uncertainty: Leveraging Side-Observations on Networks Reward Maximization Under Uncertainty: Leveraging Side-Observations Reward Maximization Under Uncertainty: Leveraging Side-Observations on Networks Swapna Buccapatnam AT&T Labs Research, Middletown, NJ

More information

New Algorithms for Contextual Bandits

New Algorithms for Contextual Bandits New Algorithms for Contextual Bandits Lev Reyzin Georgia Institute of Technology Work done at Yahoo! 1 S A. Beygelzimer, J. Langford, L. Li, L. Reyzin, R.E. Schapire Contextual Bandit Algorithms with Supervised

More information

Alireza Shafaei. Machine Learning Reading Group The University of British Columbia Summer 2017

Alireza Shafaei. Machine Learning Reading Group The University of British Columbia Summer 2017 s s Machine Learning Reading Group The University of British Columbia Summer 2017 (OCO) Convex 1/29 Outline (OCO) Convex Stochastic Bernoulli s (OCO) Convex 2/29 At each iteration t, the player chooses

More information

1 Recap: Interactive Proofs

1 Recap: Interactive Proofs Theoretical Foundations of Cryptography Lecture 16 Georgia Tech, Spring 2010 Zero-Knowledge Proofs 1 Recap: Interactive Proofs Instructor: Chris Peikert Scribe: Alessio Guerrieri Definition 1.1. An interactive

More information

arxiv: v1 [cs.ds] 7 Oct 2015

arxiv: v1 [cs.ds] 7 Oct 2015 Low regret bounds for andits with Knapsacs Arthur Flajolet, Patric Jaillet arxiv:1510.01800v1 [cs.ds] 7 Oct 2015 Abstract Achievable regret bounds for Multi-Armed andit problems are now well-documented.

More information

Learning Optimal Online Advertising Portfolios with Periodic Budgets

Learning Optimal Online Advertising Portfolios with Periodic Budgets Learning Optimal Online Advertising Portfolios with Periodic Budgets Lennart Baardman Operations Research Center, MIT, Cambridge, MA 02139, baardman@mit.edu Elaheh Fata Department of Aeronautics and Astronautics,

More information

Planning With Information States: A Survey Term Project for cs397sml Spring 2002

Planning With Information States: A Survey Term Project for cs397sml Spring 2002 Planning With Information States: A Survey Term Project for cs397sml Spring 2002 Jason O Kane jokane@uiuc.edu April 18, 2003 1 Introduction Classical planning generally depends on the assumption that the

More information

arxiv: v4 [cs.lg] 22 Jul 2014

arxiv: v4 [cs.lg] 22 Jul 2014 Learning to Optimize Via Information-Directed Sampling Daniel Russo and Benjamin Van Roy July 23, 2014 arxiv:1403.5556v4 cs.lg] 22 Jul 2014 Abstract We propose information-directed sampling a new algorithm

More information

Supplementary: Battle of Bandits

Supplementary: Battle of Bandits Supplementary: Battle of Bandits A Proof of Lemma Proof. We sta by noting that PX i PX i, i {U, V } Pi {U, V } PX i i {U, V } PX i i {U, V } j,j i j,j i j,j i PX i {U, V } {i, j} Q Q aia a ia j j, j,j

More information

ALGEBRA. 1. Some elementary number theory 1.1. Primes and divisibility. We denote the collection of integers

ALGEBRA. 1. Some elementary number theory 1.1. Primes and divisibility. We denote the collection of integers ALGEBRA CHRISTIAN REMLING 1. Some elementary number theory 1.1. Primes and divisibility. We denote the collection of integers by Z = {..., 2, 1, 0, 1,...}. Given a, b Z, we write a b if b = ac for some

More information