arxiv: v1 [stat.ml] 27 May 2016

Size: px
Start display at page:

Download "arxiv: v1 [stat.ml] 27 May 2016"

Transcription

1 An optimal algorithm for the hresholding Bandit Problem Andrea Locatelli Maurilio Gutzeit Alexandra Carpentier Department of Mathematics, University of Potsdam, Germany arxiv: v1 [stat.ml] 27 May 2016 Abstract We study a specific combinatorial pure exploration stochastic bandit problem where the learner aims at finding the set of arms whose means are above a given threshold, up to a given precision, and for a fixed time horizon. We propose a parameter-free algorithm based on an original heuristic, and prove that it is optimal for this problem by deriving matching upper and lower bounds. o the best of our knowledge, this is the first non-trivial pure exploration setting with fixed budget for which optimal strategies are constructed. 1. Introduction In this paper we study a specific combinatorial, pure exploration, stochastic bandit setting. More precisely, consider a stochastic bandit setting where each arm has mean µ k. he learner can sample sequentially > 0 samples from the arms and aims at finding as efficiently as possible the set of arms whose means are larger than a threshold τ R. In this paper, we refer to this setting as the hresholding Bandit Problem BP), which is a specific instance of the combinatorial pure exploration bandit setting introduced in Chen et al., 2014). A simpler one armed version of this problem is known as the SIGN-ξ problem, see Chen & Li, 2015). his problem is related to the popular combinatorial pure exploration bandit problem known as the opm problem where the aim of the learner is to return the set of M arms with highest mean Bubeck et al., 2013b; Gabillon et al., 2012; Kaufmann et al., 2015; Zhou et al., 2014; Cao et al., 2015) - which is a combinatorial version of the best arm identification problem Even-Dar et al., 2002; Mannor & sitsiklis, 2004; Bubeck et al., 2009; Audibert & Bubeck, Proceedings of the 33 rd International Conference on Machine Learning, New York, NY, USA, JMLR: W&CP volume 48. Copyright 2016 by the authors). 2010; Gabillon et al., 2012; Jamieson et al., 2014; Karnin et al., 2013; Kaufmann et al., 2015; Chen & Li, 2015). o formulate this link with a simple metaphor, the opm problem is a contest and the BP problem is an exam : in the former, the learner wants to select the M arms with highest mean, in the latter the learner wants to select the arms whose means are higher than a certain threshold. We believe that this distinction is important and that in many applications the BP problem is more relevant than the opm, as in many domains one has a natural efficiency, or correctness threshold above which one wants to use an option. For instance in industrial applications, one wants to keep a machine if its production s value is above its functioning costs, in crowd-sourcing one wants to hire a worker as long as its productivity is higher than its wage, etc. In addition to these applications derived from the opm problem, the BP problem has applications in dueling bandits and is a natural way to cast the problem of active and discrete level set detection, which is in turn related to the important applications of active classification, and active anomaly detection - we detail this point more in Subsection 3.1. As mentioned previously, the BP problem is a specific instance of the combinatorial pure exploration bandit framework introduced in Chen et al., 2014). Without going into the details of the combinatorial pure exploration setting for which the paper Chen et al., 2014) derives interesting general results, we will summarize what these results imply for the particular BP and opm problems, which are specific cases of the combinatorial pure exploration setting. As it is often the case for pure exploration problems, the paper Chen et al., 2014) distinguishes between two settings: he fixed budget setting where the learner aims, given a fixed budget, at returning the set of arms that are above the threshold in the case of BP) or the set of M best arms in the case of opm), with highest possible probability. In this setting, upper and lower bounds are on the probability of making an error when returning the set of arms.

2 he fixed confidence setting where the learner aims, given a probability δ of acceptable error, at returning the set of arms that are above the threshold in the case of BP) or the set of M best arms in the case of opm) with as few pulls of the arms as possible. In this setting, upper and lower bounds are on the number of pulls that are necessary to return the correct set of arm with probability at least 1 δ. he similarities and dissemblance of these two settings have been discussed in the literature in the case of the opm problem in particular in the case M = 1), see Gabillon et al., 2012; Karnin et al., 2013; Chen et al., 2014). While as explained in Audibert & Bubeck, 2010; Gabillon et al., 2012), the two settings share similarities in the specific case when additional information about the problem is available to the learner such as the complexity H defined in able 1), they are very different in general and results do not transfer from one setting to the other, see Bubeck et al., 2009; Audibert & Bubeck, 2010; Karnin et al., 2013; Kaufmann et al., 2015). In particular we highlight the following fact: while the fixed confidence setting is relatively well understood in the sense that there are constructions for optimal strategies Kalyanakrishnan et al., 2012; Jamieson et al., 2014; Karnin et al., 2013; Kaufmann et al., 2015; Chen & Li, 2015), there is an important knowledge gap in the fixed budget setting. In this case, without the knowledge of additional information on the problem such as e.g. the complexity H defined in able 1, there is a gap between the known upper and lower bounds, see Audibert & Bubeck, 2010; Gabillon et al., 2012; Karnin et al., 2013; Kaufmann et al., 2015). his knowledge gap is more acute for the general combinatorial exploration bandit problem defined in the paper Chen et al., 2014) see their heorem 3) - and therefore for the BP problem where in fact no fixed budget lower bound exists to the best of our knowledge). We summarize in able 1 the state of the art results for the BP problem and for the opm problem with M = 1. he summary of able 1 highlights that in the fixed budget setting, both for the opm and the BP problem, the correct complexity H that should appear in the bound, i.e. what is the problem dependent quantity H such that the upper and lower bounds on the probability of error is of order exp n/h ), is still an open question. In the opm problem, able 1 implies that H H log2k)h 2. In the BP problem, able 1 implies 0 H log2k)h 2, since to the best of our knowledge a lower bound for this problem exists only in the case of the fixed confidence setting. Note that although this gap may appear small in particular in the case of the opm problem as it involves only a logk) multiplicative factor, it is far from being negligible since the logk) gap factor acts on a term of order exponential minus exponentially. Problem Lower Bound Upper Bound BP FC) H log ) 1 δ H log ) 1 δ BP FB) No results K exp opm FC) H log ) 1 δ opm FB) exp ) H H log 1 δ K exp logk)h 2 ) ) ) logk)h 2 able 1. State of the art results for the BP problem and the opm problem with M = 1 with fixed confidence for δ small enough FC) and fixed budget FB) - for FC, bound on the expected total number of samples needed for making an error of at most δ on the set of arms and for FB, bound on the probability of making a mistake on the returned set of arms. he quantities H, H 2 depend on the means µ k of the arm distributions and are defined in Chen et al., 2014) and are not the same for opm and BP). In the case of the BP problem, set k = τ µ k and set k) for the k ordered in increasing order, we have H = i 2 i and H 2 = min i i 2 i). For the opm problem with M = 1, the same definitions holds with k = max i µ i µ k. In this paper we close, up to constants, the gap in the fixed budget setting for the BP problem - we prove that H = H. In addition, we also prove that our strategy minimizes at the same time the cumulative regret, and identifies optimally the best arm, provided that the highest mean of the arms is known to the learner. Our findings are summarized in able 1. In order to do that, we introduce a new algorithm for the BP problem which is entirely parameter free, and based on an original heuristic. In Section 2, we describe formally the BP problem, the algorithm, and the results. In Section 3, we describe how our algorithm can be applied to the active detection of discrete level sets, and therefore to the problem of active classification and active anomaly detection. We also describe what are the implications of our results for the opm problem. Finally Section 4 presents some simulations for evaluating our algorithm with respect to the state of the art competitors. he proofs of all theorems are in Appendix A, as well as additional simulation results. Problem Results BP FB) : UB exp H + log log )K )) BP FB) : LB exp H log log )K )) opm FB) : UB exp H + log log )K )) with µ known) able 2. Our results for the opm and the BP problem in the fixed budget setting - i.e. upper and lower bounds on the probability of making a mistake on the set of arms returned by the learner.

3 2. he hresholding Bandit Problem 2.1. Problem formulation Learning setting Let K be the number of arms that the learner can choose from. Each of these arms is characterized by a distribution ν k that we assume to be R-sub- Gaussian. Definition R-sub-Gaussian distribution). Let R > 0. A distribution ν is R-sub-Gaussian if for all t R we have E X ν [exptx te[x])] expr 2 t 2 /2). his encompasses various distributions such as bounded distributions or Gaussian distributions of variance R 2 for R R. Such distributions have a finite mean, let µ k = E X νk [X] be the mean of arm k. We consider the following dynamic game setting which is common in the bandit literature. For any time t 1, the learner chooses an arm I t from A = {1,..., K}. It receives a noisy reward drawn from the distribution ν It associated to the chosen arm. An adaptive learner bases its decision at time t on the samples observed in the past. Set notations Let u R and A be the finite set of arms. We define S u as the set of arms whose means are over u, that is S u := {k A, µ k u}. We also define Su C as the complimentary set of S u in A, i.e. Su C = {k A, µ k < u}. Objective Let > 0 not necessarily known to the learner beforehand) be the horizon of the game, let τ R be the threshold and ɛ 0 be the precision. We define the τ, ɛ) thresholding problem as such : after rounds of the game described above, the goal of the learner is to correctly identify the arms whose means are over or under the threshold τ up to a certain precision ɛ, i.e. to correctly discriminate arms that belong to S τ+ɛ from those in S C τ ɛ. In the rest of the paper, the sentence the arm is over the threshold τ is to be understood as the arm s mean is over the threshold. After rounds of the previously defined game, the learner has to output a set Ŝτ := Ŝτ ) A of arms and it suffers the following loss: L ) = IS τ+ɛ ŜC τ S C τ ɛ Ŝτ ). A good learner minimizes this loss by correctly discriminating arms that are outside of a 2ɛ band around the threshold: arms whose means are smaller than τ ɛ) should not belong to the output set Ŝτ, and symmetrically those whose means are bigger than τ + ɛ) should not belong to Ŝ C τ. If it manages to do so, the algorithm suffers no loss and otherwise it incurs a loss of 1. For arms that lie inside this 2ɛ strip, mistakes on the other hand bear no cost. If we set ɛ to 0 we recover the exact BP thresholding problem described in the introduction, and the algorithm suffers no loss if it discriminates exactly arms that are over the threshold from those under. Let E be the expectation according to the samples collected by an algorithm, its expected loss is: E[L )] = PS τ+ɛ ŜC τ S C τ ɛ Ŝτ ), i.e. it is the probability of making a mistake, that is rejecting an arm over τ + ɛ) or accepting an arm under τ ɛ). he lower this probability of error, the better the algorithm, as an oracle strategy would simply rightly classify each arm and suffer an expected loss of 0. Our problem is a pure exploration bandit problem, and is in fact, shifting the means by τ, a specific case of the pure exploration bandit problem considered in Chen et al., 2014) - namely the specific case where the set of sets of arms that they call M and which is their decision class is the set of all possible set of arms. We will comment more on this later in Subsection 2.4. Problem complexity We define τ,ɛ i the gap of arm i with respect to τ and ɛ as: i := τ,ɛ i = µ i τ + ɛ. 1) We also define the complexity H ɛ of the problem as K H := H τ,ɛ = τ,ɛ i ) 2. 2) We call H complexity as it is a characterization of the hardness of the problem. A similar quantity was introduce for general combinatorial bandit problems Chen et al., 2014) and is similar in essence to the complexity introduced for the best arm identification problem, see Audibert & Bubeck, 2010) A lower bound In this section, we exhibit a lower bound for the thresholding problem. More precisely, for any sequence of gaps d k ) k, we define a finite set of problems where the distributions of the arms of these problems correspond to these gaps and are Gaussian of variance 1. We lower bound the largest probability of error among these problems, for the best possible algorithm. heorem 1. Let K, 0. Let for any i K, d i 0. Let τ R, ɛ > 0. For 0 i K, we write B i for the problem where the distribution of arm j {1,..., K} is N τ + d i + ɛ, 1) if i j and N τ d i ɛ, 1) otherwise. For all these problems, H := H τ,ɛ = i d i + 2ɛ) 2 is the same by definition. It holds that for any bandit algorithm max i {0,...,K} E B il )) exp 3/H 4 log12log ) + 1)K) ), where E B i is the expectation according to the samples of problem B i.

4 his lower bound implies that even if the learner is given the distance of the mean of each arm to the threshold and the shape of the distribution of each arm here Gaussian of variance 1), any algorithm still makes an error of at least exp 3/H 4 log12log ) + 1)K)) on one of the problems. his is a lower bound in a very strong sense because we really restrict the set of possibilities to a setting where we know all gaps and prove that nevertheless this lower bounds holds. Also it is non-asymptotic and holds for any, and implies therefore a non-asymptotic minimax lower bound. he closer the means of the distributions to the threshold, the larger the complexity H, and the larger the lower bound. he proof is to be found in Appendix A. his theorem s lower bound contains two terms in the exponential, a term that is linear in and a term that is of order loglog ) + 1)K) loglog )) + logk). For large enough values of, one has the following simpler corollary. Corollary. Let H > 0 and R > 0, τ R and ɛ 0. Consider B H,R the set of K-armed bandit problems where the distributions of the arms are R-sub-Gaussian and which have all a complexity smaller than H. Assume that 4 HR 2 log12log ) + 1)K). It holds that for any bandit algorithm sup E B L )) exp 4/R 2 ) H), B B H,R where E B is the expectation according to the samples of problem B B H,R Algorithm AP and associated upper bound In this section we introduce AP Anytime Parameter-free hresholding algorithm), an anytime parameter-free learning algorithm. Its heuristic is based on a simple observation, namely that a near optimal static strategy that allocates k samples to arm k is such that k 2 k is constant across k and increasing with ) - see heorem 1, and in particular the second half of Step 3 of its proof in Appendix A - and that therefore a natural idea is to simply pull at time t the arm that minimizes an estimator of this quantity. Note that in this paper, we consider for the sake of simplicity that each arm is tested against the same threshold, however this can be relaxed to τ k ) k at no additional cost. Algorithm he algorithm receives as input the definition of the problem τ, ɛ). First, it pulls each arm of the game once. At time t > K, AP updates i t), the number of pulls up to time t of arm i, and the empirical mean ˆµ i t) of arm k after i t) pulls. Formally, for each k A it computes i t) = t s=1 II s = i) and the updated means µ i t) = 1 it) X i,s, 3) i t) s=1 Algorithm 1 AP algorithm Input: τ, ɛ Pull each arm once for t = K + 1 to do Pull arm I t = arg min k K B kt) from Equation 5) Observe reward X ν It end for Output: Ŝτ = {k : ˆµ k ) τ} where X i,s denotes the sample received when pulling i for the s-th time. he algorithm then computes: i s) := τ,ɛ i s) = ˆµ i t) τ + ɛ, 4) the current empirical estimate of the gap associated with arm i. he algorithm then computes: B i t + 1) = i t) i t). 5) and pulls the arm I t+1 = arg min i K B it + 1) that minimizes this quantity. At the end of the horizon, the algorithm outputs the set of arms Ŝτ = {k : µ k ) τ}. he expected loss of this algorithm can be bounded as follows. heorem 2. Let K 0, 2K, and consider a problem B. Assume that all arms ν k of the problem are R-sub- Gaussian with means µ k. Let τ R, ɛ 0 Algorithm AP s expected loss is upper bounded on this problem as EL )) exp 1 ) 64R loglog ) + 1)K), H where we remind that H = i µ i τ +ɛ) 2 and where E is the expectation according to the samples of the problem. he bound of heorem 2 holds for any R-sub-Gaussian bandit problem. Note that one does not need to know R in order to implement the algorithm, e.g. if the distributions are bounded, one does not need to know the bound. his is a desirable feature for an algorithm, yet e.g. all algorithms based on upper confidence bounds need a bound on R. his bound is non-asymptotic one just needs 2K so that one can initialize the algorithm) and therefore heorem 2 provides a minimax upper bound result over the class of problems that have sub-gaussian constant R and complexity H. he term in the exponential of the lower bound of heorem 2 matches the lower bound of heorem 1 up to a multiplicative factor and the loglog )+1)K) term. Now as in the case of the lower bound, for large enough values of, one has the following simpler corollary.

5 Corollary. Let H > 0 and R > 0, τ R and ɛ 0. Consider B H,R the set of K-armed bandit problems where the distributions of the arms are R-sub-Gaussian and whose complexity is smaller than H. Assume that 256 HR 2 loglog ) + 1)K). For Algorithm AP it holds that sup B B H,R E B L )) exp /128R 2 H) ), where E B is the expectation according to the samples of problem B B H,R his corollary and Corollary 2.2 imply that for large enough - i.e. of larger order than HR 2 loglog ) + 1)K) - Algorithm AP is order optimal over the class of problems whose complexity is bounded by H and whose arms are R-sub-Gaussian Discussion A parameter free algorithm An important point that we want to highlight for our strategy AP is that it does not need any parameter, such as the complexity H, the horizon or the sub-gaussian constant R. his contrasts with any upper confidence based approach as in e.g. Audibert & Bubeck, 2010; Gabillon et al., 2012) e.g. the UCB-E algorithm in Audibert & Bubeck, 2010)), which need as parameter an upper bound on R and the exact knowledge of H, while the bound of heorem 2 will hold for any R and any H, and our algorithm adapts to these quantities. Also we would like to highlight that for the related problem of best arm identification, existing fixed budget strategies need to know the budget in advance Audibert & Bubeck, 2010; Karnin et al., 2013; Chen et al., 2014) - while our algorithm can be stopped at any time and the bound of heorem 2 will hold. Extensions to distributions that are not sub-gaussian as opposed to adaptation to sub- models It is easy to see in the light of Bubeck et al., 2013a) that one could extend our algorithm to non sub-gaussian distributions by using an estimator other than the empirical means, as e.g. the estimators in Catoni et al., 2012) or in Alon et al., 1996). hese estimators have sub-gaussian concentration asymptotically under the only assumption that the distributions have a finite 1 + v) moment with v > 0 and the sub-gaussian concentration will depend on v). Using our algorithm with a such estimator will therefore provide a result that is similar to the one of heorem 2 - and that without requiring the knowledge of v, which means that our algorithm AP modified for using these robust estimators instead of the empirical mean will work for any bandit problem where the arm distributions have a finite 1 + v) moment with v > 0. On the other hand, if we consider more specific, e.g. exponential, models, it is possible to obtain a refined lower bound in terms of Kullback- Leibler divergences rather than gaps following Kaufmann et al., 2015). However, an upper bound of the same order clearly comes at the cost of a more complicated strategy and holds in less generality than our bound. Optimality of our strategy As explained previously, the upper bound on the expected risk of algorithm AP is comparable to the lower bound on the expected risk up to a log log ) + 1)K ) term see heorems 2 and heorems 1) - and this term vanishes when the horizon is large enough, namely when OHR 2 log log )+1)K ) ), which is the case for most problems. So for large enough, our strategy is order optimal over the class of problems that have complexity smaller than H and sub-gaussian constant smaller than R. Comparison with existing results Our setting is a specific combinatorial pure exploration setting with fixed budget where the objective is to find the set of arms that are above a given threshold. Settings related to ours have been analyzed in the literature and the state of the art result on our problem can be found to the best of our knowledge) in the paper Chen et al., 2014). In this paper, the authors consider a general pure exploration combinatorial problem. Given a set M of subsets of {1,..., K}, they aim at finding a subset of arms M M such that M = arg max M M k M µ k. In the specific case where M is the set of all subsets of {1,..., K}, their problem in the fixed budget setting is exactly the same as ours when ɛ = 0 and the means are shifted by τ. heir algorithm CSAR s upper bound on the loss is see their heorem 3): EL )) K 2 K ) exp 72R 2, logk)h CSAR,2 where H CSAR,2 = max i i 2 i). As H CSAR,2 logk) H by definition, there is a gap for their strategy in the fixed budget setting with respect to the lower bound of heorem 1, which is smaller and of order exp /HR 2 )). Our strategy on the contrary does not have this gap, and improves over the CSAR strategy. We believe that this lack of optimality for CSAR is not an artefact of the proof of the paper Chen et al., 2014), and that CSAR is sub-optimal, as it is a successive reject algorithm with fixed and nonadaptive reject phase length. A similar gap between upper and lower bounds for successive reject based algorithms in the fixed budget setting was also observed for the best arm identification problem when no additional information such as the complexity are known to the learner, see Audibert & Bubeck, 2010; Karnin et al., 2013; Kaufmann et al., 2015; Chen et al., 2014). It is therefore an interesting fact that there is a parameter free optimal algorithm for our fixed budget problem.

6 he paper Chen et al., 2014) also provides results in the fixed confidence setting, where the objective is to provide an ɛ optimal set using the smallest possible sample size. In these results such a gap in optimality does not appear and the algorithm CLUCB they propose is almost optimal, see also Kalyanakrishnan et al., 2012; Jamieson et al., 2014; Karnin et al., 2013; Kaufmann et al., 2015; Chen & Li, 2015) for related results in the fixed confidence setting. his highlights that the fixed budget setting and the fixed confidence setting are fundamentally different at least in the absence of additional information such as the complexity H), and that providing optimal strategies in the fixed budget setting is a more difficult problem than providing an adaptive strategy in the fixed confidence problem - adaptive algorithms that are nearly optimal in the absence of additional information have only been exhibited in the latter case. o the best of our knowledge, all strategies except ours have such an optimality gap for fixed budget pure exploration combinatorial bandit problems, while there exists fixed confidence strategies for general pure exploration combinatorial bandits that are very close to optimal, see Chen et al., 2014). Now in the case where the learner has additional information on the problem, as e.g. the complexity H, it has been proved in the opm problem that a UCB-type strategy has probability of error upper bounded as exp /H), see Audibert & Bubeck, 2010; Gabillon et al., 2012). A similar UCB type of algorithm would also work in the BP problem, implying the same upper bound results as AP. But we would like to highlight that the exact knowledge of H is needed by these algorithms for reaching this bound - which is unlikely in applications. Our strategy on the other hand reaches, up to constants, the optimal expected loss for the BP problem, without needing any parameter. 3. Extensions of our results to related settings In this section we detail some implications of the results of the previous section to some specific problems Active level set detection : Active classification and active anomaly detection Here we explain how a simple modification of our setting transforms it into the setting of active level set detection, and therefore why it can be applied to active classification and active anomaly detection. We define the problem of discrete, active level set detection as the problem of deciding as efficiently as possible, in our bandit setting, whether for any k the probabilities that the samples of arms ν k are above or below a given level L are higher or smaller than a threshold τ up to a precision ɛ, i.e. it is the problem of deciding for all k whether µ k L) := P X νk X > L) τ, or not up to a precision ɛ. his problem can be immediately solved by our approach with a simple change of variable. Namely, for the sample X t ν It collected by the algorithm at time t, consider the transformation X t = 1 Xt>L. hen X t is a Bernoulli random variable of parameter µ It L) which is a 1/2-sub- Gaussian distribution) - and applying our algorithm to the transformed samples X t solves the active level set detection problem. his has two interesting applications, namely in active binary classification and in active anomaly detection. Active binary classification In active binary classification, the learner aims at deciding, for k points the arms of the bandit), whether each point belongs to the class 1 or the class 0. At each round t, the learner can request help from a homogeneous mass of experts which can be a set of previously trained classifiers, where one wants to minimize the computational cost, or crowd-sourcing, when one wants to minimize the costs of the task), and obtain a noisy label for the chosen data point I t. We assume that for any point k, the expert s responses are independent and stochastic random variables in {0, 1} of mean µ k i.e. the arm distributions are Bernoulli random variables of parameter µ k ). We assume that the experts are right on average and that the label l k of k is equal to l k := 1{ µ k > 1/2}. he active classification task therefore amounts to deciding whether µ k > τ := 1/2 or not, possibly up to a given precision ɛ. Our strategy therefore directly applies to this problem by choosing τ = 1/2. Active anomaly detection In the case anomaly detection, a common way to characterize anomalies is to describe them as naturally not concentrated Steinwart et al., 2005). A natural way to characterize anomalies is thus to define a cutoff level L, and classify the samples e.g. above this level L as anomalous. Such an approach has already received attention for anomaly detection e.g in Streeter & Smith), albeit in a cumulative regret setting. Here we consider an active anomaly detection setting where we face K sources of data the arms), and we aim at sampling them actively to detect which sources emit anomalous samples with a probability higher than a given threshold τ - this threshold is chosen e.g. as the maximal tolerable amount of anomalous behavior of a source. his illustrates the fact that as described in Steinwart et al., 2005), the problem of anomaly detection is indeed a problem of level set detection - and so the problem of active anomaly detection is a problem of active level set detection on which we can use our approach as explained above.

7 3.2. Best arm identification and cumulative reward maximization with known highest mean value wo classical bandit problems are the best-arm identification problem and the cumulative reward maximization problem. In the former, the goal of the learner is to identify the arm with the highest mean Bubeck et al., 2009). In the latter, the goal is to maximize the sum of the samples collected by the algorithm up to time Auer et al., 1995). Intuitively, both problems should call for different strategies - in the best arm identification problem one wants to explore all arms heavily while in the cumulative reward maximization problem one wants to sample as much as possible the arm with the highest mean. Such intuition is backed up by heorem 1 of Bubeck et al., 2009), which states that in the absence of additional information and with a fixed budget, the lower the regret suffered in the cumulative setting, expressed in terms of rewards, the higher the regret suffered in the identification problem, expressed in terms of probability of error. We prove in this section the somewhat non intuitive fact that if one knows the value of best arm s mean, its possible to perform both tasks simultaneously by running our algorithm where we choose ɛ = 0 and τ = µ := max k µ k. Our algorithm then reduces to the GCL algorithm that can be found in Salomon & Audibert, 2011). Best arm identification In the best arm identification problem, the game setting is the same as the one we considered but the goal of the learner is different: it aims at returning an arm J that with the highest possible mean. he following proposition holds for our strategy AP that runs for times, and then returns the arm J that was the most pulled. heorem 3. Let K > 0, R > 0 and 2K and consider a problem where the distribution of the arms ν k is R-sub-Gaussian and has mean µ k. Let µ := max k µ k and H µ = i:µ i µ µ µ i ) 2. hen AP run with parameters τ = µ and ɛ = 0, recommending the arm J = arg max k ), is such that k A Pµ J µ ) exp 36R 2 +2 loglog )+1)K ). H µ If the complexity H is also known to the learner, algorithm UCB-E from Audibert & Bubeck, 2010) would attain a similar performance. Remark his implies that if µ is known to the learner, there exists an algorithm such that its probability of error is of order exp c/h). he recent paper Carpentier & Locatelli, 2016) actually implies that the knowledge of µ is actually key here, since without this information, the simple regret is at least of order exp c/logk)h)) in a minimax sense. Cumulative reward maximization In the cumulative reward maximization problem, the game setting is the same as the the one we considered but the aim of the learner is different : if we write X t for the sample collected at time t by the algorithm, it aims at maximizing t X t. he following proposition holds for our strategy AP that runs for times. heorem 4. Let K > 0, R > 0 and 2K and consider a problem where the distribution of the arms ν k is R-sub- Gaussian. hen AP run with parameters τ = µ and ɛ = 0 is such that µ E [ X t inf 4R2 log )δ δ 1 µ µ i t k k + µ µ i )1 + K ] 2δ 2 ). his bound implies both the problem dependent upper bound of order i 1 i log ) and the problem independent upper bound of order K log ), and this matches the performance of algorithms like UCB for any tuning parameter. A similar result can also be found in Salomon & Audibert, 2011). Discussion Propositions 3 and 4, whose proofs are provided in Appendix A, imply that our algorithm AP is a good strategy for solving at the same time both problems when µ is known. As mentioned previously, this is counter intuitive since one would expect a good strategy for the best arm identification problem to explore significantly more than a good strategy for the cumulative reward maximization problem. o convince oneself, it is sufficient to look at the two-armed case, for which in the fixed budget it is optimal to sample both arms equally, while this strategy has linear regret in the cumulative setting. his intuition is formalized in Bubeck et al., 2009) where the authors prove that no algorithm can achieve this without additional information. Our results therefore imply that the knowledge of µ by the learner is a sufficient information so that heorem 1 of Bubeck et al., 2009) does not hold anymore and there exists algorithms that solve both problems at the same time, as AP does. opm problem An extension of the best arm identification problem is known as opm arms identification problem, where one is concerned with identifying the set of the M arms with the highest means Bubeck et al., 2013b; Gabillon et al., 2012; Kaufmann et al., 2015; Zhou et al., 2014; Chen et al., 2014; Cao et al., 2015). If the learner has some additional information, such as the mean values of the arms with Mth and M + 1)th highest means, then it is straightforward that one can apply our algorithm AP, setting τ in the middle between the Mth and M + 1)th highest means. he set Ŝτ would then be returned as the es-

8 0 Exp. 1 3 groups), Bernoulli arms 0 Exp. 2 arithmetic progression), Bernoulli arms 0 Exp. 3 geometric progression), Bernoulli arms log1 est. probability of success) AP UA UCBE4 4 ) 7 UCBE1) UCBE0.25) CSAR horizon log1 est. probability of success) AP UA UCBE4 4 ) 6 UCBE1) UCBE0.25) CSAR horizon log1 est. probability of success) AP UA UCBE ) UCBE1) UCBE0.25) CSAR horizon Figure 1. Results of Experiments 1-3 with Bernoulli distributions. he average error of the specified methods is displayed on a logarithmic scale with respect to the horizon. timated set of M optimal arms. he upper bound and proof for this problem is a direct consequence of heorem 2, and granted one has such extra-information, outperforms existing results for the fixed budget setting, see Bubeck et al., 2013b; Kaufmann et al., 2015; Chen et al., 2014; Cao et al., 2015). If the complexity H were also known to the learner, the strategy in Gabillon et al., 2012) would attain a similar performance. 4. Experiments We illustrate the performance of algorithm AP in a number of experiments. For comparison, we use the following methods which include the state of the art CSAR algorithm of Chen et al., 2014) and two minor adaptations of known methods that are also suitable for our problem. Uniform Allocation UA): For each t {1, 2,..., }, we choose I t U A. his method is known to be optimal if all arms are equally difficult to classify, that is in our setting, if the quantities τ,ɛ i, i A, are very close. UCB-type algorithm: he algorithm UCBE given and analyzed in Audibert & Bubeck, 2010) is designed for finding the best arm - and its heuristic is to pull the arm that maximizes a UCB bound - see also Gabillon et al., 2012) for an adaptation of this algorithm to the general opm problem. he natural adaptation of the method for our problem corresponds to pulling the arm that minimizes k t) a From the theoretical analysis in the k t). paper Audibert & Bubeck, 2010; Gabillon et al., 2012), it is not hard to see that setting a K)/H minimizes their upper bound, and that this algorithm attains the same expected loss as ours - but it requires the knowledge of H. In the experiments we choose values a i = 4 i K H, i { 1, 0, 4}, and denote the respective results as UCBE4 i ). he value a 0 can be seen as the optimal choice, while the two other choices give rise to strategies that are sub-optimal because they respectively explore too little or too much. CSAR: As mentioned before, this method is given in Chen et al., 2014). In our specific setting, via the shift µ i = µ i τ, the lines 7-17 of the algorithm reduce to classifying the arm i that maximizes µ i based on its current mean. he set A t corresponds to Ŝτ at time t. In fact in our specific setting the CSAR algorithm is a successive reject-type strategy see Audibert & Bubeck, 2010) where the arm whose empirical mean is furthest from the threshold is rejected at the end of each phase. Figure 1 displays the estimated probability of success on a logarithmic scale with respect to the horizon of the six algorithms based on N = 5000 simulated games with τ = 1 2, ɛ = 0.1, K = 10, and = 500. Experiment 1 3 groups setting): K Bernoulli arms with means µ 1:3 0.1, µ 4:7 = 0.35, 0.45, 0.55, 0.65) and µ 8:10 0.9, which amounts to 2 difficult relevant arms that is, outside the 2ɛ- band), 2 difficult irrelevant arms and six easy relevant arms. Experiment 2 arithmetic progression): K Bernoulli arms with means µ 1:4 = : 3) 0.05, µ 5 = 0.45, µ 6 = 0.55 and µ 7:10 = : 3) 0.05, which amounts to 2 difficult irrelevant arms and eight arms arithmetically progressing away from τ. Experiment 3 geometric progression): K Bernoulli arms with means µ 1:4 = :4, µ 5 = 0.45, µ 6 = 0.55 and µ 7:10 = d 5 1:4), which amounts to 2 difficult irrelevant arms and eight arms geometrically progressing away from τ. he experimental results confirm that our algorithm may only be outperformed by methods that have an advantage in the sense that they have access to the underlying problem complexity and, in the case of UCBE1), an additional optimal parameter choice. In particular, other choices for that parameter lead to significantly less accurate results comparable to the naive strategy UA. hese effects are also visible in the further results given in Appendix B. Conclusion In this paper we proposed a parameter free algorithm based on a new heuristic for the BP problem in the fixed confidence setting - and we prove that it is optimal which is a kind of result which is highly non trivial for combinatorial pure exploration problems with fixed budget.

9 Acknowledgement his work is supported by the DFG s Emmy Noether grant MuSyAD CA 1488/1-1). References Alon, Noga, Matias, Yossi, and Szegedy, Mario. he space complexity of approximating the frequency moments. In Proceedings of the twenty-eighth annual ACM symposium on heory of computing, pp ACM, Audibert, Jean-Yves and Bubeck, Sébastien. Best arm identification in multi-armed bandits. In Proceedings of the 23rd Conference on Learning heory, Auer, Peter, Cesa-Bianchi, Nicolò, Freund, Yoav, and Schapire, Robert. Gambling in a Rigged Casino: he Adversarial Multi-Armed Bandit problem. In Proceedings of the 36th Annual Symposium on Foundations of Computer Science, pp , Bubeck, Sebastian, Cesa-Bianchi, Nicolo, and Lugosi, Gábor. Bandits with heavy tail. Information heory, IEEE ransactions on, 5911): , 2013a. Bubeck, Sébastien, Munos, Rémi, and Stoltz, Gilles. Pure exploration in multi-armed bandits problems. In Algorithmic Learning heory, pp Springer, Bubeck, Séebastian, Wang, engyao, and Viswanathan, Nitin. Multiple identifications in multi-armed bandits. In Proceedings of he 30th International Conference on Machine Learning ICML-13), pp , 2013b. Cao, Wei, Li, Jian, ao, Yufei, and Li, Zhize. On topk selection in multi-armed bandits and hidden bipartite graphs. In Advances in Neural Information Processing Systems, pp , Carpentier, Alexandra and Locatelli, Andrea. ight lower) bounds for the fixed budget best arm identification bandit problem. In Proceedings of the 29th Conference on Learning heory, Catoni, Olivier et al. Challenging the empirical mean and empirical variance: a deviation study. In Annales de l Institut Henri Poincaré, Probabilités et Statistiques, volume 48, pp Institut Henri Poincaré, Chen, Lijie and Li, Jian. On the optimal sample complexity for best arm identification. arxiv preprint arxiv: , Even-Dar, Eyal, Mannor, Shie, and Mansour, Yishay. Pac bounds for multi-armed bandit and markov decision processes. In Computational Learning heory, pp Springer, Gabillon, Victor, Ghavamzadeh, Mohammad, and Lazaric, Alessandro. Best arm identification: A unified approach to fixed budget and fixed confidence. In Advances in Neural Information Processing Systems, pp , Jamieson, Kevin, Malloy, Matthew, Nowak, Robert, and Bubeck, Sébastien. lil ucb: An optimal exploration algorithm for multi-armed bandits. In Proceedings of the 27th Conference on Learning heory, Kalyanakrishnan, Shivaram, ewari, Ambuj, Auer, Peter, and Stone, Peter. Pac subset selection in stochastic multiarmed bandits. In Proceedings of the 29th International Conference on Machine Learning ICML-12), pp , Karnin, Zohar, Koren, omer, and Somekh, Oren. Almost optimal exploration in multi-armed bandits. In Proceedings of the 30th International Conference on Machine Learning ICML-13), pp , Kaufmann, Emilie, Cappé, Olivier, and Garivier, Aurélien. On the complexity of best arm identification in multiarmed bandit models. Journal of Machine Learning Research, Mannor, S and sitsiklis, J N. he Sample Complexity of Exploration in the Multi-Armed Bandit Problem. Journal of Machine Learning Research, 5: , Salomon, Antoine and Audibert, Jean-Yves. Deviations of stochastic bandit regret. In Algorithmic Learning heory, pp Springer, Steinwart, Ingo, Hush, Don R, and Scovel, Clint. A classification framework for anomaly detection. In Journal of Machine Learning Research, pp , Streeter, Matthew J and Smith, Stephen F. Selecting among heuristics by solving thresholded k-armed bandit problems. ICAPS 2006, pp Zhou, Yuan, Chen, Xi, and Li, Jian. Optimal pac multiple arm identification with applications to crowdsourcing. In Proceedings of the 31st International Conference on Machine Learning ICML-14), pp , Chen, Shouyuan, Lin, ian, King, Irwin, Lyu, Michael R, and Chen, Wei. Combinatorial pure exploration of multiarmed bandits. In Advances in Neural Information Processing Systems, pp , 2014.

10 A. Proofs A.1. Proof of heorem 1 Proof. In this proof, we will prove that on at least one instance of the problem, any algorithm makes a mistake of order at least exp c/h). Step 0: Setting and notations. Let us consider K real numbers i 0, and let us set τ = 0, ɛ = 0. Let us write ν i := N i, 1) for the Gaussian distribution of mean i and variance 1, and ν i := N i, 1) for the Gaussian distribution of mean i and variance 1. Note that this construction is easily generalised to cases where τ 0 or ɛ 0 by translation or careful choice of the i. We define the product distributions B i where i {0,..., K} as ν i 1... ν i K where for k K, νi k := ν i1 k i + ν i 1 k=i is ν i if k i and ν i otherwise. We also extend this notation to B0, where none of the arms is flipped with respect to the threshold k, ν 0 k := ν i). It is straightforward that the gap i of arm i with respect to the threshold τ = 0 does not depend on B i and is equal to i. It follows that all these problems have the same complexity H as defined previously with ɛ = 0 and τ = 0). We write for i K, P i B for the probability distribution according to all the samples that a strategy could possibly collect up to horizon, i.e. according to the samples X k,s ) k K,s B i ). Let k ) k K denote the numbers of samples collected by the algorithm on arm k. Let k {0,..., K}. Note that KL k := KLν k, ν k ) = 2 2 k, where KL is the Kullback Leibler divergence. Let t 0. We define the quantity: KL k,t = 1 t t s=1 log dν k dν k X k,s )) = 1 t t 2X k,s k. s=1 Step 1: Concentration of the empirical KL. Let us define the event: { ξ = k K, t, KL k,t KL k 4 k log4log ) + 1)K) t Since KL k,t = 1 t t s=1 2X k,s k and KL k = 2 2 k, by Gaussian concentration a peeling and the maximal martingale inequality), it holds that for any i that P B iξ) 3/4. Step 2: A change of measure. We will now use the change of measure introduced previously for a well chosen event A. Namely, we consider A i = {i Ŝτ }, the event where the algorithm classified arm i as being above the threshold. We have by doing a change of measure between B i and B 0 since they only differ in arm i and only the i first samples of arm i by the algorithm): P B ia i ) = E B 0 [1 Ai exp ) i KL ] i,i E B 0 [1 Ai ξ exp ) i KL ] i,i [ E B 0 1 Ai ξ exp 2 2 ) ] i i 4 i i log4 log ) + 1)K), }. by definition of ξ and KL i. Step 3: A union of events. We now consider the event A = K A i, i.e. the event where all arms are classified as

11 being above the threshold τ = 0. We have: max i {1,...,K} P B ia i) 1 K 1 K 1 K E B 0 K P B ia i ) 6) K P B ia i ξ) K E B 0 [ 1 A ξ 1 K [1 Ai ξ exp 2 i 2 i 4 i i log4log ) + 1)K) ) ] K exp 3 i 2 i 4 log4log ) + 1)K) )] exp 4 log4log ) + 1)K) ) E B 0 where the fourth line comes from using 2ab a 2 + b 2 with a = i i and where: S = 1 K K exp 3 i 2 ) i. Since i i = and all i are positive, there exists an arm i such that i S 1 K exp 3 H ) = exp 3 H logk)). [ ] 1 A ξ S, 7). his yields: H 2 i his implies by definition of the risk: ) max E B il )) max max P B ia i), 1 P B 0A) i {0,...,K} i {1,...,K} 1 2 exp 3 ] H 4 log4log ) + 1)K)) logk)e B 0 [1 A ξ P B 0A)) = 1 2 exp 3 [ ] H 4 log4log ) + 1)K logk))) P B 0 A ξ P B 0A)) 1 8 exp 3 H 4 log4log ) + 1)K) logk)) exp 3 H 4 log12log ) + 1)K)), he fourth line comes from Pξ) 3/4, and we consider two cases P B 0A) 1/2 and P B 0A) 1/2. he first leads directly to the condition as the intersection is at least of probability 1/4; in the latter case, we have the same bound via his concludes the proof. max E B il )) E B 0L )) = P B 0AC ) 1/2. i {0,...,K} A.2. Proof of heorem 2 Proof. In this proof, we will show that on a well chosen event ξ, we classify correctly the arms which are over τ + ɛ, and reject the arms that are under τ ɛ. Step 1: A favorable event. Let δ = 4 2) 1. owards this goal, we define the event ξ as follows: { ξ = i A, s {1,..., } : 1 s δ 2 } X i,t µ i. s Hs t=1

12 We know from Sub-Gaussian martingale inequality that for each i A and each u {0,..., log ) }: P v [2 u, 2 u+1 ], { 1 v v X i,t µ i t=1 δ 2 Hv } ) exp δ2 2R 2 H ). ξ is the union of these events for all i K and s log ). As there are less than log ) + 1)K such combinations, we can lower-bound its probability of occurrence with a union bound by: Pξ) 1 2log ) + 1)K exp δ2 2R 2 H ). Step 2: Characterization of some helpful arm. At time, we consider an arm k that has been pulled after the initialization phase and such that k ) 1. We know that such an arm exists otherwise we get: K) H 2 k K K K = i ) 1) < K H 2 i which is a contradiction. Note that since 2K, we have that k ) 1 = K, 2H 2 k We now consider t the last time that this arm k was pulled. Using k t) 2 by the initialisation of the algorithm), we know that: k t) k ) 1 2H 2. 8) k Step 3: Lower bound on the number of pulls of the other arms. On ξ, at time t as we defined previously, we have for every arm i: δ ˆµ i t) µ i 2 H i t). 9) From the reverse triangle inequality and Equation 4), we have: ˆµ i t) µ i = ˆµ i t) τ) µ i τ) ˆµ i t) τ µ i τ ˆµ i t) τ + ɛ) µ i τ + ɛ) i t) i. Combining this with 9) yields the following: k δ 2 H k t) δ k t) k + H k t). 10) By construction, we know that at time t we pulled arm k, which yields for every i A: We can lower bound the left-hand side of 11) using 8): k B k t) B i t). 11) δ 2 H k t) ) k t) B k t) k ) 2δ k 2H 2 B k t) k 1 ) 2 δ H B kt), 12)

13 and upper bound the right hand side using 10) by: hresholding Bandit Problem B i t) = i i t) δ i + 2 ) i t) H i t) i i t) + δ H. 13) As both i and i are positive by definition, combining 12) and 13) yields the following lower bound on i ) i t): 1 2 ) 2 2δ 2H 2 i ). 14) i Step 4: Conclusion. On ξ, as i is a positive quantity, combining 9) and 14) yields: 2δ 2δ µ i i 1 2 2δ ˆµ i ) µ i + i 1 2 2δ, 15) where 2δ 1 2 simplifies to 1/2 for δ = 4 2) 1. 2δ For arms such that µ i τ + ɛ, then i = µ i τ + ɛ and we can rewrite 15): µ i τ 1 2 i ˆµ i ) τ µ i τ)1 1 2 ) ɛ 2 ˆµ i ) τ 0 ˆµ i ) τ, where the last line uses µ i τ + ɛ. One can easily check through similar derivations that ˆµ i ) τ < 0 holds for µ i < τ ɛ. On ξ, arms over τ + ɛ are all accepted, and arms under τ ɛ are all rejected, which means the loss suffered by the algorithm is 0. As 1 Pξ) 2log ) + 1)K exp 1 64R 2 H ), this concludes the proof. A.3. Proof of heorem 3 Proof. We will prove that on a well defined event ξ, sub-optimal arms are pulled at most 1 times, which translates 2 2 kh to the best arm being chosen at the end of the horizon as it was pulled more than half of the time. Step 1: A favorable event. Let δ = 1/18. We define the following events i A: ξ i = { s : µ δ µ i s) H i s) }, We now define ξ as the intersection of these events: ξ = k A ξ k. Using the same Sub-Gaussian martingale inequality as in the proof of heorem 2, we can lower bound its probability of occurrence with a union bound by: P ξ) 1 2log ) + 1)K exp 36R 2 H ) Step 2: he wrong arm at the wrong time. Let us now suppose that a sub-optimal arm k was pulled at least K 2 2 k H times after the initialization which translates to k ) 1 K. Let us now consider the last time t that this arm was 2 2 kh pulled. As it was pulled at time t, the following inequality holds: B k t) B k t). 16)

14 On ξ, we can now lower bound the left hand side by: δ k H k t) ) k t) B k t) We also upper bound the right hand side of 16) by: k k t) B k t) δ H B kt), 17) δ H. 18) Combining both bounds 17) and 18) with 16), as well as rearranging the terms yields: δ k k t) 2 H k t) 2 k 4 δ H. 19) Using k t) k ) 1 K as well as 2K, we have 2 2 kh Plugging this in 19) brings the following condition: k t) which directly reduces to δ 1/16, which is a contradiction as we have set δ = 1/18. As we have proved that for any sub-optimal arm i k it satisfies i ) < 4 2 k H. 20) 4 2 k H 2 k 4 δ H. 21) k ) = i ) i k < 1 2H 2 i k i 2 2 i H, summing for all arms yields: = 2. 22) We conclude by observing that k ) > /2, and as such will be chosen by the algorithm at the end as being the best arm. A.4. Proof of heorem 4 Proof. In this proof we will show that with high probability the sub-optimal arms have been pulled at most at a logarithmic rate, and will then bound the expectation of the number of pulls of these arms. Step 1: A favorable event. We define the following events s : as well as for all arms i k : ξ k,s = {µ ˆµ k s) R ξ i,s = {ˆµ k s) µ k R log )δ k s) }, log )δ k is) }.

Multiple Identifications in Multi-Armed Bandits

Multiple Identifications in Multi-Armed Bandits Multiple Identifications in Multi-Armed Bandits arxiv:05.38v [cs.lg] 4 May 0 Sébastien Bubeck Department of Operations Research and Financial Engineering, Princeton University sbubeck@princeton.edu Tengyao

More information

On the Complexity of Best Arm Identification in Multi-Armed Bandit Models

On the Complexity of Best Arm Identification in Multi-Armed Bandit Models On the Complexity of Best Arm Identification in Multi-Armed Bandit Models Aurélien Garivier Institut de Mathématiques de Toulouse Information Theory, Learning and Big Data Simons Institute, Berkeley, March

More information

Two optimization problems in a stochastic bandit model

Two optimization problems in a stochastic bandit model Two optimization problems in a stochastic bandit model Emilie Kaufmann joint work with Olivier Cappé, Aurélien Garivier and Shivaram Kalyanakrishnan Journées MAS 204, Toulouse Outline From stochastic optimization

More information

Pure Exploration Stochastic Multi-armed Bandits

Pure Exploration Stochastic Multi-armed Bandits C&A Workshop 2016, Hangzhou Pure Exploration Stochastic Multi-armed Bandits Jian Li Institute for Interdisciplinary Information Sciences Tsinghua University Outline Introduction 2 Arms Best Arm Identification

More information

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part I. Sébastien Bubeck Theory Group

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part I. Sébastien Bubeck Theory Group Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part I Sébastien Bubeck Theory Group i.i.d. multi-armed bandit, Robbins [1952] i.i.d. multi-armed bandit, Robbins [1952] Known

More information

On the Complexity of Best Arm Identification with Fixed Confidence

On the Complexity of Best Arm Identification with Fixed Confidence On the Complexity of Best Arm Identification with Fixed Confidence Discrete Optimization with Noise Aurélien Garivier, Emilie Kaufmann COLT, June 23 th 2016, New York Institut de Mathématiques de Toulouse

More information

Pure Exploration Stochastic Multi-armed Bandits

Pure Exploration Stochastic Multi-armed Bandits CAS2016 Pure Exploration Stochastic Multi-armed Bandits Jian Li Institute for Interdisciplinary Information Sciences Tsinghua University Outline Introduction Optimal PAC Algorithm (Best-Arm, Best-k-Arm):

More information

Adaptive Concentration Inequalities for Sequential Decision Problems

Adaptive Concentration Inequalities for Sequential Decision Problems Adaptive Concentration Inequalities for Sequential Decision Problems Shengjia Zhao Tsinghua University zhaosj12@stanford.edu Ashish Sabharwal Allen Institute for AI AshishS@allenai.org Enze Zhou Tsinghua

More information

The information complexity of sequential resource allocation

The information complexity of sequential resource allocation The information complexity of sequential resource allocation Emilie Kaufmann, joint work with Olivier Cappé, Aurélien Garivier and Shivaram Kalyanakrishan SMILE Seminar, ENS, June 8th, 205 Sequential allocation

More information

Advanced Machine Learning

Advanced Machine Learning Advanced Machine Learning Bandit Problems MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Multi-Armed Bandit Problem Problem: which arm of a K-slot machine should a gambler pull to maximize his

More information

Multi-armed bandit models: a tutorial

Multi-armed bandit models: a tutorial Multi-armed bandit models: a tutorial CERMICS seminar, March 30th, 2016 Multi-Armed Bandit model: general setting K arms: for a {1,..., K}, (X a,t ) t N is a stochastic process. (unknown distributions)

More information

Anytime optimal algorithms in stochastic multi-armed bandits

Anytime optimal algorithms in stochastic multi-armed bandits Rémy Degenne LPMA, Université Paris Diderot Vianney Perchet CREST, ENSAE REMYDEGENNE@MATHUNIV-PARIS-DIDEROTFR VIANNEYPERCHET@NORMALESUPORG Abstract We introduce an anytime algorithm for stochastic multi-armed

More information

Combinatorial Pure Exploration of Multi-Armed Bandits

Combinatorial Pure Exploration of Multi-Armed Bandits Combinatorial Pure Exploration of Multi-Armed Bandits Shouyuan Chen 1 Tian Lin 2 Irwin King 1 Michael R. Lyu 1 Wei Chen 3 1 The Chinese University of Hong Kong 2 Tsinghua University 3 Microsoft Research

More information

The information complexity of best-arm identification

The information complexity of best-arm identification The information complexity of best-arm identification Emilie Kaufmann, joint work with Olivier Cappé and Aurélien Garivier MAB workshop, Lancaster, January th, 206 Context: the multi-armed bandit model

More information

Bandit Algorithms for Pure Exploration: Best Arm Identification and Game Tree Search. Wouter M. Koolen

Bandit Algorithms for Pure Exploration: Best Arm Identification and Game Tree Search. Wouter M. Koolen Bandit Algorithms for Pure Exploration: Best Arm Identification and Game Tree Search Wouter M. Koolen Machine Learning and Statistics for Structures Friday 23 rd February, 2018 Outline 1 Intro 2 Model

More information

Nearly Optimal Sampling Algorithms for Combinatorial Pure Exploration

Nearly Optimal Sampling Algorithms for Combinatorial Pure Exploration JMLR: Workshop and Conference Proceedings vol 65: 55, 207 30th Annual Conference on Learning Theory Nearly Optimal Sampling Algorithms for Combinatorial Pure Exploration Editor: Under Review for COLT 207

More information

Adaptive Concentration Inequalities for Sequential Decision Problems

Adaptive Concentration Inequalities for Sequential Decision Problems Adaptive Concentration Inequalities for Sequential Decision Problems Shengjia Zhao Tsinghua University zhaosj12@stanford.edu Ashish Sabharwal Allen Institute for AI AshishS@allenai.org Enze Zhou Tsinghua

More information

Bandit models: a tutorial

Bandit models: a tutorial Gdt COS, December 3rd, 2015 Multi-Armed Bandit model: general setting K arms: for a {1,..., K}, (X a,t ) t N is a stochastic process. (unknown distributions) Bandit game: a each round t, an agent chooses

More information

Stratégies bayésiennes et fréquentistes dans un modèle de bandit

Stratégies bayésiennes et fréquentistes dans un modèle de bandit Stratégies bayésiennes et fréquentistes dans un modèle de bandit thèse effectuée à Telecom ParisTech, co-dirigée par Olivier Cappé, Aurélien Garivier et Rémi Munos Journées MAS, Grenoble, 30 août 2016

More information

Introduction to Bandit Algorithms. Introduction to Bandit Algorithms

Introduction to Bandit Algorithms. Introduction to Bandit Algorithms Stochastic K-Arm Bandit Problem Formulation Consider K arms (actions) each correspond to an unknown distribution {ν k } K k=1 with values bounded in [0, 1]. At each time t, the agent pulls an arm I t {1,...,

More information

A minimax and asymptotically optimal algorithm for stochastic bandits

A minimax and asymptotically optimal algorithm for stochastic bandits Proceedings of Machine Learning Research 76:1 15, 017 Algorithmic Learning heory 017 A minimax and asymptotically optimal algorithm for stochastic bandits Pierre Ménard Aurélien Garivier Institut de Mathématiques

More information

On the Complexity of Best Arm Identification with Fixed Confidence

On the Complexity of Best Arm Identification with Fixed Confidence On the Complexity of Best Arm Identification with Fixed Confidence Discrete Optimization with Noise Aurélien Garivier, joint work with Emilie Kaufmann CNRS, CRIStAL) to be presented at COLT 16, New York

More information

Best Arm Identification in Multi-Armed Bandits

Best Arm Identification in Multi-Armed Bandits Best Arm Identification in Multi-Armed Bandits Jean-Yves Audibert Imagine, Université Paris Est & Willow, CNRS/ENS/INRIA, Paris, France audibert@certisenpcfr Sébastien Bubeck, Rémi Munos SequeL Project,

More information

The Multi-Arm Bandit Framework

The Multi-Arm Bandit Framework The Multi-Arm Bandit Framework A. LAZARIC (SequeL Team @INRIA-Lille) ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course In This Lecture A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-2/94

More information

Upper-Confidence-Bound Algorithms for Active Learning in Multi-Armed Bandits

Upper-Confidence-Bound Algorithms for Active Learning in Multi-Armed Bandits Upper-Confidence-Bound Algorithms for Active Learning in Multi-Armed Bandits Alexandra Carpentier 1, Alessandro Lazaric 1, Mohammad Ghavamzadeh 1, Rémi Munos 1, and Peter Auer 2 1 INRIA Lille - Nord Europe,

More information

Revisiting the Exploration-Exploitation Tradeoff in Bandit Models

Revisiting the Exploration-Exploitation Tradeoff in Bandit Models Revisiting the Exploration-Exploitation Tradeoff in Bandit Models joint work with Aurélien Garivier (IMT, Toulouse) and Tor Lattimore (University of Alberta) Workshop on Optimization and Decision-Making

More information

Lecture 19: UCB Algorithm and Adversarial Bandit Problem. Announcements Review on stochastic multi-armed bandit problem

Lecture 19: UCB Algorithm and Adversarial Bandit Problem. Announcements Review on stochastic multi-armed bandit problem Lecture 9: UCB Algorithm and Adversarial Bandit Problem EECS598: Prediction and Learning: It s Only a Game Fall 03 Lecture 9: UCB Algorithm and Adversarial Bandit Problem Prof. Jacob Abernethy Scribe:

More information

arxiv: v2 [stat.ml] 14 Nov 2016

arxiv: v2 [stat.ml] 14 Nov 2016 Journal of Machine Learning Research 6 06-4 Submitted 7/4; Revised /5; Published /6 On the Complexity of Best-Arm Identification in Multi-Armed Bandit Models arxiv:407.4443v [stat.ml] 4 Nov 06 Emilie Kaufmann

More information

On the Complexity of A/B Testing

On the Complexity of A/B Testing JMLR: Workshop and Conference Proceedings vol 35:1 3, 014 On the Complexity of A/B Testing Emilie Kaufmann LTCI, Télécom ParisTech & CNRS KAUFMANN@TELECOM-PARISTECH.FR Olivier Cappé CAPPE@TELECOM-PARISTECH.FR

More information

PAC Subset Selection in Stochastic Multi-armed Bandits

PAC Subset Selection in Stochastic Multi-armed Bandits In Langford, Pineau, editors, Proceedings of the 9th International Conference on Machine Learning, pp 655--66, Omnipress, New York, NY, USA, 0 PAC Subset Selection in Stochastic Multi-armed Bandits Shivaram

More information

Lecture 4: Lower Bounds (ending); Thompson Sampling

Lecture 4: Lower Bounds (ending); Thompson Sampling CMSC 858G: Bandits, Experts and Games 09/12/16 Lecture 4: Lower Bounds (ending); Thompson Sampling Instructor: Alex Slivkins Scribed by: Guowei Sun,Cheng Jie 1 Lower bounds on regret (ending) Recap from

More information

On Bayesian bandit algorithms

On Bayesian bandit algorithms On Bayesian bandit algorithms Emilie Kaufmann joint work with Olivier Cappé, Aurélien Garivier, Nathaniel Korda and Rémi Munos July 1st, 2012 Emilie Kaufmann (Telecom ParisTech) On Bayesian bandit algorithms

More information

On the Generalization Ability of Online Strongly Convex Programming Algorithms

On the Generalization Ability of Online Strongly Convex Programming Algorithms On the Generalization Ability of Online Strongly Convex Programming Algorithms Sham M. Kakade I Chicago Chicago, IL 60637 sham@tti-c.org Ambuj ewari I Chicago Chicago, IL 60637 tewari@tti-c.org Abstract

More information

Online Learning and Sequential Decision Making

Online Learning and Sequential Decision Making Online Learning and Sequential Decision Making Emilie Kaufmann CNRS & CRIStAL, Inria SequeL, emilie.kaufmann@univ-lille.fr Research School, ENS Lyon, Novembre 12-13th 2018 Emilie Kaufmann Sequential Decision

More information

EASINESS IN BANDITS. Gergely Neu. Pompeu Fabra University

EASINESS IN BANDITS. Gergely Neu. Pompeu Fabra University EASINESS IN BANDITS Gergely Neu Pompeu Fabra University EASINESS IN BANDITS Gergely Neu Pompeu Fabra University THE BANDIT PROBLEM Play for T rounds attempting to maximize rewards THE BANDIT PROBLEM Play

More information

Bayesian and Frequentist Methods in Bandit Models

Bayesian and Frequentist Methods in Bandit Models Bayesian and Frequentist Methods in Bandit Models Emilie Kaufmann, Telecom ParisTech Bayes In Paris, ENSAE, October 24th, 2013 Emilie Kaufmann (Telecom ParisTech) Bayesian and Frequentist Bandits BIP,

More information

arxiv: v1 [cs.lg] 27 Feb 2015

arxiv: v1 [cs.lg] 27 Feb 2015 Kevin Jamieson KGJAMIESON@WISC.EDU Electrical and Computer Engineering Department, University of Wisconsin, Madison, WI 53706 Ameet Talwalkar AMEET@CS.UCLA.EDU Computer Science Department, UCLA, Boelter

More information

Ordinal optimization - Empirical large deviations rate estimators, and multi-armed bandit methods

Ordinal optimization - Empirical large deviations rate estimators, and multi-armed bandit methods Ordinal optimization - Empirical large deviations rate estimators, and multi-armed bandit methods Sandeep Juneja Tata Institute of Fundamental Research Mumbai, India joint work with Peter Glynn Applied

More information

Multi-Armed Bandit Formulations for Identification and Control

Multi-Armed Bandit Formulations for Identification and Control Multi-Armed Bandit Formulations for Identification and Control Cristian R. Rojas Joint work with Matías I. Müller and Alexandre Proutiere KTH Royal Institute of Technology, Sweden ERNSI, September 24-27,

More information

arxiv: v2 [math.st] 14 Nov 2016

arxiv: v2 [math.st] 14 Nov 2016 On Explore-hen-Commit Strategies arxiv:1605.08988v math.s] 1 Nov 016 Aurélien Garivier Institut de Mathématiques de oulouse; UMR519 Université de oulouse; CNRS UPS IM, F-3106 oulouse Cedex 9, France aurelien.garivier@math.univ-toulouse.fr

More information

On Sequential Elimination Algorithms for Best-Arm Identification in Multi-Armed Bandits

On Sequential Elimination Algorithms for Best-Arm Identification in Multi-Armed Bandits 1 On Sequential Elimination Algorithms for Best-Arm Identification in Multi-Armed Bandits Shahin Shahrampour, Mohammad Noshad, Vahid Tarokh arxiv:169266v2 [statml] 13 Apr 217 Abstract We consider the best-arm

More information

Evaluation of multi armed bandit algorithms and empirical algorithm

Evaluation of multi armed bandit algorithms and empirical algorithm Acta Technica 62, No. 2B/2017, 639 656 c 2017 Institute of Thermomechanics CAS, v.v.i. Evaluation of multi armed bandit algorithms and empirical algorithm Zhang Hong 2,3, Cao Xiushan 1, Pu Qiumei 1,4 Abstract.

More information

Adaptive Sampling Under Low Noise Conditions 1

Adaptive Sampling Under Low Noise Conditions 1 Manuscrit auteur, publié dans "41èmes Journées de Statistique, SFdS, Bordeaux (2009)" Adaptive Sampling Under Low Noise Conditions 1 Nicolò Cesa-Bianchi Dipartimento di Scienze dell Informazione Università

More information

The Multi-Armed Bandit Problem

The Multi-Armed Bandit Problem The Multi-Armed Bandit Problem Electrical and Computer Engineering December 7, 2013 Outline 1 2 Mathematical 3 Algorithm Upper Confidence Bound Algorithm A/B Testing Exploration vs. Exploitation Scientist

More information

Regional Multi-Armed Bandits

Regional Multi-Armed Bandits School of Information Science and Technology University of Science and Technology of China {wzy43, zrd7}@mail.ustc.edu.cn, congshen@ustc.edu.cn Abstract We consider a variant of the classic multiarmed

More information

Bandit Algorithms. Tor Lattimore & Csaba Szepesvári

Bandit Algorithms. Tor Lattimore & Csaba Szepesvári Bandit Algorithms Tor Lattimore & Csaba Szepesvári Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next? Overview What are bandits,

More information

Exponential Weights on the Hypercube in Polynomial Time

Exponential Weights on the Hypercube in Polynomial Time European Workshop on Reinforcement Learning 14 (2018) October 2018, Lille, France. Exponential Weights on the Hypercube in Polynomial Time College of Information and Computer Sciences University of Massachusetts

More information

Upper-Confidence-Bound Algorithms for Active Learning in Multi-armed Bandits

Upper-Confidence-Bound Algorithms for Active Learning in Multi-armed Bandits Upper-Confidence-Bound Algorithms for Active Learning in Multi-armed Bandits Alexandra Carpentier 1, Alessandro Lazaric 1, Mohammad Ghavamzadeh 1, Rémi Munos 1, and Peter Auer 2 1 INRIA Lille - Nord Europe,

More information

Lecture 6: Non-stochastic best arm identification

Lecture 6: Non-stochastic best arm identification CSE599i: Online and Adaptive Machine Learning Winter 08 Lecturer: Kevin Jamieson Lecture 6: Non-stochastic best arm identification Scribes: Anran Wang, eibin Li, rian Chan, Shiqing Yu, Zhijin Zhou Disclaimer:

More information

Lecture 8. Instructor: Haipeng Luo

Lecture 8. Instructor: Haipeng Luo Lecture 8 Instructor: Haipeng Luo Boosting and AdaBoost In this lecture we discuss the connection between boosting and online learning. Boosting is not only one of the most fundamental theories in machine

More information

Improved Algorithms for Linear Stochastic Bandits

Improved Algorithms for Linear Stochastic Bandits Improved Algorithms for Linear Stochastic Bandits Yasin Abbasi-Yadkori abbasiya@ualberta.ca Dept. of Computing Science University of Alberta Dávid Pál dpal@google.com Dept. of Computing Science University

More information

Explore no more: Improved high-probability regret bounds for non-stochastic bandits

Explore no more: Improved high-probability regret bounds for non-stochastic bandits Explore no more: Improved high-probability regret bounds for non-stochastic bandits Gergely Neu SequeL team INRIA Lille Nord Europe gergely.neu@gmail.com Abstract This work addresses the problem of regret

More information

Racing Thompson: an Efficient Algorithm for Thompson Sampling with Non-conjugate Priors

Racing Thompson: an Efficient Algorithm for Thompson Sampling with Non-conjugate Priors Racing Thompson: an Efficient Algorithm for Thompson Sampling with Non-conugate Priors Yichi Zhou 1 Jun Zhu 1 Jingwe Zhuo 1 Abstract Thompson sampling has impressive empirical performance for many multi-armed

More information

Monte-Carlo Tree Search by. MCTS by Best Arm Identification

Monte-Carlo Tree Search by. MCTS by Best Arm Identification Monte-Carlo Tree Search by Best Arm Identification and Wouter M. Koolen Inria Lille SequeL team CWI Machine Learning Group Inria-CWI workshop Amsterdam, September 20th, 2017 Part of...... a new Associate

More information

Minimax strategy for Stratified Sampling for Monte Carlo

Minimax strategy for Stratified Sampling for Monte Carlo Journal of Machine Learning Research 1 2000 1-48 Submitted 4/00; Published 10/00 Minimax strategy for Stratified Sampling for Monte Carlo Alexandra Carpentier INRIA Lille - Nord Europe 40, avenue Halley

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Lecture 5: Bandit optimisation Alexandre Proutiere, Sadegh Talebi, Jungseul Ok KTH, The Royal Institute of Technology Objectives of this lecture Introduce bandit optimisation: the

More information

Informational Confidence Bounds for Self-Normalized Averages and Applications

Informational Confidence Bounds for Self-Normalized Averages and Applications Informational Confidence Bounds for Self-Normalized Averages and Applications Aurélien Garivier Institut de Mathématiques de Toulouse - Université Paul Sabatier Thursday, September 12th 2013 Context Tree

More information

Lecture 3: Lower Bounds for Bandit Algorithms

Lecture 3: Lower Bounds for Bandit Algorithms CMSC 858G: Bandits, Experts and Games 09/19/16 Lecture 3: Lower Bounds for Bandit Algorithms Instructor: Alex Slivkins Scribed by: Soham De & Karthik A Sankararaman 1 Lower Bounds In this lecture (and

More information

Online Learning with Feedback Graphs

Online Learning with Feedback Graphs Online Learning with Feedback Graphs Claudio Gentile INRIA and Google NY clagentile@gmailcom NYC March 6th, 2018 1 Content of this lecture Regret analysis of sequential prediction problems lying between

More information

KULLBACK-LEIBLER UPPER CONFIDENCE BOUNDS FOR OPTIMAL SEQUENTIAL ALLOCATION

KULLBACK-LEIBLER UPPER CONFIDENCE BOUNDS FOR OPTIMAL SEQUENTIAL ALLOCATION Submitted to the Annals of Statistics arxiv: math.pr/0000000 KULLBACK-LEIBLER UPPER CONFIDENCE BOUNDS FOR OPTIMAL SEQUENTIAL ALLOCATION By Olivier Cappé 1, Aurélien Garivier 2, Odalric-Ambrym Maillard

More information

Learning to play K-armed bandit problems

Learning to play K-armed bandit problems Learning to play K-armed bandit problems Francis Maes 1, Louis Wehenkel 1 and Damien Ernst 1 1 University of Liège Dept. of Electrical Engineering and Computer Science Institut Montefiore, B28, B-4000,

More information

Yevgeny Seldin. University of Copenhagen

Yevgeny Seldin. University of Copenhagen Yevgeny Seldin University of Copenhagen Classical (Batch) Machine Learning Collect Data Data Assumption The samples are independent identically distributed (i.i.d.) Machine Learning Prediction rule New

More information

Advanced Topics in Machine Learning and Algorithmic Game Theory Fall semester, 2011/12

Advanced Topics in Machine Learning and Algorithmic Game Theory Fall semester, 2011/12 Advanced Topics in Machine Learning and Algorithmic Game Theory Fall semester, 2011/12 Lecture 4: Multiarmed Bandit in the Adversarial Model Lecturer: Yishay Mansour Scribe: Shai Vardi 4.1 Lecture Overview

More information

Pure Exploration in Infinitely-Armed Bandit Models with Fixed-Confidence

Pure Exploration in Infinitely-Armed Bandit Models with Fixed-Confidence Pure Exploration in Infinitely-Armed Bandit Models with Fixed-Confidence Maryam Aziz, Jesse Anderton, Emilie Kaufmann, Javed Aslam To cite this version: Maryam Aziz, Jesse Anderton, Emilie Kaufmann, Javed

More information

Multi-armed Bandits in the Presence of Side Observations in Social Networks

Multi-armed Bandits in the Presence of Side Observations in Social Networks 52nd IEEE Conference on Decision and Control December 0-3, 203. Florence, Italy Multi-armed Bandits in the Presence of Side Observations in Social Networks Swapna Buccapatnam, Atilla Eryilmaz, and Ness

More information

arxiv: v2 [stat.ml] 19 Jul 2012

arxiv: v2 [stat.ml] 19 Jul 2012 Thompson Sampling: An Asymptotically Optimal Finite Time Analysis Emilie Kaufmann, Nathaniel Korda and Rémi Munos arxiv:105.417v [stat.ml] 19 Jul 01 Telecom Paristech UMR CNRS 5141 & INRIA Lille - Nord

More information

Adaptive Multiple-Arm Identification

Adaptive Multiple-Arm Identification Adaptive Multiple-Arm Identification Jiecao Chen Computer Science Department Indiana University at Bloomington jiecchen@umail.iu.edu Qin Zhang Computer Science Department Indiana University at Bloomington

More information

Extreme bandits. Abstract

Extreme bandits. Abstract Extreme bandits Alexandra Carpentier Statistical Laboratory, CMS University of Cambridge, UK a.carpentier@statslab.cam.ac.uk Michal Valko SequeL team INRIA Lille - Nord Europe, France michal.valko@inria.fr

More information

From Bandits to Experts: A Tale of Domination and Independence

From Bandits to Experts: A Tale of Domination and Independence From Bandits to Experts: A Tale of Domination and Independence Nicolò Cesa-Bianchi Università degli Studi di Milano N. Cesa-Bianchi (UNIMI) Domination and Independence 1 / 1 From Bandits to Experts: A

More information

Beat the Mean Bandit

Beat the Mean Bandit Yisong Yue H. John Heinz III College, Carnegie Mellon University, Pittsburgh, PA, USA Thorsten Joachims Department of Computer Science, Cornell University, Ithaca, NY, USA yisongyue@cmu.edu tj@cs.cornell.edu

More information

Agnostic Online learnability

Agnostic Online learnability Technical Report TTIC-TR-2008-2 October 2008 Agnostic Online learnability Shai Shalev-Shwartz Toyota Technological Institute Chicago shai@tti-c.org ABSTRACT We study a fundamental question. What classes

More information

Online learning with feedback graphs and switching costs

Online learning with feedback graphs and switching costs Online learning with feedback graphs and switching costs A Proof of Theorem Proof. Without loss of generality let the independent sequence set I(G :T ) formed of actions (or arms ) from to. Given the sequence

More information

Learning, Games, and Networks

Learning, Games, and Networks Learning, Games, and Networks Abhishek Sinha Laboratory for Information and Decision Systems MIT ML Talk Series @CNRG December 12, 2016 1 / 44 Outline 1 Prediction With Experts Advice 2 Application to

More information

Efficient learning by implicit exploration in bandit problems with side observations

Efficient learning by implicit exploration in bandit problems with side observations Efficient learning by implicit exploration in bandit problems with side observations omáš Kocák Gergely Neu Michal Valko Rémi Munos SequeL team, INRIA Lille Nord Europe, France {tomas.kocak,gergely.neu,michal.valko,remi.munos}@inria.fr

More information

Robustness of Anytime Bandit Policies

Robustness of Anytime Bandit Policies Robustness of Anytime Bandit Policies Antoine Salomon, Jean-Yves Audibert To cite this version: Antoine Salomon, Jean-Yves Audibert. Robustness of Anytime Bandit Policies. 011. HAL Id:

More information

Logarithmic Online Regret Bounds for Undiscounted Reinforcement Learning

Logarithmic Online Regret Bounds for Undiscounted Reinforcement Learning Logarithmic Online Regret Bounds for Undiscounted Reinforcement Learning Peter Auer Ronald Ortner University of Leoben, Franz-Josef-Strasse 18, 8700 Leoben, Austria auer,rortner}@unileoben.ac.at Abstract

More information

Ordinal Optimization and Multi Armed Bandit Techniques

Ordinal Optimization and Multi Armed Bandit Techniques Ordinal Optimization and Multi Armed Bandit Techniques Sandeep Juneja. with Peter Glynn September 10, 2014 The ordinal optimization problem Determining the best of d alternative designs for a system, on

More information

Dynamic resource allocation: Bandit problems and extensions

Dynamic resource allocation: Bandit problems and extensions Dynamic resource allocation: Bandit problems and extensions Aurélien Garivier Institut de Mathématiques de Toulouse MAD Seminar, Université Toulouse 1 October 3rd, 2014 The Bandit Model Roadmap 1 The Bandit

More information

arxiv: v3 [cs.lg] 30 Jun 2012

arxiv: v3 [cs.lg] 30 Jun 2012 arxiv:05874v3 [cslg] 30 Jun 0 Orly Avner Shie Mannor Department of Electrical Engineering, Technion Ohad Shamir Microsoft Research New England Abstract We consider a multi-armed bandit problem where the

More information

Bandits and Exploration: How do we (optimally) gather information? Sham M. Kakade

Bandits and Exploration: How do we (optimally) gather information? Sham M. Kakade Bandits and Exploration: How do we (optimally) gather information? Sham M. Kakade Machine Learning for Big Data CSE547/STAT548 University of Washington S. M. Kakade (UW) Optimization for Big data 1 / 22

More information

Corrupt Bandits. Abstract

Corrupt Bandits. Abstract Corrupt Bandits Pratik Gajane Orange labs/inria SequeL Tanguy Urvoy Orange labs Emilie Kaufmann INRIA SequeL pratik.gajane@inria.fr tanguy.urvoy@orange.com emilie.kaufmann@inria.fr Editor: Abstract We

More information

Adaptive Learning with Unknown Information Flows

Adaptive Learning with Unknown Information Flows Adaptive Learning with Unknown Information Flows Yonatan Gur Stanford University Ahmadreza Momeni Stanford University June 8, 018 Abstract An agent facing sequential decisions that are characterized by

More information

Two generic principles in modern bandits: the optimistic principle and Thompson sampling

Two generic principles in modern bandits: the optimistic principle and Thompson sampling Two generic principles in modern bandits: the optimistic principle and Thompson sampling Rémi Munos INRIA Lille, France CSML Lunch Seminars, September 12, 2014 Outline Two principles: The optimistic principle

More information

The multi armed-bandit problem

The multi armed-bandit problem The multi armed-bandit problem (with covariates if we have time) Vianney Perchet & Philippe Rigollet LPMA Université Paris Diderot ORFE Princeton University Algorithms and Dynamics for Games and Optimization

More information

Learning Exploration/Exploitation Strategies for Single Trajectory Reinforcement Learning

Learning Exploration/Exploitation Strategies for Single Trajectory Reinforcement Learning JMLR: Workshop and Conference Proceedings vol:1 8, 2012 10th European Workshop on Reinforcement Learning Learning Exploration/Exploitation Strategies for Single Trajectory Reinforcement Learning Michael

More information

Adaptive Strategy for Stratified Monte Carlo Sampling

Adaptive Strategy for Stratified Monte Carlo Sampling Journal of Machine Learning Research 16 2015 2231-2271 Submitted 3/13; Revised 12/14; Published 8/15 Adaptive Strategy for Stratified Monte Carlo Sampling Alexandra Carpentier Statistical Laboratory Center

More information

Explore no more: Improved high-probability regret bounds for non-stochastic bandits

Explore no more: Improved high-probability regret bounds for non-stochastic bandits Explore no more: Improved high-probability regret bounds for non-stochastic bandits Gergely Neu SequeL team INRIA Lille Nord Europe gergely.neu@gmail.com Abstract This work addresses the problem of regret

More information

Stochastic Online Greedy Learning with Semi-bandit Feedbacks

Stochastic Online Greedy Learning with Semi-bandit Feedbacks Stochastic Online Greedy Learning with Semi-bandit Feedbacks (Full Version Including Appendices) Tian Lin Tsinghua University Beijing, China lintian06@gmail.com Jian Li Tsinghua University Beijing, China

More information

Combinatorial Multi-Armed Bandit and Its Extension to Probabilistically Triggered Arms

Combinatorial Multi-Armed Bandit and Its Extension to Probabilistically Triggered Arms Journal of Machine Learning Research 17 2016) 1-33 Submitted 7/14; Revised 3/15; Published 4/16 Combinatorial Multi-Armed Bandit and Its Extension to Probabilistically Triggered Arms Wei Chen Microsoft

More information

Introduction to Reinforcement Learning Part 3: Exploration for decision making, Application to games, optimization, and planning

Introduction to Reinforcement Learning Part 3: Exploration for decision making, Application to games, optimization, and planning Introduction to Reinforcement Learning Part 3: Exploration for decision making, Application to games, optimization, and planning Rémi Munos SequeL project: Sequential Learning http://researchers.lille.inria.fr/

More information

Deviations of stochastic bandit regret

Deviations of stochastic bandit regret Deviations of stochastic bandit regret Antoine Salomon 1 and Jean-Yves Audibert 1,2 1 Imagine École des Ponts ParisTech Université Paris Est salomona@imagine.enpc.fr audibert@imagine.enpc.fr 2 Sierra,

More information

Qualitative Multi-Armed Bandits: A Quantile-Based Approach

Qualitative Multi-Armed Bandits: A Quantile-Based Approach Qualitative Multi-Armed Bandits: A Quantile-Based Approach Balazs Szorenyi, Róbert Busa-Fekete, Paul Weng, Eyke Hüllermeier To cite this version: Balazs Szorenyi, Róbert Busa-Fekete, Paul Weng, Eyke Hüllermeier.

More information

Regret lower bounds and extended Upper Confidence Bounds policies in stochastic multi-armed bandit problem

Regret lower bounds and extended Upper Confidence Bounds policies in stochastic multi-armed bandit problem Journal of Machine Learning Research 1 01) 1-48 Submitted 4/00; Published 10/00 Regret lower bounds and extended Upper Confidence Bounds policies in stochastic multi-armed bandit problem Antoine Salomon

More information

arxiv: v3 [cs.gt] 17 Jun 2015

arxiv: v3 [cs.gt] 17 Jun 2015 An Incentive Compatible Multi-Armed-Bandit Crowdsourcing Mechanism with Quality Assurance Shweta Jain, Sujit Gujar, Satyanath Bhat, Onno Zoeter, Y. Narahari arxiv:1406.7157v3 [cs.gt] 17 Jun 2015 June 18,

More information

Tsinghua Machine Learning Guest Lecture, June 9,

Tsinghua Machine Learning Guest Lecture, June 9, Tsinghua Machine Learning Guest Lecture, June 9, 2015 1 Lecture Outline Introduction: motivations and definitions for online learning Multi-armed bandit: canonical example of online learning Combinatorial

More information

Toward a Classification of Finite Partial-Monitoring Games

Toward a Classification of Finite Partial-Monitoring Games Toward a Classification of Finite Partial-Monitoring Games Gábor Bartók ( *student* ), Dávid Pál, and Csaba Szepesvári Department of Computing Science, University of Alberta, Canada {bartok,dpal,szepesva}@cs.ualberta.ca

More information

Improving Regret Bounds for Combinatorial Semi-Bandits with Probabilistically Triggered Arms and Its Applications

Improving Regret Bounds for Combinatorial Semi-Bandits with Probabilistically Triggered Arms and Its Applications Improving Regret Bounds for Combinatorial Semi-Bandits with Probabilistically Triggered Arms and Its Applications Qinshi Wang Princeton University Princeton, NJ 08544 qinshiw@princeton.edu Wei Chen Microsoft

More information

Hybrid Machine Learning Algorithms

Hybrid Machine Learning Algorithms Hybrid Machine Learning Algorithms Umar Syed Princeton University Includes joint work with: Rob Schapire (Princeton) Nina Mishra, Alex Slivkins (Microsoft) Common Approaches to Machine Learning!! Supervised

More information

A Drifting-Games Analysis for Online Learning and Applications to Boosting

A Drifting-Games Analysis for Online Learning and Applications to Boosting A Drifting-Games Analysis for Online Learning and Applications to Boosting Haipeng Luo Department of Computer Science Princeton University Princeton, NJ 08540 haipengl@cs.princeton.edu Robert E. Schapire

More information

COS 402 Machine Learning and Artificial Intelligence Fall Lecture 22. Exploration & Exploitation in Reinforcement Learning: MAB, UCB, Exp3

COS 402 Machine Learning and Artificial Intelligence Fall Lecture 22. Exploration & Exploitation in Reinforcement Learning: MAB, UCB, Exp3 COS 402 Machine Learning and Artificial Intelligence Fall 2016 Lecture 22 Exploration & Exploitation in Reinforcement Learning: MAB, UCB, Exp3 How to balance exploration and exploitation in reinforcement

More information