arxiv: v1 [stat.ml] 27 May 2016

Size: px

Start display at page:

Download "arxiv: v1 [stat.ml] 27 May 2016"

Audrey Joanna Anderson
6 years ago
Views:

1 An optimal algorithm for the hresholding Bandit Problem Andrea Locatelli Maurilio Gutzeit Alexandra Carpentier Department of Mathematics, University of Potsdam, Germany arxiv: v1 [stat.ml] 27 May 2016 Abstract We study a specific combinatorial pure exploration stochastic bandit problem where the learner aims at finding the set of arms whose means are above a given threshold, up to a given precision, and for a fixed time horizon. We propose a parameter-free algorithm based on an original heuristic, and prove that it is optimal for this problem by deriving matching upper and lower bounds. o the best of our knowledge, this is the first non-trivial pure exploration setting with fixed budget for which optimal strategies are constructed. 1. Introduction In this paper we study a specific combinatorial, pure exploration, stochastic bandit setting. More precisely, consider a stochastic bandit setting where each arm has mean µ k. he learner can sample sequentially > 0 samples from the arms and aims at finding as efficiently as possible the set of arms whose means are larger than a threshold τ R. In this paper, we refer to this setting as the hresholding Bandit Problem BP), which is a specific instance of the combinatorial pure exploration bandit setting introduced in Chen et al., 2014). A simpler one armed version of this problem is known as the SIGN-ξ problem, see Chen & Li, 2015). his problem is related to the popular combinatorial pure exploration bandit problem known as the opm problem where the aim of the learner is to return the set of M arms with highest mean Bubeck et al., 2013b; Gabillon et al., 2012; Kaufmann et al., 2015; Zhou et al., 2014; Cao et al., 2015) - which is a combinatorial version of the best arm identification problem Even-Dar et al., 2002; Mannor & sitsiklis, 2004; Bubeck et al., 2009; Audibert & Bubeck, Proceedings of the 33 rd International Conference on Machine Learning, New York, NY, USA, JMLR: W&CP volume 48. Copyright 2016 by the authors). 2010; Gabillon et al., 2012; Jamieson et al., 2014; Karnin et al., 2013; Kaufmann et al., 2015; Chen & Li, 2015). o formulate this link with a simple metaphor, the opm problem is a contest and the BP problem is an exam : in the former, the learner wants to select the M arms with highest mean, in the latter the learner wants to select the arms whose means are higher than a certain threshold. We believe that this distinction is important and that in many applications the BP problem is more relevant than the opm, as in many domains one has a natural efficiency, or correctness threshold above which one wants to use an option. For instance in industrial applications, one wants to keep a machine if its production s value is above its functioning costs, in crowd-sourcing one wants to hire a worker as long as its productivity is higher than its wage, etc. In addition to these applications derived from the opm problem, the BP problem has applications in dueling bandits and is a natural way to cast the problem of active and discrete level set detection, which is in turn related to the important applications of active classification, and active anomaly detection - we detail this point more in Subsection 3.1. As mentioned previously, the BP problem is a specific instance of the combinatorial pure exploration bandit framework introduced in Chen et al., 2014). Without going into the details of the combinatorial pure exploration setting for which the paper Chen et al., 2014) derives interesting general results, we will summarize what these results imply for the particular BP and opm problems, which are specific cases of the combinatorial pure exploration setting. As it is often the case for pure exploration problems, the paper Chen et al., 2014) distinguishes between two settings: he fixed budget setting where the learner aims, given a fixed budget, at returning the set of arms that are above the threshold in the case of BP) or the set of M best arms in the case of opm), with highest possible probability. In this setting, upper and lower bounds are on the probability of making an error when returning the set of arms.

2 he fixed confidence setting where the learner aims, given a probability δ of acceptable error, at returning the set of arms that are above the threshold in the case of BP) or the set of M best arms in the case of opm) with as few pulls of the arms as possible. In this setting, upper and lower bounds are on the number of pulls that are necessary to return the correct set of arm with probability at least 1 δ. he similarities and dissemblance of these two settings have been discussed in the literature in the case of the opm problem in particular in the case M = 1), see Gabillon et al., 2012; Karnin et al., 2013; Chen et al., 2014). While as explained in Audibert & Bubeck, 2010; Gabillon et al., 2012), the two settings share similarities in the specific case when additional information about the problem is available to the learner such as the complexity H defined in able 1), they are very different in general and results do not transfer from one setting to the other, see Bubeck et al., 2009; Audibert & Bubeck, 2010; Karnin et al., 2013; Kaufmann et al., 2015). In particular we highlight the following fact: while the fixed confidence setting is relatively well understood in the sense that there are constructions for optimal strategies Kalyanakrishnan et al., 2012; Jamieson et al., 2014; Karnin et al., 2013; Kaufmann et al., 2015; Chen & Li, 2015), there is an important knowledge gap in the fixed budget setting. In this case, without the knowledge of additional information on the problem such as e.g. the complexity H defined in able 1, there is a gap between the known upper and lower bounds, see Audibert & Bubeck, 2010; Gabillon et al., 2012; Karnin et al., 2013; Kaufmann et al., 2015). his knowledge gap is more acute for the general combinatorial exploration bandit problem defined in the paper Chen et al., 2014) see their heorem 3) - and therefore for the BP problem where in fact no fixed budget lower bound exists to the best of our knowledge). We summarize in able 1 the state of the art results for the BP problem and for the opm problem with M = 1. he summary of able 1 highlights that in the fixed budget setting, both for the opm and the BP problem, the correct complexity H that should appear in the bound, i.e. what is the problem dependent quantity H such that the upper and lower bounds on the probability of error is of order exp n/h ), is still an open question. In the opm problem, able 1 implies that H H log2k)h 2. In the BP problem, able 1 implies 0 H log2k)h 2, since to the best of our knowledge a lower bound for this problem exists only in the case of the fixed confidence setting. Note that although this gap may appear small in particular in the case of the opm problem as it involves only a logk) multiplicative factor, it is far from being negligible since the logk) gap factor acts on a term of order exponential minus exponentially. Problem Lower Bound Upper Bound BP FC) H log ) 1 δ H log ) 1 δ BP FB) No results K exp opm FC) H log ) 1 δ opm FB) exp ) H H log 1 δ K exp logk)h 2 ) ) ) logk)h 2 able 1. State of the art results for the BP problem and the opm problem with M = 1 with fixed confidence for δ small enough FC) and fixed budget FB) - for FC, bound on the expected total number of samples needed for making an error of at most δ on the set of arms and for FB, bound on the probability of making a mistake on the returned set of arms. he quantities H, H 2 depend on the means µ k of the arm distributions and are defined in Chen et al., 2014) and are not the same for opm and BP). In the case of the BP problem, set k = τ µ k and set k) for the k ordered in increasing order, we have H = i 2 i and H 2 = min i i 2 i). For the opm problem with M = 1, the same definitions holds with k = max i µ i µ k. In this paper we close, up to constants, the gap in the fixed budget setting for the BP problem - we prove that H = H. In addition, we also prove that our strategy minimizes at the same time the cumulative regret, and identifies optimally the best arm, provided that the highest mean of the arms is known to the learner. Our findings are summarized in able 1. In order to do that, we introduce a new algorithm for the BP problem which is entirely parameter free, and based on an original heuristic. In Section 2, we describe formally the BP problem, the algorithm, and the results. In Section 3, we describe how our algorithm can be applied to the active detection of discrete level sets, and therefore to the problem of active classification and active anomaly detection. We also describe what are the implications of our results for the opm problem. Finally Section 4 presents some simulations for evaluating our algorithm with respect to the state of the art competitors. he proofs of all theorems are in Appendix A, as well as additional simulation results. Problem Results BP FB) : UB exp H + log log )K )) BP FB) : LB exp H log log )K )) opm FB) : UB exp H + log log )K )) with µ known) able 2. Our results for the opm and the BP problem in the fixed budget setting - i.e. upper and lower bounds on the probability of making a mistake on the set of arms returned by the learner.

3 2. he hresholding Bandit Problem 2.1. Problem formulation Learning setting Let K be the number of arms that the learner can choose from. Each of these arms is characterized by a distribution ν k that we assume to be R-sub- Gaussian. Definition R-sub-Gaussian distribution). Let R > 0. A distribution ν is R-sub-Gaussian if for all t R we have E X ν [exptx te[x])] expr 2 t 2 /2). his encompasses various distributions such as bounded distributions or Gaussian distributions of variance R 2 for R R. Such distributions have a finite mean, let µ k = E X νk [X] be the mean of arm k. We consider the following dynamic game setting which is common in the bandit literature. For any time t 1, the learner chooses an arm I t from A = {1,..., K}. It receives a noisy reward drawn from the distribution ν It associated to the chosen arm. An adaptive learner bases its decision at time t on the samples observed in the past. Set notations Let u R and A be the finite set of arms. We define S u as the set of arms whose means are over u, that is S u := {k A, µ k u}. We also define Su C as the complimentary set of S u in A, i.e. Su C = {k A, µ k < u}. Objective Let > 0 not necessarily known to the learner beforehand) be the horizon of the game, let τ R be the threshold and ɛ 0 be the precision. We define the τ, ɛ) thresholding problem as such : after rounds of the game described above, the goal of the learner is to correctly identify the arms whose means are over or under the threshold τ up to a certain precision ɛ, i.e. to correctly discriminate arms that belong to S τ+ɛ from those in S C τ ɛ. In the rest of the paper, the sentence the arm is over the threshold τ is to be understood as the arm s mean is over the threshold. After rounds of the previously defined game, the learner has to output a set Ŝτ := Ŝτ ) A of arms and it suffers the following loss: L ) = IS τ+ɛ ŜC τ S C τ ɛ Ŝτ ). A good learner minimizes this loss by correctly discriminating arms that are outside of a 2ɛ band around the threshold: arms whose means are smaller than τ ɛ) should not belong to the output set Ŝτ, and symmetrically those whose means are bigger than τ + ɛ) should not belong to Ŝ C τ. If it manages to do so, the algorithm suffers no loss and otherwise it incurs a loss of 1. For arms that lie inside this 2ɛ strip, mistakes on the other hand bear no cost. If we set ɛ to 0 we recover the exact BP thresholding problem described in the introduction, and the algorithm suffers no loss if it discriminates exactly arms that are over the threshold from those under. Let E be the expectation according to the samples collected by an algorithm, its expected loss is: E[L )] = PS τ+ɛ ŜC τ S C τ ɛ Ŝτ ), i.e. it is the probability of making a mistake, that is rejecting an arm over τ + ɛ) or accepting an arm under τ ɛ). he lower this probability of error, the better the algorithm, as an oracle strategy would simply rightly classify each arm and suffer an expected loss of 0. Our problem is a pure exploration bandit problem, and is in fact, shifting the means by τ, a specific case of the pure exploration bandit problem considered in Chen et al., 2014) - namely the specific case where the set of sets of arms that they call M and which is their decision class is the set of all possible set of arms. We will comment more on this later in Subsection 2.4. Problem complexity We define τ,ɛ i the gap of arm i with respect to τ and ɛ as: i := τ,ɛ i = µ i τ + ɛ. 1) We also define the complexity H ɛ of the problem as K H := H τ,ɛ = τ,ɛ i ) 2. 2) We call H complexity as it is a characterization of the hardness of the problem. A similar quantity was introduce for general combinatorial bandit problems Chen et al., 2014) and is similar in essence to the complexity introduced for the best arm identification problem, see Audibert & Bubeck, 2010) A lower bound In this section, we exhibit a lower bound for the thresholding problem. More precisely, for any sequence of gaps d k ) k, we define a finite set of problems where the distributions of the arms of these problems correspond to these gaps and are Gaussian of variance 1. We lower bound the largest probability of error among these problems, for the best possible algorithm. heorem 1. Let K, 0. Let for any i K, d i 0. Let τ R, ɛ > 0. For 0 i K, we write B i for the problem where the distribution of arm j {1,..., K} is N τ + d i + ɛ, 1) if i j and N τ d i ɛ, 1) otherwise. For all these problems, H := H τ,ɛ = i d i + 2ɛ) 2 is the same by definition. It holds that for any bandit algorithm max i {0,...,K} E B il )) exp 3/H 4 log12log ) + 1)K) ), where E B i is the expectation according to the samples of problem B i.

4 his lower bound implies that even if the learner is given the distance of the mean of each arm to the threshold and the shape of the distribution of each arm here Gaussian of variance 1), any algorithm still makes an error of at least exp 3/H 4 log12log ) + 1)K)) on one of the problems. his is a lower bound in a very strong sense because we really restrict the set of possibilities to a setting where we know all gaps and prove that nevertheless this lower bounds holds. Also it is non-asymptotic and holds for any, and implies therefore a non-asymptotic minimax lower bound. he closer the means of the distributions to the threshold, the larger the complexity H, and the larger the lower bound. he proof is to be found in Appendix A. his theorem s lower bound contains two terms in the exponential, a term that is linear in and a term that is of order loglog ) + 1)K) loglog )) + logk). For large enough values of, one has the following simpler corollary. Corollary. Let H > 0 and R > 0, τ R and ɛ 0. Consider B H,R the set of K-armed bandit problems where the distributions of the arms are R-sub-Gaussian and which have all a complexity smaller than H. Assume that 4 HR 2 log12log ) + 1)K). It holds that for any bandit algorithm sup E B L )) exp 4/R 2 ) H), B B H,R where E B is the expectation according to the samples of problem B B H,R Algorithm AP and associated upper bound In this section we introduce AP Anytime Parameter-free hresholding algorithm), an anytime parameter-free learning algorithm. Its heuristic is based on a simple observation, namely that a near optimal static strategy that allocates k samples to arm k is such that k 2 k is constant across k and increasing with ) - see heorem 1, and in particular the second half of Step 3 of its proof in Appendix A - and that therefore a natural idea is to simply pull at time t the arm that minimizes an estimator of this quantity. Note that in this paper, we consider for the sake of simplicity that each arm is tested against the same threshold, however this can be relaxed to τ k ) k at no additional cost. Algorithm he algorithm receives as input the definition of the problem τ, ɛ). First, it pulls each arm of the game once. At time t > K, AP updates i t), the number of pulls up to time t of arm i, and the empirical mean ˆµ i t) of arm k after i t) pulls. Formally, for each k A it computes i t) = t s=1 II s = i) and the updated means µ i t) = 1 it) X i,s, 3) i t) s=1 Algorithm 1 AP algorithm Input: τ, ɛ Pull each arm once for t = K + 1 to do Pull arm I t = arg min k K B kt) from Equation 5) Observe reward X ν It end for Output: Ŝτ = {k : ˆµ k ) τ} where X i,s denotes the sample received when pulling i for the s-th time. he algorithm then computes: i s) := τ,ɛ i s) = ˆµ i t) τ + ɛ, 4) the current empirical estimate of the gap associated with arm i. he algorithm then computes: B i t + 1) = i t) i t). 5) and pulls the arm I t+1 = arg min i K B it + 1) that minimizes this quantity. At the end of the horizon, the algorithm outputs the set of arms Ŝτ = {k : µ k ) τ}. he expected loss of this algorithm can be bounded as follows. heorem 2. Let K 0, 2K, and consider a problem B. Assume that all arms ν k of the problem are R-sub- Gaussian with means µ k. Let τ R, ɛ 0 Algorithm AP s expected loss is upper bounded on this problem as EL )) exp 1 ) 64R loglog ) + 1)K), H where we remind that H = i µ i τ +ɛ) 2 and where E is the expectation according to the samples of the problem. he bound of heorem 2 holds for any R-sub-Gaussian bandit problem. Note that one does not need to know R in order to implement the algorithm, e.g. if the distributions are bounded, one does not need to know the bound. his is a desirable feature for an algorithm, yet e.g. all algorithms based on upper confidence bounds need a bound on R. his bound is non-asymptotic one just needs 2K so that one can initialize the algorithm) and therefore heorem 2 provides a minimax upper bound result over the class of problems that have sub-gaussian constant R and complexity H. he term in the exponential of the lower bound of heorem 2 matches the lower bound of heorem 1 up to a multiplicative factor and the loglog )+1)K) term. Now as in the case of the lower bound, for large enough values of, one has the following simpler corollary.

5 Corollary. Let H > 0 and R > 0, τ R and ɛ 0. Consider B H,R the set of K-armed bandit problems where the distributions of the arms are R-sub-Gaussian and whose complexity is smaller than H. Assume that 256 HR 2 loglog ) + 1)K). For Algorithm AP it holds that sup B B H,R E B L )) exp /128R 2 H) ), where E B is the expectation according to the samples of problem B B H,R his corollary and Corollary 2.2 imply that for large enough - i.e. of larger order than HR 2 loglog ) + 1)K) - Algorithm AP is order optimal over the class of problems whose complexity is bounded by H and whose arms are R-sub-Gaussian Discussion A parameter free algorithm An important point that we want to highlight for our strategy AP is that it does not need any parameter, such as the complexity H, the horizon or the sub-gaussian constant R. his contrasts with any upper confidence based approach as in e.g. Audibert & Bubeck, 2010; Gabillon et al., 2012) e.g. the UCB-E algorithm in Audibert & Bubeck, 2010)), which need as parameter an upper bound on R and the exact knowledge of H, while the bound of heorem 2 will hold for any R and any H, and our algorithm adapts to these quantities. Also we would like to highlight that for the related problem of best arm identification, existing fixed budget strategies need to know the budget in advance Audibert & Bubeck, 2010; Karnin et al., 2013; Chen et al., 2014) - while our algorithm can be stopped at any time and the bound of heorem 2 will hold. Extensions to distributions that are not sub-gaussian as opposed to adaptation to sub- models It is easy to see in the light of Bubeck et al., 2013a) that one could extend our algorithm to non sub-gaussian distributions by using an estimator other than the empirical means, as e.g. the estimators in Catoni et al., 2012) or in Alon et al., 1996). hese estimators have sub-gaussian concentration asymptotically under the only assumption that the distributions have a finite 1 + v) moment with v > 0 and the sub-gaussian concentration will depend on v). Using our algorithm with a such estimator will therefore provide a result that is similar to the one of heorem 2 - and that without requiring the knowledge of v, which means that our algorithm AP modified for using these robust estimators instead of the empirical mean will work for any bandit problem where the arm distributions have a finite 1 + v) moment with v > 0. On the other hand, if we consider more specific, e.g. exponential, models, it is possible to obtain a refined lower bound in terms of Kullback- Leibler divergences rather than gaps following Kaufmann et al., 2015). However, an upper bound of the same order clearly comes at the cost of a more complicated strategy and holds in less generality than our bound. Optimality of our strategy As explained previously, the upper bound on the expected risk of algorithm AP is comparable to the lower bound on the expected risk up to a log log ) + 1)K ) term see heorems 2 and heorems 1) - and this term vanishes when the horizon is large enough, namely when OHR 2 log log )+1)K ) ), which is the case for most problems. So for large enough, our strategy is order optimal over the class of problems that have complexity smaller than H and sub-gaussian constant smaller than R. Comparison with existing results Our setting is a specific combinatorial pure exploration setting with fixed budget where the objective is to find the set of arms that are above a given threshold. Settings related to ours have been analyzed in the literature and the state of the art result on our problem can be found to the best of our knowledge) in the paper Chen et al., 2014). In this paper, the authors consider a general pure exploration combinatorial problem. Given a set M of subsets of {1,..., K}, they aim at finding a subset of arms M M such that M = arg max M M k M µ k. In the specific case where M is the set of all subsets of {1,..., K}, their problem in the fixed budget setting is exactly the same as ours when ɛ = 0 and the means are shifted by τ. heir algorithm CSAR s upper bound on the loss is see their heorem 3): EL )) K 2 K ) exp 72R 2, logk)h CSAR,2 where H CSAR,2 = max i i 2 i). As H CSAR,2 logk) H by definition, there is a gap for their strategy in the fixed budget setting with respect to the lower bound of heorem 1, which is smaller and of order exp /HR 2 )). Our strategy on the contrary does not have this gap, and improves over the CSAR strategy. We believe that this lack of optimality for CSAR is not an artefact of the proof of the paper Chen et al., 2014), and that CSAR is sub-optimal, as it is a successive reject algorithm with fixed and nonadaptive reject phase length. A similar gap between upper and lower bounds for successive reject based algorithms in the fixed budget setting was also observed for the best arm identification problem when no additional information such as the complexity are known to the learner, see Audibert & Bubeck, 2010; Karnin et al., 2013; Kaufmann et al., 2015; Chen et al., 2014). It is therefore an interesting fact that there is a parameter free optimal algorithm for our fixed budget problem.

6 he paper Chen et al., 2014) also provides results in the fixed confidence setting, where the objective is to provide an ɛ optimal set using the smallest possible sample size. In these results such a gap in optimality does not appear and the algorithm CLUCB they propose is almost optimal, see also Kalyanakrishnan et al., 2012; Jamieson et al., 2014; Karnin et al., 2013; Kaufmann et al., 2015; Chen & Li, 2015) for related results in the fixed confidence setting. his highlights that the fixed budget setting and the fixed confidence setting are fundamentally different at least in the absence of additional information such as the complexity H), and that providing optimal strategies in the fixed budget setting is a more difficult problem than providing an adaptive strategy in the fixed confidence problem - adaptive algorithms that are nearly optimal in the absence of additional information have only been exhibited in the latter case. o the best of our knowledge, all strategies except ours have such an optimality gap for fixed budget pure exploration combinatorial bandit problems, while there exists fixed confidence strategies for general pure exploration combinatorial bandits that are very close to optimal, see Chen et al., 2014). Now in the case where the learner has additional information on the problem, as e.g. the complexity H, it has been proved in the opm problem that a UCB-type strategy has probability of error upper bounded as exp /H), see Audibert & Bubeck, 2010; Gabillon et al., 2012). A similar UCB type of algorithm would also work in the BP problem, implying the same upper bound results as AP. But we would like to highlight that the exact knowledge of H is needed by these algorithms for reaching this bound - which is unlikely in applications. Our strategy on the other hand reaches, up to constants, the optimal expected loss for the BP problem, without needing any parameter. 3. Extensions of our results to related settings In this section we detail some implications of the results of the previous section to some specific problems Active level set detection : Active classification and active anomaly detection Here we explain how a simple modification of our setting transforms it into the setting of active level set detection, and therefore why it can be applied to active classification and active anomaly detection. We define the problem of discrete, active level set detection as the problem of deciding as efficiently as possible, in our bandit setting, whether for any k the probabilities that the samples of arms ν k are above or below a given level L are higher or smaller than a threshold τ up to a precision ɛ, i.e. it is the problem of deciding for all k whether µ k L) := P X νk X > L) τ, or not up to a precision ɛ. his problem can be immediately solved by our approach with a simple change of variable. Namely, for the sample X t ν It collected by the algorithm at time t, consider the transformation X t = 1 Xt>L. hen X t is a Bernoulli random variable of parameter µ It L) which is a 1/2-sub- Gaussian distribution) - and applying our algorithm to the transformed samples X t solves the active level set detection problem. his has two interesting applications, namely in active binary classification and in active anomaly detection. Active binary classification In active binary classification, the learner aims at deciding, for k points the arms of the bandit), whether each point belongs to the class 1 or the class 0. At each round t, the learner can request help from a homogeneous mass of experts which can be a set of previously trained classifiers, where one wants to minimize the computational cost, or crowd-sourcing, when one wants to minimize the costs of the task), and obtain a noisy label for the chosen data point I t. We assume that for any point k, the expert s responses are independent and stochastic random variables in {0, 1} of mean µ k i.e. the arm distributions are Bernoulli random variables of parameter µ k ). We assume that the experts are right on average and that the label l k of k is equal to l k := 1{ µ k > 1/2}. he active classification task therefore amounts to deciding whether µ k > τ := 1/2 or not, possibly up to a given precision ɛ. Our strategy therefore directly applies to this problem by choosing τ = 1/2. Active anomaly detection In the case anomaly detection, a common way to characterize anomalies is to describe them as naturally not concentrated Steinwart et al., 2005). A natural way to characterize anomalies is thus to define a cutoff level L, and classify the samples e.g. above this level L as anomalous. Such an approach has already received attention for anomaly detection e.g in Streeter & Smith), albeit in a cumulative regret setting. Here we consider an active anomaly detection setting where we face K sources of data the arms), and we aim at sampling them actively to detect which sources emit anomalous samples with a probability higher than a given threshold τ - this threshold is chosen e.g. as the maximal tolerable amount of anomalous behavior of a source. his illustrates the fact that as described in Steinwart et al., 2005), the problem of anomaly detection is indeed a problem of level set detection - and so the problem of active anomaly detection is a problem of active level set detection on which we can use our approach as explained above.

7 3.2. Best arm identification and cumulative reward maximization with known highest mean value wo classical bandit problems are the best-arm identification problem and the cumulative reward maximization problem. In the former, the goal of the learner is to identify the arm with the highest mean Bubeck et al., 2009). In the latter, the goal is to maximize the sum of the samples collected by the algorithm up to time Auer et al., 1995). Intuitively, both problems should call for different strategies - in the best arm identification problem one wants to explore all arms heavily while in the cumulative reward maximization problem one wants to sample as much as possible the arm with the highest mean. Such intuition is backed up by heorem 1 of Bubeck et al., 2009), which states that in the absence of additional information and with a fixed budget, the lower the regret suffered in the cumulative setting, expressed in terms of rewards, the higher the regret suffered in the identification problem, expressed in terms of probability of error. We prove in this section the somewhat non intuitive fact that if one knows the value of best arm s mean, its possible to perform both tasks simultaneously by running our algorithm where we choose ɛ = 0 and τ = µ := max k µ k. Our algorithm then reduces to the GCL algorithm that can be found in Salomon & Audibert, 2011). Best arm identification In the best arm identification problem, the game setting is the same as the one we considered but the goal of the learner is different: it aims at returning an arm J that with the highest possible mean. he following proposition holds for our strategy AP that runs for times, and then returns the arm J that was the most pulled. heorem 3. Let K > 0, R > 0 and 2K and consider a problem where the distribution of the arms ν k is R-sub-Gaussian and has mean µ k. Let µ := max k µ k and H µ = i:µ i µ µ µ i ) 2. hen AP run with parameters τ = µ and ɛ = 0, recommending the arm J = arg max k ), is such that k A Pµ J µ ) exp 36R 2 +2 loglog )+1)K ). H µ If the complexity H is also known to the learner, algorithm UCB-E from Audibert & Bubeck, 2010) would attain a similar performance. Remark his implies that if µ is known to the learner, there exists an algorithm such that its probability of error is of order exp c/h). he recent paper Carpentier & Locatelli, 2016) actually implies that the knowledge of µ is actually key here, since without this information, the simple regret is at least of order exp c/logk)h)) in a minimax sense. Cumulative reward maximization In the cumulative reward maximization problem, the game setting is the same as the the one we considered but the aim of the learner is different : if we write X t for the sample collected at time t by the algorithm, it aims at maximizing t X t. he following proposition holds for our strategy AP that runs for times. heorem 4. Let K > 0, R > 0 and 2K and consider a problem where the distribution of the arms ν k is R-sub- Gaussian. hen AP run with parameters τ = µ and ɛ = 0 is such that µ E [ X t inf 4R2 log )δ δ 1 µ µ i t k k + µ µ i )1 + K ] 2δ 2 ). his bound implies both the problem dependent upper bound of order i 1 i log ) and the problem independent upper bound of order K log ), and this matches the performance of algorithms like UCB for any tuning parameter. A similar result can also be found in Salomon & Audibert, 2011). Discussion Propositions 3 and 4, whose proofs are provided in Appendix A, imply that our algorithm AP is a good strategy for solving at the same time both problems when µ is known. As mentioned previously, this is counter intuitive since one would expect a good strategy for the best arm identification problem to explore significantly more than a good strategy for the cumulative reward maximization problem. o convince oneself, it is sufficient to look at the two-armed case, for which in the fixed budget it is optimal to sample both arms equally, while this strategy has linear regret in the cumulative setting. his intuition is formalized in Bubeck et al., 2009) where the authors prove that no algorithm can achieve this without additional information. Our results therefore imply that the knowledge of µ by the learner is a sufficient information so that heorem 1 of Bubeck et al., 2009) does not hold anymore and there exists algorithms that solve both problems at the same time, as AP does. opm problem An extension of the best arm identification problem is known as opm arms identification problem, where one is concerned with identifying the set of the M arms with the highest means Bubeck et al., 2013b; Gabillon et al., 2012; Kaufmann et al., 2015; Zhou et al., 2014; Chen et al., 2014; Cao et al., 2015). If the learner has some additional information, such as the mean values of the arms with Mth and M + 1)th highest means, then it is straightforward that one can apply our algorithm AP, setting τ in the middle between the Mth and M + 1)th highest means. he set Ŝτ would then be returned as the es-

8 0 Exp. 1 3 groups), Bernoulli arms 0 Exp. 2 arithmetic progression), Bernoulli arms 0 Exp. 3 geometric progression), Bernoulli arms log1 est. probability of success) AP UA UCBE4 4 ) 7 UCBE1) UCBE0.25) CSAR horizon log1 est. probability of success) AP UA UCBE4 4 ) 6 UCBE1) UCBE0.25) CSAR horizon log1 est. probability of success) AP UA UCBE ) UCBE1) UCBE0.25) CSAR horizon Figure 1. Results of Experiments 1-3 with Bernoulli distributions. he average error of the specified methods is displayed on a logarithmic scale with respect to the horizon. timated set of M optimal arms. he upper bound and proof for this problem is a direct consequence of heorem 2, and granted one has such extra-information, outperforms existing results for the fixed budget setting, see Bubeck et al., 2013b; Kaufmann et al., 2015; Chen et al., 2014; Cao et al., 2015). If the complexity H were also known to the learner, the strategy in Gabillon et al., 2012) would attain a similar performance. 4. Experiments We illustrate the performance of algorithm AP in a number of experiments. For comparison, we use the following methods which include the state of the art CSAR algorithm of Chen et al., 2014) and two minor adaptations of known methods that are also suitable for our problem. Uniform Allocation UA): For each t {1, 2,..., }, we choose I t U A. his method is known to be optimal if all arms are equally difficult to classify, that is in our setting, if the quantities τ,ɛ i, i A, are very close. UCB-type algorithm: he algorithm UCBE given and analyzed in Audibert & Bubeck, 2010) is designed for finding the best arm - and its heuristic is to pull the arm that maximizes a UCB bound - see also Gabillon et al., 2012) for an adaptation of this algorithm to the general opm problem. he natural adaptation of the method for our problem corresponds to pulling the arm that minimizes k t) a From the theoretical analysis in the k t). paper Audibert & Bubeck, 2010; Gabillon et al., 2012), it is not hard to see that setting a K)/H minimizes their upper bound, and that this algorithm attains the same expected loss as ours - but it requires the knowledge of H. In the experiments we choose values a i = 4 i K H, i { 1, 0, 4}, and denote the respective results as UCBE4 i ). he value a 0 can be seen as the optimal choice, while the two other choices give rise to strategies that are sub-optimal because they respectively explore too little or too much. CSAR: As mentioned before, this method is given in Chen et al., 2014). In our specific setting, via the shift µ i = µ i τ, the lines 7-17 of the algorithm reduce to classifying the arm i that maximizes µ i based on its current mean. he set A t corresponds to Ŝτ at time t. In fact in our specific setting the CSAR algorithm is a successive reject-type strategy see Audibert & Bubeck, 2010) where the arm whose empirical mean is furthest from the threshold is rejected at the end of each phase. Figure 1 displays the estimated probability of success on a logarithmic scale with respect to the horizon of the six algorithms based on N = 5000 simulated games with τ = 1 2, ɛ = 0.1, K = 10, and = 500. Experiment 1 3 groups setting): K Bernoulli arms with means µ 1:3 0.1, µ 4:7 = 0.35, 0.45, 0.55, 0.65) and µ 8:10 0.9, which amounts to 2 difficult relevant arms that is, outside the 2ɛ- band), 2 difficult irrelevant arms and six easy relevant arms. Experiment 2 arithmetic progression): K Bernoulli arms with means µ 1:4 = : 3) 0.05, µ 5 = 0.45, µ 6 = 0.55 and µ 7:10 = : 3) 0.05, which amounts to 2 difficult irrelevant arms and eight arms arithmetically progressing away from τ. Experiment 3 geometric progression): K Bernoulli arms with means µ 1:4 = :4, µ 5 = 0.45, µ 6 = 0.55 and µ 7:10 = d 5 1:4), which amounts to 2 difficult irrelevant arms and eight arms geometrically progressing away from τ. he experimental results confirm that our algorithm may only be outperformed by methods that have an advantage in the sense that they have access to the underlying problem complexity and, in the case of UCBE1), an additional optimal parameter choice. In particular, other choices for that parameter lead to significantly less accurate results comparable to the naive strategy UA. hese effects are also visible in the further results given in Appendix B. Conclusion In this paper we proposed a parameter free algorithm based on a new heuristic for the BP problem in the fixed confidence setting - and we prove that it is optimal which is a kind of result which is highly non trivial for combinatorial pure exploration problems with fixed budget.

9 Acknowledgement his work is supported by the DFG s Emmy Noether grant MuSyAD CA 1488/1-1). References Alon, Noga, Matias, Yossi, and Szegedy, Mario. he space complexity of approximating the frequency moments. In Proceedings of the twenty-eighth annual ACM symposium on heory of computing, pp ACM, Audibert, Jean-Yves and Bubeck, Sébastien. Best arm identification in multi-armed bandits. In Proceedings of the 23rd Conference on Learning heory, Auer, Peter, Cesa-Bianchi, Nicolò, Freund, Yoav, and Schapire, Robert. Gambling in a Rigged Casino: he Adversarial Multi-Armed Bandit problem. In Proceedings of the 36th Annual Symposium on Foundations of Computer Science, pp , Bubeck, Sebastian, Cesa-Bianchi, Nicolo, and Lugosi, Gábor. Bandits with heavy tail. Information heory, IEEE ransactions on, 5911): , 2013a. Bubeck, Sébastien, Munos, Rémi, and Stoltz, Gilles. Pure exploration in multi-armed bandits problems. In Algorithmic Learning heory, pp Springer, Bubeck, Séebastian, Wang, engyao, and Viswanathan, Nitin. Multiple identifications in multi-armed bandits. In Proceedings of he 30th International Conference on Machine Learning ICML-13), pp , 2013b. Cao, Wei, Li, Jian, ao, Yufei, and Li, Zhize. On topk selection in multi-armed bandits and hidden bipartite graphs. In Advances in Neural Information Processing Systems, pp , Carpentier, Alexandra and Locatelli, Andrea. ight lower) bounds for the fixed budget best arm identification bandit problem. In Proceedings of the 29th Conference on Learning heory, Catoni, Olivier et al. Challenging the empirical mean and empirical variance: a deviation study. In Annales de l Institut Henri Poincaré, Probabilités et Statistiques, volume 48, pp Institut Henri Poincaré, Chen, Lijie and Li, Jian. On the optimal sample complexity for best arm identification. arxiv preprint arxiv: , Even-Dar, Eyal, Mannor, Shie, and Mansour, Yishay. Pac bounds for multi-armed bandit and markov decision processes. In Computational Learning heory, pp Springer, Gabillon, Victor, Ghavamzadeh, Mohammad, and Lazaric, Alessandro. Best arm identification: A unified approach to fixed budget and fixed confidence. In Advances in Neural Information Processing Systems, pp , Jamieson, Kevin, Malloy, Matthew, Nowak, Robert, and Bubeck, Sébastien. lil ucb: An optimal exploration algorithm for multi-armed bandits. In Proceedings of the 27th Conference on Learning heory, Kalyanakrishnan, Shivaram, ewari, Ambuj, Auer, Peter, and Stone, Peter. Pac subset selection in stochastic multiarmed bandits. In Proceedings of the 29th International Conference on Machine Learning ICML-12), pp , Karnin, Zohar, Koren, omer, and Somekh, Oren. Almost optimal exploration in multi-armed bandits. In Proceedings of the 30th International Conference on Machine Learning ICML-13), pp , Kaufmann, Emilie, Cappé, Olivier, and Garivier, Aurélien. On the complexity of best arm identification in multiarmed bandit models. Journal of Machine Learning Research, Mannor, S and sitsiklis, J N. he Sample Complexity of Exploration in the Multi-Armed Bandit Problem. Journal of Machine Learning Research, 5: , Salomon, Antoine and Audibert, Jean-Yves. Deviations of stochastic bandit regret. In Algorithmic Learning heory, pp Springer, Steinwart, Ingo, Hush, Don R, and Scovel, Clint. A classification framework for anomaly detection. In Journal of Machine Learning Research, pp , Streeter, Matthew J and Smith, Stephen F. Selecting among heuristics by solving thresholded k-armed bandit problems. ICAPS 2006, pp Zhou, Yuan, Chen, Xi, and Li, Jian. Optimal pac multiple arm identification with applications to crowdsourcing. In Proceedings of the 31st International Conference on Machine Learning ICML-14), pp , Chen, Shouyuan, Lin, ian, King, Irwin, Lyu, Michael R, and Chen, Wei. Combinatorial pure exploration of multiarmed bandits. In Advances in Neural Information Processing Systems, pp , 2014.

10 A. Proofs A.1. Proof of heorem 1 Proof. In this proof, we will prove that on at least one instance of the problem, any algorithm makes a mistake of order at least exp c/h). Step 0: Setting and notations. Let us consider K real numbers i 0, and let us set τ = 0, ɛ = 0. Let us write ν i := N i, 1) for the Gaussian distribution of mean i and variance 1, and ν i := N i, 1) for the Gaussian distribution of mean i and variance 1. Note that this construction is easily generalised to cases where τ 0 or ɛ 0 by translation or careful choice of the i. We define the product distributions B i where i {0,..., K} as ν i 1... ν i K where for k K, νi k := ν i1 k i + ν i 1 k=i is ν i if k i and ν i otherwise. We also extend this notation to B0, where none of the arms is flipped with respect to the threshold k, ν 0 k := ν i). It is straightforward that the gap i of arm i with respect to the threshold τ = 0 does not depend on B i and is equal to i. It follows that all these problems have the same complexity H as defined previously with ɛ = 0 and τ = 0). We write for i K, P i B for the probability distribution according to all the samples that a strategy could possibly collect up to horizon, i.e. according to the samples X k,s ) k K,s B i ). Let k ) k K denote the numbers of samples collected by the algorithm on arm k. Let k {0,..., K}. Note that KL k := KLν k, ν k ) = 2 2 k, where KL is the Kullback Leibler divergence. Let t 0. We define the quantity: KL k,t = 1 t t s=1 log dν k dν k X k,s )) = 1 t t 2X k,s k. s=1 Step 1: Concentration of the empirical KL. Let us define the event: { ξ = k K, t, KL k,t KL k 4 k log4log ) + 1)K) t Since KL k,t = 1 t t s=1 2X k,s k and KL k = 2 2 k, by Gaussian concentration a peeling and the maximal martingale inequality), it holds that for any i that P B iξ) 3/4. Step 2: A change of measure. We will now use the change of measure introduced previously for a well chosen event A. Namely, we consider A i = {i Ŝτ }, the event where the algorithm classified arm i as being above the threshold. We have by doing a change of measure between B i and B 0 since they only differ in arm i and only the i first samples of arm i by the algorithm): P B ia i ) = E B 0 [1 Ai exp ) i KL ] i,i E B 0 [1 Ai ξ exp ) i KL ] i,i [ E B 0 1 Ai ξ exp 2 2 ) ] i i 4 i i log4 log ) + 1)K), }. by definition of ξ and KL i. Step 3: A union of events. We now consider the event A = K A i, i.e. the event where all arms are classified as

11 being above the threshold τ = 0. We have: max i {1,...,K} P B ia i) 1 K 1 K 1 K E B 0 K P B ia i ) 6) K P B ia i ξ) K E B 0 [ 1 A ξ 1 K [1 Ai ξ exp 2 i 2 i 4 i i log4log ) + 1)K) ) ] K exp 3 i 2 i 4 log4log ) + 1)K) )] exp 4 log4log ) + 1)K) ) E B 0 where the fourth line comes from using 2ab a 2 + b 2 with a = i i and where: S = 1 K K exp 3 i 2 ) i. Since i i = and all i are positive, there exists an arm i such that i S 1 K exp 3 H ) = exp 3 H logk)). [ ] 1 A ξ S, 7). his yields: H 2 i his implies by definition of the risk: ) max E B il )) max max P B ia i), 1 P B 0A) i {0,...,K} i {1,...,K} 1 2 exp 3 ] H 4 log4log ) + 1)K)) logk)e B 0 [1 A ξ P B 0A)) = 1 2 exp 3 [ ] H 4 log4log ) + 1)K logk))) P B 0 A ξ P B 0A)) 1 8 exp 3 H 4 log4log ) + 1)K) logk)) exp 3 H 4 log12log ) + 1)K)), he fourth line comes from Pξ) 3/4, and we consider two cases P B 0A) 1/2 and P B 0A) 1/2. he first leads directly to the condition as the intersection is at least of probability 1/4; in the latter case, we have the same bound via his concludes the proof. max E B il )) E B 0L )) = P B 0AC ) 1/2. i {0,...,K} A.2. Proof of heorem 2 Proof. In this proof, we will show that on a well chosen event ξ, we classify correctly the arms which are over τ + ɛ, and reject the arms that are under τ ɛ. Step 1: A favorable event. Let δ = 4 2) 1. owards this goal, we define the event ξ as follows: { ξ = i A, s {1,..., } : 1 s δ 2 } X i,t µ i. s Hs t=1

12 We know from Sub-Gaussian martingale inequality that for each i A and each u {0,..., log ) }: P v [2 u, 2 u+1 ], { 1 v v X i,t µ i t=1 δ 2 Hv } ) exp δ2 2R 2 H ). ξ is the union of these events for all i K and s log ). As there are less than log ) + 1)K such combinations, we can lower-bound its probability of occurrence with a union bound by: Pξ) 1 2log ) + 1)K exp δ2 2R 2 H ). Step 2: Characterization of some helpful arm. At time, we consider an arm k that has been pulled after the initialization phase and such that k ) 1. We know that such an arm exists otherwise we get: K) H 2 k K K K = i ) 1) < K H 2 i which is a contradiction. Note that since 2K, we have that k ) 1 = K, 2H 2 k We now consider t the last time that this arm k was pulled. Using k t) 2 by the initialisation of the algorithm), we know that: k t) k ) 1 2H 2. 8) k Step 3: Lower bound on the number of pulls of the other arms. On ξ, at time t as we defined previously, we have for every arm i: δ ˆµ i t) µ i 2 H i t). 9) From the reverse triangle inequality and Equation 4), we have: ˆµ i t) µ i = ˆµ i t) τ) µ i τ) ˆµ i t) τ µ i τ ˆµ i t) τ + ɛ) µ i τ + ɛ) i t) i. Combining this with 9) yields the following: k δ 2 H k t) δ k t) k + H k t). 10) By construction, we know that at time t we pulled arm k, which yields for every i A: We can lower bound the left-hand side of 11) using 8): k B k t) B i t). 11) δ 2 H k t) ) k t) B k t) k ) 2δ k 2H 2 B k t) k 1 ) 2 δ H B kt), 12)

13 and upper bound the right hand side using 10) by: hresholding Bandit Problem B i t) = i i t) δ i + 2 ) i t) H i t) i i t) + δ H. 13) As both i and i are positive by definition, combining 12) and 13) yields the following lower bound on i ) i t): 1 2 ) 2 2δ 2H 2 i ). 14) i Step 4: Conclusion. On ξ, as i is a positive quantity, combining 9) and 14) yields: 2δ 2δ µ i i 1 2 2δ ˆµ i ) µ i + i 1 2 2δ, 15) where 2δ 1 2 simplifies to 1/2 for δ = 4 2) 1. 2δ For arms such that µ i τ + ɛ, then i = µ i τ + ɛ and we can rewrite 15): µ i τ 1 2 i ˆµ i ) τ µ i τ)1 1 2 ) ɛ 2 ˆµ i ) τ 0 ˆµ i ) τ, where the last line uses µ i τ + ɛ. One can easily check through similar derivations that ˆµ i ) τ < 0 holds for µ i < τ ɛ. On ξ, arms over τ + ɛ are all accepted, and arms under τ ɛ are all rejected, which means the loss suffered by the algorithm is 0. As 1 Pξ) 2log ) + 1)K exp 1 64R 2 H ), this concludes the proof. A.3. Proof of heorem 3 Proof. We will prove that on a well defined event ξ, sub-optimal arms are pulled at most 1 times, which translates 2 2 kh to the best arm being chosen at the end of the horizon as it was pulled more than half of the time. Step 1: A favorable event. Let δ = 1/18. We define the following events i A: ξ i = { s : µ δ µ i s) H i s) }, We now define ξ as the intersection of these events: ξ = k A ξ k. Using the same Sub-Gaussian martingale inequality as in the proof of heorem 2, we can lower bound its probability of occurrence with a union bound by: P ξ) 1 2log ) + 1)K exp 36R 2 H ) Step 2: he wrong arm at the wrong time. Let us now suppose that a sub-optimal arm k was pulled at least K 2 2 k H times after the initialization which translates to k ) 1 K. Let us now consider the last time t that this arm was 2 2 kh pulled. As it was pulled at time t, the following inequality holds: B k t) B k t). 16)

14 On ξ, we can now lower bound the left hand side by: δ k H k t) ) k t) B k t) We also upper bound the right hand side of 16) by: k k t) B k t) δ H B kt), 17) δ H. 18) Combining both bounds 17) and 18) with 16), as well as rearranging the terms yields: δ k k t) 2 H k t) 2 k 4 δ H. 19) Using k t) k ) 1 K as well as 2K, we have 2 2 kh Plugging this in 19) brings the following condition: k t) which directly reduces to δ 1/16, which is a contradiction as we have set δ = 1/18. As we have proved that for any sub-optimal arm i k it satisfies i ) < 4 2 k H. 20) 4 2 k H 2 k 4 δ H. 21) k ) = i ) i k < 1 2H 2 i k i 2 2 i H, summing for all arms yields: = 2. 22) We conclude by observing that k ) > /2, and as such will be chosen by the algorithm at the end as being the best arm. A.4. Proof of heorem 4 Proof. In this proof we will show that with high probability the sub-optimal arms have been pulled at most at a logarithmic rate, and will then bound the expectation of the number of pulls of these arms. Step 1: A favorable event. We define the following events s : as well as for all arms i k : ξ k,s = {µ ˆµ k s) R ξ i,s = {ˆµ k s) µ k R log )δ k s) }, log )δ k is) }.

Multiple Identifications in Multi-Armed Bandits

Multiple Identifications in Multi-Armed Bandits arxiv:05.38v [cs.lg] 4 May 0 Sébastien Bubeck Department of Operations Research and Financial Engineering, Princeton University sbubeck@princeton.edu Tengyao