Best Arm Identification: A Unified Approach to Fixed Budget and Fixed Confidence

Best Ar Identification: A Unified Approach to Fixed Budget and Fixed Confidence Victor Gabillon Mohaad Ghavazadeh Alessandro Lazaric INRIA Lille - Nord Europe, Tea SequeL {victor.gabillon,ohaad.ghavazadeh,alessandro.lazaric}@inria.fr Abstract We study the proble of identifying the best ar(s) in the stochastic ulti-ared bandit setting. This proble has been studied in the literature fro two different perspectives: fixed budget and fixed confidence. We propose a unifying approach that leads to a eta-algorith called unified gap-based exploration (UGapE), with a coon structure and siilar theoretical analysis for these two settings. We prove a perforance bound for the two versions of the algorith showing that the two probles are characterized by the sae notion of coplexity. We also show how the UGapE algorith as well as its theoretical analysis can be extended to take into account the variance of the ars and to ultiple bandits. Finally, we evaluate the perforance of UGapE and copare it with a nuber of existing fixed budget and fixed confidence algoriths. 1 Introduction The proble of best ar(s) identification [6, 3, 1] in the stochastic ulti-ared bandit setting has recently received uch attention. In this proble, a forecaster repeatedly selects an ar and observes a saple drawn fro its reward distribution during an exploration phase, and then is asked to return the best ar(s). Unlike the standard ulti-ared bandit proble, where the goal is to axiize the cuulative su of rewards obtained by the forecaster (see e.g., [15, ]), in this proble the forecaster is evaluated on the quality of the ar(s) returned at the end of the exploration phase. This abstract proble odels a wide range of applications. For instance, let us consider a copany that has K different variants of a product and needs to identify the best one(s) before actually placing it on the arket. The copany sets up a testing phase in which the products are tested by potential custoers. Each custoer tests one product at the tie and gives it a score (a reward). The objective of the copany is to return a product at the end of the test phase which is likely to be successful once placed on the arket (i.e., the best ar identification), and it is not interested in the scores collected during the test phase (i.e., the cuulative reward). The proble of best ar(s) identification has been studied in two distinct settings in the literature. Fixed budget. In the fixed budget setting (see e.g., [3, 1]), the nuber of rounds of the exploration phase is fixed and is known by the forecaster, and the objective is to axiize the probability of returning the best ar(s). In the above exaple, the copany fixes the length of the test phase before hand (e.g., enrolls a fixed nuber of custoers) and defines a strategy to choose which products to show to the testers so that the final selected product is the best with the highest probability. Audibert et al. [1] proposed two different strategies to solve this proble. They defined a strategy based on upper confidence bounds, called UCB-E, whose optial paraeterization is strictly related to a easure of the coplexity of the proble. They also introduced an eliination algorith, called Successive Rejects, which divides the budget n in phases and discards one ar per phase. Both algoriths were shown to have nearly optial probability of returning the best ar. Deng et al. [5] and Gabillon et al. [8] considered the extension of the best ar identification proble to the ulti- 1

bandit setting, where the objective is to return the best ar for each bandit. Recently, Bubeck et al. [4] extended the previous results to the proble of -best ar identification and introduced a new version of the Successive Rejects algorith (with accept and reject) that is able to return the set of the -best ars with high probability. Fixed confidence. In the fixed confidence setting (see e.g., [1, 6]), the forecaster tries to iniize the nuber of rounds needed to achieve a fixed confidence about the quality of the returned ar(s). In the above exaple, the copany keeps enrolling custoers in the test until it is, e.g., 95% confident that the best product has been identified. Maron & Moore [1] considered a slightly different setting where besides a fixed confidence also the axiu nuber of rounds is fixed. They designed an eliination algorith, called Hoeffding Races, based on progressively discarding the ars that are suboptial with enough confidence. Mnih et al. [14] introduced an iproved algorith, built on the Bernstein concentration inequality, which takes into account the epirical variance of each ar. Even-Dar et al. [6] studied the fixed confidence setting without any budget constraint and designed an eliination algorith able to return an ar with a required accuracy ɛ (i.e., whose perforance is at least ɛ-close to the optial ar). Kalyanakrishnan & Stone [10] further extended this approach to the case where the -best ars ust be returned with a given confidence. Finally, Kalyanakrishnan et al. [11] recently introduced an algorith for the case of -best ar identification along with a thorough theoretical analysis showing the nuber of rounds needed to achieve the desired confidence. Although the fixed budget and fixed confidence probles have been studied separately, they display several siilarities. In this paper, we propose a unified approach to these two settings in the general case of -best ar identification with accuracy ɛ. 1 The ain contributions of the paper can be suarized as follows: Algorith. In Section 3, we propose a novel eta-algorith, called unified gap-based exploration (UGapE), which uses the sae ar selection and (ar) return strategies for the two settings. This algorith allows us to solve settings that have not been covered in the previous work (e.g., the case of ɛ 0 has not been studied in the fixed budget setting). Furtherore, we show in Appendix C of [7] that UGapE outperfors existing algoriths in soe settings (e.g., it iproves the perforance of the algorith by Mnih et al. [14] in the fixed confidence setting). We also provide a thorough epirical evaluation of UGapE and copare it with a nuber of existing fixed budget and fixed confidence algoriths in Appendix C of [7]. Theoretical analysis. Siilar to the algorithic contribution, in Section 4, we show that a large portion of the theoretical analysis required to study the behavior of the two settings of the UGapE algorith can be unified in a series of leas. The final theoretical guarantees are thus a direct consequence of these leas when used in the two specific settings. Proble coplexity. In Section 4.4, we show that the theoretical analysis indicates that the two probles share exactly the sae definition of coplexity. In particular, we show that the probability of success in the fixed budget setting as well as the saple coplexity in the fixed confidence setting strictly depend on the inverse of the gaps of the ars and the desired accuracy ɛ. Extensions. Finally, in Appendix B of [7], we discuss how the proposed algorith and analysis can be extended to iproved definitions of confidence interval (e.g., Bernstein-based bounds) and to ore coplex settings, such as the ulti-bandit best ar identification proble introduced in [8]. Proble Forulation In this section, we introduce the notation used throughout the paper. Let A = {1,..., K} be the set of ars such that each ar k A is characterized by a distribution ν k bounded in [0, b] with ean µ k and variance σk. We define the -ax and -argax operators as µ () = ax µ k and () = arg ax µ k, k A k A where () denotes the index of the -th best ar in A and µ () is its corresponding ean so that µ (1) µ ()... µ (K). We denote by S A any subset of ars (i.e., S = < K) and by S, the subset of the best ars (i.e., k S, iif µ k µ () ). Without loss of generality, we 1 Note that when ɛ = 0 and = 1 this reduces to the standard best ar identification proble. Ties are broken in an arbitrary but consistent anner.

assue there exists a unique set S,. In the following we drop the superscript and use S = S and S = S, whenever is clear fro the context. With a slight abuse of notation we further extend the -ax operator to an operator returning a set of ars, such that {µ (1),..., µ () } = ax 1.. µ k and S = arg ax 1.. µ k. k A k A For each ar k A, we define the gap Δ k as { µk µ Δ k = (+1) if k S µ () µ k if k / S. This definition of gap indicates that if k S, Δ k represents the advantage of ar k over the suboptial ars, and if k / S, Δ k denotes how suboptial ar k is. Note that we can also write the gap as Δ k = ax µ i µ k. Given an accuracy ɛ and a nuber of ars, we say that an ar i k k is (ɛ,)-optial if µ k µ () ɛ. Thus, we define the (ɛ,)-best ar identification proble as the proble of finding a set S of (ɛ,)-optial ars. The (ɛ,)-best ar identification proble can be foralized as a gae between a stochastic bandit environent and a forecaster. The distributions {ν k } are unknown to the forecaster. At each round t, the forecaster pulls an ar I(t) A and observes an independent saple drawn fro the distribution ν I(t). The forecaster estiates the expected value of each ar by coputing the average of the saples observed over tie. Let T k (t) be the nuber of ties that ar k has been pulled by the end Tk (t) s=1 X k(s), where X k (s) is of round t, then the ean of this ar is estiated as µ k (t) = 1 T k (t) the s-th saple observed fro ν k. For any ar k A, we define the notion of ar siple regret as and for any set S A of ars, we define the siple regret as r k = µ () µ k, (1) r S = ax k S r k = µ () in k S µ k. () We denote by Ω(t) A the set of ars returned by the forecaster at the end of the exploration phase (when the alg. stops after t rounds), and by r Ω(t) its corresponding siple regret. Returning (ɛ,)-optial ars is then equivalent to having r Ω(t) saller than ɛ. Given an accuracy ɛ and a nuber of ars to return, we now foralize the two settings of fixed budget and fixed confidence. Fixed budget. The objective is to design a forecaster capable of returning a set of (ɛ,)-optial ars with the largest possible confidence using a fixed budget of n rounds. More forally, given a budget n, the perforance of the forecaster is easured by the probability δ of not eeting the (ɛ,) requireent, i.e., δ = P [ r Ω(n) ɛ ], the saller δ, the better the algorith. Fixed confidence. The goal is to design a forecaster that stops as soon as possible and returns a set of (ɛ,)-optial ars with a fixed confidence. We denote by ñ the tie when the algorith stops and by Ω(ñ) its set of returned ars. Given a confidence level δ, the forecaster has to guarantee that P [ r Ω(ñ) ɛ ] δ. The perforance of the forecaster is then easured by the nuber of rounds ñ either in expectation or high probability. Although these settings have been considered as two distinct probles, in Section 3 we introduce a unified ar selection strategy that can be used in both cases by siply changing the stopping criteria. Moreover, we show in Section 4 that the bounds on the perforance of the algorith in the two settings share the sae notion of coplexity and can be derived using very siilar arguents. 3 Unified Gap-based Exploration Algorith In this section, we describe the unified gap-based exploration (UGapE) eta-algorith and show how it is ipleented in the fixed-budget and fixed-confidence settings. As shown in Figure 1, both fixed-budget (UGapEb) and fixed-confidence (UGapEc) instances of UGapE use the sae arselection strategy, SELECT-ARM (described in Figure ), and upon stopping, return the -best ars in the sae anner (using Ω). The two algoriths only differ in their stopping criteria. More precisely, both algoriths receive as input the definition of the proble (ɛ, ), a constraint (the 3

budget n in UGapEb and the confidence level δ in UGapEc), and a paraeter (a or c). While UGapEb runs for n rounds and then returns the set of ars Ω(n), UGapEc runs until it achieves the desired accuracy ɛ with the requested confidence level δ. This difference is due to the two different objectives targeted by the algoriths; while UGapEc optiizes its budget for a given confidence level, UGapEb s goal is to optiize the quality of its recoendation for a fixed budget. UGapEb (ɛ,, n, a) Paraeters: accuracy ɛ, nuber of ars, budget n, exploration paraeter a Initialize: Pull each ar k once, update µ k (K) and set T k (K) = 1 SAMP for t = K + 1,..., n do SELECT-ARM (t) end for SAMP Return Ω(n) = arg in J(t) B J(t)(t) UGapEc (ɛ,, δ, c) Paraeters: accuracy ɛ, nuber of ars, confidence level δ, exploration paraeter c Initialize: Pull each ar k once, update µ k (K), set T k (K) = 1 and t K + 1 SAMP while B J(t) (t) ɛ do SELECT-ARM (t) t t + 1 end while SAMP Return Ω(t) = J(t) Figure 1: The pseudo-code for the UGapE algorith in the fixed-budget (UGapEb) (left) and fixedconfidence (UGapEc) (right) settings. Regardless of the final objective, how to select an ar at each round (ar-selection strategy) is the key coponent of any ulti-ar bandit algorith. One of the ost iportant features of UGapE is having a unique ar-selection strategy for the fixed-budget and fixed-confidence settings. We now describe the UGapE s arselection strategy, whose pseudo-code has been reported in Figure. At each tie step t, UGapE first uses the observations up to tie t 1 and coputes an index B k (t) = ax U i(t) i k L k (t) for each ar k A, where SELECT-ARM (t) Copute B k (t) for each ar k A Identify the set of ars J(t) arg 1.. in Pull the ar I(t) = arg ax k {l t,u t } B k(t) k A β k(t 1) Observe X I(t) ( TI(t) (t 1) + 1 ) ν I(t) Update µ I(t) (t) and T I(t) (t) Figure : The pseudo-code for the UGapE s arselection strategy. This routine is used in both UGapEb and UGapEc instances of UGapE. t, k A U k (t) = µ k (t 1) + β k(t 1), L k (t) = µ k (t 1) β k(t 1). (3) In Eq. 3, β k(t 1) is a confidence interval, 3 and U k (t) and L k (t) are high probability upper and lower bounds on the ean of ar k, µ k, after t 1 rounds. Note that the paraeters a and c are used in the definition of the confidence interval β k, whose shape strictly depends on the concentration bound used by the algorith. For exaple, we can derive β k fro the Chernoff-Hoeffding bound as a UGapEb: β k(t 1) = b T k (t 1), UGapEc: β k(t 1) = b c log 4K(t 1)3 δ T k (t 1). (4) In Sec. 4, we discuss how the paraeters a and c can be tuned and we show that while a should be tuned as a function of n and ɛ in UGapEb, c = 1/ is always a good choice for UGapEc. Defining the confidence interval in a general for β k(t 1) allows us to easily extend the algorith by taking into account different (higher) oents of the ars (see Appendix B of [7] for the case of variance, where β k(t 1) is obtained fro the Bernstein inequality). Fro Eq. 3, we ay see that the index B k (t) is an upper-bound on the siple regret r k of the kth ar (see Eq. 1). We also define an index for a set S as B S (t) = ax i S B i (t). Siilar to the ar index, B S is also defined in order to upper-bound the siple regret r S with high probability (see Lea 1). After coputing the ar indices, UGapE finds a set of ars J(t) with iniu upper-bound on their siple regrets, i.e., J(t) = arg in 1.. B k(t). Fro J(t), it coputes two ar indices u t = k A arg ax j / J(t) U j (t) and l t = arg in i J(t) L i (t), where in both cases the tie is broken in favor of 3 To be ore precise, β k(t 1) is the width of a confidence interval or a confidence radius. 4

the ar with the largest uncertainty β(t 1). Ars l t and u t are the worst possible ar aong those in J(t) and the best possible ar left outside J(t), respectively, and together they represent how bad the choice of J(t) could be. Intuitively, UGapE pulls the ost uncertain between u t or l t allows Finally, the algorith selects and pulls the ar I(t) as the ar with the larger β(t 1) aong u t and l t, observes a saple X I(t) ( TI(t) (t 1) + 1 ) fro the distribution ν I(t), and updates the epirical ean µ I(t) (t) and the nuber of pulls T I(t) (t) of the selected ar I(t). There are two ore points that need to be discussed about the UGapE algorith. 1) While UGapEc defines the set of returned ars as Ω(t) = J(t), UGapEb returns the set of ars J(t) with the sallest index, i.e., Ω(n) = arg in J(t) B J(t) (t), t {1,..., n}. ) UGapEc stops (we refer to the nuber of rounds before stopping as ñ) when B J(ñ+1) (ñ + 1) is less than the given accuracy ɛ, i.e., when even the th worst upper-bound on the ar siple regret aong all the ars in the selected set J(ñ + 1) is saller than ɛ. This guarantees that the siple regret (see Eq. ) of the set returned by the algorith, Ω(ñ) = J(ñ + 1), to be saller than ɛ with probability larger than 1 δ. 4 Theoretical Analysis In this section, we provide high probability upper-bounds on the perforance of the two instances of the UGapE algorith, UGapEb and UGapEc, introduced in Section 3. An iportant feature of UGapE is that since its fixed-budget and fixed-confidence versions share the sae ar-selection strategy, a large part of their theoretical analysis can be unified. We first report this unified part of the proof in Section 4.1, and then provide the final perforance bound for each of the algoriths, UGapEb and UGapEc, separately, in Sections 4. and 4.3, respectively. Before oving to the ain results, we define additional notation used in the analysis. We first define event E as E = { k A, t {1,..., T }, µ k (t) µ k < β k(t) }, (5) where the values of T and β k are defined for each specific setting separately. Note that event E plays an iportant role in the sequel, since it allows us to first derive a series of results which are directly iplied by the event E and to postpone the study of the stochastic nature of the proble (i.e., the probability of E) in the two specific settings. In particular, when E holds, we have that for any ar k A and at any tie t, L k (t) µ k U k (t). Finally, we define the coplexity of the proble as K b H ɛ = ax( Δi+ɛ i=1, ɛ). (6) Note that although the coplexity has an explicit dependence on ɛ, it also depends on the nuber of ars through the definition of the gaps Δ i, thus aking it a coplexity easure of the (ɛ, ) best ar identification proble. In Section 4.4, we will discuss why the coplexity of the two instances of the proble is easured by this quantity. 4.1 Analysis of the Ar-Selection Strategy Here we report lower (Lea 1) and upper (Lea ) bounds for indices B S on the event E, which show their connection with the regret and gaps. The technical leas used in the proofs (Leas 3 and 4 and Corollary 1) are reported in Appendix A of [7]. We first prove that for any set S S and any tie t {1,..., T }, the index B S (t) is an upper-bound on the siple regret of this set r S. Lea 1. On event E, for any set S S and any tie t {1,..., T }, we have B S (t) r S. Proof. On event E, for any ar i / S and each tie t {1,..., T }, we ay write B i (t) = ax U j(t) L i (t) = ax ( µj (t 1) + β j(t 1) ) ( µ i (t 1) β i(t 1) ) j i j i Using Eq. 7, we have ax j i µ j µ i = µ () µ i = r i. (7) B S (t) = ax i S B i(t) ax B i(t) i (S S ) ax r i = r S, i (S S ) where the last passage follows fro the fact that r i 0 for any i S. 5

Lea. On event E, if ar k {l t, u t } is pulled at tie t {1,..., T }, we have B J(t) (t) in ( 0, Δ k + β k(t 1) ) + β k(t 1). (8) Proof. We first prove the stateent for B(t) = U ut (t) L lt (t), i.e., B(t) in ( 0, Δ k + β k(t 1) ) + β k(t 1). (9) We consider the following cases: Case 1. k = u t : Case 1.1. u t S : Since by definition u t / J(t), there exists an ar j / S such that j J(t). Now we ay write µ (+1) µ j (a) L j (t) (b) L lt (t) (c) L ut (t) = µ k (t 1) β k(t 1) (d) µ k β k(t 1) (10) (a) and (d) hold because of event E, (b) follows fro the fact that j J(t) and fro the definition of l t, and (c) is the result of Lea 4. Fro Eq. 10, we ay deduce that Δ k + β k(t 1) 0, which together with Corollary 1 gives us the desired result (Eq. 9). Case 1.. u t / S : Case 1..1. l t S : In this case, we ay write B(t) = U ut (t) L lt (t) (a) µ ut + β u t (t 1) µ lt + β l t (t 1) (b) µ ut + β u t (t 1) µ () + β l t (t 1) (c) Δ ut + 4β u t (t 1) (11) (a) holds because of event E, (b) is fro the fact that l t S, and (c) is because u t is pulled, and thus, β u t (t 1) β l t (t 1). The final result follows fro Eq. 11 and Corollary 1. Case 1... l t / S : Since l t / S and the fact that by definition l t J(t), there exists an ar j S such that j / J(t). Now we ay write µ ut + β u t (t 1) (a) U ut (t) (b) U j (t) (c) µ j (d) µ () (1) (a) and (c) hold because of event E, (b) is fro the definition of u t and the fact that j / J(t), and (d) holds because j S. Fro Eq. 1, we ay deduce that Δ ut + β u t (t 1) 0, which together with Corollary 1 gives us the final result (Eq. 9). With siilar arguents and cases, we prove the result of Eq. 9 for k = l t. The final stateent of the lea (Eq. 8) follows directly fro B J(t) (t) B(t) as shown in Lea 3. Using Leas 1 and, we define an upper and a lower bounds on B J(t) in ters of quantities related to the regret of J(t). Lea 1 confirs the intuition that the B-values upper-bound the regret of the corresponding set of ars (with high probability). Unfortunately, this is not enough to clai that selecting J(t) as the set of ars with sallest B-values actually correspond to ars with sall regret, since B J(t) could be an arbitrary loose bound on the regret. Lea provides this copleentary guarantee specifically for the set J(t), in the for of an upper-bound on B J(t) w.r.t. the gap of k {u t, l t }. This iplies that as the algorith runs, the choice of J(t) becoes ore and ore accurate since B J(t) is constrained between r J(t) and a quantity (Eq. 8) that gets saller and saller, thus iplying that selecting the ars with the saller B-value, i.e., the set J(t), corresponds to those which actually have the sallest regret, i.e., the ars in S. This arguent will be iplicitly at the basis of the proofs of the two following theores. 4. Regret Bound for the Fixed-Budget Setting Here we prove an upper-bound on the siple-regret of UGapEb. Since the setting considered by the algorith is fixed-budget, we ay set T = n. Fro the definition of the confidence interval β i(t) in Eq. 4 and a union bound, we have that P(E) 1 Kn exp( a). 4 We now have all the tools needed to prove the perforance of UGapEb for the (ɛ,)-best ar identification proble. 4 The extension to a confidence interval that takes into account the variance of the ars is discussed in Appendix B of [7]. 6

Theore 1. If we run UGapEb with paraeter 0 < a n K 4H ɛ, its siple regret r Ω(n) satisfies δ = P ( r Ω(n) ɛ ) Kn exp( a), and in particular this probability is iniized for a = n K 4H ɛ. Proof. The proof is by contradiction. We assue that r Ω(n) following two steps: > ɛ on event E and consider the Step 1: Here we show that on event E, we have the following upper-bound on the nuber of pulls of any ar i A: 4ab T i (n) < ax ( Δ i+ɛ, ɛ ) + 1. (13) Let t i be the last tie that ar i is pulled. If ar i has been pulled only during the initialization phase, T i (n) = 1 and Eq. 13 trivially holds. If i has been selected by SELECT-ARM, then we have in ( Δ i + β i(t i 1), 0 ) + β i(t i 1) (a) B(t i ) (b) B J(ti)(t i ) (c) B Ω(n) (t l ) (d) > ɛ, (14) where t l {1,..., n} is the tie such that Ω(n) = J(t l ). (a) and (b) are the results of Leas and 3, (c) is by the definition of Ω(n), and (d) holds because using Lea 1, we know that if the algorith suffers a siple regret r Ω(n) > ɛ (as assued at the beginning of the proof), then t = 1,..., n + 1, B Ω(n) (t) > ɛ. By the definition of t i, we know T i (n) = T i (t i 1) + 1. Using this fact, the definition of β i(t i 1), and Eq. 14, it is straightforward to show that Eq. 13 holds. Step : We know that K i=1 T i(n) = n. Using Eq. 13, we have K i=1 4ab ax ( Δi +ɛ,ɛ ) + K > n on event E. It is easy to see that by selecting a n K 4H ɛ, the left-hand-side of this inequality will be saller than or equal to n, which is a contradiction. Thus, we conclude that r Ω(n) ɛ on event E. The final result follows fro the probability of event E defined at the beginning of this section. 4.3 Regret Bound for the Fixed-Confidence Setting Here we prove an upper-bound on the siple-regret of UGapEc. Since the setting considered by the algorith is fixed-confidence, we ay set T = +. Fro the definition of the confidence interval β i(t) in Eq. 4 and a union bound on T k (t) {0,..., t}, t = 1,...,, we have that P(E) 1 δ. Theore. The UGapEc algorith stops after ñ rounds and returns a set of ars, Ω(ñ), that satisfies P ( r Ω(ñ+1) ɛ ñ N ) 1 δ, where N = K + O(H ɛ log Hɛ δ ) and c has been set to its optial value 1/. Proof. We first prove the bound on the siple regret of UGapEc. Using Lea 1, we have that on event E, the siple regret of UGapEc upon stopping satisfies B J(ñ+1) (ñ + 1) = B Ω(ñ+1) (ñ + 1) r Ω(ñ+1). As a result, on event E, the regret of UGapEc cannot be bigger than ɛ, because then it contradicts the stopping condition of the algorith, i.e., B J(ñ+1) (ñ + 1) < ɛ. Therefore, we have P ( r Ω(ñ+1) ɛ ) 1 δ. Now we prove the bound for the saple coplexity. Siilar to the proof of Theore 1, we consider the following two steps: Step 1: Here we show that on event E, we have the following upper-bound on the nuber of pulls of any ar i A: T i (ñ) b log(4k(ñ 1) 3 /δ) ax ( Δ i+ɛ, ɛ ) + 1. (15) Let t i be the last tie that ar i is pulled. If ar i has been pulled only during the initialization phase, T i (ñ) = 1 and Eq. 15 trivially holds. If i has been selected by SELECT-ARM, then we have B J(ti)(t i ) ɛ. Now using Lea, we ay write B J(ti)(t i ) in ( 0, Δ i + β i(t i 1) ) + β i(t i 1). (16) We can prove Eq. 15 by plugging in the value of β i(t i 1) fro Eq. 4 and solving Eq. 16 for T i (t i ) taking into account that T i (t i 1) + 1 = T i (t i ). 7

Step : We know that K i=1 T i(ñ) = ñ. Using Eq. 15, on event E, we have H ɛ log ( K(ñ 1) 3 /δ ) + K ñ. Solving this inequality gives us ñ N. 4.4 Proble Coplexity Theores 1 and indicate that both the probability of success and saple coplexity of UGapE are directly related to the coplexity H ɛ defined by Eq. 6. This iplies that H ɛ captures the intrinsic difficulty of the (ɛ,)-best ar(s) identification proble independently fro the specific setting considered. Furtherore, note that this definition generalizes existing notions of coplexity. For exaple, for ɛ = 0 and = 1 we recover the coplexity used in the definition of UCB-E [1] for the fixed budget setting and the one defined in [6] for the fixed accuracy proble. Let us analyze H ɛ in the general case of ɛ > 0. We define the coplexity of a single ar i A, H ɛ,i = b / ax( Δi+ɛ, ɛ). When the gap Δ i is saller than the desired accuracy ɛ, i.e., Δ i ɛ, then the coplexity reduces to H ɛ,i = 1/ɛ. In fact, the algorith can stop as soon as the desired accuracy ɛ is achieved, which eans that there is no need to exactly discriinate between ar i and the best ar. On the other hand, when Δ i > ɛ, then the coplexity becoes H ɛ,i = 4b /(Δ i + ɛ). This shows that when the desired accuracy is saller than the gap, the coplexity of the proble is saller than the case of ɛ = 0, for which we have H 0,i = 4b /Δ i. More in general, the analysis reported in the paper suggests that the perforance of a upper confidence bound based algorith such as UGapE is characterized by the sae notion of coplexity in both settings. Thus, whenever the coplexity is known, it is possible to exploit the theoretical analysis (bounds on the perforance) to easily switch fro one setting to the other. For instance, as also suggested in Section 5.4 of [9], if the coplexity H is known, an algorith like UGapEc can be adapted to run in the fixed budget setting by inverting the bound on its saple coplexity. This would lead to an algorith siilar to UGapEb with siilar perforance, although the paraeter tuning could be ore difficult because of the intrinsic poor accuracy in the constants of the bound. On the other hand, it is an open question whether it is possible to find an equivalence between algoriths for the two different settings when the coplexity is not known. In particular, it would be iportant to derive a distribution-dependent lower bound in the for of the one reported in [1] for the general case of ɛ 0 and 1 for both the fixed budget and fixed confidence settings. 5 Suary and Discussion We proposed a eta-algorith, called unified gap-based exploration (UGapE), that unifies the two settings of the best ar(s) identification proble in stochastic ulti-ared bandit: fixed budget and fixed confidence. UGapE can be instantiated as two algoriths with a coon structure (the sae ar-selection and ar-return strategies) corresponding to these two settings, whose perforance can be analyzed in a unified way, i.e., a large portion of their theoretical analysis can be unified in a series of leas. We proved a perforance bound for the UGapE algorith in the two settings. We also showed how UGapE and its theoretical analysis can be extended to take into account the variance of the ars and to ultiple bandits. Finally, we evaluated the perforance of UGapE and copare it with a nuber of existing fixed budget and fixed confidence algoriths. This unification is iportant for both theoretical and algorithic reasons. Despite their siilarities, fixed budget and fixed confidence settings have been treated differently in the literature. We believe that this unification provides a better understanding of the intrinsic difficulties of the best ar(s) identification proble. In particular, our analysis showed that the sae coplexity ter characterizes the hardness of both settings. As entioned in the introduction, there was no algorith available for several settings considered in this paper, e.g., (ɛ,)-best ar identification with fixed budget. With UGapE, we introduced an algorith that can be easily adapted to all these settings. Acknowledgents This work was supported by Ministry of Higher Education and Research, Nord- Pas de Calais Regional Council and FEDER through the contrat de projets état region 007 013", French National Research Agency (ANR) under project LAMPADA n ANR-09-EMER-007, European Counity s Seventh Fraework Prograe (FP7/007-013) under grant agreeent n 7037, and PASCAL European Network of Excellence. 8

References [1] J.-Y. Audibert, S. Bubeck, and R. Munos. Best ar identification in ulti-ared bandits. In Proceedings of the Twenty-Third Annual Conference on Learning Theory, pages 41 53, 010. [] P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-tie analysis of the ulti-ared bandit proble. Machine Learning, 47:35 56, 00. [3] S. Bubeck, R. Munos, and G. Stoltz. Pure exploration in ulti-ared bandit probles. In Proceedings of the Twentieth International Conference on Algorithic Learning Theory, pages 3 37, 009. [4] S. Bubeck, T. Wang, and N. Viswanathan. Multiple identifications in ulti-ared bandits. CoRR, abs/105.3181, 01. [5] K. Deng, J. Pineau, and S. Murphy. Active learning for developing personalized treatent. In Proceedings of the Twenty-Seventh International Conference on Uncertainty in Artificial Intelligence, pages 161 168, 011. [6] E. Even-Dar, S. Mannor, and Y. Mansour. Action eliination and stopping conditions for the ulti-ared bandit and reinforceent learning probles. Journal of Machine Learning Research, 7:1079 1105, 006. [7] V. Gabillon, M. Ghavazadeh, and A. Lazaric. Best Ar Identification: A Unified Approach to Fixed Budget and Fixed Confidence. Technical report 00747005, October 01. [8] V. Gabillon, M. Ghavazadeh, A. Lazaric, and S. Bubeck. Multi-bandit best ar identification. In Proceedings of Advances in Neural Inforation Processing Systes 5, pages 30, 011. [9] S. Kalyanakrishnan. Learning Methods for Sequential Decision Making with Iperfect Representations. PhD thesis, Departent of Coputer Science, The University of Texas at Austin, Austin, Texas, USA, Deceber 011. Published as UT Austin Coputer Science Technical Report TR-11-41. [10] S. Kalyanakrishnan and P. Stone. Efficient selection of ultiple bandit ars: Theory and practice. In Proceedings of the Twenty-Seventh International Conference on Machine Learning, pages 511 518, 010. [11] S. Kalyanakrishnan, A. Tewari, P. Auer, and P. Stone. Pac subset selection in stochastic ultiared bandits. In Proceedings of the Twentieth International Conference on Machine Learning, 01. [1] O. Maron and A. Moore. Hoeffding races: Accelerating odel selection search for classification and function approxiation. In Proceedings of Advances in Neural Inforation Processing Systes 6, pages 59 66, 1993. [13] A. Maurer and M. Pontil. Epirical bernstein bounds and saple-variance penalization. In th annual conference on learning theory, 009. [14] V. Mnih, Cs. Szepesvári, and J.-Y. Audibert. Epirical Bernstein stopping. In Proceedings of the Twenty-Fifth International Conference on Machine Learning, pages 67 679, 008. [15] H. Robbins. Soe aspects of the sequential design of experients. Bulletin of the Aerican Matheatics Society, 58:57 535, 195. 9