Multi-Bandit Best Arm Identification

Size: px

Start display at page:

Download "Multi-Bandit Best Arm Identification"

Arron Johns
5 years ago
Views:

1 Multi-Bndit Best Arm Identifiction Victor Gbillon Mohmmd Ghvmzdeh Alessndro Lzric INRIA Lille - Nord Europe, Tem SequeL {victor.gbillon,mohmmd.ghvmzdeh,lessndro.lzric}@inri.fr Sébstien Bubeck Deprtment of Opertions Reserch nd Finncil Engineering, Princeton University sbubeck@princeton.edu Abstrct We study the problem of identifying the best rm in ech of the bndits in multibndit multi-rmed setting. We first propose n lgorithm clled Gp-bsed Explortion (GpE) tht focuses on the rms whose men is close to the men of the best rm in the sme bndit (i.e., smll gp). We then introduce n lgorithm, clled GpE-V, which tkes into ccount the vrince of the rms in ddition to their gp. We prove n upper-bound on the probbility of error for both lgorithms. Since GpE nd GpE-V need to tune n explortion prmeter tht depends on the complexity of the problem, which is often unknown in dvnce, we lso introduce vritions of these lgorithms tht estimte this complexity online. Finlly, we evlute the performnce of these lgorithms nd compre them to other lloction strtegies on number of synthetic problems. 1 Introduction Consider clinicl problem with M subpopultions, in which one should decide between K m options for treting subjects from ech subpopultionm. A subpopultion my correspond to ptients with prticulr gene biomrker (or other risk ctegories) nd the tretment options re the vilble tretments for disese. The min objective here is to construct rule, which recommends the best tretment for ech of the subpopultions. These rules re usully constructed using dt from clinicl trils tht re generlly costly to run. Therefore, it is importnt to distribute the tril resources wisely so tht the devised rule yields good performnce. Since it my tke significntly more resources to find the best tretment for one subpopultion thn for the others, the common strtegy of enrolling ptients s they rrive my not yield n overll good performnce. Moreover, pplying tretment options uniformly t rndom in subpopultion could not only wste tril resources, but lso it might run the risk of finding bd tretment for tht subpopultion. This problem cn be formulted s the best rm identifiction over M multi-rmed bndits [1], which itself cn be seen s the problem of pure explortion [4] over multiple bndits. In this formultion, ech subpopultion is considered s multi-rmed bndit, ech tretment s n rm, trying mediction on ptient s pull, nd we re sked to recommend n rm for ech bndit fter given number of pulls (budget). The evlution cn be bsed on 1) the verge over the bndits of the rewrd of the recommended rms, or 2) the verge probbility of error (not selecting the best rm), or 3) the mximum probbility of error. Note tht this setting is different from the stndrd multi-rmed bndit problem in which the gol is to mximize the cumultive sum of rewrds (see e.g., [13, 3]). The pure explortion problem is bout designing strtegies tht mke the best use of the limited budget (e.g., the totl number of ptients tht cn be dmitted to the clinicl tril) in order to optimize the performnce in decision-mking tsk. Audibert et l. [1] proposed two lgorithms to ddress this problem: 1) highly exploring strtegy bsed on upper confidence bounds, clled UCB-E, in which the optiml vlue of its prmeter depends on some mesure of the complexity of the problem, nd 2) prmeter-free method bsed on progressively rejecting the rms which seem to be suboptiml, clled Successive Rejects. They showed tht both lgorithms re nerly optiml since their probbility of returning the wrong rm decreses exponentilly t rte. Rcing lgorithms (e.g., [10, 12]) 1

2 nd ction-elimintion lgorithms [7] ddress this problem under constrint on the ccurcy in identifying the best rm nd they minimize the budget needed to chieve tht ccurcy. However, UCB-E nd Successive Rejects re designed for single bndit problem, nd s we will discuss lter, cnnot be esily extended to the multi-bndit cse studied in this pper. Deng et l. hve recently proposed n ctive lerning lgorithm for resource lloction over multiple bndits [5]. However, they do not provide ny theoreticl nlysis for their lgorithm nd only empiriclly evlute its performnce. Moreover, the trget of their proposed lgorithm is to minimize the mximum uncertinty in estimting the vlue of the rms for ech bndit. Note tht this is different thn our trget, which is to mximize the qulity of the rms recommended for ech bndit. In this pper, we study the problem of best-rm identifiction in multi-rmed multi-bndit setting under fixed budget constrint, nd propose n lgorithm, clled Gp-bsed Explortion (GpE), to solve it. The lloction strtegy implemented by GpE focuses on the gp of the rms, i.e., the difference between the men of the rm nd the men of the best rm (in tht bndit). The GpE-vrince (GpE-V) lgorithm extends this pproch tking into ccount lso the vrince of the rms. For both lgorithms, we prove n upper-bound on the probbility of error tht decreses exponentilly with the budget. Since both GpE nd GpE-V need to tune n explortion prmeter tht depends on the complexity of the problem, which is rrely known in dvnce, we lso introduce their dptive version. Finlly, we evlute the performnce of these lgorithms nd compre them with Uniform nd Uniform+UCB-E strtegies on number of synthetic problems. Our empiricl results indicte tht 1) GpE nd GpE-V hve better performnce thn Uniform nd Uniform+UCB-E, nd 2) the dptive version of these lgorithms mtch the performnce of their non-dptive counterprts. 2 Problem Setup In this section, we introduce the nottion used throughout the pper nd formlize the multi-bndit best rm identifiction problem. Let M be the number of bndits nd K be the number of rms for ech bndit (we use indices m, p, q for the bndits nd k, i, j for the rms). Ech rm k of bndit m is chrcterized by distributionν mk bounded in [0,b] with menµ mk nd vrinceσmk 2. In the following, we ssume tht ech bndit hs unique best rm. We denote byµ m ndk m the men nd the index of the best rm of bnditm (i.e.,µ m = mx 1 k K µ mk, km = rgmx 1 k K µ mk ). In ech bnditm, we define the gp for ech rm s mk = mx j k µ mj µ mk. The clinicl tril problem described in Sec. 1 cn be formlized s gme between stochstic multibndit environment nd forecster, where the distributions {ν mk } re unknown to the forecster. At ech round t = 1,...,n, the forecster pulls bndit-rm pir I(t) = (m,k) nd observes smple drwn from the distribution ν I(t) independent from the pst. The forecster estimtes the expected vlue of ech rm by computing the verge of the smples observed over time. Let T mk (t) be the number of times tht rm k of bndit m hs been pulled by the end of round t, then the men of this rm is estimted s µ mk (t) = 1 T mk (t) Tmk (t) s=1 X mk (s), wherex mk (s) is the s-th smple observed from ν mk. Given the previous definitions, we define the estimted gps s mk (t) = mx j k µ mj (t) µ mk (t). At the end of roundn, the forecster returns for ech bndit m the rm with the highest estimted men, i.e.,j m (n) = rgmx k µ mk (n), nd incurs regret r(n) = 1 M r m(n) = 1 M ( ) µ M M m µ mjm(n). m=1 m=1 As discussed in the introduction, other performnce mesures cn be defined for this problem. In some pplictions, returning the wrong rm is considered s n error independently from its regret, nd thus, the objective is to minimize the verge probbility of error e(n) = 1 M e m(n) = 1 M P ( J m(n) km). M M m=1 m=1 Finlly, in problems similr to the clinicl tril, resonble objective is to return the right tretment for ll the genetic profiles nd not just to hve smll verge probbility of error. In this cse, the globl performnce of the forecster cn be mesured s l(n) = mx m lm(n) = mx m P( J m(n) k m It is interesting to note the reltionship between these three performnce mesures: min m m e(n) Er(n) b e(n) b l(n), where the expecttion in the regret is w.r.t. the rndom smples. As result, ny lgorithm minimizing the worst cse probbility of error, l(n), lso controls the verge probbility of error,e(n), nd the simple regreter(n). Note tht the lgorithms introduced in this pper directly trget the problem of minimizing l(n). 2 ).

3 Prmeters: number of rounds n, explortion prmeter, mximum rnge b Initilize: T mk (0) = 0, mk (0) = 0 for ll bndit-rm pirs (m,k) for t = 1,2,...,n do Compute B mk (t) = mk (t 1)+b T mk for ll bndit-rm pirs (m,k) (t 1) Drw I(t) rgmx ( m,k B mk (t) Observe X I(t) TI(t) (t 1)+1 ) ν I(t) Updte T I(t) (t) = T I(t) (t 1)+1 nd mk (t) k of the selected bndit end for ReturnJ m(n) rgmx k {1,...,K} µ mk (n), m {1...M} Figure 1: The pseudo-code of the gp-bsed Explortion (GpE) lgorithm. 3 The Gp-bsed Explortion Algorithm Fig. 1 contins the pseudo-code of the gp-bsed explortion (GpE) lgorithm. GpE flttens the bndit-rm structure nd reduces it to single-bndit problem with M K rms. At ech time step t, the lgorithm relies on the observtions up to time t 1 to build n indexb mk (t) for ech bnditrm pir, nd then selects the pir I(t) with the highest index. The index B mk consists of two terms. The first term is the negtive of the estimted gp for rm k in bndit m. Similr to other upper-confidence bound (UCB) methods [3], the second prt is n explortion term which forces the lgorithm to pull rms tht hve been less explored. As result, the lgorithm tends to pull rms with smll estimted gp nd smll number of pulls. The explortion prmeter tunes the level of explortion of the lgorithm. As it is shown by the theoreticl nlysis of Sec. 3.1, if the time horizonnis known,should be set to = 4 n K 9 H, whereh = m,k b2 / 2 mk is the complexity of the problem (see Sec. 3.1 for further discussion). Note tht GpE differs from most stndrd bndit strtegies in the sense tht the B-index for n rm depends explicitly on the sttistics of the other rms. This feture mkes the nlysis of this lgorithm much more involved. As we my notice from Fig. 1, GpE resembles the UCB-E lgorithm [1] designed to solve the pure explortion problem in the single-bndit setting. Nonetheless, the use of the negtive estimted gp ( mk ) insted of the estimted men ( µ mk ) (used by UCB-E) is crucil in the multi-bndit setting. In the single-bndit problem, since the best nd second best rms hve the sme gp ( mk m = min k k m mk ), GpE considers them equivlent nd tends to pull them the sme mount of time, while UCB-E tends to pull the best rm more often thn the second best one. Despite this difference, the performnce of both lgorithms in predicting the best rm fternpulls would be the sme. This is due to the fct tht the probbility of error depends on the cpbility of the lgorithm to distinguish optiml nd suboptiml rms, nd this is not ffected by different lloction over the best nd second best rms s long s the number of pulls llocted to tht pir is lrge enough w.r.t. their gp. Despite this similrity, the two pproches become completely different in the multi-bndit cse. In this cse, if we run UCB-E on ll themk rms, it tends to pull more the rm with the highest men over ll the bndits, i.e., k = rgmx m,k µ mk. As result, it would be ccurte in predicting the best rmk over bndits, but my hve n rbitrrily bd performnce in predicting the best rm for ech bndit, nd thus, my incur lrge error l(n). On the other hnd, GpE focuses on the rms with the smllest gps. This wy, it ssigns more pulls to bndits whose optiml rms re difficult to identify (i.e., bndits with rms with smll gps), nd s shown in the next section, it chieves high probbility in identifying the best rm in ech bndit. 3.1 Theoreticl Anlysis In this section, we derive n upper-bound on the probbility of errorl(n) for the GpE lgorithm. Theorem 1. If we run GpE with prmeter0 < 4 9 in prticulr for = 4 9 l(n) P ( m : J m (n) k m n MK H, then its probbility of error stisfies ) 2MKnexp( 64 ), n MK 1 n MK H, we hvel(n) 2MKnexp( 144 H ). Remrk 1 (Anlysis of the bound). If the time horizonnis known in dvnce, it would be possible to set the explortion prmeter s liner function ofn, nd s result, the probbility of error of GpE decreses exponentilly with the time horizon. The other interesting spect of the bound is the 3

4 complexity termh ppering in the optiml vlue of the explortion prmeter (i.e., = 4 n K 9 H ). If we denote byh mk = b 2 / 2 mk, the complexity of rmk in bnditm, it is cler from the definition of H tht ech rm hs n dditive impct on the overll complexity of the multi-bndit problem. Moreover, if we define the complexity of ech bndit m s H m = k b2 / 2 mk (similr to the definition of complexity for UCB-E in [1]), the GpE complexity my be rewritten sh = m H m. This mens tht the complexity of GpE is simply the sum of the complexities of ll the bndits. Remrk 2 (Comprison with the sttic lloction strtegy). The min objective of GpE is to trdeoff between llocting pulls ccording to the gps (more precisely, ccording to the complexities H mk ) nd the explortion needed to improve the ccurcy of their estimtes. If the gps were known in dvnce, nerly-optiml sttic lloction strtegy ssigns to ech bndit-rm pir number of pulls proportionl to its complexity. Let us consider strtegy tht pulls ech rm fixed number of times over the horizon n. The probbility of error for this strtegy my be bounded s l Sttic(n) P ( m : J m(n) km ) M P ( J m(n) k ) M m m=1 m=1 M m=1 k km exp ( T mk (n) 2 mk b 2 ) = M m=1 k km k k m P (ˆµ mk m (n) ˆµ mk (n) ) exp ( T mk (n)h 1 mk). (1) Given the constrint mk T mk(n) = n, the lloction minimizing the lst term in Eq. 1 is Tmk (n) = nh mk/h. We refer to this fixed strtegy s StticGp. Although this is not necessrily the optiml sttic strtegy (Tmk (n) minimizes n upper-bound), this lloction gurntees probbility of error smller thn M K exp( n/h). Theorem 1 shows tht, for n lrge enough, GpE chieves the sme performnce s the sttic lloction StticGp. Remrk 3 (Comprison with other lloction strtegies). At the beginning of Sec. 3, we discussed the difference between GpE nd UCB-E. Here we compre the bound reported in Theorem 1 with the performnce of the Uniform nd combined Uniform+UCB-E lloction strtegies. In the uniform lloction strtegy, the totl budget n is uniformly split over ll the bndits nd rms. As result, ech bndit-rm pir is pulledt mk (n) = n/(mk) times. Using the sme derivtion s in Remrk 2, the probbility of errorl(n) for this strtegy my be bounded s l Unif(n) M m=1 k km exp ( n 2 ) ( mk MKexp MK b 2 n ). MK mx m,k H mk In the Uniform+UCB-E lloction strtegy, i.e., two-level lgorithm tht first selects bndit uniformly nd then pulls rms within ech bndit using UCB-E, the totl number of pulls for ech bndit m is k T mk(n) = n/m, while the number of pulls T mk (n) over the rms in bndit m is determined by UCB-E. Thus, the probbility of error of this strtegy my be bounded s l Unif+UCB-E(n) M m=1 ( n/m K ) ( n/m K ) 2nK exp 2nMK exp, 18H m 18mx mh m where the first inequlity follows from Theorem 1 in [1] (recll thth m = k b2 / 2 mk ). Letb = 1 (i.e., ll the rms hve distributions bounded in [0, 1]), up to constnts nd multiplictive fctors in front of the exponentils, nd if n is lrge enough compred to M nd K (so s to pproximte n/m K nd n K by n), the probbility of error for the three lgorithms my be bounded s ( l Unif(n) exp O ( n/mk ) ) mx H, l U+UCBE(n) exp(o ( n/m ) ), l GpE(n) exp(o ( n ) ). mk mx m,k m Hm H mk By compring the rguments of the exponentil terms, we hve the trivil sequence of inequlities MKmx m,k H mk M mx m k H mk m,k H mk, which implies tht the upper bound on the probbility of error of GpE is usully significntly smller. This reltionship, which is confirmed by the experiments reported in Sec. 4, shows tht GpE is ble to dpt to the complexity H of the overll multi-bndit problem better thn the other two lloction strtegies. In fct, while the performnce of the Uniform strtegy depends on the most complex rm over the bndits nd the strtegy Unif+UCB-E is ffected by the most complex bndit, the performnce of GpE depends on the sum of the complexities of ll the rms involved in the pure explortion problem. m,k 4

5 Proof of Theorem 1. Step 1. Let us consider the following event: { E = m {1,...,M}, k {1,...,K}, t {1,...,n}, } µmk (t) µ mk < bc. T mk (t) From Chernoff-Hoeffding s inequlity nd union bound, we hvep(ξ) 1 2MKnexp( 2c 2 ). Now we would like to prove tht on the evente, we find the best rm for ll the bndits, i.e.,j m (n) = km, m {1...M}. SinceJ m (n) is the empiricl best rm of bnditm, we should prove tht for ny k {1,...,K}, µ mk (n) µ mk m (n). By upper-bounding the LHS nd lower-bounding the RHS of this inequlity, we note tht it would be enough to prove bc /T mk (n) mk /2 on the evente, or equivlently, to prove tht for ny bndit-rm pirm,k, we hvet mk (n) 4b2 c 2. Step 2. In this step, we show tht in GpE, for ny bndits (m,q) nd rms (k,j), nd for ny t M K, the following dependence between the number of pulls of the rms holds mk +(1+d)b mx ( T mk (t) 1,1 ) qj +(1 d)b T qj (t), (2) where d [0, 1]. We prove this inequlity by induction. Bse step. We know tht fter the first MK rounds of the GpE lgorithm, ll the rms hve been pulled once, i.e.,t mk (t) = 1, m,k, thus if 1/4d 2, the inequlity (2) holds fort = MK. Inductive step. Let us ssume tht (2) holds t time t 1 nd we pull rm i of bndit p t time t, i.e., I(t) = (p,i). So t time t, the inequlity (2) trivilly holds for every choice of m, q, k, nd j, except when (m,k) = (p,i). As result, in the inductive step, we only need to prove tht the following holds for nyq {1,...M} ndj {1,...K} pi +(1+d)b mx ( T pi (t) 1,1 ) qj +(1 d)b T qj (t). (3) Since rmiof bnditphs been pulled t timet, we hve tht for ny bndit-rm pir(q,j) pi (t 1)+b T pi (t 1) qj (t 1)+b T qj (t 1). (4) To prove (3), we first prove n upper-bound for pi (t 1) nd lower-bound for qj (t 1) 2 mk pi (t 1) pi + 2bc 1 c T pi (t) 1 nd qj (t 1) qj 2 2bc 1 d T qj (t). (5) We report the proofs of the inequlities in (5) in App. B of [8]. The inequlity (3), nd s result, the inductive step is proved by replcing pi (t 1) nd qj (t 1) in (4) from (5) nd under the conditions thtd 2c 1 c ndd 2 2c 1 d. These conditions re stisfied byd = 1/2 ndc = 2/16. Step 3. In order to prove the condition of T mk (n) in step 1, we need to find lower-bound on the number of pulls of ll the rms t time t = n (t the end). Let us ssume tht rmk of bnditmhs been pulled less thn b2 (1 d) 2, which indictes tht 2 mk + (1 d)b T mk mk (n) > 0. From this result nd (2), we hve qj +(1+d)b T > 0, or equivlentlyt qj(n) 1 qj(n) < b2 (1+d) qj for ny pir (q,j). We lso know tht q,j T qj(n) = n. From these, we deduce tht n MK < b 2 (1+d) 2 q,j 1 2 qj. So, if we selectsuch thtn MK b 2 (1+d) 2 q,j the first ssumption tht T mk (n) < b2 (1 d) qj, we contrdict, which mens tht T 2 mk (n) 4b2 c 2 for ny pir mk 2 mk (m,k), when 1 d 2c. This concludes the proof. The condition for in the sttement of the theorem comes from our choice ofin this step nd the vlues ofcnddfrom the inductive step. 3.2 Extensions In this section we propose two vrints on the GpE lgorithm with the objective of extending its pplicbility nd improving its performnce. 5

6 GpE with vrince (GpE-V). The lloction strtegy implemented by GpE focuses only on the rms with smll gp nd does not tke into considertion their vrince. However, it is cler tht the rms with smll vrince, even if their gp is smll, just need few pulls to be correctly estimted. In order to tke into ccount both the gps nd vrinces of the rms, we introduce the GpE-vrince Tmk (t) s=1 X 2 mk (s) µ2 mk (GpE-V) lgorithm. Let σ mk 2 (t) = 1 T mk (t) 1 (t) be the estimted vrince for rm k of bndit m t the end of round t. GpE-V uses the following B-index for ech rm: B mk (t) = mk (t 1)+ 2 σ mk 2 (t 1) + T mk (t 1) 7b 3 ( T mk (t 1) 1 ). Note tht the explortion term in the B-index hs now two components: the first one depends on the empiricl vrince nd the second one decreses so(1/t mk ). As result, rms with low vrince will be explored much less thn in the GpE lgorithm. Similr to the difference between UCB [3] nd UCB-V [2], while the B-index in GpE is motivted by Hoeffding s inequlities, the one for GpE-V is obtined using n empiricl Bernstein s inequlity [11, 2]. The following performnce bound cn be proved for GpE-V lgorithm. We report the proof of Theorem 2 in App. C of [8]. Theorem 2. If GpE-V is run with prmeter0 < 8 9 n 2MK H σ, then it stisfies l(n) P ( m : J m (n) km ) 6nMKexp ( 9 ) in prticulr for = 8 n 2MK 9 H, we hvel(n) 6nMKexp ( ) 1 n 2MK σ 64 8 H. σ In Theorem 2,H σ is the complexity of the GpE-V lgorithm nd is defined s ( M K H σ σmk + σmk 2 = +(16/3)b 2 mk). m=1 k=1 Although the vrince-complexityh σ could be lrger thn the complexityh used in GpE, whenever the vrinces of the rms re smll compred to the rngebof the distribution, we expecth σ to be smller thn H. Furthermore, if the rms hve very different vrinces, then GpE-V is expected to better cpture the complexity of ech rm nd llocte the pulls ccordingly. For instnce, in the cse where ll the gps re the sme, GpE tends to llocte pulls proportionlly to the complexity H mk nd it would perform n lmost uniform lloction over bndits nd rms. On the other hnd, the vrinces of the rms could be very heterogeneous nd GpE-V would dpt the lloction strtegy by pulling more often the rms whose vlues re more uncertin. Adptive GpE nd GpE-V. A drwbck of GpE nd GpE-V is tht the explortion prmeter should be tuned ccording to the complexitiesh nd H σ of the multi-bndit problem, which re rrely known in dvnce. A strightforwrd solution to this issue is to move to n dptive version of these lgorithms by substitutingh ndh σ with suitble estimtes Ĥ nd Ĥσ. At ech step t of the dptive GpE nd GpE-V lgorithms, we estimte these complexities s Ĥ(t) = m,k b 2 UCB i (t) 2, UCB i (t) = i(t 1)+ 2 mk Ĥσ (t) = ( LCBσi (t)+ LCB σi (t) 2 +(16/3)b UCB i (t) ) 2, where UCB i (t) 2 m,k ( ) 1 2 nd LCB σi (t) = mx 0, σ i(t 1). 2T i(t 1) T i(t 1) 1 Similr to the dptive version of UCB-E in [1],Ĥ ndĥσ re lower-confidence bounds on the true complexities H nd H σ. Note tht the GpE nd GpE-V bounds written for the optiml vlue of indicte n inverse reltion between the complexity nd the explortion. By using lower-bound on the trueh ndh σ, the lgorithms tend to explore rms more uniformly nd this llows them to increse the ccurcy of their estimted complexities. Although we do not nlyze these lgorithms, we empiriclly show in Sec. 4 tht they re in fct ble to mtch the performnce of the GpE nd GpE-V lgorithms. 4 Numericl Simultions In this section, we report numericl simultions of the gp-bsed lgorithms presented in this pper, GpE nd GpE-V, nd their dptive versions A-GpE nd A-GpE-V, nd compre them with Unif 6

7 Mximum probbility of error Uniform + UCBE GpE Adpt GpE /8 1/4 1/2 1 Prmeter η Mximum probbility of error GpE GpE V Prmeter η Adpt GpE V 1/4 1/2 1 2 Figure 2: (left) Problem 1: Comprison between GpE, dptive GpE, nd the uniform strtegies. (right) Problem 2: Comprison between GpE, GpE-V, nd dptive GpE-V lgorithms. Unif + UCBE Unif + A UCBE Unif + UCBE V Unif + A UCBE V GpE A GpE GpE V A GpE V Mximum probbility of error /4 1/ /4 1/ / Prmeter η Figure 3: Performnce of the lgorithms in Problem 3. 1/4 1/2 1 2 nd Unif+UCB-E lgorithms introduced in Sec The results of our experiments both those in the pper nd those in App. A of [8] indicte tht 1) GpE successfully dpts its lloction strtegy to the complexity of ech bndit nd outperforms the uniform lloction strtegies, 2) the use of the empiricl vrince in GpE-V cn significntly improve the performnce over GpE, nd 3) the dptive versions of GpE nd GpE-V tht estimte the complexities H nd H σ online ttin the sme performnce s the bsic lgorithms, which receiveh ndh σ s n input. Experimentl setting. We use the following three problems in our experiments. Note tht b = 1 nd tht Rdemcher distribution with prmeters(x, y) tkes vlue x or y with probbility 1/2. Problem 1. n = 700, M = 2, K = 4. The rms hve Bernoulli distribution with prmeters: bndit 1 = (0.5, 0.45, 0.4, 0.3), bndit 2 =(0.5, 0.3, 0.2, 0.1). Problem 2. n = 1000, M = 2, K = 4. The rms hve Rdemcher distribution with prmeters (x, y): bndit 1 = {(0, 1.0),(0.45, 0.45),(0.25, 0.65),(0, 0.9)} nd in bndit 2 = {(0.4,0.6),(0.45,0.45),(0.35,0.55),(0.25,0.65)}. Problem 3. n = 1400, M = 4, K = 4. The rms hve Rdemcher distribution with prmeters (x, y): bndit 1 = {(0, 1.0),(0.45, 0.45),(0.25, 0.65),(0, 0.9)}, bndit 2 = {(0.4, 0.6),(0.45, 0.45),(0.35, 0.55),(0.25, 0.65)}, bndit 3 = {(0, 1.0),(0.45, 0.45), (0.25,0.65),(0,0.9)}, nd bndit 4 ={(0.4,0.6),(0.45,0.45),(0.35,0.55),(0.25,0.65)}. All the lgorithms, except the uniform lloction, hve n explortion prmeter. The theoreticl nlysis suggests thtshould be proportionl to n H. Although could be optimized ccording to the bound, since the constnts in the nlysis re not ccurte, we will run the lgorithms with = η n H, where η is prmeter which is empiriclly tuned (in the experiments we report four different vlues for η). If H correctly defines the complexity of the explortion problem (i.e., the number of smples to find the best rms with high probbility),η should simply correct the inccurcy of the constnts in the nlysis, nd thus, the rnge of its nerly-optiml vlues should be constnt cross different problems. In Unif+UCB-E, UCB-E is run with the budget of n/m nd the sme prmeter η for ll the bndits. Finlly, we set n H σ, since we expect H σ to roughly cpture the number of pulls necessry to solve the pure explortion problem with high probbility. In Figs. 2 nd 3, we report the performnce l(n), i.e. the probbility to identify the best rm in ll the bndits fter n rounds, of the gp-bsed lgorithms s well s Unif nd Unif+UCB-E strtegies. The results re verged 7

8 over 10 5 runs nd the error brs correspond to three times the estimted stndrd devition. In ll the figures the performnce of Unif is reported s horizontl dshed line. The left pnel of Fig. 2 displys the performnce of Unif+UCB-E, GpE, nd A-GpE in Problem1. As expected, Unif+UCB-E hs better performnce (23.9% probbility of error) thn Unif (29.4% probbility of error), since it dpts the lloction within ech bndit so s to pull more often the nerly-optiml rms. However, the two bndit problems re not eqully difficult. In fct, their complexities re very different (H nd H 2 67), nd thus, much less smples re needed to identify the best rm in the second bndit thn in the first one. Unlike Unif+UCB-E, GpE dpts its lloction strtegy to the complexities of the bndits (on verge only 19% of the pulls re llocted to the second bndit), nd t the sme time to the rm complexities within ech bndit (in the first bndit the verged lloction of GpE is (37%,36%,20%,7%)). As result, GpE hs probbility of error of 15.7%, which represents significnt improvement over Unif+UCB-E. The right pnel of Fig. 2 compres the performnce of GpE, GpE-V, nd A-GpE-V in Problem 2. In this problem, ll the gps re equls ( mk = 0.05), thus ll the rms (nd bndits) hve the sme complexity H mk = 400. As result, GpE tends to implement nerly uniform lloction, which results in smll difference between Unif nd GpE (28% nd 25% ccurcy, respectively). The reson why GpE is still ble to improve over Unif my be explined by the difference between sttic nd dynmic lloction strtegies nd it is further investigted in App. A of [8]. Unlike the gps, the vrince of the rms is extremely heterogeneous. In fct, the vrince of the rms of bndit1 is bigger thn in bndit 2, thus mking it hrder to solve. This difference is cptured by the definition of H σ (H1 σ 1400 > Hσ 2 600). Note lso tht Hσ H. As discussed in Sec. 3.2, since GpE-V tkes into ccount the empiricl vrince of the rms, it is ble to dpt to the complexity Hmk σ of ech bndit-rm pir nd to focus more on uncertin rms. GpE-V improves the finl ccurcy by lmost 10% w.r.t. GpE. From both pnels of Fig. 2, we lso notice tht the dptive lgorithms chieve similr performnce to their non-dptive counterprts. Finlly, we notice tht good choice of prmeter η for GpE-V is lwys close to 2 nd 4 (see lso [8] for dditionl experiments), while GpE needs η to be tuned more crefully, prticulrly in Problem 2 where the lrge vlues ofη try to compenste the fct thth does not successfully cpture the rel complexity of the problem. This further strengthens the intuition tht H σ is more ccurte mesure of the complexity for the multi-bndit pure explortion problem. While Problems 1 nd 2 re reltively simple, we report the results of the more complicted Problem 3 in Fig. 3. The experiment is designed so tht the complexity w.r.t. the vrince of ech bndit nd within ech bndit is strongly heterogeneous. In this experiment, we lso introduce UCBE-V tht extends UCB-E by tking into ccount the empiricl vrince similrly to GpE-V. The results confirm the previous findings nd show the improvement chieved by introducing empiricl estimtes of the vrince nd llocting non-uniformly over bndits. 5 Conclusion In this pper, we studied the problem of best rm identifiction in multi-bndit multi-rmed setting. We introduced gp-bsed explortion lgorithm, clled GpE, nd proved n upper-bound for its probbility of error. We extended the bsic lgorithm to lso consider the vrince of the rms nd proved n upper-bound for its probbility of error. We lso introduced dptive versions of these lgorithms tht estimte the complexity of the problem online. The numericl simultions confirmed the theoreticl findings tht GpE nd GpE-V outperform other lloction strtegies, nd tht their dptive counterprts re ble to estimte the complexity without worsening the globl performnce. Although GpE does not know the gps, the experimentl results reported in [8] indicte tht it might outperform sttic lloction strtegy, which knows the gps in dvnce, thus suggesting tht n dptive strtegy could perform better thn sttic one. This observtion sks for further investigtion. Moreover, we pln to pply the lgorithms introduced in this pper to the problem of rollout lloction for clssifiction-bsed policy itertion in reinforcement lerning [9, 6], where the gol is to identify the greedy ction (rm) in ech of the sttes (bndit) in trining set. Acknowledgments Experiments presented in this pper were crried out using the Grid 5000 experimentl testbed ( This work ws supported by Ministry of Higher Eduction nd Reserch, Nord-Ps de Clis Regionl Council nd FEDER through the contrt de projets étt region , French Ntionl Reserch Agency (ANR) under project LAMPADA n ANR-09-EMER-007, Europen Community s Seventh Frmework Progrmme (FP7/ ) under grnt greementn , nd PASCAL2 Europen Network of Excellence. 8

9 References [1] J.-Y. Audibert, S. Bubeck, nd R. Munos. Best rm identifiction in multi-rmed bndits. In Proceedings of the Twenty-Third Annul Conference on Lerning Theory, pges 41 53, [2] Jen-Yves Audibert, Rémi Munos, nd Csb Szepesvári. Tuning bndit lgorithms in stochstic environments. In Mrcus Hutter, Rocco Servedio, nd Eiji Tkimoto, editors, Algorithmic Lerning Theory, volume 4754 of Lecture Notes in Computer Science, pges Springer Berlin / Heidelberg, [3] P. Auer, N. Ces-Binchi, nd P. Fischer. Finite-time nlysis of the multi-rmed bndit problem. Mchine Lerning, 47: , [4] S. Bubeck, R. Munos, nd G. Stoltz. Pure explortion in multi-rmed bndit problems. In Proceedings of the Twentieth Interntionl Conference on Algorithmic Lerning Theory, pges 23 37, [5] K. Deng, J. Pineu, nd S. Murphy. Active lerning for personlizing tretment. In IEEE Symposium on Adptive Dynmic Progrmming nd Reinforcement Lerning, [6] C. Dimitrkkis nd M. Lgoudkis. Rollout smpling pproximte policy itertion. Mchine Lerning Journl, 72(3): , [7] Eyl Even-Dr, Shie Mnnor, nd Yishy Mnsour. Action elimintion nd stopping conditions for the multi-rmed bndit nd reinforcement lerning problems. Journl of Mchine Lerning Reserch, 7: , [8] V. Gbillon, M. Ghvmzdeh, A. Lzric, nd S. Bubeck. Multi-bndit best rm identifiction. Technicl Report , INRIA, [9] M. Lgoudkis nd R. Prr. Reinforcement lerning s clssifiction: Leverging modern clssifiers. In Proceedings of the Twentieth Interntionl Conference on Mchine Lerning, pges , [10] O. Mron nd A. Moore. Hoeffding rces: Accelerting model selection serch for clssifiction nd function pproximtion. In Proceedings of Advnces in Neurl Informtion Processing Systems 6, [11] A. Murer nd M. Pontil. Empiricl bernstein bounds nd smple-vrince penliztion. In 22th nnul conference on lerning theory, [12] V. Mnih, Cs. Szepesvári, nd J.-Y. Audibert. Empiricl Bernstein stopping. In Proceedings of the Twenty-Fifth Interntionl Conference on Mchine Lerning, pges , [13] H. Robbins. Some spects of the sequentil design of experiments. Bulletin of the Americn Mthemtics Society, 58: ,

Multi-Armed Bandits: Non-adaptive and Adaptive Sampling

Multi-Armed Bandits: Non-adaptive and Adaptive Sampling CSE 547/Stt 548: Mchine Lerning for Big Dt Lecture Multi-Armed Bndits: Non-dptive nd Adptive Smpling Instructor: Shm Kkde 1 The (stochstic) multi-rmed bndit problem The bsic prdigm is s follows: K Independent