Tuning bandit algorithms in stochastic environments

Size: px

Start display at page:

Download "Tuning bandit algorithms in stochastic environments"

Christine Wade
6 years ago
Views:

1 Tuning bandit algorithm in tochatic environment Jean-Yve Audibert 1 and Rémi Muno and Caba Szepevári 3 1 CERTIS - Ecole de Pont 19, rue Alfred Nobel - Cité Decarte Marne-la-Vallée - France audibert@certi.enpc.fr INRIA Futur Lille, SequeL project, 50 avenue Halley, Villeneuve d Acq, France remi.muno@inria.fr 3 Univerity of Alberta, Edmonton T6G E8, Canada zepeva@c.ualberta.ca Abtract. Algorithm baed on upper-confidence bound for balancing exploration and exploitation are gaining popularity ince they are eay to implement, efficient and effective. In thi paper we conider a variant of the baic algorithm for the tochatic, multi-armed bandit problem that tae into account the empirical variance of the different arm. In earlier experimental wor, uch algorithm were found to outperform the competing algorithm. The purpoe of thi paper i to provide a theoretical explanation of thee finding and provide theoretical guideline for the tuning of the parameter of thee algorithm. For thi we analyze the expected regret and for the firt time the concentration of the regret. The analyi of the expected regret how that variance etimate can be epecially advantageou when the payoff of uboptimal arm have low variance. The ri analyi, rather unexpectedly, reveal that except ome very pecial bandit problem, for upper confidence bound baed algorithm with tandard bia equence, the regret concentrate only at a polynomial rate. Hence, although thee algorithm achieve logarithmic expected regret rate, they eem le attractive when the ri of achieving much wore than logarithmic cumulative regret i alo taen into account. 1 Introduction and notation In thi paper we conider tochatic multi-armed bandit problem. The original motivation of bandit problem come from the deire to optimize efficiency in clinical trial when the deciion maer can chooe between treatment but initially he doe not now which of the treatment i the mot effective one [9]. Multi-armed bandit problem became popular with the eminal paper of Robbin [8], after which they have found application in divere field, uch a control, economic, tatitic, or learning theory.

2 Formally, a K-armed bandit problem (K ) i defined by K ditribution, ν 1,..., ν K, one for each arm of the bandit. Imagine a gambler playing with thee K lot machine. The gambler can pull the arm of any of the machine. Succeive play of arm yield a equence of independent and identically ditributed (i.i.d.) real-valued random variable X,1, X,,..., coming from the ditribution ν. The random variable X,t i the payoff (or reward) of the -th arm when thi arm i pulled the t-th time. Independence alo hold for reward acro the different arm. The gambler facing the bandit problem want to pull the arm o a to maximize hi cumulative payoff. The problem i made challenging by auming that the payoff ditribution are initially unnown. Thu the gambler mut ue exploratory action in order to learn the utility of the individual arm, maing hi deciion baed on the available pat information. However, exploration ha to be carefully controlled ince exceive exploration may lead to unneceary loe. Hence, efficient online algorithm mut find the right balance between exploration and exploitation. Since the gambler cannot ue the ditribution of the arm (which are not available to him) he mut follow a policy, which i a mapping from the pace of poible hitorie, t N +{1,..., K} t R t, into the et {1,..., K}, which indexe the arm. Let µ = E[X,1 ] denote the expected reward of arm. 4 By definition, an optimal arm i an arm having the larget expected reward. We will ue to denote the index of uch an arm. Let the optimal expected reward be µ = max 1 K µ. Further, let T (t) denote the number of time arm i choen by the policy during the firt t play and let I t denote the arm played at time t. The (cumulative) regret at time n i defined by ˆR n n X,t t=1 n t=1 X It,T It (t). Oftentime, the goal i to minimize the expected (cumulative) regret of the policy, E[ ˆR n ]. Clearly, thi i equivalent to maximizing the total expected reward achieved up to time n. It turn out that the expected regret atifie E[ ˆR n ] K E[T (n)], =1 where = µ µ i the expected lo of playing arm. Hence, an algorithm that aim at minimizing the expected regret hould minimize the expected ampling time of uboptimal arm. Early paper tudied tochatic bandit problem under Bayeian aumption (e.g., [5]). Lai and Robbin [6] tudied bandit problem with parametric uncertaintie. They introduced an algorithm that follow what i now called the optimim in the face of uncertainty principle. Their algorithm compute upper 4 N denote the et of natural number, including zero and N + denote the et of poitive integer.

3 confidence bound for all the arm by maximizing the expected payoff when the parameter are varied within appropriate confidence et derived for the parameter. Then the algorithm chooe the arm with the highet uch bound. They how that the expected regret increae logarithmically only with the number of trial and prove that the regret i aymptotically the mallet poible up to a ublogarithmic factor for the conidered family of ditribution. Agrawal ha hown how to contruct uch optimal policie tarting from the ample-mean of the arm [1]. More recently, Auer et. al conidered the cae when the reward come from a bounded upport, ay [0, b], but otherwie the reward ditribution are uncontrained [3]. They have tudied everal policie, mot notably UCB1 which contruct the Upper Confidence Bound (UCB) for arm at time t by adding the bia factor b log t T (t 1) to it ample-mean. They have proven that the expected regret of thi algorithm atifie E[ ˆR ( ) n ] 8 :µ<µ b log(n) + O(1). (1) In the ame paper they propoe UCB1-NORMAL, that i deigned to wor with normally ditributed reward only. Thi algorithm etimate the variance of the arm and ue thee etimate to refine the bia factor. They how that for thi algorithm when the reward are indeed normally ditributed with mean µ and variance σ, E[ ˆR n ] 8 ( ) 3σ :µ <µ + log(n) + O(1). () Note that one major difference of thi reult and the previou one i that the regret-bound for UCB1 cale with b, while the regret bound for UCB1- NORMAL cale with the variance of the arm. Firt, let u note that it can be proven that the caling behavior of the regret-bound with b i not a proof artifact: The expected regret indeed cale with Ω(b ). Since b i typically jut an a priori gue on the ize of the interval containing the reward, which might be overly conervative, it i more deirable the leen the dependence on it. Auer et al. introduced another algorithm, UCB1-Tuned, in the experimental ection of their paper. Thi algorithm, imilarly to UCB1-NORMAL ue the empirical etimate of the variance in the bia equence. Although no theoretical guarantee were derived for UCB1-Tuned, thi algorithm ha been hown to outperform the other algorithm conidered in the paper in eentially all the experiment. The uperiority of thi algorithm ha been reconfirmed recently in the latet Pacal Challenge [4]. Intuitively, algorithm uing variance etimate hould wor better than UCB1 when the variance of ome uboptimal arm i much maller than b, ince thee arm will be le often drawn: uboptimal arm are more eaily potted by algorithm uing variance etimate. In thi paper we tudy the regret of UCB-V, which i a generic UCB algorithm that ue variance etimate in the bia equence. In particular, the bia equence 3

4 of UCB-V tae the form V,T (t 1)E T (t 1),t + c 3bE T (t 1),t T (t 1) T (t 1), where V, i the empirical variance etimate for arm baed on ample, E (viewed a a function of (, t)) i a o-called exploration function for which a typical choice i E,t = ζ log(t) (thu in thi cae, E independent of ). Here ζ, c > 0 are tuning parameter that can be ued to control the behavior of the algorithm. One major reult of the paper (Corollary 1) i a bound on the expected regret that cale in an improved fahion with b. In particular, we how that for a particular etting of the parameter of the algorithm, E[ ˆR ( ) σ n ] 10 + b log(n). :µ <µ The main difference to the bound (1) i that b i replaced by σ, though b till appear in the bound. Thi i indeed the major difference to the bound (). 5 In order to prove thi reult we will prove a novel tail bound on the ample average of i.i.d. random variable with bounded upport that, unlie previou imilar bound, involve the empirical variance and which may be of independent interet (Theorem 1). Otherwie, the proof of the regret bound involve the analyi of the ampling time of uboptimal arm (Theorem ), which contain ignificant advance compared with the one in [3]. Thi way we obtain reult on the expected regret for a wide cla of exploration function (Theorem 3). For the tandard logarithmic equence we will give lower limit on the tuning parameter: If the tuning parameter are below thee limit the lo goe up coniderably (Theorem 4,5). The econd major contribution of the paper i the analyi of the ri that the tudied upper confidence baed policie have a regret much higher than it expected value. To our bet nowledge no uch analyi exited for thi cla of algorithm o far. In order to analyze thi ri, we define the (cumulative) peudo-regret at time n via R n = K T (n). =1 Note that the expectation of the peudo-regret and the regret are the ame: E[R n ] = E[ ˆR n ]. The difference of the regret and the peudo-regret come from the randomne of the reward. Section 4 and 5 develop high probability bound for the peudo-regret. The ame ind of formulae can be obtained for the cumulative regret (ee Remar p.13). 5 Although, thi i unfortunate, it i poible to how that the dependence on b i unavoidable. 4

5 Interetingly, our analyi revealed ome tradeoff that we did not expect: A it turn out, if one aim for logarithmic expected regret (or, more generally, for ubpolynomial regret) then the regret doe not necearily concentrate exponentially fat around it mean (Theorem 7). In fact, thi i the cae when with poitive probability the optimal arm yield a reward maller than the expected reward of ome uboptimal arm. Tae for example two arm atifying thi condition and with µ 1 > µ : the firt arm i the optimal arm and = µ 1 µ > 0. Then the ditribution of the peudo-regret at time n will have two mode, one at Ω(log n) and the other at Ω( n). The probability ma aociated with thi econd ma will decay polynomially with n where the rate of decay depend on. Above the econd mode the ditribution decay exponentially. By increaing the exploration rate the ituation can be improved. Our ri tail bound (Theorem 6) mae thi dependence explicit. Of coure, increaing exploration rate increae the expected regret, hence the tradeoff between the expected regret and the ri of achieving much wore than the expected regret. One leon i thu that if in an application ri i important then it might be better to increae the exploration rate. In Section 5, we tudy a variant of the algorithm obtained by E,t = E. In particular, we how that with an appropriate choice of E = E (β), for any 0 < β < 1, for an infinite number of play, the algorithm achieve finite cumulative regret with probability 1 β (Theorem 8). Hence, we name thi variant PAC-UCB ( Probably approximately correct UCB ). Beide, for a finite timehorizon n, chooing β = 1/n then yield a logarithmic bound on the regret that fail with probability O(1/n) only. Thi hould be compared with the bound O(1/ log(n) a ), a > 0 obtained for the tandard choice E,t = ζ log t in Corollary. Hence, we conjecture that nowing the time horizon might repreent a ignificant advantage. Due to limited pace, ome of the proof are abent from thi paper. All the proof can be found in the extended verion []. The UCB-V algorithm For any {1,..., K} and t N, let X,t and V,t be the empirical etimate of the mean payoff and variance of arm : X,t 1 t t X,i and V,t 1 t i=1 t (X,i X,t ), i=1 where by convention X,0 0 and V,0 0. We recall that an optimal arm i an arm having the bet expected reward argmax {1,...,K} µ. We denote quantitie related to the optimal arm by putting in the upper index. In the following, we aume that the reward are bounded. Without lo of generality, we may aume that all the reward are almot urely in [0, b], with b > 0. We ummarize our aumption on the reward equence here: 5

6 Aumption: Let K >, ν 1,..., ν K ditribution over real with upport [0, b]. For 1 K, let {X,t } ν be an i.i.d. equence of random variable pecifying the reward for arm. 6 Aume that the reward of different arm are independent of each other, i.e., for any,, 1 < K, t N +, the collection of random variable, (X,1,..., X,t ) and (X,1,..., X,t), are independent of each other..1 The algorithm Let c 0. Let E = (E,t ) 0,t 0 be nonnegative real number uch that for any 0, the function t E,t i nondecreaing. We hall call E (viewed a a function of (, t)) the exploration function. For any arm and any nonnegative integer, t, introduce B,,t X, + with the convention 1/0 = +. V, E,t + c 3bE,t (3) UCB-V policy: At time t, play an arm maximizing B,T (t 1),t. Let u roughly decribe the behaviour of the algorithm. At the beginning (i.e., for mall t), every arm that ha not been drawn i aociated with an infinite bound which will become finite a oon a the arm i drawn. The more an arm i drawn, the cloer the bound (3) get cloe to it firt term, and thu, from the law of large number, to the expected reward µ. So the procedure will hopefully tend to draw more often arm having greatet expected reward. Neverthele, ince the obtained reward are tochatic it might happen that during the firt draw the (unnown) optimal arm alway give low reward. Fortunately, if the optimal arm ha not been drawn too often (i.e., mall T (t 1)), for appropriate choice of E (when E,t increae without bound in t for any fixed ), after a while the lat term of (3) will tart to dominate the two other term and will alo dominate the bound aociated with the arm drawn very often. Thu the optimal arm will be drawn even if the empirical mean of the obtained reward, X,T (t 1), i mall. More generally, uch choice of E lead to the exploration of arm with inferior empirical mean. Thi i why E i referred to a the exploration function. Naturally, a high-valued exploration function alo lead to draw often uboptimal arm. Therefore the choice of E i crucial in order to explore poibly optimal arm while eeping exploiting (what loo lie to be) the optimal arm. The actual form of B,,t come from the following novel tail bound on the ample average of i.i.d. random variable with bounded upport that, unlie previou imilar bound (Bennett and Berntein inequalitie), involve the empirical variance. 6 The i.i.d. aumption can be relaxed, ee e.g., [7]. 6

7 Theorem 1. Let X 1,..., X t be i.i.d. random variable taing their value in [0, b]. Let µ = E [X 1 ] be their common expected value. Conider the empirical expectation X t and variance V t defined repectively by t i=1 X t = X t i i=1 and V t = (X i X t ). t t Then for any t N and x > 0, with probability at leat 1 3e x, Vt x X t µ + 3bx t t. (4) Furthermore, introducing β(x, t) = 3 inf 1<α 3 ( log t log α t ) e x/α, (5) we have for any t N and x > 0, with probability at leat 1 β(x, t) V x X µ + 3bx hold imultaneouly for {1,,..., t}. Remar 1. The uniformity in time i the only difference between the two aertion of the previou theorem. When we ue (6), the value of x and t will be uch that β(x, t) i of order of 3e x, hence there will be no real price to pay for writing a verion of (4) that i uniform in time. In particular, thi mean that if 1 S t i a random variable then (4) till hold with probability at leat 1 β(x, t) and when i replaced with S. Note that (4) i uele for t 3 ince it r.h.. i larger than b. For any arm, time t and integer 1 t we may apply Theorem 1 to the reward X,1,..., X,, and obtain that with probability at leat 1 3 =4 e (c 1)E,t, we have µ B,,t. Hence, by our previou remar at time t with high probability (for a high-valued exploration function E) the expected reward of arm i upper bounded by B,T (t 1),t. The uer of the generic UCB-V policy ha two parameter to tune: the exploration function E and the poitive real number c. A cumberome technical analyi (not reproduced here) how that there are eentially two intereting type of exploration function: the one in which E,t depend only on t (ee Section 3 and 4). the one in which E,t depend only on (ee Section 5). (6). Bound for the ampling time of uboptimal arm The natural way of bounding the regret of UCB policie i to bound the number of time uboptimal arm are drawn. The bound preented here ignificantly improve the one ued in [3]. The improvement i a neceary tep to get tight bound for the intereting cae where the exploration function i logarithmic. 7

8 Theorem. After K play, each arm ha been pulled once. Let arm and time n N + be fixed. For any τ R and any integer u > 1, we have hence T (n) u + n ( t=u+k 1 1{ :u t 1.t. B,,t ) >τ} (7) +1 { :1 t 1.t. τ B,,t }, E [T (n)] u + n t 1 t=u+k 1 =u P( B,,t > τ ) + n t=u+k 1 P( : 1 t 1.t. B,,t τ ). Beide we have P ( T (n) > u ) n t=3 P( B,u,t > τ ) + P ( : 1 n u.t. B,,u+ τ ). Even if the above tatement hold for any arm, they will be only ueful for uboptimal arm. Proof. The firt aertion i trivial ince at the beginning all arm ha an infinite UCB, which become finite a oon a the arm ha been played once. To obtain (7), we note that where T (n) u n t=u+k 1 1 {It =;T (t)>u} = n t=u+k 1 Z,t,u, Z,t,u = 1 {It=; u T (t 1); 1 T (t 1);B,T (t 1),t B,T (t 1),t} 1 { :u t 1.t. B,,t >τ} + 1 { :1 t 1.t. τ B,,t } Taing the expectation on both ide of (7) and uing the probability union bound, we obtain (8). Finally, (9) come from a more direct argument that ue that the exploration function ξ,t i a nondecreaing function with repect to t. Conider an event uch that the following tatement hold: { t : 3 t n.t. B,u,t τ, : 1 n u.t. B,,u+ > τ.. Then for any 1 n u and u + t n B,,t B,,u+ > τ B,u,t. Thi implie that arm will not be pulled a (u + 1)-th time. Therefore we have proved by contradiction that { T (n) > u } ( { t : 3 t n.t. B,u,t > τ } { : 1 n u.t. B,,u+ τ }), which by taing probabilitie of both ide give the announced reult. (8) (9) (10) 8

9 3 Expected regret of UCB-V In thi ection, we conider that the exploration function doe not depend on (till, E = (E t ) t 0 i a nondecreaing of t). We will ee that a far a the expected regret i concerned, a natural choice of E t i the logarithmic function and that c hould not be taen too mall if one doe not want to uffer polynomial regret intead of logarithmic one. We derive bound on the expected regret and conclude by pecifying natural contraint on c and E t. Theorem 3. We have E[R n ] { ( σ 1 + 8(c 1)E n : >0 +ne E n + b ) ( ) 4σ + 4b + n t=16e n β ( (c 1)Et, t )}, (11) where we recall that β ( (c 1)E t, t ) i eentially of order e (c 1)E t (ee (5) and Remar 1). Proof. Let E n = (c 1)E n. We ue (8) with u the mallet integer larger than 8 ( σ + b ) E n and τ = µ. The above choice of u guarantee that for any u < t and t, [σ +b /]E t + 3bc Et [σ +b ]E n u = + 3b E n u [σ +b ] 8[σ +b ] + 3b 8[σ +b ] [, σ +b σ +4b + 3b 4σ +8b ince the lat inequality i equivalent to (x 1) 0 for x = ] σ +b σ +4b. For any u and t, uing (1), we have P(B,,t > µ ) = P ( V X, +, E t + 3bc E t > µ ) + P ( [σ X, + +b /]E t + 3bc E ) ( t > µ + + P V, σ + b / ) P ( X, µ > / ) ( ) j=1 + P (X,j µ ) σ b / e /(8σ +4b /3), (13) where in the lat tep we ued Berntein inequality twice. Summing up thee probabilitie we obtain t 1 P(B,,t > µ ) e /(8σ +4b /3) = e u /(8σ +4b /3) =u =u 1 e /(8σ +4b /3) ( ) ( ) 4σ + 4b e u /(8σ +4b /3) 4σ + 4b e E n, (14) (1) 9

10 where we have ued that 1 e x x/3 for 0 x 3/4. By uing (6) of Theorem 1 to bound the other probability in (8), we obtain that ( σ E [T (n)] 1 + 8E n + b ) ( + ne E n 4σ + 4b ) which by u 16E n give the announced reult. + n t=u+1 β((c 1)E t, t), In order to balance the term in (11) the exploration function hould be choen to be proportional to log t. For thi choice, the following corollary give an explicit bound on the expected regret: Corollary 1. If c = 1 and E t = ζ log t for ζ > 1, then there exit a contant c ζ depending only on ζ uch that for n ( ) σ E[R n ] c ζ + b log n. (15) : >0 For intance, for ζ = 1., the reult hold for c ζ = 10. Proof (Setch of the proof). The firt part, (15), follow directly from Theorem 3. Let u thu turn to the numerical reult. For n K, we have R n b(n 1) (ince in the firt K round, the optimal arm i choen at leat once). A a conequence, the numerical bound i nontrivial only for 0 log n < n 1, o we only need to chec the reult for n > 91. For n > 91, we bound the contant term uing 1 log n log 91 a 1 b (log n), with a 1 = 1/( log 91) The econd term ( σ ) between the bracet in (11) i bounded by a + b log n, with a = 8 1. = 9.6. For the third term, we ue that for n > 91, we have 4n 0. < a 3 log n, 4 with a 3 = log By tediou computation, the fourth term can b be bounded by a 4 (log n), with a Thi give the deired reult ince a 1 + a + a 3 + a A promied, Corollary 1 give a logarithmic bound on the expected regret that ha a linear dependence on the range of the reward contrary to bound on algorithm that doe not tae into account the empirical variance of the reward ditribution (ee e.g. the bound (1) that hold for UCB1). The previou corollary i well completed by the following reult, which eentially ay that we hould not ue E t = ζ log t with ζ < 1. Theorem 4. Conider E t = ζ log t and let n denote the total number of draw. Whatever c i, if ζ < 1, then there exit ome reward ditribution (depending on n) uch that the expected number of draw of uboptimal arm uing the UCB-V algorithm i polynomial in the total number of draw the UCB-V algorithm uffer a polynomial lo. 10

11 So far we have een that for c = 1 and ζ > 1 we obtain a logarithmic regret, and that the contant ζ could not be taen below 1 (whatever c i) without riing to uffer polynomial regret. Now we conider the lat term in B,,t, which i linear in the ratio E t /, and how that thi term i alo neceary to obtain a logarithmic regret, ince we have: Theorem 5. Conider E t = ζ log t. Whatever ζ i, if cζ < 1/6, there exit probability ditribution of the reward uch that the UCB-V algorithm uffer a polynomial lo. To conclude the above analyi, natural value for the contant appearing in the bound are the following one B,,t X, + V, log t + b log t. Thi choice correpond to the critical exploration function E t = log t and to c = 1/6, that i, the minimal aociated value of c in view of the previou theorem. In practice, it would be unwie (or ri eeing) to ue maller contant in front of the lat two term. 4 Concentration of the regret In real life, people are not only intereted in the expected reward that they can obtain by ome policy. They alo want to etimate probabilitie of obtaining much le reward than expected, hence they are intereted in the concentration of the regret. Thi ection tart with the tudy of the concentration of the peudo-regret, ince, a we will ee in Remar p.13, the concentration propertie of the regret follow from the concentration propertie of the peudo-regret. We till aume that the exploration function doe not depend on and that E = (E t ) t 0 i nondecreaing. Introduce β n (t) 3 min α 1 M N 0 =0< 1 < < M =n.t. j+1 α( j +1) M 1 j=0 e (c 1)E j +t+1 α. (16) We have een in the previou ection that to obtain logarithmic expected regret, it i natural to tae a logarithmic exploration function. In thi cae, and alo when the exploration function goe to infinity fater than the logarithmic function, the complicate um of (16), up to econd order logarithmic term, i of the order of e (c 1)E t. Thi can be een by conidering (diregarding rounding iue) the geometric grid j = α j with α cloe to 1. Let x till denote the larget integer maller or equal to x. The next theorem provide a bound for the tail of the peudo-regret. 11

12 Theorem 6. Let σ v 8(c 1)( + 4b ), r 0 3 Then, for any x 1, we have P ( R n > r 0 x ) : >0 : >0 ( 1 + v E n ). { } ne (c 1)Enx + β n ( v E n x ), (17) (ee the above dicu- where we recall that β n (t) i eentially of order e (c 1)Et ion). Proof (etch of the proof). Firt note that P ( R n > r 0 x ) { = P : >0 : >0 T (n) > : >0 { } P T (n) > (1 + v E n )x. } (1 + v E n )x Let E n = (c 1)E n. We ue (9) with τ = µ and u = (1 + v E n )x v E n x. From (13), we have P(B,u,t > µ ) e u /(8σ +4b /3) e E n x. To bound the other probability in (9), we ue α 1 and the grid 0,..., M of {1,..., n} realizing the minimum of (16) when t = u. Let I j = { j + 1,..., j+1 }. Then P ( : 1 n u.t. B,,u+ µ ) M 1 j=0 M 1 j=0 P ( I j.t. B,, j +u+1 µ ) P ( I j.t. (X, µ ) + V,E j +u+1 + 3bcE j +u+1 0 ) M 1 3 j=0 e (c 1)E j +u+1 α = β n (u) β n ( v E n x ), where the econd to lat inequality come from an appropriate union bound argument (ee [] for detail). When E n log n, the lat term i the leading term. In particular, when c = 1 and E t = ζ log t with ζ > 1, Theorem 6 lead to the following corollary, which eentially ay that for any z > γ log n with γ large enough, for ome contant C > 0: P ( R n > z ) C z ζ, 1

13 Corollary. When c = 1 and E t = ζ log t with ζ > 1, there exit κ 1 > 0 and κ > 0 depending only on b, K, (σ ) {1,...,K}, ( ) {1,...,K} atifying that for any ε > 0 there exit Γ ε > 0 (tending to infinity when ε goe to 0) uch that for any n and any z > κ 1 log n P ( R n > z ) κ Γ ε log z z ζ(1 ε) Since the regret i expected to be of order log n the condition z = Ω(log n) i not an eential retriction. Further, the regret concentration, although increae with increaing ζ, i pretty low. For comparion, remember that a zero-mean martingale M n with increment bounded by 1 would atify P(M n > z) exp( z /n). The low concentration for UCB-V happen becaue the firt Ω(log(t)) choice of the optimal arm can be unlucy (yielding mall regret) in which cae the optimal regret will not be elected any more during the firt t tep. Hence, the ditribution of the regret will be of a mixture form with a mode whoe poition cale linearly with time and which decay only at a polynomial rate, which i controlled by ζ. 7 Thi reaoning relie crucially on that the choice of the optimal arm can be unlucy. Hence, we have the following reult: Theorem 7. Conider E t = ζ log t with cζ > 1. Let denote the econd optimal arm. If the eential infimum of the optimal arm i trictly larger than µ, then the peudo-regret ha exponentially mall tail. Inverely, if the eential infimum of the optimal arm i trictly maller than µ, then the peudo-regret ha only polynomial tail. Remar. In Theorem 6 and Corollary, we have conidered the peudo-regret: R n = K =1 T (n) intead of the regret ˆR n n t=1 X,t n t=1 X I t,t It (t). Our main motivation for thi wa to provide a imple a poible formulae and aumption. The following computation explain that when the optimal arm i unique, one can obtain imilar contration bound for the regret. Conider the intereting cae when c = 1 and E t = ζ log t with ζ > 1. By modifying the analyi lightly in Corollary, one can get that there exit κ 1 > 0 uch that for any z > κ 1 log n, with probability at leat 1 z 1, the number of draw of uboptimal arm i bounded by C z for ome C > 0. Thi mean that the algorithm draw an optimal arm at leat n C z time. Now if the optimal arm i unique, thi mean that n Cz term cancel out in the ummation of the definition of the regret. For the Cz term which remain, one can ue tandard Berntein inequalitie and union bound to prove that with probability 1 Cz 1, we have ˆR n R n + C z. Since the bound on the peudo-regret i of order z (Corollary ), a imilar bound hold for the regret. 5 PAC-UCB In thi ection, we conider that the exploration function doe not depend on t: E,t = E. We how that for appropriate equence (E ) 0, thi lead to an UCB 7 Note that entirely analogou reult hold for UCB1. 13

14 algorithm which ha nice propertie with high probability (Probably Approximately Correct), hence the name of it. Note that in thi etting, the quantity B,,t doe not depend on the time t o we will imply write it B,. Beide, in order to implify the dicuion, we tae c = 1. Theorem 8. Let β (0, 1). Conider a equence (E ) 0 atifying E and Conider u the mallet integer uch that 4K 7 e E β. (18) u E u > 8σ + 6b 3. (19) With probability at leat 1 β, the PAC-UCB policy play any uboptimal arm at mot u time. Let q > 1 be a fixed parameter. A typical choice for E i E = log(k q β 1 ), (0) up to ome additive contant enuring that (18) hold. For thi choice, Theorem 8 implie that for ome poitive contant κ, with probability at leat 1 β, for any uboptimal arm (i.e., > 0), it number of play i bounded by T,β κ ( σ + 1 ) [ ( σ log K + b ) β 1 ], which i independent of the total number of play! Thi directly lead to the following upper bound on the regret of the policy at time n K =1 T (n) : >0 T,β. (1) One hould notice that the previou bound hold with probability at leat 1 β and on the complement et no mall upper bound i poible: one can find a ituation in which with probability of order β, the regret i of order n (even if (1) hold with probability greater than 1 β). More formally, thi mean that the following bound cannot be eentially improved (unle putting additional aumption): K E[R n ] = E[T (n)] (1 β) T,β + βn =1 : >0 A a conequence, if one i intereted in having a bound on the expected regret at ome fixed time n, one hould tae β of order 1/n (up to poibly a logarithmic factor): Theorem 9. Let n 7. Conider the equence E = log[kn( + 1)]. For thi equence, the PAC-UCB policy atifie with probability at leat 1 4 log(n/7) n, for any : > 0, the number of play of arm up to time n i bounded by 1 + ( 8σ the expected regret at time n atifie E[R n ] : >0 + 6b 3 ) log(kn ). ( 4σ + 30b ) log(n/3). () 14

15 6 Open problem When the horizon time n i nown, one may want to chooe the exploration function E depending on the value of n. For intance, in view of Theorem 3 and 6, one may want to tae c = 1 and a contant exploration function E 3 log n. Thi choice enure logarithmic expected regret and a nice concentration property: { P R n > 4 ( ) } σ : >0 + b log n C n. (3) Thi algorithm doe not behave a the one which imply tae E,t = 3 log t. Indeed the algorithm with contant exploration function E,t = 3 log n concentrate it exploration phae at the beginning of the play, and then witche to exploitation mode. On the contrary, the algorithm which adapt to the time horizon explore and exploit during all the time interval [0; n]. However, in view of Theorem 7, it atifie only P {R n > 4 : >0 ( σ ) } C + b log n (log n) C. which i ignificantly wore than (3). The open quetion i: i there an algorithm that adapt to time horizon which ha a logarithmic expected regret and a concentration property imilar to (3)? We conjecture that the anwer i no. Reference 1. R. Agrawal. Sample mean baed index policie with O(log n) regret for the multiarmed bandit problem. Advance in Applied Probability, 7: , J.-Y. Audibert, R. Muno, and C. Szepevári. Variance etimate and exploration function in multi-armed bandit. Reearch report 07-31, Certi - Ecole de Pont, , 1 3. P. Auer, N. Cea-Bianchi, and P. Ficher. Finite time analyi of the multiarmed bandit problem. Machine Learning, 47(-3):35 56, 00. 3, 4, 7 4. P. Auer, N. Cea-Bianchi, and J. Shawe-Taylor. Exploration veru exploitation challenge. In nd PASCAL Challenge Worhop. Pacal Networ, J. C. Gittin. Multi-armed Bandit Allocation Indice. Wiley-Intercience erie in ytem and optimization. Wiley, Chicheter, NY, T. L. Lai and H. Robbin. Aymptotically efficient adaptive allocation rule. Advance in Applied Mathematic, 6:4, T.L. Lai and S. Yaowitz. Machine learning and nonparametric bandit theory. IEEE Tranaction on Automatic Control, 40: , H. Robbin. Some apect of the equential deign of experiment. Bulletin of the American Mathematic Society, 58:57 535, W.R. Thompon. On the lielihood that one unnown probability exceed another in view of the evidence of two ample. Biometria, 5:85 94,

Exploration-exploitation trade-off using variance estimates in multi-armed bandits

Exploration-exploitation trade-off uing variance etimate in multi-armed bandit Jean Yve Audibert Univerité Pari-Et, Ecole de Pont PariTech, CERTIS 6 avenue Blaie Pacal, 77455 Marne-la-Vallée, France &