Stochastic Multi-armed Bandits in Constant Space

Size: px

Start display at page:

Download "Stochastic Multi-armed Bandits in Constant Space"

Jemimah Boone
5 years ago
Views:

1 Davd Lau Erc Prce Zhao Song The Unversty of Texas at Austn Ger Yang Abstract We consder the stochastc bandt problem n the sublnear space settng, where one cannot record the wn-loss record for all K arms. We gve an algorthm usng words of space wth regret K = log log T where s the gap between the best arm and arm and s the gap between the best and the second-best arms. If the rewards are bounded away from 0 and, ths s wthn an log / factor of the optmum regret possble wthout space constrants. Introducton In ths paper, we study the mult-arm bandt problem n a sublnear space settng. In an nstance of the bandt problem, there are K arms and a fnte tme horzon,..., T, where T could be unknown to us. At each tme step, we pull one of the K arms, and receve a reward that depends on our choce. The goal s to fnd a strategy that would acheve a sublnear wth respect to tme regret, whch s defned as the dfference between the cumulatve reward we receved from our strategy and the reward we could have receved f we always pulled the best arm n the hndsght. There are many formulatons of the bandt problem. In ths paper we consder the stochastc settng ur technques. Proceedngs of the st Internatonal Conference on Artfcal Intellgence and Statstcs AISTATS 08, Lanzarote, Span. PMLR: Volume 84. Copyrght 08 by the authors. specfcally. In the stochastc settng, one assumes the rewards from the -th arm are..d. random varables, wth mean µ and support [0, ]. A well-known algorthm for the stochastc bandt s the UCB algorthm Auer et al., 00, and t s known that UCB acheves regret K log T. The UCB algorthm requres ΩK space snce t records the estmated rewards from all of the K arms. However, n settngs wth lmted space such as streamng algorthms, or settngs wth nfntely many arms Klenberg, 004, the requrement s problematc. There s a sgnfcant lterature addressng ths problem, but exstng approaches assume structural propertes on the set of arms, e.g. combnatoral structure Cesa-Banch and Lugos, 0 or contnuum arm wth local Lpschtz condton Klenberg, 004. A natural queston s, what can we do wthout these structural assumptons gven lmted space? A partcular example s n a streamng algorthm settng, where space s much more lmted than tme, such as a router Zhang, 03. If the space constrant s ok but the tme constrant s ΩK, one cannot run tradtonal UCB. In ths case, K regret s stll acceptable, and by acceptng K total regret, we can avod requrng structural assumptons. In a router, complcated strategy would corresponds to a larger set K of possble strateges, whch grants us the tradeoff: larger K wll result n a hgher regret wth a better optmum. Snce routers have strct space constrants, runnng UCB would result n an extremely small regret on average over tme K/T = space/tme, whch s acceptable for routers. ur algorthm provdes more flexblty n ths bas/varance tradeoff. ur algorthm s based on farly smple deas. Frst, suppose we know the tme horzon T and the expected value of the optmal arm µ. We could then make a sngle pass through the arms; for each arm, flp t untl we have hgh /T 3

2 confdence that = µ µ > 0, where µ s the expected value of arm. nce ths happens, move to the next arm. Ths wll flp each arm log T tmes, nducng regret log T from ths arm. The total regret wll then be log T, whch s deal, wth only constant space requred. The problem s that we don t know T or µ. Not knowng T sn t a bg deal we can partton the tme horzon nto log log T scales, and the last log T term wll domnate Auer and rtner, 00 but not knowng µ s a serous problem. We solve ths problem by teratvely refnng upper and lower bounds µ LB and µ UB on µ. In each pass through the data, we get new estmates that are half as far from each other. After log/ passes, where = mn : >0 s the mnmal gap between the optmal and the suboptmal arms, only the best arm wll reman n the nterval. Ths gves an algorthm that loses at most an log/ factor n the regret. In some cases, the loss s sgnfcantly smaller. Therefore, we can obtan the followng result that mproves the log/ factor nto a log / factor, Theorem.. Gven a stochastc bandt nstance wth K arms and ther expected values µ, µ k [0, ]. Let µ = [K] µ, = µ µ, and = mn : >0. For any T > 0, there exsts an algorthm that uses words of space and acheves regret :>0 log log T. Recall that the well-known UCB algorthm gves regret log T :>0. ur algorthm s always wthn a log / factor of ts space-unlmted verson. In certan stuatons, we can do slghtly better by refnng our estmate of µ by more than a constant factor n each teraton. Ths gves us the followng result Theorem.. Under the same settng as Theorem., for any γ > 0, there exsts an algorthm that uses words of space and acheves regret :>0 log γ log / logt. γ log log / In partcular, f we set γ = /, we can fnd that ths algorthm s always wthn an log/ log log/ factor of the space-unlmted UCB algorthm. The paper s presented n the followng manner. Secton revews the related work. Secton 3 provdes detaled prelmnares of problem formulaton and the background needed for our result. Secton 4 and 5 contans the algorthm that gves the result I and II of Theorem. wth known tme horzon T, respectvely. Secton 6 demonstrates how to extend the algorthms to the case wth unknown tme horzon. The full verson s avalable at Related Works For stochastc bandts, the semnal work by La and Robbns 985 demonstrated the dea of usng the confdence ntervals to solve the problem, and t showed that the lower bound of the regret. The UCB algorthm, whch s a smple soluton to stochastc bandts, was analyzed n Auer et al. 00. The UCB algorthm s based on Hoeffdng s nequalty, whch s optmal when s Ω log T KLµ,µ KLµ, µ. In certan stuatons ths can be mproved usng dfferent types of concentraton nequaltes; for example, Audbert et al. 009 used Bernsten s nequalty to derve an algorthm wth regret dependng on the second moments. Later, Garver and Cappé 0 and Mallard et al. 0 ndependently proposed the KL-UCB algorthm that matches the lower bound. We refer to the reader the comprehensve survey by Bubeck and Cesa-Banch 0 for general bandt problems. In addton to regret analyss for onlne decson makng, there s a set of papers that dscuss the sample complexty for the pure exploraton problem,.e. how to dentfy the best arm Mannor and Tstskls, 004; Even-Dar et al., 00; Jameson et al., 04; Karnn et al., 03; Kaufmann et al., 05; Even- Dar et al., 006. Smlar algorthms has been used n the regme of onlne decson makng Bu et al., 0; Auer and rtner, 00. Wth the dea of the best arm dentfcaton, the explore-then-commt ETC polcy s desgned to frst performs some tests to dentfy the best arm, and then commt to t n the remanng tme horzon. The ETC polcy s shown to be suboptmal Garver et al., 06 but smplfes the analyss. In partcular, our algorthm s based on the framework by Auer and rtner 00, but our algorthm takes only space whle the method by Auer and rtner 00 takes K space. Moreover, there s a small set of papers that ntegrates the sketchng technques from streamng and onlne learnng Hazan and Seshadhr, 009; Luo et al., 06. Hazan and Seshadhr 009 consdered

3 Lau, Prce, Song, Yang the problem of mnmzng α-exp-concave losses, and the regret s requred to be log T unformly over tme. They used the dea from streamng to keep a small actve set of experts. Luo et al. 06 consdered the onlne convex optmzaton problem, and they used the deas of sketchng to reduce the effcency for computng onlne Newton steps, however, the complexty s stll ΩK. 3 Prelmnary Notatons For any postve nteger n, we use [n] to denote the set {,,, n}. For random varable X, let E[X] denote ts expectaton of X If ths quantty exsts. In addton to notaton, for two functons f, g, we use the shorthand f g resp. to ndcate that f Cg resp. for an absolute constant C. We use f g to mean cf g Cf for constants c, C. We measure space n words usng the word RAM model, so that the nput values such as K, T, and rewards and varables can each be expressed n word of space n logkt bts. For more detals of word RAM model, we refer the readers to Aho et al. 974; Cormen et al Problem Formulatons Defnton 3.. For a mult-armed bandt problem, there are K arms n total, and a fnte tme horzon,,..., T. At each tme step t [T ], the player has to choose an arm I t [K] to play, and receves a reward X,t assocate to that arm. Wthout loss of generalty, assume that for each arm [K] and each tme step t [T ], X,t [0, ]. We denote the arm that player chooses at tme t as I t. The goal of the player s to mze the total reward he s gettng. We wll measure the performance of an algorthm va ts regret, whch s defned as the dfference between the best reward n the hndsght and the reward receved wth the algorthm: T Ψ T = X,t [K] t= T t= X It,t In ths paper, we consder the stochastc settng, where we assume the rewards are comng from some stochastc processes. Defnton 3.. In a stochastc bandt, we assume each arm [K] s assocated wth a dstrbuton D over [0, ], wth mean µ. The reward X,t at tme t [T ] s drawn from D ndependently.. For stochastc bandts, nstead of usng the regret defned above, we wll consder the pseudo regret: [ T ] [ T Ψ T = E X,t E [K] t= t= X It,t ] We can rewrte the pseudo regret usng Wald s dentty: K Ψ T = E [N j,t j ], [K] j= where N j,t s the number of tmes arm j s chosen up to tme T, and we defne j = µ µ j to be the gap between the means of arm and arm j. We use µ to denote the mean reward for the arm wth the hghest mean,.e., µ = [K] µ. 3. Concentraton Inequaltes In ths paper, for smplcty, we wll use Chernoff- Hoeffdng nequalty to analyze the concentraton behavor for random varables wth bounded support. Fact 3.3 Chernoff-Hoeffdng Bound. Let x, x,..., x n be..d. random varables n [0, ]. Let X = n n = x. Then for any ɛ > 0, Pr [ X E[X] > ɛ] e nɛ. 4 UCBConstSpace wth known T The orgnal UCB- algorthm Auer et al., 00 needs K space to acheve :>0 log T regret. In ths secton, we propose a new algorthm whch requres only space n exchange for a slghtly worse regret. Frst, we consder the settng where T s known. The man result s presented n the followng theorem. Theorem 4.. Gven a stochastc bandt nstance wth known T, let = µ µ, and let = mn : >0. Then for any T > 0, there exsts an algorthm that uses words of space and acheves regret log / log T. :>0 We present the method n Algorthm, where we teratvely mprove our estmaton of. More precsely, we scan through the data multple rounds. In the r-th round, we sample each arm up to some.

4 Algorthm UCB algorthm wth constant space and known T Theorem 4. : procedure UCBConstSpaceK, T : Set δ /T 3, ntalze g, t 3: Exploraton Phase: 4: for rounds r =,,... do 5: a : the best arm n the prevous round, µ : mean reward for arm a n the prevous round 6: N log/δ/gr, whch s the mum number of plays for each arm n the current round 7: Intalze a, b 0, whch are the best and the second best arm n ths round 8: Intalze µ a, µ b 0, whch are the means for arms a and b 9: for each arm = K do 0: Set µ 0, whch keeps the mean reward for arm n the current round : for n = N do : Pull arm and receve reward v 3: t t 4: Update µ wth v: µ µ n v/n 5: f µ log/δ/n < µ g r / then 6: break,.e. we rule out arm for the current round 7: end f 8: end for 9: f µ > µ a then b a, µ b µ a, a and µ a µ Update the best and the nd best arms 0: else f µ > µ b then b and µ b µ Update the nd best arm : end for : Stoppng Crteron: f µ a g r / > µ b g r / or t > T then break 3: Update a = a and µ = µ a 4: Set new precson: g r = g r / 5: end for 6: Explotaton Phase: 7: Pull arm a for the remanng tme steps. 8: end procedure precson g r. The desred precson g r s halved after each round. In ths samplng process, we only keep the nformaton of the best arm and the second best arm seen n the current and the prevous round, nstead of savng those from all arms. Wth the nformaton of the best arm and the current precson g r, we can refne the upper and lower bound µ UB and µ LB on µ. If an arm whose upper confdence value s less than µ LB, we can rule t out wthout contnung to g r precson. Ths process s termnated f we are able to determne the best arm wth the rest arms. We defne a r and b r as the best arm and the second best arm stored at the end of the r-th round. Also, we let µ r to be the recorded emprcal mean at the end of the r-th round for arm. Denote n r as the total number of pulls of arm at the r-th round. Then, we defne µ r,n as the emprcal mean µ stored for arm after pullng t for n tmes n round r. Further, we defne r as the value of r at the moment the algorthm exts the loop n Lne. Defnton 4.. For each r [r ], defne the event ξ r to be the event: r [r], [K], n [n r ] such that µ r,n µ > log/δ/n,.e., there exsts some estmate of µ r,n our desred confdence nterval up to round r. that s not wthn Throughout the frst part of our analyss, we focus on the case when ξ r holds when we are dscussng the state of the algorthm at round r,.e., all estmates are wthn our desred confdence nterval. Lemma 4.3. In Algorthm, at any round r [r ], gven ξ r, the followng statements are true:. n r = log/δ/g a r,.e. the clamed optmal arm cannot be ruled out early. r. n r = log/δ/gr,.e. the true optmal arm cannot be ruled out early. 3. µ r µ a r g r /. Proof. We prove ths lemma by nducton. For the base case, the frst and the second statement are true because all arms have to be played for log/δ g

5 Lau, Prce, Song, Yang tmes. For the thrd statement, we prove by contradcton. Assume the contrary,.e. µ µ a > g / or µ µ > g a /. If µ µ a > g /, then we have µ < µ g a / µ a log/δ/n g a / µ a g / g / = µ a where the second step follows by condton ξ r and the thrd step follows by n a log/δ/g. The above equaton leads to a contradcton because µ > µ for any. Smlarly, f µ µ > g a /, then we have µ a < µ g / µ log/δ/n g / µ g / g / = µ where the second step follows by condton ξ r, and the thrd step follows by n log/δ/g. The above equaton also results n a contradcton because for any to be assgned as a, we must have µ > µ a. For the nducton step, we assume these three statements are true for r r. Now consder r = r. We frst prove the second statement. Assume the contrary,.e. the true optmal arm has been ruled out early, meanng µ r log/δ/n r < µ r g a r r / Then, we can see that µ µ r log/δ/n r < µ r a r g r / µ a r 3 where n the last nequalty, we use the nducton hypothess, n r and then a r log/δ g r µ a r µ r log/δ µ r g a r n r a r r / a r There s a contradcton n 3 because we must have µ µ a r. Hence the second statement s true. Next, we can see that the frst statement s now clear because we have shown that there s at least one arm that s gong to pull for log/δ g tmes at the r- r th round whch s arm accordng to the second statement we have just shown. Ths means that f arm a r s not arm, then t has to be pulled for tmes as well. log/δ g r For the thrd statement, the proof s smlar to the base case, where we prove by contradcton. Assume the contrary,.e. µ r µ a r > g r / or µ µ r > a r g r /. If µ r a r µ > g r /, then we have µ < µ r g a r r / µ a r log/δ/n r g a r r / µ a r g r / g r / = µ a r where the second step follows by condton ξ r, and the thrd step follows by n r log/δ a r g the frst r statement. Ths results n a contradcton because µ µ for any [K]. Smlarly, f µ µ r > g a r r /, then we have µ r a r < µ g r / µ r log/δ/n r g r / µ r g r / g r / = µ r where the second step follows by condton ξ r and the thrd step follows by n r log/δ g the second r statement. Ths results n a contradcton because for any to be assgned as a r, we must have µ r a r otherwse we wll have µ r ξ r. > µ r, µ g r / by condton Lemma 4.4. In Algorthm, condtonng on event ξ r holds, we have r log/. Proof. Assume the contrary,.e. at the end of round r = log/, the best arm and the second best arm are stll not dfferentated, meanng we stll have µ r g r / < µ r a r g r /

6 Frst note that r > log/ mples r = g r < /. We have µ µ r log/δ/n r µ r g r / < µ r a r 3g r / < µ r a r 3 /4 Smlarly, we have µ a r can show that > µ r /4. Then, we a r µ µ a µ r a r 3 /4 µ r a r /4 < whch results n a contradcton. Ths mples that gven ξ r, we must have r log/. Lemma 4.5. In Algorthm, at any round r, gven ξ r, the number of plays for any arm [K] s upper-bounded by n r log/δ g r. Proof. Frst, note that as long as an arm has not been ruled out, we have µ r,n r Then, we can show = µ µ log/δ n r µ r a r g r µ r g r a r g r µ log/δ µr g r a r µ r,n r n r log/δ n r 4 where the second step follows from Lemma 4.3, the thrd step follows by ξ r, and the last step follows by 4. Reorganzng the above nequalty proves the lemma. Proof of Theorem 4.. Consder Algorthm. For each round r [r ], condtoned on ξ r,.e. the confdence nterval s correct, we frst recognze two bounds on the number of plays n r for each arm [K]. By the defnton of Algorthm, we have n r log/δ/g r 5 Also, from Lemma 4.5, we have n r log/δ/ g r 6 By combnng 5 and 6, together wth r log/ by Lemma 4.4, we can upper bound the regret results from pullng arm n the algorthm. Let α = log/ and β = log3/. Condtonng on event ξ r holds, we have, n r log/δ r= r= {g r, g r } log/δ = { r, r } r= Furthermore, we can obtan n r r= β r= log/δ r log/ 88 log/δ log/ r=β 8 log /3 log/δ log/δ log3/ log / log/δ 7 For the next step, we fnd an upper bound for the probablty of event ξ r := { r [r ], [K], n [n r ] s.t. µ r,n µ > log/δ/n } : Prξ r T /K K T r= = n= Pr µ r,n µ > log/δ/n T δ 8 Fnally, by choosng δ = /T 3, and combnng 7 and 8, we have K log / logt Ψ T T T δ = K = log / logt whch proves the theorem. 5 Improved Algorthm for UCBConstSpace The result n Theorem gves an addtonal log / factor to the orgnal UCB- algorthm by Auer et al. 00. Ths means that n a

7 Lau, Prce, Song, Yang bad scenaro, for example, f most of the arms have gap = K, the log / factor translates to an addtonal log K factor n the regret. In ths secton, we show that we are able to mprove log the addtonal log / factor to a / log log/ factor by slghtly changng the update rule on the precson g r. Ths means that n the bad example descrbed above, we are mprovng the compettve log K rato from log K to log log K. We present our result n the followng theorem. Theorem 5.. Gven a stochastc bandt nstance wth known T, let = µ µ, and let = mn : >0. For any γ > 0 and any T > 0, there exsts an algorthm that uses words of space and acheves regret :>0 log γ log / logt. γ log log / We consder a modfed verson of Algorthm, where the update rule n Lne 4 s replaced by g r = g r log/g r ɛ 9 where ɛ s some constant to be determned later. In the followng lemma, we show that wth ths update rule, bascally gven any D <, t takes only ɛ log/d log log/d steps to reach accuracy D. Lemma 5.. Gven any g 0, D 0,, D < g 0, let r 0 = logg0/d log logg 0/D. If for any postve nteger r, g g r = r log/g. Then, for any r r ɛ ɛ r 0, we have g r D. Proof. Frst, note that by defnton of g r, we have g r g 0 r for any r. Therefore, for any r r 0, we have g r g 0 r g 0 r0 = g 0 D/g 0 log logg 0 /D Then, we can see that for any r r 0, we have g r g r logg0/d log logg 0/D As a result, we have g r ɛ r0 ɛ g r ɛ r0 logg 0 /D r0 g r logg 0 /D ɛ/ g 0 logg 0 /D r0 = D Ths mples that for any r r 0 ɛ r 0 r 0 ɛ r 0, we have g r D. Note that we can apply Lemma 4.3 and Lemma 4.5 for Algorthm wth update rule 9 because they do not requre specfc update rules. Before we proceed to the proof of Theorem 5., we need the followng lemma for an upper bound of r. Lemma 5.3. In Algorthm wth update rule 9, gven ξ r, we have r log / ɛ log log /. Due to space contrant, we provde the detaled proof of the lemma n the full verson of our paper Lau et al., 07. Proof of Theorem 5.. Consder Algorthm wth update rule 9. For each arm [K], f we condton on ξ r, then by Lemma 4.5 and Lemma 5.3, we can upper bound the regret results from pullng arm n the algorthm: r r= r r= r n r log/δ {g r, g r } r= log/δ g r r r=r log/δ g r r 0 where r be the mnmal round r such that g r < /. For the frst term of 0, snce g r decays super-exponentally,.e. g r g r /, we have r r= log/δ g r 4 log/δ g r = 4 log g r ɛ log/δ g r ɛ 6 log log/δ where the last step follows from the fact that g r / by the defnton of r. For the second term of 0, we have r r=r log/δ g r r r=r r r=r 8 log/δ 8 log/δ By Lemma 5., we can fnd that t takes ɛ log / log log/ rounds to get from / to /.

8 As a result, we can upper bound by r log/δ r=r g r ɛ log / 8 log/δ log log / 3 ɛ log / 6 log/δ 3 log log / Usng the smlar argument as we have done n the proof of Theorem 4., we can fnd that Prξ r T δ 4 Fnally, by combnng, 3, and 4, we can get Ψ T 6 log ɛ log / ɛ log log : >0 / log/δ T T δ :>0 : >0 By choosng δ = /T 3 and ɛ = γ/ we can fnd that Ψ T log γ log / logt γ log log / whch proves the theorem. We conjecture below that the log / log log/ factor s not mprovable gven the space constrant. The dscusson for our conjectured hard nstance s n the full verson of our paper Lau et al., 07. Conjecture 5.4. There exsts a dstrbuton over stochastc bandt problems such that, for any algorthm takng words of space wll have regret log / Ω logt. log log / :>0 6 Unknown Horzon T Now, we show that usng the technque descrbed n Auer and rtner, 00, we are able to get the same regret as n Theorem 4. f T s unknown. Theorem 6. Restatement of Theorem.. Gven a stochastc bandt nstance wth unknown T, let = µ µ, and let = mn : >0. For any T > 0, there exsts an algorthm that uses words of space and acheves regret log / log T :>0 Algorthm UCB algorthm wth constant space and unknown T Theorem 6. and Theorem 5. : procedure UCBCS-UnknownTK : Intalze T 0 0 3: l 0, t 4: whle t T do 5: Call UCBConstSpaceK, T l, 6: t t T l 7: l l 8: T l Tl 9: end whle 0: end procedure Due to space constrants, we defer proof of ths theorem to the full verson of our paper Lau et al., 07. Smlarly, we are able to use ths trck for the mproved algorthm n Secton 5 and get the same regret as n Theorem 5.. Theorem 6. Restatement of Theorem.. Gven a stochastc bandt nstance wth unknown T, let = µ µ, and let = mn : >0. For any γ > 0 and any T > 0, there exsts an algorthm that uses words of space and acheves regret :>0 7 Concluson log γ log / logt γ log log / We proposed a constant space algorthm for the stochastc mult-armed bandts problem. ur algorthms proceeds by teratvely refnng a confdence nterval contanng the best arm s value. In the smpler verson of our algorthm, we refne the nterval by a constant factor n each step, and each teraton only uses PT regret. Ths gves an log - compettve algorthm. We then showed how to mprove ths by an log log factor n certan cases, by usng fewer rounds that gve more progress. Fnally, we showed how to adapt our algorthms whch nvolve parameters that depend on the tme horzon T to stuatons wth unknown tme horzon..

9 Lau, Prce, Song, Yang References A. V. Aho, J. Hopcroft, and J. D. Ullman. The desgn and analyss of computer algorthms. In Addson-Wesley Seres n Computer Scence and Informaton Processng, 974. J.-Y. Audbert, R. Munos, and C. Szepesvár. Exploraton explotaton tradeoff usng varance estmates n mult-armed bandts. Theoretcal Computer Scence, 409:876 90, 009. P. Auer and R. rtner. UCB revsted: Improved regret bounds for the stochastc mult-armed bandt problem. Perodca Mathematca Hungarca, 6-:55 65, 00. P. Auer, N. Cesa-Banch, and P. Fscher. Fntetme analyss of the multarmed bandt problem. Machne learnng, 47-3:35 56, 00. S. Bubeck and N. Cesa-Banch. Regret analyss of stochastc and nonstochastc mult-armed bandt problems. Foundatons and Trends R n Machne Learnng, 5:, 0. L. X. Bu, R. Johar, and S. Mannor. Commttng bandts. In Advances n Neural Informaton Processng Systems, pages , 0. N. Cesa-Banch and G. Lugos. Combnatoral bandts. Journal of Computer and System Scences, 785:404 4, 0. T. H. Cormen, C. E. Leserson, R. L. Rvest, and C. Sten. Introducton to algorthms. MIT press, 009. E. Even-Dar, S. Mannor, and Y. Mansour. PAC bounds for mult-armed bandt and markov decson processes. In Internatonal Conference on Computatonal Learnng Theory, pages Sprnger, 00. E. Even-Dar, S. Mannor, and Y. Mansour. Acton elmnaton and stoppng condtons for the multarmed bandt and renforcement learnng problems. Journal of machne learnng research, 7 Jun:079 05, 006. A. Garver and. Cappé. The KL-UCB algorthm for bounded stochastc bandts and beyond. In CLT, pages , 0. A. Garver, T. Lattmore, and E. Kaufmann. n explore-then-commt strateges. In Advances n Neural Informaton Processng Systems, pages , 06. E. Hazan and C. Seshadhr. Effcent learnng algorthms for changng envronments. In Proceedngs of the 6th Annual Internatonal Conference on Machne Learnng, pages ACM, 009. K. G. Jameson, M. Malloy, R. D. Nowak, and S. Bubeck. ll ucb: An optmal exploraton algorthm for mult-armed bandts. In CLT, volume 35, pages , 04. Z. S. Karnn, T. Koren, and. Somekh. Almost optmal exploraton n mult-armed bandts. ICML 3, 8:38 46, 03. E. Kaufmann,. Cappé, and A. Garver. n the complexty of best arm dentfcaton n multarmed bandt models. The Journal of Machne Learnng Research, 05. R. D. Klenberg. Nearly tght bounds for the contnuum-armed bandt problem. In Advances n Neural Informaton Processng Systems, pages , 004. T. L. La and H. Robbns. Asymptotcally effcent adaptve allocaton rules. Advances n appled mathematcs, 6:4, 985. D. Lau, E. Prce, Z. Song, and G. Yang. Stochastc mult-armed bandts n constant space. In arxv preprnt , 07. H. Luo, A. Agarwal, N. Cesa-Banch, and J. Langford. Effcent second order onlne learnng by sketchng. In Advances n Neural Informaton Processng Systems, pages 90 90, 06..-A. Mallard, R. Munos, G. Stoltz, et al. A fntetme analyss of mult-armed bandts problems wth kullback-lebler dvergences. In CLT, pages , 0. S. Mannor and J. N. Tstskls. The sample complexty of exploraton n the mult-armed bandt problem. Journal of Machne Learnng Research, 5Jun:63 648, 004. Q. Zhang. Introducton. In Lecture notes of Sublnear Algorthms for Bg Data. qzhangcs/b669-3-fall-sublnear/sldes/ space--dst.pdf, 03.

Lecture 14: Bandits with Budget Constraints

Lecture 14: Bandits with Budget Constraints IEOR 8100-001: Learnng and Optmzaton for Sequental Decson Makng 03/07/16 Lecture 14: andts wth udget Constrants Instructor: Shpra Agrawal Scrbed by: Zhpeng Lu 1 Problem defnton In the regular Mult-armed