Multiarmed Bandits With Limited Expert Advice

Size: px

Start display at page:

Download "Multiarmed Bandits With Limited Expert Advice"

Joseph Cobb
5 years ago
Views:

1 uliarmed Bandis Wih Limied Exper Advice Sayen Kale Yahoo Labs ew York Absrac We consider he problem of minimizing regre in he seing of advice-efficien muliarmed bandis wih exper advice. We give an algorihm for he seing of K arms and expers ou of which we are allowed o query and use only expers advice in each round, which has a ( min{k,} regre bound 1 of Õ afer rounds. We also prove ha any algorihm for his ( problem mus have expeced regre a leas Ω, hus showing ha our upper min{k,} bound is nearly igh. his solves he COL 2013 open problem of Seldin e al. [7]. 1 Inroducion In many real world applicaions one is faced wih he problem of choosing one of several acions: for example, in healhcare, a choice of reamen; in financial domains, a choice of invesmen. ypically in such scenarios one may uilize he advice of several domain expers o make an informed choice. Once an acion is chosen, one obains feedback for he acion in erms of some loss (or reward, bu no feedback for oher acions is obained. his is repeaed over several rounds. Repeaed decision-making in his conex is modeled by he well-sudied muliarmed bandis wih exper advice problem [4]. In his paper, we sudy an imporan pracical consideraion for his seing: frequenly here are coss associaed wih obaining useful advice, and budge consrains imply ha only a few expers may be queried for advice. his consrain on he number of expers ha can be queried in any round is modeled by he advice-efficien seing of he muliarmed bandis wih exper advice problem, inroduced by Seldin e al. [7]. In his seing, in each round = 1, 2,...,, he learner is required o pull one arm A from some se A of K arms. Simulaneously, an adversary ses losses l (a [0, 1] for each arm a A, hus generaing he loss vecor l R A. Assising us in his ask are expers in he se H. Each exper h can provide advice 2 on which arm o pull in he form of a probabiliy disribuion ξ h R A on he se of arms. his advice gives he exper h an expeced loss of ξ h l in round. he cach is ha we can only observe he advice of a mos expers of our choosing in each round. he goal is o choose subses of expers in each round o query he advice of, and using heir advice his work was done when he auhor was a IB. J. Wason Research Cener. 1 Here, we use he Õ( and Ω( noaion o suppress dependence on logarihmic facors in he problem parameers. 2 o assumpions are made on how his advice is chosen by he expers oher in each round han ha i is independen of he losses of he arms chosen by he adversary in ha round. 1

2 play some arm A A (probabilisically, if desired o minimize he expeced regre wih respec o he loss of he bes exper, where he regre is defined as: Regre := l (A min h H ξ h l. In his paper we give an algorihm whose expeced regre is bounded by 2 min{k, } log( afer rounds, based on he uliplicaive Weighs (W forecaser for predicion wih exper advice [5]. We can improve his upper bound using he PolyIF forecaser of Audiber and Bubeck [2] o 8 min{k, } log( min{k,} 4. his maches he regre of he bes known algorihms for he special cases = 1 and =, and inerpolaes beween hem for inermediae values of. his solves he COL 2013 open problem proposed by Seldin e al. [7], and in fac gives a beer regre bound han he bound conjecured ( in [7], which was O Ω K log(. ( Furhermore, we also show ha any algorihm for he problem mus incur expeced regre of min{k, log(k } on some sequence of exper advice and arm losses, hus showing ha our upper bound is nearly igh: he raio beween he upper and lower bounds is always bounded by O( log(. 2 Preliminaries For any even E, le I[E] be he indicaor random variable se o 1 if E happens and 0 oherwise. In any round of he algorihm, le Pr [ ] and E [ ] denoe probabiliy and expecaion respecively condiioned on all he randomness defined up o round 1. For wo probabiliy disribuions P and Q defined on he same space le KL(P Q and d V (P, Q denoe he KL-divergence and oal variaion disance beween he wo disribuions respecively. Le p denoe he p-norm for any p 1. Wihou loss of generaliy, we may assume ha each exper suggess exacly one arm o play in any round; i.e. ξ h (a = 1 for exacly one arm a A and 0 for all oher arms. Call such advice vecors pure. o see his, for every exper h we can randomly round a general advice vecor ξ h o a pure vecor by sampling some arm a h ξ h and consrucing a new advice vecor ˆξ h by seing ˆξ h (a h = 1 and ˆξ h (a = 0 for all a a h. oe ha E[ˆξ h ] = ξ h ; hus for any exper h, following he randomly rounded advice ˆξ h for = 1, 2,..., has he same expeced cos as following he advice ξ h. Since his randomized rounding rick can be applied o he advice (algorihmically for he observed advice, and concepually for he unobserved advice, in he res of he paper we assume ha all advice vecors are pure vecors; his helps us in geing a igher bound on he regre. Le 2

3 a h denoe he acion chosen by exper h a ime, so ha he loss of he exper can be rewrien as ξ h l = l (a h. For any ime period and any se U H, define he acive se of arms o be he se of all arms recommended by expers in U, i.e. A U = {a A : h U s.. a h = a}. oe ha since we are allowed o query a mos expers in any round, if U is he queried se of expers in round, hen A U min{k, }; his leads o min{k, } facor in he regre bound. Define K := min{k, }, he effecive number of arms. hroughou he paper we also assume ha 2 and K 2: in he remaining cases we rivially ge 0 regre. 3 Algorihm he algorihm, dubbed LEXP, works as follows. Assume for simpliciy 3 ha divides, and in he beginning, pariion he expers ino R := groups of size arbirarily. Run an algorihm for predicion wih exper advice (such as uliplicaive Weighs (W forecaser of Lilesone and Warmuh [5], or he PolyIF forecaser of Audiber and Bubeck [2] on all he expers. In each round, his base exper learning algorihm compues a disribuion over he expers. hen LEXP samples an exper from his disribuion, and chooses he group of expers i belongs o o query for advice, hus ensuring ha a mos expers are queried in any round. I hen plays he acion recommended by he chosen exper, and observes is loss. I hen consrucs unbiased loss esimaors for all expers using he observed loss and queried advice and passes hese o he base exper learning algorihm, which updaes is disribuion. he loss esimaors are non-zero only for expers in he chosen group; hus hey can be compued for all expers and he algorihm is well-defined. he pseudo-code follows. 4 Analysis We firs prove a number of uiliy lemmas. he firs lemma shows ha he loss esimaors we consruc are unbiased for all expers wih posiive probabiliy in he disribuion (and an underesimae in general: Lemma 1 For all rounds and all expers h, we have E [Y h ] l (a h wih equaliy holding if q (h > 0. 4 hus, E [q (hy h ] = q (hl (a h, and uncondiionally, E[Y h ] l (a h. Proof: Le h B i. For clariy, le a = a h. If Pr [i, a] > 0, hen by he definiion of he loss esimaor in (1, we have [ E [Y h ] = E [ˆl i (a] = E l (a I[I ] = i, A = a] = l (a Pr [i, a] = l (a. Pr [i, a] Pr [i, a] 3 he regre bounds only change by a small consan facor if doesn divide. 4 I is easy o see ha boh he W and PolyIF forecasers always have posiive probabiliy on all expers, so if we use one of hese wo exper learning algorihms, hen all he inequaliies in his lemma are acually equaliies. 3

4 Algorihm 1 uliarmed Bandis wih Limied Exper Advice Algorihm (LEXP. 1: Pariion he expers ino R = / groups of expers each arbirarily. Call he groups B 1, B 2,..., B R, and define R := {1, 2,..., R}. 2: Run an algorihm for predicion wih exper advice (such as W or PolyIF on all he expers. 3: for = 1, 2,..., do 4: Le q be he disribuion over expers generaed by he base exper learning algoihm. Sample an exper H q, se I o be he index of he group o which H belongs. 5: Query he advice of all expers in B I. 6: Play A = a H, and observe is loss l (A. 7: For every group B i and every arm a A, define he loss esimaor given by ˆl i (a := { l i (a I[I=i,A=a] Pr [i,a] if Pr [i, a] > 0 0 oherwise, (1 where Pr [i, a] = h B i q (hξ h (a is he probabiliy of he even {I = i, A = a}, condiioned on all he randomness up o round 1. 8: For all expers h B i, define he loss esimaor Y h := ˆl i (a h, and pass hem o he base exper learning algorihm. 9: end for If Pr [i, a] = 0, hen ˆl i (a = 0, and so E [ˆl i (a] = 0 l (a. hus in eiher case, E [Y h ] l (a, and E [q (hy h ] = q (hl (a h. Finally, noe ha if q (h > 0, hen Pr [i, a] > 0, so equaliy holds. he nex lemma says ha he algorihm s expeced loss in each round is he same as ha of he base exper learning algorihm: Lemma 2 For all rounds we have E[l (A ] = E[ h H q (hy h ]. Proof: We have E [l (A ] = E [l (a H ] = q (hl (a h = E [ q (hy h ], h H h H by Lemma 1. aking expecaion over all he randomness up o ime 1, he proof is complee. he nex lemma gives a bound on he variance of he esimaed losses. We sae his in slighly more general erms han necessary o unify he analysis of he algorihms using he W or PolyIF forecasers as he exper learning algorihm. Lemma 3 Fix any α [1, 2]. For all rounds we have E[ (q (h α (Y h 2 ] (RK 2 α. h H Proof: Le S := {(i, a R A Pr [i, a] > 0} 4

5 be he se of all (group index, acion pairs ha have posiive probabiliy in round. Since in round, he algorihm only plays arms in A B I, and for any group B i, he se of acive arms in round, A B i, has size a mos K, we conclude ha S RK. he pair (I, A compued by he algorihm is in S. Condiioning on he value of (I, A, we can upper bound h H (q (h α (Y h 2 as follows: (q (h α (Y h 2 = (q (h α (ˆl I (ah 2 ( Y h = 0 for all h / B I h H h B I = ( (q (h α ξ h l (A 2 I (A ( ˆl Pr [I, A ] (a = 0 for all a A h B I ( (q (hξ h (A α h B I q (hξ h (A h B I 1 Pr [I, A ] α ( 2 ( ξ h (A, l (A [0, 1], α ( α 1 since α 1 Pr [I, A ] = Pr [I, A ] α 2, (2 since Pr [I, A ] = h B I q (hξ h (A. ex, we have E [ (q (h α (Y h 2 ] = E [E [ (q (h α (Y h 2 (I, A ] h H h H Pr[I, A ] Pr[I, A ] α 2 (By (2 = (I,A S (I,A S (I,A S = S 2 α (RK 2 α. Pr [I, A ] α 1 Pr[I, A ] α 1 (I,A S he penulimae inequaliy follows by applying Hölder s inequaliy o he pair of dual norms 1 α 1 and 1. aking expecaion over all he randomness up o ime 1, he proof is complee. 2 α 4.1 Analysis using he W forecaser he W forecaser for predicion wih exper advice akes one parameer, η. I sars wih q 1 being he uniform disribuion over all expers, and for any 1, consrucs he disribuion q +1 using he following updae rule: q +1 (h := q (h exp( ηy h /Z, where Z is he normalizaion consan required o make q +1 a disribuion, i.e. h H q +1(h = α

6 heorem 1 Se η = is bounded by 2K log(. log( K. hen he expeced regre of he algorihm using he W forecaser Proof: he W forecaser guaranees (see [1] ha as long as Y h exper h h H q (hy h ow, we have for any exper h E[l (A ] = using η = E[ h H E[Y h ] + η 2 Y h + η 2 0 for all, h, we have for any q (h(y h 2 + log. (3 η h H q (hy h ] (By Lemma 2 E[ h H l (a h + η 2 RK + log η q (h(y h 2 ] + log η (By Lemma 1 and Lemma 3 wih α = 1 2K l (a h + log(, 2 log( RK = 2 log( K. 4.2 Analysis using he PolyIF forecaser (By (3 he PolyIF forecaser for predicion wih exper advice akes wo parameers, η and c > 1. I sars wih q 1 being he uniform disribuion over all expers, and for any 1, consrucs he disribuion q +1 as follows: 1 q +1 (h = [η( τ=1 Y τ h + C +1 ] c where C +1 is a consan chosen so ha q +1 is a disribuion, i.e. h H q +1(h = 1. heorem 2 Se c = log( 8 K and η = 2 1 2c [c(rk 1 1 c ] 1 2. algorihm using he PolyIF forecaser is bounded by 4 K log( 8 K. Proof: Audiber e al. [3] prove ha for he PolyIF forecaser, as long as Y h have for any exper h : hen he expeced regre of he 0 for all, h, we h H q (hy h Y h + cη 2 (q (h 1+ 1 c (Y h 2 + h H c 1 c η(c 1. (4 6

7 ow, we have for any exper h E[l (A ] = E[ h H E[Y h ] + cη 2 q (hy h ] (By Lemma 2 E[ h H l (a h + cη 2 (RK 1 1 c c η (q (h 1+ 1 c (Y h 2 ] c η (By Lemma 1 and Lemma 3 wih α = c l (a h + 2 crk ( RK 1 c, (Using η = 2 1 2c [c(rk 1 1 c ] 1 2 K log( 8 l (a h K + 4, (By (4, using c 2 using c = log( 8 K = log( 8 RK. 4.3 Exension o Changing umber of Queried Expers he algorihm and is analysis exends easily o he siuaion where he number of expers queried is no fixed bu can change from round o round. Specifically, a ime, he learner is old he number of expers ha can be queried in ha round. In his seing, consider he following varian of he algorihm. In each round, he expers are re-pariioned ino as / groups 5 of size. he res of he algorihm says he same: viz. an exper is chosen from he curren probabiliy disribuion over he expers, and he group i belongs o is chosen for querying for exper advice. he updae o he disribuion and he loss esimaors are he same as in Algorihm 1. he analysis of Algorihm 1 relies on Lemmas 1, 2 and 3 all of which concern a specific round, and he re-pariioning doesn affec hem. hus, we easily obain he following bound: heorem 3 In he seing where in each round he number of expers ha can be queried in ha round is specified, he exension of Algorihm 1 which re-pariions he expers in each round ino R := / groups of size, has he following regre bound. For every round, le K = min{k, } and = arg max. If he W forecaser is used wih η = hen he expeced regre is bounded by 2K log( log( K /,. If he PolyIF forecaser is used wih c = log( 8 K and η = 2 1 2c [c (R K 1 1 c ] 1 2, hen he expeced regre is bounded by 4 5 Again, here we assume divides for convenience. K log( 8 /K. 7

8 5 Lower Bound In his secion, we show a lower bound on he regre of any algorihm for he muliarmed bandi wih limied exper advice seing which shows ha our upper bound is nearly igh. o describe he lower bound, consider he well-sudied balls-ino-bins process. Here balls are ossed randomly ino K bins. In each oss a bin is chosen uniformly a random from he K bins independenly of oher osses. Define he funcion f(k, o be he expeced number of balls in he bin wih he maximum number of balls. I is well-known (see, for example, [6] ha f(k, = O(max{log(K, K }. Wih his definiion, we can prove he following lower bound. oe ha his lower bound doesn follow from a similar lower bound in [8] because in heir seing he expers losses can be all uncorrelaed, whereas in our seing he expers losses are necessarily correlaed because here are only K arms. heorem 4 For any algorihm for he muliarmed bandi wih limied exper advice seing, here is a sequence of exper advice and (losses for each arm so ha he expeced regre of he algorihm is a leas Ω ( f(k, = Ω min{k, log(k } Proof: he lower bound is based on sandard informaion heoreic argumens (see, e.g. [4]. Le B(p be he Bernoulli disribuion wih parameer p, i.e. 1 is chosen wih probabiliy p and 0 wih probabiliy 1 p. In he following, we assume he online algorihm is deerminisic: he exension o randomized algorihms is easy by condiioning on he random seed of he algorihm, since he sequence of advice and losses we consruc do no depend on he algorihm. Fix he parameer ε := 1 16 f(k,. he exper advice and he losses of he arms are generaed randomly as follows.we define probabiliy disribuions over advice and losses, P h for all h H. Fix an h H, and define P h as follows. In each round, for all expers h H, we se heir advice o be a uniformly random arm in A. Recall ha he arm chosen by exper h in round is a h. Condiioned on he choice of he arm a h, he loss of arm a h is chosen from B( 1 2 ε, and he loss of all arms a a h from B( 1 2, independenly. Uncondiionally, he disribuion of he loss of any arm a a any ime is B(p where p = 1 K ( 1 2 ε + K 1 K 1 2 = 1 2 ε K. A similar calculaion shows ha for all expers h h, he disribuion of he loss of heir chosen arm is B(p and hus has expecaion p, and he expeced loss of he arm chosen by h is 1 2 ε. hus he bes exper is h. Le E h denoe expecaion under P h. Consider anoher probabiliy disribuion P 0 over advice and losses: in all rounds, all expers choose heir arms in A uniformly a random as before, and all arms have loss disribued as B(p. Le E 0 denoe he expecaion of random variables under P 0. Before round 1, we choose an exper h H uniformly a random, and advice and losses are hen generaed from P h. In round, le S denoe he se of expers chosen by he algorihm o query. Lemma 4 below shows ha if eiher of he evens [h / S ] or [h S, A a h algorihm suffers an expeced regre of a leas ε/2. Define he random variables L h = I[h S ] and h =. I[h S, A = a h ]. ] happens, he 8

9 hen o ge a lower bound on he expeced regre we need o upper bound E h [ h ]. o do his, we use argumens based on KL-divergence beween he disribuions P h and P 0. Specifically, for all, le H = (G 1, l 1 (A 1, (G 2, l 2 (A 2,..., (G, l (A denoe he hisory up o ime ; here, G τ = {(h, a h τ h S τ } is he se of pairs of expers and heir advice for he expers queried a ime τ. For convenience, we define H 0 = {}, he empy se. oe ha since he algorihm is assumed o be deerminisic, h is a deerminisic funcion of he hisory H. hus o upper bound E h [ h ] we compue an upper bound on KL(P 0 (H P h (H. Lemma 5 below shows ha hus, by Pinsker s inequaliy, we ge d V (P 0 (H, P h (H Since h [0, ], his implies ha KL(P 0 (H P h (H 6ε 2 E 0 [ h ] + 4ε2 K 2 E 0[L h ]. 1 2 KL(P 0(H P h (H E h [ h ] E 0 [ h ] + 3ε 2 E 0 [ h ] + 2ε2 K 2 E 0[L h ]. 3ε 2 E 0 [ h ] + 2ε2 K 2 E 0[L h ]. By Jensen s inequaliy applied o he concave square roo funcion, we ge 1 E h [ h ] 1 [ ] [ ] E 0 [ h ] + 1 3ε 2 E 0 [ h ] + 2ε2 1 K 2 E 0 [L h ] f(k, f(k, + 3ε2 + 2ε 2 K 2 (5 3 f(k, 4 + 2ε. (6 Inequaliy (5 follows from Lemma 6 below using E 0 [L h ] = P 0 [h S ] = E 0 [ S ] and E 0 [ h ] = P 0 [h S, A = a h ] E 0 [f(k, S ] f(k,. (7 o obain inequaliy (6, we upper bound 2ε 2 K 2 by ε2 f(k, because f(k, is a leas he expeced number of balls in each bin, which equals K, and so f(k, 2 for K 2. As for K 2 he f(k, erm, we bound i using he fac ha f(k, f(2, for K 2 (since f is clearly monoonically decreasing in he firs argumen and monoonically increasing in he second, and f(2, for 2. 9

10 ow, aking expecaion over he choice of he exper h, he expeced regre of he algorihm is a leas 1 ε 2 ( E h [ h ] ε f(k, 8 ε2 = f(k, = Ω min{k, log(k }, using he seing ε = 1 16 f(k, and he fac ha f(k, = O(max{log(K, K }. Lemma 4 Suppose h is he exper chosen in he beginning and advice and losses are hen generaed from P h. hen in any round, if eiher of he evens [h / S ] or [h S, A a h ] happens, he algorihm suffers an expeced regre of a leas ε/2. Proof: Firs, recall ha he exper h always incurs an expeced loss of 1 2 ε in each round. ow if h / S, hen he losses of he arms are independen of he advice of he expers in S, and hence heir disribuion condiioned on he advice of expers in S is B(p. hus, he disribuion of he chosen arm A is also B(p, which implies ha he algorihm suffers an expeced regre of p ( 1 2 ε = ε(1 1/K ε/2. If h S bu A a h, hen he disribuion of he loss of A, condiioned on he advice of he expers in S, is B( 1 2. his implies ha he algorihm suffers an expeced regre of 1 2 ( 1 2 ε = ε ε/2. Lemma 5 We have Proof: We have KL(P 0 (H P h (H 6ε 2 E 0 [ h ] + 4ε2 K 2 E 0[L h ]. KL(P 0 (H P h (H = = = = KL(P 0 ((G, l (A H 1 P h ((G, l (A H 1 (8 [KL(P 0 (l (A H 1, G P h (l (A H 1, G + KL(P 0 (G H 1 P h (G H 1 ] (9 KL(P 0 (l (A H 1, G P h (l (A H 1, G (10 P 0 [h S, A = a h ]KL(B(p B( 1 2 ε + P 0 [h S, A a h ]KL(B(p B( P 0 [h / S ]KL(B(p B(p (11 10

11 P 0 [h S, A = a h ] 6ε 2 + P 0 [h S, A a h ] 4ε2 K 2 (12 6ε 2 P 0 [h S, A = a h ] + 4ε2 K 2 P 0[h S ] = 6ε 2 E 0 [ h ] + 4ε2 K 2 E 0[L h ]. Equaliies (8 and (9 follow from he chain rule for relaive enropy. Equaliy (10 follows because he disribuion of G condiioned on H 1 is idenical in P 0 and P h. Equaliy (11 follows under P 0, he loss of he chosen arm always follows B(p, and under P h, if h / S, hen he loss of he chosen arm follows B(p, if h S and A = a h, hen he loss of he chosen arm follows B( 1 2 ε, and if h S and A a h, hen he loss of he chosen arm follows B( 1 2. Finally, inequaliy (12 follows using sandard calculaions for KL-divergence beween Bernoulli random variables. Recall ha f(k, is he expeced number of balls in he bin wih he maximum balls in a -balls-ino-k-bins process. Lemma 6 For all, we have P 0 [h S ] = E 0 [ S ] and P 0 [h S, A = a h ] E 0 [f(k, S ]. Proof: Firs, we have [ ] P 0 [h S ] = E 0 I[h S ] = E 0 [ S ]. ex, we have [ P 0 [h S, A = a h ] = E 0 ] E 0 [max a A { {h S : a = a h } } = E 0 [f(k, S ]. I[h S, A = a h ] ] [ ] = E 0 {h S : A = a h } [ ]] = E 0 E 0 [max a A { {h S : a = a h } } S he penulimae equaliy follows because condiioning on he choice of S, he random variable max a A { {h S : a = a h } } is compleely deermined by he choice of he arms recommended by he expers h S. Since hese arms are chosen uniformly a random from A independenly for each exper h S, we can hink of he S expers in S as balls and he K arms in A as bins in a balls-ino-bins process. hen he random variable of ineres is exacly he number of balls in he bin wih maximum number of balls. he expecaion of his random variable is f(k, S. 5.1 Exension o Global Limi on Queries In cerain siuaions a global limi on he number of queries made o expers over he enire run of he algorihm, raher han a per-round limi, is more naural. hen he analysis of heorem 4 can be exended easily o give he following heorem (proved in Appendix A: 11

12 heorem 5 In he seing of he muliarmed bandis wih limied exper advice problem where here is a global limi of queries o expers over he rounds, for any algorihm, here is a sequence of( exper advice and losses for each arm so ha he expeced regre of he algorihm is a leas Ω min{k, log(k }. his shows ha he up o logarihmic facors, he opimal allocaion of queries over he rounds is he uniform allocaion of queries per round. 5.2 Exension o Changing umber of Queried Expers he lower bound also exends o he seing of Secion 4.3 where in each round, he learner is old he number of expers ha can be queried,. he analysis is basically he same wih a few modificaions o handle he changing number of expers o be queried. In Appendix A, we prove he following heorem: heorem 6 For any algorihm working in he seing where he algorihm is old he number of expers ha can be queried in each round, here is a sequence of exper advice and (losses for each ( min{k, arm so ha he expeced regre of he algorihm is a leas Ω = Ω f(k, log(k }. 6 Conclusions In his paper, we presened near-opimal algorihms for he muliarmed bandis wih limied exper advice problem, solving he COL 2013 open problem of Seldin e al. [7]. he upper bound uses a novel grouping idea combined wih a sandard expers learning algorihm, whereas he lower bound uses an informaion-heoreic approach and a connecion o he classic ball-ino-bins problem o ge a nearly-igh dependence on he problem parameers. he binning sraegy migh be useful in oher conexs such as seings where here may be non-uniform cos associaed wih he advice for each exper. An ineresing open quesion is o close he sub-logarihmic gap beween he upper and lower bounds. Acknowledgmens he auhor hanks Elad Hazan, Dean Foser, Rob Schapire, and Yevgeny Seldin for discussions on his problem. References [1] Sanjeev Arora, Elad Hazan, and Sayen Kale. he uliplicaive Weighs Updae ehod: a ea-algorihm and Applicaions. heory of Compuing, 8(1: , [2] Jean-Yves Audiber and Sébasien Bubeck. Regre bounds and minimax policies under parial monioring. Journal of achine Learning Research, 11: ,

13 [3] Jean-Yves Audiber, Sébasien Bubeck, and Gábor Lugosi. inimax policies for combinaorial predicion games. Journal of achine Learning Research - Proceedings rack, 19: , [4] Peer Auer, icolò Cesa-Bianchi, Yoav Freund, and Rober E. Schapire. he nonsochasic muliarmed bandi problem. SIA J. Compu., 32(1:48 77, [5] ick Lilesone and anfred K. Warmuh. he weighed majoriy algorihm. Inf. Compu., 108(2: , [6] arin Raab and Angelika Seger. Balls ino Bins - A Simple and igh Analysis. In RA- DO, pages , [7] Yevgeny Seldin, Koby Crammer, and Peer Barle. Open Problem: Adversarial uliarmed Bandis wih Limied Advice. In COL, [8] Yevgeny Seldin, Peer L. Barle, Koby Crammer, and Yasin Abbasi-Yadkori. Predicion wih limied advice and muliarmed bandis wih paid observaions. In ICL,

14 A Proofs of Exensions o Lower Bounds In his secion, we provide missing proofs for exensions o lower bounds on he regre. A.1 Global Limi on Queries: Proof of heorem 5. Firs, noe ha since f(k, = O(max{log(K, K }, we have ha f(k, g(k, := c(log(k + K for some consan c. oe ha g is linear in is second argumen (as opposed o f so i is easier o manipulae. We use he exac same consrucion of exper advice and losses as in he proof of heorem 4, wih he choice of ε = 1 16 g(k,. he only change ha needs o be made o he proof is in inequaliy (7, which now becomes [ ] E 0 [ h ] E 0 [f(k, S ] E 0 [g(k, S ] = E 0 g(k, S g(k,. Since, as proved in he paragraph afer inequaliy (7, we have f(k, 3 4 also have ha E 0 [ h ] E 0 [f(k, S ] 3 4. Using hese wo bounds, we can now derive he following analogue of inequaliy (6: 1 E h [ h ] ε g(k,. for all, we 1 he res of he analysis goes hrough jus as before, and yields a regre lower bound of 256 g(k, ( = Ω min{k, log(k }. A.2 Changing umber of Queried Expers: Proof of heorem 6. We use he essenially he same consrucion as in he proof of heorem 4 bu wih one imporan wis: he ε conrolling he loss of he bes exper changes in each round. Specifically, in round, we se and he loss of he arm a h chosen from B( 1 2 as before. ε = 16 /f(k, τ=1 /f(k, τ is chosen from B( 1 2 ε. he losses of all oher arms a a h We now urn o he analysis. Firs, we noe ha since Lemma 4 gives a lower bound on expeced regre in specific rounds, summing over all rounds, we conclude ha he expeced regre of he algorihm is a leas ε 2 (1 I[h S, A = a h ]. 14 are

15 hus, define he random variable G h = ε 2 I[h S, A = a h ]. o ge a lower bound on regre, we need o upper bound his random variable. ex, because ε changes in differen rounds, we need o consider slighly differen random variables: L h = ε 2 I[h S ] and h = ε 2 I[h S, A = a h ]. Wih his definiion, he saemen of Lemma 5 exends easily o he following: KL(P 0 (H P h (H 6E 0 [ h ] + 4 K 2 E 0[L h ]. Define U = ε 2. Coninuing he analysis as in he proof of heorem 4, using Pinsker s inequaliy, he fac ha G h [0, U], and Jensen s inequaliy applied o he concave square roo funcion o conclude ha 1 E h [G h ] 1 [ ] [ ] E 0 [G h ] + U 1 3 E 0 [ h ] K 2 E 0 [L h ] ε 2 f(k, + U 3ε 2 f(k, + 2ε 2 K 2 (13 3U 4 + 2U ε 2 f(k,. (14 Inequaliy (13 follows from Lemma 6 using he following bounds: E 0 [L h ] = ε 2 P 0 [h S ] = ε 2 E 0 [ S ] ε 2, and E 0 [ h ] = ε 2 P 0 [h S, A = a h ] ε 2 E 0 [f(k, S ] ε 2 f(k,, E 0 [G h ] = ε 2 P 0[h S, A = a h ] ε 2 E 0[f(K, S ] ε 2 f(k,. Inequaliy (14 follows from he bound f(k, 2 K 2 for K 2, and he bound f(k, 3 4 for 2. Finally, aking expecaion over he choice of he exper h, he expeced regre of he 15

16 algorihm is a leas 1 (U G h U 4 2U ε 2 f(k, = f(k, = Ω min{k, log(k } using he definiion of ε. his gives us he required lower bound on he expeced regre., 16

1 Review of Zero-Sum Games

1 Review of Zero-Sum Games COS 5: heoreical Machine Learning Lecurer: Rob Schapire Lecure #23 Scribe: Eugene Brevdo April 30, 2008 Review of Zero-Sum Games Las ime we inroduced a mahemaical model for wo player zero-sum games. Any