Near Minimax Optimal Players for the Finite-Time 3-Expert Prediction Problem

Size: px

Start display at page:

Download "Near Minimax Optimal Players for the Finite-Time 3-Expert Prediction Problem"

Terence Edwards
5 years ago
Views:

1 Near Minimax Opimal Players for he Finie-Time 3-Exper Predicion Problem Yasin Abbasi-Yadkori Adobe Research Peer L. Barle UC Berkeley Vicor Gabillon Queensland Universiy of Technology Absrac We sudy minimax sraegies for he online predicion problem wih exper advice. I has been conjecured ha a simple adversary sraegy, called COMB, is near opimal in his game for any number of expers. Our resuls and new insighs make progress in his direcion by showing ha, up o a small addiive erm, COMB is minimax opimal in he finie-ime hree exper problem. In addiion, we provide for his seing a new near minimax opimal COMB-based learner. Prior o his work, in his problem, learners obaining he opimal muliplicaive consan in heir regre rae were known only when K = or K. We characerize, when K = 3, he regre of he game scaling as 8/(9)T ± log(t ) which gives for he firs ime he opimal consan in he leading ( T ) erm of he regre. Inroducion This paper sudies he online predicion problem wih exper advice. This is a fundamenal problem of machine learning ha has been sudied for decades, going back a leas o he work of Hannan [] (see [4] for a survey). As i sudies predicion under adversarial daa he designed algorihms are known o be robus and are commonly used as building blocks of more complicaed machine learning algorihms wih numerous applicaions. Thus, elucidaing he ye unknown opimal sraegies has he poenial o significanly improve he performance of hese higher level algorihms, in addiion o providing insigh ino a classic predicion problem. The problem is a repeaed wo-player zero-sum game beween an adversary and a learner. A each of he T rounds, he adversary decides he qualiy/gain of K expers advice, while simulaneously he learner decides o follow he advice of one of he expers. The objecive of he adversary is o maximize he regre of he learner, defined as he difference beween he oal gain of he learner and he oal gain of he bes fixed exper. Open Problems and our Main Resuls. Previously his game has been solved asympoically as boh T and K end o : asympoically he upper bound on he performance of he sae-of-hear Muliplicaive Weighs Algorihm (MWA) for he learner maches he opimal muliplicaive consan of he asympoic minimax opimal regre rae (T/) log K [3]. However, for finie K, his asympoic quaniy acually overesimaes he finie-ime value of he game. Moreover, Gravin e al. [] proved a maching lower bound (T/) log K on he regre of he classic version of MWA, addiionally showing ha he opimal learner does no belong an exended MWA family. Already, Cover [5] proved ha he value of he game is of order of T/() when K =, meaning ha he regre of a MWA learner is 47% larger ha he opimal learner in his case. Therefore he quesion of opimaliy remains open for non-asympoic K which are he ypical cases in applicaions. In sudying a relaed seing wih K = 3, where T is sampled from a geomeric disribuion wih parameer δ, Gravin e al. [9] conjecured ha, for any K, a simple adversary sraegy, called he COMB adversary, is asympoically opimal (T, or when δ ), and also excessively compeiive for finie-ime fixed T. The COMB sraegy sors he expers based on heir cumulaive gains and, wih probabiliy one half, assigns gain one o each exper in an odd posiion and gain zero 3s Conference on Neural Informaion Processing Sysems (NIPS 7), Long Beach, CA, USA.

2 o each exper in an even posiion. Wih probabiliy one half, he zeros and ones are swapped. The simpliciy and elegance of his sraegy, combined wih is almos opimal performance makes i very appealing and calls for a more exensive sudy of is properies. Our resuls and new insighs make progress in his direcion by showing ha, for any fixed T and up o small addiive erms, COMB is minimax opimal in he finie-ime hree exper problem. Addiionally and wih similar guaranees, we provide for his seing a new near minimax opimal COMB-based learner. For K = 3, he regre of a MWA learner is 39% larger han our new opimal learner. In his paper we also characerize, when K = 3, he regre of he game as 8/(9)T ± log(t ) which gives for he firs ime he opimal consan in he leading ( T ) erm of he regre. Noe ha he sae-of-he-ar non-asympoic lower bound in [5] on he value of his problem is non informaive as he lower bound for he case of K = 3 is a negaive quaniy. Relaed Works and Challenges. For he case of K = 3, Gravin e al. [9] proved he exac minimax opimaliy of a COMB-relaed adversary in he geomerical seing, i.e. where T is no fixed in advance bu raher sampled from a geomeric disribuion wih parameer δ. However he connecion beween he geomerical seing and he original finie-ime seing is no well undersood, even asympoically (possibly due o he large variance of geomeric disribuions wih small δ). Addressing his issue, in Secion 7 of [8], Gravin e al. formulae he Finie vs Geomeric Regre conjecure which saes ha he value of he game in he geomerical seing, V α, and he value of he game in he finie-ime seing, V T, verify V T = V α=/t. We resolve here he conjecure for K = 3. Analyzing he finie-ime exper problem raises new challenges compared o he geomeric seing. In he geomeric seing, a any ime (round) of he game, he expeced number of remaining rounds before he end of he game is consan (does no depend on he curren ime ). This simplifies he problem o he poin ha, when K = 3, here exiss an exacly minimax opimal adversary ha ignores he ime and he parameer δ. As noed in [9], and noiceable from solving exacly small insances of he game wih a compuer, in he finie-ime case, he exac opimal adversary seems o depend in a complex manner on ime and sae. I is herefore naural o compromise for a simpler adversary ha is opimal up o a small addiive error erm. Acually, based on he observaion of he resriced compuer-based soluions, he addiive error erm of COMB seems o vanish wih larger T. Tighly conrolling he errors made by COMB is a new challenge wih respec o [9], where he soluion o he opimaliy equaions led direcly o he exac opimal adversary. The exisence of such equaions in he geomeric seing crucially relies on he fac ha he value-o-go of a given policy in a given sae does no depend on he curren ime (because geomeric disribuions are memoryless). To conrol he errors in he finie-ime seing, our new approach solves he game by backward inducion showing he approximae greediness of COMB wih respec o iself (read Secion. for an overview of our new proof echniques and heir organizaion). We use a novel exchangeabiliy propery, new connecions o random walks and a close relaion ha we develop beween COMB and a TWIN-COMB sraegy. Addiional connecions wih new relaed opimal sraegies and random walks are used o compue he value of he game (Theorem ). We discuss in Secion 6 how our new echniques have more poenial o exend o an arbirary number of arms, han hose of [9]. Addiionally, we show how he approximae greediness of COMB wih respec o iself is key o proving ha a learner based direcly on he COMB adversary is iself quasi-minimax-opimal. This is he firs work o exend o he approximae case, approaches used o designed exacly opimal players in relaed works. In [] a probabiliy maching learner is proven opimal under he assumpion ha he adversary is limied o a fixed cumulaive loss for he bes exper. In [4] and [], he opimal learner relies on esimaing he value-o-go of he game hrough rollous of he opimal adversary s plays. The resuls in hese papers were limied o games where he opimal adversary was only playing canonical uni vecor while our resul holds for general gain vecors. Noe also ha a probabiliy maching learner is opimal in [9]. Noaion: Le [a : b] = {a, a +,..., b} wih a, b N, a b, and [a] = [ : a]. For a vecor w R n, n N, w = max k [n] w k. A vecor indexed by boh a ime and a specific elemen index k is w,k. An undiscouned Markov Decision Process (MDP) [3, 6] M is a 4-uple S, A, r, p. S is he sae space, A is he se of acions, r : S A R is he reward funcion, and he ransiion model p( s, a) gives he probabiliy disribuion over he nex sae when acion a is aken in sae s. A sae is denoed by s or s if i is aken a ime. An acion is denoed by a or a. [9] also provides an upper-bound ha is subopimal when K = 3 even afer opimizaion of is parameers.

3 The Game We consider a game, composed of T rounds, beween wo players, called a learner and an adversary. A each ime/round he learner chooses an index I [K] from a disribuion p on he K arms. Simulaneously, he adversary assigns a binary gain o each of he arms/expers, possibly a random from a disribuion Ȧ, and we denoe he vecor of hese gains by g {, } K. The adversary and he learner hen observe I and g. For simpliciy we use he noaion g [] = (g s ) s=,...,. The value of one realizaion of such a game is he cumulaive regre defined as R T = g g,i. = A sae s S = (N {}) K is a K-dimensional vecor such ha he k-h elemen is he cumulaive sum of gains deal by he adversary on arm k before he curren ime. Here he sae does no include bu is ypically denoed for a specific ime as s and compued as s = = g. This definiion is moivaed by he fac ha here exis minimax sraegies for boh players ha rely solely on he sae and ime informaion as opposed o he complee hisory of plays, g [] I []. In sae s, he se of leading expers, i.e., hose wih maximum cumulaive gain, is X(s) = {k [K] : s k = s }. We use o denoe he (possibly non-saionary) sraegy/policy used by he adversary, i.e., for any inpu sae s and ime i oupus he gain disribuion (s, ) played by he adversary a ime in sae s. Similarly we use p o denoe he sraegy of he learner. As he sae depends only on he adversary plays, we can sample a sae s a ime from. Given an adversary and a learner p, he expeced regre of he game, V T p,, is V T p, = E g[t ],I [T ] p [R T ]. The learner ries o minimize he expeced regre while he adversary ries o maximize i. The value of he game is he minimax value V T defined by V T = min max p = V T p, = max min V T p,. p In his work, we are ineresed in he search for opimal minimax sraegies, which are adversary sraegies such ha V T = min p V T p, and learner sraegies p, such ha V T = max V T p,.. Summary of our Approach o Obain he Near Greediness of COMB Mos of our maerial is new. Firs, Secion 3 recalls ha Gravin e al. [9] have shown ha he search for he opimal adversary can be resriced o he finie family of balanced sraegies (defined in he nex secion). When K = 3, he acion space of a balanced adversary is limied o seven sochasic acions (gain disribuions), denoed by Ḃ3 = {Ẇ, Ċ, V,,, {}, {3}} (see Secion 5. for heir descripion). The COMB adversary repeas he gain disribuion Ċ a each ime and in any sae. In Secion 4 we provide an explici formulaion of he problem as finding inside an MDP wih a specific reward funcion. Ineresingly, we observe ha anoher adversary, which we call TWIN- COMB and denoe by W, which repeas he disribuion Ẇ, has he same value as C (Secion 5.). To conrol he errors made by COMB, he proof uses a novel and inriguing exchangeabiliy propery (Secion 5.). This exchangeabiliy propery holds hanks o he surprising role played by he TWIN- COMB sraegy. For any disribuions Ȧ Ḃ3 here exiss a disribuion Ḋ, mixure of Ċ and Ẇ, such ha for almos all saes, playing Ȧ and hen Ḋ is he same as playing Ẇ and hen Ȧ in erms of he expeced reward and he probabiliies over he nex saes afer hese wo seps. Using Bellman operaors, his can be concisely wrien as: for any (value) funcion f : S R, in (almos) any sae s, we have ha [TȦ[TḊf]](s) = [TẆ[TȦf]](s). We solve he MDP wih a backward inducion in ime from = T. We show ha playing Ċ a ime is almos greedy wih respec o playing C in laer rounds >. The greedy error is defined as he difference of expeced reward beween always playing C and playing he bes (greedy) firs acion before playing COMB. Bounding how hese errors accumulae hrough he rounds relaes he value of COMB o he value of (Lemma 6). To illusrae he main ideas, le us firs make wo simplifying (bu unrealisic) assumpions a ime : COMB has been proven greedy w.r.. iself in rounds > and he exchangeabiliy holds in all saes. Then we would argue a ime ha by he exchangeabiliy propery, insead of opimizing he greedy 3

4 acion w.r.. COMB as maxȧ Ḃ 3 ȦĊ... Ċ, we can sudy he opimizer of maxȧ Ḃ 3 ẆȦĊ... Ċ. Then we use he inducion propery o conclude ha Ċ is he soluion of he previous opimizaion problem. Unforunaely, he exchangeabiliy propery does no hold in one specific sae denoed by s α. Wha saves us hough is ha we can direcly compue he error of greedificaion of any gain disribuion wih respec o COMB in s α and show ha i diminishes exponenially fas as T, he number of rounds remaining, increases (Lemma 7). This helps us o conrol how he errors accumulae during he inducion. From one given sae s s α a ime, firs, we use he exchangeabiliy propery once when rying o assess he qualiy of an acion Ȧ as a greedy acion w.r.. COMB. This leads us o consider he qualiy of playing Ȧ in possibly several new saes {s + } a ime + reached following TWIN-COMB in s. We use our exchangeabiliy propery repeaedly, saring from he sae s unil a subsequen sae reaches s α, say a ime α, where we can subsiue he exponenially decreasing greedy error compued a his ime α in s α. Here he subsequen saes are he saes reached afer having played TWIN-COMB repeiively saring from he sae s. If s α is never reached we use he fac ha COMB is an opimal acion everywhere else in he las round. The problem is hen o deermine a which ime α, saring from any sae a ime and following a TWIN-COMB sraegy, we hi s α for he firs ime. This is ranslaed ino a classical gambler s ruin problem, which concerns he hiing imes of a simple random walk (Secion 5.3). Similarly he value of he game is compued using he sudy of he expeced number of equalizaions of a simple random walk (Theorem 5.). 3 Solving for he Adversary Direcly In his secion, we recall he resuls from [9] ha, for arbirary K, permi us o direcly search for he minimax opimal adversary in he resriced se of balanced adversaries while ignoring he learner. Definiion. A gain disribuion Ȧ is balanced if here exiss a consan cȧ, he mean gain of Ȧ, such ha k [K], cȧ = E g Ȧ [g k ]. A balanced adversary uses exclusively balanced gain disribuions. Lemma (Claim 5 in [9]). There exiss a minimax opimal balanced adversary. Use B o denoe he se of all balanced sraegies and Ḃ o denoe he se of all balanced gain disribuions. Ineresingly, as demonsraed in [9], a balanced adversary inflics he same regre on every learner: If B, hen VT R : p, V T p, = VT. (See Lemma ) Therefore, given an adversary sraegy, we can define he value-o-go V (s) associaed wih from ime in sae s, V [ ] (s) = E s T + st E c(s,), s+ P (. s, (s, ), s = s). + s = Anoher reducion comes from he fac ha he se of balanced gain disribuions can be seen as a convex combinaion of a finie se of balanced disribuions [9, Claim and 3]. We call his limied se he aomic gain disribuions. Therefore he search for can be limied o his se. The se of convex combinaions of he m disribuions Ȧ,... Ȧ m is denoed by (Ȧ,... Ȧ m ). 4 Reformulaion as a Markovian Decision Problem In his secion we formulae, for arbirary K, he maximizaion problem over balanced adversaries as an undiscouned MDP problem S, A, r, p. The sae space S was defined in Secion and he acion space is he se of aomic balanced disribuions as discussed in Secion 3. The ransiion model is defined by p(. s, Ḋ), which is a probabiliy disribuion over saes given he curren sae s and a balanced disribuion over gains Ḋ. In his model, he ransiion dynamics are deerminisic and enirely conrolled by he adversary s acion choices. However, he adversary is forced o choose sochasic acions (balanced gain disribuions). The maximizaion problem can herefore also be hough of as designing a balanced random walk on saes so as o maximize a sum of rewards (ha are ye o be defined). Firs, we define PȦ he ransiion probabiliy operaor wih respec o a gain disribuion Ȧ. Given funcion f : S R, PȦ reurns [PȦf](s) = E[f(s ) s p(. s, Ȧ)] = E [f(s + g)]. g s,ȧ g is sampled in s according o Ȧ. Given Ȧ in s, he per-sep regre is denoed by rȧ(s) and defined as rȧ(s) = E s s,ȧ s s cȧ. 4

5 Given an adversary sraegy, saring in s a ime, he cumulaive per-sep regre is V (s) = T = E [ r (,) (s ) s + p(. s, (s, ), s = s) ]. The acion-value funcion of a (s, Ḋ) and is he expeced sum of rewards received by saring from s, aking acion Ḋ, and hen following : Q (s, Ḋ) = E [ T = r Ȧ (s ) Ȧ = Ḋ, s + p( s, Ȧ ), Ȧ + = (s +, + )]. The Bellman operaor of Ȧ, TȦ, is [TȦf](s) = rȧ(s) + [PȦf](s). wih [T (s,) V + ](s) = V (s). This per-sep regre, rȧ(s), depends on s and Ȧ and no on he ime sep. Removing he ime from he picure permis a simplified view of he problem ha leads o a naural formulaion of he exchangeabiliy propery ha is independen of he ime. Crucially, his decomposiion of he regre ino per-sep regres is such ha maximizing V (s) over adversaries is equivalen, for all ime and s, o maximizing over adversaries he original value of he game, he regre V (s) (Lemma ). Lemma. For any adversary sraegy and any sae s and ime, V (s) = V (s) + s. The proof of Lemma is in Secion 8. In he following, our focus will be on maximizing V (s) in any sae s. We now show some basic properies of he per-sep regre ha holds for an arbirary number of expers K and discuss heir implicaions. The proofs are in Secion 9. Lemma 3. Le Ȧ Ḃ, for all s,, we have r Ȧ(s). Furhermore if X(s) =, rȧ(s) =. Lemma 3 shows ha a sae s in which he reward is no zero conains a leas wo equal leading expers, X(s) >. Therefore he goal of maximizing he reward can be rephrased ino finding a policy ha visis he saes wih X(s) > as ofen as possible, while sill aking ino accoun ha he per-sep reward increases wih X(s). The se of saes wih X(s) > is called he reward wall. Lemma 4. In any sae s, wih X(s) =, for any balanced gain disribuion Ḋ such ha wih probabiliy one exacly one of he leading exper receives a gain of, rḋ(s) = maxȧ Ḃ r Ȧ(s). 5 The Case of K = 3 5. Noaions in he 3-Expers Case, he COMB and he TWIN-COMB Adversaries Firs we define he sae space in he 3-exper case. The expers are sored wih respec o heir cumulaive gains and are named in decreasing order, he leading exper, he middle exper and he lagging exper. As menioned in [9], in our search for he minimax opimal adversary, i is sufficien for any K o describe our sae only using d ij ha denoe he difference beween he cumulaive gains of consecuive sored expers i and j = i +. Here, i denoes he exper wih ih larges cumulaive gains, and hence d ij for all i < j. Therefore one noaion for a sae, ha will be used hroughou his secion, is s = (x, y) = (d, d 3 ). We disinguish four ypes of saes C, C, C 3, C 4 as deailed below in Figure. In he same figure, in he cener, he saes are represened on a d-grid. C 4 conains only he sae denoed s α = (, ). s C, d >, d 3 > s C, d =, d 3 > s C 3, d >, d 3 = s C 4, d =, d 3 = Reward Wall d d Aomic Ȧ Symbol cȧ {}{3} Ẇ / {}{3} Ċ / {3}{} V / {}{}{3} /3 {}{3}{3} /3 Figure : 4 ypes of saes (lef), heir locaion on he d grid of saes (cener) and 5 aomic Ȧ (righ) Concerning he acion space, he gain disribuions use brackes. The group of arms in he same bracke receive gains ogeher and each group receive gains wih equal probabiliy. For insance, {}{}{3} exclusively deals a gain o exper (leading exper) wih probabiliy /3, exper (middle exper) wih probabiliy /3, and exper 3 (lagging exper) wih probabiliy /3, whereas {}{3} means dealing a gain o exper alone wih probabiliy / and expers and 3 ogeher wih probabiliy /. As discussed in Secion 3, we are searching for a using mixures of aomic balanced disribuions. When K = 3 here are seven aomic disribuions, denoed by Ḃ3 = { V,,, Ċ, Ẇ, {}, {3}} and described in Figure (righ). Moreover, in Figure, we repor in deail in a able (lef) and 5

6 Disribuion of nex sae s s rċ(s) p( s, Ċ) d wih s = (x, y) C P (s = (x, y+)) = P (s = (x+, y )) = C / P (s = (x +, y)) = P (s = (x +, y )) = C 3 P (s = (x, y + )) = P (s = (x, y + )) = C 4 / P (s = (x, y + )) = P (s = (x +, y)) = 3 4 Figure : The per-sep regre and ransiion probabiliies of he gain disribuion Ċ d 3 / / an illusraion (righ) on he -D sae grid he properies of he COMB gain disribuion Ċ. The remaining aomic disribuions are similarly repored in he appendix in Figures 5 o 8. In he case of hree expers, he COMB disribuion is simply playing {}{3} in any sae. We use Ẇ o denoe he sraegy ha plays {}{3} in any sae and refer o i as he TWIN-COMB sraegy. The COMB and TWIN-COMB sraegies (as opposed o he disribuions) repea heir respecive gain disribuions in any sae and any ime. They are respecively denoed C, W. The Lemma 5 shows ha he COMB sraegy C, he TWIN-COMB sraegy W and herefore any mixure of boh, have he same expeced cumulaive per-sep regre. The proof is repored o Secion. Lemma 5. For all saes s a ime, we have V C 5. The Exchangeabiliy Propery (s) = V W Lemma 6. Le Ȧ Ḃ3, here exiss Ḋ (Ċ, Ẇ) such ha for any s s α, and for any f : S R, (s). [TȦ[TḊf]](s) = [TẆ[TȦf]](s). Proof. If Ȧ = Ẇ, Ȧ = {} or Ȧ = {3}, use Ḋ = Ẇ. If Ȧ = Ċ, use Lemma and. Case. Ȧ = V: V is equal o Ċ in C3 C 4 and if s p(. s, Ẇ) wih s C 3 hen s C 3 C 4. So when s C 3 we reuse he case Ȧ = Ċ above. When s C C, we consider wo cases. Case.. s (, ): We choose Ḋ = Ẇ which is {}{3}. If s p(. s, V) wih s C hen s C. Similarly, if s p(. s, V) wih s C hen s C C 3. Moreover Ḋ modifies similarly he coordinaes (d, d 3 ) of s C and s C 3. Therefore he effec in erms of ransiion probabiliy and reward of Ḋ is he same wheher i is done before or afer he acions chosen by V. If s p(. s, Ḋ) wih s C C hen s C C. Moreover V modifies similarly he coordinaes (d, d 3 ) of s C and s C. Therefore he effec in erms of he ransiion probabiliy of V is he same wheher i is done before or afer he acion Ḋ. In erms of reward, noice ha in he saes s C C, V has per-sep regre and using V does no make s leave or ener he reward wall. Case. s = (, ): We can chose Ḋ = Ẇ. One can check from he ables in Figures 7 and 8 ha exchangebiliy holds. Addiionally we provide an illusraion of he exchangeabiliy equaliy in he d-grid in Figure. The saring sae s = (, ), is graphically represened by. We show on he grid he effec of he gain disribuion V (in dashed red) followed (lef picure) or preceded (righ picure) by he gain disribuion Ḋ (in plain blue). The illusraion shows ha V Ḋ and Ḋ V lead o he same final saes ( ) wih equal probabiliies. The rewards are displayed on op of he picures. Their color corresponds o he acions, he probabiliies are in ialic, and he rewards are in roman. Case & 3. Ȧ = & Ȧ = : The proof is similar and is repored in Secion of he appendix. 6

7 5.3 Approximae Greediness of COMB, Minimax Players and Regre The greedy error of he gain disribuion Ḋ in sae s a ime is ɛḋ s, = max Ȧ Ḃ3 Q C (s, Ȧ) Q C (s, Ḋ). Le ɛḋ = max s S ɛḋ s, denoe he maximum greedy error of he gain disribuion Ḋ a ime. The COMB greedy error in s α is conrolled by he following lemma proved in Secion 3.. Missing proofs from his secion are in he appendix in Secion 3.. ( Lemma 7. For any [T ] and gain disribuion Ḋ {Ẇ, Ċ, V, }, ɛḋ s α, ) T. 6 The following proposiion shows how we can index he saes d 3 in he d-grid as a one dimensional line over which he TWIN COMB sraegy behaves very similarly o a simple random walk. Figure 3 (op) illusraes his random walk on he d-grid and he indexing scheme (he yellow sickers). Proposiion. Index a sae s = (x, y) by i s = x + y irrespecive of he ime. Then for any sae s s α, and s p( s, Ẇ) we have ha P (i s = i s ) = P (i s = i s +) =. Consider a random walk ha sars from sae s = s and is generaed by he TWIN-COMB sraegy, s + p(. s, Ẇ). Define he random variable T α,s = min{ N {} : s = s α }. This random variable is he number of seps of he random walk before hiing s α for he firs ime. Then, le P α (s, ) be he probabiliy ha s α is reached afer seps: P α (s, ) = P (T α,s = ). Lemma 8 conrols he COMB greedy error in s in relaion o P α (s, ). Lemma 9 derives a sae-independen upper-bound for P α (s, ). Lemma 8. For any ime [T ] and sae s, ɛċ s, P α (s, ) 6 = ( ) T d d d Figure 3: Numbering TWIN-COMB (op) & G random walks (boom) Proof. If s = s α, his is a direc applicaion of Lemma 7 as P α (s α, ) = for >. When s s α, he following proof is by inducion. Iniializaion: Le = T. A he las round only he las per-sep regre maers (for all saes s, Q C (s, Ḋ) = rḋ(s)). As s s α, s is such ha X(s) hen rḋ(s) = maxȧ Ḃ r Ȧ(s) because of Lemma 4 and Lemma 3. Therefore he saemen holds. Inducion: Le < T. We assume he saemen is rue a ime +. We disinguish wo cases. For all gain disribuions Ḋ Ḃ3, Q C (s, Ḋ) = [TḊ[TĖ V C + ]](s) = [TẆ[TḊ V C + ]](s) = [T Q C Ẇ (c) [TẆ max C Q + (., Ȧ)](s) Ȧ Ḃ3 (d) max C [TẆ Q + (., Ȧ)](s) Ȧ Ḃ3 = max Ȧ Ḃ3 (e) = max Ȧ Ḃ3 Q C Q C (s, Ȧ) (s, Ȧ) =+ = 6. + (., Ḋ)](s) [PẆP α (., ) ( ) T ](s) 6 =+ =+ 6 ( 6 ( ) T [PẆP α (., )](s) ) T [PẆP α (., )](s) ( ) T P α (s, ) 7

8 where in Ė is any disribuion in (Ċ, Ẇ) and his sep holds because of Lemma 5, holds because of he exchangeabiliy propery of Lemma 6, (c) is rue by inducion and monooniciy of Bellman operaor, in (d) he max operaors change from being specific o any nex sae s a ime + o being jus one max operaor ha has o choose a single opimal gain disribuion in sae s a ime, (e) holds by definiion as for any, (here he las equaliy holds because s s α ) [PẆP α (., )](s) = E s p(. s,ẇ)[p α (s, )] = E s p(. s,ẇ)[p (T α,s = )] = P α (s, + ). Lemma 9. For > and any s, P α (s, ). Proof. Using he connecion beween he TWIN-COMB sraegy and a simple random walk in Proposiion, a formula can be found for P α (s, ) from he classical Gambler s ruin problem, where one wans o know he probabiliy ha he gambler reaches ruin (here sae s α ) a any ime given an iniial capial in dollars (here i s as defined in Proposiion ). The gambler has an equal probabiliy o win or lose one dollar a each round and has no upper bound on( his capial during he game. Using [7] (Chaper XIV, Equaion 4.4) or [8] we have P α (s, ) = is ) +is, where he binomial coefficien is if and i s are no of he same pariy. The echnical Lemma 4 complees he proof. We now sae our main resul, connecing he value of he COMB adversary o he value of he game. Theorem. Le K = 3, he regre of COMB sraegies agains any learner p, min p V T p, C, saisfies min p V T p, C V T log (T + ). We also characerize he minimax regre of he game. Theorem. Le K = 3, for even T, we have ha ( ) T + T/ + V T T/ + 3 T log (T + ), ( ) T + T/ + 8T wih T/ + 3 T 9. In Figure 4 we inroduce a COMB-based learner ha is denoed by p C. Here a sae is represened by a vecor of 3 inegers. The hree arms/expers are ordered as () () (3), breaking ies arbirarily. We connec he value of he COMB-based learner o he value of he game. p,() (s) = V C + (s+e ()) V C (s) Theorem 3. Le K = 3, he regre of COMB-based p,() (s) = V C + (s+e ()) V C (s) learner agains any adversary, max V T p C,, saisfies p,(3) (s) = p,() (s) p,() (s) max V T p C, V T + 36 log (T + ). Figure 4: A COMB learner, p C Similarly o [] and [4], his sraegy can be efficienly compued using rollous/simulaions from he COMB adversary in order o esimae he value V C (s) of C in s a ime. 6 Discussion and Fuure Work The main objecive is o generalize our new proof echniques o higher dimensions. In our case, he MDP formulaion and all he resuls in Secion 4 already holds for general K. Ineresingly, Lemma 3 and 4 show ha he COMB disribuion is he balanced disribuion wih highes per-sep regre in all he saes s such ha X(s), for arbirary K. Then assuming an ideal exchangeabiliy propery ha gives maxȧ Ḃ ȦĊ... Ċ = maxȧ Ḃ ĊĊ... ĊȦ, a disribuion would be greedy w.r. he COMB sraegy a an early round of he game if i maximizes he per-sep regre a he las round of he game. The COMB policy specifically ends o visi almos exclusively saes X(s), saes where COMB iself is he maximizer of he per-sep regre (Lemma 3). This would give ha COMB is greedy w.r.. iself and herefore opimal. To obain his resul for larger K, we will need o exend he exchangeabiliy propery o higher K and herefore undersand how he COMB and TWIN-COMB families exend o higher dimensions. One could also borrow ideas from he link wih pde approaches made in [6]. 8

9 Acknowledgemens We graefully acknowledge he suppor of he NSF hrough gran IIS-6936 and of he Ausralian Research Council hrough an Ausralian Laureae Fellowship (FL8) and hrough he Ausralian Research Council Cenre of Excellence for Mahemaical and Saisical Froniers (ACEMS). We would like o hank Nae Eldredge for poining us o he resuls in [8] and Wouer Koolen for poining us a [9]! References [] Jacob Abernehy and Manfred K. Warmuh. Repeaed games agains budgeed adversaries. In Advances in Neural Informaion Processing Sysems (NIPS), pages 9,. [] Jacob Abernehy, Manfred K. Warmuh, and Joel Yellin. Opimal sraegies from random walks. In s Annual Conference on Learning Theory (COLT), pages , 8. [3] Nicolò Cesa-Bianchi, Yoav Freund, David Haussler, David P. Helmbold, Rober E. Schapire, and Manfred K. Warmuh. How o use exper advice. Journal of he ACM (JACM), 44(3):47 485, 997. [4] Nicolò Cesa-Bianchi and Gábor Lugosi. Predicion, learning, and games. Cambridge universiy press, 6. [5] Thomas M. Cover. Behavior of sequenial predicors of binary sequences. In 4h Prague Conference on Informaion Theory, Saisical Decision Funcions, Random Processes, pages 63 7, 965. [6] Nadeja Drenska. A pde approach o mixed sraegies predicion wih exper advice. hp:// (Exended absrac). [7] Willliam Feller. An Inroducion o Probabiliy Theory and is Applicaions, volume. John Wiley & Sons, 8. [8] Nick Gravin, Yuval Peres, and Balasubramanian Sivan. Towards opimal algorihms for predicion wih exper advice. In arxiv preprin arxiv:63.498, 4. [9] Nick Gravin, Yuval Peres, and Balasubramanian Sivan. Towards opimal algorihms for predicion wih exper advice. In Proceedings of he Tweny-Sevenh Annual ACM-SIAM Symposium on Discree Algorihms (SODA), pages , 6. [] Nick Gravin, Yuval Peres, and Balasubramanian Sivan. Tigh Lower Bounds for Muliplicaive Weighs Algorihmic Families. In 44h Inernaional Colloquium on Auomaa, Languages, and Programming (ICALP), volume 8, pages 48: 48:4, 7. [] Charles Miller Grinsead and James Laurie Snell. Inroducion o probabiliy. American Mahemaical Soc.,. [] James Hannan. Approximaion o bayes risk in repeaed play. Conribuions o he Theory of Games, 3:97 39, 957. [3] Ronald A. Howard. Dynamic Programming and Markov Processes. The MIT Press, Cambridge, MA, 96. [4] Haipeng Luo and Rober E. Schapire. Towards minimax online learning wih unknown ime horizon. In Proceedings of The 3s Inernaional Conference on Machine Learning (ICML), pages 6 34, 4. [5] Francesco Orabona and Dávid Pál. Opimal non-asympoic lower bound on he minimax regre of learning wih exper advice. arxiv preprin arxiv:5.76, 5. [6] Marin L. Puerman. Markov Decision Processes. Wiley, New York, 994. [7] Panelimon Sanica. Good lower and upper bounds on binomial coefficiens. Journal of Inequaliies in Pure and Applied Mahemaics, (3):3,. 9

10 [8] Remco van der Hofsad and Michael Keane. An elemenary proof of he hiing ime heorem. The American Mahemaical Monhly, 5(8): , 8. [9] Vladimir Vovk. A game of predicion wih exper advice. Journal of Compuer and Sysem Sciences (JCSS), 56():53 73, 998.

11 We use e,... e K {, } K o denoe he K canonical basis uni vecors. 7 A Shor Proof on he Regre of Balanced Adversaries Lemma. A balanced adversary inflics he same regre on any learner: If B, V T R : p, V T p, = V T. Proof. Indeed he expeced regre for a balanced adversary, E[R T ] agains any learner, can be wrien as [ ] E [R T ] = E g e I g () p, g,i p, = = E s T + s T + = E s T + s T + = E s = = E (g ) I () g,i, p s E c (s,) (3) s where is because is balanced which means ha E g s [(g ) k ] = c (s,) for each k. 8 Equivalence Beween Maximizing he Value and Maximizing he Cumulaive Per-Sep Regre Proof of Lemma. In he following he expecaion over saes are all aken wih respec o he sraegy. V (s) + s = = = E r(s ) + s (4) s s, = E s s = ( = [ ] E s + s c (s,) + s (5) s + s E s + E s s + s s s = E s T + s s T + s = E s T + s s T + = = ) E c (s,) + s (6) s s = E c (s,) + s (7) s s E c (s,) (8) s s = V (s) (9) 9 Proofs of Basic Properies of he Per-Sep Regre Firs we noice ha he per-sep regre can be wrien as rȧ(s) = PȦ ( k : k X(s) and g k = ) cȧ. () where he noaion PȦ denoes he fac ha he gain vecor g is sampled from Ȧ. Proof of Lemma 3. Sep ) rȧ(s) : We have E s s s s,ȧ

12 s r (s) Disribuion of wih s = (x, y) nex sae s p( s, ) C P (s = (x+, y)) = P (s = (x, y )) = P (s = (x, y + )) = /3 C /3 P (s = (x+, y)) = /3, P (s = (x, y )) = /3 C 3 P (s = (x+, y)) = /3, P (s = (x, y +)) = /3 C 4 /3 P (s = (x +, y)) = /3 /3 /3 /3 3 /3 /3 /3 /3 /3 4 Figure 5: The per-sep regre and ransiion probabiliies of he gain disribuion by definiion as he maximum cumulaive gain canno increase by more han in one round. Moreover c s as he adversary only deals posiive gains. Therefore rȧ(s) = E s s,ȧ s s c s. Sep ) rȧ(s): We are following he argumen in [9]. We wrie PȦ ( k : k X(s) and g k = ) PȦ (g k = ) = E g Ȧ [g k s ] (c) = cȧ, where in k s is any arm in X(s), holds because as g {, } K, for all k [K], E g s,ȧ [g k ] = PȦ(g k = ), and (c) holds because, as Ȧ is balanced, for all k, cȧ = E g s [g k ]. Therefore rȧ(s) = PȦ ( k : k X(s) and g k = ) cȧ. Sep 3) rȧ(s) = if X(s) = : This resul can be proven by following he same seps as in he previous case bu now he inequaliy urns ino an equaliy as X(s) =. Proof of Lemma 4. We have exacly wo (leading) arms ha have equal maximal cumulaive gain a ime. The adversary is designing a balanced gain disribuion on he K arms. Le p denoe he probabiliy ha boh leading arms are allocaed a gain of and le p, p, p be similarly defined. Therefore p + p + p + p =. Also, he balanced propery forces ha he expeced gains of boh leading expers are equal: cȧ = p + p = p + p, which gives p = p. A balanced gain disribuion Ȧ possesses herefore he following per-sep regre a sae s (see Equaion ): rȧ(s) = PȦ ( k : k X(s) and g k = ) cȧ = p + p + p (p + p ) = p = p. Finding he balanced gain disribuion wih maximal per-sep regre, arg maxȧ Ḃ r Ȧ(s), means maximizing p = p subjec o he consrains p + p + p + p = which is also p + p + p =. This is solved by having p = and p = p =, which is a valid balanced disribuion ha saisfies PḊ( k : k X(s) and g k = = ) =. Gain Disribuions Illusraions In his secion we repor he dynamics of he MDP for all he balanced gain disribuions in Ḃ3 and all he saes. We also illusrae hem on he d-grid of saes as inroduced in Figure. We firs repor he dynamics of he MDP grouped by gain disribuion (Secion.) and hen grouped by sae class (Secion.).. Grouped by Gain Disribuions We deail in ables and illusrae on he d sae grid, he properies of he aomic gain disribuions V,,, Ẇ in Figures 5 o 8. Noe ha he able and he illusraion for Ċ is in Figure of he main paper.. Grouped by Sae Class In Figure 9, we illusrae for each sae class he effec of each acions in erms of he nex saed reached in he d-grid. For insance if we look a he sae in class C 4, we can see ha he acions, and 3 leads us o he sae (, ).

13 s r (s) Disribuion of wih s = (x, y) nex sae s p( s, ) C P (s = (x +, y )) = P (s = (x, y + )) = P (s = (x, y)) = /3 C /3 P (s = (x+, y )) = /3, P (s = (x, y +)) = /3 C 3 P (s = (x, y)) = /3, P (s = (x, y+)) = /3 C 4 /3 P (s = (x, y + )) = /3 /3 3 /3 /3 /3 /3 4 /3 /3 /3 Figure 6: The per-sep regre and ransiion probabiliies of he gain disribuion s rẇ(s) Disribuion of nex sae s p( s, Ẇ) wih s = (x, y) C P (s = (x +, y)) = P (s = (x, y)) = C / P (s = (x +, y)) = P (s = (x +, y )) = C 3 P (s = (x +, y)) = P (s = (x, y)) = C 4 / P (s = (x, y + )) = P (s = (x +, y)) = 4 / / 3 Figure 7: The per-sep regre and ransiion probabiliies of he gain disribuion Ẇ s r V (s) Disribuion of nex sae s p( s, V) wih s = (x, y) C P (s = (x, y + )) =, P (s = (x, y )) = C / P (s = (x, y + )) =, P (s = (x, y )) = C 3 P (s = (x, y+)) =, P (s = (x, y+)) = C 4 / P (s = (x, y + )) =, P (s = (x +, y)) = 3 4 / Figure 8: The per-sep regre and ransiion probabiliies of he gain disribuion V d d Figure 9: The acions illusraed for each sae class 3

14 Proofs of Secion 5. Given wo adversary gain disribuions Ȧ, Ȧ respecively a imes, +, a wo-sep regre in sae s is rȧȧ (s) = E s s,ȧ [rȧ (s )] + rȧ (s). Proof of Lemma 5. The proof is by inducion. If = T, V C (s) = rċ(s) = rẇ(s) = V W (s) for any sae s. Now, given a ime, we assume for all imes >, for all saes s a ime, we have V C (s) = V W (s). Case. A ime, if s C or s C 4 hen we have p(. s, Ċ) = p(. s, Ẇ) by Lemma and, looking a he gain disribuions ables, rċ(s) = rẇ(s). Therefore we have, using he inducion propery, V C (s) = rċ(s) + [PĊ V C + ](s) = r Ẇ(s) + [PẆ V W + ](s) = V W (s). Case. A ime, if s C or s C 3 hen we have, V C (s) = rċ(s) + [PĊ V C + ](s) = rċẇ(s) + [PĊ[PẆ V C + ]](s) = rẇċ(s) + [PẆ[PĊ V C + ]](s) = rẇ(s) + [PẆ V W + ](s) = V W (s), where holds by inducion, and follows from he exchangeabiliy beween PẆ and PĊ (proved in Lemma ) and also he fac ha rċẇ(s) = rẇċ(s) (proved in Lemma ). The exac saemens and proofs of hese wo lemmas are repored in he nex subsecion.. Lemma and Lemma We prove a firs exchangeabiliy resul beween he COMB gain disribuion Ċ and he TWIN-COMB gain disribuion Ẇ. Lemma. If s C or s C 4, p( s, Ẇ) = p( s, Ċ). If s C or s C 3, E p( s, Ẇ) = s s,ċ E p( s, Ċ). s s,ẇ Proof. If s C or s C 4 hen we have p(. s, Ċ) = p(. s, Ẇ). This can be seen by direcly reading he ables in Figures and 5 o 8. Le us now focus on he cases s C or s C 3. To prove ha he gains disribuions Ċ and Ẇ are inverible, we show ha he order does no maer for any pair of possible oucomes of he gains of each disribuion. Recall ha he oucomes are or 3 for Ẇ and or 3 for Ċ. To follow he upcoming reasonings one can use he illusraions of he effec of he acions in Figure 9 and he d-grid in Figure. In he following we will refer o he acion as he oucome of he gain disribuion Ẇ = {}{3} ha happens by definiion half of he ime and ha means ha he exper number, ie. he leading exper receives a gain of one while he middle and lagging expers receive zero. Similarly, for insance acion 3 refers o a one deal o boh he lagging and middle expers while he leading exper receives zero. Case of he exchangeabiliy of wih 3: The acion applied o a sae s C leads o a sae s C and similarly a sae s C 3 is lead o a sae s C 3. Therefore he effec of 3 is he same wheher i is done before or afer he acion. The acion 3 applied o a s C C 3 leads o a sae s C C 3. Moreover he acion has he same effec wheher i is in C or C 3 (incremening d by one and leaving d 3 he same). Therefore he effec of is he same wheher i is done before or afer he acion 3. Case of he exchangeabiliy of wih 3: Case. d > : The acion has he same effec wheher i is in C C 3. The same goes for acion 3. Moreover d > insures ha afer applying eiher or 3 we say in C C 3. The exchangeabiliy holds. 4

15 Case. d = : One can check from he ables ha he effecs of playing hen 3 or 3 hen eiher from a sae in C or C 3 are canceling ou and he final sae is he original sae in all cases. Case of he exchangeabiliy of wih : The acion applied o a s C leads o a sae s C similarly for a sae s C 3 leads o a sae s C 3. Therefore he effec of is he same wheher i is done before or afer he acion. is an acion ha has he same effec in any sae. Therefore he effec of is he same wheher i is done before or afer he acion. Case of he exchangeabiliy of 3 wih 3: The acion 3 applied o a sae s C leads o a sae s C or C. The acion 3 applied o a sae s C 3 leads o a sae s C 3 or C 4. Moreover he acion 3 has he same effec wheher i is in C or C and also has he same effec wheher i is in C 3 or C 4. Therefore he effec of 3 is he same wheher i is done before or afer he acion 3. The acion 3 applied o a sae s C C 3 leads o a sae s C C 3. Moreover he acion 3 has he same effec wheher i is in C or C 3. Therefore he effec of 3 is he same wheher i is done before or afer he acion 3. Lemma. For all sae s we have rċẇ(s) = rẇċ(s). Proof. Firs we have rċ(s) = rẇ(s) for all saes s. Then we have [ ] E [rẇ(s { X(s )] = E ) >} = P s s,ċ s s,ċ s s,ċ( X(s ) > ) () [ ] = P s s,ẇ( X(s { X(s ) > ) = E ) >} = E [rċ(s )], () s s,ẇ s s,ẇ where This is obvious in he case s C 4 C as saed in Lemma, p(. s, Ċ) = p(. s, Ẇ). If s = (x, y) C 3 C. Le he sae s = (x, y ) be generaed wih he COMB policy Ċ and a sae s = (x, y ) be generaed wih he TWIN-COMB policy Ẇ, we have ha P (x = x ) = P (x = x ) = / so P ( X(s ) > ) = P ( X(s ) > ) and hese probabiliies are eiher equal o (if d > ) or / (if d = ). Therefore rċẇ(s) = E s s,ċ [rẇ(s )] + rċ(s) = E s s,ẇ [rċ(s )] + rẇ(s) = rẇċ(s) Proof of he Exchangeabiliy Propery End of he Proof of Lemma 6. Case. Ȧ = : Case. s C : We can chose Ḋ = {}{}{3}{3}, he disribuion ha mixes wih equal probabiliies Ċ and Ẇ. One can check from he ables ha he exchangeabiliy holds. Case.. s (, ): One can check from he able in Figures 7 and 8 ha he exchangebiliy holds. We provide a visual illusraion of he exchangeabiliy equaliy below. 5

16 Case.. s = (, ): One can check from he able ha he exchangebiliy holds. Also we provide a visual illusraion of he exchangeabiliy equaliy below. Case. s C C 3 : We can chose Ḋ = {}{}{3}{3}, he disribuion ha mixes wih probabiliy half-half Ċ and Ẇ. Case.. d > : Following a very similar reasoning as in he Case. he resul hold. Case.. d = : Case... d 3 > : One can check from he able ha he exchangebiliy holds. Also we provide a visual illusraion of he exchangeabiliy equaliy below. Case... d 3 = : One can check from he able ha he exchangebiliy holds. Also we provide a visual illusraion of he exchangeabiliy equaliy below. Case 3. Ȧ = : This case uses a very similar srucure of argumens as in he Case. 6

17 3 Proofs for Conrolling he Accumulaion of he Greedy Errors 3. The COMB Greedy Error in s α, Proof of Lemma 7 Le he value of playing he COMB sraegy from he saes s A = (, ), s B = (, ), s C = (, ) a ime be V A = V C (s A ), V B = V C (s B ) and V C = V C (s C ). Lemma 3 relaes V A o V B. Lemma 3. For all [T ] we have, V B = V A ( 6 T ). Proof. From Table, for all [T ] we have, V A for all [T ] we have, V B V A = V A + V B + + and V T B V T A formula for he value of V B V A, we have for all [T ], V B = V B + + V C +, V B = + V A + + V C +. Therefore =. Solving his recurrence V A = 3 + ( 6 T. ) The greedy acion in s α w.r. playing COMB in laer rounds is arg C maxȧ Ḃ Q 3 (s α, Ȧ) which is equal o arg maxȧ {Ċ,, } Q C (s α, Ȧ) as Ẇ, Ċ, V are equivalen in s α and Q C (s α, {}) = Q C (s α, {3}) = V C + (s α) V C (s α ) = Q C (s α, Ċ). Moreover, from Figures, 5 and 6 we have Q C (s α, Ċ) = / + ( V + A + V +)/, B C Q (s α, ) = /3 + V +, A Combining hese equaliies wih Lemma 3 leads o Lemma Bounding All he Errors Lemma 4. For any wo inegers i n of he same pariy, i ( ) n n+i n. Proof of Lemma 4. Case. i n: we have ( ) n n i ( ) n n n n+i n Here uses he fac ha he cenral binomial coefficien ( n C Q (s α, ) = /3 + V + B. n+ n n n +. ) n is he larges of he binomial coefficien of he shape ( n m) wih m an ineger, and uses he upper bound in Theorem.6 in [7] which gives ( ) n n n n+ n+. ) is non-increasing for Case. i n: We now wan o show ha he mapping Φ from i o i ( n n+i i n. Indeed his would prove ha for all i n, Φ(i) Φ( n). As we have already proven in he Case above Φ( n) n his would prove he desired inequaliy for he Case also. To prove he non-increase, we sudy he raio of he values for i and i + (he expression is zero a i + ) is (i + ) ( n n+i i ( n n+i ) = + ) ( + ) i ( + ) n ( n+i n i )! ( )! )! = n n ( + ( n+i + )! ( n i n+ n + = ( + i ) n i n+i + ) n n + n + n + n + =. n Lemma 5. For [T ] and any s, T = P ( α(s, ) T 6 ) log(t +)+4 T +. Proof of Lemma 5. Case. s = s α : We have P α (s, ) = for = and P α (s, ) = for >. Therefore he bound holds in his case. 7

18 Case. s s α : We have P α (s, ) = which gives T = P ( α(s, ) T 6 ) T =+ P ( α(s, ) ) T 6. Le T = T log (T + ) so ha we have = We have ha + =+ where is using Equaion 3. Moreover we have =T + T T T +. (3) ( ) T ( ) T + =+ T T =T + 4 T +. ( ) T (T T ) T + log (T + ) T + log (T + ) T +, where is using he definiion of T above, and is because T + T log (T + ) T + and herefore T + = T + T + T + T +. Therefore we have ( ) T P α (s, ( ) T ) =+ = ( T+ ( ) T ( ) T ) = + =+ =T + ( log (T + ) + 4) T +, where holds by Lemma 9. The opimal Bellman operaor is denoed by T and is defined as [T f](s) = maxḋ Ḃ r Ḋ(s) + [PḊf](s). If we define he opimal cumulaive per-sep regre as V (s) = max B V (s), we have ha [T V + ](s) = V (s). Lemma 6. For any sae s and any ime, we have V (s) V C (s) + T = ɛċ. Proof of Lemma 6. The proof is by backward inducion on ime. Iniializaion = T : We use Lemma 8 and Lemma 5. Inducion: We wrie V (s) = [T V + ](s) = max rḋ(s) + [PḊ V + ](s) Ḋ Ḃ max rḋ(s) + [PḊ V C + ](s) + T ɛċ Ḋ Ḃ =+ 8

19 = max Q C (s, Ḋ) + Ḋ Ḃ = Q C (s, Ċ) + Q C (s, Ċ) + = V C (s) + = =+ =+ ɛċ, =+ ɛċ ɛċ + ɛċ s, ɛċ + ɛċ where holds by inducion and holds by definiion of he errors ɛċ s, and ɛċ. Proof of Theorem 3. We have, min V T p, p C = V C T (s α) = V C T (s α) (c) (d) V (s α ) V (s α ) = ɛċ ( log (T + ) + 4) T + = T V (s α ) ( log (T + ) + 4) T + log (T + ), where we use C is balanced, Lemma, (c) Lemma 6, and (d) Lemma 8 & 5. = 4 Proofs for he Minimax Regre of he Game Proof of Theorem. A new adversary, GREEDY-COMB, is denoed G. We prove ha he value of COMB and he value of GREEDY-COMB do no differ by more han an addiive numerical consan. As we have proved ha COMB is almos minimax opimal, hen GREEDY-COMB is also almos opimal. However, compuing he value of his new sraegy is easier using he classical resuls on number of passages o he origin in random walks. The GREEDY-COMB adversary akes he same acions as he TWIN-COMB in all saes excep in sae s α where, a all ime, he gain disribuion played by GREEDY-COMB is. Lemma 7 in he appendix proves ha he value of GREEDY-COMB is no differen from he value of TWIN-COMB by more han a small consan: V T G V T C /3. This is shown by using a backward inducion over ime, accumulaing he errors ha appear in sae s α (see Lemma 3), and noing ha hese errors decrease exponenially wih T. Saring from s α, he random walk followed by GREEDY-COMB is a simple random walk on he line d 3 = as illusraed in Figure 3(boom). Therefore he value of his adversary is V T G = 3 H G, where H G is he expeced number of imes he random walk of GREEDY-COMB his he reward wall. Indeed each ime he wall is hi he GREEDY-COMB earns /3 as is gain disribuion on he wall is. We noice ha compuing H G is equivalen o compuing he expeced number of equalizaions (passages by ) in a random walk ha sars wih a value and incremens is value by + wih probabiliy half and decremens by - wih probabiliy half. Therefore, following he classic resul in [, Example.3 in Secion ], we have H G = ( T + T/+) (T/ + ) T T. Finally we have, V T log (T + ) V T C V T G + /3, where holds by Theorem 3. Moreover we have V T = max B V T V T G. Lemma 7. For all T >, V T G V T C 3. 9

20 Proof of Lemma 7. We will acually prove ha for all s, and all T, V G (s) V C (s) T ( = ) T 6 which implies he claim of he lemma by looking a he special case =, s = sα and using Lemma. The proof is by backward inducion on ime. Iniializaion = T : The wo policies G and C only differ in sae s α where we use Lemma 7. Inducion: We wrie V G (s) = Q G (s, G (s)) Q C (s, G (s)) =+ where he las inequaliy is by inducion. We disinguish wo cases. Case. s s α : inducion. 6 ( Q C (s, G (s)) = Q C 6 ( ) T, (4) (s, C (s)) = V C (s) and he Equaion 4 gives he Case. s = s α : Using Lemma 7 we have Q C (s α, G (s α )) + 6 ) T C maxȧ Ḃ Q 3 (s α, Ȧ) Q C (s α, C (s α )) = V C (s α ). 5 Proofs for he COMB-based Learner ( ) T = Q C (s α, ) + Definiion. An ordering ( ) is a bijecive funcion ha maps an order k [K] o is arm (k) [K]. For a sae s, a valid increasing ordering ( ) is an ordering such ha for all pairs of disinc orders i, j [K] wih i j and i < j we have s (i) s (j). Proposiion. Le g {, } K, for any sae s, X(s) X(s + g). Proof. If here exiss an arm k X(s) such ha g k =, hen k X(s + g) and hen X(s) X(s + g). Oherwise for all arms k X(s) we have g k = so acually X(s) X(s + g) as s = s + g. Again we have X(s) X(s + g). For a se of deacivaed arms I [K], we define X I (s) = {k [K]\I : s k = max j [K]\I s j }. Therefore we also have he following proposiion. Proposiion 3. Le g {, } K, for any sae s, for any se arms I [K], X I (s) X I (s + g). Proposiion 4. Le g {, } K, for any sae s here exiss an ordering ( ) ha is valid simulaneously for boh saes s and s + g. Proof. To consruc he ordering follow he following seps. Le he firs sep be i = and A = [K], I =. Ieraively, for each sep i, we ake k i X I (s) X I (s + g) wih X I (s) X I (s + g) (Proposiion 3). We se (i) k i and we pass o he nex ieraion (i i + ) by deacivaing index arm k i wih I i+ = I i {k i }. Noe ha by consrucion his ordering saisfies a any sep he definiing propery of valid increasing ordering for s. The following lemma, which holds for any K, will be used in showing ha p C defines a probabiliy. Lemma 8. Le g {, } K, for any sae s, hen for all, and V C (s + g) V C (s), V C (s + ) V C (s) =. Proof. The proof is by backward inducion on ime. Iniiaion: A = T + we have for any s and any g {, } K, V C T + (s + g) V C T + (s) = s + g s = (s j + g j ) s i g i min g k, k [K] where in i = arg max k [K] (s) and j = arg max k [K] (s + g), and uses s j + g j s i + g i by definiion. Noe ha he inequaliies urn ino equaliies when g =.

21 Inducion sep: Assuming ha for all > and for any s and any g {, } K, V C (s + g) (s) min k [K] g k we have, V C V C (s + g) V C (s) = ( + V C + (s + g + e ()) + V C min g k k [K] + (s + g + e () + e (3) ) ) ( + V C + (s + e ()) + V C + (s + e () + e (3) ) ) where is because, using Proposiion 4, here exiss a common ordering ( ) for boh saes s ans s + g and is by inducion. Noe ha he inequaliy urn ino equaliy when g =. We now urn o he analysis of our COMB-based learner, specifically designed for he case K = 3. Proposiion 5. In all saes and a all imes, he COMB-based learner is a probabiliy. Proof. We need o show for any sae s, any ime, 3 k= p,k(s) = and for any exper k, p,k (s). For all, for any sae s, We have moreover, 3 p,k (s) = p,() (s) + p,() (s) + p,(3) (s) k= p,() (s) + p,() (s) = p,() (s) + p,() (s) + p,() (s) p,() (s) =. = V C + (s + e ()) V C (s) + V C + (s + e ()) V C (s) = V C + (s + e ()) (V C + (s + e ()) + V C + (s + e () + e (3) ) ) + V C + (s + e ()) (V C + (s + e ()) + V C + (s + e () + e (3) ) ) = + V C + (s + e ()) V C + (s + e () + e (3) ) + V C + (s + e ()) V C + (s + e () + e (3) ) = ( V C + (s + e () + e (3) ) V C + (s + e ()) ) ( V C + (s + e () + e (3) ) V C + (s + e ()) ) min (s + e () + e (3) (s + e () )) k k [K] min (s + e () + e (3) (s + e () )) k k [K] =, where is because of Lemma 5 and is using Lemma 8. Therefore p,(3) (s) = p,() (s) p,() (s). Concerning he posiiveness of p,() (s), we have, p,() (s) = V C + (s + e ()) V C (s) = V C + (s + e ()) (V C + (s + e ()) + V C + (s + e () + e (3) ) ) = (V C + (s + e ()) V C + (s + e () + e (3) ) + ) = (V C + (s + e ()) V C + (s) + V C + (s + e () + e () + e (3) ) V C + (s + e () + e (3) ) + ) min k [K] (s + e () s) k

Near Minimax Optimal Players for the Finite-Time 3-Expert Prediction Problem

Near Minimax Optimal Players for the Finite-Time 3-Expert Prediction Problem Near Minimax Opimal Players for he Finie-Time 3-Exper Predicion Problem Yasin Abbasi-Yadkori Adobe Research Peer L. Barle UC Berkeley Vicor Gabillon Queensland Universiy of Technology Absrac We sudy minimax