Machine Learning, 43, , PDF Free Download

Machne Learnng, 43, 65-9, 00. Drftng Games Robert E. Schapre (schapre@research.att.com) AT&T Labs Research, Shannon Laboratory, 80 Park Avenue, Room A79, Florham Park, NJ 0793-097 Abstract. We ntroduce and study a general, abstract game played between two players called the shepherd and the adversary. The game s played n a seres of rounds usng a nte set of \chps" whch are moved about n IR n. On each round, the shepherd assgns a desred drecton of movement and an mportance weght to each of the chps. The adversary then moves the chps n any way that need only be weakly correlated wth the desred drectons assgned by the shepherd. The shepherd's goal s to cause the chps to be moved to low-loss postons, where the loss of each chp at ts nal poston s measured by a gven loss functon. We present a shepherd algorthm for ths game and prove an upper bound on ts performance. We also prove a lower bound showng that the algorthm s essentally optmal for a large number of chps. We dscuss computatonal methods for ecently mplementng our algorthm. We show that our general drftng-game algorthm subsumes some well studed boostng and on-lne learnng algorthms whose analyses follow as easy corollares of our general result. Keywords: boostng, on-lne learnng algorthms. Introducton We ntroduce a general, abstract game played between two players called the shepherd and the adversary. The game s played n a seres of rounds usng a nte set of \chps" whch are moved about n IR n. On each round, the shepherd assgns a desred drecton of movement to each of the chps, as well as a nonnegatve weght measurng the relatve mportance that each chp be moved n the desred drecton. In response, the adversary moves each chp however t wshes, so long as the relatve movements of the chps projected n the drectons chosen by the shepherd are at least, on average. Here, the average s taken wth respect to the mportance weghts that were selected by the shepherd, and 0 s a gven parameter of the game. Snce we thnk of as a small number, the adversary need move the chps n a fashon that s only weakly correlated wth the drectons desred by the shepherd. The adversary s also restrcted to choose relatve movements for the In an earler verson of ths paper, the \shepherd" was called the \drfter," a term that was found by some readers to be confusng. The name of the man algorthm has also been changed from \Shepherd" to \OS." c 00 Kluwer Academc Publshers. Prnted n the Netherlands. p.

R. E. Schapre chps from a gven set B IR n. The goal of the shepherd s to force the chps to be moved to low-loss postons, where the loss of each chp at ts nal poston s measured by a gven loss functon L. A more formal descrpton of the game s gven n Secton. We present n Secton 4 a new algorthm called \OS" for playng ths game n the role of the shepherd, and we analyze the algorthm's performance for any parameterzaton of the game meetng certan natural condtons. Under the same condtons, we also prove n Secton 5 that our algorthm s the best possble when the number of chps becomes large. As spelled out n Secton 3, the drftng game s closely related to boostng, the problem of ndng a hghly accurate classcaton rule by combnng many weak classers or hypotheses. The drftng game and ts analyss are generalzatons of Freund's (995) \majorty-vote game" whch was used to derve hs boost-by-majorty algorthm. Ths latter algorthm s optmal n a certan sense for boostng bnary problems usng weak hypotheses whch are restrcted to makng bnary predctons. However, the boost-by-majorty algorthm has never been generalzed to multclass problems, nor to a settng n whch weak hypotheses may \abstan" or gve graded predctons between two classes. The general drftng game that we study leads mmedately to new boostng algorthms for these settngs. By our result on the optmalty of the OS algorthm, these new boostng algorthms are also best possble, assumng as we do n ths paper that the nal hypothess s restrcted n form to a smple majorty vote. We do not know f the derved algorthms are optmal wthout ths restrcton. In Secton 6, we dscuss computatonal methods for mplementng the OS algorthm. We gve a useful theorem for handlng games n whch the loss functon enjoys certan monotoncty propertes. We also gve a more general technque usng lnear programmng for mplementng OS n many settngs, ncludng the drftng game that corresponds to multclass boostng. In ths latter case, the algorthm runs n polynomal tme when the number of classes s held constant. In Secton 7, we dscuss the analyss of several drftng games correspondng to prevously studed learnng problems. For the drftng games correspondng to bnary boostng wth or wthout abstanng weak hypotheses, we show how to mplement the algorthm ecently. We also show that there are parameterzatons of the drftng game under whch OS s equvalent to a smpled verson of the AdaBoost algorthm (Freund and Schapre, 997; Schapre and Snger, 999), as well as Cesa-Banch et al.'s (996) BW algorthm and Lttlestone and Warmuth's (994) weghted majorty algorthm for combnng the advce of experts n an on-lne learnng settng. Analyses of these alp.

Drftng Games 3 parameters: number of rounds T dmenson of space n set B IR n of permtted relatve movements norm l p where p mnmum average drft 0 loss functon L : IR n IR number of chps m for t = ; : : : ; T : shepherd chooses weght vector w t IRn for each chp adversary chooses drft vector z t B for each chp so that m w t z t m = = the nal loss suered by the shepherd s m Fgure. The drftng game. jjw t jj p m = L T t= z t gorthms follow as easy corollares of the analyss we gve for general drftng games.. Drftng games We begn wth a formal descrpton of the drftng game. An outlne of the game s shown n Fg.. There are two players n the game called the shepherd and the adversary. The game s played n T rounds usng m chps. On each round, the shepherd speces a weght vector w t IRn for each chp. The drecton of ths vector, v t = wt =jjwt jj p, speces a desred drecton of drft, whle the length of the vector jjw tjj p speces the relatve mportance of movng the chp n the desred drecton. In response, the adversary chooses a drft vector z t for each chp. The adversary s constraned to choose each z t from a xed set B IRn. Moreover, the z t 's must satsfy w t z t jjw t jj p () or equvalently P jjwt jj pv t zt P jjwt jj p () p.3

4 R. E. Schapre where 0 s a xed parameter of the game. (Here and throughout the paper, when clear from context, P denotes P m = ; lkewse, we wll shortly use the notaton P t for P T t=.) In words, v t zt s the amount by whch chp has moved n the desred drecton. Thus, the left hand sde of Eq. () represents a weghted average of the drfts of the chps projected n the desred drectons where chp 's projected drft s weghted by jjw t jj p= P jjwt jj p. We requre that ths average projected drft be at least. The poston of chp at tme t, denoted by s t, s smply the sum of the drfts of that chp up to that pont n tme. Thus, s = 0 and = s t + zt. The nal poston of chp at the end of the game s s t+ s T +. At the end of T rounds, we measure the shepherd's performance usng a functon L of the nal postons of the chps; ths functon s called the loss functon. Speccally, the shepherd's goal s to mnmze m L(s T + ): Summarzng, we see that a game s speced by several parameters: the number of rounds T ; the dmenson n of the space; a norm jjjj p on IR n ; a set B IR n ; a mnmum drft constant 0; a loss functon L; and the number of chps m. Snce the length of weght vectors w are measured usng an l p -norm, t s natural to measure drft vectors z usng a dual l q -norm where =p + =q =. When clear from context, we wll generally drop p and q subscrpts and wrte smply jjwjj or jjzjj. As an example of a drftng game, suppose that the game s played on the real lne and that the shepherd's goal s to get as many chps as possble nto the nterval [; 7]. Suppose further that the adversary s constraned to move each chp left or rght by one unt, and that, on each round, 0% of the chps (as weghted by the shepherd's chosen dstrbuton over chps) must be moved n the shepherd's desred drecton. Then for ths game, n =, B = f ; +g and = 0:. Any norm wll do (snce we are workng n just one dmenson), and the loss functon s 0 f s [; 7] L(s) = otherwse. We wll return to ths example later n the paper. Drftng games bear a certan resemblence to the knd of games studed n Blackwell's (956) celebrated approachablty theory. However, t s unclear what the exact relatonshp s between these two types of games and whether one type s a specal case of the other. p.4

Drftng Games 5 3. Relaton to boostng In ths secton, we descrbe how the general game of drft relates drectly to boostng. In the smplest boostng model, there s a boostng algorthm that has access to a weak learnng algorthm that t calls n a seres of rounds. There are m gven labeled examples (x ; y ); : : : ; (x m ; y m ) where x and y f ; +g. On each round t, the booster chooses a dstrbuton D t () over the examples. The weak learner then must generate a weak hypothess h t : f ; +g whose error s at most = wth respect to dstrbuton D t. That s, Pr Dt [y 6= h t (x )] : (3) Here, > 0 s known a pror to both the booster and the weak learner. After T rounds, the booster outputs a nal hypothess whch we here assume s a majorty vote of the weak hypotheses: H(x) = sgn t h t (x) : (4) For our purposes, the goal of the booster s to mnmze the fracton of errors of the nal hypothess on the gven set of examples: m jf : y 6= H(x )gj : (5) We can recast boostng as just descrbed as a specal-case drftng game; a smlar game, called the \majorty-vote game," was studed by Freund (995) for ths case. The chps are dented wth examples, and the game s one-dmensonal so that n =. The drft of a chp z t s + f example s correctly classed by h t and otherwse; that s, z t = y h t (x ) and B = f ; +g. The weght w t s formally permtted to be negatve, somethng that does not make sense n the boostng settng; however, for the optmal shepherd descrbed n the next secton, ths weght wll always be nonnegatve for ths game (by Theorem 7), so we henceforth assume that w t 0. The dstrbuton D t() corresponds to P w t= wt. Then the condton n Eq. (3) s equvalent to or " w t P wt z t w t z t # w t : (6) Of course, the real goal of a boostng algorthm s to nd a hypothess wth low generalzaton error. In ths paper, we only focus on the smpled problem of mnmzng error on the gven tranng examples. p.5

6 R. E. Schapre Ths s the same as Eq. () f we let =. Fnally, f we dene the loss functon to be f s 0 L(s) = (7) 0 f s > 0 then m L(s T + ) (8) s exactly equal to Eq. (5). Our man result on playng drftng games yelds n ths case exactly Freund's boost-by-majorty algorthm (995). There are numerous varants of ths basc boostng settng to whch Freund's algorthm has never been generalzed and analyzed. For nstance, we have so far requred weak hypotheses to output values n f ; +g. It s natural to generalze ths model to allow weak hypotheses to take values n f ; 0; +g so that the weak hypotheses may \abstan" on some examples, or to take values n [ ; +] so that a whole range of values are possble. These correspond to smple modcatons of the drftng game descrbed above n whch we smply change B to f ; 0; +g or [ ; +]. As before, we requre that Eq. (6) hold for all weak hypotheses and we attempt to desgn a boostng algorthm whch mnmzes Eq. (8). For both of these cases, we are able to derve analogs of the boost-by-majorty algorthm whch we prove are optmal n a partcular sense. Another drecton for generalzaton s to the non-bnary multclass case n whch labels y belong to a set Y = f; : : : ; ng, n >. Followng generalzatons of the boostng algorthm AdaBoost to the multclass case (Freund and Schapre, 997; Schapre and Snger, 999), we allow the booster to assgn weghts both to examples and labels. That s, on each round, the booster devses a dstrbuton D t (; `) over examples and labels ` Y. The weak learner then computes a weak hypothess h t : Y f ; +g whch must be correct on a non-trval fracton of the example-label pars. That s, f we dene + f y = ` y (`) = otherwse then we requre Pr (;`)Dt [h t (x ; `) 6= y (`)] : (9) The nal hypothess, we assume, s agan a pluralty vote of the weak hypotheses: H(x) = arg max h t (x; y): (0) yy t p.6

Drftng Games 7 We can cast ths multclass boostng problem as a drftng game as follows. We have n dmensons, one per class. It wll be convenent for the rst dmenson always to correspond to the correct label, wth the remanng n dmensons correspondng to ncorrect labels. To do ths, let us dene a map ` : IR n IR n whch smply swaps coordnates and `, leavng the other coordnates untouched. The weght vectors w t correspond to the dstrbuton D t, modulo swappng of coordnates, a correcton of sgn and normalzaton: [y (w t D t (; `) = P )]` jjwt jj : The norm used here to measure weght vectors s l -norm. Also, t wll follow from Theorem 7 that, for optmal play of ths game, the rst coordnate of w t s always nonnegatve and all other coordnates are nonpostve. The drft vectors z t are derved as before from the weak hypotheses: z t = y (hh t (x ; ); : : : ; h t (x ; n)): It can be vered that the condton n Eq. (9) s equvalent to Eq. () wth =. For bnary weak hypotheses, B = f ; +g n. The nal hypothess H makes a mstake on example (x; y) f and only f h t (x; y) max h t (x; `): `:`6=y t Therefore, we can count the fracton of mstakes of the nal hypothess n the drftng game context as where m t L(s T + ) f s maxfs L(s) = ; : : : ; s n g 0 otherwse. () Thus, by gvng an algorthm for the general drftng game, we also obtan a generalzaton of the boost-by-majorty algorthm for multclass problems. The algorthm can be mplemented n ths case n polynomal tme for a constant number of classes n, and the algorthm s provably best possble n a partcular sense. We note also that a smpled form of the AdaBoost algorthm (Freund and Schapre, 997; Schapre and Snger, 999) can be derved as an nstance of the OS algorthm smply by changng the loss functon L n Eq. (7) to an exponental L(s) = exp( s) for some > 0. More detals on ths game are gven n Secton 7.. p.7

8 R. E. Schapre Besdes boostng problems, the drftng game also generalzes the problem of learnng on-lne wth a set of \experts" (Cesa-Banch et al., 997; Lttlestone and Warmuth, 994). In partcular, the BW algorthm of Cesa-Banch et al. (996) and the weghted majorty algorthm of Lttestone and Warmuth (994) can be derved as specal cases of our man algorthm for a partcular natural parameterzaton of the drftng game. Detals are gven n Secton 7.3. 4. The algorthm and ts analyss We next descrbe our algorthm for playng the general drftng game of Secton. Lke Freund's boost-by-majorty algorthm (995), the algorthm we present here uses a \potental functon" whch s central both to the workngs of the algorthm and ts analyss. Ths functon can be thought of as a \guess" of the loss that we expect to suer for a chp at a partcular poston and at a partcular pont n tme. We denote the potental of a chp at poston s on round t by t (s). The nal potental s the actual loss so that T = L. The potental functons t for earler tme steps are dened nductvely: t (s) = mn sup wir n zb ( t (s + z) + w z jjwjj p ): () We wll show later that, under natural condtons, the mnmum above actually exsts. Moreover, the mnmzng vector w s the one used by the shepherd for the algorthm we now present. We call our shepherd algorthm \OS" for \optmal shepherd." The weght vector w t chosen by OS for chp s any vector w whch mnmzes sup( t (s t + z) + w z jjwjj p ): zb Returnng to the example at the end of Secton, Fg. shows the potental functon t and the weghts that would be selected by OS as a functon of the poston of each chp for varous choces of t. For ths gure, T = 0. We wll need some natural assumptons to analyze ths algorthm. The rst assumpton states merely that the allowed drft vectors n B are bounded; for convenence, we assume they have norm at most one. ASSUMPTION. sup zb jjzjj q. We next assume that the loss functon L s bounded. ASSUMPTION. There exst nte L mn and L max such that L mn L(s) L max for all s IR n. p.8

Drftng Games 9 t=0 t=5 potental / weght 0.5 0 potental / weght 0.5 0 0.5 0 5 0 5 0 5 0 poston t=0 0.5 0 5 0 5 0 5 0 poston t=5 potental / weght 0.5 0 potental / weght 0.5 0 0.5 0 5 0 5 0 5 0 poston t=9 0.5 0 5 0 5 0 5 0 poston t=0 potental / weght 0.5 0 potental / weght 0.5 0 0.5 0 5 0 5 0 5 0 poston 0.5 0 5 0 5 0 5 0 poston Fgure. Plots of the potental functon (top curve n each gure) and the weghts selected by OS (bottom curves) as a functon of the poston of a chp n the example game at the end of Secton for varous choces of t and wth T = 0. The vertcal dotted lnes show the boundary of the goal nterval [; 7]. Curves are only meanngful at nteger values. In fact, ths assumpton need only hold for all s wth jjsjj q T snce postons outsde ths range are never reached, gven Assumpton. Fnally, we assume that, for any drecton v, t s possble to choose a drft whose projecton onto v s more than by a constant amount. ASSUMPTION 3. There exsts a number > 0 such that for all w IR n there exsts z B wth w z ( + )jjwjj. p.9

0 R. E. Schapre LEMMA. Gven Assumptons, and 3, for all t = 0; : : : ; T :. the mnmum n Eq. () exsts; and. L mn t (s) L max for all s IR n. Proof. By backwards nducton on t. The base cases are trval. Let us x s and let F (z) = t (s + z). Let H(w) = sup(f (z) + w z jjwjj): zb Usng Assumpton, for any w, w 0 : jh(w 0 ) H(w)j sup zb (F (z) + w z jjwjj) (F (z) + w 0 z jjw 0 jj) (w w 0 ) z + (jjw 0 jj jjwjj) = sup zb ( + )jjw 0 wjj: Therefore, H s contnuous. Moreover, for w IR n, by Assumptons and 3 (as well as our nductve hypothess), Snce H(w) L mn + ( + )jjwjj jjwjj = L mn + jjwjj: (3) H(0) L max ; (4) t follows that H(w) > H(0) f jjwjj > (L max L mn )=. Thus, for computng the mnmum of H, we only need consder ponts n the compact set w : jjwjj L max L mn Snce a contnuous functon over a compact set has a mnmum, ths proves Part. Part follows mmedately from Eqs (3) and (4). We next prove an upper bound on the loss suered by a shepherd employng the OS algorthm aganst any adversary. Ths s the man result of ths secton. We wll shortly see that ths bound s essentally best possble for any algorthm. It s mportant to note that these theorems tell us much more than the almost obvous pont that the optmal thng to do s whatever s best n a mnmax sense. These theorems prove the nontrval fact that (nearly) mnmax behavor can be obtaned wthout the smultaneous consderaton of all of the chps at once. Rather, we can compute each weght vector w t merely as a functon of the poston of chp, wthout consderaton of the postons of any of the other chps. : p.0

Drftng Games THEOREM. Under the condton of Assumptons, and 3, the nal loss suered by the OS algorthm aganst any adversary s at most 0 (0) where the functons t are dened above. Proof. Followng Freund's analyss (995), we show that the total potental never ncreases. That s, we prove by nducton that t (s t+ ) t (s t ): (5) Ths mples, through repeated applcaton of Eq. (5), that m L(s T + ) = m T (s T + ) m 0 (s ) = 0 (0) as clamed. The denton of t gven n Eq. () mples that for w t chosen by the OS algorthm, and for all z B and all s IR n : t (s + z) + w t z jjw t jj t (s): Therefore, t (s t+ ) = t (s t + z t ) ( t (s t ) w t z t + jjw t jj) t (s t ) where the last nequalty follows from Eq. (). Returnng agan to the example at the end of Secton, Fg. 3 shows a plot of the bound 0 (0) as a functon of the total number of rounds T. It s rather curous that the bound s not monotonc n T (even dscountng the jagged nature of the curve caused by the derence between even and odd length games). Apparently, for ths game, havng more tme to get the chps nto the goal regon can actually hurt the shepherd. 5. A lower bound In ths secton, we prove that the OS algorthm s essentally optmal n the sense that, for any shepherd algorthm, there exsts an adversary capable of forcng a loss matchng the upper bound of Theorem n the lmt of a large number of chps. Speccally, we prove the followng theorem, the man result of ths secton: p.

R. E. Schapre 0.9 optmal loss 0.8 0.7 0.6 0.5 0.4 0 0 0 0 0 3 T Fgure 3. A plot of the loss bound 0(0) as a functon of the total number of rounds T for the example game at the end of Secton. The jagged nature of the curve s due to the derence between a game wth an odd or an even number of steps. THEOREM 3. Let A be any shepherd algorthm for playng a drftng game satsfyng Assumptons, and 3 where all parameters of the game are xed, except the number of chps m. Let t be as dened above. Then for any > 0, there exsts an adversary such that for m sucently large, the loss suered by algorthm A s at least 0 (0). To prove the theorem, we wll need two lemmas. The rst gves an abstract result on computng a mnmax of the knd appearng n Eq. (). The second lemma uses the rst to prove a characterzaton of t n a form amenable to use n the proof of Theorem 3. LEMMA 4. Let S be any nonempty, bounded subset of IR. Let C be the convex hull of S. Then nf supfy + x : (x; y) Sg = supfy : (0; y) Cg: IR Proof. Let C be the closure of C. Frst, for any IR, supfy + x : (x; y) Sg = supfy + x : (x; y) Cg = supfy + x : (x; y) Cg: (6) p.

Drftng Games 3 The rst equalty follows from the fact that, f (x; y) C then (x; y) = N = p (x ; y ) for some postve nteger N, p [0; ], P p =, (x ; y ) S. But then y + x = N = p (y + x ) max(y + x ): The second equalty n Eq. (6) follows smply because the supremum of a contnuous functon on any set s equal to ts supremum over the closure of the set. For ths same reason, supfy : (0; y) Cg = supfy : (0; y) Cg: (7) Because C s closed, convex and bounded, and because the functon y + x s contnuous, concave n (x; y) and convex n, we can reverse the order of the \nf sup" (see, for nstance, Corollary 37.3. of Rockafellar (970)). That s, nf sup IR (x;y)c Clearly, f x 6= 0 then (y + x) = sup (x;y)c nf (y + x) = : IR Thus, the rght hand sde of Eq. (8) s equal to supfy : (0; y) Cg: nf (y + x): (8) IR Combnng wth Eqs. (6) and (7) mmedately gves the result. LEMMA 5. Under the condton of Assumptons, and 3, and for t as dened above, t (s) = nf v:jjvjj= sup N j= d j t (s + z j ) where the supremum s taken over all postve ntegers N, all z ; : : : ; z N B and all nonnegatve d ; : : : ; d N satsfyng P j d j = and j d j v z j = : p.3

4 R. E. Schapre Proof. To smplfy notaton, let us x t and s. Let F and H be as dened n the proof of Lemma. For jjvjj =, let G(v) = sup N j= d j F (z j ) (9) where agan the supremum s taken over d j 's and z j 's as n the statement of the lemma. Note that by Assumpton 3, ths supremum cannot be vacuous. Throughout ths proof, we use v to denote a vector of norm one, whle w s a vector of unrestrcted norm. Our goal s to show that nf v Let us x v momentarly. Let G(v) = nf H(w): (0) S = f(v z ; F (z)) : z Bg : Then S s bounded by Assumptons, and 3 (and part of Lemma ), so we can apply Lemma 4 whch gves Note that w nf sup (F (z) + (v z )) = G(v): () IR zb nf H(v) = nf 0 sup 0 zb nf sup IR zb nf sup IR zb (F (z) + v z ) (F (z) + v z ) (F (z) + v z jj) = nf H(v) IR (where the second nequalty uses jj). Combnng wth Eq. () gves nf nf H(v) nf G(v) nf nf H(v): v 0 v v IR Snce the left and rght terms are both equal to nf w H(w), ths mples Eq. (0) and completes the proof. Proof of Theorem 3. We wll show that, for m sucently large, on round t, the adversary can choose the z t 's so that m t (s t+ ) m t (s t ) T : () p.4

Repeatedly applyng Eq. () mples that m L(s T + ) = m Drftng Games 5 T (s T + ) m 0 (s ) = 0 (0) provng the theorem. Fx t. We use a random constructon to show that there exst z t 's wth the desred propertes. For each weght vector w t chosen by the Pshepherd, let d ; : : : ; d N [0; ] and z ; : : : ; z N B be such that j d j =, and j j d j w t z j = jjw t jj d j t (s t + z j ) t (s t ) T : Such d j 's and z j 's must exst by Lemma 5. Usng Assumpton 3, let z 0 be such that w t z 0 ( + )jjw t jj: Fnally, let Z be a random varable that s z 0 wth probablty and z j wth probablty ( )d j (ndependent of the other Z 's). Here, = 4T (L max L mn ) : Let v = w t=jjwt jj, and let a = P jjw tjj= jjwt jj. By Assumpton, jv Z j. Also, E [v Z ] ( ) + ( + ) = + : Thus, by Hoedng's nequalty (963), Pr " a v Z < Let S = (=m) P t(s t + Z ). Then E [S] m = m m # t (s t ) exp P a ( ) + t (s t + z 0 ) T h t (s t ) + ( t (s t + z 0 ) t (s t )) e = : (3) ( ) T t (s t ) (L max L mn ) T : (4) p.5

6 R. E. Schapre By Hoedng's nequalty (963), snce L mn t (s t + Z ) L max, Pr [S < E [S] (L max L mn )] e m : (5) Now let m be so large that e m + e = < : Then by Eqs. (3) and (5), there exsts a choce of z t 's such that and such that m w t z t = t (s t+ ) = m by Eq. (4) and our choce of. a v z t t (s t + z t ) E [S] (L max L mn ) t (s t m ) T 6. Computatonal methods In ths secton, we dscuss general computatonal methods for mplementng the OS algorthm. 6.. Unate loss functons We rst note that, for loss functons L wth certan monotoncty propertes, the quadrant n whch the mnmzng weght vectors are to be found can be determned a pror. Ths often smples the search for mnma. To be more precse, for f ; +g n and x; y IR n, let us wrte x y f x y for all n. We say that a functon f : IR n IR s unate wth sgn vector f ; +g n f f(x) f(y) whenever x y. LEMMA 6. If the loss functon L s unate wth sgn vector f ; +g n, then so s t (as dened above) for t = 0; : : : ; T. Proof. By backwards nducton on t. The base case s mmedate. Let x y. Then for any z B and w IR n, x + z y + z, and so t (x + z) + w z jjwjj t (y + z) + w z jjwjj p.6

Drftng Games 7 by nductve hypothess. Therefore, t (x) t (y), and so t s also unate. For the man theorem of ths subsecton, we need one more assumpton: ASSUMPTION 4. If z B and f z 0 s such that jz 0 j = jz j for all, then z 0 B. THEOREM 7. Under the condton of Assumptons,, 3 and 4, f L s unate wth sgn vector f ; +g n, then for any s IR n, there s a vector w whch mnmzes sup( t (s + z) + w z jjwjj) zb and for whch w 0. Proof. Let F and H be as n the proof of Lemma. By Lemma 6, F s unate. Let w IR n have some coordnate for whch w > 0 so that w 6 0. Let w 0 be such that w 0 j = wj f j 6= w f j =. We show that H(w 0 ) H(w). Let z B. If z > 0 then F (z) + w z jjwjj F (z) + w 0 z jjw 0 jj: If z 0 then let z 0 be dened analogously to w 0. By Assumpton 4, z 0 B. Then z z 0 and so F (z) F (z 0 ). Thus, F (z 0 ) + w z 0 jjwjj F (z) + w 0 z jjw 0 jj: Hence, H(w 0 ) H(w). Applyng ths argument repeatedly, we can derve a vector w wth w 0 and such that H(w) H(w). Ths proves the theorem. Note that the loss functons for all of the games n Secton 3 are unate (and also satsfy Assumptons {4). The same wll be true of all of the games dscussed n Secton 7. Thus, for all of these games, we can determne a pror the sgns of each of the coordnates of the mnmzng vectors used by the OS algorthm. p.7

8 R. E. Schapre 6.. A general technque usng lnear programmng In many cases, we can use lnear programmng to mplement OS. In partcular, let us assume that we measure weght vectors w usng the l norm (.e., p = ). Also, let us assume that jbj s nte. Then gven t and s, computng t (s) = mn wir n max zb ( t(s + z) + w z jjwjj) can be rewrtten as an optmzaton problem: varables: w IR n, b IR mnmze: b subject to: 8z B : t (s + z) + w z jjwjj b. The mnmzng value b s the desred value of t (s). Note that, wth respect to the varables w and b, ths problem s \almost" a lnear program, f not for the norm operator. However, when L s unate wth sgn vector, and when the other condtons of Theorem 7 hold, we can restrct w so that w 0. Ths allows us to wrte jjwjj = n = w : Addng w 0 as a constrant (or rather, a set of n constrants), we now have derved a lnear program wth n + varables and jbj + n constrants. Ths can be solved n polynomal tme. Thus, for nstance, ths technque can be appled to the multclass boostng problem dscussed n Secton 3. In ths case, B = f ; +g n. So, for any s, t (s) can be computed from t n tme polynomal n n whch may be reasonable for small n. In addton, t must be computed at each reachable poston s n an n-dmensonal nteger grd of radus t,.e., for all s f t; t + ; : : : ; t ; tg n. Ths nvolves computaton of t at (t + ) n ponts, gvng an overall runnng tme for the algorthm whch s polynomal n (T + ) n. Agan, ths may be reasonable for very small n. It s an open problem to nd a way to mplement the algorthm more ecently. 7. Dervng old and new algorthms In ths secton, we show how a number of old and new boostng and on-lne learnng algorthms can be derved and analyzed as nstances of the OS algorthm for approprately chosen drftng games. p.8

Drftng Games 9 7.. Boost-by-majorty and varants We begn wth the drftng game descrbed n Secton 3 correspondng to bnary boostng wth B = f ; +g. For ths game, t (s) = mn w0 maxf t(s ) w w; t (s + ) + w wg where we know from Theorem 7 that only nonnegatve values of w need to be consdered. It can be argued that the mnmum must occur when.e., when Ths gves Solvng gves t (s ) w w = t (s + ) + w w; w = t(s ) t (s + ) : (6) t (s) = + t(s + ) + t(s ): t (s) = t T 0k(T t s)= T t k + n (where we follow the conventon that k = 0 f k < 0 or k > n). Weghtng examples usng Eq. (6) gves exactly Freund's (995) boostby-majorty algorthm (the \boostng by resamplng" verson). When B = f ; 0; +g, a smlar but more nvolved analyss gves t (s) = max ( ) t (s) + t (s + ); + t(s + ) + t(s ) (7) and the correspondng choce of w s t (s) t (s + ) or ( t (s ) t (s+))=, dependng on whether the maxmum n Eq. (7) s realzed by the rst or second quantty. We do not know how to solve the recurrence n Eq. (7) so that the bound 0 (0) gven n Theorem can be put n explct form. Nevertheless, ths bound can easly be evaluated numercally, and the algorthm can certanly be mplemented ecently n ts present form. We have thus far been unable to solve the recurrence for the case that B = [ ; +], even to a pont at whch the algorthm can be mplemented. However, ths case can be approxmated by the case n whch B = f=n : = N; : : : ; Ng for a moderate value of N. In the latter case, the potental functon and assocated weghts can be k p.9

0 R. E. Schapre loss 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0. 0. AdaBoost B = [-,+] B = {-,0,+} B = {-,+} 0 0 0 40 60 80 00 T Fgure 4. A comparson of the bound 0(0) for the drftng games assocated wth AdaBoost (Secton 7.) and boost-by-majorty (Sectons 3 and 7.). For AdaBoost, s set as n Eq. (8). For boost-by-majorty, the bound s plotted when B s f ; +g, f ; 0; +g and [ ; +]. (The latter case s approxmated by B = f=00 : = 00; : : : ; 00g.) The bound s plotted as a functon of the number of rounds T. The drft parameter s xed to = 0:. (The jagged nature of the B = f ; +g curve s due to the fact that games wth an even number of rounds n whch tes count as a loss for the shepherd so that L(0) = are harder than games wth an odd number of rounds.) computed numercally. For nstance, lnear programmng can be used as dscussed n Secton 6.. Alternatvely, t can be shown that Lemma 5 combned wth Theorem 7 mples that t (s)=maxfp t (s + z ) + ( p) t (s + z ) : z ; z B; p [0; ]; pz + ( p)z = g whch can be evaluated usng a smple search over all pars z ; z (snce B s nte). Fg. 4 compares the bound 0 (0) for the drftng games assocated wth boost-by-majorty and varants n whch B s f ; +g, f ; 0; +g and [ ; +] (usng the approxmaton that was just mentoned), as well as AdaBoost (dscussed n the next secton). These bounds are plotted as a functon of the number of rounds T. p.0

Drftng Games 7.. AdaBoost and varants As mentoned n Secton 3, a smpled, non-adaptve verson of AdaBoost can be derved as an nstance of OS. To do ths, we smply replace the loss functon (Eq. (7)) n the bnary boostng game of Secton 3 wth an exponental loss functon L(s) = e s where > 0 s a parameter of the game. As a specal case of the dscusson below, t wll follow that t (s) = T t e s where s the constant = e + + e : Also, the weght gven to a chp at poston s on round t s T t e e e s whch s proportonal to e s (n other words, the weghtng functon s eectvely unchanged from round to round). Ths weghtng s the same as the one used by a non-adaptve verson of AdaBoost n whch all weak hypotheses are gven equal weght. Snce e s s an upper bound on the loss functon of Eq. (7), Theorem mples an upper bound on the fracton of mstakes of the nal hypothess of When 0 (0) = T : + = ln so that s mnmzed, ths gves an upper bound of (8) ( ) T = = ( 4 ) T = whch s equvalent to a non-adaptve verson of Freund and Schapre's (997) analyss. We next consder a more general drftng game n n dmensons whose loss functon s a sum of exponentals L(s) = k j= b j exp( j u j s) (9) where the b j 's, j 's and u j 's are parameters wth b j > 0, j > 0, jju j jj = and u j 0 for some sgn vector. For ths game, B = p.

R. E. Schapre [ ; +] n and p =. Many (non-adaptve) varants of AdaBoost correspond to specal cases of ths game. For nstance, AdaBoost.M (Freund and Schapre, 997), a multclass verson of AdaBoost, essentally uses the loss functon L(s) = n j= e (=)(s s j ) where we follow the multclass setup of Secton 3 so that n s the number of classes, and the rst component n the drftng game s dented wth the correct class. (As before, we only consder a nonadaptve game n whch > 0 s a xed, tunable parameter.) Lkewse, AdaBoost.MH (Schapre and Snger, 999), another multclass verson of AdaBoost, uses the loss functon L(s) = e s + n j= e s j : Note that both loss functons upper bound the \true" loss for multclass boostng gven n Eq. (). Moreover, both functons clearly have the form gven n Eq. (9). We clam that, for the general game wth loss functon as n Eq. (9), t (s) = j b j T t j exp( j u j s) (30) where j = e j + + e j : Proof of Eq. (30) s by backwards nducton on t. For xed t and s, let w = j b j T t j e j e j u j exp( j u j s): We wll show that ths s the mnmzng weght vector that gets used by OS for a chp at poston s at tme t. Let b 0 j = b j T t j exp( j u j s): Note that t (s + z) + w z = j j b 0 j exp( j u j z) + e j e j b 0 j e j + e j u j z (3) p.

snce e x Drftng Games 3 e + e e e for all IR and x [ ; +] by convexty of e x. Also, by our assumptons on b j, u j and j, we can compute jjwjj = j b 0 j e j Thus, combnng Eqs. (3) and (3) gves e j t (s) sup( t (s + z) + w z jjwjj ) zb x : (3) j = j b 0 j j b j T t+ j exp( j u j s): Ths gves the needed upper bound on t (s). For the lower bound, usng Theorem 7 (snce L s unate wth sgn vector ), we have t (s) mn max w 0 zf ;g 8 < = mn max c0 : j ( t (s + z) + w z jjwjj ) b 0 j e j + c c; j b 0 j e j c c 9 = ; where we have used u j = and w = jjwjj (snce u j 0 and w 0). We also have dented c wth jjwjj. Solvng the mn max expresson gves the desred lower bound. Ths completes the proof of Eq. (30). 7.3. On-lne learnng algorthms In ths secton, we show how Cesa-Banch et al.'s (996) BW algorthm for combnng expert advce can be derved as an nstance of OS. We wll also see how ther algorthm can be generalzed, and how Lttlestone and Warmuth's (994) weghted majorty algorthm can also be derved and analyzed. Suppose that we have access to m \experts." On each round t, each expert provdes a predcton t f ; +g. A \master" algorthm combnes ther predctons nto ts own predcton t f ; +g. An p.3

4 R. E. Schapre outcome y t f ; +g s then observed. The master makes a mstake f t 6= y t, and smlarly for expert f t 6= y t. The goal of the master s to mnmze how many mstakes t makes relatve to the best expert. We wll consder master algorthms whch use a weghted majorty vote to form ther predctons; that s, t = sgn m = w t t The problem s to derve a good choce of weghts w t. We also assume that the master algorthm s conservatve n the sense that rounds on whch the master's predctons are correct are eectvely gnored (so that the weghts w t only depend upon prevous rounds on whch mstakes were made). Let us suppose that there s one expert that makes at most k mstakes. We wll (re)derve an algorthm (namely, BW) and a bound on the number of mstakes made by the master, gven ths assumpton. Snce we restrct our attenton to conservatve algorthms, we can assume wthout loss of generalty that a mstake occurs on every round and smply proceed to bound the total number of rounds. To set up the problem as a drftng game, we dentfy one chp wth each of the experts. The problem s one dmensonal so n =. The weghts w t selected by the master are the same as those chosen by the shepherd. Snce we assume that the master makes a mstake on each round, we have for all t that y t w t t 0: (33) Thus, f we dene the drft z t to be y t t, then w t zt 0: Settng = 0, we see that Eq. (33) s equvalent to Eq. (). Also, B = f ; +g. Let M t be the number of mstakes made by expert on rounds ; : : : ; t. Then by denton of z t, Let the loss functon L be s t = M t t + : : L(s) = f s k T 0 otherwse. (34) p.4

Drftng Games 5 Then L(s T + ) = f and only f expert makes a total of k or fewer mstakes n T rounds. Thus, our assumpton that the best expert makes at most k mstakes mples that On the other hand, Theorem mples that m L(s T + ): (35) L(s T + ) 0 (0): (36) By an analyss smlar to the one gven n Secton 7., t can be seen that t (s) = ( t(s + ) + t (s )) : Solvng ths recurrence gves t (s) = t T where n = k In partcular, 0 (0) = T T t k t+s k =0 n k T k Combnng Eqs. (35), (36) and (37) gves m T T k : : (37) : (38) In other words, the number of mstakes T of the master algorthm must satsfy Eq. (38) and so must be at most ( ) q max q IN : q lg m + lg ; k the same bound gven by Cesa-Banch et al. (996). The weghtng functon obtaned s also equvalent to thers snce, by a smlar argument to that used n Secton 7., OS gves w t = ( t(s t ) t(s t + )) = t T T t k t+s = t T T t k M t : p.5

6 R. E. Schapre Note that ths argument can be generalzed to the case n whch the expert's predctons are not restrcted to f ; +g but nstead may be all of [ ; +], or a subset of ths nterval, such as f ; 0; +g. The performance of each expert then s measured on each round usng absolute loss jt y tj rather than whether or not t made a mstake. In ths case, as n the analogous extenson of boost-by-majorty gven n Secton 3, we only need to replace B by [ ; +] or f ; 0; +g. The resultng bound on the number of mstakes of the master s then the largest T for whch =m 0 (0) (note that 0 (0) depends mplctly on T ). The resultng master algorthm smply uses the weghts computed by OS for the approprate drftng game. It s an open problem to determne f ths generalzed algorthm enjoys strong optmalty propertes smlar to those of BW (Cesa-Banch et al., 996). Lttlestone and Warmuth's (994) weghted majorty algorthm can also be derved as an nstance of OS. To do ths, we smply replace the loss functon L n the game above wth L(s) = exp( (s k + T )) for some parameter > 0. Ths loss functon upper bounds the one n Eq. (34). We assume that experts are permtted to output predctons n [ ; +] so that B = [ ; +]. From the results of Secton 7. appled to ths drftng game, where t (s) = T t exp( (s k + T )) = e + e : Therefore, because one expert suers loss at most k, m 0(0) = T e (k T ) : Equvalently, the number of mstakes T s at most k + ln m ; ln +e exactly the bound gven by Lttlestone and Warmuth (994). The algorthm s also the same as thers snce the weght gven to an expert (chp) at poston s t at tme t s w t = e e exp( (s t k + T )) / exp( M t ): p.6

Drftng Games 7 8. Open problems Ths paper represents the rst work on general drftng games. As such, there are many open problems. We have presented closed-form solutons of the potental functon for just a few specal cases. Are there other cases n whch such closedform solutons are possble? In partcular, can the boostng games of Secton 3 correspondng to B = f ; 0; +g and B = [ ; +] be put nto closed-form? For games n whch a closed form s not possble, s there nevertheless a general method of characterzng the loss bound 0 (0), say, as the number of rounds T gets large? Sde products of our work nclude new versons of boost-by-majorty for the multclass case, as well as bnary cases n whch the weak hypotheses have range f ; 0; +g or [ ; +]. However, the optmalty proof for the drftng game only carres over to the boostng settng f the nal hypothess has the restrcted forms gven n Eqs. (4) and (0). Are the resultng boostng algorthms also optmal (for nstance, n the sense proved by Freund (995) for boost-by-majorty) wthout these restrctons? Lkewse, can the extensons of the BW algorthm n Secton 7.3 be shown to be optmal? Can ths algorthm be extended usng drftng games to the multclass case, or to the case n whch the master s allowed to output predctons n [ ; +] (suerng absolute loss)? The OS algorthm s non-adaptve n the sense that must be known ahead of tme. To what extent can OS be made adaptve? For nstance, can Freund's (999) recent technque for makng boost-by-majorty adaptve be carred over to the general drftng-game settng? Smlarly, what happens f the number of rounds T s not known n advance? Fnally, are there other nterestng drftng games for entrely derent learnng problems such as regresson or densty estmaton? Acknowledgments Many thanks to Yoav Freund for very helpful dscussons whch led to ths research. References Blackwell, D.: 956, `An analog of the mnmax theorem for vector payos'. Pacc Journal of Mathematcs 6(), {8. Cesa-Banch, N., Y. Freund, D. Haussler, D. P. Helmbold, R. E. Schapre, and M. K. Warmuth: 997, `How to Use Expert Advce'. Journal of the Assocaton for Computng Machnery 44(3), 47{485. p.7

8 R. E. Schapre Cesa-Banch, N., Y. Freund, D. P. Helmbold, and M. K. Warmuth: 996, `On-lne Predcton and Converson Strateges'. Machne Learnng 5, 7{0. Freund, Y.: 995, `Boostng a weak learnng algorthm by majorty'. Informaton and Computaton (), 56{85. Freund, Y.: 999, `An adaptve verson of the boost by majorty algorthm'. In: Proceedngs of the Twelfth Annual Conference on Computatonal Learnng Theory. pp. 0{3, ACM Press. Freund, Y. and R. E. Schapre: 997, `A decson-theoretc generalzaton of onlne learnng and an applcaton to boostng'. Journal of Computer and System Scences 55(), 9{39. Hoedng, W.: 963, `Probablty nequaltes for sums of bounded random varables'. Journal of the Amercan Statstcal Assocaton 58(30), 3{30. Lttlestone, N. and M. K. Warmuth: 994, `The Weghted Majorty Algorthm'. Informaton and Computaton 08, {6. Rockafellar, R. T.: 970, Convex Analyss. Prnceton Unversty Press. Schapre, R. E. and Y. Snger: 999, `Improved boostng algorthms usng condencerated predctons'. Machne Learnng 37(3), 97{336. p.8

Machine Learning, 43, , 2001.