A decision-theoretic generalization of on-line learning. and an application to boosting. AT&T Bell Laboratories. 600 Mountain Avenue

Size: px

Start display at page:

Download "A decision-theoretic generalization of on-line learning. and an application to boosting. AT&T Bell Laboratories. 600 Mountain Avenue"

Job King
5 years ago
Views:

1 A decson-heorec generalzaon of on-lne learnng and an applcaon o boosng Yoav Freund Rober E. Schapre AT&T Bell Laboraores 600 Mounan Avenue Room f2b-428, 2A-424g Murray Hll, NJ fyoav, schapreg@research.a.com Sepember 20, 995 Absrac In he rs par of he paper we consder he problem of dynamcally apporonng resources among a se of opons n a wors-case on-lne framework. The model we sudy can be nerpreed as a broad, absrac exenson of he well-suded on-lne predcon model o a general decson-heorec seng. We show ha he mulplcave weghupdae rule of Llesone and Warmuh [5] can be adaped o hs model yeldng bounds ha are slghly weaker n some cases, bu applcable o a consderably more general class of learnng problems. We show how he resulng learnng algorhm can be appled o a varey of problems, ncludng gamblng, mulple-oucome predcon, repeaed games and predcon of pons n R n. In he second par of he paper we apply he mulplcave wegh-updae echnque o derve a new boosng algorhm. Ths boosng algorhm does no requre any pror knowledge abou he performance of he weak learnng algorhm. We also sudy generalzaons of he new boosng algorhm o he problem of learnng funcons whose range, raher han beng bnary, s an arbrary ne se or a bounded segmen of he real lne. Inroducon A gambler, frusraed by perssen horse-racng losses and envous of hs frends' wnnngs, decdes o allow a group of hs fellow gamblers o make bes on hs behalf. He decdes he wll wager a xed sum of money n every race, bu ha he wll apporon hs money among hs frends based on how well hey are dong. Ceranly, f he knew psychcally ahead of me whch of hs frends would wn he mos, he would naurally have ha frend handle all hs wagers. Lackng such clarvoyance, however, he aemps o allocae each race's wager n such a way ha hs oal wnnngs for he season wll be reasonably close o wha he would have won had he be everyhng wh he luckes of hs frends. In hs paper, we descrbe a smple algorhm for solvng such dynamc allocaon problems, and we show ha our soluon can be appled o a grea assormen of learnng problems. An exended absrac of hs work appeared n he Proceedngs of he Second European Conference on Compuaonal Learnng Theory, Barcelona, March, 995. Ths draf was submed for journal publcaon.

2 Perhaps he mos surprsng of hese applcaons s he dervaon of a new algorhm for \boosng,".e., for converng a \weak" PAC learnng algorhm ha performs jus slghly beer han random guessng no one wh arbrarly hgh accuracy. We formalze our on-lne allocaon model as follows. The allocaon agen A has N opons or sraeges o choose from; we number hese usng he negers ; : : :; N. A each me sep = ; 2; : : :; T, he allocaor A decdes on a dsrbuon p over he sraeges; ha s p 0 s he amoun allocaed o sraegy, and P N = p =. Each sraegy hen suers some loss ` whch s deermned by he (possbly adversaral) \envronmen." The loss suered by A s hen P N= p ` = p `,.e., he average loss of he sraeges wh respec o A's chosen allocaon rule. We call hs loss funcon he mxure loss. In hs paper, we always assume ha he loss suered by any sraegy s bounded so ha, whou loss of generaly, ` 2 [0; ]. Besdes hs condon, we make no assumpons abou he form of he loss vecors `, or abou he manner n whch hey are generaed; ndeed, he adversary's choce for ` may even depend on he allocaor's chosen mxure p. The goal of he algorhm A s o mnmze s cumulave loss relave o he loss suered by he bes sraegy. Tha s, A aemps o mnmze s ne loss where L A? mn L L A = = p ` s he oal cumulave loss suered by algorhm A on he rs T rals, and L = s sraegy 's cumulave loss. In Secon 2, we show ha Llesone and Warmuh's [5] \weghed majory" algorhm can be generalzed o handle hs problem, and we prove a number of bounds on he ne loss. For pt nsance, one of our resuls shows ha he ne loss of our algorhm can be bounded by O ln N or, pu anoher way, ha he average per ral ne loss s decreasng a he rae Op (ln N)=T. Thus, as T ncreases, hs derence decreases o zero. Our resuls for he on-lne allocaon model can be appled o a wde varey of learnng problems, as we descrbe n Secon 3. In parcular, we generalze he resuls of Llesone and Warmuh [5] and Cesa-Banch e al. [2] for he problem of predcng a bnary sequence usng he advce of a eam of \expers." Whereas hese auhors proved wors-case bounds for makng on-lne randomzed decsons over a bnary decson and oucome space wh a f0; gvalued dscree loss, we prove (slghly weaker) bounds ha are applcable o any bounded loss funcon over any decson and oucome spaces. Our bounds express explcly he rae a whch he loss of he learnng algorhm approaches ha of he bes exper. Relaed generalzaons of he exper predcon model were suded by Vovk [9], Kvnen and Warmuh [4], and Haussler, Kvnen and Warmuh []. Lke us, hese auhors focused prmarly on mulplcave wegh-updae algorhms. Chung [3] also presened a generalzaon, gvng he problem a game-heorec reamen. = ` 2

3 Boosng Reurnng o he horse-racng sory, suppose now ha he gambler grows weary of choosng among he expers and nsead wshes o creae a compuer program ha wll accuraely predc he wnner of a horse race based on he usual nformaon (number of races recenly won by each horse, beng odds for each horse, ec.). To creae such a program, he asks hs favore exper o explan hs beng sraegy. No surprsngly, he exper s unable o arculae a grand se of rules for selecng a horse. On he oher hand, when presened wh he daa for a specc se of races, he exper has no rouble comng up wh a \rule-of-humb" for ha se of races (such as, \Be on he horse ha has recenly won he mos races" or \Be on he horse wh he mos favored odds"). Alhough such a rule-of-humb, by self, s obvously very rough and naccurae, s no unreasonable o expec o provde predcons ha are a leas a lle b beer han random guessng. Furhermore, by repeaedly askng he exper's opnon on deren collecons of races, he gambler s able o exrac many rules-of-humb. In order o use hese rules-of-humb o maxmum advanage, here are wo problems faced by he gambler: Frs, how should he choose he collecons of races presened o he exper so as o exrac rules-of-humb from he exper ha wll be he mos useful? Second, once he has colleced many rules-of-humb, how can hey be combned no a sngle, hghly accurae predcon rule? Boosng refers o hs general problem of producng a very accurae predcon rule by combnng rough and moderaely naccurae rules-of-humb. In he second par of he paper, we presen and analyze a new boosng algorhm nspred by he mehods we used for solvng he on-lne allocaon problem. Formally, boosng proceeds as follows: The booser s provded wh a se of labeled ranng examples (x ; y ); : : :; (x N ; y N ), where y s he label assocaed wh nsance x ; for nsance, n he horse-racng example, x mgh be he observable daa assocaed wh a parcular horse race, and y he oucome (wnnng horse) of ha race. On each round = ; : : :; T, he booser devses a dsrbuon D over he se of examples, and requess (from an unspeced oracle) a weak hypohess (or rule-of-humb) h wh low error wh respec o D (ha s, = Pr D [h (x ) 6= y ]). Thus, dsrbuon D speces he relave mporance of each example for he curren round. Afer T rounds, he booser mus combne he weak hypoheses no a sngle predcon rule. Unlke he prevous boosng algorhms of Freund [8, 9] and Schapre [6], he new algorhm needs no pror knowledge of he accuraces of he weak hypoheses. Raher, adaps o hese accuraces and generaes a weghed majory hypohess n whch he wegh of each weak hypohess s a funcon of s accuracy. For bnary predcon problems, we prove n Secon 4 ha he error of hs nal hypohess (wh respec o he gven se of examples) s bounded by exp(?2 P T = 2 ) where = =2? s he error of he h weak hypohess. Snce a hypohess ha makes enrely random guesses has error =2, measures he accuracy of he h weak hypohess relave o random guessng. Thus, hs bound shows ha f we can conssenly nd weak hypoheses ha are slghly beer han random guessng, hen he error of he nal hypohess drops exponenally fas. Noe ha he bound on he accuracy of he nal hypohess mproves when any of he weak hypoheses s mproved. Ths s n conras wh prevous boosng algorhms whose performance bound depended only on he accuracy of he leas accurae weak hypohess. A he same me, f he weak hypoheses all have he same accuracy, he performance of he new algorhm s very close o ha acheved by he bes of he known boosng algorhms. In Secon 5, we gve wo exensons of our boosng algorhm o mul-class predcon 3

4 Algorhm Hedge() Parameers: 2 [0; ] nal wegh vecor w 2 [0; ] N wh P N = w = number of rals T Do for = ; 2; : : :; T. Choose allocaon p = w P N= w 2. Receve loss vecor ` 2 [0; ] N from envronmen. 3. Suer loss p `. 4. Se he new weghs vecor o be w + = w ` Fgure : The on-lne allocaon algorhm. problems n whch each example belongs o one of several possble classes (raher han jus wo). We also gve an exenson o regresson problems n whch he goal s o esmae a real-valued funcon. 2 The on-lne allocaon algorhm and s analyss In hs secon, we presen our algorhm, called Hedge(), for he on-lne allocaon problem. The algorhm and s analyss are drec generalzaons of Llesone and Warmuh's weghed majory algorhm [5]. The pseudo-code for Hedge() s shown n Fgure. The algorhm manans a wegh vecor whose value a me s denoed w = w ; : : :; w N. A all mes, all weghs wll be nonnegave. All of he weghs of he nal wegh vecor w mus be nonnegave and sum o one, so ha P N = w =. Besdes hese condons, he nal wegh vecor may be arbrary, and may be vewed as a \pror" over he se of sraeges. Snce our bounds are sronges for hose sraeges recevng he greaes nal wegh, we wll wan o choose he nal weghs so as o gve he mos wegh o hose sraeges whch we expec are mos lkely o perform he bes. Naurally, f we have no reason o favor any of he sraeges, we can se all of he nal weghs equally so ha w = =N. Noe ha he weghs on fuure rals need no sum o one. Our algorhm allocaes among he sraeges usng he curren wegh vecor, afer normalzng. Tha s, a me, Hedge() chooses he dsrbuon vecor p = w + w P N= : () w Afer he loss vecor ` has been receved, he wegh vecor w s updaed usng he mulplcave rule = w ` : (2) 4

5 More generally, can be shown ha our analyss s applcable wh only mnor modcaon o an alernave updae rule of he form w + = w U (` ) where U : [0; ]! [0; ] s any funcon, parameerzed by 2 [0; ] sasfyng for all r 2 [0; ]. 2. Analyss r U (r)? (? )r The analyss of Hedge() mmcs drecly ha gven by Llesone and Warmuh [5]. The man dea s o derve upper and lower bounds on P N = w T + whch, ogeher, mply an upper bound on he loss of he algorhm. We begn wh an upper bound. Lemma For any sequence of loss vecors `; : : :; `T,! ln = w T +?(? )L Hedge() : Proof: By a convexy argumen, can be shown ha r? (? )r (3) for 0 and r 2 [0; ]. Combned wh Equaons () and (2), hs mples = w + = = Applyng repeaedly for = ; : : :; T yelds X N X N w ` w (? (? )` ) = w = = = w T + TY = (? (? )p `) exp?(? ) =! p ` (? (? )p `): (4) snce + x e x for all x. The lemma follows mmedaely. Thus, P L Hedge()? ln N = wt + : (5)? Noe ha, from Equaon (2), w T + = w TY = Ths s all ha s needed o complee our analyss. 5! ` = w L : (6)

6 Theorem 2 For any sequence of loss vecors `; : : :; `T, and for any 2 f; : : :; Ng, we have L Hedge()? ln(w )? L ln : (7)? More generally, for any nonempy se S f; : : :; Ng, we have L Hedge()? ln(p 2S w )? (ln ) max 2S L : (8)? Proof: We prove he more general saemen (8) snce Equaon (7) follows n he specal case ha S = fg. From Equaon (6), = w T + X 2S w T + X = w L max 2S L X w : 2S The heorem now follows mmedaely from Equaon (5). The smpler bound (7) saes ha Hedge() does no perform \oo much worse" han he bes sraegy for he sequence. The derence n loss depends on our choce of and on he nal wegh w of each sraegy. If each wegh s se equally so ha w = =N, hen hs bound becomes L Hedge() mn L ln(=) + ln N : (9)? Snce depends only logarhmcally on N, hs bound s reasonable even for a very large number of sraeges. The more complcaed bound (8) s a generalzaon of he smpler bound ha s especally applcable when he number of sraeges s nne. Naurally, for uncounable collecons of sraeges, he sum appearng n Equaon (8) can be replaced by an negral, and he maxmum by a supremum. The bound gven n Equaon (9) can be wren as L Hedge() c mn L + a ln N; (0) where c = ln(=)=(?) and a = =(?). Vovk [8] analyzes predcon algorhms ha have performance bounds of hs form, and proves gh upper and lower bounds for he achevable values of c and a. Usng Vovk's resuls, we can show ha he consans a and c acheved by Hedge() are opmal. Theorem 3 Le B be an algorhm for he on-lne allocaon problem wh an arbrary number of sraeges. Suppose ha here exs posve real numbers a and c such ha for any number of sraeges N and for any sequence of loss vecors `; : : :; `T Then for all 2 (0; ), eher L B c mn L + a ln N: 2S The proof s gven n he appendx. c ln(=)? or a (? ) : 6

7 2.2 How o choose So far, we have analyzed Hedge() for a gven choce of, and we have proved reasonable bounds for any choce of. In pracce, we wll ofen wan o choose so as o maxmally explo any pror knowledge we may have abou he specc problem a hand. The followng lemma wll be helpful for choosng usng he bounds derved above. Lemma 4 Suppose 0 L ~ L and 0 < R ~ R. Le = g( ~ L= ~ R) where g(z) = =( + p 2=z). Then?L ln + R? L + q 2 ~ L ~ R + R: Proof: (Skech) I can be shown ha? ln (? 2 )=(2) for 2 (0; ]. Applyng hs approxmaon and he gven choce of yelds he resul. Lemma 4 can be appled o any of he bounds above snce all of hese bounds have he form gven n he lemma. For example, suppose we have N sraeges, and we also know a pror bound L ~ on he loss of he bes sraegy. Then, combnng Equaon (9) and Lemma 4, we have q L Hedge() mn L + 2~ L ln N + ln N () for = g(~ L= ln N). In general, f we know ahead of me he number of rals T, hen we can use ~ L = T as an upper bound on he cumulave loss of each sraegy. Dvdng boh sdes of Equaon () by T, we oban an explc bound on he rae a whch he average per-ral loss of Hedge() approaches he average loss for he bes sraegy: L Hedge() T mn p L 2 L T + ~ ln N T + ln N T : (2) Snce ~ L T, hs gves a wors case rae of convergence of O p (ln N)=T. However, f ~ L s close o zero, hen he rae of convergence wll be much faser, roughly, O((ln N)=T ). Lemma 4 can also be appled o he oher bounds gven n Theorem 2 o oban analogous resuls. 3 Applcaons The framework descrbed up o hs pon s que general and can be appled n a wde varey of learnng problems. Consder he followng se-up used by Chung [3]. We are gven a decson space, a space of oucomes, and a bounded loss funcon :! [0; ]. (Acually, our resuls requre only ha be bounded, bu, by rescalng, we can assume ha s range s [0; ].) A every me sep, he learnng algorhm selecs a decson 2, receves an oucome! 2, and suers loss ( ;! ). More generally, we may allow he learner o selec a dsrbuon D over he space of decsons, n whch case suers he expeced loss of a decson randomly seleced accordng o D ; ha s, s expeced loss s (D ;! ) where (D;!) = E D [(;!)]: To decde on dsrbuon D, we assume ha he learner has access o a se of N expers. A every me sep, exper produces s own dsrbuon E on, and suers loss (E ;! ). 7

8 The goal of he learner s o combne he dsrbuons produced by he expers so as o suer expeced loss \no much worse" han ha of he bes exper. The resuls of Secon 2 provde a mehod for solvng hs problem. Speccally, we run algorhm Hedge(), reang each exper as a sraegy. A every me sep, Hedge() produces a dsrbuon p on he se of expers whch s used o consruc he mxure dsrbuon D = = p E : For any oucome!, he loss suered by Hedge() wll hen be (D ;! ) = = p (E ;! ): Thus, f we dene ` = (E ;! ) hen he loss suered by he learner s p `,.e., exacly he mxure loss ha was analyzed n Secon 2. Hence, he bounds of Secon 2 can be appled o our curren framework. For nsance, applyng Equaon (), we oban he followng: Theorem 5 For any loss funcon, for any se of expers, and for any sequence of oucomes, he expeced loss of Hedge() f used as descrbed above s a mos = (D ;! ) mn = (E ;! ) + q 2 ~ L ln N + ln N where ~ L T s an assumed bound on he expeced loss of he bes exper, and = g( ~ L= ln N). Example. In he k-ary predcon problem, = = f; 2; : : :; kg, and (;!) s f 6=! and 0 oherwse. In oher words, he problem s o predc a sequence of leers over an alphabe of sze k. The loss funcon s f a msake was made, and 0 oherwse. Thus, (D;!) s he probably (wh respec o D) of a predcon ha dsagrees wh!. The cumulave loss of he learner, or of any exper, s herefore he expeced number of msakes on he enre sequence. So, n hs case, Theorem 2 saes ha he expeced number of msakes of he learnng pt algorhm wll exceed he expeced number of msakes of he bes exper by a mos O ln N, or possbly much less f he loss of he bes exper can be bounded ahead of me. Bounds of hs ype were prevously proved n he bnary case (k = 2) by Llesone and Warmuh [5] usng he same algorhm. Ther algorhm was laer mproved by Vovk [9] and Cesa-Banch e al. [2]. The man resul of hs secon s a proof ha such bounds can be shown o hold for any bounded loss funcon. Example 2. The loss funcon may represen an arbrary marx game, such as \rock, paper, scssors." Here, = = fr; P; Sg, and he loss funcon s dened by he marx:! R P S R 2 0 P 0 2 S 0 2 8

9 The decson represens he learner's play, and he oucome! s he adversary's play; hen (;!), he learner's loss, s f he learner loses he round, 0 f wns he round, and =2 f he round s ed. (For nsance, (S; P) = 0 snce \scssors cu paper.") So he cumulave loss of he learner (or an exper) s he expeced number of losses n a seres of rounds of game play (counng es as half a loss). Our resuls show hen ha, n repeaed play, he expeced number of rounds los by our algorhm wll converge quckly o he expeced number ha would have been los by he bes of he expers (for he parcular sequence of moves ha were acually played by he adversary). Example 3. Suppose ha and are ne, and ha represens a game marx as n he las example. Suppose furher ha we creae one exper for each decson 2 (.e., ha always recommends playng ). In hs case, Theorem 2 mples ha he learner's average perround loss on a sequence of repeaed plays of he game wll converge, a wors, o he value of he game,.e., o he loss ha would have been suered had he learner used he mnmax \opmal" sraegy for he game. Moreover, hs holds rue even f he learner knows nohng a all abou he game ha s beng played (so ha s unknown o he learner), and even f he adversaral opponen has complee knowledge boh of he game ha s beng played and he algorhm ha s beng used by he learner. (See he relaed work of Hannan [0].) Example 4. Suppose ha = s he un ball n R n, and ha (;!) = jj?!jj. Thus, he problem here s o predc he locaon of a pon!, and he loss suered s he Eucldean dsance beween he predced pon and he acual oucome!. Theorem 2 can be appled f probablsc predcons are allowed. However, n hs seng s more naural o requre ha he learner and each exper predc a sngle pon (raher han a measure on he space of possble pons). Essenally, hs s he problem of \rackng" a sequence of pons! ; : : :;! T where he loss funcon measures he dsance o he predced pon. To see how o handle he problem of ndng deermnsc predcons, noce ha he loss funcon (;!) s convex wh respec o : jj(a + (? a) 2 )?!jj ajj?!jj + (? a)jj 2?!jj (3) for any a 2 [0; ] and any! 2. Thus we can do as follows. A me, he learner predcs wh he weghed average of he expers' predcons: = P N = p " where " 2 Rn s he predcon of he h exper a me. Regardless of he oucome!, Equaon (3) mples ha jj?! jj = p jj"?! jj : Snce Theorem 2 provdes an upper bound on he rgh hand sde of hs nequaly, we also oban upper bounds for he lef hand sde. Thus, our resuls n hs case gve explc bounds on he oal error (.e., dsance beween predced and observed pons) for he learner relave o he bes of a eam of expers. In he one-dmensonal case (n = ), hs case was prevously analyzed by Llesone and Warmuh [5], and laer mproved upon by Kvnen and Warmuh [4]. Ths resul depends only on he convexy and he bounded range of he loss funcon (;!) wh respec o. Thus, can also be appled, for example, o he squared-dsance loss funcon (;!) = jj?!jj 2, as well as he log loss funcon (;!) =? ln(!) used by Cover [4] for he desgn of \unversal" nvesmen porfolos. (In hs las case, s he se of probably vecors on n pons, and = [=B; B] n for some consan B >.) 9

10 In many of he cases lsed above, superor algorhms or analyses are known. Alhough weaker n specc cases, should be emphaszed ha our resuls are far more general, and can be appled n sengs ha exhb consderably less srucure, such as he horse-racng example descrbed n he nroducon. 4 Boosng In hs secon we show how he algorhm presened n Secon 2 for he on-lne allocaon problem can be moded o boos he performance of weak learnng algorhms. We very brey revew he PAC learnng model (see, for nsance, Kearns and Vazran [3] for a more dealed descrpon). Le X be a se called he doman. A concep s a Boolean funcon c : X! f0; g. A concep class C s a collecon of conceps. The learner has access o an oracle whch provdes labeled examples of he form (x; c(x)) where x s chosen randomly accordng o some xed bu unknown and arbrary dsrbuon D on he doman X, and c 2 C s he arge concep. Afer some amoun of me, he learner mus oupu a hypohess h : X! [0; ]. The value h(x) can be nerpreed as a randomzed predcon of he label of x ha s wh probably h(x) and 0 wh probably? h(x). (Alhough we assume here ha we have drec access o he bas of hs predcon, our resuls can be exended o he case ha h s nsead a random mappng no f0; g.) The error of he hypohess h s he expeced value E xd (jh(x)? c(x)j) where x s chosen accordng o D. If h(x) s nerpreed as a sochasc predcon, hen hs s smply he probably of an ncorrec predcon. A srong PAC-learnng algorhm s an algorhm ha, gven ; > 0 and access o random examples, oupus wh probably? a hypohess wh error a mos. Furher, he runnng me mus be polynomal n =, = and oher relevan parameers (namely, he \sze" of he examples receved, and he \sze" or \complexy" of he arge concep). A weak PAC-learnng algorhm sases he same condons bu only for =2? where > 0 s eher a consan, or decreases as =p where p s a polynomal n he relevan parameers. We use WeakLearn o denoe a generc weak learnng algorhm. Schapre [6] showed ha any weak learnng algorhm can be ecenly ransformed or \boosed" no a srong learnng algorhm. Laer, Freund [8, 9] presened he \boos-bymajory" algorhm ha s consderably more ecen han Schapre's. Boh algorhms work by callng a gven weak learnng algorhm WeakLearn mulple mes, each me presenng wh a deren dsrbuon over he doman X, and nally combnng all of he generaed hypoheses no a sngle hypohess. The nuve dea s o aler he dsrbuon over he doman X n a way ha ncreases he probably of he \harder" pars of he space, hus forcng he weak learner o generae new hypoheses ha make less msakes on hese pars. An mporan, praccal decency of he boos-by-majory algorhm s he requremen ha he bas of he weak learnng algorhm WeakLearn be known ahead of me. No only s hs wors-case bas usually unknown n pracce, bu he bas ha can be acheved by WeakLearn wll ypcally vary consderably from one dsrbuon o he nex. Unforunaely, he boos-by-majory algorhm canno ake advanage of hypoheses compued by WeakLearn wh error sgncanly smaller han he presumed wors-case bas of =2?. In hs secon, we presen a new boosng algorhm whch was derved from he on-lne allocaon algorhm of Secon 2. Ths new algorhm s very nearly as ecen as boos-bymajory. However, unlke boos-by-majory, he accuracy of he nal hypohess produced by he new algorhm depends on he accuracy of all he hypoheses reurned by WeakLearn, and so s able o more fully explo he power of he weak learnng algorhm. 0

11 Also, hs new algorhm gves a clean mehod for handlng real-valued hypoheses whch ofen are produced by neural neworks and oher learnng algorhms. 4. The new boosng algorhm Alhough boosng has s roos n he PAC model, for he remander of he paper, we adop a more general learnng framework n whch he learner receves examples (x ; y ) chosen randomly accordng o some xed bu unknown dsrbuon P on X Y, where Y s a se of possble labels. As usual, he goal s o learn o predc he label y gven an nsance x. We sar by descrbng our new boosng algorhm n he smples case ha he label se Y consss of jus wo possble labels, Y = f0; g. In laer secons, we gve exensons of he algorhm for more general label ses. Freund [9] descrbes wo frameworks n whch boosng can be appled: boosng by lerng and boosng by samplng. In hs paper, we use he boosng by samplng framework, whch s he naural framework for analyzng \bach" learnng,.e., learnng usng a xed ranng se whch s sored n he compuer's memory. We assume ha a sequence of N ranng examples (labeled nsances) (x ; y ); : : :; (x N ; y N ) s drawn randomly from X Y accordng o dsrbuon P. We use boosng o nd a hypohess h f whch s conssen wh mos of he sample (.e., h f (x ) = y for mos N). In general, a hypohess whch s accurae on he ranng se mgh no be accurae on examples ousde he ranng se; hs problem s somemes referred o as \over-ng." Ofen, however, over- ng can be avoded by resrcng he hypohess o be smple. We wll come back o hs problem n Secon 4.3. The new boosng algorhm s descrbed n Fgure 2. The goal of he algorhm s o nd a nal hypohess wh low error relave o a gven dsrbuon D over he ranng examples. Unlke he dsrbuon P whch s over X Y and s se by \naure," he dsrbuon D s only over he nsances n he ranng se and s conrolled by he learner. Ordnarly, hs dsrbuon wll be se o be unform so ha D() = =N. The algorhm manans a se of weghs w over he ranng examples. On eraon a dsrbuon p s compued by normalzng hese weghs. Ths dsrbuon s fed o he weak learner WeakLearn whch generaes a hypohess h ha (we hope) has small error wh respec o he dsrbuon. Usng he new hypohess h, he boosng algorhm generaes he nex wegh vecor w +, and he process repeas. Afer T such eraons, he nal hypohess h f s oupu. The hypohess h f combnes he oupus of he T weak hypoheses usng a weghed majory voe. We call he algorhm AdaBoos because, unlke prevous algorhms, adjuss adapvely o he errors of he weak hypoheses reurned by WeakLearn. If WeakLearn s a PAC weak learnng algorhm n he sense dened above, hen =2? for all (assumng he examples have been generaed appropraely wh y = c(x ) for some c 2 C). However, such a bound on he error need no be known ahead of me. Our resuls hold for any 2 [0; ], and depend only on he performance of he weak learner on hose dsrbuons ha are acually generaed durng he boosng process. The parameer s chosen as a funcon of and s used for updang he wegh vecor. The updae rule reduces he probably assgned o hose examples on whch he hypohess Some learnng algorhms can be generalzed o use a gven dsrbuon drecly. For nsance, graden based algorhms can use he probably assocaed wh each example o scale he updae sep sze whch s based on he example. If he algorhm canno be generalzed n hs way, he ranng sample can be re-sampled o generae a new se of ranng examples ha s dsrbued accordng o he gven dsrbuon. The compuaon requred o generae each re-sampled example akes O(log N) me.

12 Algorhm AdaBoos Inpu: sequence of N labeled examples h(x ; y ); : : :; (x N ; y N ) dsrbuon D over he N examples weak learnng algorhm WeakLearn neger T specfyng number of eraons Inalze he wegh vecor: w = D() for = ; : : :; N. Do for = ; 2; : : :; T. Se p = w P N= w 2. Call WeakLearn, provdng wh he dsrbuon p ; ge back a hypohess h : X! [0; ]. 3. Calculae he error of h : = P N = p jh (x )? y j. 4. Se = =(? ). 5. Se he new weghs vecor o be w + = w?jh(x )?y j Oupu he hypohess h f (x) = f P T= log h (x) 2 P T= log 0 oherwse : Fgure 2: The adapve boosng algorhm. 2

13 makes a good predcon and ncreases he probably of he examples on whch he predcon s poor. 2 Noe ha AdaBoos, unlke boos-by-majory, combnes he weak hypoheses by summng her probablsc predcons. Drucker, Schapre and Smard [7], n expermens hey performed usng boosng o mprove he performance of a real-valued neural nework, observed ha summng he oucomes of he neworks and hen selecng he bes predcon performs beer han selecng he bes predcon of each nework and hen combnng hem wh a majory rule. I s neresng ha he new boosng algorhm's nal hypohess uses he same combnaon rule ha was observed o be beer n pracce, bu whch prevously lacked heorecal juscaon. Expermens are needed o measure wheher he new algorhm has an advanage n real world applcaons. 4.2 Analyss Comparng Fgures and 2, here s an obvous smlary beween he algorhms Hedge() and AdaBoos. Ths smlary reecs a surprsng \dual" relaonshp beween he on-lne allocaon model and he problem of boosng. Pu anoher way, here s a drec mappng or reducon of he boosng problem o he on-lne allocaon problem. In such a reducon, one mgh naurally expec a correspondence relang he sraeges o he weak hypoheses and he rals (and assocaed loss vecors) o he examples n he ranng se. However, he reducon we have used s reversed: he \sraeges" correspond o he examples, and he rals are assocaed wh he weak hypoheses. Anoher reversal s n he denon of he loss: n Hedge() he loss ` s small f he h sraegy suggess a good acon on he h ral whle n AdaBoos he \loss" ` =? jh (x )? y j appearng n he wegh-updae rule (Sep 5) s small f he h hypohess suggess a bad predcon on he h example. The reason s ha n Hedge() he wegh assocaed wh a sraegy s ncreased f he sraegy s successful whle n AdaBoos he wegh assocaed wh an example s ncreased f he example s \hard." The man echncal derence beween he wo algorhms s ha n AdaBoos he parameer s no longer xed ahead of me bu raher changes a each eraon accordng o. If we are gven ahead of me he nformaon ha =2? for some > 0 and for all = ; : : :; T, hen we could nsead drecly apply algorhm Hedge() and s analyss as follows: Fx o be?, and se ` =? jh (x )? y j, and h f as n AdaBoos, bu wh equal wegh assgned o all T hypoheses. Then p ` s exacly he accuracy of h on dsrbuon p, whch, by assumpon, s a leas =2 +. Also, leng S = f : h f (x ) 6= y g, s sraghforward o show ha f 2 S hen L T = T = ` =? T = jy? h (x )j =? y? T by h f 's denon, and snce y 2 f0; g. Thus, by Theorem 2, T (=2 + ) = = h (x ) =2 p `? ln(p 2S D()) + ( + 2 )(T=2) snce? ln() =? ln(? ) + 2 for 2 [0; =2]. Ths mples ha he error = P 2S D() of h f s a mos e?t 2 =2. 2 Furhermore, f h s Boolean (wh range f0; g), hen can be shown ha hs updae rule exacly removes he advanage of he las hypohess. Tha s, he error of h on dsrbuon p + s exacly =2. 3

14 The boosng algorhm AdaBoos has wo advanages over hs drec applcaon of Hedge(). Frs, by gvng a more rened analyss and choce of, we oban a sgncanly superor bound on he error. Second, he algorhm does no requre pror knowledge of he accuracy of he hypoheses ha WeakLearn wll generae. Insead, measures he accuracy of h a each eraon and ses s parameers accordngly. The updae facor decreases wh whch causes he derence beween he dsrbuons p and p + o ncrease. Decreasng also ncreases he wegh ln(= ) whch s assocaed wh h n he nal hypohess. Ths makes nuve sense: more accurae hypoheses cause larger changes n he generaed dsrbuons and have more nuence on he oucome of he nal hypohess. We now gve our analyss of he performance of AdaBoos. Noe ha hs heorem apples also f, for some hypoheses, =2. Theorem 6 Suppose he weak learnng algorhm WeakLearn, when called by AdaBoos, generaes hypoheses wh errors ; : : :; T (as dened n Sep 3 of Fgure 2.) Then he error = Pr D [h f (x ) 6= y ] of he nal hypohess h f oupu by AdaBoos s bounded above by 2 T T Y = q (? ): (4) Proof: We adap he man argumens from Lemma and Theorem 2. We use p and w as hey are dened n Fgure 2. Smlar o Equaon (4), he updae rule gven n Sep 5 n Fgure 2 mples ha! w(?(? )(? jh (x )? y j)) = (? (? )(? )) : = w + = = w?jh(x )?y j = Combnng hs nequaly over = ; : : :; T, we ge ha = w T + TY = = w (5) (? (? )(? )) : (6) The nal hypohess h f, as dened n Fgure 2, makes a msake on nsance only f TY =?jh(x )?y j TY =!?=2 (7) (snce y 2 f0; g). The nal wegh of any nsance s w T + = D() TY =?jh(x )?y j : (8) Combnng Equaons (7) and (8) we can lower bound he sum of he nal weghs by he sum of he nal weghs of he examples on whch h f s ncorrec: = w T + X w T + :h f (x )6=y X :h f (x )6=y D() A T Y =! =2 = TY =! =2 (9) 4

15 where s he error of h f. Combnng Equaons (6) and (9), we ge ha TY =? (? )(? ) p : (20) As all he facors n he produc are posve, we can mnmze he rgh hand sde by mnmzng each facor separaely. Seng he dervave of he h facor o zero, we nd ha he choce of whch mnmzes he rgh hand sde s = =(? ). Pluggng hs choce of no Equaon (20) we ge Equaon (4), compleng he proof. The bound on he error gven n Theorem 6, can also be wren n he form TY q!!? 4 2 = exp? KL(=2 jj =2? ) exp?2 (2) = = where KL(a jj b) = a ln(a=b) +(? a) ln((? a)=(? b)) s he Kullback-Lebler dvergence, and where has been replaced by =2?. In he case where he errors of all he hypoheses are equal o =2?, Equaon (2) smples o? 4 2 T =2 = exp (?T KL(=2 jj =2? )) exp?2t 2 : (22) Ths s a form of he Cherno bound for he probably ha less han T=2 con ps urn ou \heads" n T osses of a random con whose probably for \heads" s =2?. Ths bound has he same asympoc behavor as he bound gven for he boos-by-majory algorhm [9]. From Equaon (22) we ge ha he number of eraons of he boosng algorhm ha s sucen o acheve error of h f s T = KL(=2 jj =2? ) ln = 2 2 ln : (23) 2 Noe, however, ha when he errors of he hypoheses generaed by WeakLearn are no unform, Theorem 6 mples ha he nal error depends on he error of all of he weak hypoheses. Prevous bounds on he errors of boosng algorhms depended only on he maxmal error of he weakes hypohess and gnored he advanage ha can be ganed from he hypoheses whose errors are smaller. Ths advanage seems o be very relevan o praccal applcaons of boosng, because here one expecs he error of he learnng algorhm o ncrease as he dsrbuons fed o WeakLearn shf more and more away from he arge dsrbuon. 4.3 Generalzaon error We now come back o dscussng he error of he nal hypohess ousde he ranng se. Theorem 6 guaranees ha he error of h f on he sample s small; however, he quany ha neress us s he generalzaon error of h f, whch s he error of h f over he whole nsance space X; ha s, g = Pr (x;y)p [h f (x) 6= y]. In order o make g close o he emprcal error ^ on he ranng se, we have o resrc he choce of h f n some way. One naural way of dong hs n he conex of boosng s o resrc he weak learner o choose s hypoheses from some smple class of funcons and resrc T, he number of weak hypoheses ha are combned o make h f. The choce of he class of weak hypoheses s specc o he learnng problem a hand and should reec our knowledge abou he properes of he unknown concep. As for he choce of T, varous general mehods can be devsed. One popular mehod s o use 5

16 an upper bound on he VC-dmenson of he concep class. Ths mehod s somemes called \srucural rsk mnmzaon." See Vapnk's book [7] for an exensve dscusson of he heory of srucural rsk mnmzaon. For our purposes, we quoe Vapnk's Theorem 6.7: Theorem 7 (Vapnk) Le H be a class of bnary funcons over some doman X. Le d be he VC-dmenson of H. Le P be a dsrbuon over he pars X f0; g. For h 2 H, dene he (generalzaon) error of h wh respec o P o be g (h) : = Pr (x;y)p [h(x) 6= y]: Le S = f(x ; y ); : : :; (x N ; y N )g be a sample (ranng se) of N ndependen random examples drawn from X f0; g accordng o P. Dene he emprcal error of h wh respec o he sample S o be ^(h) = : jf : h(x ) 6= y gj : N Then, for any > 0 we have ha 2 6 vu u Pr 6 4 9h 2 H : j^(h)? d g(h)j > 2 ln 2Nd + + ln 9 N where he probably s compued wh respec o he random choce of he sample S. Le : R! f0; g be dened by (x) = f x 0 0 oherwse and, for any class H of funcons, le T (H) be he class of all funcons dened as a lnear hreshold of T funcons n H: T (H) = ( = a h? b! : b; a ; : : :; a T 2 R; h ; : : :; h T 2 H Clearly, f all hypoheses generaed by WeakLearn belong o some class H, hen he nal hypohess of AdaBoos, afer T rounds of boosng, belongs o T (H). Thus, he nex heorem provdes an upper bound on he VC-dmenson of he class of nal hypoheses generaed by AdaBoos n erms of he weak hypohess class. Theorem 8 Le H be a class of bnary funcons of VC-dmenson d 2. Then he VCdmenson of T (H) s a mos 2(d + )(T + ) log 2 (e(t + )). Therefore, f he hypoheses generaed by WeakLearn are chosen from a class of VCdmenson d 2, hen he nal hypoheses generaed by AdaBoos afer T eraons belong o a class of VC-dmenson a mos 2(d + )(T + ) log 2 [e(t + )]. Proof: We use a resul abou he VC-dmenson of compuaon neworks proved by Baum and Haussler []. We can vew he nal hypohess oupu by AdaBoos as a funcon ha s compued by a wo-layer feed-forward nework where he compuaon uns of he rs layer are he weak hypoheses and he compuaon un of he second layer s he lnear hreshold funcon whch combnes he weak hypoheses. The VC-dmenson of he se of lnear hreshold 6 ) :

17 funcons over R T s T + [20]. Thus he sum over all compuaon uns of he VC-dmensons of he classes of funcons assocaed wh each un s T d + (T + ) < (T + )(d + ). Baum and Haussler's Theorem [] mples ha he number of deren funcons ha can be realzed by h 2 T (H) when he doman s resrced o a se of sze m s a mos ((T + )em=(t + )(d + )) (T +)(d+). If d 2, T and we se m = d2(t + )(d + ) log 2 [e(t + )]e, hen he number of realzable funcons s smaller han 2 m whch mples ha he VC-dmenson of T (H) s smaller han m. Followng he gudelnes of srucural rsk mnmzaon we can do he followng (assumng we know a reasonable upper bound on he VC-dmenson of he class of weak hypoheses). Le h T f be he hypohess generaed by runnng AdaBoos for T eraons. By combnng he observed emprcal error of h T f wh he bounds gven n Theorems 7 and 8, we can compue an upper bound on he generalzaon error of h T f for all T. We would hen selec he hypohess h T f ha mnmzes he guaraneed upper bound. Whle srucural rsk mnmzaon s a mahemacally sound mehod, he upper bounds on g ha are generaed n hs way mgh be larger han he acual value and so he chosen number of eraons T mgh be much smaller han he opmal value, leadng o nferor performance. A smple alernave s o use \cross-valdaon" n whch a fracon of he ranng se s lef ousde he se used o generae h f as he so-called \valdaon" se. The value of T s hen chosen o be he one for whch he error of he nal hypohess on he valdaon se s mnmzed. (For an exensve analyss of he relaons beween deren mehods for selecng model complexy n learnng, see Kearns e al. [2].) Some nal expermens usng AdaBoos on real-world problems conduced by ourselves and Drucker and Cores [6] ndcae ha AdaBoos ends no o over-; on many problems, even afer hundreds of rounds of boosng, he generalzaon error connues o drop, or a leas does no ncrease. On problems where over-ng can occur, cross valdaon seems o be a reasonable mehod for ndng a good value of T. 4.4 A Bayesan nerpreaon The nal hypohess generaed by AdaBoos s closely relaed o one suggesed by a Bayesan analyss. As usual, we assume ha examples (x; y) are beng generaed accordng o some dsrbuon P on X f0; g; all probables n hs secon are aken wh respec o P. Suppose we are gven a se of f0; g-valued hypoheses h ; : : :; h T and ha our goal s o combne he predcons of hese hypoheses n he opmal way. Then, gven an nsance x and he hypohess predcons h (x), he Bayes opmal decson rule says ha we should predc he label wh he hghes lkelhood, gven he hypohess values,.e., we should predc f Pr [y = j h (x); : : :; h T (x)] > Pr [y = 0 j h (x); : : :; h T (x)]; and oherwse we should predc 0. Ths rule s especally easy o compue f we assume ha he errors of he deren hypoheses are ndependen of one anoher and of he arge concep, ha s, f we assume ha he even h (x) 6= y s condonally ndependen of he acual label y and he predcons of all he oher hypoheses h (x); : : :; h? (x); h + (x); : : :; h T (x). In hs case, by applyng Bayes rule, we can rewre he Bayes opmal decson rule n a parcularly smple form n whch we predc f Pr [y = ] Y :h (x)=0 Y :h (x)= (? ) > Pr [y = 0] Y :h (x)=0 (? ) Y :h (x)= ; 7

18 and 0 oherwse. Here = Pr [h (x) 6= y]. We add o he se of hypoheses he rval hypohess h 0 whch always predcs he value. We can hen replace Pr [y = 0] by 0. Takng he logarhm of boh sdes n hs nequaly and rearrangng he erms, we nd ha he Bayes opmal decson rule s dencal o he combnaon rule ha s generaed by AdaBoos. If he errors of he deren hypoheses are dependen, hen he Bayes opmal decson rule becomes much more complcaed. However, n pracce, s common o use he smple rule descrbed above even when here s no juscaon for assumng ndependence. (Ths s somemes called \nave Bayes.") An neresng and more prncpled alernave o hs pracce would be o use he algorhm AdaBoos o nd a combnaon rule whch, by Theorem 6, has a guaraneed non-rval accuracy. 4.5 Improvng he error bound We show n hs secon how he bound gven n Theorem 6 can be mproved by a facor of wo. The man dea of hs mprovemen s o replace he \hard" f0; g-valued decson used by h f by a \sof" hreshold. To be more precse, le r(x ) = P T= log h (x ) P T= log be a weghed average of he weak hypoheses h. We wll here consder nal hypoheses of he form h f (x ) = F (r(x )) where F : [0; ]! [0; ]. For he verson of AdaBoos gven n Fgure 2, F (r) s he hard hreshold ha equals f r =2 and 0 oherwse. In hs secon, we wll nsead use sof hreshold funcons ha ake values n [0; ]. As menoned above, when h f (x ) 2 [0; ], we can nerpre h f as a randomzed hypohess and h f (x ) as he probably of predcng. Then he error E D [jh f (x )? y j] s smply he probably of an ncorrec predcon. Theorem 9 Le ; : : :; T be as n Theorem 6, and le r(x ) be as dened above. Le he moded nal hypohess be dened by h f = F (r(x )) where F sases he followng for r 2 [0; ]: F (? r) =? F (r); and F (r) 2 TY =! =2?r: Then he error of h f s bounded above by 2 T? T Y = q (? ): For nsance, can be shown ha he sgmod funcon F (r) = he condons of he heorem. Proof: By our assumpons on F, he error of h f s = = = = D() jf (r(x ))? y j D()F (jr(x )? y j) + Q T = 2r?? sases 8

19 2 = D() TY = =2?jr(x )?y j! : Snce y 2 f0; g and by denon of r(x ), hs mples ha 2 = 2 2 = = TY = D() w T + TY =! TY = =2?jh(x )?y j?=2! (? (? )(? ))?=2 The las wo seps follow from Equaons (8) and (6), respecvely. The heorem now follows from our choce of. : 5 Boosng for mul-class and regresson problems So far, we have resrced our aenon o bnary classcaon problems n whch he se of labels Y conans only wo elemens. In hs secon, we descrbe wo possble exensons of AdaBoos o he mul-class case n whch Y s any ne se of class labels. We also gve an exenson for a regresson problem n whch Y s a real bounded nerval. We sar wh he mulple-label classcaon problem. Le Y = f; 2; : : :; kg be he se of possble labels. The boosng algorhms we presen oupu hypoheses h f : X! Y, and he error of he nal hypohess s measured n he usual way as he probably of an ncorrec predcon. The rs exenson of AdaBoos, whch we call AdaBoos.M, s he mos drec. The weak learner generaes hypoheses whch assgn o each nsance one of he k possble labels. We requre ha each weak hypohess have predcon error less han =2 (wh respec o he dsrbuon on whch was raned). Provded hs requremen can be me, we are able prove ha he error of he combned nal hypohess decreases exponenally, as n he bnary case. Inuvely, however, hs requremen on he performance of he weak learner s sronger han mgh be desred. In he bnary case (k = 2), a random guess wll be correc wh probably =2, bu when k > 2, he probably of a correc random predcon s only =k < =2. Thus, our requremen ha he accuracy of he weak hypohess be greaer han =2 s sgncanly sronger han smply requrng ha he weak hypohess perform beer han random guessng. In fac, when he performance of he weak learner s measured only n erms of error rae, hs dculy s unavodable as s shown by he followng nformal example (also presened by Schapre [6]): Consder a learnng problem where Y = f0; ; 2g and suppose ha s \easy" o predc wheher he label s 2 bu \hard" o predc wheher he label s 0 or. Then a hypohess whch predcs correcly whenever he label s 2 and oherwse guesses randomly beween 0 and s guaraneed o be correc a leas half of he me (sgncanly beang he =3 accuracy acheved by guessng enrely a random). On he oher hand, boosng hs learner o an arbrary accuracy s nfeasble snce we assumed ha s hard o dsngush 0- and -labeled nsances. As a more naural example of hs problem, consder classcaon of handwren dgs n an OCR applcaon. I may be easy for he weak learner o ell ha a parcular mage of a 9

20 \7" s no a \0" bu hard o ell for sure f s a \7" or a \9". Par of he problem here s ha, alhough he boosng algorhm can focus he aenon of he weak learner on he harder examples, has no way of forcng he weak learner o dscrmnae beween parcular labels ha may be especally hard o dsngush. In our second verson of mul-class boosng, we aemp o overcome hs dculy by exendng he communcaon beween he boosng algorhm and he weak learner. Frs, we allow he weak learner o generae more expressve hypoheses whose oupu s a vecor n [0; ] k, raher han a sngle label n Y. Inuvely, he yh componen of hs vecor represens a \degree of belef" ha he correc label s y. The componens wh large values (close o ) correspond o hose labels consdered o be plausble. Lkewse, labels consdered mplausble are assgned a small value (near 0), and quesonable labels may be assgned a value near =2. If several labels are consdered plausble (or mplausble), hen hey all may be assgned large (or small) values. Whle we gve he weak learnng algorhm more expressve power, we also place a more complex requremen on he performance of he weak hypoheses. Raher han usng he usual predcon error, we ask ha he weak hypoheses do well wh respec o a more sophscaed error measure ha we call he pseudo-loss. Ths pseudo-loss vares from example o example, and from one round o he nex. On each eraon, he pseudo-loss funcon s suppled o he weak learner by he boosng algorhm, along wh he dsrbuon on he examples. By manpulang he pseudo-loss funcon, he boosng algorhm can focus he weak learner on he labels ha are hardes o dscrmnae. The boosng algorhm AdaBoos.M2, descrbed n Secon 5.2, s based on hese deas and acheves boosng f each weak hypohess has pseudoloss slghly beer han random guessng (wh respec o he pseudo-loss measure ha was suppled o he weak learner). In addon o he wo exensons descrbed n hs paper, we menon an alernave, sandard approach whch would be o conver he gven mul-class problem no several bnary problems, and hen o use boosng separaely on each of he bnary problems. There are several sandard ways of makng such a converson, one of he mos successful beng he errorcorrecng oupu codng approach advocaed by Deerch and Bakr [5]. Fnally, n Secon 5.3 we exend AdaBoos o boosng regresson algorhms. In hs case Y = [0; ], and he error of a hypohess s dened as E (x;y)p (h(x)? y) 2. We descrbe a boosng algorhm AdaBoos.R whch, usng mehods smlar o hose used n AdaBoos.M2, booss he performance of a weak regresson algorhm. 5. Frs mul-class exenson In our rs and mos drec exenson o he mul-class case, he goal of he weak learner s o generae on round a hypohess h : X! Y wh low classcaon error : = Prp [h (x ) 6= y ]. Our exended boosng algorhm, called AdaBoos.M, s shown n Fgure 3, and ders only slghly from AdaBoos. The man derence s n he replacemen of he error jh (x )? y j for he bnary case by [h (x ) 6= y ] where, for any predcae, we dene [ ] o be f holds and 0 oherwse. Also, he nal hypohess h f, for a gven nsance x, now oupus he label y ha maxmzes he sum of he weghs of he weak hypoheses predcng ha label. In he case of bnary classcaon (k = 2), a weak hypohess h wh error sgncanly larger han =2 s of equal value o one wh error sgncanly less han =2 snce h can be replaced by? h. However, for k > 2, a hypohess h wh error =2 s useless o he boosng algorhm. If such a weak hypohess s reurned by he weak learner, our algorhm 20

21 Algorhm AdaBoos.M Inpu: sequence of N examples h(x ; y ); : : :; (x N ; y N ) wh labels y 2 Y = f; : : :; kg dsrbuon D over he examples weak learnng algorhm WeakLearn neger T specfyng number of eraons Inalze he wegh vecor: w = D() for = ; : : :; N. Do for = ; 2; : : :; T. Se p = w P N = w 2. Call WeakLearn, provdng wh he dsrbuon p ; ge back a hypohess h : X! Y. 3. Calculae he error of h : = P N = p [h (x ) 6= y ]. If > =2, hen se T =? and abor loop. 4. Se = =(? ). 5. Se he new weghs vecor o be w + = w? [h (x )6=y ] Oupu he hypohess h f (x) = arg max y2y = log [h (x) = y ]: Fgure 3: A rs mul-class exenson of AdaBoos. 2

22 smply hals, usng only he weak hypoheses ha were already compued. Theorem 0 Suppose he weak learnng algorhm WeakLearn, when called by AdaBoos.M, generaes hypoheses wh errors ; : : :; T, where s as dened n Fgure 3. Assume each =2. Then he error = Pr D [h f (x ) 6= y ] of he nal hypohess h f oupu by AdaBoos.M s bounded above by 2 T T Y = q (? ): Proof: To prove hs heorem, we reduce our seup for AdaBoos.M o an nsanaon of AdaBoos, and hen apply Theorem 6. For clary, we mark wh ldes varables n he reduced AdaBoos space. For each of he gven examples (x ; y ), we dene an AdaBoos example (~x ; ~y ) n whch ~x = and ~y = 0. We dene he AdaBoos dsrbuon D ~ over examples o be equal o he AdaBoos.M dsrbuon D. On he h round, we provde AdaBoos wh a hypohess ~ h dened by he rule ~h () = [h (x ) 6= y ] n erms of he h hypohess h whch was reurned o AdaBoos.M by WeakLearn. Gven hs seup, can be easly proved by nducon on he number of rounds ha he wegh vecors, dsrbuons and errors compued by AdaBoos and AdaBoos.M are dencal so ha ~w = w, ~p = p, ~ = and ~ =. Suppose ha AdaBoos.M's nal hypohess h f makes a msake on nsance so ha h f (x ) 6= y. Then, by denon of h f, = where = ln(= ). Ths mples [h (x ) = y ] = = [h (x ) = y ] 2 [h (x ) = h f (x )] usng he fac ha each 0 snce =2. By denon of ~ h, hs mples = ~ h () 2 so h ~ f () = by denon of he nal AdaBoos hypohess. Therefore, h Pr D [h f (x ) 6= y ] Pr ~hf D () = h Snce each AdaBoos nsance has a 0-label, Pr ~hf D () = s exacly he error of ~ h f. Applyng Theorem 6, we can oban a bound on hs error, compleng he proof. I s possble, for hs verson of he boosng algorhm, o allow hypoheses whch generae for each x, no only a predced class label h(x) 2 Y, bu also a \condence" (x) 2 [0; ]. The learner hen suers loss =2? (x)=2 f s predcon s correc and =2 + (x)=2 oherwse. (Deals omed.) 22 = = ; ; :

23 5.2 Second mul-class exenson In hs secon we descrbe a second alernave exenson of AdaBoos o he case where he label space Y s ne. Ths exenson requres more elaborae communcaon beween he boosng algorhm and he weak learnng algorhm. The advanage of dong hs s ha gves he weak learner more exbly n makng s predcons. In parcular, somemes enables he weak learner o make useful conrbuons o he accuracy of he nal hypohess even when he weak hypohess does no predc he correc label wh probably greaer han =2. As descrbed above, he weak learner generaes hypoheses whch have he form h : X Y! [0; ]. Roughly speakng, h(x; y) measures he degree o whch s beleved ha y s he correc label assocaed wh nsance x. If, for a gven x, h(x; y) aans he same value for all y hen we say ha he hypohess s unnformave on nsance x. On he oher hand, any devaon from src equaly s poenally nformave, because predcs some labels o be more plausble han ohers. As wll be seen, any such nformaon s poenally useful for he boosng algorhm. Below, we formalze he goal of he weak learner by denng a pseudo-loss whch measures he goodness of he weak hypoheses. To movae our denon, we rs consder he followng seup. For a xed ranng example (x ; y ), we use a gven hypohess h o answer k? bnary quesons. For each of he ncorrec labels y 6= y we ask he queson: \Whch s he label of x : y or y?" In oher words, we ask ha he correc label y be dscrmnaed from he ncorrec label y. Assume momenarly ha h only akes values n f0; g. Then f h(x ; y) = 0 and h(x ; y ) =, we nerpre h's answer o he queson above o be y (snce h deems y o be a plausble label for x, bu y s consdered mplausble). Lkewse, f h(x ; y) = and h(x ; y ) = 0 hen he answer s y. If h(x ; y) = h(x ; y ), hen one of he wo answers s chosen unformly a random. In he more general case ha h akes values n [0; ], we nerpre h(x; y) as a randomzed decson for he procedure above. Tha s, we rs choose a random b b(x; y) whch s wh probably h(x; y) and 0 oherwse. We hen apply he above procedure o he sochascally chosen bnary funcon b. The probably of choosng he ncorrec answer y o he queson above s Pr [b(x ; y ) = 0 ^ b(x ; y) = ] + 2 Pr [b(x ; y ) = b(x ; y)] = 2 (? h(x ; y ) + h(x ; y)): If he answers o all k? quesons are consdered equally mporan, hen s naural o dene he loss of he hypohess o be he average, over all k? quesons, of he probably of an ncorrec answer: k? X y6=y 2 (? h(x ; y ) + h(x ; y)) = 2 h(x ; y ) + k? X h(x ; y) A : (24) y6=y However, as was dscussed n he nroducon o Secon 5, deren dscrmnaon quesons are lkely o have deren mporance n deren suaons. For example, consderng he OCR problem descrbed earler, mgh be ha a some pon durng he boosng process, some example of he dg \7" has been recognzed as beng eher a \7" or a \9". A hs sage he queson ha dscrmnaes beween \7" (he correc label) and \9" s clearly much more mporan han he oher egh quesons ha dscrmnae \7" from he oher dgs. 23

A decision-theoretic generalization of on-line learning. and an application to boosting. AT&T Labs. 180 Park Avenue. Florham Park, NJ 07932

A decision-theoretic generalization of on-line learning. and an application to boosting. AT&T Labs. 180 Park Avenue. Florham Park, NJ 07932 A decson-heorec generalzaon of on-lne learnng and an applcaon o boosng Yoav Freund Rober E. Schapre AT&T Labs 80 Park Avenue Florham Park, NJ 07932 fyoav, schapreg@research.a.com December 9, 996 Absrac