Boosting with Online Binary Learners for the Multiclass Bandit Problem

Size: px

Start display at page:

Download "Boosting with Online Binary Learners for the Multiclass Bandit Problem"

Christine Morton
6 years ago
Views:

1 Shang-Tse Chen School of Compuer Science, Georgia Insiue of Technology, Alana, GA Hsuan-Tien Lin Deparmen of Compuer Science and Informaion Engineering Naional Taiwan Universiy, Taipei, Taiwan Chi-Jen Lu Insiue of Informaion Science, Academia Sinica, Taipei, Taiwan Absrac We consider he problem of online muliclass predicion in he bandi seing. Compared wih he full-informaion seing, in which he learner can receive he rue label as feedback afer making each predicion, he bandi seing assumes ha he learner can only know he correcness of he prediced label. Because he bandi seing is more resriced, i is difficul o design good bandi learners and currenly here are no many bandi learners. In his paper, we propose an approach ha sysemaically convers exising online binary classifiers o promising bandi learners wih srong heoreical guaranee. The approach maches he idea of boosing, which has been shown o be powerful for bach learning as well as online learning. In paricular, we esablish he weak-learning condiion on he online binary classifier, and show ha he condiion allows auomaically consrucing a bandi learner wih arbirary srengh by combining several of hose classifiers. Experimenal resuls on several real-world daa ses demonsrae he effeciveness of he proposed approach. 1. Inroducion Recenly, machine learning problems ha involve parial feedback have received an increasing amoun of aenion (Auer e al., 22; Flaxman e al., 25). These problems occur naurally in many modern applicaions, such as online adverising and recommender sysems (Li e al., 21). Proceedings of he 31 s Inernaional Conference on Machine Learning, Beijing, China, 214. JMLR: W&CP volume 32. Copyrigh 214 by he auhor(s). For insance, in a recommender sysem, he parial feedback represens wheher he user likes he conen recommended by he sysem, whereas he user s preference for he oher conens ha have no been displayed remain unknown. In his paper, we consider one paricular learning problem relaed o parial feedback: he online muliclass predicion problem in he bandi seing (Kakade e al., 28). In he problem, he learner ieraively ineracs wih he environmen. In each ieraion, he learner observes an insance and is asked o predic is label. The main difference beween he radiional full-informaion seing and he bandi seing is he feedback received afer each predicion. In he full-informaion seing, he rue label of he insance is revealed, whereas in he bandi seing, only wheher he predicion is correc is known. Tha is, in he bandi seing, he rue label remains unknown if he predicion is incorrec. Our goal is o make as few errors as possible in he harsh environmen of he bandi seing. Wih more resriced informaion available, i becomes harder o design good learning algorihms in he bandi seing, excep for he case of online binary classificaion, in which he bandi seing and full-informaion seing coincide. Thus, i is desirable o find a sysemaic way o ransform he exising online binary classificaion algorihms, or combine many of hem, o ge an algorihm ha effecively deals wih he bandi seing. The moivaion calls for boosing (Schapire, 199), which is one of he mos popular and well-developed ensemble mehods implemened in he radiional bach supervised classificaion framework. While mos sudies on boosing focus on he bach seing (Freund & Schapire, 1997; Schapire & Singer, 1999), some works have exended he success of boosing o he online seing (Oza & Russell, 21; Chen e al., 212). However, o he bes of our knowledge, here is no boosing algorihm ye for he problem of online muliclass predicion in he bandi seing. In his paper, we sudy he possibiliy of

2 exending he promising heoreical and empirical resuls of boosing o he bandi seing. As in he design and analysis of boosing algorihms in oher seings, we need an appropriae assumpion on weak learners in order for boosing o work. A sronger assumpion makes he design of a boosing algorihm easier, bu a he expense of more resriced applicabiliy. To weaken he assumpion, we consider binary full-informaion weak learners, insead of muliclass bandi ones, wih he given binary examples consruced hrough he one-versus-res decomposiion from he muliclass examples (Chen e al., 29). Following (Chen e al., 212), we propose a similar assumpion which requires such binary weak learners o perform beer han random guessing wih respec o smooh weigh disribuions over he binary examples. Then we prove ha boosing is possible under his assumpion by designing a srong bandi algorihm using such binary weak learners. Our bandi algorihm is exended from he full-informaion one of (Chen e al., 212), which provides a mehod o generae such smooh example weighs for updaing weak learners, as well as some appropriae voing weighs for combining he predicions of weak learners. Neverheless, our exension in his paper is non-rivial. To compue hese weighs exacly in (Chen e al., 212), one needs he full-informaion feedback, which is no available in our bandi seing. Wih he limied informaion of bandi feedback, we show how o find good esimaors for he example weighs as well as for he voing weighs, and we prove ha hey can in fac be used o replace he rue weighs o make boosing work in he bandi seing. Our proposed bandi boosing algorihm enjoys nice heoreical properies similar o hose of is bach counerpar. In paricular, he proposed algorihm can achieve a small error rae if he performance of each weak learner is beer han ha of random guessing wih respec o he carefully generaed weigh disribuions. In addiion, he algorihm reaches promising empirical performance on real-world daa ses, even when using very simple full-informaion weak learners. Finally, le us sress he difference beween our work and exising ones on he bandi problem. Unlike exising works, our goal is no o consruc one specific bandi algorihm and analyze is regre. Insead, our goal is o sudy he possibiliy of a general paradigm for designing bandi algorihms in a sysemaic way. Noe ha here are currenly only a very small number of bandi algorihms for he muliclass predicion problem, and mos seem o be based on linear models (Kakade e al., 28; Hazan & Kale, 211). Wih he limied power of such linear models, a high error rae is unavoidable in general, so he focus of hese works was o reduce he regre, regardless of wheher he acual error rae is high. Our resul, on he oher hand, works for a broader class of classifiers beyond linear ones. We show how o consruc a srong bandi algorihm wih an error rae close o zero, when we have weak learners which can perform slighly beer han random guessing. Here we allow any weak learners, no jus linear ones, ha only need o work in he simpler full-informaion seing raher han in he more challenging bandi seing. Consrucing such weak learners may look much less dauning, bu we show ha hey in fac suffice for consrucing srong bandi algorihms. We hope ha his could open more possibiliies for designing beer bandi algorihms in he fuure. 2. Boosing in differen seings Before formally describing our boosing framework in he online bandi seing, le us firs review he radiional bach seing as well as he online full-informaion seing. In he bach seing, he boosing algorihm has he whole raining se S = {(x 1, y 1 ),..., (x T, y T )} available a he beginning, where each x is he feaure vecor from some space X R d and y is is label from some space Y. For he case of binary classificaion, we assume Y = { 1, +1}, and he boosing algorihm repeaedly calls he bach weak learner for a number of rounds as follows. In round i, i feeds S as well as a probabiliy disribuion p (i) over S o he weak learner, which hen reurns a weak hypohesis h (i) afer seeing he whole S and p (i). I sops a some round N when he srong hypohesis H(x) = sign( N i=1 α(i) h (i) (x)), wih α (i) R being he voing weigh of h (i), achieves a small error rae over S, defined as { : H(x ) y } /T. For he case of muliclass classificaion, we assume Y = {1,..., K}, and for simpliciy we adop he one-versusres approach o reduce he muliclass problem o a binary one. More precisely, each muliclass example (x, y ) is decomposed ino K binary examples ((x, k), y ), k = 1,..., K, where y is 1 if y = k and 1 oherwise. One can hen apply he boosing algorihm o such binary examples and use H(x) = arg max k i α(i) k h(i) (x, k) as he srong hypohesis for he original muliclass problem. In he online full-informaion seing, he examples of S are usually considered as chosen adversarially and hey arrive one a a ime. The online boosing algorihm mus decide on some number N of online weak learners o sar wih. A sep, he boosing algorihm receives x and i predics H(x ) = arg max k i α(i) h(i) (x, k), where h (i) is he weak hypohesis provided by he i h weak learner and α (i) is is voing weigh. Afer he predicion, he rue label y is revealed, and o updae each weak learner, we would like o feed i wih a probabiliy measure on each binary example, as in he bach boosing. However, in he

3 online seing i is hard o deermine a good measure of an example wihou seeing he remaining examples, so we insead only generae a weigh w (i) for ((x, k), y ), which afer normalizaion corresponds o he measure p (i), for he i-h weak learner. The goal is again o achieve a small error rae over S, given ha each weak learner has some posiive advanage, defined as p(i) y h (i) (x, k). Chen e al. (212) proposed an online boosing algorihm ha achieves his goal in he binary case, which can be easily adaped o he muliclass case here. In his paper, we consider he online muliclass predicion problem in he bandi seing. The seing is similar o he full-informaion one, excep ha a sep he boosing algorihm only receives he bandi informaion of wheher is predicion is correc or no. The goal is essenially he same o achieve a small error rae, given ha each weak learner has some posiive advanage. Several issues arise in designing such a bandi boosing algorihm. The sandard approach in designing a bandi algorihm is o use a full-informaion algorihm as a black box, wih is needed informaion replaced by some esimaed one. Usually, he only informaion needed by a fullinformaion algorihm is he gradien of he loss funcion a each sep, and his informaion is used only once, for updaing is nex sraegy or acion. As a resul, he performance (regre) of such a bandi algorihm can be easily analyzed based on ha of he full-informaion one, as i is usually expressed as a simple funcion of he gradiens. For our boosing problem, we would also like o follow his approach, and he only available full-informaion boosing algorihm wih heoreical guaranee is ha of (Chen e al., 212). However, i is no obvious wha o esimae now since ha algorihm involves hree online processes which all need he informaion y, bu for differen purposes. Firs, he boosing algorihm needs y o compue he example weighs w (i) s. Second, he boosing algorihm needs y o compue he voing weighs α (i) s. Third, he weak learners also need y, in addiion o w (i) s, o updae is nex hypohesis. Can one single bi of bandi informaion abou y be used o ge good esimaors for all he hree processes? Furhermore, as y is used in several places and in a more involved way, he bandi algorihm may no be able o use he full-informaion one as a simple black box, and is performance (error rae) may no be easily based on ha of he full-informaion one. Finally, i is no clear wha he appropriae assumpion one should make on he weak learners in order for boosing o work in he bandi seing. In fac, i is no even clear wha ype of weak learners one should use. Perhaps he mos naural choice is o use muliclass bandi algorihms. Tha is, saring from weak muliclass bandi algorihms, we boos hem ino srong muliclass bandi ones. Surprisingly, we will show ha i suffices o use binary full-informaion algorihms wih a posiive advanage as weak learners. This no only gives us a sronger resul in heory, as a weaker assumpion on weak learners is needed, bu also provides us more possibiliies of designing weak learners (and hus srong bandi algorihms) in pracice, as mos exising muliclass bandi algorihms are linear ones. We will use he following noaion and convenion. For a posiive ineger n, we le [n] denoe he se {1,..., n}. For a condiion π, we use he noaion 1[π] which gives he value 1 if π holds and oherwise. For simpliciy, we assume ha each x has lengh x 2 1 and each hypohesis h comes from some family H wih h (x, k) [ 1, 1]. 3. Online weak learners In his secion, we sudy reasonable assumpions on weak learners for allowing boosing o work in he bandi seing. As menioned in he previous secion, insead of using muliclass bandi algorihms as weak learners, we will use binary full-informaion ones. A naural assumpion o make is for such a binary full-informaion algorihm o achieve a posiive advanage wih respec o any example weighs. However, as noed in (Chen e al., 212), his assumpion is oo srong o achieve, as one canno expec an online algorihm o achieve a posiive advanage in exreme cases, such as when only he firs example has a nonzero weigh. Thus, some consrains mus be pu on he example weighs. To idenify an appropriae consrain, le us follow (Chen e al., 212) and consider he case ha each hypohesis h consiss of K linear funcions wih h (x, k) = h, x, he inner produc of wo vecors h and x, wih h 2 1. When given an example (x, k), he weak learner uses h o predic he binary label y. Afer ha, i receives y as well as he example weigh w, and uses hem o updae h ino a new h (+1)k. We can reduce he ask of such a weak learner o he well-known online linear opimizaion problem, by using he reward funcion r (h ) = w y h, x, which is linear in h. Then we can apply he online gradien descen algorihm of (Zinkevich, 23) o generae h a sep, and a sandard regre analysis shows ha for some consan c >, w y h, x w y h k, x c w 2 for any h k wih h k 2 1. Summing over k [K] and using Cauchy-Schwarz inequaliy, we ge w y h, x w y h k, x ck w 2. Le w denoe he oal weigh w, so ha p = w w is he measure of example (x, k). Then by dividing boh

4 sides of he inequaliy above by w, we obain p y h, x p y h k, x ck w 2 w 2. Noe ha p y h k, x is he advanage of he offline learner, and suppose ha i is a leas 3γ >. Moreover, suppose he example weighs are large, in he sense ha hey saisfy he following condiion: w ckb/γ 2, (1) where B max w is a consan ha will be fixed laer. Then he advanage of he online weak learner becomes p y h, x 3γ ck B w 3γ γ = 2γ. w 2 This moivaes us o propose he following assumpion on weak learners, which need no be linear ones. Assumpion 1. There is an online full-informaion weak learner which can achieve an advanage 2γ > for any sequence of examples and weighs saisfying condiion (1). From he discussion above, we have he following. Lemma 1. Suppose for any sequence of examples and weighs saisfying condiion (1), here exiss an offline linear hypohesis wih an advanage 3γ >. Then Assumpion 1 holds. Le us make wo remarks on Assumpion 1. Firs, he assumpion ha a weak learner has a posiive advanage is jus he assumpion ha i predics beer han random guessing, which is he sandard assumpion used by (almos) all previous bach boosing algorihms. Second, he condiion (1) on example weighs acually makes our assumpion weaker, which in urn makes he boosing ask harder and our boosing resul in he nex secion sronger. More precisely, we only require he weak learner o perform well (having a posiive advanage) when he weighs are large, and we do no care how bad i may perform wih small weighs. In fac, we will make our boosing algorihm call he weak learner wih large weighs. 4. Our bandi boosing algorihm In his secion we show how o design a bandi boosing algorihm under Assumpion 1. Le WL be such an online full-informaion weak learner and we will run N copies of WL, for some N o be deermined laer. We follow he approach of reducing he muliclass problem o he binary one as described in Secion 2, and we base our bandi boosing algorihm on he full-informaion one of (Chen e al., 212) ha works for binary classificaion in he full-informaion seing. More precisely, a sep we do he following afer receiving he feaure vecor x. For each class k [K], a new feaure vecor (x, k) is creaed, we obain a binary weak hypohesis h (i) (x) = h(i) (x, k) from he i h weak learner, for i [N], and we form he srong hypohesis H (x) = arg max k [K] f (x), wih f (x) = N i=1 α (i) h(i) (x), where α (i) is some voing weigh for he i h weak learner. Then we make our predicion ŷ based on H (x ) in some way and receive he feedback 1[ŷ = y ]. Using he feedback, we prepare some example weigh w (i) o updae he i h weak learner, as well as o compue he nex voing weigh α (i) (+1)k, for i [N]. I remains o show how o se he example weighs and he voing weighs, as well as how o choose ŷ, which we describe and analyze in deail nex. The complee algorihm is given in Algorihm 1. Algorihm 1 Bandi boosing algorihm wih online weak learner WL Inpu: Sreaming examples (x 1, y 1 ),..., (x T, y T ). Parameers: < < 1, θ < γ < 1 2. Choose α (i) 1k = 1 N and random h(i) 1k for k [K], i [N]. for = 1 o T do Le H (x) = arg max k [K] i [N] α(i) h(i) (x). Le p (k) = (1 )1[k = H (x )] + K for k [K]. Predic ŷ according o he disribuion p. Receive he informaion 1[ŷ = y ]. for k = 1 o K and i = 1 o N do Updae w (i) If ŷ = k, call WL(h (i) h (i) (+1)k end for end for according o (4)., (x, k), y, w (i) ) o obain ; oherwise, le h(i) (+1)k = h(i). according o (6). Updae α (i) (+1)k The example weigh of (x, k) for he i h weak learner used by he full-informaion algorihm of (Chen e al., 212) is { } w (i) = min (1 γ) z(i 1) /2, 1, (2) where z () = and z(i 1), for i 1 1, is defined as i 1 z (i 1) = (y h (j) (x ) θ), wih θ = γ/(2 + γ), (3) j=1 which depends on he informaion y. As we are in he bandi seing, we do no have y o compue such weighs in general. Thus, we balance exploiaion wih exploraion

5 by independenly predicing H (x ) wih probabiliy 1 and a random label wih probabiliy, wih γ; le ŷ denoe our predicion. For k [K], le p (k) denoe he probabiliy ha ŷ = k. Then we replace he example weigh w (i) by he esimaor w (i) = { w (i) /p (k) if ŷ = k, oherwise, which we can compue, because when ŷ = k, we do have y o compue w (i). As p (k) /K, we can choose B = K/ and have w (i) and w (i) w (i) (4) B for any, k, i. Noe ha w(i) are random variables, and he following shows ha s are in fac good esimaors for w(i) s. Claim 1. For any, k, i, E[ w (i) ] = E[w(i) ]. For any k, i, λ, [ ] ( ) Pr w (i) w(i) > λt 2 Ω(λ2 T/B 2). Proof. Observe ha any fixing of he randomness up o sep 1 leaves w (i) and w(i) wih he same condiional expecaion. Thus, E[ w (i) ] = E[w(i) ]. Moreover, as he random variables M = w (i) w(i), for [T ], form a maringale difference sequence, wih M B, he probabiliy bound follows from Azuma s inequaliy. This claim allows us o use w (i) as he example weigh of (x, k) o updae he i h weak learner. However, as each weak learner is assumed o be a full-informaion one, i also needs he label y o updae which we may no know. One may ry o ake a similar approach as before o feed he weak learner wih an esimaor which is y /p (k) when ŷ = k and oherwise, bu his does no work as i does no ake a value in { 1, 1} needed by he binary weak learner. Insead, we ake a differen approach: we only call he weak learner o updae when ŷ = k so ha we know y. Tha is, when ŷ = k, we call he i h weak learner wih w (i) and y, which can hen updae and reurn he nex weak hypohesis h (i) (+1)k; oherwise, we do no call he i h weak learner o updae and we simply le he nex hypohesis h (i) (+1)k be he curren h(i). Anoher issue is ha a weak learner is only assumed o work well when given large example weighs saisfying condiion (1), and even hen, i only works well on hose examples which are given o i o updae. This is deal by he following. Lemma 2. Le [, 1], le m be he larges number such ha w(i) KT for every i m, and le f (x) = m i=1 1 m h(i) (x). Then when T c (K 2 / 4 ) log(k/) for a large enough consan c, Pr [ {(, k) : y f (x ) < θ} > 2KT ], for he parameer θ = γ/(2 + γ) inroduced in (3) Proof. Noe ha according o he definiion, for any and k, w (m+1), and w (m+1) = 1 if y f (x ) < θ, as = m(y f (x ) θ). This implies ha z (m) {(, k) : y f (x ) < θ} As w(m+1) w (m+1). < KT by he definiion of m, we have Pr [ {(, k) : y f (x ) < θ} > 2KT ] Pr w (m+1) w (m+1) > KT, which by a union bound and Claim 1 is a mos K2 Ω(2 T/B 2). The following lemma gives an upper bound on he parameer m defined in Lemma 2. Lemma 3. Suppose Assumpion 1 holds and T ck/( 2 γ 2 ) for he consan c in he condiion (1). Then he parameer m in Lemma 2 is a mos O(K/( 2 γ 2 )). Proof. Noe ha for any i [m], w(i) KT ckb/γ 2, wih B = K/, and he condiion (1) is saisfied. Thus from Assumpion 1, we have i, w (i) y h (i) (x ) 2γ w (i), (5) i, wih he sums over i above, as well as in he res of he proof, being aken over i [m]. On he oher hand, we have he following claim. Claim 2. i, w(i) y h (i) (x ) O(BKT/γ) + γ i, w(i). We omi is proof here as i is very similar o ha for Lemma 5 in (Servedio, 23). 1 Combining he bound in Claim 2 wih he inequaliy (5), we have γ i, w (i) O(BKT/γ). 1 Alhough ha lemma is for w (i) s, is proof can be easily modified o work for w (i) s, bu wih an addiional facor of B appearing in he erm O(BKT/γ) here.

6 Since w(i) KT for i [m] and B = K/, we ge γmkt O(K 2 T/(γ)). From his, he required bound on m follows, and we have he lemma. Le us suppose ha T c (K 2 / 4 ) log(k/) for a large enough consan c so ha boh lemmas apply. Then Lemma 2 shows ha one can obain a srong learner by combining he firs m weak learners. However, one canno deermine he number m before seeing all he examples, and in fac in our online seing, we need o decide he number N of weak learners even before seeing he firs example. Following (Chen e al., 212), we se N o be he upper bound given by Lemma 3. Then a sep, for each k [K], we consider he funcion f (x) = N i=1 α (i) h(i) (x) and reduce he ask of finding such α = (α (1),..., α(n) ) o he Online Convex Programming problem. More precisely, we use he N-dimensional probabiliy simplex, denoed by P N, as he feasible se and define he loss funcion as { N } L (α) = max, θ y α (i) h (i) (x ), i=1 which is a convex funcion of α. However, unlike in (Chen e al., 212), we are in he bandi seing and hus may no know y. To overcome his, we use a similar idea as before o esimae a subgradien L (α ) by { L (α l = )/p (k) if ŷ = k, oherwise. One can hen use l = (l (1),..., l(n) ) o perform gradien descen o updae α as in (Chen e al., 212). However, o ge a beer heoreical bound, here we choose o perform a muliplicaive updae on α o ge α (+1)k = (α (1) (+1)k,..., α(n) (+1)k) for sep + 1, wih α (i) (+1)k = α(i) e ηl(i) /Z(+1)k, (6) where Z (+1)k is he normalizaion facor and η is he learning rae which we se o 3 /K. Then we have he following. [ ] Lemma 4. Pr L (α ) O(KT ) 1 2. Proof. Following he sandard analysis, one can show ha for any k [K] and any ᾱ k P N, l, α ᾱ k O((log N)/η) + η l 2 O(KT ), (7) since l 2 B 2 = K 2 / 2, η = 3 /K and N O(K/ 4 ). Now noe ha for any and k, E[ l, α ᾱ k ] = E[ L (α ), α ᾱ k ] because given he randomness up o sep 1, α is fixed and he condiional expecaion of l equals L (α ). Moreover, as L (α ) L (ᾱ k ) L (α ), α ᾱ k for convex L, we have E [L (α ) L (ᾱ k )] E[ l, α ᾱ k ]. Then using he bound in (7) and applying a similar maringale analysis as before, one can show ha for any ᾱ k P N, Pr (L (α ) L (ᾱ k )) O(KT ) 1. Le ᾱ k = (ᾱ (1) k ᾱ (i) k,..., ᾱ(n) k ), wih ᾱ (i) k = 1 m for i m and = for i > m, so ha L (ᾱ k ) = max{, θ y f (x )} Then we know from Lemma 2 ha (1 + θ)1[y f (x ) < θ]. Pr L (ᾱ k ) (1 + θ)2kt 1. Combining he wo probabiliy bounds ogeher, we have he lemma. Finally, recall ha o predic each y, we independenly oupu H (x ) = arg max k [K] f (x ) wih probabiliy 1 and a random label wih probabiliy. Thus, by a Chernoff bound, our algorihm makes a mos { : H (x ) y } + 2T errors wih probabiliy 1 2 Ω(2 T ) 1. On he oher hand, as 1[H (x ) y ] k Lemma 4 implies ha k 1[y f (x ) < ] L (α )/θ, Pr [ { : H (x ) y } O(KT/θ)] 1 2. Consequenly, for θ = γ/(2 + γ), we can conclude ha our algorihm makes a mos O(KT/θ) + 2T O(KT/γ) errors wih probabiliy a leas 1 3. Therefore, we have he following, which is he main resul of our paper.

7 Theorem 1. Suppose Assumpion 1 holds and T c (K 2 / 4 ) log(k/) for a large enough consan c. Then our bandi algorihm uses O(K/( 2 γ 2 )) weak learners and makes O(KT/γ) errors wih probabiliy 1 3. Noe ha he error rae of our algorihm is O(K/γ), which can be made o any ε by seing = O(εγ/K), wih he requiremen on T and he number of weak learners adjused accordingly. We remark ha we did no aemp o opimize our bounds (which we believe can be improved) as our focus was on esablishing he possibiliy of boosing in he bandi seing. Moreover, i does no seem appropriae o compare our error bound wih he regre bounds of exising bandi algorihms. This is because exising algorihms are usually based on linear classifiers, which may have large error raes even hough heir regres are small. On he oher hand, our boosing algorihm works for any ype of classifiers and achieves a small error rae as long as we have weak learners which saisfy Assumpion Experimens In his secion, we validae he empirical performance of he proposed algorihm on several real-world daa ses. We compare wih wo represenaive algorihms. The firs one is (Kakade e al., 28), which is one of he firs proposed algorihms for he bandi seing. I is modified from a muliclass varian of he well-known Percepron algorihm (Rosenbla, 1962) using he so-called Kesler s consrucion (Duda & Har, 1973). By doing some random exploraion, i can accuraely consruc he esimaion of he updae sep for he full-informaion muliclass Percepron. The algorihm has good heoreical guaranee, especially when he daa is linearly separable. The algorihm can be viewed as a direc modificaion of a full-informaion learner (Percepron) for he bandi seing, wihou combining he learners for boosing. The second one is Conservaive OVA (C-OVA) (Chen e al., 29), which uses he one-versus-all muliclass o binary decomposiion similar o our algorihm. Bu unlike mos of he bandi algorihms, i does no do random exploraion a all. Insead, i conservaively updaes using whaever i ges from he parial feedback, and hence he name. Noe ha alhough i embeds an online binary learning algorihm as is base learner, i does no perform boosing by combining several base learners like our algorihm does. Also, C-OVA performs a margin-based decoding of he binary classificaion resuls, and hence may no work well wih non-margin-based base learners. To demonsrae he boosing abiliy of our proposed algorihm, we choose wo compleely differen ypes of online binary classifiers as our weak learners. The firs one is Percepron, a sandard margin-based linear classifier. Noe ha in (Chen e al., 29) hey use a similar bu more com- Table 1. The daa ses used in our experimens. Daa se Car DNA Nursery Connec4 Reuers4 #classes #feaures ,81 #examples 1,728 3,186 12,96 67, ,768 plex Online Passive-Aggressive Algorihm (PA) (Crammer e al., 26) as is inernal learner. Since we found lile difference in performance on he daa ses we esed beween he PA algorihm and he Percepron algorihm, we only repor he resuls using he simpler and more famous Percepron algorihm o compare fairly wih. The second weak learner we use is Naive Bayes, a simple saisical classifier ha esimaes he poserior probabiliy for each class using Bayes heorem and he assumpion of condiional independence beween feaures Resuls We es our algorihm on 5 public real-world daa ses from various domains wih differen sizes: CAR, NURSERY, and CONNECT4 from he UCI machine learning reposiory (Frank & Asuncion, 21); DNA from he Salog projec (Michie e al., 1994); REUTERS4 from he paper of (Kakade e al., 28). Basic informaion of hese daa ses are summarized in Table 1. As described previously, each example is firs used for predicion before he disclosure of is label, and he error rae is he number of predicion errors divided by he oal number of examples. All he experimens are repeaed 1 imes wih differen random orderings of he examples. For fairness of comparison o, we do no une he parameers oher han he exploraion rae. We fix he number of weak learners o be 1 and he assumed weak learner advanage γ o be as in he full-informaion online boosing algorihm (Chen e al., 212). For he exploraion rae, we es a wide range of values o see he effec of random exploraion. The resuls are shown in Figure 1. Noe ha C-OVA is no included in his figure since C-OVA does no perform random exploraion a all and is parameer-free. One can see ha for reasonable range of values of (around.1 o ), he performance of our algorihm is quie srong and relaively sable, while seing i oo high or oo low resuls in worse performance as expeced. Table 2 summarizes he average error rae and he sandard deviaion when he bes choices of are used in and in our algorihm. Le us firs focus on he case when Percepron is used as he weak learner. Here, he caegorical feaures are ransformed ino numerical ones by decomposiion ino binary vecors. We can see ha he proposed bandi boosing algorihm consisenly ouperforms on all he daa ses, and is also comparable o C-OVA, especially on larger

8 BandiBoos + Percepron BandiBoos + NaiveBayes BandiBoos + Percepron BandiBoos + NaiveBayes BandiBoos + Percepron BandiBoos + NaiveBayes error rae error rae error rae (a) CAR (b) DNA (c) NURSERY BandiBoos + Percepron BandiBoos + NaiveBayes BandiBoos + Percepron.8 C-OVA + Percepron BandiBoos + Percepron error rae error rae error rae (d) CONNECT (e) REUTERS # of examples (f) Learning Curve of REUTERS4 Figure 1. (a)-(d): Error rae using differen values of exploraion rae. (f): learning curve of REUTERS4 using he bes Table 2. Average (over 1 rials) error rae (%) and sandard deviaion comparison Daa se C-OVA (Percepron) BandiBoos (Percepron) C-OVA (Naive Bayes) BandiBoos (Naive Bayes) CAR 29.4 ± ± ± ± 25.1 ± 1.8 DNA 26.8 ± ± 18.6 ± ± ± 2.6 NURSERY 28.8 ± ± ± ± ± 2.1 CONNECT ± 28.1 ± 3.8 ± 34.9 ± ± REUTERS ± 8.5 ± ± N/A N/A daa ses. To ake a closer look a he performance of hese algorihms, we plo he learning curve for he larges daa se (REUTERS4) in Figure 1 (f). One can see ha our algorihm begins o ouperform he oher algorihms when he number of examples is sufficienly large. This is due o he more complex model we use and he need for random exploraion as opposed o he deerminisic C-OVA algorihm. Noe ha i is in accordance o our analysis in Theorem 1, as he error bound only holds when he number of rounds T is large. Nex, le us see he siuaion when he weak learner is swiched o Naive Bayes. Noe ha here we did no es on he REUTERS4 daa se due o he slow inference of Naive Bayes for high dimensional daa. I can be seen ha our algorihm consisenly reaches he bes on all he daa ses. Moreover, we see a large difference beween C-OVA and our algorihm, especially in DNA and NURSERY daa ses. The superioriy echoes he earlier conjecure ha C-OVA may no work well wih non-margin-based base learners. On he oher hand, he proposed bandi boosing algorihm enjoys a sronger heoreical guaranee and works well wih various ypes of weak learners. 6. Conclusion We propose a boosing algorihm o efficienly generae srong muliclass bandi learners by exploiing he abundance of exising online binary learners. The proposed algorihm can be viewed as a careful combinaion of he online boosing algorihm for binary classificaion (Chen e al., 212) and some key esimaion echniques in he bandi algorihms. While he proposed algorihm is simple, we show some non-rivial heoreical analysis ha leads o sound heoreical guaranee. To he bes of our knowledge, our proposed boosing algorihm is he firs one ha comes wih such heoreical guaranee. In addiion, experimenal resuls on real-world daa ses show ha he proposed bandi boosing algorihm can be easily coupled wih differen weak binary learners o reach promising performance.

9 References Auer, P., Cesa-Bianchi, N., Freund, Y., and Schapire, R. E. The non-sochasic muli-armed bandi problem. SIAM Journal of Compuing, 32:48 77, 22. Chen, G., Chen, G., Zhang, J., Chen, S., and Zhang, C. Beyond bandiron: A conservaive and efficien reducion for online muliclass predicion wih bandi seing model. In Proceedings of ICDM, pp. 71 8, 29. Chen, S.-T., Lin, H.-T., and Lu, C.-J. An online boosing algorihm wih heoreical jusificaions. In Proceedings of ICML, pp , July 212. Schapire, R. E. The srengh of weak learnabiliy. Mach. Learn., 5(2): , July 199. Schapire, R. E. and Singer, Y. Improved boosing algorihms using confidence-raed predicions. Machine Learning, 37(3): , December Servedio, R. A. Smooh boosing and learning wih malicious noise. JMLR, 4: , 23. Zinkevich, M. Online convex programming and generalized infiniesimal gradien ascen. In Proceedings of ICML, pp , 23. Crammer, K., Dekel, O., Keshe, J., Shalev-Shwarz, S., and Singer, Y. Online passive-aggressive algorihms. J. Mach. Learn. Res., 7: , December 26. Duda, R. O. and Har, P. E. Paern Classificaion and Scene Analysis. Wiley, Flaxman, A. D., Kalai, A. T., and McMahan, H. B. Online convex opimizaion in he bandi seing: gradien descen wihou a gradien. In Proceedings of SODA, pp , Philadelphia, PA, USA, 25. Frank, A. and Asuncion, A. UCI machine learning reposiory, 21. URL hp://archive.ics.uci. edu/ml. Freund, Y. and Schapire, R. E. A decision-heoreic generalizaion of on-line learning and an applicaion o boosing. Journal of Compuer and Sysem Sciences, 55(1): , Hazan, E. and Kale, S. Newron: an efficien bandi algorihm for online muliclass predicion. In Proceedings of NIPS, pp , 211. Kakade, S. M., Shalev-Shwarz, S., and Tewari, A. Efficien bandi algorihms for online muliclass predicion. In Proceedings of ICML, pp , New York, NY, USA, 28. Li, L., Chu, W., Langford, J., and Schapire, R. E. A conexual-bandi approach o personalized news aricle recommendaion. In Proceedings of WWW, pp , New York, NY, USA, 21. ACM. Michie, D., Spiegelhaler, D. J., and Taylor, C. C. Machine learning, neural and saisical classificaion, Oza, N. C. and Russell, S. Online bagging and boosing. In Proceedings of AISTATS, pp , 21. Rosenbla, F. Principles of Neurodynamics: Perceprons and he Theory of Brain Mechanisms. Sparan, 1962.

Ensamble methods: Bagging and Boosting

Ensamble methods: Bagging and Boosting Lecure 21 Ensamble mehods: Bagging and Boosing Milos Hauskrech milos@cs.pi.edu 5329 Senno Square Ensemble mehods Mixure of expers Muliple base models (classifiers, regressors), each covers a differen par