Robust and Efficient Boosting Method using the Conditional Risk

1 Robust and Effcent Boostng Method usng the Condtonal Rsk Zh Xao, Zhe Luo, Bo Zhong, and Xn Dang, Member, IEEE Abstract Well-known for ts smplcty and effectveness n classfcaton, AdaBoost, however, suffers from overfttng when class-condtonal dstrbutons have sgnfcant overlap. Moreover, t s very senstve to nose that appears n the labels. Ths artcle tackles the above lmtatons smultaneously va optmzng a modfed loss functon (.e., the condtonal rsk). The proposed approach has the followng two advantages. (1) It s able to drectly take nto account label uncertanty wth an assocated label confdence. (2) It ntroduces a trustworthness measure on tranng samples va the Bayesan rsk rule, and hence the resultng classfer tends to have fnte sample performance that s superor to that of the orgnal AdaBoost when there s a large overlap between class condtonal dstrbutons. Theoretcal propertes of the proposed method are nvestgated. Extensve expermental results usng synthetc data and real-world data sets from UCI machne learnng repostory are provded. The emprcal study shows the hgh compettveness of the proposed method n predcaton accuracy and robustness when compared wth the orgnal AdaBoost and several exstng robust AdaBoost algorthms. Index Terms AdaBoostng, classfcaton, condtonal rsk, exponental loss, label nose, overfttng, robustness. I. INTRODUCTION For classfcaton, AdaBoost s well-known as a smple but effectve boostng algorthm wth the goal of constructng a strong classfer by gradually combnng weak learners [46], [12], [31]. Its mprovement on classfcaton accuracy benefts from the ablty of adaptvely samplng nstances for each base classfer n the tranng process, more specfcally n ts re-weghtng mechansm. It emphaszes the nstances that were prevously msclassfed, and t decreases the mportance of those that have been adequately traned. Ths adaptve scheme, however, causes an overfttng problem for nose data or data from overlappng class dstrbutons [9], [25], [43]. The problem stems from the uncertanty of observed labels. It s usually a great challenge to do classfcaton for the cases wth overlappng classes. Also, t s both expensve and dffcult to obtan relable labels [11]. In some applcatons (such as bomedcal data), perfect tranng labels are almost mpossble to obtan. Hence, how to make AdaBoost acheve nose robustness and avod overfts becomes an mportant task. The am of ths paper s to construct a modfed AdaBoost Z. Xao s wth the Department of Informaton Management, Chongqng Unversty, Chna. E-mal: xaozh@cqu.edu.cn. Z. Luo s wth Bank of Chna, 151 Natonal Road, Qngxu Dstrct, Nannng, Chna. E-mal: shulfang 1988@126.com. B. Zhong s wth the Department of Statstcs and Actuaral Scence, Chongqng Unversty, Chna. E-mal: zhongbo@cqu.edu.cn. X. Dang s the correspondng author and she s wth the Department of Mathematcs, Unversty of Msssspp, Unversty, MS, 38677, USA. E-mal: xdang@olemss.edu. classfcaton algorthm wth a new perspectve for tacklng those problems. A. Related Work Modfcatons to AdaBoost n dealng wth nose data can be summarzed nto three strategc categores. The frst one ntroduces some robust loss functons as new crtera to be mnmzed, rather than usng the orgnal exponental loss. The second type focuses on modfyng the re-weghtng rule n teratons n order to reduce or elmnate the effects of nosy data or outlers n the tranng sets. The thrd approach suggests more modest methods to combne weak learners that take advantage of base classfers n other ways. LogtBoost [13] s an outstandng example of a modfcaton of the frst strategc category. It uses the negatve bnomal loglkelhood loss functon, whch puts relatvely less nfluence on nstances wth large negatve margns 1 n comparson wth the exponental loss, thus LogtBoost s less affected by contamnated data [15]. Based on the concept of robust statstcs, Kanamor et al. [19] studed loss functons for robust boostng and proposed a transformaton of loss functons n order to construct boostng algorthms more robust aganst outlers. Ther usefulness has been confrmed emprcally. However, the loss functon they utlzed was derved wthout consderng effcency. Onoda [26] proposed a set of algorthms that ncorporate a normalzaton term nto the orgnal objectve functon to prevent from overfttng. Sun et al. [35] and Sun et al. [36] modfed AdaBoost usng the regularzaton method. The approaches of the frst category modfcaton manly dffer n the loss functons and optmzaton technques that are used. Sometmes, n the pursut of robustness, t s hard to balance the complexty of a loss functon wth ts computaton cost. In general, modfcaton of a loss functon leads to a new re-weghtng rule for AdaBoost, but some heurstc algorthms drectly rebuld ther weght updatng scheme to avod skewed dstrbutons of examples n the tranng set. For nstance, Domngo and Watanabe [10] proposed MadaBoost that bounds the weght assgned to every sample by ts ntal probablty. Zhang et al. [49] ntroduced a parameter nto the weght updatng formula to reduce weght changes n the tranng process. Servedo [32] provded a new boostng algorthm, SmoothBoost, whch produces only smooth dstrbutons of weghts but enables generaton of a large margn n the fnal hypothess. Utkn and Zhuk [40] took the mnmax (pessmstc) approach to search the optmal weghts at each 1 Margn s generally defned as yf(x), a negatve margn mples a msclassfcaton on an nstance

2 teraton n order to avod outlers beng heavly sampled n the next teraton. Snce the ensemble classfer n AdaBoost predcts a new nstance by a weghted majorty votng among weak learners, the classfer that acheves hgh tranng accuracy wll greatly mpact the predctve result because of ts large coeffcent. Ths can have a detrmental effect on the generalzaton error, especally when the tranng set tself s corrupted [30], [1]. Wth ths n mnd, the thrd strategy seeks to provde a better way to combne weak learners. Schapre and Snger [30] mproved boostng n an extended framework where each weak hypothess produces not only classfcatons but also confdence scores n order to smooth the predctons. Besdes, another method called Modest AdaBoost [42] ntends to decrease contrbutons of base learners n a modest way and forces them to work only n ther doman. The algorthms descrbed above manly focus on some robustfyng prncple, but they do not consder specfc nformaton n the tranng samples. Many other researches [37], [18], [16] ntroduced the nose level nto the loss functon and extended some of the above mentoned methods. Nevertheless, most of these algorthms do not fundamentally change the fact that msclassfed samples are weghted more than they are n the prevous stage, though the ncrement of weghts s smaller than that n AdaBoost. Thus mslabeled data may stll hurt the fnal decson and cause overfttng. In recent studes, many researchers were nclned to utlze the nstance-base method to make AdaBoost robust aganst label nose or outlers. They evaluated the relablty or usefulness of each sample usng statstcal methods, and took that nformaton nto account. Cao et al. [6] suggested a nosedetecton based loss functon that teaches AdaBoost to classfy each sample nto a mostly agreed class rather than usng ts observed label. Gao and Gao [14] set the weght of suspcous samples n each teraton to zero and elmnated ther effects n AdaBoost. Essentally, these two methods use dynamc correctng and deletng technques n the tranng process. In [43], the boostng algorthm drectly works on a reduced tranng set whose confusng samples have been removed. Zhang and Zhang [48] consdered a local boostng algorthm. Its reweghtng rule and the combnaton of multple classfers utlze more local nformaton of the tranng nstances. For handlng label nose, t s natural to delete or correct suspcous nstances frst and then take the remanng good samples as prototypes for learnng tasks. Ths dea s not just for AdaBoost but s also applcable to general methods n many felds (e.g., [39]). Some approaches am at constructng a good nose purfcaton mechansm under the framework of dfferent methods, such as ensemble methods [41], [4], [5], KNN or ts varants [29], [22], [17] and so on. Data preprocessng technque s a necessary step to mprove qualty of the predcton models n some cases [28]. However, some correct samples along wth some valuable nformaton may be dscarded, and n the meantme, some nose samples may be ncluded or some new nose samples may be ntroduced. Ths s the lmtaton of correctng and deletng technques. To overcome ths weakness, Rebbapragada and Brodley [27] tred to use the confdence on the observed label as a weght of each nstance durng the tranng process and provded a novel framework for mtgatng class noses. They showed emprcally that ths confdence weghtng approach can outperform the dscardng approach, but ths new method was only appled to tree-based C4.5 classfer. The confdencelabelng technque they utlzed fals to be a desrable label correcton method. In [45] and [50], they consdered and estmated the probablty of an nstance beng from class 1 and used t as a soft label of the nstance. B. An overvew of the proposed approach Inspred by nstance-base methods and constructon of robust algorthms, we propose a novel boostng algorthm based on label confdence, called CB-AdaBoost. The observed label of each nstance s treated as uncertan. Not only the correctness, but also the degree of correctness of the label, are evaluated accordng to a certan crteron before the tranng procedure. We ntroduce the confdence of each nstance nto the exponental loss functon. Wth such a modfcaton, the msclassfed and correctly classfed exponental losses are weghtly averaged. The weghts are ther correspondng probabltes represented by the correctness certanty parameter. In ths way, the algorthm treats nstances dfferently based on ther confdence, and thus, t moderately controls the tranng ntensty for each observaton. The modfed loss functon s ndeed the condtonal rsk or nner rsk, whch s qute dfferent from a asymmetrc loss or fuzzy loss. Our method can make a smooth transton between full acceptance and full rejecton of a sample label, thereby achevng robustness and effcency at the same tme. In addton, our label-confdence based learnng has no threshold parameter, whereas correctng and deletng technques have to defne a confdence level for suspect nstances so that they are relabeled or dscarded n the tranng procedure. We derve theoretcal results and also provde emprcal evdences to show superor performance of the proposed CB-AdaBoost. The contrbutons of ths paper are as follows. A new loss functon. We consder the condtonal rsk so that label uncertanty can be drectly dealt wth by the concept of label confdence. Ths new loss functon also leads the consderaton of the sgn of Bayesan rsk rule on each of the sample ponts at the ntalzaton of the procedure. A smple modfcaton of adaptve boostng algorthm. Based on the new exponental loss functon, AdaBoost has a smple explct optmzaton soluton at each teraton. Theoretcal and emprcal justfcatons for effcency and robustness of the proposed method. Consstency of the CB-AdaBoost s studed. Broad adaptvty. The proposed CB-AdaBoost s sutable for nose data and for class-overlappng data. C. Outlne of the paper The remander of the paper s organzed as follows. Secton II revews the orgnal AdaBoost. In Secton III, we propose a new AdaBoost algorthm. We dscuss n detal assgnment

3 of label confdence, the loss functon, and the algorthm as well as ts ablty of adaptve learnng n the label-confdence framework. Secton IV devotes to a study of the consstency property. In Secton V, we llustrate how the proposed algorthm works and nvestgate ts performance through emprcal studes of both synthetc and real-world data sets. Fnally, the paper concludes wth some fnal remarks n Secton VI. A proof of consstency s provded n Appendx. II. REVIEW OF ADABOOST ALGORITHM For bnary classfcaton, the man dea of AdaBoost s to produce a strong classfer by combnng weak learners. Ths s obtaned through an optmzaton that mnmzes the exponental loss crteron over the tranng set. Let L = {(x, z ) n =1 } denote a gven tranng set consstng of n ndependent tranng observatons, where x = (x 1, x 2,, x p ) T R p and z {1, 1} represent the nput attrbutes and the class label of the th nstance, respectvely. The pseudo-code of AdaBoost s gven n Algorthm 1 below. Algorthm 1: AdaBoost Algorthm Input: L = {(x, z ) n =1 } and the maxmum number of base classfers M. Intalze: For, w (1) = 1/N, D (1) = w (1) /S 1, where S 1 = n =1 w(1) s the normalzaton factor. For m = 1 To M 1 Draw nstance from L wth replacement accordng to the dstrbuton D (m) to form a tranng set L m ; 2 Tran L m wth the base learnng algorthm and obtan a weak hypothess h m ; 3 Compute ε m = n :h m(x ) z D (m) ; ( 4 Let β m = 1 2 ln 1 ε m ε m ); If β m < 0, then M = m 1 and abort loop. 5 Update w (m+1) = e zβmhm(x) ; D (m+1) = w (m+1) S m+1 = n =1 w(m+1) ; End For Output: sgn( M m=1 β mh m (x)). /S m+1 for, where In the AdaBoost Algorthm, the current classfer h m s nduced on the weghted samplng data, and the resultng weghted error ε m s computed. The ndvdual weght of each of the observatons s updated for the next teraton. AdaBoost s desgned for clean tranng data that s, each label z s the true label of x. In ths framework, any nstance was prevously msclassfed has a hgher probablty to be sampled n the next stage. In ths way, the next classfer focuses more on those msclassfed nstances, and hence, the fnal ensemble classfer acheves hgh accuracy. For mslabeled data, however, those observatons whch were msclassfed n the prevous step are weghted less, and those correctly classfed nstances are weghted more than they should. Ths leads to the next tranng set L m+1 beng serously corrupted, and those mslabeled data eventually hurt the performance of the ensemble classfer. Therefore, some modfcatons should be ntroduced to make AdaBoost nsenstve to class nose. III. LABEL-CONFIDENCE BASED BOOSTING ALGORITHM A. Label confdence For the class nose data problem, the observed label y assocated wth x may be ncorrectly assumed due to some random mechansm. For the class overlappng problem, the label y assocated wth x s a realzaton of random label from some dstrbuton. In our approach to deal wth both problems, we treat the true label Z to be random. Let y (ether 1 or -1) be the observed label assocated wth x. We defne a parameter γ as the probablty of beng correctly labeled, that s, γ = P (Z = y x) and P (Z = y x) = 1 γ for γ [0, 1]. The quantty γ (1 γ) = (2γ 1)sgn(2γ 1) measures trustworthness of label y and sgn(2γ 1) = ±1 represents confdence towards correctness or wrongness of the label. Thus we can use sgn(2γ 1)y as the trusted label wth confdence level 2γ 1. For example, for γ = 1, 2γ 1 = 1 and sgn(2γ 1) = 1 represent that we are 100% confdent about correctness of the label y, whle for γ = 0, 2γ 1 = 1 and sgn(2γ 1) = 1 represent 100% certanty about the wrongness of y so that y should be 100% trusted. The label y wth γ = 0.5 s the most unsure or fuzzy case wth 0 confdence. It s easy to see that the trusted label sgn(2γ 1)y s exactly the Bayes rule. Let η(x) = P (Z = 1 x) and hence the Bayes rule s sgn(2η(x) 1), whch s equal to sgn(2γ 1)y for both y = 1 and y = 1. For gven tranng data L = {(x 1, y 1 ),..., (x n, y n )}, let a parameter vector γ = (γ 1, γ 2,..., γ n ) represent ther probabltes of beng correctly labeled. That s, the parameter γ can be regarded as the confdence of a sample x beng correctly labeled as y. In the next subsectons, we frst ntroduce the modfed loss functon based on a gven γ, then propose the confdence based adaptve boostng method (CB-AdaBoost). At the end of the secton we dscuss the estmaton of γ. B. Condtonal-rsk loss functon Gven a clean tranng set wth correct labels z s avalable, the orgnal AdaBoost mnmzes the emprcal exponental rsk ˆ rsk(f) = 1 n exp( z f(x )) =1 (III.1) over all lnear combnatons of base classfers n the gven space H, assumng that an exhaustve weak learner returns the best weak hypothess on every round [13], [31]. Now n class nose data, the true label z s unknown. We only observe y assocated wth x. Based on the assumpton, gven x, the probablty that Z s y s γ. It s natural to consder the followng emprcal rsk: ˆR = 1 n [γ exp( y f(x )) + (1 γ ) exp(y f(x ))]. =1 (III.2)

4 That s, we treat the observed label y as a fuzzy label wth γ correctness confdence. In other words, we consder the modfed exponental loss functon L γ (y, f(x)) = γ exp( yf(x))+(1 γ) exp(yf(x)), (III.3) whch has a straghtforward nterpretaton. The label y assocated wth x s trusted wth γ confdence and t s corrected as y wth 1 γ confdence. It s easy to check that the loss (III.3) L γ (y, f(x)) = E z x exp( zf(x)), whch s the nner rsk defned n [33]. The reason t s called the nner rsk s because the true exponental rsk s rsk(f) = E exp( zf(x)) (III.4) = ExE z x[exp( zf(x)] = ExL γ (y, f(x)) (III.5) for y = ±1. From ths perspectve, we consder mnmzng the emprcal nner rsk of (III.5), whle the orgnal AdaBoost mnmzes the emprcal rsk of (III.4). Stenwart and Chrstmann [33] showed n ther Lemma 3.4 that the rsk can be acheved by mnmzng the nner rsks, where the expectaton s taken wth respect to the margnal dstrbuton of x, n contrast to (III.4) where the expectaton s taken wth respect to the jont dstrbuton of (x, z). Clearly, under the scenaros of overlappng class and label nose, the emprcal nner rsk (III.2) has an advantage over (III.1). In [2], (III.3) s called the condtonal ψ-rsk wth ψ beng the exponental loss functon. A classfcaton-calbrated condton on the condtonal rsk s provded to ensure a pontwse form of Fsher consstency for classfcaton. In other words, f the condton s satsfed, the 0-1 loss can be surrogated by the convex ψ loss n order to make the mnmzaton computatonally effcent. The exponental loss s classfcaton-calbrated. Our proposed method utlzes a dfferent emprcal estmator of the exponental rsk. Its consstency follows from the consstency result of AdaBoostng [3] along wth consstent estmaton of γ. More detals are presented n Secton IV. The loss (III.3) s closely related to the asymmetrc loss used n the lterature (e.g. [44], [24]), but the motvaton and goal of the two losses are qute dfferent. The asymmetrc loss treats two classes unequally. Two msclassfcaton errors produce dfferent costs. However, the costs or weghts do not necessarly sum up to 1. In asymmetrc loss, the rato of two costs s usually used to measure the degree of asymmetry and s often a constant parameter, whle n (III.3) t s a functon of x. Also the loss (III.3) takes a lnear combnaton of the exponental loss at y and y, whle the asymmetrc loss only takes one. Indeed, γ n the loss (III.3) s the posteror probablty used n [38] for the support vector machne technque. The smlarty s that we all use the sgn of the Bayes rule as the trusted label. However, we also nclude the magntude 2γ 1 n our loss functon. We assocate the trusted label wth a confdence 2γ 1, whle n [38] the confdence s always 1. The dea of label confdence s closely related to fuzzy label used n fuzzy support vector machnes [21]. The dfference s that fuzzy label only assgns an mportance weght for the observed label wthout consderng ts correctness. Next, we derve the proposed method based on the modfed exponental loss functon. C. Dervaton of our algorthm For an addtve model, f M (x) = M β m h m (x), m=1 (III.6) where h m (x) { 1, 1} s a weak classfer n the m th teraton, β m s ts coeffcent and f M (x) s an ensemble classfer. Our goal s to learn an ensemble classfer wth a forward stage-wse estmaton procedure by fttng an addtve model to mnmze the modfed loss functons. Let us consder an update from f m 1 (x) to f m (x) = f m 1 (x) + β m h m (x) by mnmzng (III.2). Ths s an optmzaton problem to fnd solutons h m and β m, that s, (β m, h m ) = arg mn β,h [γ exp( y f m (x )) =1 + (1 γ ) exp(y f m (x ))] = arg mn [ 1 exp( y βh(x )) β,h =1 + 2 exp(y βh(x ))], (III.7) where 1 = γ e yfm 1(x) and 2 = (1 γ )e yfm 1(x) are ndependent wth h m and β m. As we wll show, h m and β m can be derved separately n two steps. Let us frst optmze the weak hypothess h m. The summaton n (III.7) can be expressed alternatvely as n =1 [w(m) 1 exp( y βh(x )) + 2 exp(y βh(x ))] = n + n = n {:h(x )=y } [w(m) 1 {:h(x ) y } [w(m) 1 =1 [w(m) 1 e β + 2 e β ] e β + 2 e β ] e β + 2 e β ] +(e β e β ) N {:h(x ) y } [w(m) 1 2 ]. Therefore, for any gven value of β > 0, (III.7) s equvalent to the mnmzaton of h m = arg mn h =1 [ 1 2 ]I{h(x ) y }. (III.8) It s worthwhle to menton that the term ( 1 2 ) may be negatve, hence t cannot be drectly nterpreted as the weght of the nstance (x, y ) n the tranng set. Accordng to the analytcal soluton of h m, the base classfer s expected

5 to correctly predct (x, y ) n the case of 1 2 and otherwse msclassfy (x, y ). Ths s equvalent to solvng mn h =1 1 { } 2 I h(x ) sgn([ 1 2 ]y ). (III.9) In other words, h m s actually the one that mnmzes the predcton error over the set {(x, sgn([ 1 2 ]y ) n =1 } wth each nstance weghted 1 2. In each teraton, we treat sgn([ 1 2 ]y ) as the label of x and 1 2 as ts mportance. Ths provdes a theoretcal justfcaton of the samplng scheme n our proposed algorthm, whch s gven later. Next, we optmze β m. Wth h m fxed, β m mnmzes [ 1 :h m(x )=y + [ 1 :h m(x ) y e β + 2 e β ] e β + 2 e β ]. (III.10) Upon settng the dervatve of (III.10) (wth respect to β) to zero, we obtan β m = 1 2 ln :h m(x )=y 1 + :h m(x ) y 2 :h m(x ) y 1 +. :h m(x )=y 2 (III.11) Note that the condton that 1 + :h m(x )=y > :h m(x ) y 1 + 2 :h m(x ) y 2 :h m(x )=y (III.12) should hold n order to ensure the value of β m s postve. The approxmaton on the m th teraton s then updated as f m (x) = f m 1 (x) + β m h m (x), whch leads to the followng update of 1 and w (m+1) 1 w (m+1) 2 = 1 e yβmhm(x) = 2 e yβmhm(x). and 2 : (III.13) By repeatng the procedure above, we can derve the teratve process for all rounds m 2 untl m = M or the condton (III.12) s not satsfed. The ntal values take w (1) 1 = γ and w (1) 2 = 1 γ. Now we wrte the procedure nto the pseudocode of the Algorthm 2. Algorthm 2: CB-AdaBoost Algorthm Input: L = {(x, y ) n =1 }, γ = {(γ ) n =1 } and M Intalze: For, w (1) 1 = γ, w (1) 2 = 1 γ, D (1) = w (1) 1 w (1) 2 /S 1, where S 1 = n 1 w (1) 2 =1 w(1) For m = 1 To M 1 Relabel all nstances n L to compose a new data set as L = {(x, y )n =1 }, where x L, y = sgn[(w(m) 1 2 )y ]; 2 Draw nstance from L wth replacement accordng to the dstrbuton D (m) to compose a tranng set L m ; 3 Tran L m wth the base learnng algorthm and obtan a weak hypothess h m ; 4 Let β m = 1 2 ln :hm(x )=y 1 + :hm(x ) y 2 :hm(x ) y 1 + :hm(x )=y 2 If β m < 0, then M = m 1 and abort loop. 5 Update w (m+1) 1 = w (m+1) 2 = 1 e yβmhm(x) ; 2 e yβmhm(x) ; D (m+1) = w (m+1) 1 w (m+1) 2 /S m+1 for, where S m+1 = n w (m+1) 2 ; =1 w(m+1) 1 End For Output: sgn( M m=1 β mh m (x)). D. Class nose mtgaton In ths subsecton, we study the effect of label confdence, and we nvestgate the adaptve ablty of CB-AdaBoost n the mtgaton of overfttng and class nose from aspects of ts re-weghtng procedure and classfer combnaton rule. Frst, the ntalzaton of dstrbuton shows dfferent ntal emphases on tranng nstances between Algorthm 1 and Algorthm 2. As dscussed early, γ (1 γ ) actually represents ts label certanty, and t s used as the ntal weght n Algorthm 2. The condtonal rsk type of loss functon leads ths ntalzaton and the weghtng strategy that dstngushes nstances based on ther own confdences. Consequently, the nstances wth a hgh certanty receve a prorty to be traned. Ths makes sense as these nstances are usually those dentfable from a statstcal standpont, and thus, they are more valuable n classfcaton. By contrast, Algorthm 1 treats each nstance equally at the begnnng wthout consderng the relablty on the samples. Second, we consder y = sgn(2γ 1)y as the label of x n Algorthm 2. Under the mslabeled or class overlappng scenaros, ths desgn makes sense because sgn(2γ 1) represents the confdence towards correctness or wrongness of the label y. If sgn(2γ 1) = 1, y should be trusted wth confdence 2γ 1. Nevertheless, f sgn(2γ 1) = 1, y should be trusted wth confdence 2γ 1. The orgnal AdaBoost trusts label y completely, whch s napproprate under mslabellng and class overlappng. As shown before, the trust label y n CB-AdaBoost has the same sgn as the Bayes rule at sample pont x. Intutvely, our method takes more nformaton at the ntalzaton. Thrd, we take a detaled look at the weght updatng formulas n Algorthm 2 and subsequently obtan the followng results on the frst re-weghtng process. We say that an nstance x s msclassfed at the m th teraton f h m (x ) y, where y = sgn[(wm 1 wm 2 )y ]; otherwse, t s correctly classfed. Proposton 1. The msclassfed nstance receves larger weght for the next teraton.

6 Proof. Two types of msclassfcaton are ether h m (x ) y wth 1 > 2 or h m (x ) y wth 1 < 2. In the frst case, w (m+1) 1 > 1 w (m+1) 2 2, whle n the second case, w (m+1) 1 > 1 w (m+1) 2 2. = 1 = In both cases, the weght ncreases. 1 e βm 2 e βm e βm 2 e βm Proposton 2. If an nstance s correctly classfed and ts certanty s hgh enough so that max{ 1, 2 } > e βm mn{ 1, 2 }, then t receves smaller weght at the next teraton. Proof. We can easly check two cases. For the case of 2 and h m (x ) = y, when 1 > e βm 2, we have w (m+1) 1 < 1 w (m+1) 2 2. For the case of 1 e βm 2, we have w (m+1) 1 < 1 w (m+1) 2 2. = < 2 1 = e βm 2 e βm 1 > and h m (x ) = y, f 1 > 1 e βm 2 e βm Propostons 1 and 2 show that on the frst mportant stage, CB-AdaBoost nherts the adaptve learnng ablty of AdaBoost and has the dstncton that t adjusts the dstrbuton of nstances accordng to the current classfcaton wth respect to the commonly agreed nformaton. Moreover, the degree of adjustment s managed by the confdence of each sample. For the followng teratons, we can magne the resamplng process. The weghts of nstances wth hgh confdence stay at a hgh level untl most of them are suffcently learned. After that, ther proporton decreases rapdly whle the proporton of nstances wth low confdence ncreases gradually. As uncertan nstances consst of most of the tranng set, the tranng process s dffcult to contnue. On the other hand, once a new classfer becomes no better than a random guess, then an early stop n the teratve process s possble. Ths s because the condton (III.12) no longer holds n that case. Thus, our proposed method effectvely prevents the ensemble classfer from overfttng. Fourth, let us scrutnze the classfer ensemble rule. Proposton 3. In the framework of Algorthm 2, defne ε m as the error rate of h m over ts tranng set L m durng the m th teraton that s, ε m = n :h m(x ) y 1 2 /S m. We then have β m < 1 2 ln ( 1 ε m ε m ). Proof. We can prove ths result by gvng an equvalent representaton of β m as: ( β m = 1 N 2 ln :h m(x )=y 1 + n :h m(x ) y ) 2 n :h m(x ) y 1 + n :h m(x )=y 2 = 1 n 2 ln :h m(x )=y 1 2 + c n, :h m(x ) y 1 2 + c where c = : 1 < 2 1 + : 1 > 2 2. Wth the Condton (III.12) beng satsfed, we obtan n :h m(x )=y 1 2 > n :h m(x ) y 1 2, whch mples n 1 ε m :h m(x )=y ε = n m :h m(x ) y n :h m(x )=y > 1 n :h m(x ) y 1 1 1 2 2 2 + c 2 + c. Thus, the proof of Proposton 3 s complete. It turns out that β m calculated n our modfed algorthm does not take nto account the full value of the odd rato for each hypothess. In fact, t s smaller than that calculated n AdaBoost, so our algorthm combnes base classfers and updates nstance weghts modestly. Ths effectvely avods the stuaton where some hypotheses domnated by substantal classfcaton nose are exaggerated by ther large coeffcents n the fnal classfer. We have studed the CB-AdaBoost algorthm n detal and compared ts advantages to the orgnal one. Next, we dscuss the remanng ssue of how to estmate label confdence. E. Assgnment of label confdence In most cases, snce t s dffcult to track the data collecton process and dentfy where corruptons wll most lkely occur, we evaluate the confdence on labels accordng to the statstcal characterstcs of the data tself. In ths regard, [27] suggested a par-wse expectaton maxmzaton method (PWEM) to compute confdence of labels. Cao et al [6] appled KNN to detect suspcons examples. However, a drect applcaton of these methods may not be effcent for data sets whose nose level s hgh. We beleve that a cleaner data set can make a better confdence estmaton. Therefore, before confdence assgnment, a nose flter shall be ntroduced to elmnate very suspcous nstances so that we are able to extract more relable statstcal characterstcs from the remanng data. Frst, a nose flter scans over the orgnal data set. Usng a smlarty measure between nstances to fnd a neghborhood of each nstance, one can compute the agreement rate for ts label from ts neghbors. The nstances wth an agreement rate below a certan threshold are elmnated. The above process can be repeated several tmes snce some suspect nstances may be exposed later when ther neghborhood changes. In

7 Nose Level 10% 20% 30% Normal n = 50 Clean 0.8919 ± 0.2068 0.8693 ± 0.2267 0.8201 ± 0.2519 Mslabeled 0.0581 ± 0.0616 0.1795 ± 0.2265 0.4459 ± 0.4018 n = 500 Clean 0.9172 ± 0.2011 0.8547 ± 0.1978 0.7145 ± 0.1514 Mslabeled 0.0850 ± 0.1790 0.1446 ± 0.1843 0.2742 ± 0.1499 Sne n = 50 Clean 0.8551 ± 0.2503 0.8503 ± 0.2720 0.7142 ± 0.3556 Mslabeled 0.1833 ± 0.2905 0.3888 ± 0.4047 0.4661 ± 0.3999 n = 500 Clean 0.8731 ± 0.2639 0.8543 ± 0.2675 0.8451 ± 0.2844 Mslabeled 0.2870 ± 0.3832 0.4142 ± 0.4195 0.4958 ± 0.4257 TABLE I AVERAGE AND STANDARD DEVIATION OF THE CONFIDENCES FOR CLEAN AND MISLABELED SAMPLES IN TWO DATA SETS WITH DIFFERENT NOISE LEVELS. our experment, the threshold s set to 0.07 at the begnnng wth an ncrement of 0.07 n each subsequent round. The process s repeated three tmes, and the fnal cut-off value for the agreement rate s 0.21 so that the sample sze doesn t decrease much. In the mean tme, dstrbutonal nformaton of the sample s kept relatvely ntact. Once a fltered data set, denoted as L red, s obtaned, two methods can be used to compute label confdence. If the nose level ε over the tranng labels s known or can be estmated, we can represent the frequency of observatons wth label y as follows: P (Y = y) = P (Y = y, Z = y) + P (Y = y, Z = y) = (1 ε)p (Z = y) + εp (Z = y), where the nose level ε = P (Y = y Z = y) = P (Y = y Z = y). Ths representaton explans two sources for the composton of label y: correctly labeled nstances belongng to true class y and mslabeled nstances belongng to true class y. Then P (Z = y) = (P (Y = y) ε)/(1 2ε), and utlzng the Bayesan formula, we assess the confdence as follows: P (Z = y)f(x Z = y) γ = P (Z = y x) = f(x) P (Z = y)f(x Z = y) = f(x Z = y)p (Z = y) + f(x Z = y)p (Z = y) = (P (Y = y) ε)f(x Z = y) (P (Y = y) ε)f(x Z = y) + εf(x Z = y). Wth condtonal dstrbuton type known, f(x Z = y) and f(x Z = y) can be estmated under L red whle P (Y = y) s drectly set to be the sample proporton of class y n L. The second method doesn t need to assume the nose level. KNN s recalled to assgn confdence on each label. Based on L red, the label agreement rate of each nstance among ts nearest neghbors can act as ts confdence. So the confdence probablty of an example (x, y) n L s computed as follows: P (Z = y x) = 1 K K j=1 x j N (x) I(y j = y), (III.14) where N (x) represents the set contanng K nearest neghbors of x from L red. In our experment, K = 5 s used. In the smulaton of Secton V, we wll evaluate the qualty of confdence assgned by these two methods. In practce, however, the Bayesan method s usually nfeasble snce the nose level s unknown. F. Relatonshp to prevous work Note that our modfed algorthm reduces to AdaBoost f we set the confdence on each label to one. The greater the confdence on each nstance, the less CB-AdaBoost dffers from AdaBoost n terms of the weght updatng and base classfers, as well as ther coeffcents n successve teratons. Rebbapragada et al. [27] proposed nstance weghtng va confdence n order to mtgate class nose. They attempted to assgn confdence on nstance label such that ncorrect labels receve lower confdences. We share a smlar opnon n dealng wth nose data, but nstance weghtng va confdence tself seems to be a dscardng technque rather than a correctng technque. That s, a low confdence mples an attempt to elmnate the example, whle a hgh confdence mples keepng t. By contrast, our algorthm consders both the correctly labeled and mslabeled probablty for an nstance. Therefore, the loss functon L γ (y, f(x)) = γe yf(x) + (1 γ)e yf(x) explans the atttude towards an nstance: delete t wth γ and correct t by 1 γ. In other words, our algorthm can be vewed as a composton technque of dscardng and correctng. For the same reason, our algorthm dffers from those proposed n [14] and [6]. In ther dscussons, they suggested heurstc algorthms to delete or revse suspcous examples durng teratons n order to mprove the accuracy of AdaBoost for mslabeled data. In our algorthm, the suspcous labels are smlarly revsed, whch s a consequence of mnmzng the modfed loss functon (III.2). The trusted label at each sample pont s the sgn of the Bayes rule and s assocated wth a confdence level. Other closely related work ncludes [45] and [50]. Both consder the same confdence level of x as p = p(z = 1 x ), whereas our approach takes advantage of the observed label y by consderng γ = P (z = y x ). We evaluate confdence of the observed label y, whle they assess confdence of the

8 postve label +1. In [50], the ntal weght 2p 1 s very smlar to our choce, but our re-weghtng and classfer combnaton rules are dfferent. [45] has a smlar combnaton rule as ours, but the ntal weghts are dfferent. IV. CONSISTENCY OF CB-ADABOOSTING In ths secton, we study consstency of the proposed CB- AdaBoostng method wth label confdences estmated by KNN approach. Several authors have shown that the orgnal and modfed versons of AdaBoost are consstent. For example, Zhang and Yu [47] consdered a general boostng wth a step sze restrcton. Lugos and Vayats [23] proved the consstency of regularzed boostng methods. Bartlett and Traskn [3] studed the stoppng rule of the tradtonal AdaBoost that guarantees ts consstency. In our algorthm, we use the exponental loss functon. We just use a dfferent emprcal verson of the exponental rsk. Ths enables us to adopt the stoppng strategy used n [3] wth a consstency result on the nearest neghborhood method ([34], [8]) to show that the proposed CB-AdaBoost s Bayes-rsk consstent. We use notaton smlar to [3]. Let (X, Z) be a par of random values n R p { 1, 1} wth the jont dstrbuton P X,Z and the margnal dstrbuton of X beng P X. The tranng sample data L n = {(x 1, y 1 ),..., (x n, y n )} s avalable, havng the same dstrbuton as (X, Z). The mslabel problem can be treated as the case P X,Z beng a contamnaton dstrbuton. The CB-AdaBoost produces a classfer g n = sgn(f n ) : R p { 1, 1} based on ths sample L n. The msclassfcaton probablty s gven by L(g n ) = P (g n (X) Z L n ). Our goal s to prove that L(g n ) approaches the Bayes rsk L = nf L(f) = E(mn(η(X), 1 η(x))), f as n, where the nfmum s taken over all measurable classfers and where η(x) s the condtonal probablty η(x) = P (Z = 1 X). Assume that H s the set of all lnear combnatons of base classfers and has a fnte VC dmenson. The proposed CB- AdaBoost fnds a combnaton f n H that mnmzes R n,kn (f) = 1 [ˆγ exp( y f(x ))+(1 ˆγ ) exp(y f(x ))], n =1 where ˆγ s a K-NN estmator of γ = P (Z = y x ). That s, ˆγ = 1 k n k n j=1 x j N (x ) I(y j = y ), where N (x ) denotes the set contanng k n nearest neghbors of x. We denote R n (f) = 1 [γ exp( y f(x )) + (1 γ ) exp(y f(x ))], n =1 and the true exponental rsk as R(f) = ExE z x exp( Zf(X)) = E exp( Zf(X). We frst prove that the CB-Adaboost s consstent wth the exponental rsk. Then, by [2], ts 0-1 rsk also approaches the Bayes rsk L, snce the exponental loss s classfcaton calbrated. We shall denote the convex hull of H scaled by λ 0 as F λ = {f f = β h, n N {0}, β = λ, h H}, =1 =1 and the set of t-combnatons, t N, of functons n H s denoted as t F t = {f f = β h, β R, h H}. =1 Defne the truncated functon π l ( ) to be π l (x) = xi(x [ l, l]) + lsgn(x), where I(x) s the ndcator functon. The set of truncated functons s π l F = { f f = π l (f), f F} and the set of classfers based on a class F s denoted by g F = { f f = g(f), f F}. Based on the stoppng strategy of [3] and the unversal consstency of nearest neghbor functon estmate of [8], we have the followng proposton. Proposton 4. Assume that V = d V C (H) < and that H s dense n the sense of lm λ nf f Fλ R(f) = R. Further assume k n, k n /n 0 and t n = n 1 a for a (0, 1). Then the CB-AdaBoost stopped at step t n returns a sequence of classfers almost surely satsfyng L(g(f n )) L. The proposton states the strong consstency of the proposed CB-AdaBoost method f t stops at t n = n 1 a and the sze of neghbors for estmatng label confdence k n but k n /n 0. A proof of Proposton 4 s gven n Appendx. V. EXPERIMENTS To begn, we run three experments to nvestgate performance of our proposed algorthm on synthetc data. The frst one examnes the qualty of assgned label confdence, snce t has a great mpact on the effectveness of the proposed method. The second explores the advantages of the proposed algorthm over other commonly used methods n dealng wth nose data. The thrd experment demonstrates sgnfcant dfferences of weghts between the proposed algorthm and the orgnal AdaBoost method. We generate random samples from two scenaros wth ncreasng levels of label nose. Norm Two classes of data are sampled from bvarate normal dstrbutons N((0, 0) T, I) and N((2, 2) T, I), respectvely. Sne Random vectors x = (x 1, x 2 ) T unformly dstrbuted on [ 3, 3] [ 3, 3] are smulated, and ther labels are assgned accordng to the condtonal probablty P (z = y x ) = e yg(x) /(e yg(x) + e yg(x) ), where y {1, 1} and g(x ) = ((x 2 3 sn x 1 ))/2.

9 0.28 0.26 0.24 Nose level:0% AdaBoost CORR DISC CB-AdaBoost Stump 0.3 0.28 0.26 Nose level:10% AdaBoost CORR DISC CB-AdaBoost Stump Test Error 0.22 Test Error 0.24 0.22 0.2 0.2 0.18 0.18 0.16 0 20 40 60 80 100 120 140 160 180 200 Iteratons 0.16 0 20 40 60 80 100 120 140 160 180 200 Iteratons 0.32 0.3 0.28 Nose level:20% AdaBoost CORR DISC CB-AdaBoost Stump 0.4 0.38 0.36 Nose level:30% AdaBoost CORR DISC CB-AdaBoost Stump Test Error 0.26 0.24 Test Error 0.34 0.32 0.22 0.3 0.2 0.28 0.18 0 20 40 60 80 100 120 140 160 180 200 Iteratons 0.26 0 20 40 60 80 100 120 140 160 180 200 Iteratons Fg. 1. Testng errors of each method under dfferent nose levels (0%,10%, 20% and 30%) as the number of teratons ncreases. Data sets consst of 50/500 tranng observatons and 10000 testng nstances. We ntroduce mslabeled data by randomly choosng tranng nstances and reversng ther labels. We then carry out the experment on real data sets from the UCI repostory [20]. Seventeen data sets of dfferent szes wth dfferent numbers of nput varables are used to compare performance of the proposed algorthm wth some exstng robust boostng methods. We set the number of teratons M to be 200 for all ensemble classfers. The base classfer used n the AdaBoost and CB-AdaBoost s the classfcaton stump, the smplest onelevel decson tree. A. Assessng the qualty of label confdence It s expected that the label confdence of clean sample nstances shall be hgh, whle for mslabelled nstances, the confdence should be low. In ths experment, we examne two assgnment methods prevously ntroduced n Secton III-E by assessng the qualty of ther label confdence results. We use the Bayesan method on the Normal data n whch the nose level s known to be 0%, 10% and 20%, respectvely. The KNN method s used on the Sne data. The number of nearest neghbors (K) used n KNN s selected from the range 3 to 15, and s set to 5 for consderaton of balance between accuracy and computaton effcency. Table I reports the average and standard devaton of confdences on clean and mslabeled samples. The averages and standard devatons are calculated through 30 repettons. As expected, there exsts a sgnfcant separaton between two types of samples on confdences. On average, clean labels acheve a hgher degree of confdence than corrupted ones. For example, under 10% contamnaton of normal samples of sze n = 500, confdence for clean sample s 0.9172 compared to 0.0855 for mslabeled sample. For the small sze n = 50, the confdence dfference s also sgnfcant wth 0.8919 for clean sample and 0.0581 for mslabeled sample. As the nose level ncreases, the dfference of label confdences between clean data and mslabeled data becomes smaller. Ths phenomenon mentoned n [27] s understandable because the certanty decreases n hgh nose data and because assgnment methods tend to be more conservatve than they are n low nose data. B. Comparsons wth dscardng and correctng methods We compare the effcency of the label-confdence based learnng wth the dscardng and correctng technques. For the latter two, a threshold on confdence s pre-specfed to defne suspect samples. We consder four types of classfers: 1) AdaBoost; 2) AdaBoost workng on the data wth suspected samples havng been dscarded (DISC); 3) AdaBoost workng on the orgnal tranng set but suspected labels havng been

10 Data Level n AdaBoost DISC20 DISC50 DISC80 CORR20 CORR50 CORR80 CB-AdaBoost Normal 0% 50.1453±.0302.1407±.0308.1510±.0410.1719±.0410.1457±.0537.1413±.0429.1484±.0272.1070±.0168 500.0942±.0040.0863±.0038.0848±.0037.0899±.0074.0857±.0037.0833±.0026.0888±.0044.0809±.0032 10% 50.1979±.0419.1447±.0464.1562±.0452.1779±.0544.1410±.0458.1468±.0441.1482±.0627.1128±.0269 500.1296±.0108.0901±.0044.0881±.0049.0949±.0078.0895±.0042.0863±.0044.0921±.0052.0835±.0048 20% 50.2749±.0576.1678±.0418.1769±.0528.2041±.0490.1651±.0453.1782±.0530.1900±.0577.1390±.0366 500.1742±.0183.0967±.0069.0882±.0047.1048±.0142.0953±.0070.0857±.0047.1099±.0164.0849±.0050 30% 50.3446±.0391.2602±.0771.2450±.0872.3270±.1381.2679±.0782.2497±.1169.2994±.0882.2375±.1195 500.2474±.0298.1457±.0205.1015±.0134.6049±.4305.1410±.0220.1014±.0131.2397±.0699.1028±.0173 Sne 0% 50.2305±.0221.2271±.0272.2358±.0508.2420±.0281.2306±.0316.2307±.0330.2402±.0228.2139±.0188 500.1934±.0074.1850±.0071.1859±.0090.1871±.0076.1851±.0073.1861±.0083.1891±.0087.1834±.0067 10% 50.2872±.0303.2497±.0450.2430±.0343.2469±.0353.2408±.0310.2428±.0328.2566±.0339.2318±.0299 500.2242±.0086.1961±.0089.1934±.0083.1902±.0081.1954±.0099.1926±.0085.1931±.0100.1887±.0098 20% 50.3247±.0433.2782±.0406.2761±.0395.2848±.0717.2754±.0432.2786±.0494.2978±.0558.2672±.0540 500.2641±.0162.2295±.0135.2236±.0135.2166±.0147.2250±.0129.2217±.0140.2275±.0135.2096±.0168 30% 50.4017±.0545.3349±.0711.3433±.0822.3448±.0793.3338±.0709.3320±.0766.3497±.0687.3258±.0811 500.3166±.0310.2676±.0292.2671±.0270.2576±.0264.2659±.0267.2661±.0285.2654±.0290.2264±.0278 TABLE II AVERAGE AND STANDARD DEVIATION OF TESTING ERRORS OF EACH METHOD UNDER DIFFERENT NOISE LEVELS. DISCARDING AND CORRECTING METHODS USE 0.20, 0.50 AND 0.80 AS THE CONFIDENCE THRESHOLDS. THE SMALLEST ERRORS ARE SHOWN IN BOLD. 0.11 0.1 0.09 0.08 ms AdaBoost cle AdaBoost ms CB cle CB 0.05 0.045 0.04 hgh AdaBoost low AdaBoost hgh CB low CB Average Weghts 0.07 0.06 0.05 Average Weghts 0.035 0.03 0.025 0.04 0.03 0.02 0.02 0.015 0.01 0 50 100 150 200 Iteratons 0.01 0 50 100 150 200 Iteratons Fg. 2. Average weghts of dfferent types of nstances durng the learnng process n the orgnal AdaBoost and CB-AdaBoost. The left panel s for weghts of mslabelled nstances and clean-labelled nstances. The rght panel s for weghts of groups wth hgh and low label confdence. corrected (CORR); 4) CB-AdaBoost. We repeat the procedure 30 tmes and record the test errors of the four classfers. Fg. 1 llustrates how the average test error changes as the number of teratons ncreases for dfferent classfers based on the tranng set of sze 50. The threshold s set to 0.5 for DISC and CORR methods. AdaBoost greatly mproves the predcton accuracy of Stump (a smple one-level decson tree) n the clean data, but ts ablty n boostng s lmted when the tranng set s corrupted, especally at hgh nose levels where t performs even worse than a sngle stump. Ths demonstrates that AdaBoost s ndeed very senstve to nose. It also suffers from overfttng at 0% nose level f the number of teratons becomes large. Wth preprocessng technques (CORR or DISC), AdaBoost acts well at the begnnng but ts accuracy decreases as a large number of base learners accumulate. Compared wth the above methods, our proposed algorthm shows better performance n clean data and better robustness aganst nose. Moreover, t tactcally avod overfttng by ceasng the learnng process at an early teraton (as early as 40). Table II provdes test errors for correctng and dscardng methods under dfferent thresholds of 0.2, 0.5 and 0.8, denoted as DISC20, DISC50 and DISC80 or CORR20, CORR50 and CORR80, respectvely. We now see that CB-AdaBoost s performance s superor n 15 out of 16 cases. The only excepton s the normal data under 30% nose level for n = 500, where CORR50 and DISC50 perform better. The advantage of CB-AdaBoost over the others s more sgnfcant for smaller szes than for larger szes. Nether the correctng nor dscardng method at one threshold performs unformly better than other thresholds. Ths makes ther practce use dffcult wth reasonable confdence thresholds. It s worthwhle to menton that CB-AdaBoost unformly outperforms AdaBoost even for the case wthout mslabels. Ths s because

11 there s overlappng between the two classes and because the proposed loss functon consders true rsks that may help the classfcaton acheve a better performance, wth a test error close to the theoretcally mnmum error, namely Bayes error. C. Reweghtng Ths experment llustrates re-weghtng dfferences between the orgnal AdaBoost and the proposed one. Fg. 2 plots how the average weghts of dfferent groups of nstances change as the number of teratons ncreases. Frst, we consder two groups: mslabeled nstances and clean-labeled nstances. Ther mean weghts are plotted n the left panel n Fg. 2. As the learnng process contnues, the mean weght of nose data n AdaBoost (ms-adaboost, the top red curve) rapdly rses and stays at a level much hgher than that of CB- AdaBoost (ms-cb, the mddle red curve). If the teratons cannot be stopped n tme, the weak classfers traned by heavly-weghted nose data become unrelable. By contrast, our proposed method does not place too much weght on nosy examples. The rght panel of Fg. 2 llustrates groups dvded by the certanty degree (hgher than 0.7 or not). The plot clearly demonstrates the features of the weghtng rule n CB- AdaBoost. Instances wth hgh certanty are ntalzed more and ther average weghts declne after beng fully traned whereas the average weghts of the others ncrease and reman at hgh values untl the teraton stops. However, ths adaptve ablty s not present n AdaBoost. D. Real data sets In addton, we conducted experments on 17 real data sets avalable from the UCI repostory [20]. Snce we focus on the two-class problem, the classes of several mult-class data sets are combned nto 2 classes. If the class varable s nomnal, class 1 s treated as the postve class, and the remanng classes are treated as the negatve class. If the class varable s ordnal, we merge the classes wth smlar propertes. For example, n the Cadotocography data, Suspect class and Pathologc class are combned as the postve class and Normal as the negatve class. For the Urban Land Cover dataset, we combne the tranng and test set, and any nstances wth mssng values are removed. Table III summarzes the man characterstcs of all data sets. For each data set, half of the nstances are randomly selected as the tranng set and the remanng are used for testng. 10%, 20% and 30% mslabels are ntroduced n the tranng data by randomly choosng tranng nstances and reversng ther labels. For comparson, we consder another boostng method known as LogtBoost, n addton to the two modfed AdaBoost algorthms known as MadaBoost [10] and β-boostng [49] 2, all of whch are robust aganst nose data. The procedure s repeated 30 tmes, and we take the average of the 30 test errors for each classfer as a measurement of ts performance. Accordng to Table IV, CB-AdaBoost performs better than the orgnal AdaBoost for all cases except for one whch 2 The orgnal paper dd not name the method. We name t β-boostng after the β parameter added to the algorthm, as suggested by a revewer. s the case of Musk under 10% nose level. It also greatly mproves the accuracy of the stump (.e., the base classfer). β-boostng, MadaBoost and LogtBoost methods show robustness to mslabeled data. They outperform AdaBoost for most cases especally whch LogtBoost acheves a lower test error than the other two. However, lke AdaBoost, they suffer from overfttng because they cannot stop teratons due to ther weght dstrbutons. Ths problem s overcome by CB- AdaBoost, and as a result, the wn-lose numbers of the proposed algorthm when compared to the robust three algorthms are 42-9, 48-3 and 38-13 respectvely. We conducted the sgn test based on counts of wns, losses and tes [7] n order to quantfy the sgnfcance of the proposed method. Table V lsts the frequency and sgnfcance level that CB-AdaBoost wns each of other algorthms on 17 data sets at each nose level. Ths demonstrates the effectveness and advantages of CB-AdaBoost n handlng mslabeled data. VI. CONCLUSION In ths paper, we have provded a label-confdence based boostng method that s suffcently mmune to the label nose and overfttng problems. Wth the assgnment of confdence, our proposed algorthm dstngushes between clean and contamnated nstances. In addton, the values of confdence on nstances represent dfferent levels of judgments on ther label relablty. Under the gudance of confdent nstances, CB-AdaBoost s able to mnmze the loss functon over the tranng set under the condtonal rsk functon. Moreover, n CB-AdaBoost, explct solutons for weak learners and ther coeffcents on each stage can be easly obtaned and appled practcally. In comparsons wth some common nose handlng technques and other robust algorthms, CB-AdaBoost does a better job of tacklng problems of class overlappng and mslabellng. The proposed method has some lmtatons. The computatonal complexty of the proposed CB- AdaBoost s O(n 2 d), where n s the sample sze and d s dmenson. Ths s because we need to compute or estmate label confdence of each nstance, and the KNN method for label confdence evaluaton has the computaton complexty O(n 2 d). The remanng process of the CB-AdaBoost s O(n 2 a d) wth a (0, 1). Collectvely, ths yelds an overall computatonal complexty O(n 2 d), whch may be prohbtve for large-scale applcatons. As currently formulated, the proposed method cannot drectly handle categorcal or symbolc features. A smlarty metrc on those type of features needs to be ntroduced to defne neghbors for label confdence assgnment. Contnuaton of ths work could take several drectons. A general framework of optmzaton strategy based on the condtonal rsk deserves a deeper understandng and further development. In the current work, KNN s used to estmate the confdence of each nstance. Theoretcally, the number of neghbors shall go to nfnty wth a speed slower than the sample sze to ensure strong consstency of KNN

12 Datasets Instances Input varables Orgnal classes Datasets Instances Input varables Orgnal classes Breast-Cancer 683 9 2 Wne 178 13 3 Wpbc 194 33 2 Haberman 306 3 2 Wdbc 569 30 2 Vehcle 846 18 4 Pma 768 8 2 Banknote 1372 4 2 Aust 690 14 2 Cardotocography 2126 21 3 Heart 270 13 2 Waveform 5000 21 3 Glass 214 9 6 Urban Land Cover 181 147 2 Seeds 210 7 3 Musk 6598 168 2 Ecol 336 7 8 TABLE III SUMMARIES OF DATA SETS. estmator. In practce, however, a small number of neghbors seem to be suffcent. Perhaps a proof of consstence exsts wthout the condtons of k n. It wll be nterestng to study the mpact of parameter k and dscuss a proper selecton on the number of neghborhoods n practce. For example, the cross-valdaton method for choosng k deserves further nvestgaton. In fact, the problem of how to desgn a good crteron for confdence assgnment s stll open. Other methods are needed to produce hgh qualty confdences, especally when categorcal features are nvolved. CB-AdaBoost outperforms the AdaBoost for class overlappng problems, thus t s promsng to extend CB- AdaBoost for multple class classfcaton problems and other applcatons such as mage or object recognton. VII. APPENDIX Proof of Proposton 4. Let { f n } n=1 be a sequence of reference functons such that R( f n ) R. We shall prove that there exst non-negatve sequences t n, ξ n, k n, and k n /n 0 such that the followng condtons are satsfed. Unform Convergence of t n -combnatons sup R(f) R n (f) a.s. 0; f π ξn F tn Emprcal convergence for the sequence { f n } R n ( f n ) R( f n ) a.s. 0; Convergence of the KNN estmates R n,kn ( f n ) R n ( f n ) a.s. 0; Algorthm convergence of t n -combnatons R n,kn (f tn ) R n,kn ( f n ) a.s 0. (VII.1) (VII.2) (VII.3) (VII.4) Snce R n (f) s an emprcal exponental rsk, a proof of (VII.1) follows exactly the same lnes of Lemma 4 n [3] wth the Lpschtz constant L ξ = (e ξ e ξ )/(2ξ) and M ξ = e ξ. Then for any δ > 0, wth probablty at least 1 δ, sup R(f) R n (f) f π ξ F t cξl ξ (V + 1)(t + 1) log2 [2(t + 1)/ ln 2] n + M ξ 1/δ 2n, (VII.5) where V = d V C (H) and c = 24 1 ln 8e 0 ɛ dɛ. We can take 2 t = n 1 a and ξ = κ ln n wth κ > 0, a (0, 1) and 2κ a < 0 so that the rght sde of the nequalty (VII.5) converges to 0, and n the mean tme n=1 δ n <. Hence an applcaton of the Borel-Cantell lemma ensures the almost surely convergence of (VII.1). Applyng Theorem 8 of [3], we have the result of (VII.4), n whch the reference sequence f n F λn wth λ n = κ 1 ln n where κ 1 (0, 1/2). (VII.2) can be proved by Hoeffdng s nequalty f the range of f n s restrcted to the nterval [ λ n, λ n ]. That s, P ( R n ( f n ) R( f n ) ɛ n ) exp( 2nɛ 2 n/m 2 λ n ) := δ n where M λn = e λn e λn. Let λ n = κ 1 ln n wth κ 1 (0, 1/2). Lettng ɛ n 0, we stll have n=1 δ n <, and hence convergence n probablty 1 of (VII.2) holds. By the result of Theorem 1 n [8], for each KNN estmate ˆγ wth k n and k n /n 0, we have P (2 ˆγ γ > ɛ n ) exp[ nɛ 2 n/(8n 2 p )], where the constant N p s the mnmal number of cones centered at the orgn of angle π/6 that cover R p. Then wth the restrcton of f n n [ λ n, λ n ], we have P ( R n,kn ( f n ) R n ( f n ) > ɛ n ) < exp[ nɛ 2 n/(2m 2 λ n N 2 d ) + ln n] := δ n Agan, a choce of λ n = κ 1 ln n wth κ 1 (0, 1/2) guarantees δn < when ɛ n = o(1), and hence (VII.3) holds. Now we are ready to prove Proposton 4. For almost every outcome ω on the probablty space, we can defne sequences ɛ n, (ω) 0 for = 1,..., 5 so that for almost all ω the followng nequaltes are true. R(π ξn (f tn )) R n (π ξn (f tn )) + ɛ n,1 (ω) R n,kn (π ξn (f tn )) + ɛ n,2(ω) R n,kn (f tn ) + e ξn + ɛ n,2(ω) R n,kn ( f n ) + e ξn + ɛ n,3(ω) R n ( f n ) + e ξn + ɛ n,4(ω) by (VII.1) by (VII.4) by (VII.3) (VII.6) (VII.7) R( f n ) + e ξn + ɛ n,5(ω) by (VII.2) (VII.8) where ɛ n,k (ω) = k j=1 ɛ n,j(ω). Inequalty (VII.6) follows smlarly as (VII.3) wth ξ n = κ 1 ln n, where κ 1 (0, 1/2). Inequalty (VII.7) follows from the facts that e π ξn (x) <

13 e x + e ξn and e π ξn (x) < e x + e ξn. Then wth t n = n 1 a, ξ n = κ ln n (a > 0, κ > 0, 2κ < a) and (VII.8), by choce of the sequence { f n } F λn wth λ n = κ 1 log n, κ 1 (0, 1/2), we have R( f n ) R and R(π ξn (f tn )) R a.s. By Theorem 3 of [2], L(g(π ξn (f tn ))) a.s L. Snce for ξ n > 0 we have g(π ξn (f tn )) = g(f tn ), t follows that L(g(f tn )) a.s L. Hence, the proposed CB AdaBoostng procedure s consstent f stopped after t n steps. ACKNOWLEDGMENT Zh Xao and Bo Zhong are supported by the Chna Natonal Scence Foundaton (NSF) under Grant No. 71171209. The authors would lke to thank Yxn Chen for dscussng the condtonal rsk. REFERENCES [1] H. Allende-Cd et al., Robust alternatng adaboost. Progress n pattern recognton, mage analyss and applcatons. Sprnger Berln Hedelberg (2007), 427-436. [2] P.L. Bartlett, M. Jordan and J.D. McAulffe, Convexty, classfcaton, and rsk bounds, J. Amer. Stat. Ass. 101 (2006) 138-156. [3] P.L. Bartlett and M. Traskn, AdaBoost s consstent, J. Mach. Learn. Res. 8 (2007) 2347-2368. [4] C.E. Brodley and M.A. Fredl, Identfyng and elmnatng mslabeled tranng nstances, In: AAAI/IAAI, 1 (1996) 799-805. [5] C.E. Brodley and M.A. Fredl, Identfyng mslabeled tranng data, J. Artf. Intell. Res. 11 (1999) 131-167. [6] J. Cao, S. Kwong and R. Wang, A nose-detecton based AdaBoost algorthm for mslabeled data, Pattern Recog. 45(12) (2012) 4451-4465. [7] J. Demšar, Statstcal comparsons of classfers over multple data sets, J. Mach. Learn. Res. 7 (2006) 1-30. [8] L. Devroye, L. Györf, A. Krzyżak and G. Lugos, On the strong unversal consstency of nearest neghbor regresson functon estmates, Ann. Stat. 22(3) (1994) 1371-1385. [9] T. Detterch, An expermental comparson of three methods for constructng ensembles of decson trees: baggng, boostng, and randomzaton, Mach. Learn. 40(2) (2000)139-157. [10] C. Domngo and O. Watanabe, MadaBoost: a modfcaton of adaboost, In: COLT (2000) 180-189. [11] B. Frénay and A. Kabán, A comprehensve ntroducton to label nose, In: Proceedngs of the European Symposum on Artfcal Neural Networks, Computatonal Intellgence and Machne Learnng (ESANN) (2014), Bruges, Belgum. [12] Y. Freund and R.E. Schapre, Experments wth a new boostng algorthm, In: Proceedngs of the Thrteenth Internatonal Conference on Machne Learnng (ICML) (1996), 148-156. [13] J. Fredman, T. Haste and R. Tbshran, Addtve logstc regresson: a statstcal vew of boostng, Ann. Stat. 28 (2) (2000) 337-374. [14] Y. Gao and F. Gao, Edted adaboost by weghted knn, Neurocomputng 73 (2010) 3079-3088. [15] T. Haste, R. Tbshran and J. Fredman, The Elements of Statstcal Learnng: Data Mnng, Inference, and Predcton, Sprnger, New York, 2001. [16] K. Hayash, A smple extenson of boostng for asymmetrc mslabeled data, Stat. Probabl. Lett. 82 (2) (2012) 348-356. [17] Y. Jang and Z. Zhou, Edtng tranng data for knn classfers wth neural network ensemble, In: Advances n Neural Networks, ISNN 2004, Sprnger, Berln Hedelberg, 2004, 356-361. [18] T. Kanamor, T. Takenouch and S. Eguch, The most robust loss functon for boostng, In: Neural Informaton Processng, ICONIP 2004, Sprnger, Berln Hedelberg, 2004, 496-501. [19] T. Kanamor, T. Takenouch and S. Eguch, Robust loss functons for boostng, Neural Comput. 19 (8) (2007) 2183-2244. [20] M. Lchman, UCI Machne Learnng Repostory [http://archve.cs.uc.edu/ml], 2013, Irvne, CA: Unversty of Calforna, School of Informaton and Computer Scence. [21] C. Ln and S. Wang, Fuzzy support vector machne, IEEE Trans. Neural Netw. 13 (2) (2002) 464-471. [22] H. Lu and S. Zhang, Nosy data elmnaton usng mutual k-nearest neghbor for classfcaton mnng, J. Syst. Software 85 (5) (2012) 1067-1074. [23] G. Lugos and N. Vayats, On the Bayes-rsk consstency of regularzed boostng methods, Ann. Stat. 32 (1) (2004) 30-55. [24] H. Masnad-Shraz and N. Vasconcelos, Cost-senstve boostng, IEEE T. Pattern Anal. Mach. Intell. 33 (2) (2011) 294-309. [25] P. Melvlle, N. Shah, L. Mhalkova and R. Mooney, Experments on ensembles wth mssng and nosy data. In: Proc. of the Ffth Internatonal Workshop on Mult Classfer Systems (2004) 293-302. [26] T. Onoda, Overfttng of boostng and regularzed Boostng algorthms, Electron. Comm. Jpn. 3 90 (9) (2007) 69-78. [27] U. Rebbapragada and C. Brodley, Class nose mtgaton through nstance weghtng, In: Lecture Notes n Computer Scence, ECML, Sprnger, Berln Hedelberg, 2007, 788-715. [28] J.A. Sáez, J. Luengo and F. Herrera. Predctng nose flterng effcacy wth data complexty measures for nearest neghbor classfcaton. Pattern Recogn. 46 (1) (2013) 355-364. [29] J.S. Sánchez, R. Barandela, A.I. Marqués et al. Analyss of new technques to obtan qualty tranng sets, Pattern Recogn. Lett. 24 (7) (2003) 1015-1022. [30] R.E. Schapre and Y. Snger, Improved boostng algorthms usng confdence-rated predctons, Mach. Learn. 37 (3) (1999) 297-336. [31] R.E. Schapre and Y. Freund, Boostng: Foundatons and Algorthms. The MIT Press, 2012. [32] R.A. Servedo, Smooth boostng and learnng wth malcous nose, J. Mach. Learn. Res. 4 (2003) 633-648. [33] I. Stenwart and A. Chrstmann, Support vector machnes. Sprnger, New York, 2008. [34] C.J. Stone, Consstent nonparametrc regresson, Ann. Stat. 5 (4) (1977) 595-620. [35] Y. Sun, J. L and W. Hager, Two new regularzed AdaBoost algorthms, In: Proc. ICMLA (2004) 11. [36] Y. Sun, S. Todorovc and J. L, Reducng the overfttng of AdaBoost by controllng ts data dstrbuton skewness, Int. J. Pattern Recogn. 20 (7) (2006) 1093-1116. [37] T. Takenouch and S. Eguch, Robustfyng AdaBoost by addng the nave error rate. Neural Comput. 16 (4) (2004) 767-787. [38] Q. Tao, G. Wu, F. Wang and J. Wang, Posteror probablty support vector machnes for unbalanced data, IEEE T. Neural Netw. 16 (6) (2015) 1561-1573. [39] J. Thongkam, G. Xu, Y. Zhang and F. Huang, Toward breast cancer survvablty predcton models through mprovng tranng space, Expert Syst. Appl. 36 (2009) 12200-12209. [40] L. Utkn and Y. Zhuk, Robust boostng classfcaton models wth local sets of probablty dstrbutons, Knowl.-Based Syst., 61 (2014) 59-75. [41] S. Verbaeten and A. Van Assche, Ensemble methods for nose elmnaton n classfcaton problems, In: Multple classfer systems, Sprnger, Berln Hedelberg, 2003, 317-325. [42] A. Vezhnevets and V. Vezhnevets, Modest adaboost-teachng adaboost to generalze better, Graphcon-2005. Novosbrsk Akademgorodok, Russa, 2005. [43] A. Vezhnevets and O. Barnova, Avodng boostng overfttng by removng confusng samples, In: Lecture Notes n Computer Scence, ECML, Sprnger, Berln Hedelberg, 2007, 430-441. [44] P. Wang, C.H. Shen, N. Barnes and H. Zheng, Fast and robust object detecton usng asymmetrc totally correctve boostng, IEEE T. Neural Netw. Learn. Syst. 23 (1) (2012) 33-46. [45] W. Wang, Y. Wang, F. Chen and A. Sowmya, A weakly supervsed approach for object detecton based on soft-label boostng, IEEE Workshop on Applcatons of Computer Vson, 2013, 331-338. [46] F. Yoav and E. Robert, A decson-theoretc generalzaton of on-lne learnng and an applcaton to boostng, J. Comput. Sys. Sc. 55 (1) (1997) 119-139. [47] T. Zhang and B. Yu, Boostng wth early stoppng: convergence and consstency, Ann. Stat. 33 (4) (2005) 1538-1579. [48] C.X. Zhang and J.S. Zhang, A local boostng algorthm for solvng classfcaton problems, Comput. Stat. Data An. 52 (4) (2008) 1928-1941. [49] C.X. Zhang, J.S. Zhang and G.Y. Zhang, An effcent modfed boostng method for solvng classfcaton problems, J. Comput. Appl. Math. 214 (2) (2008) 381-392. [50] D. Zhou, B. Quost and V. Frémont, Soft label based sem-supervsed boostng for classfcaton and object recognton, In: 13th Internatonal Conference on Control, Automaton, Robotcs and Vson (ICARCV), 2014, Sngapore.

14 Zh Xao s a professor and the char of the Informaton Management Department at the Chongqng Unversty. Hs research nterests nclude operatonal optmzaton, Statstcs, forecastng, nformaton ntellgence analyss and data mnng. Recently he focuses on soft sets, nterdscplnary bg data analyss. Dr. Xao has been the Prncpal Investgator of 50 fundng projects. He has publshed 5 textbooks and more than 100 scentfc papers n journals ncludng Knowledge-based Systems, Expert Systems wth Applcaton, Appled Mathematcal Modelng, Journal of Computatonal and Appled Mathematcs, Computers & Mathematcs wth Applcatons, etc. Professor Xao serves as Vce Executve Drector of the Chna Informaton Economcs Assocaton, as Executve Offcer of the Natonal Statstcal Socety of Chna and as Vce Presdent of the Chongqng Statstcal Socety. Zhe Luo receved the Master degree n Probablty and Mathematcal Statstcs from the Chongqng Unversty. Currently, he s an assstant manager of the Bank of Chna at Nannng. Hs research focuses on statstcal decson, pattern recognton, cluster analyss and Monte Carlo smulatons. Bo Zhong s a professor of the Statstcs and Actuary Department at the Chongqng Unversty. She s the drector of the Graduate Mathematcs Courses Program. Her research specalty s on soft set, soft computaton, rough set, statstcal learnng and relablty analyss n power systems. Dr. Zhong leads 30 fundng projects, ncludng 10 from natonal fund agents. She has publshed 70 scentfc papers n journals ncludng Expert Systems wth Applcaton and Knowledge-based Systems, etc. She s also the author of 8 textbooks. Xn Dang (M 2017) receved the PhD degree n Statstcs from the Unversty of Texas at Dallas n 2005. Currently she s an assocate professor of the Department of Mathematcs at the Unversty of Msssspp. Her research nterests nclude robust and nonparametrc statstcs, statstcal and numercal computng, and multvarate data analyss. In partcular, she has focused on data depth and applcaton, bonformatcs, machne learnng, and robust procedure computaton. Dr. Dang s a member of the Insttute of Mathematcal Statstcs, the Amercan Statstcal Assocaton, the Internatonal Chnese Statstcal Assocaton, the Internatonal Neural Network Socety and the IEEE. For further nformaton, see home.olemss.edu/ xdang/.