Boosting with Abstention

Size: px

Start display at page:

Download "Boosting with Abstention"

Angela Jordan
5 years ago
Views:

1 Boosting with Abstention Corinna Cortes Google Research New York, NY 00 Giulia DeSalvo Courant Institute New York, NY 00 Mehryar Mohri Courant Institute and Google New York, NY 00 Abstract We present a new boosting algorith for the key scenario of binary classification with abstention where the algorith can abstain fro predicting the label of a point, at the price of a fixed cost. At each round, our algorith selects a pair of functions, a base predictor and a base abstention function. We define convex upper bounds for the natural loss function associated to this proble, which we prove to be calibrated with respect to the Bayes solution. Our algorith benefits fro general argin-based learning guarantees which we derive for ensebles of pairs of base predictor and abstention functions, in ters of the Radeacher coplexities of the corresponding function classes. We give convergence guarantees for our algorith along with a linear-tie weak-learning algorith for abstention stups. We also report the results of several experients suggesting that our algorith provides a significant iproveent in practice over two confidence-based algoriths. Introduction Classification with abstention is a key learning scenario where the algorith can abstain fro aking a prediction, at the price of incurring a fixed cost. This is the natural scenario in a variety of coon and iportant applications. An exaple is spoken-dialog applications where the syste can redirect a call to an operator to avoid the cost of incorrectly assigning a category to a spoken utterance and isguiding the dialog anager. This requires the availability of an operator, which incurs a fixed and predefined price. Other exaples arise in the design of a search engine or an inforation extraction syste, where, rather than taking the risk of displaying an irrelevant docuent, the syste can resort to the help of a ore sophisticated, but ore tie-consuing classifier. More generally, this learning scenario arises in a wide range of applications including health, bioinforatics, astronoical event detection, active learning, and any others, where abstention is an acceptable option with soe cost. Classification with abstention is thus a highly relevant proble. The standard approach for tackling this proble is via confidence-based abstention: a real-valued function h is learned for the classification proble and the points x for which its agnitude hx is saller than soe threshold γ are rejected. Bartlett and Wegkap [ gave a theoretical analysis of this approach based on consistency. They introduced a discontinuous loss function taking into account the cost for rejection, upper-bounded that loss by a convex and continuous Double Hinge Loss DHL surrogate, and derived an algorith based on that convex surrogate loss. Their work inspired a series of follow-up papers that developed both the theory and practice behind confidence-based abstention [3, 5, 3. Further related works can be found in Appendix A. In this paper, we present a solution to the proble of classification with abstention that radically departs fro the confidence-based approach. We introduce a general odel where a pair h, r for a classifier h and rejection function r are learned siultaneously. Under this novel fraework, we present a Boosting-style algorith with Abstention, BA, that learns accurately the classifier and abstention functions. Note that the terinology of boosting with abstention was used by Schapire and Singer [6 to refer to a scenario where a base classifier is allowed to abstain, but 30th Conference on Neural Inforation Processing Systes NIPS 06, Barcelona, Spain.

2 hx = x + Figure : The best predictor h is defined by the threshold θ: hx = x + θ. For c <, the region defined by X η should be rejected. But the corresponding abstention function r defined by rx = x η cannot be defined as hx γ for any γ > 0. y hx < where the boosting algorith itself has to coit to a prediction. This is therefore distinct fro the scenario of classification with abstention studied here. Nevertheless, we will introduce and exaine a confidence-based Two-Step Boosting algorith, the TSB algorith, that consists of first training Adaboost and next searching for the best confidence-based abstention threshold. The paper is organized as follows. Section describes our general abstention odel which consists of learning a pair h, r siultaneously and copares it with confidence-based odels. Section 3. presents a series of theoretical results for the proble of learning convex ensebles for classification with abstention, including the introduction of calibrated convex surrogate losses and general datadependent learning guarantees. In Section 4, we use these learning bounds to design a regularized boosting algorith. We further prove the convergence of the algorith and present a linear-tie weak-learning algorith for a natural faily of abstention stups. Finally, in Section 5, we report several experiental results coparing the BA algorith with the DHL and the TSB algoriths. Preliinaries In this section, we first introduce a general odel for learning with abstention [7 and then copare it with confidence-based odels.. General abstention odel We assue as in standard supervised learning that the training and test points are drawn i.i.d. according to soe fixed but unknown distribution D over X {, +}. We consider the learning scenario of binary classification with abstention. Given an instance x X, the learner has the option of abstaining fro aking a prediction for x at the price of incurring a non-negative loss cx, or otherwise aking a prediction hx using a predictor h and incurring the standard zero-one loss yhx 0 where the true label is y. Since a rando guess achieves an expected cost of at ost, rejection only akes sense for cx<. We will odel the learner by a pair h, r where the function r : X R deterines the points x X to be rejected according to rx 0 and where the hypothesis h: X R predicts labels for non-rejected points via its sign. Extending the loss function considered in Bartlett and Wegkap [, the abstention loss for a pair h, r is defined as as follows for any x, y X {, +}: Lh, r, x, y = yhx 0 rx>0 + cx rx 0. The abstention cost cx is assued known to the learner. In the following, we assue that c is a constant function, but part of our analysis is applicable to the ore general case. We denote by H and R two failies of functions apping X to R and we assue the labeled saple S = x, y,..., x, y is drawn i.i.d. fro D. The learning proble consists of deterining a pair h, r H R that adits a sall expected abstention loss Rh, r, defined as follows: Rh, r = E x,y D [ yhx 0 rx>0 + c rx 0. Siilarly, we define the epirical loss of a pair h, r H R over the saple S by: R S h, r = E x,y S [ yhx 0 rx>0 + c rx 0, where x, y S indicates that x, y is drawn according to the epirical distribution defined by S.. Confidence-based abstention odel Confidence-based odels are a special case of the general odel for learning with rejection presented in Section. corresponding to the pair hx, rx = hx, hx γ, where γ is a paraeter x

3 that changes the threshold of rejection. This specific choice was based on consistency results shown in [. In particular, the Bayes solution h, r of the learning proble, that is where the distribution D is known, is given by h x = ηx and r x = h x c where ηx = P[Y = + x for any x X, but note that this is not a unique solution. The for of h x follows by a siilar reasoning as for the standard binary classification proble. It is straightforward to see that the optial rejection function r is non-positive, eaning a point is rejected, if and only if in{ηx, ηx} c. Equivalently, the following holds: ax{ηx, ηx} c if and only if ηx c and using the definition of h, we recover the optial r. In light of the Bayes solution, the specific choice of the abstention function r is natural; however, requiring the abstention function r to be defined as rx = hx γ, for soe h H, is in general too restrictive when predictors are selected out of a liited subset H of all easurable functions over X. Consider the exaple shown in Figure where H is a faily of linear functions. For this siple case, the optial abstention region cannot be attained as a function of the best predictor h while it can be achieved by allowing to learn a pair h, r. Thus, the general odel for learning with abstention analyzed in Section. is both ore flexible and ore general. 3 Theoretical analysis This section presents a theoretical analysis of the proble of learning convex ensebles for classification with abstention. We first introduce general convex surrogate functions for the abstention loss and prove a necessary and sufficient condition based on their paraeters for the to be calibrated. Next we define the enseble faily we consider and prove general data-dependent learning guarantees for it based on the Radeacher coplexities of the base predictor and base rejector sets. 3. Convex surrogates We introduce two types of convex surrogate functions for the abstention loss. Observe that the abstention loss Lh, r, x, y can be equivalently expressed as Lh, r, x, y = ax yhx 0 rx<0, c rx 0. In view of that, since for any f, g R, axf, g = f+g+ g f f+g, the following inequalities hold for a > 0 and b > 0: Lh, r, x, y = ax yhx 0 rx<0, c rx 0 ax axyhx, rx 0, c rx 0 ax yhx rx 0, c rx 0 = ax a [yhx rx 0, c b rx 0 ax Φ a [rx yhx, c Φ b rx, where u Φ u and u Φ u are two non-increasing convex functions upper-bounding u u 0 over R. Let L MB be the convex surrogate defined by the last inequality above: L MB h, r, x, y = ax Φ a [rx yhx, c Φ b rx, 3 Since L MB is not differentiable everywhere, we upper-bound the convex surrogate L MB as follows: ax a [yhx rx 0, c b rx 0 Φ a [rx yhx + c Φ b rx. Siilarly, we let L SB denote this convex surrogate: L SB h, r, x, y = Φ a [rx yhx + c Φ b rx. 4 Figure shows the plots of the convex surrogates L MB and L SB as well as that of the abstention loss. Let h L, r L denote the pair that attains the iniu of the expected loss E x,yl SB h, r, x, y over all easurable functions for Φ u = Φ u = expu. In Appendix F, we show that with ηx= PY =+ X =x, the pair h L, r L where h L = a log η η and r L = a +b log cb a η η akes L SB a calibrated loss, eaning that the sign of the h L, r L that iniizes the expected surrogate loss atches the sign of the Bayes classifier h, r. More precisely, the following holds. Theore Calibration of convex surrogate. For a > 0 and b > 0, the inf h,r E x,y [Lh, r, x, y is attained at h L, r L such that signh = signh L and signr = signrl if and only if b /a = c/c. 3

Figure : The left figure is a plot of the abstention loss. The iddle figure is a plot of the surrogate function L MB while the right figure is a plot of the surrogate loss L SB both for c = 0.45.

In the following, we ake the explicit choice of a = and b = c/c for the loss L SB to be calibrated. 3.

As an exaple, a siple rule could classify a essage as spa based on the presence of soe word, as ha in the presence of soe other word, and just abstain in the absence of both, as in the boosting with

4 Figure : The left figure is a plot of the abstention loss. The iddle figure is a plot of the surrogate function L MB while the right figure is a plot of the surrogate loss L SB both for c = The theore shows that the classification and rejection solution obtained by iniizing the surrogate loss for that choice of a, b coincides with the one obtained using the original loss. In the following, we ake the explicit choice of a = and b = c/c for the loss L SB to be calibrated. 3. Learning guarantees for ensebles in classification with abstention In the standard scenario of classification, it is often easy to coe up with siple base classifiers that ay abstain. As an exaple, a siple rule could classify a essage as spa based on the presence of soe word, as ha in the presence of soe other word, and just abstain in the absence of both, as in the boosting with abstention algorith by Schapire and Singer [6. Our objective is to learn ensebles of such base hypotheses to create accurate solutions for classification with abstention. Our enseble functions are based on the fraework described in Section.. Let H and R be two failies of functions apping X to [,. The enseble faily F that we consider is then the convex hull of H R: { T T T } F = α t h t, α t r t : T, α t 0, α t =, h t H, r t R. 5 t= t= Thus, h, r F abstains on input x X when rx 0 and predicts the label signhx otherwise. Let u Φ u and u Φ u be two strictly decreasing differentiable convex function upperbounding u u 0 over R. For calibration constants a, b > 0, and cost c > 0, we assue that there exist u and v such that Φ a u < and c Φ v <, otherwise the surrogate would not be useful. Let Φ and Φ be the inverse functions, which always exist since Φ and Φ are strictly onotone. We will use the following definitions: C Φ = a Φ Φ and C Φ = cb Φ Φ /c. Observe that for Φ u = Φ u = expu, we siply have C Φ = a and C Φ = b. Theore. Let H and R be two failies of functions apping X to R. Assue N >. Then, for any δ > 0, with probability at least δ over the draw of a saple S of size fro D, the following holds for all h, r F: log /δ Rh, r E [L MBh, r, x, y + C Φ R H + C Φ + C Φ R R + x,y S. The proof is given in Appendix C. The theore gives effective learning guarantees for enseble pairs h, r F when the base predictor and abstention functions adit favorable Radeacher coplexities. In earlier work [7, we present a learning bound for a different type of surrogate losses which can also be extended to hold for ensebles. Next, we derive argin-based guarantees in the case where Φ u = Φ u = expu. For any ρ > 0, the argin-losses associated to L MB and L SB are denoted by L ρ MB and Lρ SB and defined for all h, r F and x, y X {, +} by L ρ MB h, r, x, y = L MBh/ρ, r/ρ, x, y and L ρ SB h, r, x, y = L SBh/ρ, r/ρ, x, y. Theore applied to this argin-based loss results in the following corollary. Corollary 3. Assue N > and fix ρ > 0. Then, for any δ > 0, with probability at least δ over the draw of an i.i.d. saple S of size fro D, the following holds for all f F: Rh, r E x,y S [Lρ MB h, r, x, y + a ρ R H + a + b log /δ R R + ρ. t= 4

5 BAS = x, y,..., x, y for i to do D i, ; D i, 3 for t to T do 4 Z,t D ti, ; Z,t D ti, 5 k argin j [,N Z,t ɛ t,j + Z,t r j, c cz,t r j, Direction 6 Z Z,t ɛ t,k + r k, r c cz k,,t 7 if Z,t Ze α t,k Ze α t,k < Z t β then 8 η t α t,k Step 9 else η t log 0 α t α t + η t e k r t N j= α jr j h t N j= α jh j 3 Z t+ [ [ β Z + β Z tz Z tz +,t Z Step Φ r t x i y i h t x i + Φ c c r tx i 4 for i to do 5 D t+ i, Φ r tx i y ih tx i Z t+ ; D t+ i, Φ 6 h, r N j= α T,jh j, r j 7 return h, r c c rtxi Z t+ Figure 3: Pseudocode of the BA algorith for both the exponential loss with Φ u = Φ u = expu as well as for the logistic loss with Φ u = Φ u = log + e u. The paraeters include the cost of rejection c and β deterining the strength of the the α-constraint for the L regularization. The definition of the weighted errors ɛ t,k as well as the expected rejections, r k, and r k,, are given in Equation 7. For other surrogate losses, the step size η t is found via a line search or other nuerical ethods by solving argin η F α t + ηe k. The bound of Corollary 3 applies siilarly to L ρ SB since it is an upper bound on Lρ MB. It can further log log /ρ be shown to hold uniforly for all ρ 0, at the price of a ter in O using standard techniques [6, see Appendix C. 4 Boosting algorith Here, we derive a boosting-style algorith BA algorith for learning an enseble with the option of abstention for both losses L MB and L SB. Below, we describe the algorith for L SB and refer the reader to Appendix H for the version using the loss L MB. 4. Objective function The BA algorith solves a convex optiization proble that is based on Corollary 3 for loss L SB. Since the last three ters of the right-hand side of the boundof the corollary do not depend on α, this suggests to select α as the solution of in α Lρ SB h, r, x i, y i. Via a change of variable α α/ρ that does not affect the optiization proble, we can equivalently search for in α 0 L SBh, r, x i, y i such that T t= α t /ρ. Introducing the Lagrange variable β associated to the constraint T t= α t /ρ, the proble can rewritten as: in α 0 L SBh, r, x i, y i +β T t= α t. Letting {h, r,..., h N, r N } be the set of base functions pairs for the classifier and rejection function, we can rewrite the optiization proble as 5

6 the iniization over α 0 of N Φ α j r j x i y i j= N j= α j h j x i +c Φ b N j= α j r j x i +β N α j. Thus, the following is the objective function of our optiization proble: F α = Φ r t x i y i h t x i + c Φ b r t x i N + β α j Projected coordinate descent The proble in α 0 F α is a convex optiization proble, which we solve via projected coordinate descent. Let e k be the kth unit vector in R N and let F α, e j be the directional derivative of F along the direction e j at α. The algorith consists of the following three steps. First, it deterines the direction of axial descent by k = argax j [,N F α t, e j. Second, it calculates the best step η along the direction that preserves non-negativity of α by η = argin αt +ηe k 0 F α t + ηe k. Third, it updates α t to α t = α t + ηe k. The pseudocode of the BA algorith is given in Figure 3. The step and direction are based on F α t, e j. For any t [, T, define a distribution D t over the pairs i, n, with n in {, } D t i, = Φ r t x i y i h t x i and D t i, = Φ b rt x i, Z t Z t where Z t is the noralization factor given by Z t = Φ r t x i y i h t x i + Φ b r t x i. In order to derive an explicit forulation of the descent direction that is based on the weighted error of the classification function h j and the expected value of the rejection function r j, we use the distributions D,t and D,t defined by D t i, /Z,t and D t i, /Z,t where Z,t = D ti, and Z,t = D ti, are the noralization factors. Now, for any j [, N and s [, T, we can define the weighted error ɛ t,j and the expected value of the rejection function, r j, and r j,, over distribution D,t and D,t as follows: [ ɛ t,j = E [y i h j x i, r j, = E [r j x i, and r j, = i D,t i D,t Using these definition, we show see Appendix D that the descent direction is given by k = argin Z,t ɛ t,j + Z,t r j, c cz,t r j,. j [,N j= j= E [r j x i. 7 i D,t This equation shows that Z,t and c cz,t re-scale the weighted error and expected rejection. Thus, finding the best descent direction by iniizing this equation is equivalent to finding the best scaled trade-off between the isclassification error and the average rejection cost. The step size can in general be found via line search or other nuerical ethods, but we have derived a closed-for solution of the step size for both the exponential and logistic loss see Appendix D.. Further details of the derivation of projected coordinate descent on F are also given in Appendix D. Note that for r t 0 + in Equation 6, that is when the rejection ters are dropped in the objective, we retrieve the L-regularized Adaboost. As for Adaboost, we can define a weak learning assuption which requires that the directional derivative along at least one base pair be non-zero. For β = 0, it does not hold when for all j: ɛ s,j = r j, + c cz,t Z,t r j,, which corresponds to a balance between the edge and rejection costs for all j. Observe that in the particular case when the rejection functions are zero, it coincides with the standard weak learning assuption for Adaboost ɛ s,j = for all j. The following theore provides the convergence of the projected coordinate descent algorith for our objective function, F α. The proof is given in Appendix E. Theore 4. Assue that Φ is twice differentiable and that Φ u > 0 for all u R. Then, the projected coordinate descent algorith applied to F converges to the solution α of the optiization proble ax α 0 F α. If additionally Φ is strongly convex over the path of the iterates α t then there exists τ > 0 and ν > 0 such that for all t > τ, F α t+ F α ν F αt F α. 6

7 R + + R Figure 4: Illustration of the abstention stups on a variable X. Specifically, this theore holds for the exponential loss Φu = expu and the logistic loss Φ u = log + e u since they are strongly convex over the copact set containing the α t s. 4.3 Abstention stups We first define a faily of base hypotheses, abstention stups, that can be viewed as extensions of the standard boosting stups to the setting of classification with abstention. An abstention stup h θ,θ over the feature X is defined by two thresholds θ, θ R with θ θ. There are 6 different such stups, Figure 4 illustrates two of the. For the left figure, points with variables X less than or equal to θ are labeled negatively, those with X θ are labeled positively, and those with X between θ and θ are rejected. In general, an abstention stup is defined by the pair h θ,θ X, r θ,θ X where, for Figure 4-left, h θ,θ X = X θ + X>θ and r θ,θ X = θ<x θ. Thus, our abstention stups are pairs h, ˆr with h taking values in {, 0, } and ˆr in {0, }, and such that for any x either hx or ˆrx is zero. For our forulation and algorith, these stups can be used in cobination with any γ > 0, to define a faily of base predictor and base rejector pairs of the for hx, γ ˆrx. Since α t is non-negative, the value γ is needed to correct for over-rejection by previously selected abstention stups. The γ can be autoatically learned by adding to the set of base pairs the constant functions h 0, r 0 = 0,. An enseble solution returned by the BA algorith is therefore of the for t α th t x, t α tr t x where α t s are the weights assigned to each base pair. Now, consider a saple of points sorted by the value of X, which we denote by X X. For abstention stups, the derivative of the objective, F, can be further siplified see Appendix G such that the proble can be reduced to finding an abstention stup with the inial expected abstention loss lθ, θ, that is argin D t i, [ yi=+ Xi θ + yi= Xi>θ + D t i, cb D t i, θ<x i θ. θ,θ Notice that given points, at ost + thresholds need to be considered for θ and θ. Hence, a straightforward algorith inspects all possible O pairs θ, θ with θ θ in tie O. However, Lea 5 below and further derivations in Appendix G, allows for an O-tie algorith for finding optial abstention stups when the proble is solved without the constraint θ θ. Note that while we state the lea for the abstention stup in Figure 4-left, siilar results hold for any of the 6 types of stups. Lea 5. The optiization proble without the constraint θ < θ can be decoposed as follows: argin θ,θ lθ, θ = argin θ + argin θ D t i, yi=+ Xi θ + D t i, cb D t i, θ<x i 8 D t i, yi= Xi>θ + D t i, cb D t i, Xi θ. 9 The optiization Probles 8 and 9 can be solved in linear tie, via a ethod siilar to that of finding the optial threshold for a standard zero-one loss boosting stup. When the condition θ < θ does not hold, we can siply revert to finding the iniu of lθ, θ in the naive way. In practice, we find ost often that the optial solution of Proble 8 and Proble 9 satisfies θ < θ. 5 Experients In this section, we present the results of experients with our abstention stup BA algorith based on L SB for several datasets. We copare the BA algorith with the DHL algorith [, as well as a 7

8 0.3 cod pia skin Rejection Loss Rejection Loss Rejection Loss Cost Cost Cost 0.6 banknote haberan australian Rejection Loss Rejection Loss Rejection Loss Cost Cost Figure 5: Average rejection loss on the test set as a function of the abstention cost c for the TSB Algorith in orange, the DHL Algorith in red and the BA Algorith in blue based on L SB. Cost confidence-based boosting algorith TSB. Both of these algoriths are described in further detail in Appendix B. We tested the algoriths on six data sets fro UCI s data repository, specifically australian, cod, skin, banknote, haberan, and pia. For ore inforation about the data sets, see Appendix I. For each data set, we ipleented the standard 5-fold cross-validation where we randoly divided the data into training, validation and test set with the ratio 3::. Using a different rando partition, we repeated the experients five ties. For all three algoriths, the cost values ranged over c {0.05, 0.,..., 0.5} while threshold γ ranged over γ {0.08, 0.6,..., 0.96}. For the BA algorith, the β regularization paraeter ranged over β {0, 0.05,..., 0.95}. All experients for BA were based on T = 00 boosting rounds. The DHL algorith used polynoial kernels with degree d {,, 3} and it was ipleented in CVX [8. For each cost c, the hyperparaeter configuration was chosen to be the set of paraeters that attained the sallest average rejection loss on the validation set. For that set of paraeters we report the results on the test set. We first copared the confidence-based TSB algorith with the BA and DHL algoriths first row of Figure 5. The experients show that, while TSB can soeties perfor better than DHL, in a nuber of cases its perforance is draatically worse as a function of c and, in all cases it is outperfored by BA. In Appendix J, we give the full set of results for the TSB algorith. In view of that, our next series of results focus on the BA and DHL algoriths, directly designed to optiize the rejection loss, for 3 other datasets second row of Figure 5. Overall, the figures show that BA outperfors the state-of-the-art DHL algorith for ost values of c, thereby indicating that BA yields a significant iproveent in practice. We have also successfully run BA on the CIFAR-0 data set boat and horse iages which contains 0,000 instances and we believe that our algorith can scale to uch larger datasets. In contrast, training DHL on such larger saples did not terinate as it is based on a costly QCQP. In Appendix J, we present tables that report the average and standard deviation of the abstention loss as well as the fraction of rejected points and the classification error on non-rejected points. 6 Conclusion We introduced a general fraework for classification with abstention where the predictor and abstention functions are learned siultaneously. We gave a detailed study of enseble learning within this fraework including: new surrogate loss functions proven to be calibrated, Radeacher coplexity argin bounds for enseble learning of the pair of predictor and abstention functions, a new boosting-style algorith, the analysis of a natural faily of base predictor and abstention functions, and the results of several experients showing that BA algorith yield a significant iproveent over the confidence-based algoriths DHL and TSB. Our algorith can be further extended by considering ore coplex base pairs such as ore general ternary decision trees with rejection leaves. Moreover, our theory and algorith can be generalized to the scenario of ulti-class classification with abstention, which we have already initiated. Acknowledgents This work was partly funded by NSF CCF and IIS

9 References [ P. Bartlett and M. Wegkap. Classification with a reject option using a hinge loss. JMLR, 008. [ A. Bounsiar, E. Grall, and P. Beauseroy. Kernel based rejection ethod for supervised classification. In WASET, 007. [3 H. L. Capitaine and C. Frelicot. An optiu class-rejective decision rule and its evaluation. In ICPR, 00. [4 K. Chaudhuri and C. Zhang. Beyond disagreeent-based agnostic active learning. In NIPS, 04. [5 C. Chow. An optiu character recognition syste using decision function. IEEE Trans. Coput., 957. [6 C. Chow. On optiu recognition error and reject trade-off. IEEE Trans. Coput., 970. [7 C. Cortes, G. DeSalvo, and M. Mohri. Learning with rejection. In ALT, 06. [8 I. CVX Research. CVX: Matlab software for disciplined convex prograing, version.0, Aug. 0. [9 B. Dubuisson and M. Masson. Statistical decision rule with incoplete knowledge about classes. In PR, 993. [0 R. El-Yaniv and Y. Wiener. On the foundations of noise-free selective classification. JMLR, 00. [ R. El-Yaniv and Y. Wiener. Agnostic selective classification. In NIPS, 0. [ C. Elkan. The foundations of cost-sensitive learning. In IJCAI, 00. [3 G. Fuera and F. Roli. Support vector achines with ebedded reject option. In ICPR, 00. [4 G. Fuera, F. Roli, and G. Giacinto. Multiple reject thresholds for iproving classification reliability. In ICAPR, 000. [5 Y. Grandvalet, J. Keshet, A. Rakotoaonjy, and S. Canu. Suppport vector achines with a reject option. In NIPS, 008. [6 V. Koltchinskii and D. Panchenko. Epirical argin distributions and bounding the generalization error of cobined classifiers. Annals of Statistics, 30, 00. [7 T. Landgrebe, D. Tax, P. Paclik, and R. Duin. The interaction between classification and reject perforance for distance-based reject-option classifiers. PRL, 005. [8 M. Ledoux and M. Talagrand. Probability in Banach Spaces: Isoperietry and Processes. Springer, New York, 99. [9 M. Littan, L. Li, and T. Walsh. Knows what it knows: A fraework for self-aware learning. In ICML, 008. [0 Z.-Q. Luo and P. Tseng. On the convergence of coordinate descent ethod for convex differentiable iniization. Journal of Optiization Theory and Applications, 99. [ I. Melvin, J. Weston, C. S. Leslie, and W. S. Noble. Cobining classifiers for iproved classification of proteins fro sequence or structure. BMC Bioinforatics, 008. [ M. Mohri, A. Rostaizadeh, and A. Talwalkar. Foundations of Machine Learning. The MIT Press, 0. [3 F. Pedregosa, G. Varoquaux, A. Grafort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in python. In JMLR, 0. [4 T. Pietraszek. Optiizing abstaining classifiers using ROC analysis. In ICML, 005. [5 C. Santos-Pereira and A. Pires. On optial reject rules and ROC curves. PRL, 005. [6 R. E. Schapire and Y. Singer. Boostexter: A boosting-based syste for text categorization. Machine Learning, 39-3:35 68, 000. [7 D. Tax and R. Duin. Growing a ulti-class classifier with a reject option. In Pattern Recognition Letters, 008. [8 F. Tortorella. An optial reject rule for binary classifiers. In ICAPR, 00. [9 K. Trapeznikov and V. Saligraa. Supervised sequential classification under budget constraints. In AISTATS, 03. [30 J. Wang, K. Trapeznikov, and V. Saligraa. An lp for sequential learning under budgets. In JMLR, 04. [3 M. Yuan and M. Wegkap. Classification ethods with reject option based on convex risk iniizations. In JMLR, 00. [3 M. Yuang and M. Wegkap. Support vector achines with a reject option. In Bernoulli, 0. [33 C. Zhang and K. Chaudhuri. The extended littlestone s diension for learning with istakes and abstentions. In COLT, 06. 9

10 A Extended Related Work Initial work in learning with abstention has focused on the optial trade-off between the error and abstention rate [5, 6 as well as finding the optial abstention rule based on the ROC curve [4, 8, 5. Another series of papers used rejection options to reduce isclassification rate, but theoretical learning guarantees were not given [3, 4,, 7,. More recently, El-Yaniv and Wiener [0, study the trade-off between the coverage and the accuracy of classifiers by using an approach related to active learning. A seeingly connected fraework is that of cost-sensitive learning where the cost of isclassifying class y as class y ay depend on the pair y, y [. It would be tepting to view classification with abstention as a special instance of cost-sensitive learning with the set of classes {, +, R }, with R standing for abstention and where a different cost would be assigned to abstention. However, in our proble, R is not an intrinsic class: training or test saples bear no R label. Instead, the distribution over that set will depend on the algorith. Thus, classification with abstention cannot be cast as a special case of cost-sensitive learning. Sequential learning with a budget is also a arginally related task where abstention functions are learned. But, unlike our approach, it is done in a two-step process where the classifier function is fixed [9, 30. Lastly, the option of abstaining has been analyzed in related topics including the ulti-class setting [7, 9, 3, reinforceent learning [9, online learning [33 and active learning [4. B Confidence-based abstention odel In this appendix, we describe two confidence-based abstention algoriths: the DHL algorith and the TSB algorith. B. DHL algorith The DHL algorith found in [ is based on a double hinge loss, which is a hinge-type convex surrogate, with favorable consistency results. The optiization proble solved by the DHL algorith iniizes this surrogate loss along with the constraint that the nor of the classifier is bounded by c. More precisely, let H be a hypotheses sets defined in ters of PSD kernels K over X where Φ is the feature apping associated to K, then the DHL solves the following QCQP optiization proble in α,ξ,β subject to ξ i + c β i c α i α j Kx i, x j c i,j= B. Two-step Adaboost TSB ξ i y i α i Kx i, x ξ i 0, β i y i α i Kx i, x β i 0, i [,. The TSB algorith is a confidence-based algorith that proceeds in two steps. The first step consists of training a vanilla Adaboost algorith which returns a classifier h. Then, given classifier h, the second step is to search for the best threshold γ that iniizes the epirical abstention loss. More precisely, we pick the paraeter γ via cross-validation, by choosing the threshold that iniizes the epirical abstention loss on the validation set. This is a natural confidence-based boosting algorith and since the BA algorith is based on boosting, it provides a useful baseline for our experients. We ipleented this algorith using scikit-learn [3. C Theoretical guarantees In this appendix, we provide the proof of the theoretical guarantees presented in Section 3.. 0

11 Let u Φ u and u Φ u be two strictly non-increasing differentiable convex functions upper-bounding u u 0 over R. We assue that a, b > 0, and c > 0. We will use the quantity inφ a u, and so we assue that there exists u such that Φ a u < and siilarly, we need to analyze incφ u, and so we assue there exists a u such that cφ u <. Note that if these two assuptions did not hold, then the surrogate would not be useful. Let Φ and Φ be the inverse functions, which always exist since Φ and Φ are strictly onotone functions. For siplicity, we define C Φ = a Φ Φ and C Φ = cb Φ Φ /c. Theore. Let H and R be faily of functions apping X to R. Assue N >. Then, for any δ > 0, with probability at least δ over the draw of a saple S of size fro D, the following holds for all h, r F: Rh, r E [L MBh, r, x, y + C Φ R H + C Φ + C Φ R R + x,y S log δ. Proof. Let L MB,F be the faily of functions defined by L MB,F = { x, y inl MB h, r, x, y,, h, r F }. Since inl MB, is bounded by one, by the general Radeacher coplexity generalization bound [6, with probability at least δ over the draw of a saple S, the following holds: Rh, r E [inl MBh, r, x, y, x,y D E [inl MBh, r, x, y, + R L MB,F + x,y S E [L log δ MBh, r, x, y + R L MB,F + x,y S. log δ Since for any a, b R, in axa, b, = ax ina,, inb,, we can write inl MB h, r, x, y, = ax in Φ a [rx yhx,, in c Φ b rx, in Φ b [rx yhx, + in c Φ b rx,. The function Φ a u has a non-negative increasing derivative because it is a strictly increasing convex function. Since in Φ a u, = Φ a u for a u Φ, the Lipschitz constant of u in Φ a u, is given by a Φ Φ. Siilarly, u in cφ b u, is also cb Φ Φ /c-lipschitz. Then, by Talagrand s lea [8, R L MB,F a Φ Φ R x, y rx yhx: h, r F + c b Φ Φ /c R x, y rx: h, r F. 0 We exaine each of the ters in the right-hand side of the inequality: [ R x, y rx yhx: h, r F = Eσ sup E σ [ = E σ [ sup h,r F sup h,r F [ σ i rx i + Eσ [ σ i rx i + Eσ = R R + R H, sup h,r F sup h,r F h,r F σ i y i hx i σ i hx i σ i rx i y i hx i since y i σ i and σ i are distributed in the sae way, we effectively can absorb y i into the definition of σ i. Lastly, since the α does not affect the Radeacher coplexity, we have that

12 [ [ E σ sup h,r F σ ihx i = R H and siilarly E σ sup h,r F σ irx i = R R. By a siilar reasoning, we also have that R x, y rx: h, r F = Eσ [ sup h,r F σ i rx i = R R. Cobining the above, we have that the right-hand side of Inequality 0 is bounded as follows R L MB,F a Φ Φ R H + c b Φ Φ /c + a Φ Φ R R, which copletes the proof. By taking Φ u = Φ u = expu, we have the following theore since in this case, we siply have that C Φ = a and C Φ = b. Theore 6. Let H and R be faily of functions apping X to R. Assue N >. Then, for any δ > 0, with probability at least δ over the draw of a saple S of size fro D, the following holds for all h, r F: Rh, r E x,y S [L MBh, r, x, y + a R H + a + b R R + log /δ. The corollary below is a direct consequence of the above Theore 6 and it presents argin-based guarantees that are subsequently used to derive the BA algorith. Corollary 3. Assue N > and fix ρ > 0. Then, for any δ > 0, with probability at least δ over the draw of an i.i.d. saple S of size fro D, the following holds for all h, r F: Rh, r E x,y S [Lρ MB h, r, x, y + a ρ R H + a + b log /δ R R + ρ. D Direction and step of projected coordinate descent In this appendix, we provide the details of the projected coordinate descent, projected CD, algorith by first deriving the direction and then the optial step. We give a closed for solution of the step size for exponential loss Φu = expu and logistic loss Φu = log + e u. D. Direction At each iteration t, the direction e k selected by projected CD is k = argax j [,N F α t, e j where the derivative is given by the following F α t, e j = [r j x i y i h j x i Φ r t x i y i h t x i cb r j x i Φ b r t x i + β. Using the definition of Di, and Di,, we re-write the derivative as follows: F α t, e j = Z t = Z t [r j x i y i h j x i D t i, cb r j x i D t i, + β Z,t ɛ s,j Z,t + Z,t r j, cb Z,t r j, + β. Hence, we have that the descent direction is k = argin j [,N Z,t ɛ t,j + Z,t r j, cb Z,t r j,.

13 D. Step The optial step values η for direction e k is given by argin η+αt,k 0 F α t + ηe k. The values η ay be found via line search or other nuerical ethods, but below we derive a closed-for solution by iniizing an upper bound of F α t + ηe k. Since Φ is convex and since for all i [, y i h k x i + r k x i = + y ih k x i r k x i + y ih k x i + r k x i we have that the following holds for all η R Φ r t x i y i h t x i ηy i h k x i + ηr k x i + y ih k x i r k x i Φ r t x i y i h t x i η + y ih k x i + r k x i Φ r t x i y i h t x i + η. Siilarly, we have that b r k x i = b r kx i + b r kx i Φ b r t x i b ηr k x i b r kx i Φ b r t x i + η + b r kx i Φ b r t x i η Thus, we can upper-bound F as follows:, F α t + ηe k + y i h k x i r k x i Φ r t x i y i h t x i η + y i h k x i + r k x i Φ r t x i y i h t x i + η + b r k x i c Φ b r t x i + η + b r k x i c Φ b r t x i η + N α t β + βη j= We define Jη to be the right-hand side of the inequality above. We will select η as the solution of in η+αt,k 0 Jη, which is a convex optiization proble since J is convex. D.. Exponential loss When Φu = expu, the J function is given by Jη = + y i h k x i r k x i e rt xi yiht xi e η + y i h k x i + r k x i e rt xi yiht xi e η + b r k x i ce b rt xi e η + b r k x i ce b rt xi e η + N α t β + βη. j= 3

14 Since e rt xi yiht xi = Φ r t x i y i h t x i = Z t D t i, and e b rt xi = Φ b r t x i = Z t D t i,, it iplies that Jη = Z t ɛ t,k r k, Z,te η + ɛ t,k + r k, Z,te η + b r k, cz,t e η + b r k, cz,te η + N α t β + βη. For siplicity below, we define A = Z,t ɛ t,k r k, + cz,t b r k, and Z = Z,t ɛ t,k + r k, + cz,t b r k, so that J can be written as Jη = Z t Ae η + Ze η + j= N α t β + βη. Introducing a Lagrange variable λ 0, the optiization proble then becoes j= Lη, λ = Jη λη + α t,k with η Lη, λ = J η λ. By the KKT conditions, at the solution η, λ, J η = λ and λ η + α t,k = 0. Thus, we can fall in one of the two following cases:. λ > 0 J η > 0 and η = α t,k. λ = 0 and η is a solution of the equation Jη = 0 The first case can be written as Z t Ae α t,k + Ze α t,k + β > 0 Ae α t,k Ze α t,k < Z t β. For the second case we have to solve J η = 0 which can be written as e η + β Z tz eη A Z. The solution is given by e η = β Z t Z + β A [ + Z t Z Z η = log β β Z t Z + Z t Z Noting that A = Z,t Z, the above can be siplified to D.. Logistic loss [ η = log β Z t Z + A +. Z β Z,t + Z t Z Z. For the logistic loss, we have that for any u R, Φ u = log + e u and Φ u = log +e u. We have the following upper bound + e u + e u v e u Φ u v Φ u = log + e u = log + e v e u + e v log + e u = Φ ue v, which allows us to write F α t + ηe k F α t Φ r t x i y i h t x i e ηyih kx i+ηr k x i + c Φ b r t x i e b ηr kx i + βη. Fro here, we can use a very siilar reasoning as the exponential loss which results in a siilar expression for the step size. 4

15 E Convergence analysis of algorith In the section, we prove the convergence of the projected CD algorith for F α = Φ r t x i y i h t x i + c Φ b r t x i + β N j= α j. Theore 4. Assue that Φ is twice differentiable and that Φ u > 0 for all u R. Then, the projected CD algorith applied to F converges to the solution α of the optiization proble ax α 0 F α. If additionally Φ is strongly convex over the path of the iterates α t then there exists τ > 0 and ν > 0 such that for all t > τ, F α t+ F α ν F α t F α. Proof. Let H be the atrix in R N defined by H i,,j = y i h j x i r j x i and H i,,j = b r j x i for all i [, and for all j [, N, and let e i, and e i, be unit vectors in R. Then for any α, we have that e T i, Hα = N j= α jy i h j x i r j x i and e T i, Hα = b N j= α jr j x i. Thus, we can write for any α R N, F α = GHα + Λ T α, 3 where Λ = Λ,..., Λ N T and where G is the function defined by Gu = Φ e T i,u + cφ e T i,u = Φ u i, + cφ u i, 4 for all u R with u i, its i, th coordinate and u i, its i, th coordinate. Since Φ is differentiable, the function G is differentiable and Gu is a diagonal atrix with diagonal entries Φ u i, > 0 or c Φ u i, > 0 for all i [,. Thus, GHα is positive definite for all α. The conditions of Theore. of [0 are therefore satisfied for the optiization proble in GHα + α 0 ΛT α, 5 thereby guaranteeing the convergence of the projected CD ethod applied to F. If additionally F is strongly convex over the sequence of α t s, the by the result of [0[page 6, the Inequality holds for the projected coordinate ethod that we are using which selects the best direction at each round, as with the Gauss-Southwell ethod. F Calibration In this section, we show that L SB h, r, x, y = e a rx yhx +ce b rx is a calibrated loss whenever b a = c c. Below, let L := L SBh, r, x, y and define ηx = PY = + X = x. Theore. For a > 0 and b > 0, the inf h,r E x,y [Lh, r, x, y is attained at h L, r L such that signh = signh L and signr = signr L if and only if b a = c c. Proof. Conditioning on the label y, we can write the generalization error for the Lh, r, x, y as follows E [Lh, r, x, y = E [ηxψ hx, rx + ηxψhx, rx, x,y x where Ψ hx, rx = e a rx hx + ce b rx. For siplicity, we also let L Ψ hx, rx = ηxψ hx, rx + ηxψhx, rx. Since the infiu is over all easurable functions hx, rx, we have that inf h,r E x L Ψ hx, rx = E x inf hx,rx L Ψ hx, rx. Thus, we need to find the optial u, v for a fixed x that iniizes L Ψ u, v over all easurable functions, which is a convex optiization proble. When ηx = 0, the sign of the iniizers of L Ψ u, v are u < 0 and v > 0 while when ηx =, the the sign of the iniizers are u > 0 and v > 0, which atches the sign of h and r in both cases respectively. Now for ηx 0, [, we take the derivative of L Ψ u, v with respect to u L Ψu,v u = ηxa e a v u + ηxa e a u+v. 5

16 Setting it to zero and solving for u, we have that u = ηx a log ηx. We can now see that u > 0 if ηx > and u 0 if ηx. Recalling that h = ηx, we can conclude that the sign of u atches the sign of h. We now take the derivative of L Ψ u, v with respect to v L Ψu,v v = ηxe a v u + ηxe a v+u + c b e b v. Setting it equal to zero and using the fact that ηxe a u + ηxe a u = ηx ηx, we have that v = a +b log cb a ηx ηx. Now, we know that the Bayes classifiers h, r satisfy h = ηx and r = h + c so that the following holds ηx ηx = 4 h = 4 r + c. Thus, we can replace ηx ηx in the definition of v to arrive at this equation v = a +b log cb a. We now analyze when v > 0 which is equivalent to a +b log cb a 4 c > 4 r + c 4 r + c > 0 cb a cb a 4 r + > c > 4 r + c. Since 4 r + c for r > 0 and using the fact that c c = 4 c, we need that cb a c c. By siilar reasoning for v 0, we need that cb a c c. Thus, we can conclude that the sign of v atches the sign of r if and only if = c c. cb a G Abstention stups Under the assuptions of Section 4.3, the derivative of F can be siplified as follows F α t, e j = Z t D t i, + D t i, + D t i, i:y ih jx i=+ i:y ih jx i= i:r jx i= cb D t i, + β 6 i:r jx i= Fro the definition of Di, and the assuptions on hx and rx, the following holds D t i, = D t i, + D t i, + D t i, i:y ih jx i=+ i:y ih jx i= i:r jx i= Solving for i:y ih jx i=+ D ti, and plugging it in Equation 6, we can siplify the derivative F α t, e j = Z t i:y ih jx i= D t i, + i:r jx i= D t i, D t i, cb = Z t Z,t ɛ t,j + Z,t r j, cb Z,t r j, Z,t + β i:r jx i= Thus, the optial descent direction is k = argin j [,N Z,t ɛ t,j + Z,t r j, cb Z,t r j, Below, we provide the proof of the lea that was needed to decouple the optiization proble for the abstention stups. D t i, + β 6

17 Lea 5. The optiization proble without the constraint θ < θ can be decoposed as follows: argin D t i, [ yi=+ Xi θ + yi= Xi>θ + D t i, cb D t i, θ<x i θ. θ,θ = argin θ D t i, yi=+ Xi θ + D t i, cb D t i, θ<x i + argin D t i, yi= Xi>θ + D t i, cb D t i, Xi θ. θ Proof. For siplicity below, let κ = D t i, cb D t i, and observe that the following identity holds: θ<x i θ = θ<x i + Xi θ In view of that, we can write argin D t i, [ yi=+ Xi θ + yi= Xi>θ + κ θ<x i θ θ,θ = argin D t i, [ yi=+ Xi θ + yi= Xi>θ + κ[ θ<x i + Xi θ θ,θ = argin D t i, [ yi=+ Xi θ + yi= Xi>θ + κ θ<x i θ,θ + κ Xi θ κ = argin D t i, [ yi=+ Xi θ + yi= Xi>θ + κ θ<x i θ,θ = argin θ + κ Xi θ D t i, yi=+ Xi θ + κ θ<x i + argin D t i, yi= Xi>θ + κ Xi θ. θ H Alternative surrogate, L MB In this section, we derive the boosting algorith for the surrogate loss L MB h, r, x, y = ax Φ a [rx yhx, c Φ b rx. 7 By a siilar reasoning as Section 4, the objective function F α of our optiization proble is given by the following F α = ax e a [rtxi yihtxi, cb e b rtxi + β N α j. For siplicity, we define u t i = e a [rtxi yihtxi, v t i = cb e b rtxi and w t i = axu t i, v t i. We also let uti be the indicator functions that equals if u t i v t i and siilarly vti be the indicator functions that equals if v t i > u t i. For any t [, T, we also define the distribution D t i = w t i Z t, 8 7 j=

18 where Z t is the noralization factor given by Z t = w t i. We then apply projected coordinate descent to this objective function. Notice that our objective F is differentiable everywhere except when u t i = v t i. A true axiu descent algorith would choose the eleent of the subgradient that is closest to 0 as the descent direction. However, since this event is rare in our case, we arbitrarily a pick descent direction that is an eleent of the subgradient. For siplicity below, we will use the sybol F α t, e j to denote the directional derivative with the added condition that for the non-differentiable point, we choose the direction that is an eleent of the subgradient. H.0.3 Direction and step At each iteration t, the direction e k selected by projected CD is k = argax j [,N F α t, e j where F α t, e j = = = a [ y i h j x i + r j x i ut i cb r j x i vt i w t i + β a y i h j x i ut i [ a + cb ut i + cb r j x i w t i + β a y i h j x i ut i [ a + cb ut i + cb r j x i D t iz t + β. 9 The step can siply be found via line search or other nuerical ethods. H. Abstention stups We focus in on a special case where the base classifiers have a specific for defined as follows: hx takes values in {, 0, } and rx take values in {0, }. We also have the added the condition that for each saple point x, only one of the two coponents of hx, rx is non-zero. Under this setting, Equation 9 can be siplified as follows. The derivative of F is given by F α t, e j = a [ y i h j x i + r j x i ut i cb r j x i vt i D t iz t + β, which can be rewritten as = Z t a [ ut id t i ut id t i + ut id t i i:y ih jx i= i:y ih jx i=+ i:r jx i= cb D t i vt i + β. 0 i:r jx i= Fro the assuptions on hx and rx, the relation below holds: D t i ut i = i:y ih jx i=+ which is equivalent to the following i:y ih jx i=+ D t i ut i = D t i ut i+ i:y ih jx i= i:y ih jx i= D t i ut i+ D t i ut i+ i:r jx i= i:r jx i= D t i ut i, D t i ut i D t i ut i. 8

19 Table : For each data set, we report the saple size and the nuber of features. Plugging this into equation 0, we have that F α t, e j = Z t a [ cb i:r jx i= i:y ih jx i= D t i vt i Data Sets Saple Size Feature australian cod skin banknote,37 4 haberan pia ut id t i + + β. i:r jx i= ut id t i This in turn iplies that our weak learning algorith is given by the following: ut id t i lθ, θ = E i D [a ui [ yi=+ hθ,θ x= + yi= hθ,θ x= + [a ui cb vi rθ,θ x=. The following lea allows us to decouple the optiization proble into two optiization probles with respect to θ and θ that can be solved in linear tie. Lea 7. The optiization proble without the constraint θ < θ can be decoposed as follows: argin E a ui [ θ,θ i D yi=+ Xi θ + yi= Xi>θ + [a ui cb vi θ<x i θ = argin E a ui θ i D yi=+ Xi θ + [a ui cb vi θ<x i + argin θ E a ui i D yi= Xi>θ + [a ui cb vi Xi θ. Proof. For siplicity, let κ = a ui cb vi and observe that the following identity holds: In view of that, we can write θ<x i θ = θ<x i + Xi θ argin E [a ui [ i D yi=+ Xi θ + yi= Xi>θ + κ θ<x i θ θ,θ = argin E [a ui [ i D yi=+ Xi θ + yi= Xi>θ + κ θ<x i + Xi θ θ,θ = argin E [a ui i D yi=+ Xi θ + κ θ<x i + a ui yi= Xi>θ θ,θ + κ Xi θ κ = argin E [a ui i D yi=+ Xi θ + κ θ<x i θ,θ + argin E [a ui i D yi= Xi>θ + κ Xi θ θ,θ which copletes the proof. I Data sets Table shows the saple size and nuber of features for each data set used in our experients. 9

20 J Experients In this appendix, we report the results of several experients by presenting different tables in order to copare the three algoriths studied in this paper: TSB, DHL, and BA. In each table, we provide the average and standard deviation on the test set for the hyper paraeter configurations that aditted the sallest abstention loss on the validation set. Overall, these results reveal that BA yields a significant iproveent in practice for all the data sets across different values of cost c. Table gives the average abstention loss on the test set for TSB, DHL, and BA algoriths. Across alost all the different values of cost c, the BA algorith attains the sallest abstention loss copared with the TSB and DHL algoriths. On soe datasets, the TSB perfors better than the DHL algorith, but on other datasets, its perforance largely deteriorates. We also see that the effects of changing the cost c of rejection for soe datasets is uch stronger than for other datasets. For exaple, the pia dataset has a large change in abstention loss as c increases while for banknote dataset the difference in abstention loss is very sall. These changes reflect the changes in the fraction of points rejected by the algoriths, see Table 3. Note that this effect also depends on the algorith as seen in the cod dataset where BA algorith s abstention loss changes only slightly while for the other two algoriths the difference is uch higher as c increases. Table 3 shows the fraction of points that are rejected on the test set. For all three algoriths, the fraction of points rejected decreases as the cost c of rejection increases. Moreover, the fraction of points rejected is uch higher for soe datasets. For ost values of c, the DHL algorith appears to reject less frequently, but its abstention loss is also higher. For haberan, australian, and pia datasets, the TSB algorith rejection rates is quite high, which reinforces our clai that DHL and BA algoriths are better algoriths. Finally, Table 4, presents the classification loss on non-rejected points for different values of c. As c increases, we see that ore points are classified incorrectly, which is in accordance with the previous table since it shows that we are also rejecting less points. 0

Boosting with Abstention

Boosting with Abstention Corinna Cortes Google Research New York, NY corinna@google.co Giulia DeSalvo Courant Institute New York, NY desalvo@cis.nyu.edu Mehryar Mohri Courant Institute and Google New York,