Boosting with Abstention

Size: px
Start display at page:

Download "Boosting with Abstention"

Transcription

1 Boosting with Abstention Corinna Cortes Google Research New York, NY 00 Giulia DeSalvo Courant Institute New York, NY 00 Mehryar Mohri Courant Institute and Google New York, NY 00 Abstract We present a new boosting algorith for the key scenario of binary classification with abstention where the algorith can abstain fro predicting the label of a point, at the price of a fixed cost. At each round, our algorith selects a pair of functions, a base predictor and a base abstention function. We define convex upper bounds for the natural loss function associated to this proble, which we prove to be calibrated with respect to the Bayes solution. Our algorith benefits fro general argin-based learning guarantees which we derive for ensebles of pairs of base predictor and abstention functions, in ters of the Radeacher coplexities of the corresponding function classes. We give convergence guarantees for our algorith along with a linear-tie weak-learning algorith for abstention stups. We also report the results of several experients suggesting that our algorith provides a significant iproveent in practice over two confidence-based algoriths. Introduction Classification with abstention is a key learning scenario where the algorith can abstain fro aking a prediction, at the price of incurring a fixed cost. This is the natural scenario in a variety of coon and iportant applications. An exaple is spoken-dialog applications where the syste can redirect a call to an operator to avoid the cost of incorrectly assigning a category to a spoken utterance and isguiding the dialog anager. This requires the availability of an operator, which incurs a fixed and predefined price. Other exaples arise in the design of a search engine or an inforation extraction syste, where, rather than taking the risk of displaying an irrelevant docuent, the syste can resort to the help of a ore sophisticated, but ore tie-consuing classifier. More generally, this learning scenario arises in a wide range of applications including health, bioinforatics, astronoical event detection, active learning, and any others, where abstention is an acceptable option with soe cost. Classification with abstention is thus a highly relevant proble. The standard approach for tackling this proble is via confidence-based abstention: a real-valued function h is learned for the classification proble and the points x for which its agnitude hx is saller than soe threshold γ are rejected. Bartlett and Wegkap [ gave a theoretical analysis of this approach based on consistency. They introduced a discontinuous loss function taking into account the cost for rejection, upper-bounded that loss by a convex and continuous Double Hinge Loss DHL surrogate, and derived an algorith based on that convex surrogate loss. Their work inspired a series of follow-up papers that developed both the theory and practice behind confidence-based abstention [3, 5, 3. Further related works can be found in Appendix A. In this paper, we present a solution to the proble of classification with abstention that radically departs fro the confidence-based approach. We introduce a general odel where a pair h, r for a classifier h and rejection function r are learned siultaneously. Under this novel fraework, we present a Boosting-style algorith with Abstention, BA, that learns accurately the classifier and abstention functions. Note that the terinology of boosting with abstention was used by Schapire and Singer [6 to refer to a scenario where a base classifier is allowed to abstain, but 30th Conference on Neural Inforation Processing Systes NIPS 06, Barcelona, Spain.

2 hx = x + Figure : The best predictor h is defined by the threshold θ: hx = x + θ. For c <, the region defined by X η should be rejected. But the corresponding abstention function r defined by rx = x η cannot be defined as hx γ for any γ > 0. y hx < where the boosting algorith itself has to coit to a prediction. This is therefore distinct fro the scenario of classification with abstention studied here. Nevertheless, we will introduce and exaine a confidence-based Two-Step Boosting algorith, the TSB algorith, that consists of first training Adaboost and next searching for the best confidence-based abstention threshold. The paper is organized as follows. Section describes our general abstention odel which consists of learning a pair h, r siultaneously and copares it with confidence-based odels. Section 3. presents a series of theoretical results for the proble of learning convex ensebles for classification with abstention, including the introduction of calibrated convex surrogate losses and general datadependent learning guarantees. In Section 4, we use these learning bounds to design a regularized boosting algorith. We further prove the convergence of the algorith and present a linear-tie weak-learning algorith for a natural faily of abstention stups. Finally, in Section 5, we report several experiental results coparing the BA algorith with the DHL and the TSB algoriths. Preliinaries In this section, we first introduce a general odel for learning with abstention [7 and then copare it with confidence-based odels.. General abstention odel We assue as in standard supervised learning that the training and test points are drawn i.i.d. according to soe fixed but unknown distribution D over X {, +}. We consider the learning scenario of binary classification with abstention. Given an instance x X, the learner has the option of abstaining fro aking a prediction for x at the price of incurring a non-negative loss cx, or otherwise aking a prediction hx using a predictor h and incurring the standard zero-one loss yhx 0 where the true label is y. Since a rando guess achieves an expected cost of at ost, rejection only akes sense for cx<. We will odel the learner by a pair h, r where the function r : X R deterines the points x X to be rejected according to rx 0 and where the hypothesis h: X R predicts labels for non-rejected points via its sign. Extending the loss function considered in Bartlett and Wegkap [, the abstention loss for a pair h, r is defined as as follows for any x, y X {, +}: Lh, r, x, y = yhx 0 rx>0 + cx rx 0. The abstention cost cx is assued known to the learner. In the following, we assue that c is a constant function, but part of our analysis is applicable to the ore general case. We denote by H and R two failies of functions apping X to R and we assue the labeled saple S = x, y,..., x, y is drawn i.i.d. fro D. The learning proble consists of deterining a pair h, r H R that adits a sall expected abstention loss Rh, r, defined as follows: Rh, r = E x,y D [ yhx 0 rx>0 + c rx 0. Siilarly, we define the epirical loss of a pair h, r H R over the saple S by: R S h, r = E x,y S [ yhx 0 rx>0 + c rx 0, where x, y S indicates that x, y is drawn according to the epirical distribution defined by S.. Confidence-based abstention odel Confidence-based odels are a special case of the general odel for learning with rejection presented in Section. corresponding to the pair hx, rx = hx, hx γ, where γ is a paraeter x

3 that changes the threshold of rejection. This specific choice was based on consistency results shown in [. In particular, the Bayes solution h, r of the learning proble, that is where the distribution D is known, is given by h x = ηx and r x = h x c where ηx = P[Y = + x for any x X, but note that this is not a unique solution. The for of h x follows by a siilar reasoning as for the standard binary classification proble. It is straightforward to see that the optial rejection function r is non-positive, eaning a point is rejected, if and only if in{ηx, ηx} c. Equivalently, the following holds: ax{ηx, ηx} c if and only if ηx c and using the definition of h, we recover the optial r. In light of the Bayes solution, the specific choice of the abstention function r is natural; however, requiring the abstention function r to be defined as rx = hx γ, for soe h H, is in general too restrictive when predictors are selected out of a liited subset H of all easurable functions over X. Consider the exaple shown in Figure where H is a faily of linear functions. For this siple case, the optial abstention region cannot be attained as a function of the best predictor h while it can be achieved by allowing to learn a pair h, r. Thus, the general odel for learning with abstention analyzed in Section. is both ore flexible and ore general. 3 Theoretical analysis This section presents a theoretical analysis of the proble of learning convex ensebles for classification with abstention. We first introduce general convex surrogate functions for the abstention loss and prove a necessary and sufficient condition based on their paraeters for the to be calibrated. Next we define the enseble faily we consider and prove general data-dependent learning guarantees for it based on the Radeacher coplexities of the base predictor and base rejector sets. 3. Convex surrogates We introduce two types of convex surrogate functions for the abstention loss. Observe that the abstention loss Lh, r, x, y can be equivalently expressed as Lh, r, x, y = ax yhx 0 rx<0, c rx 0. In view of that, since for any f, g R, axf, g = f+g+ g f f+g, the following inequalities hold for a > 0 and b > 0: Lh, r, x, y = ax yhx 0 rx<0, c rx 0 ax axyhx, rx 0, c rx 0 ax yhx rx 0, c rx 0 = ax a [yhx rx 0, c b rx 0 ax Φ a [rx yhx, c Φ b rx, where u Φ u and u Φ u are two non-increasing convex functions upper-bounding u u 0 over R. Let L MB be the convex surrogate defined by the last inequality above: L MB h, r, x, y = ax Φ a [rx yhx, c Φ b rx, 3 Since L MB is not differentiable everywhere, we upper-bound the convex surrogate L MB as follows: ax a [yhx rx 0, c b rx 0 Φ a [rx yhx + c Φ b rx. Siilarly, we let L SB denote this convex surrogate: L SB h, r, x, y = Φ a [rx yhx + c Φ b rx. 4 Figure shows the plots of the convex surrogates L MB and L SB as well as that of the abstention loss. Let h L, r L denote the pair that attains the iniu of the expected loss E x,yl SB h, r, x, y over all easurable functions for Φ u = Φ u = expu. In Appendix F, we show that with ηx= PY =+ X =x, the pair h L, r L where h L = a log η η and r L = a +b log cb a η η akes L SB a calibrated loss, eaning that the sign of the h L, r L that iniizes the expected surrogate loss atches the sign of the Bayes classifier h, r. More precisely, the following holds. Theore Calibration of convex surrogate. For a > 0 and b > 0, the inf h,r E x,y [Lh, r, x, y is attained at h L, r L such that signh = signh L and signr = signrl if and only if b /a = c/c. 3

4 Figure : The left figure is a plot of the abstention loss. The iddle figure is a plot of the surrogate function L MB while the right figure is a plot of the surrogate loss L SB both for c = The theore shows that the classification and rejection solution obtained by iniizing the surrogate loss for that choice of a, b coincides with the one obtained using the original loss. In the following, we ake the explicit choice of a = and b = c/c for the loss L SB to be calibrated. 3. Learning guarantees for ensebles in classification with abstention In the standard scenario of classification, it is often easy to coe up with siple base classifiers that ay abstain. As an exaple, a siple rule could classify a essage as spa based on the presence of soe word, as ha in the presence of soe other word, and just abstain in the absence of both, as in the boosting with abstention algorith by Schapire and Singer [6. Our objective is to learn ensebles of such base hypotheses to create accurate solutions for classification with abstention. Our enseble functions are based on the fraework described in Section.. Let H and R be two failies of functions apping X to [,. The enseble faily F that we consider is then the convex hull of H R: { T T T } F = α t h t, α t r t : T, α t 0, α t =, h t H, r t R. 5 t= t= Thus, h, r F abstains on input x X when rx 0 and predicts the label signhx otherwise. Let u Φ u and u Φ u be two strictly decreasing differentiable convex function upperbounding u u 0 over R. For calibration constants a, b > 0, and cost c > 0, we assue that there exist u and v such that Φ a u < and c Φ v <, otherwise the surrogate would not be useful. Let Φ and Φ be the inverse functions, which always exist since Φ and Φ are strictly onotone. We will use the following definitions: C Φ = a Φ Φ and C Φ = cb Φ Φ /c. Observe that for Φ u = Φ u = expu, we siply have C Φ = a and C Φ = b. Theore. Let H and R be two failies of functions apping X to R. Assue N >. Then, for any δ > 0, with probability at least δ over the draw of a saple S of size fro D, the following holds for all h, r F: log /δ Rh, r E [L MBh, r, x, y + C Φ R H + C Φ + C Φ R R + x,y S. The proof is given in Appendix C. The theore gives effective learning guarantees for enseble pairs h, r F when the base predictor and abstention functions adit favorable Radeacher coplexities. In earlier work [7, we present a learning bound for a different type of surrogate losses which can also be extended to hold for ensebles. Next, we derive argin-based guarantees in the case where Φ u = Φ u = expu. For any ρ > 0, the argin-losses associated to L MB and L SB are denoted by L ρ MB and Lρ SB and defined for all h, r F and x, y X {, +} by L ρ MB h, r, x, y = L MBh/ρ, r/ρ, x, y and L ρ SB h, r, x, y = L SBh/ρ, r/ρ, x, y. Theore applied to this argin-based loss results in the following corollary. Corollary 3. Assue N > and fix ρ > 0. Then, for any δ > 0, with probability at least δ over the draw of an i.i.d. saple S of size fro D, the following holds for all f F: Rh, r E x,y S [Lρ MB h, r, x, y + a ρ R H + a + b log /δ R R + ρ. t= 4

5 BAS = x, y,..., x, y for i to do D i, ; D i, 3 for t to T do 4 Z,t D ti, ; Z,t D ti, 5 k argin j [,N Z,t ɛ t,j + Z,t r j, c cz,t r j, Direction 6 Z Z,t ɛ t,k + r k, r c cz k,,t 7 if Z,t Ze α t,k Ze α t,k < Z t β then 8 η t α t,k Step 9 else η t log 0 α t α t + η t e k r t N j= α jr j h t N j= α jh j 3 Z t+ [ [ β Z + β Z tz Z tz +,t Z Step Φ r t x i y i h t x i + Φ c c r tx i 4 for i to do 5 D t+ i, Φ r tx i y ih tx i Z t+ ; D t+ i, Φ 6 h, r N j= α T,jh j, r j 7 return h, r c c rtxi Z t+ Figure 3: Pseudocode of the BA algorith for both the exponential loss with Φ u = Φ u = expu as well as for the logistic loss with Φ u = Φ u = log + e u. The paraeters include the cost of rejection c and β deterining the strength of the the α-constraint for the L regularization. The definition of the weighted errors ɛ t,k as well as the expected rejections, r k, and r k,, are given in Equation 7. For other surrogate losses, the step size η t is found via a line search or other nuerical ethods by solving argin η F α t + ηe k. The bound of Corollary 3 applies siilarly to L ρ SB since it is an upper bound on Lρ MB. It can further log log /ρ be shown to hold uniforly for all ρ 0, at the price of a ter in O using standard techniques [6, see Appendix C. 4 Boosting algorith Here, we derive a boosting-style algorith BA algorith for learning an enseble with the option of abstention for both losses L MB and L SB. Below, we describe the algorith for L SB and refer the reader to Appendix H for the version using the loss L MB. 4. Objective function The BA algorith solves a convex optiization proble that is based on Corollary 3 for loss L SB. Since the last three ters of the right-hand side of the boundof the corollary do not depend on α, this suggests to select α as the solution of in α Lρ SB h, r, x i, y i. Via a change of variable α α/ρ that does not affect the optiization proble, we can equivalently search for in α 0 L SBh, r, x i, y i such that T t= α t /ρ. Introducing the Lagrange variable β associated to the constraint T t= α t /ρ, the proble can rewritten as: in α 0 L SBh, r, x i, y i +β T t= α t. Letting {h, r,..., h N, r N } be the set of base functions pairs for the classifier and rejection function, we can rewrite the optiization proble as 5

6 the iniization over α 0 of N Φ α j r j x i y i j= N j= α j h j x i +c Φ b N j= α j r j x i +β N α j. Thus, the following is the objective function of our optiization proble: F α = Φ r t x i y i h t x i + c Φ b r t x i N + β α j Projected coordinate descent The proble in α 0 F α is a convex optiization proble, which we solve via projected coordinate descent. Let e k be the kth unit vector in R N and let F α, e j be the directional derivative of F along the direction e j at α. The algorith consists of the following three steps. First, it deterines the direction of axial descent by k = argax j [,N F α t, e j. Second, it calculates the best step η along the direction that preserves non-negativity of α by η = argin αt +ηe k 0 F α t + ηe k. Third, it updates α t to α t = α t + ηe k. The pseudocode of the BA algorith is given in Figure 3. The step and direction are based on F α t, e j. For any t [, T, define a distribution D t over the pairs i, n, with n in {, } D t i, = Φ r t x i y i h t x i and D t i, = Φ b rt x i, Z t Z t where Z t is the noralization factor given by Z t = Φ r t x i y i h t x i + Φ b r t x i. In order to derive an explicit forulation of the descent direction that is based on the weighted error of the classification function h j and the expected value of the rejection function r j, we use the distributions D,t and D,t defined by D t i, /Z,t and D t i, /Z,t where Z,t = D ti, and Z,t = D ti, are the noralization factors. Now, for any j [, N and s [, T, we can define the weighted error ɛ t,j and the expected value of the rejection function, r j, and r j,, over distribution D,t and D,t as follows: [ ɛ t,j = E [y i h j x i, r j, = E [r j x i, and r j, = i D,t i D,t Using these definition, we show see Appendix D that the descent direction is given by k = argin Z,t ɛ t,j + Z,t r j, c cz,t r j,. j [,N j= j= E [r j x i. 7 i D,t This equation shows that Z,t and c cz,t re-scale the weighted error and expected rejection. Thus, finding the best descent direction by iniizing this equation is equivalent to finding the best scaled trade-off between the isclassification error and the average rejection cost. The step size can in general be found via line search or other nuerical ethods, but we have derived a closed-for solution of the step size for both the exponential and logistic loss see Appendix D.. Further details of the derivation of projected coordinate descent on F are also given in Appendix D. Note that for r t 0 + in Equation 6, that is when the rejection ters are dropped in the objective, we retrieve the L-regularized Adaboost. As for Adaboost, we can define a weak learning assuption which requires that the directional derivative along at least one base pair be non-zero. For β = 0, it does not hold when for all j: ɛ s,j = r j, + c cz,t Z,t r j,, which corresponds to a balance between the edge and rejection costs for all j. Observe that in the particular case when the rejection functions are zero, it coincides with the standard weak learning assuption for Adaboost ɛ s,j = for all j. The following theore provides the convergence of the projected coordinate descent algorith for our objective function, F α. The proof is given in Appendix E. Theore 4. Assue that Φ is twice differentiable and that Φ u > 0 for all u R. Then, the projected coordinate descent algorith applied to F converges to the solution α of the optiization proble ax α 0 F α. If additionally Φ is strongly convex over the path of the iterates α t then there exists τ > 0 and ν > 0 such that for all t > τ, F α t+ F α ν F αt F α. 6

7 R + + R Figure 4: Illustration of the abstention stups on a variable X. Specifically, this theore holds for the exponential loss Φu = expu and the logistic loss Φ u = log + e u since they are strongly convex over the copact set containing the α t s. 4.3 Abstention stups We first define a faily of base hypotheses, abstention stups, that can be viewed as extensions of the standard boosting stups to the setting of classification with abstention. An abstention stup h θ,θ over the feature X is defined by two thresholds θ, θ R with θ θ. There are 6 different such stups, Figure 4 illustrates two of the. For the left figure, points with variables X less than or equal to θ are labeled negatively, those with X θ are labeled positively, and those with X between θ and θ are rejected. In general, an abstention stup is defined by the pair h θ,θ X, r θ,θ X where, for Figure 4-left, h θ,θ X = X θ + X>θ and r θ,θ X = θ<x θ. Thus, our abstention stups are pairs h, ˆr with h taking values in {, 0, } and ˆr in {0, }, and such that for any x either hx or ˆrx is zero. For our forulation and algorith, these stups can be used in cobination with any γ > 0, to define a faily of base predictor and base rejector pairs of the for hx, γ ˆrx. Since α t is non-negative, the value γ is needed to correct for over-rejection by previously selected abstention stups. The γ can be autoatically learned by adding to the set of base pairs the constant functions h 0, r 0 = 0,. An enseble solution returned by the BA algorith is therefore of the for t α th t x, t α tr t x where α t s are the weights assigned to each base pair. Now, consider a saple of points sorted by the value of X, which we denote by X X. For abstention stups, the derivative of the objective, F, can be further siplified see Appendix G such that the proble can be reduced to finding an abstention stup with the inial expected abstention loss lθ, θ, that is argin D t i, [ yi=+ Xi θ + yi= Xi>θ + D t i, cb D t i, θ<x i θ. θ,θ Notice that given points, at ost + thresholds need to be considered for θ and θ. Hence, a straightforward algorith inspects all possible O pairs θ, θ with θ θ in tie O. However, Lea 5 below and further derivations in Appendix G, allows for an O-tie algorith for finding optial abstention stups when the proble is solved without the constraint θ θ. Note that while we state the lea for the abstention stup in Figure 4-left, siilar results hold for any of the 6 types of stups. Lea 5. The optiization proble without the constraint θ < θ can be decoposed as follows: argin θ,θ lθ, θ = argin θ + argin θ D t i, yi=+ Xi θ + D t i, cb D t i, θ<x i 8 D t i, yi= Xi>θ + D t i, cb D t i, Xi θ. 9 The optiization Probles 8 and 9 can be solved in linear tie, via a ethod siilar to that of finding the optial threshold for a standard zero-one loss boosting stup. When the condition θ < θ does not hold, we can siply revert to finding the iniu of lθ, θ in the naive way. In practice, we find ost often that the optial solution of Proble 8 and Proble 9 satisfies θ < θ. 5 Experients In this section, we present the results of experients with our abstention stup BA algorith based on L SB for several datasets. We copare the BA algorith with the DHL algorith [, as well as a 7

8 0.3 cod pia skin Rejection Loss Rejection Loss Rejection Loss Cost Cost Cost 0.6 banknote haberan australian Rejection Loss Rejection Loss Rejection Loss Cost Cost Figure 5: Average rejection loss on the test set as a function of the abstention cost c for the TSB Algorith in orange, the DHL Algorith in red and the BA Algorith in blue based on L SB. Cost confidence-based boosting algorith TSB. Both of these algoriths are described in further detail in Appendix B. We tested the algoriths on six data sets fro UCI s data repository, specifically australian, cod, skin, banknote, haberan, and pia. For ore inforation about the data sets, see Appendix I. For each data set, we ipleented the standard 5-fold cross-validation where we randoly divided the data into training, validation and test set with the ratio 3::. Using a different rando partition, we repeated the experients five ties. For all three algoriths, the cost values ranged over c {0.05, 0.,..., 0.5} while threshold γ ranged over γ {0.08, 0.6,..., 0.96}. For the BA algorith, the β regularization paraeter ranged over β {0, 0.05,..., 0.95}. All experients for BA were based on T = 00 boosting rounds. The DHL algorith used polynoial kernels with degree d {,, 3} and it was ipleented in CVX [8. For each cost c, the hyperparaeter configuration was chosen to be the set of paraeters that attained the sallest average rejection loss on the validation set. For that set of paraeters we report the results on the test set. We first copared the confidence-based TSB algorith with the BA and DHL algoriths first row of Figure 5. The experients show that, while TSB can soeties perfor better than DHL, in a nuber of cases its perforance is draatically worse as a function of c and, in all cases it is outperfored by BA. In Appendix J, we give the full set of results for the TSB algorith. In view of that, our next series of results focus on the BA and DHL algoriths, directly designed to optiize the rejection loss, for 3 other datasets second row of Figure 5. Overall, the figures show that BA outperfors the state-of-the-art DHL algorith for ost values of c, thereby indicating that BA yields a significant iproveent in practice. We have also successfully run BA on the CIFAR-0 data set boat and horse iages which contains 0,000 instances and we believe that our algorith can scale to uch larger datasets. In contrast, training DHL on such larger saples did not terinate as it is based on a costly QCQP. In Appendix J, we present tables that report the average and standard deviation of the abstention loss as well as the fraction of rejected points and the classification error on non-rejected points. 6 Conclusion We introduced a general fraework for classification with abstention where the predictor and abstention functions are learned siultaneously. We gave a detailed study of enseble learning within this fraework including: new surrogate loss functions proven to be calibrated, Radeacher coplexity argin bounds for enseble learning of the pair of predictor and abstention functions, a new boosting-style algorith, the analysis of a natural faily of base predictor and abstention functions, and the results of several experients showing that BA algorith yield a significant iproveent over the confidence-based algoriths DHL and TSB. Our algorith can be further extended by considering ore coplex base pairs such as ore general ternary decision trees with rejection leaves. Moreover, our theory and algorith can be generalized to the scenario of ulti-class classification with abstention, which we have already initiated. Acknowledgents This work was partly funded by NSF CCF and IIS

9 References [ P. Bartlett and M. Wegkap. Classification with a reject option using a hinge loss. JMLR, 008. [ A. Bounsiar, E. Grall, and P. Beauseroy. Kernel based rejection ethod for supervised classification. In WASET, 007. [3 H. L. Capitaine and C. Frelicot. An optiu class-rejective decision rule and its evaluation. In ICPR, 00. [4 K. Chaudhuri and C. Zhang. Beyond disagreeent-based agnostic active learning. In NIPS, 04. [5 C. Chow. An optiu character recognition syste using decision function. IEEE Trans. Coput., 957. [6 C. Chow. On optiu recognition error and reject trade-off. IEEE Trans. Coput., 970. [7 C. Cortes, G. DeSalvo, and M. Mohri. Learning with rejection. In ALT, 06. [8 I. CVX Research. CVX: Matlab software for disciplined convex prograing, version.0, Aug. 0. [9 B. Dubuisson and M. Masson. Statistical decision rule with incoplete knowledge about classes. In PR, 993. [0 R. El-Yaniv and Y. Wiener. On the foundations of noise-free selective classification. JMLR, 00. [ R. El-Yaniv and Y. Wiener. Agnostic selective classification. In NIPS, 0. [ C. Elkan. The foundations of cost-sensitive learning. In IJCAI, 00. [3 G. Fuera and F. Roli. Support vector achines with ebedded reject option. In ICPR, 00. [4 G. Fuera, F. Roli, and G. Giacinto. Multiple reject thresholds for iproving classification reliability. In ICAPR, 000. [5 Y. Grandvalet, J. Keshet, A. Rakotoaonjy, and S. Canu. Suppport vector achines with a reject option. In NIPS, 008. [6 V. Koltchinskii and D. Panchenko. Epirical argin distributions and bounding the generalization error of cobined classifiers. Annals of Statistics, 30, 00. [7 T. Landgrebe, D. Tax, P. Paclik, and R. Duin. The interaction between classification and reject perforance for distance-based reject-option classifiers. PRL, 005. [8 M. Ledoux and M. Talagrand. Probability in Banach Spaces: Isoperietry and Processes. Springer, New York, 99. [9 M. Littan, L. Li, and T. Walsh. Knows what it knows: A fraework for self-aware learning. In ICML, 008. [0 Z.-Q. Luo and P. Tseng. On the convergence of coordinate descent ethod for convex differentiable iniization. Journal of Optiization Theory and Applications, 99. [ I. Melvin, J. Weston, C. S. Leslie, and W. S. Noble. Cobining classifiers for iproved classification of proteins fro sequence or structure. BMC Bioinforatics, 008. [ M. Mohri, A. Rostaizadeh, and A. Talwalkar. Foundations of Machine Learning. The MIT Press, 0. [3 F. Pedregosa, G. Varoquaux, A. Grafort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in python. In JMLR, 0. [4 T. Pietraszek. Optiizing abstaining classifiers using ROC analysis. In ICML, 005. [5 C. Santos-Pereira and A. Pires. On optial reject rules and ROC curves. PRL, 005. [6 R. E. Schapire and Y. Singer. Boostexter: A boosting-based syste for text categorization. Machine Learning, 39-3:35 68, 000. [7 D. Tax and R. Duin. Growing a ulti-class classifier with a reject option. In Pattern Recognition Letters, 008. [8 F. Tortorella. An optial reject rule for binary classifiers. In ICAPR, 00. [9 K. Trapeznikov and V. Saligraa. Supervised sequential classification under budget constraints. In AISTATS, 03. [30 J. Wang, K. Trapeznikov, and V. Saligraa. An lp for sequential learning under budgets. In JMLR, 04. [3 M. Yuan and M. Wegkap. Classification ethods with reject option based on convex risk iniizations. In JMLR, 00. [3 M. Yuang and M. Wegkap. Support vector achines with a reject option. In Bernoulli, 0. [33 C. Zhang and K. Chaudhuri. The extended littlestone s diension for learning with istakes and abstentions. In COLT, 06. 9

10 A Extended Related Work Initial work in learning with abstention has focused on the optial trade-off between the error and abstention rate [5, 6 as well as finding the optial abstention rule based on the ROC curve [4, 8, 5. Another series of papers used rejection options to reduce isclassification rate, but theoretical learning guarantees were not given [3, 4,, 7,. More recently, El-Yaniv and Wiener [0, study the trade-off between the coverage and the accuracy of classifiers by using an approach related to active learning. A seeingly connected fraework is that of cost-sensitive learning where the cost of isclassifying class y as class y ay depend on the pair y, y [. It would be tepting to view classification with abstention as a special instance of cost-sensitive learning with the set of classes {, +, R }, with R standing for abstention and where a different cost would be assigned to abstention. However, in our proble, R is not an intrinsic class: training or test saples bear no R label. Instead, the distribution over that set will depend on the algorith. Thus, classification with abstention cannot be cast as a special case of cost-sensitive learning. Sequential learning with a budget is also a arginally related task where abstention functions are learned. But, unlike our approach, it is done in a two-step process where the classifier function is fixed [9, 30. Lastly, the option of abstaining has been analyzed in related topics including the ulti-class setting [7, 9, 3, reinforceent learning [9, online learning [33 and active learning [4. B Confidence-based abstention odel In this appendix, we describe two confidence-based abstention algoriths: the DHL algorith and the TSB algorith. B. DHL algorith The DHL algorith found in [ is based on a double hinge loss, which is a hinge-type convex surrogate, with favorable consistency results. The optiization proble solved by the DHL algorith iniizes this surrogate loss along with the constraint that the nor of the classifier is bounded by c. More precisely, let H be a hypotheses sets defined in ters of PSD kernels K over X where Φ is the feature apping associated to K, then the DHL solves the following QCQP optiization proble in α,ξ,β subject to ξ i + c β i c α i α j Kx i, x j c i,j= B. Two-step Adaboost TSB ξ i y i α i Kx i, x ξ i 0, β i y i α i Kx i, x β i 0, i [,. The TSB algorith is a confidence-based algorith that proceeds in two steps. The first step consists of training a vanilla Adaboost algorith which returns a classifier h. Then, given classifier h, the second step is to search for the best threshold γ that iniizes the epirical abstention loss. More precisely, we pick the paraeter γ via cross-validation, by choosing the threshold that iniizes the epirical abstention loss on the validation set. This is a natural confidence-based boosting algorith and since the BA algorith is based on boosting, it provides a useful baseline for our experients. We ipleented this algorith using scikit-learn [3. C Theoretical guarantees In this appendix, we provide the proof of the theoretical guarantees presented in Section 3.. 0

11 Let u Φ u and u Φ u be two strictly non-increasing differentiable convex functions upper-bounding u u 0 over R. We assue that a, b > 0, and c > 0. We will use the quantity inφ a u, and so we assue that there exists u such that Φ a u < and siilarly, we need to analyze incφ u, and so we assue there exists a u such that cφ u <. Note that if these two assuptions did not hold, then the surrogate would not be useful. Let Φ and Φ be the inverse functions, which always exist since Φ and Φ are strictly onotone functions. For siplicity, we define C Φ = a Φ Φ and C Φ = cb Φ Φ /c. Theore. Let H and R be faily of functions apping X to R. Assue N >. Then, for any δ > 0, with probability at least δ over the draw of a saple S of size fro D, the following holds for all h, r F: Rh, r E [L MBh, r, x, y + C Φ R H + C Φ + C Φ R R + x,y S log δ. Proof. Let L MB,F be the faily of functions defined by L MB,F = { x, y inl MB h, r, x, y,, h, r F }. Since inl MB, is bounded by one, by the general Radeacher coplexity generalization bound [6, with probability at least δ over the draw of a saple S, the following holds: Rh, r E [inl MBh, r, x, y, x,y D E [inl MBh, r, x, y, + R L MB,F + x,y S E [L log δ MBh, r, x, y + R L MB,F + x,y S. log δ Since for any a, b R, in axa, b, = ax ina,, inb,, we can write inl MB h, r, x, y, = ax in Φ a [rx yhx,, in c Φ b rx, in Φ b [rx yhx, + in c Φ b rx,. The function Φ a u has a non-negative increasing derivative because it is a strictly increasing convex function. Since in Φ a u, = Φ a u for a u Φ, the Lipschitz constant of u in Φ a u, is given by a Φ Φ. Siilarly, u in cφ b u, is also cb Φ Φ /c-lipschitz. Then, by Talagrand s lea [8, R L MB,F a Φ Φ R x, y rx yhx: h, r F + c b Φ Φ /c R x, y rx: h, r F. 0 We exaine each of the ters in the right-hand side of the inequality: [ R x, y rx yhx: h, r F = Eσ sup E σ [ = E σ [ sup h,r F sup h,r F [ σ i rx i + Eσ [ σ i rx i + Eσ = R R + R H, sup h,r F sup h,r F h,r F σ i y i hx i σ i hx i σ i rx i y i hx i since y i σ i and σ i are distributed in the sae way, we effectively can absorb y i into the definition of σ i. Lastly, since the α does not affect the Radeacher coplexity, we have that

12 [ [ E σ sup h,r F σ ihx i = R H and siilarly E σ sup h,r F σ irx i = R R. By a siilar reasoning, we also have that R x, y rx: h, r F = Eσ [ sup h,r F σ i rx i = R R. Cobining the above, we have that the right-hand side of Inequality 0 is bounded as follows R L MB,F a Φ Φ R H + c b Φ Φ /c + a Φ Φ R R, which copletes the proof. By taking Φ u = Φ u = expu, we have the following theore since in this case, we siply have that C Φ = a and C Φ = b. Theore 6. Let H and R be faily of functions apping X to R. Assue N >. Then, for any δ > 0, with probability at least δ over the draw of a saple S of size fro D, the following holds for all h, r F: Rh, r E x,y S [L MBh, r, x, y + a R H + a + b R R + log /δ. The corollary below is a direct consequence of the above Theore 6 and it presents argin-based guarantees that are subsequently used to derive the BA algorith. Corollary 3. Assue N > and fix ρ > 0. Then, for any δ > 0, with probability at least δ over the draw of an i.i.d. saple S of size fro D, the following holds for all h, r F: Rh, r E x,y S [Lρ MB h, r, x, y + a ρ R H + a + b log /δ R R + ρ. D Direction and step of projected coordinate descent In this appendix, we provide the details of the projected coordinate descent, projected CD, algorith by first deriving the direction and then the optial step. We give a closed for solution of the step size for exponential loss Φu = expu and logistic loss Φu = log + e u. D. Direction At each iteration t, the direction e k selected by projected CD is k = argax j [,N F α t, e j where the derivative is given by the following F α t, e j = [r j x i y i h j x i Φ r t x i y i h t x i cb r j x i Φ b r t x i + β. Using the definition of Di, and Di,, we re-write the derivative as follows: F α t, e j = Z t = Z t [r j x i y i h j x i D t i, cb r j x i D t i, + β Z,t ɛ s,j Z,t + Z,t r j, cb Z,t r j, + β. Hence, we have that the descent direction is k = argin j [,N Z,t ɛ t,j + Z,t r j, cb Z,t r j,.

13 D. Step The optial step values η for direction e k is given by argin η+αt,k 0 F α t + ηe k. The values η ay be found via line search or other nuerical ethods, but below we derive a closed-for solution by iniizing an upper bound of F α t + ηe k. Since Φ is convex and since for all i [, y i h k x i + r k x i = + y ih k x i r k x i + y ih k x i + r k x i we have that the following holds for all η R Φ r t x i y i h t x i ηy i h k x i + ηr k x i + y ih k x i r k x i Φ r t x i y i h t x i η + y ih k x i + r k x i Φ r t x i y i h t x i + η. Siilarly, we have that b r k x i = b r kx i + b r kx i Φ b r t x i b ηr k x i b r kx i Φ b r t x i + η + b r kx i Φ b r t x i η Thus, we can upper-bound F as follows:, F α t + ηe k + y i h k x i r k x i Φ r t x i y i h t x i η + y i h k x i + r k x i Φ r t x i y i h t x i + η + b r k x i c Φ b r t x i + η + b r k x i c Φ b r t x i η + N α t β + βη j= We define Jη to be the right-hand side of the inequality above. We will select η as the solution of in η+αt,k 0 Jη, which is a convex optiization proble since J is convex. D.. Exponential loss When Φu = expu, the J function is given by Jη = + y i h k x i r k x i e rt xi yiht xi e η + y i h k x i + r k x i e rt xi yiht xi e η + b r k x i ce b rt xi e η + b r k x i ce b rt xi e η + N α t β + βη. j= 3

14 Since e rt xi yiht xi = Φ r t x i y i h t x i = Z t D t i, and e b rt xi = Φ b r t x i = Z t D t i,, it iplies that Jη = Z t ɛ t,k r k, Z,te η + ɛ t,k + r k, Z,te η + b r k, cz,t e η + b r k, cz,te η + N α t β + βη. For siplicity below, we define A = Z,t ɛ t,k r k, + cz,t b r k, and Z = Z,t ɛ t,k + r k, + cz,t b r k, so that J can be written as Jη = Z t Ae η + Ze η + j= N α t β + βη. Introducing a Lagrange variable λ 0, the optiization proble then becoes j= Lη, λ = Jη λη + α t,k with η Lη, λ = J η λ. By the KKT conditions, at the solution η, λ, J η = λ and λ η + α t,k = 0. Thus, we can fall in one of the two following cases:. λ > 0 J η > 0 and η = α t,k. λ = 0 and η is a solution of the equation Jη = 0 The first case can be written as Z t Ae α t,k + Ze α t,k + β > 0 Ae α t,k Ze α t,k < Z t β. For the second case we have to solve J η = 0 which can be written as e η + β Z tz eη A Z. The solution is given by e η = β Z t Z + β A [ + Z t Z Z η = log β β Z t Z + Z t Z Noting that A = Z,t Z, the above can be siplified to D.. Logistic loss [ η = log β Z t Z + A +. Z β Z,t + Z t Z Z. For the logistic loss, we have that for any u R, Φ u = log + e u and Φ u = log +e u. We have the following upper bound + e u + e u v e u Φ u v Φ u = log + e u = log + e v e u + e v log + e u = Φ ue v, which allows us to write F α t + ηe k F α t Φ r t x i y i h t x i e ηyih kx i+ηr k x i + c Φ b r t x i e b ηr kx i + βη. Fro here, we can use a very siilar reasoning as the exponential loss which results in a siilar expression for the step size. 4

15 E Convergence analysis of algorith In the section, we prove the convergence of the projected CD algorith for F α = Φ r t x i y i h t x i + c Φ b r t x i + β N j= α j. Theore 4. Assue that Φ is twice differentiable and that Φ u > 0 for all u R. Then, the projected CD algorith applied to F converges to the solution α of the optiization proble ax α 0 F α. If additionally Φ is strongly convex over the path of the iterates α t then there exists τ > 0 and ν > 0 such that for all t > τ, F α t+ F α ν F α t F α. Proof. Let H be the atrix in R N defined by H i,,j = y i h j x i r j x i and H i,,j = b r j x i for all i [, and for all j [, N, and let e i, and e i, be unit vectors in R. Then for any α, we have that e T i, Hα = N j= α jy i h j x i r j x i and e T i, Hα = b N j= α jr j x i. Thus, we can write for any α R N, F α = GHα + Λ T α, 3 where Λ = Λ,..., Λ N T and where G is the function defined by Gu = Φ e T i,u + cφ e T i,u = Φ u i, + cφ u i, 4 for all u R with u i, its i, th coordinate and u i, its i, th coordinate. Since Φ is differentiable, the function G is differentiable and Gu is a diagonal atrix with diagonal entries Φ u i, > 0 or c Φ u i, > 0 for all i [,. Thus, GHα is positive definite for all α. The conditions of Theore. of [0 are therefore satisfied for the optiization proble in GHα + α 0 ΛT α, 5 thereby guaranteeing the convergence of the projected CD ethod applied to F. If additionally F is strongly convex over the sequence of α t s, the by the result of [0[page 6, the Inequality holds for the projected coordinate ethod that we are using which selects the best direction at each round, as with the Gauss-Southwell ethod. F Calibration In this section, we show that L SB h, r, x, y = e a rx yhx +ce b rx is a calibrated loss whenever b a = c c. Below, let L := L SBh, r, x, y and define ηx = PY = + X = x. Theore. For a > 0 and b > 0, the inf h,r E x,y [Lh, r, x, y is attained at h L, r L such that signh = signh L and signr = signr L if and only if b a = c c. Proof. Conditioning on the label y, we can write the generalization error for the Lh, r, x, y as follows E [Lh, r, x, y = E [ηxψ hx, rx + ηxψhx, rx, x,y x where Ψ hx, rx = e a rx hx + ce b rx. For siplicity, we also let L Ψ hx, rx = ηxψ hx, rx + ηxψhx, rx. Since the infiu is over all easurable functions hx, rx, we have that inf h,r E x L Ψ hx, rx = E x inf hx,rx L Ψ hx, rx. Thus, we need to find the optial u, v for a fixed x that iniizes L Ψ u, v over all easurable functions, which is a convex optiization proble. When ηx = 0, the sign of the iniizers of L Ψ u, v are u < 0 and v > 0 while when ηx =, the the sign of the iniizers are u > 0 and v > 0, which atches the sign of h and r in both cases respectively. Now for ηx 0, [, we take the derivative of L Ψ u, v with respect to u L Ψu,v u = ηxa e a v u + ηxa e a u+v. 5

16 Setting it to zero and solving for u, we have that u = ηx a log ηx. We can now see that u > 0 if ηx > and u 0 if ηx. Recalling that h = ηx, we can conclude that the sign of u atches the sign of h. We now take the derivative of L Ψ u, v with respect to v L Ψu,v v = ηxe a v u + ηxe a v+u + c b e b v. Setting it equal to zero and using the fact that ηxe a u + ηxe a u = ηx ηx, we have that v = a +b log cb a ηx ηx. Now, we know that the Bayes classifiers h, r satisfy h = ηx and r = h + c so that the following holds ηx ηx = 4 h = 4 r + c. Thus, we can replace ηx ηx in the definition of v to arrive at this equation v = a +b log cb a. We now analyze when v > 0 which is equivalent to a +b log cb a 4 c > 4 r + c 4 r + c > 0 cb a cb a 4 r + > c > 4 r + c. Since 4 r + c for r > 0 and using the fact that c c = 4 c, we need that cb a c c. By siilar reasoning for v 0, we need that cb a c c. Thus, we can conclude that the sign of v atches the sign of r if and only if = c c. cb a G Abstention stups Under the assuptions of Section 4.3, the derivative of F can be siplified as follows F α t, e j = Z t D t i, + D t i, + D t i, i:y ih jx i=+ i:y ih jx i= i:r jx i= cb D t i, + β 6 i:r jx i= Fro the definition of Di, and the assuptions on hx and rx, the following holds D t i, = D t i, + D t i, + D t i, i:y ih jx i=+ i:y ih jx i= i:r jx i= Solving for i:y ih jx i=+ D ti, and plugging it in Equation 6, we can siplify the derivative F α t, e j = Z t i:y ih jx i= D t i, + i:r jx i= D t i, D t i, cb = Z t Z,t ɛ t,j + Z,t r j, cb Z,t r j, Z,t + β i:r jx i= Thus, the optial descent direction is k = argin j [,N Z,t ɛ t,j + Z,t r j, cb Z,t r j, Below, we provide the proof of the lea that was needed to decouple the optiization proble for the abstention stups. D t i, + β 6

17 Lea 5. The optiization proble without the constraint θ < θ can be decoposed as follows: argin D t i, [ yi=+ Xi θ + yi= Xi>θ + D t i, cb D t i, θ<x i θ. θ,θ = argin θ D t i, yi=+ Xi θ + D t i, cb D t i, θ<x i + argin D t i, yi= Xi>θ + D t i, cb D t i, Xi θ. θ Proof. For siplicity below, let κ = D t i, cb D t i, and observe that the following identity holds: θ<x i θ = θ<x i + Xi θ In view of that, we can write argin D t i, [ yi=+ Xi θ + yi= Xi>θ + κ θ<x i θ θ,θ = argin D t i, [ yi=+ Xi θ + yi= Xi>θ + κ[ θ<x i + Xi θ θ,θ = argin D t i, [ yi=+ Xi θ + yi= Xi>θ + κ θ<x i θ,θ + κ Xi θ κ = argin D t i, [ yi=+ Xi θ + yi= Xi>θ + κ θ<x i θ,θ = argin θ + κ Xi θ D t i, yi=+ Xi θ + κ θ<x i + argin D t i, yi= Xi>θ + κ Xi θ. θ H Alternative surrogate, L MB In this section, we derive the boosting algorith for the surrogate loss L MB h, r, x, y = ax Φ a [rx yhx, c Φ b rx. 7 By a siilar reasoning as Section 4, the objective function F α of our optiization proble is given by the following F α = ax e a [rtxi yihtxi, cb e b rtxi + β N α j. For siplicity, we define u t i = e a [rtxi yihtxi, v t i = cb e b rtxi and w t i = axu t i, v t i. We also let uti be the indicator functions that equals if u t i v t i and siilarly vti be the indicator functions that equals if v t i > u t i. For any t [, T, we also define the distribution D t i = w t i Z t, 8 7 j=

18 where Z t is the noralization factor given by Z t = w t i. We then apply projected coordinate descent to this objective function. Notice that our objective F is differentiable everywhere except when u t i = v t i. A true axiu descent algorith would choose the eleent of the subgradient that is closest to 0 as the descent direction. However, since this event is rare in our case, we arbitrarily a pick descent direction that is an eleent of the subgradient. For siplicity below, we will use the sybol F α t, e j to denote the directional derivative with the added condition that for the non-differentiable point, we choose the direction that is an eleent of the subgradient. H.0.3 Direction and step At each iteration t, the direction e k selected by projected CD is k = argax j [,N F α t, e j where F α t, e j = = = a [ y i h j x i + r j x i ut i cb r j x i vt i w t i + β a y i h j x i ut i [ a + cb ut i + cb r j x i w t i + β a y i h j x i ut i [ a + cb ut i + cb r j x i D t iz t + β. 9 The step can siply be found via line search or other nuerical ethods. H. Abstention stups We focus in on a special case where the base classifiers have a specific for defined as follows: hx takes values in {, 0, } and rx take values in {0, }. We also have the added the condition that for each saple point x, only one of the two coponents of hx, rx is non-zero. Under this setting, Equation 9 can be siplified as follows. The derivative of F is given by F α t, e j = a [ y i h j x i + r j x i ut i cb r j x i vt i D t iz t + β, which can be rewritten as = Z t a [ ut id t i ut id t i + ut id t i i:y ih jx i= i:y ih jx i=+ i:r jx i= cb D t i vt i + β. 0 i:r jx i= Fro the assuptions on hx and rx, the relation below holds: D t i ut i = i:y ih jx i=+ which is equivalent to the following i:y ih jx i=+ D t i ut i = D t i ut i+ i:y ih jx i= i:y ih jx i= D t i ut i+ D t i ut i+ i:r jx i= i:r jx i= D t i ut i, D t i ut i D t i ut i. 8

19 Table : For each data set, we report the saple size and the nuber of features. Plugging this into equation 0, we have that F α t, e j = Z t a [ cb i:r jx i= i:y ih jx i= D t i vt i Data Sets Saple Size Feature australian cod skin banknote,37 4 haberan pia ut id t i + + β. i:r jx i= ut id t i This in turn iplies that our weak learning algorith is given by the following: ut id t i lθ, θ = E i D [a ui [ yi=+ hθ,θ x= + yi= hθ,θ x= + [a ui cb vi rθ,θ x=. The following lea allows us to decouple the optiization proble into two optiization probles with respect to θ and θ that can be solved in linear tie. Lea 7. The optiization proble without the constraint θ < θ can be decoposed as follows: argin E a ui [ θ,θ i D yi=+ Xi θ + yi= Xi>θ + [a ui cb vi θ<x i θ = argin E a ui θ i D yi=+ Xi θ + [a ui cb vi θ<x i + argin θ E a ui i D yi= Xi>θ + [a ui cb vi Xi θ. Proof. For siplicity, let κ = a ui cb vi and observe that the following identity holds: In view of that, we can write θ<x i θ = θ<x i + Xi θ argin E [a ui [ i D yi=+ Xi θ + yi= Xi>θ + κ θ<x i θ θ,θ = argin E [a ui [ i D yi=+ Xi θ + yi= Xi>θ + κ θ<x i + Xi θ θ,θ = argin E [a ui i D yi=+ Xi θ + κ θ<x i + a ui yi= Xi>θ θ,θ + κ Xi θ κ = argin E [a ui i D yi=+ Xi θ + κ θ<x i θ,θ + argin E [a ui i D yi= Xi>θ + κ Xi θ θ,θ which copletes the proof. I Data sets Table shows the saple size and nuber of features for each data set used in our experients. 9

20 J Experients In this appendix, we report the results of several experients by presenting different tables in order to copare the three algoriths studied in this paper: TSB, DHL, and BA. In each table, we provide the average and standard deviation on the test set for the hyper paraeter configurations that aditted the sallest abstention loss on the validation set. Overall, these results reveal that BA yields a significant iproveent in practice for all the data sets across different values of cost c. Table gives the average abstention loss on the test set for TSB, DHL, and BA algoriths. Across alost all the different values of cost c, the BA algorith attains the sallest abstention loss copared with the TSB and DHL algoriths. On soe datasets, the TSB perfors better than the DHL algorith, but on other datasets, its perforance largely deteriorates. We also see that the effects of changing the cost c of rejection for soe datasets is uch stronger than for other datasets. For exaple, the pia dataset has a large change in abstention loss as c increases while for banknote dataset the difference in abstention loss is very sall. These changes reflect the changes in the fraction of points rejected by the algoriths, see Table 3. Note that this effect also depends on the algorith as seen in the cod dataset where BA algorith s abstention loss changes only slightly while for the other two algoriths the difference is uch higher as c increases. Table 3 shows the fraction of points that are rejected on the test set. For all three algoriths, the fraction of points rejected decreases as the cost c of rejection increases. Moreover, the fraction of points rejected is uch higher for soe datasets. For ost values of c, the DHL algorith appears to reject less frequently, but its abstention loss is also higher. For haberan, australian, and pia datasets, the TSB algorith rejection rates is quite high, which reinforces our clai that DHL and BA algoriths are better algoriths. Finally, Table 4, presents the classification loss on non-rejected points for different values of c. As c increases, we see that ore points are classified incorrectly, which is in accordance with the previous table since it shows that we are also rejecting less points. 0

Boosting with Abstention

Boosting with Abstention Boosting with Abstention Corinna Cortes Google Research New York, NY corinna@google.co Giulia DeSalvo Courant Institute New York, NY desalvo@cis.nyu.edu Mehryar Mohri Courant Institute and Google New York,

More information

Learning with Rejection

Learning with Rejection Learning with Rejection Corinna Cortes 1, Giulia DeSalvo 2, and Mehryar Mohri 2,1 1 Google Research, 111 8th Avenue, New York, NY 2 Courant Institute of Matheatical Sciences, 251 Mercer Street, New York,

More information

Learning with Rejection

Learning with Rejection Learning with Rejection Corinna Cortes 1, Giulia DeSalvo 2, and Mehryar Mohri 1,2 1 Google Research, 111 8th Avenue, New York, NY 2 Courant Institute of Matheatical Sciences, 251 Mercer Street, New York,

More information

Learning with Rejection

Learning with Rejection Learning with Rejection Corinna Cortes 1, Giulia DeSalvo 2, and Mehryar Mohri 2,1 1 Google Research, 111 8th Avenue, New York, NY 2 Courant Institute of Mathematical Sciences, 251 Mercer Street, New York,

More information

Foundations of Machine Learning Boosting. Mehryar Mohri Courant Institute and Google Research

Foundations of Machine Learning Boosting. Mehryar Mohri Courant Institute and Google Research Foundations of Machine Learning Boosting Mehryar Mohri Courant Institute and Google Research ohri@cis.nyu.edu Weak Learning Definition: concept class C is weakly PAC-learnable if there exists a (weak)

More information

Rademacher Complexity Margin Bounds for Learning with a Large Number of Classes

Rademacher Complexity Margin Bounds for Learning with a Large Number of Classes Radeacher Coplexity Margin Bounds for Learning with a Large Nuber of Classes Vitaly Kuznetsov Courant Institute of Matheatical Sciences, 25 Mercer street, New York, NY, 002 Mehryar Mohri Courant Institute

More information

Intelligent Systems: Reasoning and Recognition. Perceptrons and Support Vector Machines

Intelligent Systems: Reasoning and Recognition. Perceptrons and Support Vector Machines Intelligent Systes: Reasoning and Recognition Jaes L. Crowley osig 1 Winter Seester 2018 Lesson 6 27 February 2018 Outline Perceptrons and Support Vector achines Notation...2 Linear odels...3 Lines, Planes

More information

1 Rademacher Complexity Bounds

1 Rademacher Complexity Bounds COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #10 Scribe: Max Goer March 07, 2013 1 Radeacher Coplexity Bounds Recall the following theore fro last lecture: Theore 1. With probability

More information

1 Bounding the Margin

1 Bounding the Margin COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #12 Scribe: Jian Min Si March 14, 2013 1 Bounding the Margin We are continuing the proof of a bound on the generalization error of AdaBoost

More information

E0 370 Statistical Learning Theory Lecture 6 (Aug 30, 2011) Margin Analysis

E0 370 Statistical Learning Theory Lecture 6 (Aug 30, 2011) Margin Analysis E0 370 tatistical Learning Theory Lecture 6 (Aug 30, 20) Margin Analysis Lecturer: hivani Agarwal cribe: Narasihan R Introduction In the last few lectures we have seen how to obtain high confidence bounds

More information

Combining Classifiers

Combining Classifiers Cobining Classifiers Generic ethods of generating and cobining ultiple classifiers Bagging Boosting References: Duda, Hart & Stork, pg 475-480. Hastie, Tibsharini, Friedan, pg 246-256 and Chapter 10. http://www.boosting.org/

More information

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory Coputational and Statistical Learning Theory Proble sets 5 and 6 Due: Noveber th Please send your solutions to learning-subissions@ttic.edu Notations/Definitions Recall the definition of saple based Radeacher

More information

A Smoothed Boosting Algorithm Using Probabilistic Output Codes

A Smoothed Boosting Algorithm Using Probabilistic Output Codes A Soothed Boosting Algorith Using Probabilistic Output Codes Rong Jin rongjin@cse.su.edu Dept. of Coputer Science and Engineering, Michigan State University, MI 48824, USA Jian Zhang jian.zhang@cs.cu.edu

More information

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation Course Notes for EE227C (Spring 2018): Convex Optiization and Approxiation Instructor: Moritz Hardt Eail: hardt+ee227c@berkeley.edu Graduate Instructor: Max Sichowitz Eail: sichow+ee227c@berkeley.edu October

More information

Support Vector Machines MIT Course Notes Cynthia Rudin

Support Vector Machines MIT Course Notes Cynthia Rudin Support Vector Machines MIT 5.097 Course Notes Cynthia Rudin Credit: Ng, Hastie, Tibshirani, Friedan Thanks: Şeyda Ertekin Let s start with soe intuition about argins. The argin of an exaple x i = distance

More information

Support Vector Machine Classification of Uncertain and Imbalanced data using Robust Optimization

Support Vector Machine Classification of Uncertain and Imbalanced data using Robust Optimization Recent Researches in Coputer Science Support Vector Machine Classification of Uncertain and Ibalanced data using Robust Optiization RAGHAV PAT, THEODORE B. TRAFALIS, KASH BARKER School of Industrial Engineering

More information

Boosting with log-loss

Boosting with log-loss Boosting with log-loss Marco Cusuano-Towner Septeber 2, 202 The proble Suppose we have data exaples {x i, y i ) i =... } for a two-class proble with y i {, }. Let F x) be the predictor function with the

More information

1 Generalization bounds based on Rademacher complexity

1 Generalization bounds based on Rademacher complexity COS 5: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #0 Scribe: Suqi Liu March 07, 08 Last tie we started proving this very general result about how quickly the epirical average converges

More information

Kernel Methods and Support Vector Machines

Kernel Methods and Support Vector Machines Intelligent Systes: Reasoning and Recognition Jaes L. Crowley ENSIAG 2 / osig 1 Second Seester 2012/2013 Lesson 20 2 ay 2013 Kernel ethods and Support Vector achines Contents Kernel Functions...2 Quadratic

More information

Bayes Decision Rule and Naïve Bayes Classifier

Bayes Decision Rule and Naïve Bayes Classifier Bayes Decision Rule and Naïve Bayes Classifier Le Song Machine Learning I CSE 6740, Fall 2013 Gaussian Mixture odel A density odel p(x) ay be ulti-odal: odel it as a ixture of uni-odal distributions (e.g.

More information

Support Vector Machines. Goals for the lecture

Support Vector Machines. Goals for the lecture Support Vector Machines Mark Craven and David Page Coputer Sciences 760 Spring 2018 www.biostat.wisc.edu/~craven/cs760/ Soe of the slides in these lectures have been adapted/borrowed fro aterials developed

More information

A Simple Regression Problem

A Simple Regression Problem A Siple Regression Proble R. M. Castro March 23, 2 In this brief note a siple regression proble will be introduced, illustrating clearly the bias-variance tradeoff. Let Y i f(x i ) + W i, i,..., n, where

More information

Pattern Recognition and Machine Learning. Learning and Evaluation for Pattern Recognition

Pattern Recognition and Machine Learning. Learning and Evaluation for Pattern Recognition Pattern Recognition and Machine Learning Jaes L. Crowley ENSIMAG 3 - MMIS Fall Seester 2017 Lesson 1 4 October 2017 Outline Learning and Evaluation for Pattern Recognition Notation...2 1. The Pattern Recognition

More information

Stochastic Subgradient Methods

Stochastic Subgradient Methods Stochastic Subgradient Methods Lingjie Weng Yutian Chen Bren School of Inforation and Coputer Science University of California, Irvine {wengl, yutianc}@ics.uci.edu Abstract Stochastic subgradient ethods

More information

Accuracy at the Top. Abstract

Accuracy at the Top. Abstract Accuracy at the Top Stephen Boyd Stanford University Packard 264 Stanford, CA 94305 boyd@stanford.edu Mehryar Mohri Courant Institute and Google 25 Mercer Street New York, NY 002 ohri@cis.nyu.edu Corinna

More information

Sharp Time Data Tradeoffs for Linear Inverse Problems

Sharp Time Data Tradeoffs for Linear Inverse Problems Sharp Tie Data Tradeoffs for Linear Inverse Probles Saet Oyak Benjain Recht Mahdi Soltanolkotabi January 016 Abstract In this paper we characterize sharp tie-data tradeoffs for optiization probles used

More information

Consistent Multiclass Algorithms for Complex Performance Measures. Supplementary Material

Consistent Multiclass Algorithms for Complex Performance Measures. Supplementary Material Consistent Multiclass Algoriths for Coplex Perforance Measures Suppleentary Material Notations. Let λ be the base easure over n given by the unifor rando variable (say U over n. Hence, for all easurable

More information

Pattern Recognition and Machine Learning. Artificial Neural networks

Pattern Recognition and Machine Learning. Artificial Neural networks Pattern Recognition and Machine Learning Jaes L. Crowley ENSIMAG 3 - MMIS Fall Seester 2016 Lessons 7 14 Dec 2016 Outline Artificial Neural networks Notation...2 1. Introduction...3... 3 The Artificial

More information

Feature Extraction Techniques

Feature Extraction Techniques Feature Extraction Techniques Unsupervised Learning II Feature Extraction Unsupervised ethods can also be used to find features which can be useful for categorization. There are unsupervised ethods that

More information

A Note on Scheduling Tall/Small Multiprocessor Tasks with Unit Processing Time to Minimize Maximum Tardiness

A Note on Scheduling Tall/Small Multiprocessor Tasks with Unit Processing Time to Minimize Maximum Tardiness A Note on Scheduling Tall/Sall Multiprocessor Tasks with Unit Processing Tie to Miniize Maxiu Tardiness Philippe Baptiste and Baruch Schieber IBM T.J. Watson Research Center P.O. Box 218, Yorktown Heights,

More information

3.8 Three Types of Convergence

3.8 Three Types of Convergence 3.8 Three Types of Convergence 3.8 Three Types of Convergence 93 Suppose that we are given a sequence functions {f k } k N on a set X and another function f on X. What does it ean for f k to converge to

More information

Block designs and statistics

Block designs and statistics Bloc designs and statistics Notes for Math 447 May 3, 2011 The ain paraeters of a bloc design are nuber of varieties v, bloc size, nuber of blocs b. A design is built on a set of v eleents. Each eleent

More information

CS Lecture 13. More Maximum Likelihood

CS Lecture 13. More Maximum Likelihood CS 6347 Lecture 13 More Maxiu Likelihood Recap Last tie: Introduction to axiu likelihood estiation MLE for Bayesian networks Optial CPTs correspond to epirical counts Today: MLE for CRFs 2 Maxiu Likelihood

More information

Deep Boosting. Abstract. 1. Introduction

Deep Boosting. Abstract. 1. Introduction Corinna Cortes Google Research, 8th Avenue, New York, NY Mehryar Mohri Courant Institute and Google Research, 25 Mercer Street, New York, NY 2 Uar Syed Google Research, 8th Avenue, New York, NY Abstract

More information

Support Vector Machines. Maximizing the Margin

Support Vector Machines. Maximizing the Margin Support Vector Machines Support vector achines (SVMs) learn a hypothesis: h(x) = b + Σ i= y i α i k(x, x i ) (x, y ),..., (x, y ) are the training exs., y i {, } b is the bias weight. α,..., α are the

More information

Learning with Deep Cascades

Learning with Deep Cascades Learning with Deep Cascades Giulia DeSalvo 1, Mehryar Mohri 1,2, and Uar Syed 2 1 Courant Institute of Matheatical Sciences, 251 Mercer Street, New Yor, NY 10012 2 Google Research, 111 8th Avenue, New

More information

PAC-Bayes Analysis Of Maximum Entropy Learning

PAC-Bayes Analysis Of Maximum Entropy Learning PAC-Bayes Analysis Of Maxiu Entropy Learning John Shawe-Taylor and David R. Hardoon Centre for Coputational Statistics and Machine Learning Departent of Coputer Science University College London, UK, WC1E

More information

Randomized Recovery for Boolean Compressed Sensing

Randomized Recovery for Boolean Compressed Sensing Randoized Recovery for Boolean Copressed Sensing Mitra Fatei and Martin Vetterli Laboratory of Audiovisual Counication École Polytechnique Fédéral de Lausanne (EPFL) Eail: {itra.fatei, artin.vetterli}@epfl.ch

More information

1 Proof of learning bounds

1 Proof of learning bounds COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #4 Scribe: Akshay Mittal February 13, 2013 1 Proof of learning bounds For intuition of the following theore, suppose there exists a

More information

Structured Prediction Theory Based on Factor Graph Complexity

Structured Prediction Theory Based on Factor Graph Complexity Structured Prediction Theory Based on Factor Graph Coplexity Corinna Cortes Google Research New York, NY 00 corinna@googleco Mehryar Mohri Courant Institute and Google New York, NY 00 ohri@cisnyuedu Vitaly

More information

Pattern Recognition and Machine Learning. Artificial Neural networks

Pattern Recognition and Machine Learning. Artificial Neural networks Pattern Recognition and Machine Learning Jaes L. Crowley ENSIMAG 3 - MMIS Fall Seester 2017 Lessons 7 20 Dec 2017 Outline Artificial Neural networks Notation...2 Introduction...3 Key Equations... 3 Artificial

More information

Support Vector Machines. Machine Learning Series Jerry Jeychandra Blohm Lab

Support Vector Machines. Machine Learning Series Jerry Jeychandra Blohm Lab Support Vector Machines Machine Learning Series Jerry Jeychandra Bloh Lab Outline Main goal: To understand how support vector achines (SVMs) perfor optial classification for labelled data sets, also a

More information

Polygonal Designs: Existence and Construction

Polygonal Designs: Existence and Construction Polygonal Designs: Existence and Construction John Hegean Departent of Matheatics, Stanford University, Stanford, CA 9405 Jeff Langford Departent of Matheatics, Drake University, Des Moines, IA 5011 G

More information

e-companion ONLY AVAILABLE IN ELECTRONIC FORM

e-companion ONLY AVAILABLE IN ELECTRONIC FORM OPERATIONS RESEARCH doi 10.1287/opre.1070.0427ec pp. ec1 ec5 e-copanion ONLY AVAILABLE IN ELECTRONIC FORM infors 07 INFORMS Electronic Copanion A Learning Approach for Interactive Marketing to a Custoer

More information

Lecture 9: Multi Kernel SVM

Lecture 9: Multi Kernel SVM Lecture 9: Multi Kernel SVM Stéphane Canu stephane.canu@litislab.eu Sao Paulo 204 April 6, 204 Roadap Tuning the kernel: MKL The ultiple kernel proble Sparse kernel achines for regression: SVR SipleMKL:

More information

Geometrical intuition behind the dual problem

Geometrical intuition behind the dual problem Based on: Geoetrical intuition behind the dual proble KP Bennett, EJ Bredensteiner, Duality and Geoetry in SVM Classifiers, Proceedings of the International Conference on Machine Learning, 2000 1 Geoetrical

More information

Model Fitting. CURM Background Material, Fall 2014 Dr. Doreen De Leon

Model Fitting. CURM Background Material, Fall 2014 Dr. Doreen De Leon Model Fitting CURM Background Material, Fall 014 Dr. Doreen De Leon 1 Introduction Given a set of data points, we often want to fit a selected odel or type to the data (e.g., we suspect an exponential

More information

Deep Boosting. Abstract. 1. Introduction

Deep Boosting. Abstract. 1. Introduction Corinna Cortes Google Research, 8th Avenue, New York, NY 00 Mehryar Mohri Courant Institute and Google Research, 5 Mercer Street, New York, NY 00 Uar Syed Google Research, 8th Avenue, New York, NY 00 Abstract

More information

On the Impact of Kernel Approximation on Learning Accuracy

On the Impact of Kernel Approximation on Learning Accuracy On the Ipact of Kernel Approxiation on Learning Accuracy Corinna Cortes Mehryar Mohri Aeet Talwalkar Google Research New York, NY corinna@google.co Courant Institute and Google Research New York, NY ohri@cs.nyu.edu

More information

Intelligent Systems: Reasoning and Recognition. Artificial Neural Networks

Intelligent Systems: Reasoning and Recognition. Artificial Neural Networks Intelligent Systes: Reasoning and Recognition Jaes L. Crowley MOSIG M1 Winter Seester 2018 Lesson 7 1 March 2018 Outline Artificial Neural Networks Notation...2 Introduction...3 Key Equations... 3 Artificial

More information

Understanding Machine Learning Solution Manual

Understanding Machine Learning Solution Manual Understanding Machine Learning Solution Manual Written by Alon Gonen Edited by Dana Rubinstein Noveber 17, 2014 2 Gentle Start 1. Given S = ((x i, y i )), define the ultivariate polynoial p S (x) = i []:y

More information

Grafting: Fast, Incremental Feature Selection by Gradient Descent in Function Space

Grafting: Fast, Incremental Feature Selection by Gradient Descent in Function Space Journal of Machine Learning Research 3 (2003) 1333-1356 Subitted 5/02; Published 3/03 Grafting: Fast, Increental Feature Selection by Gradient Descent in Function Space Sion Perkins Space and Reote Sensing

More information

are equal to zero, where, q = p 1. For each gene j, the pairwise null and alternative hypotheses are,

are equal to zero, where, q = p 1. For each gene j, the pairwise null and alternative hypotheses are, Page of 8 Suppleentary Materials: A ultiple testing procedure for ulti-diensional pairwise coparisons with application to gene expression studies Anjana Grandhi, Wenge Guo, Shyaal D. Peddada S Notations

More information

Lecture 12: Ensemble Methods. Introduction. Weighted Majority. Mixture of Experts/Committee. Σ k α k =1. Isabelle Guyon

Lecture 12: Ensemble Methods. Introduction. Weighted Majority. Mixture of Experts/Committee. Σ k α k =1. Isabelle Guyon Lecture 2: Enseble Methods Isabelle Guyon guyoni@inf.ethz.ch Introduction Book Chapter 7 Weighted Majority Mixture of Experts/Coittee Assue K experts f, f 2, f K (base learners) x f (x) Each expert akes

More information

This model assumes that the probability of a gap has size i is proportional to 1/i. i.e., i log m e. j=1. E[gap size] = i P r(i) = N f t.

This model assumes that the probability of a gap has size i is proportional to 1/i. i.e., i log m e. j=1. E[gap size] = i P r(i) = N f t. CS 493: Algoriths for Massive Data Sets Feb 2, 2002 Local Models, Bloo Filter Scribe: Qin Lv Local Models In global odels, every inverted file entry is copressed with the sae odel. This work wells when

More information

Ensemble Based on Data Envelopment Analysis

Ensemble Based on Data Envelopment Analysis Enseble Based on Data Envelopent Analysis So Young Sohn & Hong Choi Departent of Coputer Science & Industrial Systes Engineering, Yonsei University, Seoul, Korea Tel) 82-2-223-404, Fax) 82-2- 364-7807

More information

New Bounds for Learning Intervals with Implications for Semi-Supervised Learning

New Bounds for Learning Intervals with Implications for Semi-Supervised Learning JMLR: Workshop and Conference Proceedings vol (1) 1 15 New Bounds for Learning Intervals with Iplications for Sei-Supervised Learning David P. Helbold dph@soe.ucsc.edu Departent of Coputer Science, University

More information

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation Course Notes for EE7C (Spring 018: Convex Optiization and Approxiation Instructor: Moritz Hardt Eail: hardt+ee7c@berkeley.edu Graduate Instructor: Max Sichowitz Eail: sichow+ee7c@berkeley.edu October 15,

More information

13.2 Fully Polynomial Randomized Approximation Scheme for Permanent of Random 0-1 Matrices

13.2 Fully Polynomial Randomized Approximation Scheme for Permanent of Random 0-1 Matrices CS71 Randoness & Coputation Spring 018 Instructor: Alistair Sinclair Lecture 13: February 7 Disclaier: These notes have not been subjected to the usual scrutiny accorded to foral publications. They ay

More information

Nyström Method vs Random Fourier Features: A Theoretical and Empirical Comparison

Nyström Method vs Random Fourier Features: A Theoretical and Empirical Comparison yströ Method vs : A Theoretical and Epirical Coparison Tianbao Yang, Yu-Feng Li, Mehrdad Mahdavi, Rong Jin, Zhi-Hua Zhou Machine Learning Lab, GE Global Research, San Raon, CA 94583 Michigan State University,

More information

Robustness and Regularization of Support Vector Machines

Robustness and Regularization of Support Vector Machines Robustness and Regularization of Support Vector Machines Huan Xu ECE, McGill University Montreal, QC, Canada xuhuan@ci.cgill.ca Constantine Caraanis ECE, The University of Texas at Austin Austin, TX, USA

More information

E. Alpaydın AERFAISS

E. Alpaydın AERFAISS E. Alpaydın AERFAISS 00 Introduction Questions: Is the error rate of y classifier less than %? Is k-nn ore accurate than MLP? Does having PCA before iprove accuracy? Which kernel leads to highest accuracy

More information

Machine Learning Basics: Estimators, Bias and Variance

Machine Learning Basics: Estimators, Bias and Variance Machine Learning Basics: Estiators, Bias and Variance Sargur N. srihari@cedar.buffalo.edu This is part of lecture slides on Deep Learning: http://www.cedar.buffalo.edu/~srihari/cse676 1 Topics in Basics

More information

Lower Bounds for Quantized Matrix Completion

Lower Bounds for Quantized Matrix Completion Lower Bounds for Quantized Matrix Copletion Mary Wootters and Yaniv Plan Departent of Matheatics University of Michigan Ann Arbor, MI Eail: wootters, yplan}@uich.edu Mark A. Davenport School of Elec. &

More information

ASSUME a source over an alphabet size m, from which a sequence of n independent samples are drawn. The classical

ASSUME a source over an alphabet size m, from which a sequence of n independent samples are drawn. The classical IEEE TRANSACTIONS ON INFORMATION THEORY Large Alphabet Source Coding using Independent Coponent Analysis Aichai Painsky, Meber, IEEE, Saharon Rosset and Meir Feder, Fellow, IEEE arxiv:67.7v [cs.it] Jul

More information

Algorithms for parallel processor scheduling with distinct due windows and unit-time jobs

Algorithms for parallel processor scheduling with distinct due windows and unit-time jobs BULLETIN OF THE POLISH ACADEMY OF SCIENCES TECHNICAL SCIENCES Vol. 57, No. 3, 2009 Algoriths for parallel processor scheduling with distinct due windows and unit-tie obs A. JANIAK 1, W.A. JANIAK 2, and

More information

Bayesian Learning. Chapter 6: Bayesian Learning. Bayes Theorem. Roles for Bayesian Methods. CS 536: Machine Learning Littman (Wu, TA)

Bayesian Learning. Chapter 6: Bayesian Learning. Bayes Theorem. Roles for Bayesian Methods. CS 536: Machine Learning Littman (Wu, TA) Bayesian Learning Chapter 6: Bayesian Learning CS 536: Machine Learning Littan (Wu, TA) [Read Ch. 6, except 6.3] [Suggested exercises: 6.1, 6.2, 6.6] Bayes Theore MAP, ML hypotheses MAP learners Miniu

More information

Computable Shell Decomposition Bounds

Computable Shell Decomposition Bounds Coputable Shell Decoposition Bounds John Langford TTI-Chicago jcl@cs.cu.edu David McAllester TTI-Chicago dac@autoreason.co Editor: Leslie Pack Kaelbling and David Cohn Abstract Haussler, Kearns, Seung

More information

List Scheduling and LPT Oliver Braun (09/05/2017)

List Scheduling and LPT Oliver Braun (09/05/2017) List Scheduling and LPT Oliver Braun (09/05/207) We investigate the classical scheduling proble P ax where a set of n independent jobs has to be processed on 2 parallel and identical processors (achines)

More information

Non-Parametric Non-Line-of-Sight Identification 1

Non-Parametric Non-Line-of-Sight Identification 1 Non-Paraetric Non-Line-of-Sight Identification Sinan Gezici, Hisashi Kobayashi and H. Vincent Poor Departent of Electrical Engineering School of Engineering and Applied Science Princeton University, Princeton,

More information

E0 370 Statistical Learning Theory Lecture 5 (Aug 25, 2011)

E0 370 Statistical Learning Theory Lecture 5 (Aug 25, 2011) E0 370 Statistical Learning Theory Lecture 5 Aug 5, 0 Covering Nubers, Pseudo-Diension, and Fat-Shattering Diension Lecturer: Shivani Agarwal Scribe: Shivani Agarwal Introduction So far we have seen how

More information

Supplementary to Learning Discriminative Bayesian Networks from High-dimensional Continuous Neuroimaging Data

Supplementary to Learning Discriminative Bayesian Networks from High-dimensional Continuous Neuroimaging Data Suppleentary to Learning Discriinative Bayesian Networks fro High-diensional Continuous Neuroiaging Data Luping Zhou, Lei Wang, Lingqiao Liu, Philip Ogunbona, and Dinggang Shen Proposition. Given a sparse

More information

This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and

This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and This article appeared in a ournal published by Elsevier. The attached copy is furnished to the author for internal non-coercial research and education use, including for instruction at the authors institution

More information

A note on the multiplication of sparse matrices

A note on the multiplication of sparse matrices Cent. Eur. J. Cop. Sci. 41) 2014 1-11 DOI: 10.2478/s13537-014-0201-x Central European Journal of Coputer Science A note on the ultiplication of sparse atrices Research Article Keivan Borna 12, Sohrab Aboozarkhani

More information

Detection and Estimation Theory

Detection and Estimation Theory ESE 54 Detection and Estiation Theory Joseph A. O Sullivan Sauel C. Sachs Professor Electronic Systes and Signals Research Laboratory Electrical and Systes Engineering Washington University 11 Urbauer

More information

Stability Bounds for Non-i.i.d. Processes

Stability Bounds for Non-i.i.d. Processes tability Bounds for Non-i.i.d. Processes Mehryar Mohri Courant Institute of Matheatical ciences and Google Research 25 Mercer treet New York, NY 002 ohri@cis.nyu.edu Afshin Rostaiadeh Departent of Coputer

More information

Solutions of some selected problems of Homework 4

Solutions of some selected problems of Homework 4 Solutions of soe selected probles of Hoework 4 Sangchul Lee May 7, 2018 Proble 1 Let there be light A professor has two light bulbs in his garage. When both are burned out, they are replaced, and the next

More information

MSEC MODELING OF DEGRADATION PROCESSES TO OBTAIN AN OPTIMAL SOLUTION FOR MAINTENANCE AND PERFORMANCE

MSEC MODELING OF DEGRADATION PROCESSES TO OBTAIN AN OPTIMAL SOLUTION FOR MAINTENANCE AND PERFORMANCE Proceeding of the ASME 9 International Manufacturing Science and Engineering Conference MSEC9 October 4-7, 9, West Lafayette, Indiana, USA MSEC9-8466 MODELING OF DEGRADATION PROCESSES TO OBTAIN AN OPTIMAL

More information

Distributed Subgradient Methods for Multi-agent Optimization

Distributed Subgradient Methods for Multi-agent Optimization 1 Distributed Subgradient Methods for Multi-agent Optiization Angelia Nedić and Asuan Ozdaglar October 29, 2007 Abstract We study a distributed coputation odel for optiizing a su of convex objective functions

More information

A Theoretical Analysis of a Warm Start Technique

A Theoretical Analysis of a Warm Start Technique A Theoretical Analysis of a War Start Technique Martin A. Zinkevich Yahoo! Labs 701 First Avenue Sunnyvale, CA Abstract Batch gradient descent looks at every data point for every step, which is wasteful

More information

PAC-Bayesian Learning of Linear Classifiers

PAC-Bayesian Learning of Linear Classifiers Pascal Gerain Pascal.Gerain.@ulaval.ca Alexandre Lacasse Alexandre.Lacasse@ift.ulaval.ca François Laviolette Francois.Laviolette@ift.ulaval.ca Mario Marchand Mario.Marchand@ift.ulaval.ca Départeent d inforatique

More information

Shannon Sampling II. Connections to Learning Theory

Shannon Sampling II. Connections to Learning Theory Shannon Sapling II Connections to Learning heory Steve Sale oyota echnological Institute at Chicago 147 East 60th Street, Chicago, IL 60637, USA E-ail: sale@athberkeleyedu Ding-Xuan Zhou Departent of Matheatics,

More information

1 Identical Parallel Machines

1 Identical Parallel Machines FB3: Matheatik/Inforatik Dr. Syaantak Das Winter 2017/18 Optiizing under Uncertainty Lecture Notes 3: Scheduling to Miniize Makespan In any standard scheduling proble, we are given a set of jobs J = {j

More information

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory Coputational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 2: PAC Learning and VC Theory I Fro Adversarial Online to Statistical Three reasons to ove fro worst-case deterinistic

More information

Estimating Parameters for a Gaussian pdf

Estimating Parameters for a Gaussian pdf Pattern Recognition and achine Learning Jaes L. Crowley ENSIAG 3 IS First Seester 00/0 Lesson 5 7 Noveber 00 Contents Estiating Paraeters for a Gaussian pdf Notation... The Pattern Recognition Proble...3

More information

Quantum algorithms (CO 781, Winter 2008) Prof. Andrew Childs, University of Waterloo LECTURE 15: Unstructured search and spatial search

Quantum algorithms (CO 781, Winter 2008) Prof. Andrew Childs, University of Waterloo LECTURE 15: Unstructured search and spatial search Quantu algoriths (CO 781, Winter 2008) Prof Andrew Childs, University of Waterloo LECTURE 15: Unstructured search and spatial search ow we begin to discuss applications of quantu walks to search algoriths

More information

Soft Computing Techniques Help Assign Weights to Different Factors in Vulnerability Analysis

Soft Computing Techniques Help Assign Weights to Different Factors in Vulnerability Analysis Soft Coputing Techniques Help Assign Weights to Different Factors in Vulnerability Analysis Beverly Rivera 1,2, Irbis Gallegos 1, and Vladik Kreinovich 2 1 Regional Cyber and Energy Security Center RCES

More information

Homework 3 Solutions CSE 101 Summer 2017

Homework 3 Solutions CSE 101 Summer 2017 Hoework 3 Solutions CSE 0 Suer 207. Scheduling algoriths The following n = 2 jobs with given processing ties have to be scheduled on = 3 parallel and identical processors with the objective of iniizing

More information

On Constant Power Water-filling

On Constant Power Water-filling On Constant Power Water-filling Wei Yu and John M. Cioffi Electrical Engineering Departent Stanford University, Stanford, CA94305, U.S.A. eails: {weiyu,cioffi}@stanford.edu Abstract This paper derives

More information

ESTIMATING AND FORMING CONFIDENCE INTERVALS FOR EXTREMA OF RANDOM POLYNOMIALS. A Thesis. Presented to. The Faculty of the Department of Mathematics

ESTIMATING AND FORMING CONFIDENCE INTERVALS FOR EXTREMA OF RANDOM POLYNOMIALS. A Thesis. Presented to. The Faculty of the Department of Mathematics ESTIMATING AND FORMING CONFIDENCE INTERVALS FOR EXTREMA OF RANDOM POLYNOMIALS A Thesis Presented to The Faculty of the Departent of Matheatics San Jose State University In Partial Fulfillent of the Requireents

More information

Multi-Dimensional Hegselmann-Krause Dynamics

Multi-Dimensional Hegselmann-Krause Dynamics Multi-Diensional Hegselann-Krause Dynaics A. Nedić Industrial and Enterprise Systes Engineering Dept. University of Illinois Urbana, IL 680 angelia@illinois.edu B. Touri Coordinated Science Laboratory

More information

DEPARTMENT OF ECONOMETRICS AND BUSINESS STATISTICS

DEPARTMENT OF ECONOMETRICS AND BUSINESS STATISTICS ISSN 1440-771X AUSTRALIA DEPARTMENT OF ECONOMETRICS AND BUSINESS STATISTICS An Iproved Method for Bandwidth Selection When Estiating ROC Curves Peter G Hall and Rob J Hyndan Working Paper 11/00 An iproved

More information

Fixed-to-Variable Length Distribution Matching

Fixed-to-Variable Length Distribution Matching Fixed-to-Variable Length Distribution Matching Rana Ali Ajad and Georg Böcherer Institute for Counications Engineering Technische Universität München, Gerany Eail: raa2463@gail.co,georg.boecherer@tu.de

More information

arxiv: v3 [cs.lg] 7 Jan 2016

arxiv: v3 [cs.lg] 7 Jan 2016 Efficient and Parsionious Agnostic Active Learning Tzu-Kuo Huang Alekh Agarwal Daniel J. Hsu tkhuang@icrosoft.co alekha@icrosoft.co djhsu@cs.colubia.edu John Langford Robert E. Schapire jcl@icrosoft.co

More information

Experimental Design For Model Discrimination And Precise Parameter Estimation In WDS Analysis

Experimental Design For Model Discrimination And Precise Parameter Estimation In WDS Analysis City University of New York (CUNY) CUNY Acadeic Works International Conference on Hydroinforatics 8-1-2014 Experiental Design For Model Discriination And Precise Paraeter Estiation In WDS Analysis Giovanna

More information

Using EM To Estimate A Probablity Density With A Mixture Of Gaussians

Using EM To Estimate A Probablity Density With A Mixture Of Gaussians Using EM To Estiate A Probablity Density With A Mixture Of Gaussians Aaron A. D Souza adsouza@usc.edu Introduction The proble we are trying to address in this note is siple. Given a set of data points

More information

CSE525: Randomized Algorithms and Probabilistic Analysis May 16, Lecture 13

CSE525: Randomized Algorithms and Probabilistic Analysis May 16, Lecture 13 CSE55: Randoied Algoriths and obabilistic Analysis May 6, Lecture Lecturer: Anna Karlin Scribe: Noah Siegel, Jonathan Shi Rando walks and Markov chains This lecture discusses Markov chains, which capture

More information

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 1

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 1 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE Two-Diensional Multi-Label Active Learning with An Efficient Online Adaptation Model for Iage Classification Guo-Jun Qi, Xian-Sheng Hua, Meber,

More information

The Transactional Nature of Quantum Information

The Transactional Nature of Quantum Information The Transactional Nature of Quantu Inforation Subhash Kak Departent of Coputer Science Oklahoa State University Stillwater, OK 7478 ABSTRACT Inforation, in its counications sense, is a transactional property.

More information

Computable Shell Decomposition Bounds

Computable Shell Decomposition Bounds Journal of Machine Learning Research 5 (2004) 529-547 Subitted 1/03; Revised 8/03; Published 5/04 Coputable Shell Decoposition Bounds John Langford David McAllester Toyota Technology Institute at Chicago

More information