Learning with Rejection

Similar documents
The definitions and notation are those introduced in the lectures slides. R Ex D [h

Support Vector Machines with a Reject Option

Advanced Machine Learning

Learning with Rejection

Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 3 April 5, 2013 Due: April 19, 2013

Learning with Rejection

Classification with Reject Option

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Classification objectives COMS 4771

Support Vector Machines, Kernel SVM

Evaluation. Andrea Passerini Machine Learning. Evaluation

Evaluation requires to define performance measures to be optimized

The PAC Learning Framework -II

Improved Algorithms for Confidence-Rated Prediction with Error Guarantees

Improved Algorithms for Confidence-Rated Prediction with Error Guarantees

Does Unlabeled Data Help?

The sample complexity of agnostic learning with deterministic labels

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University

STATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION

Support vector machines Lecture 4

Lecture Notes on Support Vector Machine

Lecture 2 Machine Learning Review

Support Vector Machines: Maximum Margin Classifiers

Jeff Howbert Introduction to Machine Learning Winter

Introduction to Support Vector Machines

Introduction to Machine Learning Lecture 11. Mehryar Mohri Courant Institute and Google Research

Approximation Theoretical Questions for SVMs

Model Averaging With Holdout Estimation of the Posterior Distribution

Lecture Support Vector Machine (SVM) Classifiers

Rademacher Complexity Bounds for Non-I.I.D. Processes

Machine Learning 4771

ADANET: adaptive learning of neural networks

Foundations of Machine Learning

DS-GA 1003: Machine Learning and Computational Statistics Homework 6: Generalized Hinge Loss and Multiclass SVM

Domain Adaptation for Regression

An Introduction to Statistical Theory of Learning. Nakul Verma Janelia, HHMI

Lecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity;

Learning Kernels -Tutorial Part III: Theoretical Guarantees.

Rademacher Bounds for Non-i.i.d. Processes

Selective Prediction. Binary classifications. Rong Zhou November 8, 2017

Understanding Generalization Error: Bounds and Decompositions

Advanced Machine Learning

Deep Boosting. Joint work with Corinna Cortes (Google Research) Umar Syed (Google Research) COURANT INSTITUTE & GOOGLE RESEARCH.

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015

Advanced Introduction to Machine Learning CMU-10715

Introduction to Machine Learning Lecture 13. Mehryar Mohri Courant Institute and Google Research

Lecture 3: Statistical Decision Theory (Part II)

Computational and Statistical Learning theory

Structured Prediction Theory and Algorithms

Support Vector Machine

Support Vector Machine (SVM) and Kernel Methods

Support vector comparison machines

Computational and Statistical Learning Theory

Perceptron Mistake Bounds

Part of the slides are adapted from Ziko Kolter

Boosting Ensembles of Structured Prediction Rules

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Machine Learning. Support Vector Machines. Manfred Huber

Name (NetID): (1 Point)

Sample and Computationally Efficient Active Learning. Maria-Florina Balcan Carnegie Mellon University

COMS 4771 Introduction to Machine Learning. Nakul Verma

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

Warm up: risk prediction with logistic regression

Better Algorithms for Selective Sampling

Generalization, Overfitting, and Model Selection

Machine Learning Practice Page 2 of 2 10/28/13

Generalization bounds

Online Learning with Abstention

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels

Support Vector Machine (SVM) and Kernel Methods

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

A Bahadur Representation of the Linear Support Vector Machine

PAC-learning, VC Dimension and Margin-based Bounds

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University

Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013

Support vector machines

On the Generalization Ability of Online Strongly Convex Programming Algorithms

Support Vector Machines

CMU-Q Lecture 24:

A Magiv CV Theory for Large-Margin Classifiers

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

i=1 cosn (x 2 i y2 i ) over RN R N. cos y sin x

Generalization Bounds in Machine Learning. Presented by: Afshin Rostamizadeh

Introduction to Machine Learning Midterm Exam

Support Vector Machine

Linear & nonlinear classifiers

Computational and Statistical Learning Theory

Plug-in Approach to Active Learning

Generalization error bounds for classifiers trained with interdependent data

Estimating the accuracy of a hypothesis Setting. Assume a binary classification setting

The Perceptron algorithm

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

10-701/ Machine Learning - Midterm Exam, Fall 2010

Machine Learning Linear Classification. Prof. Matteo Matteucci

Stat542 (F11) Statistical Learning. First consider the scenario where the two classes of points are separable.

Lecture 10: Support Vector Machine and Large Margin Classifier

Empirical Risk Minimization

Algorithms for Learning Kernels Based on Centered Alignment

Supervised Metric Learning with Generalization Guarantees

Transcription:

Learning with Rejection Corinna Cortes 1, Giulia DeSalvo 2, and Mehryar Mohri 2,1 1 Google Research, 111 8th Avenue, New York, NY 2 Courant Institute of Mathematical Sciences, 251 Mercer Street, New York, NY Abstract. We introduce a novel framework for classification with a rejection option that consists of simultaneously learning two functions: a classifier along with a rejection function. We present a full theoretical analysis of this framework including new data-dependent learning bounds in terms of the Rademacher complexities of the classifier and rejection families as well as consistency and calibration results. These theoretical guarantees guide us in designing new algorithms that can exploit different kernel-based hypothesis sets for the classifier and rejection functions. We compare and contrast our general framework with the special case of confidence-based rejection for which we devise alternative loss functions and algorithms as well. We report the results of several experiments showing that our kernel-based algorithms can yield a notable improvement over the best existing confidence-based rejection algorithm. 1 Introduction We consider a flexible binary classification scenario where the learner is given the option to reject an instance instead of predicting its label, thereby incurring some pre-specified cost, typically less than that of a random prediction. While classification with a rejection option has received little attention in the past, it is in fact a scenario of great significance that frequently arises in applications. Incorrect predictions can be costly, especially in applications such as medical diagnosis and bioinformatics. In comparison, the cost of abstaining from prediction, which may be that of additional medical tests, or that of routing a call to a customer representative in a spoken-dialog system, is often more acceptable. From a learning perspective, abstaining from fitting systematic outliers can also result in a more accurate predictor. Accurate algorithms for learning with rejection can further be useful to developing solutions for other learning problems such as active learning [2]. One of the most influential works in this area has been that of Bartlett and Wegkamp [1] who studied a natural discontinuous loss function taking into account the cost of a rejection. They used consistency results to define a convex and continuous Double Hinge Loss (DHL surrogate loss upper-bounding that rejection loss, which they also used to derive an algorithm. A series of followup articles further extended this publication, including [8] which used the same convex surrogate while focusing on the l 1 penalty. Grandvalet et al. [5] derived a convex surrogate based on [1] that aims at estimating conditional probabilities

only in the vicinity of the threshold points of the optimal decision rule. They also provided some preliminary experimental results comparing the DHL algorithm and their variant with a naive rejection algorithm. Under the same rejection rule, Yuan and Wegkamp [7] studied the infinite sample consistency for classification with a reject option. In this paper, we introduce a novel framework for classification with a rejection option that consists of simultaneously learning a pair of functions (h, r: a predictor h along with a rejection function r, each selected from a different hypothesis set. This is a more general framework than that the special case of confidence-based rejection studied by Bartlett and Wegkamp [1] and others, where the rejection function is constrained to be a thresholded function of the predictor s scores. Our novel framework opens up a new perspective on the problem of learning with rejection for which we present a full theoretical analysis, including new data-dependent learning bounds in terms of the Rademacher complexities of the classifier and rejection families, as well as consistency and calibration results. We derive convex surrogates for this framework that are realizable (H, R-consistent. These guarantees in turn guide the design of a variety of algorithms for learning with rejection. We describe in depth two different types of algorithms: the first type uses kernel-based hypothesis classes, the second type confidence-based rejection functions. We report the results of experiments comparing the performance of these algorithms and that of the DHL algorithm. The paper is organized as follows. Section 2 introduces our novel learning framework and contrasts it with that of Bartlett and Wegkamp [1]. Section 3 provides generalization guarantees for learning with rejection. It also analyzes two convex surrogates of the loss along with consistency results and provides margin-based learning guarantees. Note that many of the proofs are omitted due to space limitations. In Section 4, we present an algorithm with kernelbased hypothesis sets derived from our learning bounds. Lastly, we report the results of several experiments comparing the performance of our algorithms with that of DHL. 2 Learning problem Let X denote the input space. We assume as in standard supervised learning that training and test points are drawn i.i.d. according to some fixed yet unknown distribution D over X { 1, +1}. We present a new general model for learning with rejection, which includes the confidence-based models as a special case. 2.1 General rejection model The learning scenario we consider is that of binary classification with rejection. Let R denote the rejection symbol. For any given instance x X, the learner has the option of abstaining or rejecting that instance and returning the symbol R, or assigning to it a label ŷ { 1, +1}. If the learner rejects an instance, it incurs some loss c(x [, 1]; if it does not reject but assigns an incorrect label,

h (x = η(x 1 and 2 r (x = h (x ( 1 c. 2 x Fig. 1. Mathematical expression and illustration of the optimal classification and rejection function for the Bayes solution. Note, as c increases, the rejection region shrinks. (x h (x> c r (x< it incurs a cost of one; otherwise, it suffers no loss. Thus, the learner s output is a pair (h, r where h: X R is the hypothesis used for predicting a label for points not rejected using sign(h and where r : X R is a function determining the points x X to be rejected according to r(x. The problem is distinct from a standard multi-class classification problem since no point is inherently labeled with R. Its natural loss function L is defined by L(h, r, x, y = 1 yh(x 1 r(x> + c(x1 r(x, (1 for any pair of functions (h, r and labeled sample (x, y X { 1, +1}, thus extending the loss function considered by [1]. In what follows, we assume for simplicity that c is a constant function, though part of our analysis is applicable to the general case. Observe that for c 1 2, on average, there is no incentive for rejection since a random guess can never incur an expected cost of more than 1 2. For biased distributions, one may further limit c to the fraction of the smallest class. For c =, we obtain a trivial solution by rejecting all points, so we restrict c to the case of c ], 1 2 [. Let H and R denote two families of functions mapping X to R. The learning problem consists of using a labeled sample S = ((x 1, y 1,..., (x m, y m drawn i.i.d. from D m to determine a pair (h, r H R with a small expected rejection loss R(h, r [ ] R(h, r = E 1yh(x 1 r(x> + c1 r(x. (2 (x,y D We denote by R S (h, r the empirical loss of a pair (h, r H R over the sample S and use (x, y S to denote the draw of (x, y according to the empirical distribution defined by S: R S (h, r = E (x,y S [ 1yh(x 1 r(x> + c1 r(x ]. 2.2 Confidence-based rejection model Learning with rejection based on two independent yet jointly learned functions h and r introduces a completely novel approach to this subject. However, our new framework encompasses much of the previous work on this problem, e.g. [1], is a special case where rejection is based on the magnitude of the value of the predictor h, that is x X is rejected if h(x γ for some γ. Thus, r is implicitly defined in the terms of the predictor h by r(x = h(x γ. This specific choice of the rejection function r is natural when considering the Bayes solution (h, r of the learning problem, that is the one where the distribution D is known. Indeed, for any x X, let η(x be defined by η(x = P[Y = +1 x]. As shown in [1], we can choose the Bayes solution h and r as in Figure 1, which also provides an illustration of confidence-based rejection.

+ + + + + + Fig. 2. The best predictor h is defined by the threshold θ: h(x = x θ. For c < 1 2, the region defined by X η should be rejected. Note that the corresponding rejection function r defined by r(x = x η cannot be defined as h(x γ for some γ >. However, when predictors are selected out of a limited subset H of all measurable functions over X, requiring the rejection function r to be defined as r(x = h(x γ, for some h H, can be too restrictive. Consider, for example, the case where H is a family of linear functions. Figure 2 shows a simple case in dimension one where the optimal rejection region cannot be defined simply as a function of the best predictor h. The model for learning with rejection that we describe where a pair (h, r is selected is more general. In the next section, we study the problem of learning such a pair. 3 Theoretical analysis We first give a generalization bound for the problem of learning with our rejection loss function as well as consistency results. Next, to devise efficient learning algorithms, we give general convex upper bounds on the rejection loss. For several of these convex surrogate losses, we prove margin-based guarantees that we subsequently use to define our learning algorithms (Section 4. 3.1 Generalization bound Theorem 1. Let H and R be families of functions taking values in { 1, +1}. Then, for any δ >, with probability at least 1 δ over the draw of a sample S of size m from D, the following holds for all (h, r H R: R(h, r R log 1 δ S (h, r + R m (H + (1 + cr m (R + 2m. The theorem gives generalization guarantees for learning with a family of predictors H and rejection function R mapping to { 1, +1} that admit Rademacher complexities in O(1/ m. For such families, it suggests to select the pair (h, r to minimize the right-hand side. As with the zero-one loss, minimizing R S (h, r is computationally hard for most families of functions. Thus, in the next section, we study convex upper bounds that lead to more efficient optimization problems, while admitting favorable learning guarantees as well as consistency results. 3.2 Convex surrogate losses We first present general convex upper bounds on the rejection loss. Let u Φ( u and u Ψ( u be convex functions upper-bounding 1 u. Since for any

a, b R, max(a, b = a+b+ b a 2 a+b 2, the following inequalities hold with α > and β > : L(h, r, x,y = 1 yh(x 1 r(x> + c 1 r(x = max (1 yh(x 1 r(x<, c 1 r(x ( max (1 max(yh(x, r(x, c 1 r(x max 1 yh(x r(x 2, c 1 r(x ( max 1 yh(x r(x α 2, c 1 βr(x ( max Φ ( α 2 (r(x yh(x, c Ψ( βr(x (3 Φ ( α 2 (r(x yh(x + c Ψ( βr(x. (4 Since Φ and Ψ are convex, their composition with an affine function of h and r is also a convex function of h and r. Since the maximum of two convex functions is convex, the right-hand side of (3 is a convex function of h and r. Similarly, the right-hand side of (4 is a convex function of h and r. In the specific case where the Hinge loss is used for both u Φ( u and u Ψ( u, we obtain the following two convex upper bounds, Max Hinge (MH and Plus Hinge (PH: (1 + α2 (r(x yh(x, c (1 βr(x, L MH (h, r, x, y = max L PH (h, r, x, y = max ( (1 + α2 (r(x yh(x, + max c (1 βr(x,. 3.3 Consistency results In this section, we present a series of theoretical results related to the consistency of the convex surrogate losses introduced. We first prove the calibration and consistency for specific choices of the parameters α and β. Next, we show that the excess risk with respect to the rejection loss can be bounded by its counterpart defined via our surrogate loss. We further prove a general realizable (H, R- consistency for our surrogate losses. Calibration. The constants α > and β > are introduced in order to calibrate the surrogate loss with respect to the Bayes solution. Let (h M, r M be a pair attaining the infimum of the expected surrogate loss E (x,y (L MH (h, r, x, y over all measurable functions. Recall from Section 2, the Bayes classifier is denoted by (h, r. The following lemma shows that for α = 1 and β = 1 1 2c, the loss L MH is calibrated, that is the sign of (h M, r M matches the sign of (h, r. Theorem 2. Let (h M, r M denote a pair attaining the infimum of the expected surrogate loss, E (x,y [L MH (h M, r M, x, y] = inf (h,r meas E (x,y [L MH (h, r, x, y]. Then, for β = 1 1 2c and α = 1, 1. the surrogate loss L MH is calibrated with respect to the Bayes classifier: sign(h = sign(h M and sign(r = sign(r M ;

2. furthermore, the following equality holds for the infima over pairs of measurable functions: inf (h,r E [L MH(h, r, x, y] = (3 2c inf (x,y D E (h,r (x,y D [L(h, r, x, y]. Excess risk bound. Here, we show upper bounds on the excess risk in terms of the surrogate loss excess risk. Let R denote the Bayes rejection loss, that is R = inf (h,r E (x,y D [L(h, r, x, y], where the infimum is taken over all measurable functions and similarly let R L denote inf (h,r E (x,y D [L MH (h, r, x, y]. Theorem 3. Let R L (h, r = E (x,y D [L MH (h, r, x, y] denote the expected surrogate loss of a pair (h, r. Then, the surrogate excess of (h, r is upper bounded by its surrogate excess error as follows: R(h, r R 1 (1 c(1 2c( RL (h, r R L. 3.4 Margin bounds In this section, we give margin-based learning guarantees for the loss function L MH. Since L PH is a simple upper bound on L MH, its margin-based learning bound can be derived similarly. In fact, the same technique can be used to derive margin-based guarantees for the subsequent convex surrogate loss functions we present. For any ρ, ρ >, the margin-loss associated to L MH is given by L ρ,ρ MH (h, r, x, y = max ( max ( ( 1+ α r(x 2 ρ yh(x ( ( ρ,, max c 1 β r(x ρ,. The theorem enables us to derive margin-based learning guarantees. The proof requires dealing with this max-based surrogate loss, which is a non-standard derivation. Theorem 4. Let H and R be families of functions mapping X to R. Then, for any δ >, with probability at least 1 δ over the draw of a sample S of size m from D, the following holds for all (h, r H R: R(h, r E [L MH(h, r, x, y] + αr m (H + (2βc + αr m (R + (x,y S log 1 δ 2m. The following corollary is then a direct consequence of the theorem above. Corollary 1. Let H and R be families of functions mapping X to R. Fix ρ, ρ >. Then, for any δ >, with probability at least 1 δ over the draw of an i.i.d. sample S of size m from D, the following holds for all (h, r H R: R(h, r E (x,y S [Lρ,ρ MH (h, r, x, y] + α ρ R m(h + 2βc + α log 1 δ ρ R m (R + 2m. Then, via [6], the bound of Corollary 1 can be shown to hold uniformly for all ( ρ, ρ (, 1, at the price of a term in O. log log 1/ρ m + log log 1/ρ m

4 Algorithms for kernel-based hypotheses In this section, we devise new algorithms for learning with a rejection option when H and R are kernel-based hypotheses. We use Corollary 1 to guide the optimization problems for our algorithms. Let H and R be hypotheses sets defined in terms of PSD kernels K and K over X : H = {x w Φ(x: w Λ} and R = {x u Φ (x: u Λ }, where Φ is the feature mapping associated to K and Φ the feature mapping associated to K and where Λ, Λ are hyperparameters. One key advantage of this formulation is that different kernels can be used to define H and R, thereby providing a greater flexibility for the learning algorithm. In particular, when using a second-degree polynomial for the feature vector Φ, the rejection function corresponds to abstaining on an ellipsoidal region, which covers confidence-based rejection. For example, the Bartlett and Wegkamp [1] solution consists of choosing Φ (x = Φ(x, u = w, and the rejection function, r(x = h(x γ. Corollary 2. Let H and R be the hypothesis spaces as defined above. Then, for any δ >, with probability at least 1 δ over the draw of a sample S of size m from D, the following holds for all (h, r H R: R(h, r E (x,y S [Lρ,ρ MH (h, r, x, y]+α where κ 2 = sup x X K(x, x and κ 2 = sup x X K (x, x. (κλ/ρ 2 (κ m + (2βc + α Λ /ρ 2 log m + 1 δ 2m This learning bound guides directly the definition of our first algorithm based on the L MH (see full version [3] for details resulting in the following optimization: min w,u,ξ λ 2 w 2 + λ 2 u 2 + m ξ i subject to: ξ i c(1 β(u Φ (x i + b, i=1 and ξ i 1 + α 2 ( u Φ (x i + b y i w Φ(x i b, ξ i, where λ, λ are parameters and b and b are explicit offsets for the linear functions h and r. Similarly, we use the learning bound to derive a second algorithm based on the loss L PH (see full paper [3]. We have implemented and tested the dual of both algorithms, which we will refer to as CHR algorithms (short for convex algorithms using H and R families. Both the primal and dual optimization are standard QP problems whose solution can be readily found via both general-purpose and specialized QP solvers. The flexibility of the kernel choice and the QP formulation for both primal and dual are key advantages of the CHR algorithms. In Section 5 we report experimental results with these algorithms as well as the details of our implementation.

.3.24.3.225.15.75.18.12.6.225.15.75.16.2.3.12.8.4.15.1.5.225.15.75 Fig. 3. Average rejection loss on the test set as a function of cost c for the DHL algorithm and the CHR algorithm for six datasets and polynomial kernels. The blue line is the DHL algorithm while the red line is the CHR algorithm based on L MH. The figures on the top starting from the left are for the cod, skin, and haberman data set while the figures on the bottom are for banknote, australian and pima data sets. These figures show that the CHR algorithm outperforms the DHL algorithm for most values of cost, c, across all data sets. 5 Experiments In this section, we present the results of several experiments comparing our CHR algorithms with the DHL algorithm. All algorithms were implemented using CVX [4]. We tested the algorithms on seven data sets from the UCI data repository, specifically australian, cod, skin, liver, banknote, haberman, and pima. For each data set, we performed standard 5-fold cross-validation. We randomly divided the data into training, validation and test set in the ratio 3:1:1. We then repeated the experiments five times where each time we used a different random partition. The cost values ranged over c {.5,.1,...,.5} and the kernels for both algorithms were polynomial kernels of degree d {1, 2, 3} and Gaussian kernels with widths in the set {1, 1, 1}. The regularization parameters λ, λ for the CHR algorithms varied over λ, λ {1 i : i = 5,..., 5} and the threshold γ for DHL ranged over γ {.8,.16,...,.96}. For each fixed value of c, we chose the parameters with the smallest average rejection loss on the validation set. For these parameter values, Figure 3 shows the corresponding rejection loss on the test set for the CHR algorithm based on L MH and the DHL algorithm as a function of the cost for six of our data sets. These plots demonstrate that the difference in accuracy between the two algorithms holds consistently for almost all values of c across all the data sets. Thus, the CHR algorithm yields an improvement in the rejection loss over the DHL algorithm.

References [1] P. Bartlett and M. Wegkamp. Classification with a reject option using a hinge loss. JMLR, 28. [2] K. Chaudhuri and C. Zhang. Beyond disagreement-based agnostic active learning. In NIPS, 214. [3] C. Cortes, G. DeSalvo, and M. Mohri. Learning with rejection. In ALT, 216. [4] I. CVX Research. CVX: Matlab software for disciplined convex programming, version 2., Aug. 212. [5] Y. Grandvalet, J. Keshet, A. Rakotomamonjy, and S. Canu. Suppport vector machines with a reject option. In NIPS, 28. [6] V. Koltchinskii and D. Panchenko. Empirical margin distributions and bounding the generalization error of combined classifiers. Annals of Statistics, 3, 22. [7] M. Yuan and M. Wegkamp. Classification methods with reject option based on convex risk minimizations. In JMLR, 21. [8] M. Yuan and M. Wegkamp. SVMs with a reject option. In Bernoulli, 211.