Classification Methods with Reject Option Based on Convex Risk Minimization

Size: px

Start display at page:

Download "Classification Methods with Reject Option Based on Convex Risk Minimization"

Bartholomew Bishop
6 years ago
Views:

1 Journal of Machine Learning Research 11 (010) Submitte /09; Revise 11/09; Publishe 1/10 Classification Methos with Reject Option Base on Convex Risk Minimization Ming Yuan H. Milton Stewart School of Inustrial an Systems Engineering Georgia Institute of Technology Atlanta, GA , USA Marten Wegkamp Department of Statistics Floria State University Tallahassee, FL 3306, USA Eitor: Bin Yu Abstract In this paper, we investigate the problem of binary classification with a reject option in which one can withhol the ecision of classifying an observation at a cost lower than that of misclassification. Since the natural loss function is non-convex so that empirical risk minimization easily becomes infeasible, the paper proposes minimizing convex risks base on surrogate convex loss functions. A necessary an sufficient conition for infinite sample consistency (both risks share the same minimizer) is provie. Moreover, we show that the excess risk can be boune through the excess surrogate risk uner appropriate conitions. These bouns can be tightene by a generalize margin conition. The impact of the results is illustrate on several commonly use surrogate loss functions. Keywors: classification, convex surrogate loss, empirical risk minimization, generalize margin conition, reject option 1. Introuction In binary classification, one observes inepenent realizations (X 1,Y 1 ),...,(X n,y n ) of the ranom pair (X,Y) where X X an Y Y = { 1,1}. The goal is to learn from these training ata a classification rule g :X Y that classifies an observation X into the two classes. It is recognize that in many applications the consequences of misclassification can be substantial. In such situations, a less specific response that reserves the right of not making a ecision, sometimes referre to as a reject option (see, e.g., Herbei an Wegkamp, 006), may even be more preferable than risking misclassification. This, for example, is typical in meical stuies where screening of a certain isease can be one base on relatively inexpensive clinical measures. If the classification base on these measurements are satisfactory, nothing further nees to be one. But in the event that there are ambiguities, it woul be more esirable to take a rejection option an seek more expensive stuies to ientify a subject s isease status. Similar approaches are often aopte in DNA sequencing or genotyping applications, where the rejection option is commonly referre to as a no-call. Similar problems have attracte much attention in various application fiels an also receive increasing c 010 Ming Yuan an Marten Wegkamp.

2 YUAN AND WEGKAMP amount of interest more recently in machine learning literature. See Ripley (1996) an Bartlett an Wegkamp (008) an references therein. To accommoate the reject option, we now seek a classification rule g :X Ỹ where Ỹ = { 1,1,0} is an augmente response space an g(x) = 0 inicates that no efinitive classification will be mae for X or a reject option is taken. To measure the performance of a classification rule, we employ the following loss function that generalizes the usual 0-1 loss to account for reject option: 1 if g(x) Y an g(x) 0 l[g(x),y] = if g(x) = 0. 0 if g(x) = Y In other wors, an ambiguous response (g(x) = 0) incurs a loss of whereas misclassification incurs a loss of 1. Note that is necessarily smaller than 1/. Otherwise, rather than taking a rejection option with a loss, we can always flip a fair coin to ranomly assign ±1 as the value of g(x), which incurs an average loss of 1/. For this reason, we shall assume that < 1/ in what follows. For any classification rule g :X Ỹ, the risk function is then given by R(g) = E(l[g(X),Y]) where the expectation is taken over the joint istribution of X an Y. It is not har to show that the optimal classification rule g := argminr(g) is given by (see, e.g., Bartlett an Wegkamp, 008) 1 if η(x) > 1 g (X) = 0 if η(x) 1 1 if η(x) < where η(x) = P(Y = 1 X). The corresponing risk is R := infr(g) = R(g ) = E(min{η(X),1 η(x),}). Thus, the performance of any classification rule g : X Ỹ can be measure by the excess risk R(g) := R(g) R. Appealing to the general empirical risk minimization strategy, one coul attempt to erive a classification rule from the training ata by minimizing the empirical risk R n (g) = 1 n n i=1 l[g(x i ),Y i ]. Similar to the usual 0-1 loss, however, l is not convex in g; an irect minimization of R n is typically an NP-har problem. A common remey is to consier a surrogate convex loss function. To this en, let φ : R R be a convex function. Denote by Q( f) = E[φ(Y f(x))] the corresponing risk for a iscriminant function f :X R. Let ˆf n be the minimizer of Q n ( f) = 1 n n i=1 11 φ(y i f(x i ))

3 CLASSIFICATION WITH REJECTION OPTION over a certain functional space F consisting of functions that map from X to R. ˆf n can be conveniently converte to a classification rule C( ˆf n ;δ) as follows: 1 if f(x) > δ C( f(x);δ) = 0 if f(x) δ 1 if f(x) < δ where δ > 0 is a parameter that as we shall see plays a critical role in etermining the performance of C( ˆf n,δ). In this paper, we investigate the statistical properties of this general convex risk minimization technique. To what extent C( ˆf n,δ) mimics the optimal classification rule g plays a critical role in the success of this technique. Let fφ be the minimizer of Q( f). We shall assume throughout the paper that fφ is uniquely efine. Typically f φ reflects the limiting behavior of ˆf n when F is rich enough an there are infinitely many training ata. Therefore the first question is whether or not fφ can be use to recover the optimal rule g. A surrogate loss function φ that satisfies this property is often calle infinite sample consistent (see, e.g., Zhang, 004) or classification calibrate (see, e.g., Bartlett, Joran an McAuliffe, 006). A secon question further concerns the relationship between the excess risk R[C( f,δ)] an the excess φ risk Q( f) = Q( f) infq( f): Can we fin an increasing function ρ : R R such that for all f, R[C( f,δ)] ρ( Q( f))? (1) Clearly the infinite sample consistency of φ implies that ρ(0) = 0. Such a boun on the excess risk provies useful tools in bouning the excess risk of ˆf n. In particular, (1) inicates that R[C( ˆf n,δ)] ρ ( Q( ˆf n ) ) = ρ [ Q( f)+ ( Q( ˆf n ) Q( f) )], where f = argmin f F Q( f). The first term Q( f) on the right-han sie exhibits the approximation error of functional classf whereas the secon term Q( ˆf n ) Q( f) is the estimation error. In the case when there is no reject option, these problems have been well investigate in recent years (Lin, 00; Zhang, 004; Bartlett, Joran an McAuliffe, 006). In this paper, we establish similar results when there is a reject option. The most significant ifference between the two situations, with or without the reject option, is the role of δ. As we shall see, for some loss functions such as least squares, exponential or logistic, a goo choice of δ yiels classifiers that are infinite sample consistent. For other loss functions, however, such as the hinge loss, no matter how δ is chosen, the classification rule C( f,δ) cannot be infinite sample consistent. The remainer of the paper is organize as follows. We first examine in Section the infinite sample consistency for classification with reject option. After establishing a general result, we consier its implication on several commonly use loss functions. In Section 3, we establish bouns on the excess risk in the form of (1), followe by applications to the popular loss functions. We also show that uner an aitional assumption on the behavior of η(x) near an 1 as in Herbei an Wegkamp (006), generalizing the conition in the case of = 1/ of Mammen an Tsybakov (1999) an Tsybakov (004), the boun (1) can be tightene consierably. Section 4 iscusses rates of convergence of the empirical risk minimizer f n that minimizes the empirical risk Q n ( f) over a boune class F. Section 5 consiers extension to asymmetric loss where one type of misclassification may be more costly than the other. All proofs are relegate to Section

4 YUAN AND WEGKAMP. Infinite Sample Consistency We first give a general result on the infinite sample consistency of the classification rule C( f φ,δ). Theorem 1 Assume that φ is convex. Then the classification rule C( fφ,δ) for some δ > 0 is infinite sample consistent, that is, C( fφ,δ) = g if an only if φ (δ) an φ ( δ) both exist, φ (δ) < 0, an φ (δ) φ (δ)+φ =. () ( δ) When there is no reject option, it is known that the necessary an sufficient conition for the infinite sample consistency is that φ is ifferentiable at 0 an φ (0) < 0 (see, e.g., Bartlett, Joran an McAuliffe, 006). As inicate by Theorem 1, the ifferentiability of φ at ±δ plays a more prominent role in the general case when there is a reject option. From Theorem 1 it is also evient that the infinite sample consistency epens on both φ an the choice of thresholing parameter δ. Observe that for any δ 1 < δ, φ ( δ ) φ ( δ 1 ) φ (δ 1 ) φ (δ ), which implies that the left-han sie of () is a ecreasing function of δ. If φ is strictly convex, then it is strictly ecreasing; an therefore there is at most one value of δ that satisfies (). In other wors, for strictly convex φ, there is at most one thresholing parameter δ such that C( f φ,δ) = g. On the other han, if φ is twice ifferentiable such that φ (0) < 0 an φ (z) 0 as z +, then for any < 1/, there always exists a δ > 0 such that () hols. This is because the left-han sie of () is a ecreasing function of δ, which approaches its supremum 1/ when δ 0 an 0 when δ increases. Moreover, the twice ifferentiability of φ ensures that the left-han sie of () is also a continuous function of δ. The following is therefore a irect consequence of Theorem 1: Corollary If φ is strictly convex, then either there is a unique δ > 0 such that C( fφ,δ) is infinite sample consistent; or C( fφ,δ) is not infinite sample consistent for any δ > 0. In aition to convexity, if φ is twice ifferentiable such that φ (0) < 0 an φ (z) 0 as z +, then there always exists a δ > 0 such that C( fφ,δ) is infinite sample consistent. Theorem 1 provies a general guieline on how to choose δ for common choices of convex losses. Below we look at several concrete examples..1 Least Squares Loss We first examine the least squares loss φ(z) = (1 z). Observe that φ (δ) φ ( δ)+φ (δ) = 1 δ. All conitions of Theorem 1 are met if an only if δ = 1. Corollary 3 For the least squares loss, C( f φ,1 ) = g. 114

5 CLASSIFICATION WITH REJECTION OPTION. Exponential Loss Exponential loss, φ(z) = exp( z), is connecte with boosting (Frieman, Hastie an Tibshirani, 000). Because φ (δ) φ ( δ)+φ (δ) = 1 1+exp(δ), Therefore all conitions of Theorem 1 are met if an only if δ = 1 log ( 1 1 ). Corollary 4 For the exponential loss, ( C fφ, 1 ( )) 1 log 1 = g..3 Logistic Loss Logisitic regression employs loss φ(z) = ln(1+exp( z)). Similar to before, φ (δ) φ ( δ)+φ (δ) = 1 1+exp(δ), which suggests that all conitions of Theorem 1 are met if ( ) 1 δ = log 1. Corollary 5 For the logistic loss, ( ( )) 1 C fφ,log 1 = g..4 Square Hinge Loss Square hinge loss, φ(z) = (1 z) +, is another popular choice for which φ (δ) φ ( δ)+φ (δ) = 1 δ. Similar to the least squares loss, we have the following corollary. Corollary 6 For the square hinge loss, C ( f φ,1 ) = g. 115

6 YUAN AND WEGKAMP.5 Distance Weighte Discrimination Marron, To an Ahn (007) recently introuce the so-calle istance weighte iscrimination metho where the following loss function (see, e.g., Bartlett, Joran an McAuliffe, 006) is use { 1 z if z γ φ(z) = ( ) 1 γ z γ if z < γ, (3) where γ > 0 is a constant. It is not har to see that φ is convex. Moreover, { φ 1/z if z γ (z) = 1/γ if z < γ. Thus, { φ (δ) 1/ if δ < γ φ ( δ)+φ (δ) = 1/δ if δ > γ. 1/δ +1/γ In other wors, we have the following result for the istance weighte iscrimination loss. Corollary 7 For the loss (3),.6 Hinge Loss ( ) C fφ,[(1 )/] 1/ γ = g. The popular support vector machine employs the hinge loss, φ(z) = (1 z) +. The hinge loss is ifferentiable everywhere except 1. Therefore φ { (δ) 1 φ ( δ)+φ (δ) = if 0 < δ < 1. 0 if δ > 1 Because 0 < < 1/, there oes not exist a δ such that all conitions of Theorem 1 are met. As a matter of fact, for any δ > 0, C( fφ,δ) g. Motivate by this observation, Bartlett an Wegkamp (008) introuce the following moification to the hinge loss: φ(z) = where a > 1. Note that with this moification, φ (δ) φ ( δ)+φ (δ) = Therefore, we have the following corollary. 1 az if z 0 1 z if 0 < z 1 0 if z > 1 { 1/(a+1) if 0 < δ < 1 0 if δ > 1 Corollary 8 For the moifie hinge loss (4) an any δ < 1, if a = (1 )/, then C ( f φ,δ ) = g., (4) It is interesting to note that for the examples we consiere previously, a specific choice of δ is neee to ensure the infinite sample consistent. Whereas for the moifie hinge loss, a range of choice of δ can serve the same purpose. However, as we shall see in the next section, ifferent choices of δ for the moifie hinge loss may result in slightly ifferent boun on the excess risk with δ = 1/ appearing to be more preferable in that it yiels the smallest upper boun of the excess risk.. 116

7 CLASSIFICATION WITH REJECTION OPTION 3. Excess Risk We now turn to the excess risk R[C( f,δ)] an show how it can be boune through the excess φ risk Q( f) := Q( f) Q( f φ). Recall that the infinite sample consistency establishe in the previous section means that Q( f) = 0 implies throughout this section that R(C( f,δ)) = 0. For brevity, we shall assume implicitly that δ is chosen in accorance with Theorem 1 to ensure infinite sample consistency. Write By efinition, Denote Q η(x) (z) = η(x)φ(z)+(1 η(x))φ( z). Q η(x) ( f φ(x)) = inf z Q η(x)(z). Q η ( f) = Q η ( f) Q η ( f φ) where we suppress the epenence of η, f an fφ on X for brevity. Theorem 9 Assume that φ is convex, φ (δ) an φ ( δ) both exist, φ (δ) < 0, an () hols. In aition, suppose that there exist constants C > 0 an s 1 such that η s C s Q η ( δ); (1 η) s C s Q η (δ). Then R[C( f,δ)] C[ Q( f)] 1/s. (5) It is immeiate from Theorem 9 that Q( ˆf n ) p 0 implies R( ˆf n ) p 0. In other wors, consistency in terms of φ risk implies the consistency in terms of loss l. It is worth noting that the constant in the upper boun can be tightene uner stronger conitions. Theorem 10 In aition to the assumptions of Theorem 9, assume that (η 1) s + C s Q η ( δ); (1 η) s + C s Q η (δ). Then R[C( f,δ)] C[ Q( f)] 1/s. We can improve the bouns even further by the following margin conition. Assume that for some α 0 an A 1 P{ η(x) z t} At α (6) for all 0 t < at z = an z = 1. This assumption was introuce in Herbei an Wegkamp (006) an generalizes the margin conition of Mammen an Tsybakov (1999) an Tsybakov (004). It is always met for α = 0 an A = 1. The other extreme is for α + - the case where η(x) stays away from an 1 with probability one. 117

8 YUAN AND WEGKAMP Theorem 11 In aition to the assumptions of Theorem 9, assume that (6) hols for some α 0 an A 1. Then, for some K epening on A an α, where β = α/(1+α). R[C( f,δ)] K[ Q( f)] 1/(s+β βs), (7) In case α = 0, the exponent 1/(s+β βs) is 1/s on the right han sie in (7) above, an the situation is as in Theorem 9. For α +, the boun (7) improves upon the one in Theorem 9 as the exponent 1/(s + β βs) converges to 1. We now examine the consequences of Theorems 9, 10 an 11 on several common loss functions. 3.1 Least Squares Note that for the least squares loss Simple algebraic manipulations show that Therefore, by Theorems 9 an 11, Corollary 1 For the least squares loss, Q η ( f) = (η 1 f). Q η ( δ) = 4 η ; Q η (δ) = 4 (1 η). R[C( f,1 )] [ Q( f)] 1/. Furthermore, if the margin conition (6) hols, then for some constant K > Exponential Loss R[C( f,1 )] K[ Q( f)] 1+α +α An application of Taylor expansion yiels (see, e.g., Zhang, 004) Therefore, Therefore, by Theorems 9 an 11, ( ) 1 Q η ( f) η. 1+exp( f) Q η ( δ) η ; Q η (δ) (1 η). 118

9 CLASSIFICATION WITH REJECTION OPTION Corollary 13 For the exponential loss, [ ( R C f, 1 ( ))] 1 log 1 [ Q( f)] 1/. Furthermore, if the margin conition (6) hols, then [ ( R C f, 1 ( ))] 1 log 1 K[ Q( f)] +α 1+α for some constant K > Logistic Loss Similar to exponential loss, an application of Taylor expansion yiels ( ) 1 Q η ( f) η. 1+exp( f) Therefore, Therefore, by Theorems 9 an 11, Corollary 14 For the logistic loss, [ R C Q η ( δ) η ; Q η (δ) (1 η). ( f,log ( ))] 1 1 [ Q( f)] 1/. Furthermore, if the margin conition (6) hols, then [ ( ( ))] 1 R C f,log 1 K[ Q( f)] 1+α +α for some constant K > Square Hinge Loss Simple algebraic erivation shows Therefore, By Theorems 9 an 11, Corollary 15 For the square hinge loss, Q η ( f) = (η 1 f) η( f 1) + (1 η)( f + 1). Q η ( δ) = 4 η ; Q η (δ) = 4 (1 η). R[C( f,1 )] [ Q( f)] 1/. Furthermore, if the margin conition (6) hols, then for some constant K > 0. R[C( f,1 )] K[ Q( f)] 1+α +α 119

10 YUAN AND WEGKAMP 3.5 Distance Weighte Discrimination Observe that Hence Q η (z) = infq η (z) = γ η z + (1 η)z γ γ + z + (1 η) γ (1 η) γ η γ ηz 1 η γ z if z γ if z < γ if z γ ( η(1 η)+min{η,1 η} ). an (η/(1 η)) 1/ γ if η > 1/ fφ = any value in [ γ,γ] if η = 1/ ((1 η)/η) 1/ γ if η < 1/ Recall that δ = ((1 )/) 1/ γ. Then ( η (1 η)δ Q η (δ) + δ γ ) η(1 η)/γ ( (η ) ( ) ) 1/ (1 η)δ 1/ = δ γ ( ( = ηδ ) 1/ ( ) ) 1 η 1/ γ 1 η. Observe that Thus, Similarly, ( ( = ηδ γ = δ γ (1 ) From Theorems 9 an 11, we conclue that ) 1/ ( ) ) 1 η 1/ ( + 1 η 1 1 η ) η [ ( ) ] 1/ η 1/ +(1 η) 1/ (1 η ). 1 ( ) 1/ η 1/ +(1 η) 1/ (1 ) 1/. 1 (1 η ) γ(1 ) 1/ 1/ Q η (δ). (η ) γ(1 ) 1/ 1/ Q η ( δ). Corollary 16 For the istance weighte iscrimination loss, [ ] R C( f,((1 )/) 1/ γ) γ 1/ (1 ) 1/4 1/4 [ Q( f)] 1/. Furthermore, if the margin conition (6) hols, then [ ] R C( f,((1 )/) 1/ γ) K[ Q( f)] +α 1+α for some constant K > 0. 10

11 CLASSIFICATION WITH REJECTION OPTION 3.6 Hinge Loss with Rejection Option As shown by Bartlett an Wegkamp (008), for the moifie hinge loss (4), 1 if η arg minq η (z) = 0 if < η < 1. z 1 if η > 1 Simple algebraic manipulations lea to (1 δ)( η)/ if η Q η ( δ) = (η )δ/ if < η < 1 1 (1 η)/ +(η )δ/ if η > 1, an Therefore, Furthermore, Q η (δ) = 1 η/ +(1 η )δ/ if η (1 η )δ/ if < η < 1 (δ 1)(1 η )/ if η > 1 min{δ,1 δ} η Q η ( δ); min{δ,1 δ} (1 η) Q η (δ).. From Theorems 10 an 11, we conclue that min{δ,1 δ} (η 1) + Q η ( δ); min{δ,1 δ} (1 η) + Q η (δ). Corollary 17 For the moifie hinge loss an any δ < 1, R[C( f,δ)] Furthermore, if the margin conition (6) hols, then for some constant K > 0. Q( f). (8) min{δ,1 δ} R[C( f,δ)] K Q( f) (9) Notice that the corollary also suggests that δ = 1/ yiels the best constant in the upper boun. A similar result has also been recently establishe by Bartlett an Wegkamp (008). It is also interesting to see that (8) cannot be further improve by the generalize margin conition (6) as the bouns (8) an (9) only iffer by a constant. 11

12 YUAN AND WEGKAMP 4. Rates of Convergence for Empirical Risk Minimizers In this section we briefly review the possible rates of convergence for minimizers of the empirical risk Q n ( f) = (1/n) n i=1 φ(y i f(x i )) over a convex class of iscriminant functions F ; an show the implications of the excess risk bouns obtaine in the previous section. The analysis of the generalize hinge loss is complicate an is treate in etail in Wegkamp (007) an Bartlett an Wegkamp (008). The other loss functions φ consiere in this paper have in common that the moulus of convexity of Q, { ( ) } Q( f)+q(g) f + g δ(ε) = inf Q : E[( f g) (X)] ε satisfies δ(ε) cε for some c > 0 an that, for some L <, φ(x) φ(x ) L x x for all x,x R. We have the following result that imposes a restriction on the 1/n-covering number N n = N(1/n,L,F ), the carinality of the set of close balls with raius 1/n in L neee to coverf. Theorem 18 Assume that f B for all f F an let 0 < γ < 1. With probability at least 1 γ, Q( ˆf n ) inf Q( f)+ 3L ( L f F n + 8 c + B ) log(nn /γ) 6 n Together with the excess risk bouns from Theorems 9 an 11, we have Corollary 19 Uner the assumptions of Theorems 9 an 18, we have, with probability at least 1 γ, { R(C( ˆf n,δ)) C inf Q( f)+ 3L ( L f F n + 8 c + LB ) } 1/s log(nn /γ). 3 n Furthermore, if the generalize margin conition (6) hols, then with probability at least 1 γ, { R(C( ˆf n,δ)) K inf Q( f)+ 3L ( L f F n + 8 c + LB ) } 1/(s+β βs) log(nn /γ) 3 n for some constant K > 0. In the special case wheref consists of linear combinations f λ (x) = M λ j f j (x) j=1 of simple iscriminant functions (ecision stumps) f 1,..., f M with M j=1 λ j B an f j 1, we obtain the rate (M logn/n) 1/(s+β βs). We can view B as a tuning parameter here, an if the functions f j are near orthogonal in the sense that max 1 i j M E[ f i (X) f j (X)] E[ fi (X)]E[ f j (X)] for some small c > 0, a small moification of Theorem 1 in Wegkamp (007) shows that we also aapt to the unknown sparsity of the minimizer λ 0 of Q(f λ ) over λ in that the rate becomes ( λ 0 0 logn/n) 1/(s+β βs) for suitably chosen B = B(n). c λ 0 0 1

13 CLASSIFICATION WITH REJECTION OPTION 5. Asymmetric Loss We have focuse thus far on the case where misclassifying from one class to the other, either g(x) = 1 while Y = 1 or g(x) = 1 while Y = 1, is assigne the same loss. In many applications, however, one type of misclassification may incur a heavier loss than the other. Such situations naturally arise in risk management or meical iagonsis. To this en, the following loss function can be aopte in place of l: 1 if g(x) = 1 an Y = 1 θ l θ [g(x),y] = if g(x) = 1 an Y = 1 if g(x) = 0 0 if g(x) = Y We shall assume that θ < 1 without loss of generality. It can be shown that the rejection option is only available if < θ/(1+θ) (see, e.g., Herbei an Wegkamp, 006), which we shall assume throughout the section. When this hols, the corresponing Bayes rule is given by (see, e.g., Herbei an Wegkamp, 006) 1 if η(x) > 1 /θ g θ(x) = 0 if η(x) 1 /θ. 1 if η(x) < Instea of C( ˆf n,δ), an asymmetrically truncate classification rule, ˆf n, C( ˆf n ;δ 1,δ ), can be use for our purpose here where 1 if f(x) > δ 1 C( f(x);δ 1,δ ) = 0 if δ f(x) δ 1. 1 if f(x) < δ The behavior of the asymmetically truncate classification rule C( ˆf n ;δ 1,δ ) can be stuie in a similar fashion as before. In particular, we have the following results in parallel to Theorems 1 an 9. Theorem 0 Assume that φ is convex. Then C( fφ,δ 1,δ ) for some δ 1,δ > 0 is infinite sample consistent, that is, C( fφ,δ 1,δ ) = g θ if an only if φ (±δ 1 ) an φ (±δ ) exist; φ (δ 1 ),φ (δ ) < 0; an φ (δ 1 ) φ ( δ 1 )+φ = (δ 1 ) θ ; φ (δ ) φ ( δ )+φ =. (δ ) Furthermore, if C( f φ,δ 1,δ ) is infinite sample consistent an then θ(1 η) s C s Q η (δ 1 ); η s C s Q η ( δ ), R θ [C( f,δ 1,δ )] C[ Q( f)] 1/s, where R θ (g) = R θ (g) R θ (g θ ) an R θ(g) = E[l θ (g(x),y)]. Theorem 0 can be prove in the same fashion as Theorems 1 an 9 an is therefore omitte for brevity.. 13

14 YUAN AND WEGKAMP 6. Proofs Proof of Theorem 1. We first show the if part. Recall that Q( f) = E[φ(Y f(x))] With slight abuse of notation, write = E(E[φ(Y f(x)) X]) = E[η(X)φ( f(x))+(1 η(x))φ( f(x))]. Q η(x) ( f(x)) = η(x)φ( f(x))+(1 η(x))φ( f(x)). Then f φ (X) minimizes Q η(x)( ). We now procee by separately consiering three ifferent scenarios: (a) η(x) < ; (b) η(x) > 1 ; an (c) < η(x) < 1. For brevity, we shall abbreviate the epenence of η an f φ on X in the reminer of the proof when no confusion occurs. First consier the case when η <. Recall that φ ( δ) < φ (δ) < 0, an Therefore, By the convexity of φ, for any z > 0, φ (δ) φ ( δ)+φ (δ) =. ηφ ( δ) (1 η)φ (δ) > 0. φ(z δ) φ( δ) φ( z+δ) φ(δ) φ ( δ)z; φ (δ)z. Hence Q η (z δ) Q η ( δ) [ ηφ ( δ) (1 η)φ (δ) ] z > 0, which implies that f φ δ. It now suffices to show that f φ δ. By the efinition of φ ( δ) an φ (δ), for any ε > 0, there exists a ζ > 0 such that for any 0 < z < ζ, Therefore for any 0 < z < ζ, φ( z δ) φ( δ) z φ(z+δ) φ(δ) z φ ( δ) ε; φ (δ)+ε. Q η ( z δ) Q η ( δ) = η[φ( z δ) φ( δ)]+(1 η)[φ(z+δ) φ(δ)] η [ φ ( δ) ε ] z+(1 η) [ φ (δ)+ε ] z = ([ (1 η)φ (δ) ηφ ( δ) ] + ε ) z. Recall that (1 η)φ (δ) ηφ ( δ)<0. By setting ε small enough, we can ensure that (1 η)φ (δ) ηφ ( δ)+ε remains negative. Hence Q η ( z δ) < Q η ( δ), 14

15 CLASSIFICATION WITH REJECTION OPTION which implies that fφ δ. Now consier the case when η > 1. Observe that Q η (z) = Q 1 η ( z). From the previous iscussion, f φ = argmin z Q η (z) = argminq 1 η ( z) > δ. z At last, consier the case when < η < 1. Observe that in this case, Hence for any z > 0, ηφ (δ) (1 η)φ ( δ) > 0; ηφ ( δ) (1 η)φ (δ) < 0. Q η (z+δ) Q η (δ) [ ηφ (δ) (1 η)φ ( δ) ] z > 0, which implies that f φ δ. Similarly, Q η ( z δ) Q η ( δ) [ ηφ ( δ)+(1 η)φ (δ) ] z > 0, which implies that fφ δ. In summary, f φ [ δ,δ]. We now consier the only if part. Let [a,b ] an [a +,b + ] be the subifferential of φ at δ an δ respectively. We nee to show that a = b, a + = b + an a + /(a + + a ) =. We begin by showing that b + 0. Assume the contrary. The infinite sample consistency implies that for any η > 1, fφ > δ. Because b + > 0, we have φ( fφ ) > φ(δ). Together with the fact that Q η ( fφ ) < Q η(δ), this implies that φ( fφ ) < φ( δ). Subsequently, we have a > 0. The convexity of φ also suggests that a a + b b +. Because we have φ( f φ) φ(δ) b + ( f φ δ); φ( δ) φ( f φ) a ( f φ δ), Q η ( f φ) Q η (δ) (ηb + (1 η)a )( f φ δ) > 0. This contraiction suggests that b + 0. Given that a a + b b + 0, we have a a + b b +, which implies that It suffices to show that b + a + b + b + a + b + a + b + a +. an a + b + a +. Assume the contrary. First consier the case when b + /(a + b + ) <. Let η be such that b + /(a + b + ) < η <. By efinition, for any f < δ, φ( f) φ( δ) a ( f + δ); φ( f) φ(δ) b + ( f δ). Hence Q η ( f) Q η ( δ) [ηa (1 η)b + ]( f + δ) > 0, 15

16 YUAN AND WEGKAMP which implies that argminq η (z) δ. This contraicts with the infinite sample consistency. Therefore, b + /(a + b + ). Next we eal with the case of a + /(b + a + ) >. Let η be such that a + /(b + a + ) > η >. Following a similar argument as before, one can show that Q 1 η ( f) Q 1 η (δ) > 0 for any f > δ, which implies that argminq η (z) δ. This again contraicts infinite sample consistency because 1 η < 1. Therefore, a + /(b + a + ). The proof is now conclue. Proof of Theorem 9. Recall that Q η ( f) = ηφ( f)+(1 η)φ( f). Similarly, write R η [C( f,δ)] = ηl(c( f,δ),1)+(1 η)l(c( f,δ), 1). Also write Q η ( f) = Q η ( f) infq η ( f) an R η ( f) = R η ( f) infr η ( f). It suffices to show that The theorem can be euce from (10) by Jensen s inequality: R η [C( f,δ)] C[ Q η ( f)] 1/s. (10) R[C( f,δ)] = E [ R η(x) [C( f(x),δ)] ] CE [ Q η(x) ( f(x)) ] 1/s C ( E [ Q η(x) ( f(x)) ]) 1/s = C[ Q( f)] 1/s. To show (10), we consier separately the ifferent combinations of values of η an f. For brevity, we shall abbreviate their epenence on X in what follows. Case 1. η < an f < δ. As shown before, in this case fφ (X) < δ. Thus, R η [C( f,δ)] = 0 C[ Q η ( f)] 1/s. Case. η < an f < δ. Observe that Q η ( f) Q η ( δ) [ ηφ ( δ) (1 η)φ (δ) ] ( f + δ) = φ (δ) ( η)( f + δ) 0. Together with the fact that C s Q η ( δ) η s, we have Q η ( f) Q η ( δ) C s η s. Note that We have R η [C( f,δ)] = η. R η [C( f,δ)] C[ Q η ( f)] 1/s. 16

17 CLASSIFICATION WITH REJECTION OPTION Case 3. η < an f > δ. Observe that Q η ( f) Q η (δ) [ ηφ (δ) (1 η)φ ( δ) ] ( f δ) = φ (δ) (1 η )( f δ) 0. Together with the facts that < 1/ an C s Q η (δ) 1 η s, we have Q η ( f) Q η (δ) C s 1 η s (C) s 1 η s. Note that Therefore, R η [C( f,δ)] = 1 η. R η [C( f,δ)] C[ Q η ( f)] 1/s. Case 4. < η < 1 an f < δ. Following a similar argument as before, Therefore, Q η ( f) Q η ( δ) φ (δ) ( η)( f + δ) 0. Q η ( f) Q η ( δ) C s η s, which, together with the fact that R η [C( f,δ)] = η, implies that Case 5. < η < 1 an f < δ. In this case, Case 6. < η < 1 an f > δ. Observe that Hence R η [C( f,δ)] C[ Q η ( f)] 1/s. R η [C( f,δ)] = 0 C[ Q η ( f)] 1/s. Q η ( f) Q η (δ) φ (δ) (1 η )( f δ) 0. Q η ( f) Q η (δ) C s 1 η s, which, together with the fact that R η [C( f,δ)] = 1 η, implies that R η [C( f,δ)] C[ Q η ( f)] 1/s. Case 7. η > 1. Observe that R η [C( f,δ)] = R 1 η [C( f,δ)], an Q η ( f) = Q 1 η ( f). Because 1 η <, from Cases 1, an 3, we have The proof is therefore complete. R η [C( f,δ)] = R 1 η [C( f,δ)] C[ Q 1 η ( f)] 1/s = C[ Q η ( f)] 1/s. 17

18 YUAN AND WEGKAMP Proof of Theorem 10. The proof follows from the same argument as that of Theorem 9. The only ifference takes place in Case 3 where uner the current assumptions R η ( f) = 1 η C[ Q η (δ)] 1/s. Proof of Theorem 11. The last part of the proof is base on the proof of Theorem 3 in Bartlett, Joran an McAuliffe (006). Let g = C( f,δ) be the classification rule with reject option base on f :X R an set g = C( fφ,δ). We have shown above that uner the assumptions of Theorem 9, η 1{g g }(1{g = 1}+1{g = 1}) C[ Q η ( f)] 1/s 1 η 1{g g }(1{g = 1}+1{g = 1}) C[ Q η ( f)] 1/s. Moreover, Lemma 1 in Herbei an Wegkamp (006) states that R(g) = E[ η(x) 1{g(X) g (X)}(1{g(X) = 1}+1{g (X) = 1})] (11) Hence, for any ε > 0, R(g) +E[ 1 η(x) 1{g(X) g (X)}(1{g(X) = 1}+1{g (X) = 1})]. = E[ η(x) 1{ η(x) ε}1{g(x) g (X)}(1{g(X) = 1}+1{g (X) = 1})] +E[ η(x) 1{ η(x) > ε}1{g(x) g (X)}(1{g(X) = 1}+1{g (X) = 1})] +E[ 1 η(x) 1{ 1 η(x) ε}1{g(x) g (X)}(1{g(X) = 1}+1{g (X) = 1})] +E[ 1 η(x) 1{ 1 η(x) > ε}1{g(x) g (X)}(1{g(X) = 1}+1{g (X) = 1})] εp{g (X) g(x)}+ε 1 s Q( f) where we use (11) an the inequality x 1{ x ε} x r ε 1 r for r 1. Using the boun [ ] β P{g(X) g (X)} (8A) 1/α R(g) from the proof of Lemma 4 of Herbei an Wegkamp (006), an choosing ε = c[ R(g)] 1 β with c = [(8A) 1/α ] β /4 reaily gives the esire claim with K = 4Cc 1 s. Proof of Theorem 18. φ(y f(x)). Since we have Recall that f F minimizes Q( f) over f F. Let h(y f(x)) = φ(y f(x)) Q( f)+q( f) ( ) f + f Q + ce [ ( f f) (X) ] Q( f)+ce [ ( f f) (X) ], E [ h (Y f(x)) ] L E [ ( f f) (X) ] L c {Q( f) Q( f)} = L E[h(Y f(x))], c 18

19 CLASSIFICATION WITH REJECTION OPTION see, for example, Bartlett, Joran an McAuliffe (006). Since f n minimizes Q n ( f), we have Q( f n ) Q( f) = Ph(y f n (x)) = P n h(y f n (X))+(P P n )h(y f n (X)) P n h(y f(x))+(p P n )h(y f n (X)) sup(p P n )h(y f(x)) f F where Ph(Y f(x)) = E[h(Y f(x))] an P n h(y f(x)) = (1/n) n i=1 h(y i f(x i )) for any f F. Next we observe that sup(p P n )h(y f(x)) 3L f F n + max (P P n )h(y f(x)) f F n wheref n is the minimal 1/n-net off. By Bernstein s inequality, we get { } [ n{t +Ph(Y f(x))} ] /8 P sup(p P n )h(y f(x)) t N n exp f F n Ph (Y f(x))+(lb){t +Ph(Y f(x))}/6 [ N n exp nt ( L 8 c + LB ) 1 ] 3 an the conclusion follows easily. Acknowlegments The authors gratefully acknowlege the fact that part of the research was one while the authors were visiting the Isaac Newton Institute for Mathematical Sciences (Statistical Theory an Methos for Complex, High-Dimensional Data Programme) at Cambrige University uring Spring 008. The research of Ming Yuan was supporte in part by NSF grants DMS-MPSA an DMS The research of Marten Wegkamp was supporte in part by NSF Grant DMS References P.L. Bartlett, M.I. Joran an J. McAuliffe. Convexity, classification, an risk bouns. Journal of the American Statistical Association, 101: , 006. P.L. Bartlett an M.H. Wegkamp. Classification with a reject option using a hinge loss. Journal of Machine Learning Research, 9: , 008. J. Frieman, T. Hastie an R. Tibshirani. Aitive logistic regression: a statistical view of boosting. Annals of Statistics, 8(): , 000. R. Herbei an M.H. Wegkamp. Classification with reject option. Canaian Journal of Statistics, 34(4):709-71, 006. Y. Lin. Support vector machines an the Bayes rule in classification. Machine Learning an Knowlege Discovery, 6:59-75,

20 YUAN AND WEGKAMP E. Mammen an A.B. Tsybakov. Smooth iscrimination analysis. Annals of Statistics, 7(6): , J.S. Marron, M. To an J. Ahn. Distance weighte iscrimination. Journal of the American Statistical Association, 10: , 007. B.D. Ripley. Pattern Recognition an Neural Networks. Cambrige University Press, Cambrige, A.B. Tsybakov. Optimal aggregation of classifiers in statistical learning. Annals of Statistics, 3: , 004. M.H. Wegkamp. Lasso type classifiers with a reject option. Electronic Journal of Statistics, 1: , 007. T. Zhang. Statistical behavior an consistency of classification methos base on convex risk minimization. Annals of Statistics, 3:56-134,

PETER L. BARTLETT AND MARTEN H. WEGKAMP

CLASSIFICATION WITH A REJECT OPTION USING A HINGE LOSS PETER L. BARTLETT AND MARTEN H. WEGKAMP Abstract. We consier the problem of binary classification where the classifier can, for a particular cost,