PETER L. BARTLETT AND MARTEN H. WEGKAMP

Size: px

Start display at page:

Download "PETER L. BARTLETT AND MARTEN H. WEGKAMP"

Gertrude Hutchinson
6 years ago
Views:

1 CLASSIFICATION WITH A REJECT OPTION USING A HINGE LOSS PETER L. BARTLETT AND MARTEN H. WEGKAMP Abstract. We consier the problem of binary classification where the classifier can, for a particular cost, choose not to classify an observation. Just as in the conventional classification problem, minimization of the sample average of the cost is a ifficult optimization problem. As an alternative, we propose the optimization of a certain convex loss function φ, analogous to the hinge loss use in support vector machines SVMs. Its convexity ensures that the sample average of this surrogate loss can be efficiently minimize. We stuy its statistical properties. We show that minimizing the expecte surrogate loss the φ-risk also minimizes the risk. We also stuy the rate at which the φ-risk approaches its minimum value. We show that fast rates are possible when the conitional probability PY = 1 X is unlikely to be close to certain critical values. Key wors an phrases: Bayes classifiers; classification; convex surrogate loss; empirical risk minimization; hinge loss; large margin classifiers; margin conition; reject option; support vector machines. MSC 2000: Primary 62C05; seconary 62G05, 62G08 1. Introuction The aim of binary classification is to classify observations that take values in an arbitrary feature space X into one of two classes, labelle 1 or +1. Since an observation X oes not fully etermine its label y, we construct a classifier t : X { 1, +1} that represents our guess tx of the label Y of a future observation X. The rule with the smallest probability of error P{tX Y } is the Bayes rule { t 1 if P{Y = 1 X = x} P{Y = +1 X = x} 1 x := +1 otherwise Observations x for which the conitional probability 2 ηx := P{Y = +1 X = x} is close to 1/2, are the most ifficult to classify. In the extreme case where ηx = 1/2, we may just as well toss a coin to make a ecision. While it is our aim to classify the majority of future observations in an automatic way, it is often appropriate to instea report a warning for those observations that are har to classify the ones having conitional probability ηx near 1

2 2 BARTLETT AND WEGKAMP the value 1/2. This motivates the introuction of a reject option for classifiers, by allowing for a thir ecision, R reject, expressing oubt. In case the reject option is invoke, no ecision is mae. Although such classifiers are valuable in practice, few theoretical results are available see [10] an [7] for references. In this note, we assume that the cost of making a wrong ecision is 1 an the cost of utilizing the reject option is > 0. Given a classifier with reject option t : X { 1, 1, R}, the appropriate risk function is 3 L t := P{tX = R} + P{tX Y, tx R}. It is easy to see that the rule minimizing this risk assigns 1, 1 or R epening on which of ηx, 1 ηx or is smallest. Accoring to this rule, which we refer to as the Bayes rule, we nee never reject if 1/2. For this reason we restrict ourselves to the cases 0 1/2 an the Bayes rule with reject option is then 1 4 with risk t x := if ηx < +1 if ηx > 1 R otherwise 5 L := L t = E min{ηx, 1 ηx, }. The case = 1/2 reuces to the classical situation without the reject option an 4 coincies with 1. Herbei an Wegkamp [7] stuy plug-in rules that replace the regression function ηx by an estimate ηx in the formula for t x above. They show that the rate of convergence of the risk 3 to the Bayes risk L epens on how well ηx estimates ηx an on the behavior of ηx near the values an 1. This conition on ηx nicely generalizes Tsybakov s noise conition cf. [13] from the classical setting = 1/2 to our more general framework 0 1/2. The same paper consiers classifiers t that minimize the empirical counterpart 6 L t := n I{tX i = R} + 1 n I{tX i Y i, tx i R} of the risk L t over a class T of classifiers t : X { 1, +1, R} with reject option, base on a sample X 1, Y 1,..., X n, Y n of inepenent copies of the pair X, Y. For such rules t,

3 CLASSIFICATION WITH A REJECT OPTION USING A HINGE LOSS 3 [7] establish oracle inequalities for the excess risk of the form L t L C 0 inf t T [L t L ] + n for some constant C 0 > 1 an the remainer n epens on the sample size n, the complexity of the class T an the behavior of ηx near the values an 1. Their finings are in line with the recent theoretical evelopments of stanar binary classification without the reject option = 1/2, see e.g. [3], [4], [9]. Despite the attractive theoretical properties of this metho, it is often har to implement. This paper aresses this pitfall by consiering a convex surrogate for the loss function akin to the hinge loss that is use in SVMs. The next section introuces a piecewise linear loss function φ x that generalizes the hinge loss function max{0, 1 x} in that it allows for the reject option an φ x = max{0, 1 x} for = 1/2. We prove that the Bayes classifier 4 is a transforme version of the minimizer of the risk associate with this new loss an that the excess risk L L can be boune by 2 times the excess risk base on the piecewise linear loss φ. Thus classifiers with small excess φ -risk automatically have small excess classification risk, proviing theoretical justification of the more computationally appealing metho. In Section 3, we illustrate the computational convenience of the new loss, showing that the SVM classifier with reject option can be obtaine by solving a quaratic program. Finally, in Section 4, we show that fast rates for instance, faster than n 1/2 of the SVM classifier with reject option are possible uner the same noise conitions on ηx use in [7]. As a sie effect, for the stanar SVM the special case of = 1/2, our results imply fast rates without an assumption that ηx is unlikely to be near 0 an 1, a technical conition that has been impose in the literature for that case [5], [12]. 2. Generalize hinge loss We can associate any real-value function f with a classifier with reject option using the transformation 1 if fx < δ, 7 t f,δ x = R if fx δ, +1 if fx > δ, where 0 δ 1 is an arbitrary positive number. We recommen the value δ = 1/2 but we postpone the iscussion on the choice of δ until after Theorem 2. Accoring to 3, the risk

4 4 BARTLETT AND WEGKAMP of t f,δ x equals then L t f,δ = P{t f,δ X Y, t f,δ X R} + P{t f,δ X = R} = P{fX < δ, Y = 1} + P{fX > δ, Y = 1} + P{ δ fx δ} = P{Y fx < δ} + P{ Y fx δ}. For example, the Bayes classifier t x efine in 4 correspons to the function 1 if ηx <, 8 f x = 0 if ηx 1, 1 if ηx > 1. Although the function f x is not unique, the classifier t x is the unique minimizer of L t f,δ over all measurable f : X R. We see that 9 L t f,δ = L,δ f := El,δ Y fx for the iscontinuous loss 1 if α < δ, l,δ α = if α < δ, 0 otherwise. The choice δ, = 0, 1/2 correspons to the classical case L t f,δ = P{Y fx < 0}. All other choices δ > 0 lea to classification that allows for the reject option with minimal risk L,δ f = El,δY f X = L. Instea of the iscontinuous loss l,δ, we consier a convex surrogate loss. Define the piecewise linear function 1 aα if α < 0, φ α = 1 α if 0 α < 1, 0 otherwise where a = 1 / 1. The next result states that the minimizer of the expectation of the iscrete loss l,δ an the convex loss φ α remains the same: f efine in 8 minimizes both El,δY fx an Eφ Y fx an the minimal risks are relate via the equality L,δ f = L φ f. Proposition 1. The minimizer of the risk L φ f = Eφ Y fx over all measurable f : X R is the Bayes rule 8. Furthermore, L φ f = L,δf.

5 CLASSIFICATION WITH A REJECT OPTION USING A HINGE LOSS 5 Proof. Note that Hence, for L φ f = EηXφ fx + E1 ηxφ fx. 10 r η,φ α = ηφ α + 1 ηφ α it suffices to show that 1 if η < 1/1 + a, 11 α = 0 if 1/1 + a η a/1 + a, 1 if ηx > a/1 + a minimizes r η,φ α. The function r η,φ α can be written as η aηα if α 1, 1 + α1 1 + aη if 1 α 0, r η,φ α = 1 + α η + a1 η if 0 α 1, αa1 η + 1 η if α 1 an it is now a simple exercise to verify that α inee minimizes L φ α. Finally, since L φ f = Er η,φ fx an we fin that inf α ηφ α + 1 ηφ α = ηφ α + 1 ηφ α = η 1 η {η } + 1{ η 1 } + {η 1 }, L φ f = E [minηx, 1 ηx, ] = L. an the secon claim follows as well. We see that φ α l,δ α for all α R as long as 0 δ 1. Since this pointwise relation remains preserve uner taking expecte values, we immeiately obtain L t f,δ L φ f. The following comparison theorem shows that a relation like this hols not only for the risks, but for the excess risks as well. Theorem 2. Let 0 < 1/2 be fixe. For all 0 < δ 1/2, we have 12 L t f,δ L δ Lφ f L φ, where L φ = L φ f. For 1/2 δ 1, we have 13 L t f,δ L L φ f L φ.

6 6 BARTLETT AND WEGKAMP Finally, for δ, = 0, 1/2, we have 14 Lt f L L φ f L φ, where Lt f := P{Y fx < 0}, L := E minηx, 1 ηx an φx = max{0, 1 x}. Remark. The optimal multiplicative constant /δ or 1 epening on the value of δ in front of the φ -excess risk is achieve at δ = 1/2. For this choice, Theorem 2 states that L t f,1/2 L 2 L φ f L φ. For all δ 1, the multiplicative constant in front of the φ -excess risk oes not excee 1. The choice δ = 1/2 with the smallest constant 2 < 1 is right in the mile of the interval [, 1 ]. The choice δ = 1 correspons to the largest value of δ for which the piecewise constant function l,δ α is still majorize by the convex surrogate φ α. For δ = we will reject less frequently than for δ = 1 an δ = 1/2 can be seen as a compromise among these two extreme cases. an Inequality 14 is ue to Zhang [14]. Before we prove the theorem, we nee an intermeiate result. We efine the functions ξη = η{η } + { η 1 } + 1 η{η 1 } Hη = inf α ηφ α + 1 ηφ α = η 1 η {η } + 1{ η 1 } + {η 1 }. We suppress their epenence on in our notation. Their expectations are L = EξηX an L φ = EHηX, respectively. Furthermore, we efine H 1 η = inf α< δ ηφ α + 1 ηφ α, H R η = inf α δ ηφ α + 1 ηφ α, H 1 η = inf α>δ ηφ α + 1 ηφ α ; ξ 1 η = η ξη, ξ R η = ξη, ξ 1 η = 1 η ξη.

7 CLASSIFICATION WITH A REJECT OPTION USING A HINGE LOSS 7 Proposition 3. Let 0 < 1/2. If 0 < δ 1/2, then, for b { 1, 1, R}, If δ 1, then, for b { 1, 1, R}, ξ b η δ {H bη Hη}. ξ b η H b η Hη. If δ, = 0, 1/2, then, for b { 1, 1, R}, Proof. First we compute inf r φα = η α 1, inf r φα = η {η } + 1 α δ inf r φα = {η } + δ α 0 inf r φα = {η 1 } + 0 α δ ξ b η H b η Hη. inf r φα = 1 η {η 1 } + δ α 1 inf r φα = 1 η α 1 It is now easy to verify that so that δ η + 1 δ {η } 1 δ + η δ {η } 1 + δ δ δ η {η 1 } 1 + δ δ δ η {η 1 } H 1 η = inf α< δ ηφ α + 1 ηφ α = η {η } + δ η + 1 δ {η } H 1 η Hη = δ 1 + δ 0{η } + η δ { η 1 } + η + 1 δ 1 {η 1 } On the other han, ξ 1 η = η ξη = 0{η } + η { η 1 } + 2η 1{η 1 }

8 8 BARTLETT AND WEGKAMP an we see that δ ξ 1η H 1 η Hη for all 0 < δ 1. Next, we compute H R η = inf ηφ α + 1 ηφ α α δ = 1 δ + δ η {η } + { η 1 } + 1 δ + δ δ η {η 1 } an Since H R η Hη = ξ R η = ξη 1 δ 1 δ η {η } + 1 δ 1 δ + 1 δ η {η 1 }. = η{η } + 0{ η 1 } η{η 1 } we fin that δ ξ Rη H R η Hη provie 0 < δ 1/2. Finally, we fin that an consequently Now, H 1 η = inf ηφ α + 1 ηφ α α>δ H 1 η Hη = ξ 1 η = 1 η ξη = 1 η {η 1 } + δ + 1 δ δ η {η 1 } 1 δ + δ δ η η {η } δ + δ δ η { η 1 } + 0{η 1 }. = 1 2η{η } + 1 η { η 1 } + 0{η 1 }, an we fin that δ ξ 1η H 1 η Hη

9 CLASSIFICATION WITH A REJECT OPTION USING A HINGE LOSS 9 provie 0 < δ 1. We now verify the secon claim of Proposition 3. Assume that δ 1. First we consier the case η. Then ξ 1 η H 1 η Hη hols trivially. ξ R η H R η Hη 1 + δ 2η δ1. As η, we nee that 1 + δ 2 δ1, that is, δ ξ 1 η H 1 η Hη 1 + δ 2η δ1. As η, we nee that 1 + δ 2 δ1, equivalently, δ Next, if η 1, we see that ξ 1 η H 1 η Hη δ η δ. ξ R η H R η Hη hols trivially. ξ 1 η H 1 η Hη δ η 1 δ. Finally, if η 1, we fin that ξ 1 η H 1 η Hη 1 + δ 2η 1 + δ 2. For η 1 this hols provie 1 + δ δ 2 δ ξ R η H R η Hη 1 δ η 1 1 δ. ξ 1 η H 1 η Hη hols trivially. This conclues the proof of the secon claim since δ 1. The last claim for the case δ, = 0, 1/2 follows as well from the preceing calculations. Proof of Theorem 2. Assume 0 < δ 1/2 an 0 < 1/2. Define ψx = xδ/. By linearity of ψ, we have for any measurable function f, ψl t f,δ L = P{f < δ}ψφ 1η + P{ δ f δ}ψφ R η + P{f > δ}ψφ 1 η, where P is the probability measure of X an Pf = f P. Invoke now Proposition 3 to euce ψl t f,δ L P{f < δ} [H 1η Hη] + P{ δ f δ}[h R η Hη] +P{f > δ} [H 1 η Hη] an conclue the proof by observing that the term on the right of the previous inequality equals L φ f L φ.

10 10 BARTLETT AND WEGKAMP For the case δ, = 0, 1/2 an the case δ, with δ 1 an 0 < 1/2, take ψx = x. 3. SVM classifiers with reject option In this section, we consier the SVM classifier with reject option, an show that it can be obtaine by solving a quaratic program. Let k : X 2 R be the kernel of a reproucing kernel Hilbert space RKHS H, an let f be the norm of f in H. The SVM classifier with reject option is the minimizer of the sum of the empirical φ -risk an a regularization term that is proportional to the square RKHS norm. The following theorem shows that this classifier is the solution to a quaratic program, that is, it is the minimizer of a quaratic criterion on a subset of Eucliean space efine by linear inequalities. Thus, the classifier can be foun efficiently using general-purpose algorithms. Theorem 4. For any x 1,..., x n X an y 1,..., y n { 1, 1}, let f H be the minimizer of the regularize risk functional f φ y i fx i + λ f 2, where λ > 0. Then we can represent f as the finite sum f x = αi kx i, x, where α1,..., α n is the solution to the following quaratic program. 1 min ξ i γ i + λ α i α j kx i, x j α i,ξ i,γ i n i,j s.t. ξ i 0, γ i 0, ξ i 1 y i α j kx i, x j, γ i y i j=1 α j kx i, x j for i = 1,..., n. j=1 Proof. The fact that f can be represente as a finite sum over the kernel basis functions is a stanar argument see [8, 6]. It follows from Pythagoras theorem: the square RKHS norm can be split into the square norm of the component in the space spanne by the kernel basis functions x kx i, x an that of the component in the orthogonal subspace. Since the cost

11 CLASSIFICATION WITH A REJECT OPTION USING A HINGE LOSS 11 function epens on f only at the points x i, an the reproucing property fx i = kx i,, f shows that these values epen only on the component of f in the space spanne by the kernel basis functions, the orthogonal subspace contributes only to the square norm term an not to the cost term. Thus, a minimizing f can be represente in terms of the solution α to the minimization 1 min α 1,...,α n n φ y i α j kx i, x j + λ α i α j kx i, x j. j=1 But then it is easy to see that we can ecompose φ as φ β = max{0, 1 β} i,j n max{0, β}. Defining ξ i = max{0, 1 y i fx i } an γ i = max{0, y i fx i } gives the QP. 4. Tsybakov s noise conition, Bernstein classes, an fast rates In this section, we consier methos that choose the function f from some class F so as to minimize the empirical φ -risk, L φ f := 1 n φ Y i fx i. For instance, to analyze the SVM classifier with reject option, we coul consier classes F n = {f H : f c n } for some sequence of constants c n. We are intereste in bouns on the excess φ -risk, that is, the ifference between the φ -risk of f an the minimal φ -risk over all measurable functions, of the form EL φ f L φ 2 inf f F Lφ f L φ + ɛn. Such bouns can be combine with an assumption on the rate of ecrease of the approximation error inf f Fn L φ f L φ for a sequence of classes F n use by a metho of sieves, an thus provie bouns on the rate of convergence of risk to Bayes risk. For many binary classification methos incluing empirical risk minimization, plug-in estimates, an minimization of the sample average of a suitable convex loss, the estimation error term ɛ n approaches zero at a faster rate when the conitional probability ηx is unlikely to be close to the critical value of 1/2 see [13, 2, 5, 12, 1]. For plug-in rules, [7] showe an analogous result for classification with a reject option, where the corresponing conition concerns the probability that ηx is close to the critical values of an 1. In this

12 12 BARTLETT AND WEGKAMP section, we prove a boun on the excess φ -risk of f that converges rapily when a conition of this kin applies. We begin with a precise statement of the conition. For = 1/2, it is equivalent to Tsybakov s margin conition [13]. Definition 5. We say that η satisfies the margin conition at with exponent α if there is an A 1 such that for all t > 0, P{ ηx t} At α an P{ηX 1 t} At α. The reason that conitions of this kin allow fast rates is relate to the variance of the excess φ -loss, g f x, y = φ yfx φ yf x, where f minimizes the φ -risk. Notice that the expectation of g f is precisely the excess risk of f, Eg f X, Y = L φ f L φ. We will show that when η satisfies the margin conition at with exponent α, the variance of each g f is boune in terms of its expectation, an thus approaches zero as the φ-risk of f approaches the minimal value. Classes for which this occurs are calle Bernstein classes. Definition 6. We say that G L 2 P is a β, B-Bernstein class with respect to the probability measure P 0 < β 1, B 1 if every g G satisfies Pg 2 B{Pg} β. We say that G has a Bernstein exponent β with respect to P if there exists a constant B for which G is a β, B-Bernstein class. Lemma 7. If η satisfies the margin conition at with exponent α, then for any class any class F of measurable uniformly boune functions, the class G = {g f : f F} has a Bernstein exponent β = α/1 + α. The result relies on the following two lemmas. The first shows that the excess φ -risk is at least linear in a certain pseuo-norm of the ifference between f an f. It is similar to the L 1 P norm, but it penalizes f less for large excursions that have little impact on the φ -risk. For example, if ηx = 1, then the conitional φ -risk is zero even if fx takes a large positive value. For η [0, 1], efine η f f if η < an f < 1, ρ η f, f = 1 η f f if η > 1 an f > 1, f f otherwise,

13 CLASSIFICATION WITH A REJECT OPTION USING A HINGE LOSS 13 an recall the efinition of the conitional φ -risk in 10. Lemma 8. For η [0, 1], r η,φ f r η,φ f η η 1 ρ η f, f. Proof. Since r η,φ is convex, r η,φ f r η,φ f + gf f for any g in the subgraient of r η,φ f at f. In our case, r η,φ is piecewise linear, with four pieces, an the subgraients inclue Thus, we have η 1 at f = 1, η 1 at f = 1, 0, 1 η 1 at f = 0, 1, 1 η 1 at f = 1. r η,φ f r η,φ f η1 f f if η < an f < 1, η f f if η < an f > 1, η 1 η f f if η 1, 1 η f f if η > 1 an f < 1, 1 η1 f f if η > 1, f > 1. 1 ρ η f, f if η < an f < 1, η ρ η f, f if η < an f > 1, = η 1 η ρ η f, f if η 1, 1 η ρ η f, f if η > 1 an f < 1, 1 ρ η f, f if η > 1, f > 1. η 1 η ρ η f, f. We shall also use the following inequalities. Lemma 9. For η [0, 1], an ρ η f, f f f, η φ f φ f η φ f φ f B + 1ρ η f, f.

14 14 BARTLETT AND WEGKAMP Proof. The first inequality is immeiate from the efinition of ρ η. To see the secon, use the fact that φ is flat to the right of 1 to notice that η φ f φ f η φ f φ f 2 { η φ f φ f 2 if η < an f < 1, = 1 η φ f φ f 2 if η > 1 an f > 1. Since φ has Lipschitz constant a = 1 /, this implies η φ f φ f η φ f φ f 2 ηa 2 f f 2 if η < an f < 1, 1 ηa 2 f f 2 if η > 1 an f > 1, a 2 f f 2 otherwise a Bρ η f, f. Proof of Lemma 7. By Lemma 8, we have with L φ f L φ 1 Pρ η f, f η 1 I E + η I E+, E = { η 1 η }, E + = { η 1 > η }. Using the assumption on η, there is an A 1 such that for all t > 0 P{ ηx t} At α an P{ηX 1 t} At α. Thus, for any set E, Pρ η f, f η 1 I E tpρ η f, f I { η 1 t} I E = tpρ η f, f I E tpρ η f, f I { η 1 <t} I E t{pρ η f, f I E B + 1At α }, where B is such that f B. Similarly, Pρ η f, f η I E t{pρ η f, f I E B + 1At α }, an we obtain L φ f L φ 1 t Pρ η f, f I E+ E 2B + 1At α = 1 t Pρ η f, f 2B + 1At α.

15 CLASSIFICATION WITH A REJECT OPTION USING A HINGE LOSS 15 Choose Pρη f, f 1/α t =, 4B + 1A in the expression above, an we obtain an so 15 Eg f X, Y = L φ f L φ Pρ η f, f In aition, by Lemma 9, 1 24B + 1A 1/α Pρ ηf, f 1+α/α, { 24B + 1A 1/α} α/α+1 {Egf X, Y } α/1+α. E{g f X, Y } 2 = EE[{g f X, Y } 2 X] = P η φ f φ f η φ f φ f B + 1 Pρ η f, f. Combining these two inequalities shows that 1 2 E{g f X, Y } 2 B AB + 1 1/α α/α+1 Egf X, Y α/1+α. Remark. Specialize to the case δ, = 0, 1/2, we note that Lemma 7 removes unnecessary technical restrictions on ηx near 0 an 1, impose in [5] an [12]. Lemma 7 provies the main ingreient for establishing fast rates of minimizers f of the empirical risk L φ f := 1 n φ Y i fx i. Theorem 10. If η satisfies the margin conition at with exponent α, F is a countable class of functions f : X R satisfying f B, an F satisfies log Nε, L, F Cε p for all ε > 0 an some 0 p 2, then there exists a constant C inepenent of n, such that EL φ f L φ 2 inf Lφ f L φ f F + C n 1+α 2+p+α+pα, where f = arg min f F Lφ f.

16 16 BARTLETT AND WEGKAMP Proof. We use the notation Pg f = Eg f X, Y an P n g f = 1 g f X i, Y i. n By efinition of f, we have L φ f L φ = Pg bf = 2P n g bf + P 2P n g bf 2 inf P ng f + supp 2P n g f. f F Taking expecte values on both sies, yiels, [ ] EL φ f L φ 2 inf Lφ f L φ f F + E supp 2P n g f. f F Since g f g f f f, it follows that [ E ] supp 2P n g f 4ε n + 2BP f F where F n is a ε n -minimal covering net of F with { f F ε n = An 1+α/2+p+α+pα. sup f F n P 2P n g f ε n The union boun an Bernstein s exponential inequality for the tail probability of sums of boune ranom variables in conjunction with Lemma 7, yiel { } P sup f F n P 2P n g f ε n F n max f F n exp expcε p n n 8 cnε 2 β n } ε n + Pg f 2 Pg 2 f + Bε n + Pg f /6 with 0 β = α/1 + α 1 an some c > 0 inepenent of n. Conclue the proof by noting that expcε p n cnε 2 β n = exp an by choosing the constant A in ε n such that Cε p n oε n. c 2 nε2 β n, = 2cnε 2 β n, an exp nε 2 β n = Remark. Consier for simplicity the case F is finite p = 0. Then, if the margin conition hols for α = +, we obtain rates of convergence log F /n. If α = 0, we in fact impose no restriction on ηx at all, an the rate equals log F /n 1/2.

17 CLASSIFICATION WITH A REJECT OPTION USING A HINGE LOSS 17 The constant 2 in front of the minimal excess risk on the right coul be mae closer to 1, at the expense of increasing C. References 1. J. Y. Auibert an A. B. Tsybakov Fast convergence rates for plug-in classifiers uner margin conitions. Annals of Statistics. To appear 2. P. L. Bartlett, M. I. Joran an J. D. McAuliffe Convexity, classification, an risk bouns, Journal of the American Statistical Association, : S. Boucheron, O. Bousquet an G. Lugosi 2004a. Introuction to statistical learning theory. In Avance Lectures in Machine Learning O. Bousquet, U. von Luxburg, an G. Rätsch, Eitors, Springer, New-York. 4. S. Boucheron, O. Bousquet an G. Lugosi Theory of Classification: a Survey of Recent Avances. ESAIM: Probability an Statistics, 9: G. Blanchar, O. Bousquet an P. Massart Statistical Performance of Support Vector Machines. Manuscript blanchar/publi. 6. D. Cox an F. O Sullivan Asymptotic analysis of penalize likelihoo an relate estimators. Annals of Statistics, 18: , R. Herbei an M. H. Wegkamp Classification with reject option. Canaian Journal of Statistics. To appear 8. G. S. Kimelorf an G. Wahba Some results on Tchebycheffian spline functions. J. Math. Anal. Applic., 33: P. Massart St Flour Lecture Notes. Manuscript. massart/stf2003 massart.pf 10. B. D. Ripley Pattern recognition an neural networks. Cambrige University Press, Cambrige. 11. I. Steinwart an C. Scovel Fast Rates for Support Vector Machines using Gaussian Kernels. Los Alamos National Laboratory Technical Report LA-UR B. Tarigan an S. A. van e Geer Aaptivity of support vector machines with l 1 penalty. Technical Report MI , University of Leien. 13. A. B. Tsybakov Optimal aggregation of classifiers in statistical learning. Annals of Statistics, 32, T. Zhang Statistical behavior an consistency of classification methos base on convex risk minimization. Annals of Statistics, 32: Computer Science Division an Department of Statistics, University of California, Berkeley aress: bartlett@cs.berkeley.eu Department of Statistics, Floria State University, Tallahassee aress: wegkamp@stat.fsu.eu

Classification Methods with Reject Option Based on Convex Risk Minimization

Classification Methods with Reject Option Based on Convex Risk Minimization Journal of Machine Learning Research 11 (010) 111-130 Submitte /09; Revise 11/09; Publishe 1/10 Classification Methos with Reject Option Base on Convex Risk Minimization Ming Yuan H. Milton Stewart School