Fisher Consistency of Multicategory Support Vector Machines

Similar documents
Lecture 18: Multiclass Support Vector Machines

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University

Support Vector Machines

Support Vector Machines for Classification: A Statistical Portrait

On the Consistency of Multi-Label Learning

Polyhedral Computation. Linear Classifiers & the SVM

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Classification and Support Vector Machine

Support Vector Machines

Support Vector Machine

The Margin Vector, Admissible Loss and Multi-class Margin-based Classifiers

Support Vector Machines Explained

A Brief Survey on Semi-supervised Learning with Graph Regularization

Lecture 10: Support Vector Machine and Large Margin Classifier

A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES. Wei Chu, S. Sathiya Keerthi, Chong Jin Ong

Introduction to Support Vector Machines

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

Support Vector Machines and Kernel Methods

Kernel Methods and Support Vector Machines

Lecture 10: A brief introduction to Support Vector Machine

Introduction to Support Vector Machines

Machine Learning. Lecture 6: Support Vector Machine. Feng Li.

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Machine Learning. Support Vector Machines. Manfred Huber

Lecture Notes on Support Vector Machine

Reinforced Multicategory Support Vector Machines

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machines With Adaptive L q Penalty

Cheng Soon Ong & Christian Walder. Canberra February June 2018

A NEW MULTI-CLASS SUPPORT VECTOR ALGORITHM

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

Applied inductive learning - Lecture 7

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Support vector machines

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

Convex Optimization. Dani Yogatama. School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA. February 12, 2014

Support Vector Machines

Nonlinear Support Vector Machines through Iterative Majorization and I-Splines

Support Vector Machines for Classification and Regression

Jeff Howbert Introduction to Machine Learning Winter

SUPPORT VECTOR MACHINE

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan

Neural Networks. Prof. Dr. Rudolf Kruse. Computational Intelligence Group Faculty for Computer Science

Brief Introduction to Machine Learning

Statistical Machine Learning from Data

(Kernels +) Support Vector Machines

Introduction to Machine Learning Lecture 13. Mehryar Mohri Courant Institute and Google Research

Support Vector Machine (SVM) and Kernel Methods

Lecture Support Vector Machine (SVM) Classifiers

Linear & nonlinear classifiers

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Introduction to Support Vector Machines

MACHINE LEARNING. Support Vector Machines. Alessandro Moschitti

Understanding SVM (and associated kernel machines) through the development of a Matlab toolbox

Support Vector Machines

Lecture 2: Linear SVM in the Dual

The Lagrangian L : R d R m R r R is an (easier to optimize) lower bound on the original problem:

LECTURE 7 Support vector machines

An introduction to Support Vector Machines

Support Vector Machines

Non-linear Support Vector Machines

Model Selection for LS-SVM : Application to Handwriting Recognition

Margin Maximizing Loss Functions

Support Vector Machine

Learning with kernels and SVM

Support Vector Machine (continued)

Multicategory Vertex Discriminant Analysis for High-Dimensional Data

Analysis of Multiclass Support Vector Machines

CSC 411 Lecture 17: Support Vector Machine

Scale-Invariance of Support Vector Machines based on the Triangular Kernel. Abstract

Supplementary material for Continuous-action planning for discounted infinite-horizon nonlinear optimal control with Lipschitz values

Robust Kernel-Based Regression

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

SVM and Kernel machine

ICS-E4030 Kernel Methods in Machine Learning

Max Margin-Classifier

LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Stat542 (F11) Statistical Learning. First consider the scenario where the two classes of points are separable.

A Note on Extending Generalization Bounds for Binary Large-Margin Classifiers to Multiple Classes

ABC-Boost: Adaptive Base Class Boost for Multi-class Classification

A Bahadur Representation of the Linear Support Vector Machine

Formulation with slack variables

Introduction to Logistic Regression and Support Vector Machine

SUPPORT VECTOR REGRESSION WITH A GENERALIZED QUADRATIC LOSS

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

Perceptron Revisited: Linear Separators. Support Vector Machines

Statistical Pattern Recognition

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Constrained Optimization and Support Vector Machines

Scattering of Solitons of Modified KdV Equation with Self-consistent Sources

Linear Classification and SVM. Dr. Xin Zhang

Support Vector Machines: Maximum Margin Classifiers

Pattern Recognition 2018 Support Vector Machines

Support vector machines with adaptive L q penalty

Review: Support vector machines. Machine learning techniques and image analysis

Support Vector Machines

A Revisit to Support Vector Data Description (SVDD)

Linear & nonlinear classifiers

Transcription:

Fisher Consistency o Multicategory Support Vector Machines Yueng Liu Department o Statistics and Operations Research Carolina Center or Genome Sciences University o North Carolina Chapel Hill, NC 7599-360 yliu@email.unc.edu Abstract The Support Vector Machine (SVM) has become one o the most popular machine learning techniques in recent years. The success o the SVM is mostly due to its elegant margin concept and theory in binary classiication. Generalization to the multicategory setting, however, is not trivial. There are a number o dierent multicategory extensions o the SVM in the literature. In this paper, we review several commonly used extensions and Fisher consistency o these extensions. For inconsistent extensions, we propose two approaches to make them Fisher consistent, one is to add bounded constraints and the other is to truncate unbounded hinge losses. Background on Binary SVM The Support Vector Machine (SVM) is a well known large margin classiier and has achieved great success in many applications (Vapnik, 998, Cristianini and Shawe-Taylor, 000, and Hastie, Tibshirani, and Friedman, 000). The basic concept behind the binary SVM is to search a separating hyperplane with maximum separation between the two classes. Suppose a training dataset containing n training pairs {x i, y i } n i=, i.i.d realizations rom a probability distribution P (x, y), is given. The goal is to search or a linear unction (x) = w x + b so that sign((x)) can be used or prediction o labels or new inputs. The SVM aims to ind such an so that points o class + and points o class are best separated. In particular, or the separable case, the SVM s solution maximizes the distance between (x) = ± subject to y i (x i ) ; i =,..., n. This distance can be expressed as and is known as the geometric margin. w When perect separation between two classes is not easible, slack variables ξ i ; i =,..., n, can be used to measure the amount o violation o the original constraints. Then the SVM solves the ollowing optimization problem: w,b w + C ξ i () s.t. i= y i (x i ) ξ i, ξ i 0, i, where C > 0 is a tuning parameter which balances the separation and the amount o violation o the constraints. Optimization ormulation in () is also known as the primal problem o the SVM. Using the Lagrange multipliers, () can be converted into an equivalent dual problem as ollows: s.t. α i,j= y i y j α i α j x i, x j y i α i = 0; 0 α i C, i. i= i= α i () Once the solution o problem () is obtained, w can be calculated as n i= y iα i x i and the intercept b can be computed using the Karush-Kuhn-Tucker (KKT) complementary conditions o the optimization theory. I nonlinear learning is needed, one can apply the kernel trick by replacing the inner product x i, x j by K(x i, x j ), where the kernel K is a positive deinite unction. This amounts to applying linear learning in the eature space induced by the kernel K to achieve nonlinear learning in the original input space. It is now known that the SVM can be it in the regularization ramework (Wahba, 998) as ollows: J() + C [ y i (x i )] +, (3) i= where the unction [ u] + = u i u and 0 otherwise and it is known as the hinge loss unction (see Figure ). The term J() is a regularization term and in the linear learning setting, it becomes w which is related to the geometric margin. Denote P (x) = P (Y = X = x). Then a binary classiier with loss V ((x), y) is Fisher consistent i the imizer o E[V ((X), Y X = x] has

loss 0.0 0.5.0.5.0.5 3.0 0 Figure : Plot o the hinge loss H (u) = [ u] +. the same sign as P (x) /. Clearly, Fisher consistency requires the loss unction asymptotically yields the Bayes decision boundary. Fisher consistency is also known as classiication-calibration (Bartlett et al., 006). For the SVM, Lin (00) shows that the imizer o E[[ Y (X)] + X = x] is sign(p (x) /) and consequently the hinge loss o the SVM is Fisher consistent. Interestingly, the SVM only targets on the classiication set {x : P (x) /} without estimating P (x) itsel. Extending the binary SVM to multicategory SVM is not trivial. In this paper, we irst review our common extensions in Section. Fisher consistency o dierent extensions will be discussed in Secion 3. Among the our extensions, three extensions are not always Fisher consistent. In Sections 4 and 5, we propose to use bounded constraints and truncation to derive Fisher consistent analogs o dierent losses discussed in Section. Some discussions are given in Section 6. Multicategory SVM The standard SVM only solves binary problems. However, it is very oten or one to encounter multicategory problems in practice. To solve a multicategory problem using the SVM, one can use two possible approaches. The irst approach is to solve the multicategory problem via a sequence o binary problems, e.g., one-versus-rest and one-versus-one. The second approach is to generalize the binary SVM into a simultaneous multicategory ormulation. In this paper, we will ocus on the second approach. u Consider a k-class classiication problem with k. When k =, the methodology to be discussed here reduces to the binary counterpart in Section. Let = (,,, k ) be the decision unction vector, where each component represents one class and maps rom S to R. To remove redundant solutions, a sum-to-zero constraint j= j = 0 is employed. For any new input vector x, its label is estimated via a decision rule ŷ = argmax j=,,,k j (x). Clearly, the argmax rule is equivalent to the sign unction used in the binary case in Section. The extension o the SVM rom the binary to multicategory case is nontrivial. Beore we discuss the detailed ormulation o multicategory hinge loss, we irst discuss Fisher consistency or multicategory problems. Consider y {,..., k} and let P j (x) = P (Y = j x). Suppose V ((x), y) is a multicategory loss unction. Then in this context, Fisher consistency requires that argmax j j = argmax j P j, where (x) = ( (x),..., k (x)) denotes the imizer o E[V ((X), Y ) X = x]. Fisher consistency is a desirable condition o a loss unction, although a consistent loss may not always translate into better classiication accuracy (Hsu and Lin, 00, Rikin and Klautau, 004). For simplicity, we only ocus on standard learning where all types o misclassiication are treated equally. The proposed techniques, however, can be extended to more general settings with unequal losses. Note that a point (x, y) is misclassiied by i y argmax j j (x). Thus a sensible loss V should try to orce y to be the maximum among k unctions. Once V is chosen, multicategory SVM solves the ollowing problem J( j ) + C j= subject to j= j(x) = 0. V ((x i ), y i ) (4) i= The key o extending the SVM rom binary to multicategory is the choice o loss V. There are a number o extensions o the binary hinge loss to the multicategory case proposed in the literature. We consider the ollowing our commonly used extensions and their Fisher consistency or inconsistency will be discussed. For inconsistent losses, we propose two methods to make them Fisher consistent. (a). (Naive hinge loss, c.., Zou et al. 006) [ y (x)] + ; (b). (Lee et al., 004) j y [ + j(x)] + ; (c). (Vapnik, 998; Weston and Watkins, 999; Bredensteiner and Bennett, 999; Guermeur, 00) j y [ ( y(x) j (x))] + ; (d). (Crammer and Singer, 00; Liu and Shen, 006) [ j ( y (x) j (x))] +. Note that the constant in these losses can be changed to a general positive value. However, the resulting losses will be equivalent to the current ones by re-scaling.

As a remark, we note that the sum-to-zero constraint is essential or Losses (a) and (b), not so or Losses (c) and (d). It is easy to see that all these losses try to encourage y to be the maximum among k unctions, either explicitly or implicitly. In the next section, we explore Fisher consistency o these our multicategory hinge loss unctions. 3 Fisher Consistency o Multicategory Hinge Losses In this section, we discuss Fisher consistency o all our losses. We would like to clariy here that some o the Fisher consistency results are already available in the literature. There are previous studies on Fisher consistency o multicategory SVMs such as Zhang (004), Lee et al. (004), Tewari and Bartlett (005), and Zou et al. (006). The goal o this paper is to irst review Fisher consistency or inconsistency o these losses and then explore ways, motivated rom the discussions in this section, to modiy inconsistent extensions to be consistent. Inconsistency o Loss (a): Lemma. The imizer o E[[ Y (X)] + X = x] subject to j j(x) = 0 satisies the ollowing: j (x) = (k ) i j = arg j P j (x) and otherwise. Proo: E[[ Y (X)] + ] = E[ [ l (X)] + P l (X)]. For any ixed X = x, our goal is to imize [ l(x)] + P l (x). We irst show the imizer satisies j or j =,..., k. To show this, suppose a solution having j >. Then we can construct another solution with j = and l = l + A, where A = (j )/(k ) > 0. Then l l = 0 and l > l ; l j. Consequently, [ l ] +P l < [ l ] +P l. This implies that cannot be the imizer. Thereore, the imizer satisies j or j. Using the property o, we only need to consider with j or j. Thus, k [ l (x)] + P l (x) = ( l(x))p l (x) = l(x)p l (x). Then the problem reduces to max subject to P l (x) l (x) (5) l (x) = 0; l (x), l. (6) It is easy to see that the solution satisies j (x) = (k ) i j = arg j P j (x) and otherwise. 3 From Lemma, we can see that Loss (a) is not Fisher consistent since except the smallest element, all remaining elements o its imizer are. Consequently, the argmax rule cannot be uniquely detered and thus the loss is not Fisher consistent when k 3 no matter how the P j s are distributed. Consistency o Loss (b): Lemma. The imizer o E[ j Y [ + j (X)] + X = x] subject to j j(x) = 0 satisies the ollowing: j (x) = k i j = argmax jp j (x) and otherwise. Proo: Note that E[ j Y [ + j(x)] + ] = E[E( j Y [ + j(x)] + X = x)]. Thus, it is suicient to consider the imizer or a given x and E( j Y [ + j(x)] + X = x) = j l [ + j (x)] + P l (x). Next, we show the imizer satisies j or j =,..., k. To show this, suppose a solution having j <. Then we can construct another solution with j = and l = l A, where A = ( j )/(k ) > 0. Then l l = 0 and l < l ; l j. Consequently, j l [ + j ] +P l < j l [ + j ] +P l. This implies that cannot be the imizer. Thereore, the imizer satisies j or j. Using the property o, we only need to consider with j or j. Thus, j l [+ j ] + P l = P l j l ( + j) = P l(k + j l j) = P l(k l ) = k P l l. Consequently, imizing j l [ + j] + P l is equivalent to maximizing P l l. Then the problem reduces to max subject to P l (x) l (x) (7) l (x) = 0; l (x), l.(8) It is easy to see that the solution satisies j (x) = k i j = argmax j P j (x) and otherwise. Lemma implies that Loss (b) yields the Bayes classiication boundary asymptotically, consequently it is a Fisher consistent loss. A similar result was also established by Lee et al. (004). Inconsistency o Loss (c): Lemma 3. Consider k = 3 with / > P > P > P 3. Then imizer(s) o E[ j Y [ ( Y (X) j (X))] + X = x] is(are) as ollows: P = /3: Any = (,, 3 ) satisying 3 and 3 = is a imizer. P > /3: = (,, 3 ) satisying 3, =, and 3 =. P < /3: = (,, 3 ) satisying 3, = 3, and =.

Proo: Note that E[ j Y [ ( Y (X) j (X))] + X = x] can be rewritten as P l j l [ ( l(x) j (x))] +. By the non-increasing property o [ u] +, the imizer must satisy... k i P... P k. Next we ocus on k = 3. Consider 3 with = A 0 and 3 = B 0. Then the objective unction becomes L(A, B) = P [[ A] + + [ (A + B)] + ](9) +P [ + A + [ B] + ] (0) +P 3 [ + A + B + + B]. () Beore proving the lemma, we irst need to prove that the imizer must satisy that A + B. To this end, we prove it in two steps: () A, B ; () A + B. To prove (), suppose that the imizer satisies A >. Using contradiction, we can consider another solution with A = A ɛ > with ɛ > 0 and B = B. It is easy to see that L(A ɛ, B) < L(A, B). Thus, the imizer must have A. Similarly, we can show B. To prove (), we know A and B by property (). Suppose the imizer have A + B >. Using contradiction, consider another solution with A + B = A + B ɛ > and A = A ɛ and B = B. Then by the act that P < P + P 3, we have L(A, B ) L(A, B) = P ɛ P ɛ P 3 ɛ < 0. () This implies A + B. Using properties () and (), imizing L(A, B) can be reduced as ollows: A,B ( P + P + P 3 )A + ( P P + P 3 )B s.t. A + B, A, B. (3) I P = /3, then ( P + P + P 3 ) = ( P P + P 3 ) < 0. Thereore, the solution o (3) satisies A + B =, implying 3 =. I P > /3, then 0 > ( P + P + P 3 ) > ( P P + P 3 ) and consequently the solution satisies A = 0 and B =, i.e., = and 3 =. I P < /3, then 0 > ( P P + P 3 ) > ( P + P + P 3 ) and consequently the solution satisies A = and B = 0, i.e., = and = 3. The desired result then ollows. Lemma 3 tells us that Loss (c) may be Fisher inconsistent. In act, in the case o k = 3 with / > P > P > P 3, Loss (c) is Fisher consistent only when P < /3. (P, P, P 3 ) = (5/, /3, /4), both (/, 0, /) and (/3, /3, /3) are imizers. In general, one cannot claim uniqueness o the imizer unless the objective unction to be imized is strictly convex. Inconsistency o Loss (d): Denote g((x), y)) = { y (x) j (x); j y} and H (u) = [ u] +. Then Loss (d) can written as H ( g((x), y)). Lemma 4. The imizer o E[H ( g((x), Y )) X = x] has the ollowing properties: (). I max j P j > /, then argmax j j = argmax j P j and g ((x), argmax j j ) = ; (). I max j P j < /, then = 0. Proo: Note E[H ( g((x), Y ))] can be written as E[E(H ( g((x), Y )) X = x)] = E[ P j (X)H ( g((x), j))] = E[ j= P j (X)( g((x), j)) + ]. j= For any given X = x, let g j = g((x), j); j =,, k. The problem reduces to imizing k j= P j( g j ) +. To prove the lemma, we utilize two properties o the imizer : () The imizer satisies max gj ; () Let j 0 = argmaxgj. For l j 0, gl equals to max gj = g j 0. Using property (), we have P j ( gj ) + = j= P j gj. j= Thereore, imizing j= P j( gj ) + is equivalent to maximizing j= P jgj. Using property (), the problem reduces to max (P j 0 )g j0 or g j0 [0, ]. Clearly, the imizer satisies the ollowing conditions: I P j0 > /, g j0 = ; i P j0 < /, g j0 = 0. The desired result o the lemma then ollows. We are now let to show the two properties o. Property (): The imizer satisies max g j. To show this property, we note ( g j 0 ) + = 0 or g j 0. However, or l j 0, j l { l j } = l j 0 l j 0 { j 0 l } = g j 0 = max g j, (4) Remark: Lee et al. (004) considered a special case with P > /3. Our Lemma 3 here is more general. Interestingly, or the case o which is less than i max g j >. Thereore, P = /3, there are ininite imizers although the loss is convex. For example, i then ollows. cannot be a imizer i max gj >. Property () 4

Property (): For l j 0, gl equals to max gj = gj 0. Since gl gj 0 or l j 0 as shown in (4), to maximize j= P jgj, property () holds. Lemma 4 suggests that H ( g((x), y)) is Fisher consistent when max j P j > /, i.e., when there is a doating class. Except or the Bayes decision boundary, this condition always holds or a binary problem. For a problem with k >, however, existence o a doating class may not be guaranteed. I max j P j (x) < / or a given x, then (x) = 0 and consequently the argmax o (x) cannot be uniquely detered. The Fisher inconsistency o Loss (d) was also noted by Zhang (004) and Tewari and Bartlett (005). 4 A Consistent Hinge Loss with Bounded Constraints (x)= (x)=0 (x)= Figure : Illustration o the binary SVM classiier using the hinge loss (a). In Section 3, we show that Losses (a), (c), and (d) may not always be Fisher consistent while in contrast, Loss (b) is always Fisher consistent. An interesting observation we have is that although Loss (a) is inconsistent while Loss (b) is consistent, the reduced problems (5) and (7) in the derivation o their asymptotic properties are very similar. In act, the only dierence is that Loss (b) has the constraint l (x) or l, and in contrast, Loss (a) has the constraint l (x) or l. Our observation is that one can add additional constraints on to orce Loss (a) to be consistent. In act, i additional constraints j /(k ) or j are imposed on (5), then the imizer becomes j (x) = i j = argmax jp j (x) and /(k ) otherwise. Consequently, the corresponding loss is Fisher consistent. To be speciic, we can modiy the scale o Loss (a) and propose the ollowing new loss: [k y (x)] + subject to j j(x) = 0 and l (x) or l. Clearly, this loss can be reduced to (e). y (x) subject to j j(x) = 0 and l (x) k or l. Consistency o Loss (e): Lemma 5. The imizer o E[ Y (X)], subject to j j(x) = 0 and l (x) k or l, satisies the ollowing: j (x) = k i j = argmax j P j (x) and otherwise. Proo: The proo is analogous to that o Lemma. It is easy to see that the problem can be reduced to s.t. max P l (x) l (x) (5) l (x) = 0; l (x) k, l. 5 (x)= (x)=0 (x)= Figure 3: Illustration o the binary SVM classiier using the bounded hinge loss (e). Thus, the solution satisies j (x) = k i j = argmax j P j (x) and otherwise. It is worthwhile to point out that Losses (b) and (c) can be reduced to Loss (e) using bounded constraints. Speciically, or Loss (b), i l (x) k or l and l(x) = 0, then y j [ + j ] + = y j (+ j) = k y. Thus, it is equivalent to Loss (e). For Loss (c), we can rewrite it in an equivalent orm as j y [k ( y j )] +. Then i l (x) k, l and l(x) = 0, j y [k ( y j )] + = j y (k ( y j )) = k(k ) k y. Thus, it is equivalent to Loss (e) as well. As a remark, we want to point out that the constraint l (x) k or l can be diicult to implement or all x S. For simplicity o learning, we suggest to relax such constraints on all training points only, that is l (x i ) k or i =,..., n and l =,..., k. Then we have the

ollowing multicategory learning method: 4 4 4 s.t. j C j= yi (x i ) (6) i= j (x i ) = 0; l (x i ) ; j l =,..., k, i =,..., n. H 3 H s 3 T s 3 To urther understand learning using the proposed Loss (e), we discuss the binary case with y {±}. Recall in the separable case, the standard SVM tries to ind a hyperplane with maximum separation, i.e., the distance between (x) = ± is maximized. As we can see Figure, only a small subset o the training data, the so called support vectors, deteres the solution. In contrast, using the relaxed bounded constraints in Loss (e) amounts to orcing (x i ) [, ] or all training points. As a result, in the separable case, this new classiier tries to ind a hyperplane so that the total distance o all training points to the classiication boundary, n i= y i(x i ), subject to (x i ) [, ]; i =,..., n, is maximized as shown in Figure 3. Thus, all points play a role in detering the inal solution. In summary, we can conclude that Loss (e) aims or a dierent classiier rom that o Loss (a) due to the bounded constraints. Note that in contrast to the notion o support vectors o the SVM, the new classiier in (6) utilizes all training points to detere its solution, as illustrated in Figure 3. In order to make the classiier sparse in training points, one can select a raction o training points to approximate the classiier using the whole training set without jeopardizing the perormance in classiication. Using such a sparse classiier may help to reduce the computational cost, especially or large training datasets. The idea o Import Vector Machine (IVM, Zhu and Hastie, 005) may be proven useul here. Although Losses (a), (b) and (c) can all be reduced to Loss (e) using bounded constraints, Loss (d) appears to behave dierently. Currently, we are not able to make Loss (d) Fisher consistent using bounded constraints due to special properties o the unction. In Section 5, we propose to use truncation to make Loss (d) Fisher consistent. 5 Fisher Consistent Truncated Hinge Loss In many applications, outliers may exist in the training sample and unbounded losses can be sensitive to such points (Shen et al. 003, Liu et al, 005, Liu and Shen, 006). The hinge loss is unbounded and truncating unbounded hinge losses may help to improve robustness o the corresponding classiiers. In this section, we explore the idea 0 3 0 3 z 0 3 s 0 3 z 0 3 s 0 3 z Figure 4: The let, middle, and right panels display unctions H (u), H s (u), and T s (u) respectively. o truncation on Loss (d). We show that the truncated version o Loss (d) can be Fisher consistent or certain truncating locations. Deine T s (u) = H (u) H s (u). Then T s ( g((x), y)) with s 0 becomes a truncated version o Loss (d). Figure 4 shows unctions H (u), H s (u), and T s (u). We irst show in Lemma 6 that or a binary problem, T s is Fisher consistent or any s 0. For multicategory problems, truncating H ( g((x), y)) can make it Fisher consistent even in the situation o no doating class as shown in Lemma 7. The ollowing lemma establishes Fisher consistency o the truncated hinge loss T s or the binary case: Lemma 6. The imizer o E[T s (Y (X)) X = x] has the same sign as P (x) / or any s 0. Proo: Notice E[T s (Y (X))] = E[E(T s (Y (X)) X = x)]. We can imize E[T s (Y (X))] by imizing E(T s (Y (X)) X = x) or every x. For any ixed x, E(T s (Y (X)) X = x) can be written as P (x)t s ((x)) + ( P (x))t s ( (x)). Since T s is a non-increasing unction, the imizer must satisy that (x) 0 i P (x) > / and (x) 0 otherwise. Thus, it is suicient to show that = 0 is not a imizer. Without loss o generality, assume P (x) > /, then E(T s (0) X = x) =. Consider another solution (x) =. Then E(T s (Y (X)) X = x) = ( P )T s ( ) ( P ) < or any s 0. Thereore, (x) = gives a smaller value o E(T s (Y (X)) X = x) than (x) = 0, which implies that = 0 is not a imizer. We can then conclude that (x) has the same sign as P (x) /. Truncating Loss (d): Lemma 7. The imizer o E[T s ( g((x), Y )) X = x] satisies that argmax j j = argmax jp j, or any s [ k, 0]. Proo: Note that E[T s ( g((x), Y ))] = 6

E[ j= T s( g((x), j)p j (X))]. For any given x, we need to imize j= T s(g j )P j where g j = g((x), j). By deinition and the act that j= j = 0, we can conclude that max j g j 0 and at most one o g j s is positive. Let j p satisy that P jp = max j P j. Then using the nonincreasing property o T s, the imizer satisies that gj p 0. We are now let to show that gj p 0, equivalently that 0 cannot be a imizer. To this end, assume P jp > /k and consider another solution 0 with j 0 p = (k )/k and j 0 = /k or j j p. Then gj 0 p = and gj 0 = or j j p. Clearly, j= T s(gj 0)P j ( + /(k ))( P jp ) <. Thereore, 0 gives a smaller value o j= T s(g j )P j than 0 and consequently > 0. The desired result then ollows. g j p Remark: The truncation operation can be applied to many other losses such as Loss (b). Denote Hs (u) = [u s] +. Then Loss (b) can be expressed as j= I(y j)h ( j (x)). Deine Ts (u) = H (u) Hs (u) or s 0. Then the truncated version o Loss (b) becomes j= I(y j)ts ( j (x)). It can be shown that the truncated loss (b) is Fisher consistent or any s 0 (Wu and Liu, 006a). 6 Discussion Fisher consistency is a desirable property or a loss unction in classiication. It ensures the corresponding classiier delivers the Bayes classiication boundary asymptotically. Consequently, we view Fisher consistency a necessary property or any good loss to have. In this paper, we ocus on the method o the SVM and discuss Fisher consistency o several commonly used multicategory hinge losses. We study our dierent losses and three out o our are Fisher inconsistent. For Fisher inconsistent losses, we propose two methods to make them Fisher consistent. Our irst approach is to add bounded constraints to orce the corresponding hinge loss to be always Fisher consistent. This results in a very interesting new loss (Loss (e)) and a new learning algorithm. In contrast to the notion o support vectors in the standard SVM, the new classiier utilizes all points to detere the classiication boundary. Implementation o the new loss involves convex imization and solves a similar quadratic programg problem as that o the standard SVM. For our uture research, we will explore perormance o the new classiier and compare it with the standard SVM both theoretically and numerically. With Fisher consistency o the new loss or both binary and multicategory problems, we believe this new classiier is promising. 7 Our second approach is to use truncation to make Loss (d) Fisher consistent. Interestingly, truncation not only helps to remedy Fisher inconsistency, it also improves robustness o the resulting classiier as discussed in Wu and Liu (006b). One drawback o the truncated losses, however, is that the corresponding optimization is nonconvex. For a truncated hinge loss, one can decompose it into a dierence o two convex unctions and then apply the d.c. algorithm (Liu et al., 005). Acknowledgements This work is partially supported by Grant DMS- 0606577 rom the National Science Foundation and the UNC Junior Faculty Development Award. Reerences Bartlett, P. and Jordan, M., and McAulie, J. (006). Convexity, classiication, and risk bounds. Journal o the American Statistical Association, 0,38-56. Bredensteiner, E. and Bennett, K. (999). Multicategory classiication by support vector machines. Computational Optimizations and Applications,, 53-79. Crammer, K., and Singer, Y. (00). On the algorithmic implementation o multiclass kernelbased vector machines. Journal o Machine Learning Research,, 65-9. Cristianini, N. and Shawe-Taylor, J. (000). An Introduction to Support Vector Machines and other Kernel-Based Learning Methods. Cambridge University Press. Guermeur, Y. (00). Combining discriant models with new multiclass SVMS. Pattern Analysis and Applications (PAA), 5, 68-79. Hastie, T., Tibshirani, R., and Friedman, J. (000). The Elements o Statistical Learning. Springer-Verlag, New York. Hsu C. W. and Lin C. J. (00). A comparison o methods or multiclass support vector machines. IEEE transations on neural networks, 3,, 45-45. Lee, Y., Lin, Y., and Wahba, G. (004). Multicategory Support Vector Machines, theory, and application to the classiication o microarray data and satellite radiance data. Journal o the American Statistical Association, 99, 465, 67-8. Lin, Y. (00). Support vector machines and the Bayes rule in classiication. Data Mining and Knowledge Discovery, 6, 59-75. Liu, Y. and Shen, X. (006). Multicategory ψ- learning. Journal o the American Statistical Association, 0, 474, 500-509. Liu, Y., Shen, X., and Doss, H. (005). Multicategory ψ-learning and support vector machine: computational tools. Journal o Computational and Graphical Statistics, 4,, 9-

36. Rikin R. and Klautau A. (004). In deense o onevs-all classiication, Journal o Machine Learning Research, 5, 0-4. Shen, X., Tseng, G. C., Zhang, X., and Wong, W. H. (003). On ψ-learning. Journal o the American Statistical Association, 98, 74-734. Tewari, A. and Bartlett, P. L. (005). On the consistency o multiclass classiication methods. In Proceedings o the 8th Annual Conerence on Learning Theory, volume 3559, pages 43-57. Springer. Vapnik, V. (998). Statistical learning theory. Wiley, New York. Wahba, G. (998). Support vector machines, reproducing kernel Hilbert spaces, and randomized GACV. In: B. Schölkop, C. J. C. Burges and A. J. Smola (eds), Advances in Kernel Methods: Support Vector Learning, MIT Press, 5-43. Weston, J., and Watkins, C. (999). Support vector machines or multi-class pattern recognition. Proceedings o the Seventh European Symposium On Artiicial Neural Networks. Wu, Y. and Liu, Y. (006a). On multicategory truncated-hinge-loss support vector machines. To appear in Proceedings o Joint Summer Research Conerence on Machine and Statistical Learning: Prediction and Discovery. Wu, Y. and Liu, Y. (006b). Robust truncatedhinge-loss support vector machines. Submitted. Zhang, T. (004). Statistical analysis o some multi-category large margin classiication methods. Journal o Machine Learning Research, 5, 5-5. Zhu, J. and Hastie, T. (005). Kernel logistic regression and the import vector machine. Journal o Computational and Graphical Statistics, 4,, 85-05. Zou, H., Zhu, J. and Hastie, T. (006). The margin vector, admissible loss, and multiclass marginbased classiers. Technical report, Department o Statistics, University o Minnesota. 8