Multicategory Classification by Support Vector Machines

Size: px

Start display at page:

Download "Multicategory Classification by Support Vector Machines"

Dora Bradley
5 years ago
Views:

1 Muticategory Cassification by Support Vector Machines Erin J Bredensteiner Department of Mathematics University of Evansvie 800 Lincon Avenue Evansvie, Indiana eb6@evansvieedu Kristin P Bennett Department of Mathematica Sciences Rensseaer Poytechnic Institute Troy, NY 280 bennek@rpiedu Abstract We examine the probem of how to discriminate between objects of three or more casses Specificay, we investigate how two-cass discrimination methods can be extended to the muticass case We show how the inear programming (LP) approaches based on the work of Mangasarian and quadratic programming (QP) approaches based on Vapnik s Support Vector Machines (SVM) can be combined to yied two new approaches to the muticass probem In LP muticass discrimination, a singe inear program is used to construct a piecewise inear cassification function In our proposed muticass SVM method, a singe quadratic program is used to construct a piecewise noninear cassification function Each piece of this function can take the form of a poynomia, radia basis function, or even a neura network For the k > 2 cass probems, the SVM method as originay proposed required the construction of a two-cass SVM to separate each cass from the remaining casses Simiariy, k two-cass inear programs can be used for the muticass probem We performed an empirica study of the origina LP method, the proposed k LP method, the proposed singe QP method and the origina k QP methods We discuss the advantages and disadvantages of each approach

2 Introduction We investigate the probem of discriminating arge rea-word datasets with more than two casses Given exampes of points known to come from k > 2 casses, we construct a function to discriminate between the casses The goa is to seect a function that wi efficienty and correcty cassify future points This cassification technique can be used for data mining or pattern recognition For exampe, the United States Posta Service is interested in an efficient yet accurate method of cassifying zipcodes Actua handwritten digits from zipcodes coected by the United States Posta Service are used in our study Each digit is represented by a 6 by 6 pixe grayscae map, resuting in 256 attributes for each sampe number Given the enormous quantities of mai the Posta Service sorts each day, the accuracy and efficiency in evauation are extremey important In this paper, we combine two independent but reated research directions deveoped for soving the two-cass inear discrimination probem The first is the inear programming (LP) methods stemming from the Mutisurface Method of Mangasarian [2, 3] This method and it s ater extension the Robust Linear Programming (RLP) approach [6] have been used in a highy successfuy breast cancer diagnosis system [26] The second direction is the quadratic programming (QP) methods based on Vapnik s Statistica Learning Theory [24, 25] Statistica Learning Theory addresses mathematicay the probem of how to best construct functions that generaize we on future points The probem of constructing the best inear two-cass discriminant can be posed as a convex quadratic program with inear constraints The resuting inear discriminant is known as a Support Vector Machine (SVM) because it is a function of a subset of the training data known as support vectors Specific impementations such as the Generaized Optima Pane (GOP) method has proven to perform very we in practice [8] Throughout this paper we wi refer to the two different approaches as RLP and SVM The primary focus of this paper is how the the two research directions have differed in their approach to soving probems with k > 2 casses The origina SVM method for muticass probems was to find k separate two-cass discriminants [23] Each discriminant is constructed by separating a singe cass from a the others This process requires the soution of k quadratic programs When appying a k cassifiers to the origina muticategory dataset, mutipy cassified points or uncassified points may occur This ambiguity has been avoided by choosing the cass of a point corresponding to the cassification function that is maximized at that point The LP approach has been to directy construct k cassification functions such that for each point the corresponding cass function is maximized [5, 6] The Muticategory Discrimination Method [5, 6] constructs a piecewise-inear discriminate for the k- cass probem using a singe inear program We wi ca this method M-RLP since it is a direction extension of the RLP approach We wi show how these two different approaches can be combined two yied two new methods: k-rlp, and M-SVM In Section 2, we wi provide background on the existing RLP and SVM 2

3 methods Whie the k-cass cases are quite different, the two-cass inear discrimination methods for SVM and RLP are amost identica They differ ony in the reguarization term used in the objective We use the reguarized form of RLP proposed in [3] which is equivaent to SVM except that a different norm is used for the reguarization term For two-cass inear discrimination, RLP generaizes equay we and is more computationay efficient than SVM RLP expoits the fact that state-of-the-art LP codes are far more efficient and reiabe than QP codes The primary appea of SVM is that they can be simpy and eeganty appied to noninear discrimination With ony minor changes, SVM methods can construct a wide cass of two-cass noninear discriminants by soving a singe QP [24] The basic idea is that the points are mapped nonineary to a higher dimensiona space Then the dua SVM probem is used to construct a inear discriminant in the higher dimensiona space that is noninear in the origina attribute space By using kerne functions in the dua SVM probem, SVM can efficienty and effectivey construct many types of noninear discriminant functions incuding poynomia, radia basis function machine, and neura networks The successfu poynomia-time noninear methods based on LP use a muti-step approaches The methods of Roy et a [20, 9, 8] use custering in conjuction with LP to generate neura networks in poynomia time Another approach is to recursivey construct piecewise-inear discriminants using a series of LP s [3, 2, 5] These approaches coud aso be used with SVM but we imit discussion to noninear discriminants constructed using the SVM kerne-type approaches After the introduction to the existing muticass methods, M-RLP and k- SVM, we wi show how same idea used in the M-RLP, can be adapted to construct muticass SVM using a singe quadratic program We adapt a probem formuation simiar to the two-cass case In the two-cass case, initiay the probem is to construct a inear discriminant The data points are then transformed to a higher dimensiona feature space A inear discriminant is constructed in the higher dimension space This resuts in a noninear cassification function in the origina feature space In Section 3, for the k > 2 cass case, we begin by constructing a piecewise-inear discriminant function A reguarization term is added to avoid overfitting This method is then extended to piecewisenoninear cassification functions in Section 4 The variabes are mapped to a higher dimensiona space Then a piecewise-inear discriminant function is constructed in the new space This resuts in a piecewise-noninear discriminant in the origina space In Section 5, we extend the method to piecewise inseparabe datasets We ca the fina approach the Muticategory Support Vector Machine (M-SVM) Depending on the choice of transformation, the pieces may be poynomias, radia basis functions, neura networks, etc We concentrate our research on the poynomia cassifier and eave the computationa investigation other cassification functions as future work Figure shows a piecewise-seconddegree poynomia separating three casses in two dimensions M-SVM requires the soution of a very arge quadratic program When transforming the data points into a higher dimension feature space, the number 3

4 Figure : Piecewise-poynomia separation of three casses in two dimensions 4

5 of variabes wi grow exponentiay For exampe, a second degree poynomia cassifier in two dimensions requires the origina variabes x and x 2 as we as the variabes x 2, x 2 2, and x x 2 In the prima probem, the probem size wi expode as the degree of the poynomia increases The dua probem, however, remains tractabe The number of dua variabes is k times the number of points regardess of what transformation is seected In the dua probem, the transformation appears as an inner product in the high dimensiona space Inexpensive techniques exist for computing these inner products Each dua variabe corresponds to a point in the origina feature space A point with a corresponding positive dua variabe is referred to as a support vector The goa is to maintain a high accuracy whie using a sma number of support vectors Minimizing the number of support vectors is important for generaization and aso for reducing the computationa time required to evauate new exampes Section 6 contains computationa resuts comparing the two LP approaches k- RLP and M-RLP; and the the two QP approaches k-svm and M-SVM The methods were compared in terms of generaization (testing set accuracy), number of support vectors, and computationa time The foowing notation wi be used throughout this paper Mathematicay we can abstract the probem as foows: Given the eements of the sets, A i, i =,, k, in the n-dimensiona rea space R n, construct a a discriminant function is determined which separates these points into distinct regions Each region shoud contains points beonging to a or amost a of the same cass Let A j be a set of points in the n-dimensiona rea space R n with cardinaity m j Let A j be an m j n matrix whose rows are the points in A j The i th point in A j and the i th row of A j are both denoted A j i Let e denote a vector of ones of the appropriate dimension The scaar 0 and a vector of zeros are both represented by 0 Thus, for x R n, x > 0 impies that x i > 0 for i =,, n Simiary, x y impies that x i y i for i =,, n The set of minimizers of f(x) on the set S is denoted by arg min f(x) For a vector x in x S Rn, x + wi denote the vector in R n with components (x + ) i := max{x i, 0}, i =,, n The step function x wi denote the vector in [0, ] n with components (x ) i := 0 if (x) i 0 and (x ) i := if (x) i > 0, i =,, n For the vector x in R n and the matrix A in R n m, the transpose of x and A are denoted x T and A T respectivey The dot product of two vectors x and y wi be denoted x T y and (x y) 2 Background This section contains a brief overview of the RLP and SVM methods for cassification First we wi discuss the two-cass probem using a inear cassifier Then SVM for two casses wi be defined Then RLP wi be reviewed Finay, the piecewise-inear function used for muticategory cassification in M-RLP wi be reviewed 5

6 2 Two Cass Linear Discrimination Commony, the method of discrimination for two casses of points invoves determining a inear function that consists of a inear combination of the attributes of the given sets In the simpest case, a inear function can be used to separate two sets as shown in Figure 2 This function is the separating pane x T w = γ x T w = γ PSfrag repacements Figure 2: Two ineary separabe sets and a separating pane where w is the norma to the pane and γ is the distance from the origin Let A and A 2 be two sets of points in the n-dimensiona rea space R n with cardinaity m and m 2 respectivey Let A be an m n matrix whose rows are the points in A Let A 2 be an m 2 n matrix whose rows are the points in A 2 Let x R n be a point to be cassified as foows: x T w γ > 0 x A x T w γ < 0 x A 2 () The two sets of points, A and A 2, are ineary separabe if A w > γe γe > A 2 w (2) where e is a vector of ones of the appropriate dimension If the two casses are inear separabe, there are infinitey many panes that separate the two casses The goa is two choose the pane that wi generaize best on future points Both Mangasarian [2] and Vapnik and Chervonenkis [25] concuded that the best pane in the separabe case is the one that minimizes the distance of the cosest vector in each cass to the separating pane For the separabe case the formuations of Mangasarian s Muti-surface Method of Pattern Recognition [3] and those of Vapnik s Optima Hyperpane [24, 25] are very simiar [3] We wi concentrate on the Optima Hyperpane probem since it the basis of SVM, and it is vaidated theoreticay by Statistica Learning Theory [24] According to Statistica Learning Theory, the Optima Hyperpane can construct inear 6

7 Cass A 2 Cass A PSfrag repacements xw = γ + xw = γ xw = γ Figure 3: Two supporting panes and the resuting optima separating pane discriminants in very high dimensiona spaces without overfitting The reader shoud consut [24] for fu detais of Statistica Learning Theory not covered in this paper The probem in the canonica form of Vapnik [24] becomes to determine two parae panes xw = γ + and xw = γ such that A w γe e 0 A 2 w + γe e 0 (3) and the margin or distance between the two panes is maximized The margin 2 of seperation between the two supporting panes is An exampe of such w a pane is shown in Figure 3 The probem of finding the maximum margin becomes[24]: min w,γ 2 wt w st A w γe e 0 A 2 w + γe e 0 In genera it is not aways possibe for a singe inear function to competey separate two given sets of points Thus, it is important to find the inear function that discriminates best between the two sets according to some error minimization criterion Bennett and Mangasarian [4] minimize the average magnitude of the miscassification errors in the construction of their foowing robust inear programming probem (RLP) min w,γ,y,z δ e T y + δ 2 e T z subject to y + A w γe e 0 z A 2 w + γe e 0 y 0, z 0 where δ > 0 and δ 2 > 0 are the miscassification costs To avoid the nu soution w = 0, use δ = m and δ 2 = m 2 where m and m 2 are the cardinaities of A and A 2 respectivey The RLP method is very effective in practice (4) (5) 7

8 The functions generated by RLP generaize we on many rea-word probems Additionay, the computationa time is reasonaby sma because its soution invoves ony a singe inear program Note however that the RLP method no onger incudes any notion of maximizing the margin Statistica Learning Theory indicates that the maximizing the margin is essentia for good generaization The SVM approach [8, 23] is a mutiobjective quadratic program which minimizes the absoute miscassification errors, and maximizing the separation margin by minimizing w 2 min w,y,z,γ ( λ)(e T y + e T z) + λ 2 wt w st A w γe + y e 0 A 2 w + γe + z e 0 y 0 z 0 (6) where 0 < λ < is a fixed constant Note that Probem 6 is equivaent to RLP with the addition of a reguarization term λ 2 wt w, and δ = δ 2 = A inear programming version of (6) can be constructed by repacing the norm used to minimize the weights w [3] Reca that the SVM objective minimizes the square of the 2-norm of w, w 2 = w T w The -norm of w, w = e T w, can be used instead The absoute vaue function can be removed by introducing the variabe s and the constraints s w s The SVM objective is then modified by substituting e T s for wt w At optimaity, 2 s i = w i, i =,, k The resuting LP is: min w,α,β,y,z,s ( λ)( m e T y + m 2 e T z) + λe T s st A w γe + y 0 A 2 w + γe + z 0 s w s y 0 z 0 s 0 (7) We wi refer to this probem as RLP since λ = 0 yieds the origina RLP method As in the SVM method, the RLP method minimizes both the average distance of the miscassified points from the reaxed supporting panes and the maximum cassification error The main advantage of the RLP method over the SVM probem is that RLP is a inear program sovabe using very robust agorithms such as the Simpex Method [7] SVM requires the soution of quadratic program that is typicay much more computationay costy for the same size probem In [3], the RLP method was found to generaize as we as the inear SVM but with much ess computationa cost It is more efficient computationay to sove the dua RLP and SVM probems The dua RLP probem is min u,v st e T u + e T v λe u T A v T A 2 λe e T u e T v = 0 0 u ( λ)δ 0 v ( λ)δ 2 (8) 8

9 In this paper we use δ = m and δ 2 = k but δ and δ 2 may be any positive weights for the miscassification costs The dua SVM probem and its extension to noninear discriminants is given in the next section 22 Noninear Cassifiers Using Support Vector Machines The primary advantage of the SVM (6) over RLP (7) is that in its dua form it can be used to construct noninear discriminants using poynomia separators, radia basis functions, neura networks, etc The basic idea is to map the origina probems to a higher dimensiona space and then to construct a inear discriminant in a higher dimensiona space that corresponds to a inear discriminant in the origina space So for exampe, to construct a quadratic discriminant for a two dimensiona probems, the input attributes [x, x 2 ] are mapped into [x 2, x 2, 2x x 2, x, x 2 ] and a inear discriminant function is constructed in the new five-dimensiona space Two exampes of possibe poynomia cassifiers are given in Figure 4 The dua SVM is appied to the mapped points The reguarization term in the prima objective heps avoid overfitting the higher dimensiona space The dua SVM provides a practica computationa approach through the use of generaized inner products or kernes Figure 4: Two exampes of second degree poynomia separations of two sets The dua SVM is as foows: as foows: A T u A 2T v 2 e T u e T v min u,v st 2λ e T u = e T v ( λ)e u 0 ( λ)e v 0 (9) To formuate the noninear case it is convenient to rewrite the probem in summation notation Let A be the set of a points A and A 2 Define M = m + 9

10 m 2 to be the tota number of points Let { α T = [α, α 2,, α M ] = [ λ ut λ vt ] Let t R M xi A be such that for x i A t i = x i A 2 To construct the noninear cassification function, the origina data points x are transformed to the higher dimension feature space by the function φ(x) : R n R n, n >> n The dot product of the origina vectors x T i x j is repaced by the dot product of the transformed vectors (φ(x i ) φ(x j )) The first term of the objective function can then be written as the sum: λ M M t i t j α i α j (φ(x i ) φ(x j )) 2 i= j= Using this notation and simpifying the probem becomes: min α st M M M t i t j α i α j (φ(x i ) φ(x j )) 2 i= j= M α i t i = 0 i= ( λ) λ e α 0 i= α i (0) In the support vector machine (SVM), Vapnik repaces the inner product (φ(x) φ(x i )) with the inner product in the Hibert space K(x, x i ) This symmetric function K(x, x i ) must satisfy Theorem 53 in [23] This theorem ensures K(x, x i ) is an inner product in some feature space The choice of K(x, x i ) determines the type of cassifier that is constructed Possibe choices incude poynomia cassifiers as in Figure 4 (K(x, x i ) = (x T x i +) d, where d is the degree of the poynomia), radia basis function machines (K γ ( x x i ) = exp{ γ x x i 2 } where x x i is the distance between two vectors and γ is the width parameter), and two-ayer neura networks (K(x, x i ) = S[v(x T x i ) + c] where S(u) is a sigmoid function) [23] Variants of SVM (0) have proven to be quite successfu in paractice [2, 22, 7] Note that the number of variabes in Program (0) remains constant as K(x, x i ) increases in dimensionaity Additionay, the objective function remains quadratic and thus the compexity of the probem does not increase In fact, the size of the probem is dependent on the number of nonzero dua variabes α i The points x i corresponding to these variabes are caed the support vectors According to Statistica Learning Theory, the best soution for a given miscassification error uses the minimum number of support vectors The fina cassification function with the generaized kerne function K(x, x i ) is: f(x) = sign t i α i K(x, x i ) γ () support vectors where x A if f(x) =, otherwise x A 2 0

11 f(x) = max w i x γ i i=,2,3 PSfrag repacements w x γ w 2 x γ 2 w 3 x γ 3 A A 2 A 3 Figure 5: Piecewise-inear separation of sets A, A 2, and A 3 by the convex piecewise-inear function f(x) 23 Muticategory Discrimination In muticategory cassification a piecewise-inear separator is used to discriminate between k > 2 casses of m i, i =,, k, points We wi examine two methods for accompishing this The first used in SVM [24] is two construct a discriminate function to separate one cass from the remaining k casses This is process is repeated k times In the separabe case, the inear discriminant for each cass must satisfy the foowing set of inequaities Find (w, γ ),, (w k, γ k ), such that A i w i γ i > A j w i γ i, i, j =,, k, i j (2) To cassify a new point x, compute f i (x) = x T i w i γ i If f i (x) > 0 for ony one i then ceary the point beongs to Cass A i If more than one f i (x) > 0 or f i (x) 0 for i =,, m then the cass is ambiguous Thus the genera rue is that the cass of a point x is determined from (w i, γ i ), i =,, k by finding i such that f i (x) = x T w i γ i (3) is maximized Figure 5 shows a piecewise-inear function f(x) = max f i (x) on i=,2,3 R that separates three sets Note either SVM (0) or RLP can be used to construct the k two-cass discriminants For carity, we wi ca this method used with SVM (0), k- SVM We wi denote this method used with RLP (8), k-svm The advantage of k-svm is that it can used for piecewise-noninear discriminants which k-rlp is imited to piecewise-inear discriminants For both k-svm and k-rlp to attain perfect training set accuracy, foowing inequaities must be satisfied: A i w i γ i > A j w j γ j, i, j =,, k, i j This inequaity can be used as a definition of piecewise-inear separabiity

12 Definition 2 (Piecewise-inear Separabiity) The sets of points A i, i =,, k, represented by the matrices A i R mi n, i =,, k, are piecewiseineary separabe if there exist w i R n and γ i R, i =,, k, such that A i w i γ i e > A i w j γ j e, i, j =,, k, i j (4) Equivaent to Definition 2, finding the piecewise-inear separator invoves soving the equation A i w i γ i e A i w j γ j e + e, i, j =,, k, i j This can be rewritten as 0 A i (w i w j ) + (γ i γ j )e + e, i, j =,, k, i j Figure 6 shows an exampe of a piecewise-inear separator for three casses in two dimensions The inear separating functions are represented by the quantities A 2 x T (w 2 w 3 ) = γ 2 γ 3 A 3 PSfrag repacements x T (w w 2 ) = γ γ 2 x T (w 3 w ) = γ 3 γ A Figure 6: Three casses separated by a piecewise-inear function (w i w j, γ i γ j ), i, j =,, k, j i, where w i R n and γ i R, i =,, k The M-RLP method proposed and investigated in [5, 6] can be used to find (w i, γ i ), i =,, k satisfying Definition 2 min w i,γ i,y ij i= j= e T y ij m i yij A i (w i w j ) + (γ i γ j )e + e, y ij 0, i j, i, j =,, k (5) where y ij R mi In M-RLP (5), if the optima objective vaue is zero, then the dataset is piecewise-ineary separabe If the dataset is not piecewiseineary separabe, the positive vaues of the variabes y ij are proportiona to the The method was originay caed Muticategory Discrimination 2

13 magnitude of the miscassified points from the pane x T (w i w j ) = (γ i γ j )+ This program (5) is a generaization of the two-cass RLP inear program (5) to the muticategory case Like the origina RLP (5) M-RLP does not incude any terms for maximizing the margin and it does not directy permit the use of generaized inner products or kernes to aow extension to the noninear case So in the next section we wi show how M-RLP and SVM can be combined by incuding margin maximization and generaized inner products into M-RLP 3

14 3 Formuation of M-SVM: Piecewise-inear Separabe Case We now propose to construct piecewise-inear and piecewise-noninear SVM using a singe quadratic program Anaogous to the two cass case we start by formuating the optima piecewise-inear separator for the separabe case Assume that the k sets of points are piecewise-ineary separabe, ie, there exist w i R n and γ i R, i =,, k, such that A i w i γ i e > A i w j γ j e, i, j =,, k, i j (6) The cass of a point x is determined from (w i, γ i ), i =,, k by finding i such that f i (x) = x T w i γ i (7) is maximized For this piecewise-ineary separabe probem, infinitey many (w i, γ i ) exist that satisfy (6) Intuitivey, the optima (w i, γ i ) provides the argest margin of cassification So in an approach anaogous to the two cass support vector machine (SVM) approach, we add reguarization terms The dashed ines in Figure 7 represent the margins for each piece (w i w j, γ i γ j ) of the piecewiseinear separating function The margin of separation between the casses i and j, ie the distance between A i (w i w j ) (γ i γ j )e + e and A j (w i w j ) (γ i γ j )e e 2 is w i w j So, we woud ike to minimize w i w j for a i, j =,, k, i j Aso, we wi add the reguarization term 2 w i 2 to the objective For the i= piecewise-ineary separabe probem we get the foowing: min w i,γ i 2 i w i w j 2 + w i 2 2 i= j= i= st A i (w i w j ) e(γ i γ j ) e 0 i, j =,, k i j (8) To simpify the notation for formuation of the piecewise-inear SVM, we rewrite this in matrix notation See Appendix A for compete matrix definitions for genera k For the three cass probem (k = 3) the foowing matrices are obtained: Let C = I I 0 I 0 I 0 I I 4

15 (w w 2 ) x = (γ γ 2 ) + A (w w 2, γ γ 2 ) (w w 2 ) x = (γ γ 2 ) A 2 PSfrag repacements (w w 3, γ γ 3 ) (w 2 w 3, γ 2 γ 3 ) A 3 Figure 7: Piecewise-inear separator with margins for three casses where I R n n is the identity matrix Let A A 0 A 0 A Ā = A 2 A A 2 A 2 A 3 0 A 3 0 A 3 A 3 Ē = e e 0 e 0 e e 2 e e 2 e 2 e 3 0 e 3 0 e 3 e 3 where A i R mi n, i =,, 3, and e i R mi, i =,, 3, is a vector of ones Using this notation for fixed k > 2 the program becomes: min w,γ 2 Cw w 2 st Āw + Ēγ e 0 (9) where w = [w T, w 2T,, w kt ] T and γ = [γ, γ 2,, γ k ] T The dua of this probem can be written as: Cw 2 + T (Āw + Ēγ e) max u,w,γ st 2 2 w 2 u (I + C T C)w = Ā T u Ē T u = 0 u 0 (20) 5

16 To eiminate the variabes w and γ from this probem we wi first show that the matrix (I + C T C) is nonsinguar Proposition 3 (Nonsinguarity of (I + C T C)) The inverse of matrix (I+ C T C) for k > 2 is (I kn + C T C) = 2 k+ I n k+ I n k+ I n where I n indicates the n n identity matrix k+ I n k+ I n k+ I n k+ I n 2 k+ I n (2) Proof To show that (I + C T C) is nonsinguar for some k > 2, we wi cacuate its inverse The matrix C as defined in Appendix A has size (n (i ) kn) Reca that n indicates the dimension of the feature space (k )I n I n I n C T C = I n In I n I n (k )I n has size kn kn Therefore ki n I n I n I kn + C T C = I n In I n I n ki n Through simpe cacuations it can be shown that the inverse of this matrix is (2): (I kn + C T C) = 2 k+ I n k+ I n k+ I n k+ I n k+ I n k+ I n k+ I n 2 k+ I n i=2 Using Proposition 3 the foowing reationship resuts: (I + C T C) Ā T = k + ĀT (22) 6

17 It foows from Probem (20) and equation (22) that w = (I + C T C) Ā T u = k + ĀT u (23) Using this reationship, we eiminate w from the dua probem Additionay, γ is removed because ĒT u = 0 After some simpification the new dua probem becomes: max u e T u 2(k+) ut ĀĀ T u st Ē T u = 0 u 0 (24) To construct the muticategory support vector machine, it is convenient to write this probem in summation notation Let the dua vector u T = [u 2T, u 3T,, u kt, u 2T, u 23T,, u k(k )T ] where R mi The resuting dua probem for piecewise-inear datasets is: max u st i= m i j= = m i j= = m j 2 2(k+) i= m i p= q= m j + j= = i [ mi m u ji p u i q A j pa i T j q + j= = u ji m i p u i q A i pa i T q p= q= m p= q= = 0 for i =,, k u ji p u i q A j pa T q 0 for i, j =,, k, i j and =,, m i ] (25) where m i is the number of points in cass i Reca, for the piecewise-inear cassification function, the cass of a point x is determined by finding i =,, k, such that is maximized From equation (23), w w 2 w = = ĀT u k + w k Soving for w i in summation notation we get: w i = j= p= f i (x) = x T w i γ i (26) m i p A i T p m j j= p= u ji p A j pt 7

18 Therefore, f i (x) = m i p x T A i T p j= p= j= m j p= u ji p x T A j pt γ i 4 Formuation of M-SVM: Piecewise-nonineary Separabe Case Just ike in the two-cass case, M-SVM can be generaized to the piecewisenoninear functions To construct the separating functions, f i (x), in a higher dimension feature space, the origina data points x are transformed by some function φ(x) : R n R n [23, 8] The function f i (x) is now reated to the sum of dot products of vectors in this higher dimension feature space: f i (x) = m i p (φ(x) φ(a i T p )) j= p= j= m j p= u ji p (φ(x) φ(a j pt )) γ i According to [23], any symmetric function K(x, x i ) L 2 that satisfies Mercer s Theorem [9] can repace the dot product (φ(x) φ(x i )) Mercer s Theorem guarantees that any eigenvaue λ j in the expansion K(x, x i ) = λ j (φ j (x) φ j (x i )) is positive This is a sufficient condition for a function K(x, x i ) to define a dot product in the higher dimension feature space Therefore we et K(x, x i ) = (φ(x) φ(x i )) Returning to dua Probem (25), the objective function contains the sum of dot products A j p Ai qt of two points in the origina feature space To transform the points A j p to a higher dimension feature space we repace these dot products by K(A j pt, A i T q ) The resuting M-SVM for piecewise-ineary separabe datasets is: j= max u st m i i= j= = m j m i 2 p= q= m i j= = + 2(k+) i= j= = i [ mi m u ji p u i q K(A j T p, A i T j q ) + m j j= = u ji m i p u i q K(A i pt, A i T q ) p= q= m p= q= = 0 for i =,, k ] u ji p u i q K(A j T p, A T q ) 0 for i, j =,, k, i j and =,, m i (27) The points A i corresponding to nonzero dua variabes uij, j =,, k, j i are referred to as support vectors It is possibe for A i to correspond with more 8

19 Figure 8: Piecewise-poynomia separation of three casses in two dimensions Support vectors are indicated with circes than one nonzero variabe, j =,, k, j i In Figure 8, support vectors are represented by a circe around the point Some points have doube circes which indicate that two dua variabes > 0, j =,, 3, j i By the compementarity within the KKT conditions [4], > 0 A i (w i w j ) = (γ i γ j ) + Consequenty the support vectors are ocated cosest to the separating function In fact, the remainder of the points, those that are not support vectors, are not necessary in the construction of the separating function The resuting noninear cassification probem for a point x is to find i =,, k such that the cassification function f i (x) = j= support vectors A i p K(x, A i T p ) support vectors A j u ji p K(x, A j p T ) γi (28) 9

20 is maximized 5 Formuation of M-SVM: Piecewise Inseparabe Case The proceeding sections provided a formuation for the piecewise-ineary and piecewise-noninear separabe cases To construct a cassification function for a piecewiseineary inseparabe dataset, we must first choose an error minimization criterion The technique used in the preceeding sections of formuating the M-SVM for piecewise-ineary separabe datasets can be combined with the -norm error criterion used in Probem (5) of Bennett and Mangasarian [6] The resut is the M-SVM for piecewise-ineary inseparabe probems Using the same matrix notation as in Section 3, we add the terms Cw w 2 to the objective of Probem (5) The resuting prima probem is as foows: min w,γ,y λ( 2 Cw w 2 ) + ( λ)e T y st Āw + Ēγ e + y 0 y 0 (29) where y = [y2, T y3, T, yk T, yt 2,, yk(k ) T ]T and 0 < λ < Soving for the dua, substituting w = k+āt u, and simpifying produces the foowing probem: max u st u T e 2(k+) ut ĀĀ T u 0 u λ λ e Ē T u = 0 (30) As shown in Proposition 5, Probem (30) maximizes a concave quadratic objective over a bounded poyhedra set Thus there exists a ocay optima soution that is gobay optima Proposition 5 (Concavity of objective) The function u T e 2(k+) ut ĀĀT u is concave Proof The matrix ĀĀT is aways positive semi-definite and symmetric Thus the Hessian matrix (= (k+)āāt ) is negative semi-definite Therefore, the objective is a concave function Probem (30) is identica to Probem (24) in the piecewise-ineary separabe case except the dua variabes are now bounded by λ λ Therefore, transforming the data points A i wi proceed identicay as in Section 4 Using the function K(x, x i ) to denote the dot product in some feature space, the fina M-SVM resuts: 20

21 max u st m i i= j= = m j m i 2 p= q= m i j= 0 = + 2(k+) i= j= = i [ mi m u ji p u i q K(A j T p, A i T j q ) + m j j= = u ji m i p ui q K(Ai pt, A i T q ) p= q= m p= q= = 0 for i =,, k ] u ji p u i q K(A j T p, A T q ) λ λ for i, j =,, k, i j and =,, m i (3) As in Sections 3 and 4, the cass of a point x is determined by finding the maximum function f i (x) = p K(x, A i T p ) u ji p K(x, A j T ) p γi j= support vectors support vectors A i A j (32) for i =,, k To determine the threshod vaues γ i, i =,, k, we sove the prima probem (29) with w fixed, where Āw is transformed to the higher dimension feature space This probem is as foows: min γ,y st γ i + γ j + y ij k+ k+ i= [ m i mi = q= i[ m i mj = j q= r= r= m i j= = y ij K(A i T q, A i T r )u i K(A i T q, A j T r )u j r y ij 0, i, j =,, k, i j, =,, m i r m m r= r= ] K(A i T q, A T r )u i r ] K(A i T q, A T r )u j r + (33) The right side of the constraints are constant Thus Probem (33) is a inear program and is easiy soved 6 Computationa Experiments In this section, we present computationa resuts comparing M-SVM (32), M- RLP (5), k-svm using SVM (0), and k-rlp using RLP (8) Severa experiments on rea-word datasets are reported A description of each of the 2

22 datasets foows this paragraph Each of these methods was impemented using the MINOS 54 [7] sover The quadratic programming probems for M- SVM and k-svm were soved using the noninear sover impemented in Minos 54 This sover uses a reduced-gradient agorithm in conjunction with a quasi- Newton method In M-SVM, k-svm and M-RLP, the seected vaues for λ are given Better soutions may resut with different choices of λ Additionay, it is not necessary for the same vaue of λ to be used for both methods The kerne function for the piecewise-noninear M-SVM and k-svm methods is K(x, x i ) = ( x x i n + ) d, where d is the degree of the desired poynomia Wine Recognition Data The Wine dataset [] uses the chemica anaysis of wine to determine the cutivar There are 78 points with 3 features This is a three cass dataset distributed as foows: 59 points in cass, 7 points in cass 2, and 48 points in cass 3 This dataset is avaiabe via anonymous fie transfer protoco (ftp) from the UCI Repository of Machine Learning Databases and Domain Theories [6] at ftp://ftpicsuciedu/pub/machine-earning-databases Gass Identification Database The Gass dataset [] is used to identify the origin of a sampe of gass through chemica anaysis This dataset is comprised of six casses of 24 points with 9 features The distribution of points by cass is as foows: 70 foat processed buiding windows, 7 foat processed vehice windows, 76 non-foat processed buiding windows, 3 containers, 9 tabeware, and 29 headamps This dataset is avaiabe via anonymous fie transfer protoco (ftp) from the UCI Repository of Machine Learning Databases and Domain Theories [6] at ftp://ftpicsuciedu/pub/machine-earning-databases US Posta Service Database The USPS Database [0] contains zipcode sampes from actua mai This database is comprised of separate training and testing sets There are 729 sampes in the training set and 2007 sampes in the testing set Each sampe beongs to one of ten casses: the integers 0 through 9 The sampes are represented by 256 features Two experiments were performed In the first, the datasets were normaized between - and 0-fod cross vaidation was used to estimate generaization on future data The second experiment was conducted on two subsets of the United States Posta Service (USPS) data This data contains handwriting sampes of the integers 0 through 9 The objective of this dataset is to quicky and effectivey interpret zipcodes This data has separate training and testing sets, each of which consist of the 0 integer casses We compied two individua training subsets from the USPS training data The first subset contains 756 exampes each beonging to the casses 3, 5, and 8 We ca this set USPS- training data The second subset contains 96 exampes each beonging to the casses 4, 6, and 7 We ca this set USPS-2 training data Simiary two subsets are created from the testing data In a of these datasets, the data vaues are scaed by 200 Testing set accuracies are reported for a four methods The tota numbers of unique support vectors in the resuting cassification functions for the M-SVM and k-svm methods are given Tabe contains resuts for M-RLP, k-rlp, M-SVM, and k-svm on the Wine and Gass datasets As anticipated, adding the reguarization term to 22

23 Data Degree Wine M-RLP k-rlp M-SVM (378) (29) (258) (239) (228) k-svm (537) (424) (405) (394) (43) Gass M-RLP k-rlp M-SVM (759) (660) (595) (533) (476) k-svm (898) (854) (796) (769) (734) Tabe : Percent testing set accuracies and (tota number of support vectors) for M-SVM and k-svm λ = 05 for k-rlp, M-SVM, and k-svm the degree one probem in M-SVM produced better testing generaization than M-RLP on the Wine dataset The Wine dataset is piecewise-ineary separabe Therefore, the M-RLP method has infinitey many optima soutions However, the testing accuracy for M-SVM with degree one on the Gass data was much ower than the M-RLP accuracy This may indicate that the choice of λ is too arge However, as the degree increases the accuracy of the M-SVM method improves and exceeds the M-RLP resuts The k-svm method generaized surprisingy we The testing accuracies reported for k-svm on the Wine dataset are higher than those of M-SVM The inear k-rlp method performed just as we as the quadratic k-svm program on the Wine dataset and better than the M-SVM and M-RLP methods On the Gass data, as the degree increases, both methods, M-SVM and k-svm, improve dramaticay in testing accuracy Using higher degree poynomias the M-SVM and k-svm methods surpass the accuracies of M-RLP and k-rlp This demonstrates the potentia for poynomia and piecewise-poynomia cassification functions over inear and piecewise-inear functions Tabe 2 contains resuts for the four methods on the USPS data subsets Simiar observations as above can be made Both of these datasets are piecewiseineary separabe The soution that m-rlp has found for each of these datasets tests significanty ower than the other methods The k-svm method generaizes sighty better than M-SVM The k-rlp method reports simiar accuracies as the k-svm method Additionay, it is soving inear programs rather than quadratic programs, so the computationa training time is significanty smaer than the other methods Changing the parameter λ may further improve generaization The M-SVM method consistenty finds cassification functions using fewer support vectors than those of k-svm With fewer support vectors, a sam- 23

24 Data Degree USPS- M-RLP k-rlp M-SVM (45) (327) (32) (305) (37) k-svm (666) (557) (54) (59) (56) USPS-2 M-RLP k-rlp M-SVM (228) (85) (67) (66) (80) k-svm (383) (33) (303) (294) (289) Tabe 2: Percent testing set accuracies and (tota number of support vectors) for M-SVM and SVM λ = 05 for k-svm and λ = 03 for k-rlp and M-SVM Degree M-RLP k-rlp M-SVM k-svm Tabe 3: Tota computationa training time (in seconds) for M-RLP,k-RLP, M-SVM, and k-svm on USPS- pe can be cassified more quicky since the dot-product of the sampe with each support vector must be computed Thus the M-SVM woud be a good method to choose when cassification time is critica CPU times for training a four methods on the USPS- dataset are reported in Tabe 3 The times for a the datasets are not isted because the programs were run using a batch system on custers of machines so the timing was not reiabe However, the trends were cear The k-rlp method is significanty faster than the other methods In the M-SVM and k-svm methods, as the degree increased the computationa time woud decrease and then after a certain degree is reached it woud increase The degree of the poynomia for which it starts to increase varies by dataset Surprisingy, for the USPS datasets the k-svm method was faster than the M-RLP method This was not the case for the Wine and Gass datasets The M-RLP method had faster training times than k-svm for these datasets The times reported are for IBM RS6000 mode 590 workstations with 28 MB RAM 24

25 7 Concusions We have examined four methods for the soution of muticategory discrimination probems based on the LP methods of Mangasarian and the QP methods for SVM of Vapnik The two-cass methods, RLP and SVM are differ ony in the norm of the reguarization term In the past two different approaches had been used for the k > 2 cass case The method we caed k-svm, constructed k two-cass discriminants using k quadratic programs The resuting cassifier was a piecewise-inear or piecewise noninear discriminant function depending on what kerne function was used in the SVM The origina muticategory RLP for k casses, constructed a piecewise-inear discriminant using a singe inear program We proposed two new hybrid approaches Like the k-svm method, k- RLP uses LP to construct k two-cass discriminants We aso formuated a new approach, M-SVM We began the formuation by adding reguarization terms to M-RLP Then ike k-svm with piecewise-noninear discriminants, the noninear pieces are found by mapping the origina data points into a higher dimension feature space This transformation appeared in the dua probem as an inner product of two points in the higher dimension space A generaized inner product was used to make the probem tractabe The new M-SVM method requires the soution of a singe quadratic program We performed a computationa study of the four methods on four datasets In genera we found that the k- SVM and k-rlp generaized However, M-SVM used fewer support vectors a counter-intuitive resut since for the two-cass cass Statistica Learning Theory predicts that fewer support vector shoud resut in better generaization The theoretic justification of the better generaization of k-svm and k-rlp and M- SVM and M-RLP is an open question The k-rlp method provided accurate and efficient resuts on the piecewise-inear separabe datasets The k-svm aso tested surprisingy we but requires the soution of k quadratic programs Thus providing soutions with smaer cassification time On the piecewiseineary inseparabe dataset, the poynomia and piecewise-poynomia cassifiers provided an improvement over the M-RLP and k-rlp methods On the other datasets, the k-rlp method found soutions that generaized best or neary best in ess computationa time 25

26 A Matrix Representations for Muticategory Support Vector Machines This appendix contains the definitions of the matrices used for the genera k- cass SVM formuation (8): min w,γ 2 Cw w 2 st Āw + Ēγ e 0 Let I I I 0 I I 0 0 I 0 I I C = I 0 0 I 0 0 I I I 0 I 0 0 I I (34) where I R n n is the identity matrix The matrix C has n (i ) rows and kn coumns i=2 26

27 Let Ā = A A A 0 A A 0 0 A A 2 A A 2 A A 2 0 A 2 A k 0 0 A k 0 A k 0 0 A k A k A k (35) where A i R mi n The matrix Ā has (k ) i= m i rows and kn coumns 27

28 Let Ē = e e e 0 e e 0 0 e e 2 e e 2 e e 2 0 e 2 e k 0 0 e k 0 e k 0 0 e k e k e k where e i R mi is a vector of ones The matrix Ē has (k ) i= m i rows and kn coumns 28

29 References [] S Aeberhard, D Coomans, and O de Ve Comparison of cassifiers in high dimensiona settings Technica Report 92-02, Departments of Computer Science and of Mathematics and Statistics, James Cook University of North Queensand, 992 [2] K P Bennett Decision tree construction via inear programming In M Evans, editor, Proceedings of the 4th Midwest Artificia Inteigence and Cognitive Science Society Conference, pages 97 0, Utica, Iinois, 992 [3] K P Bennett and E J Bredensteiner Geometry in earning In C Gorini, E Hart, W Meyer, and T Phiips, editors, Geometry at Work, Washington, DC, 998 Mathematica Association of America To appear [4] K P Bennett and O L Mangasarian Neura network training via inear programming In P M Pardaos, editor, Advances in Optimization and Parae Computing, pages 56 67, Amsterdam, 992 North Hoand [5] K P Bennett and O L Mangasarian Muticategory discrimination via inear programming Optimization Methods and Software, 3:27 39, 994 [6] K P Bennett and O L Mangasarian Seria and parae muticategory discrimination SIAM Journa on Optimization, 4(4): , 994 [7] V Banz, B Schökopf, H Büthoff, C Burges, V Vapnik, and T Vetter Comparison of view based object recognition agorithms using reaistic 3D modes In C von der Masburg, W von Seeen, J C Vorbrüggen, and B Sendhoff, editors, Artificia Neura Networks - ICANN 96, pages , Berin, 996 Springer Lecture Notes in Computer Science Vo 2 [8] C Cortes and V N Vapnik Support vector networks Machine Learning, 20: , 995 [9] R Courant and D Hibert Methods of Mathematica Physics J Wiey, New York, 953 [0] Y Le Cun, B Boser, J S Denker, D Henderson, R E Howard, W Hubbard, and L J Jacke Backpropagation appied to handwritten zip code recognition Neura Computation, :54 55, 989 [] I W Evett and E J Spieher Rue induction in forensic science Technica report, Centra Research Estabishment, Home Office Forensic Science Service, Adermaston, Reading, Berkshire RG7 4PN, 987 [2] O L Mangasarian Linear and noninear separation of patterns by inear programming Operations Research, 3: , 965 [3] O L Mangasarian Muti-surface method of pattern separation IEEE Transactions on Information Theory, IT-4:80 807,

30 [4] O L Mangasarian Noninear Programming McGraw Hi, New York, 969 [5] O L Mangasarian Mathematica programming in machine earning In G DiPio and F Giannessi, editors, Proceedings of Noninear Optimization and Appications Workshop, pages , New York, 996 Penum Press [6] P M Murphy and D W Aha UCI repository of machine earning databases [ mearn/mlrepositoryhtm] Department of Information and Computer Science, University of Caifornia, Irvine, Caifornia, 994 [7] B A Murtagh and M A Saunders MINOS 54 user s guide Technica Report SOL 8320, Stanford University, 993 [8] A Roy, S Govi, and R Miranda An agorithm to generate radia basis function (RBF)-ike nets for cassification probems Neura Networks, 8(2):79 202, 995 [9] A Roy, L S Kim, and S Mukhopadhyay A poynomia time agorithm for the construction and training of a cass of mutiayer perceptrons Neura Networks, 6: , 993 [20] A Roy and S Mukhopadhyay Pattern cassification using inear programming ORSA Journa of Computing, 3:66 80, 990 [2] B Schökopf, C Burges, and V Vapnik Incorporating invariances in support vector machines In C von der Masburg, W von Seeen, J C Vorbrüggen, and B Sendhoff, editors, Artificia Neura Networks - ICANN 96, pages 47 52, Berin, 996 Springer Lecture Notes in Computer Science Vo 2 [22] B Schökopf, K Sung, C Burges, F Girosi, P Niyogi, T Poggio, and V Vapnik Comparing support vector machines with gaussian kernes to radia basis function cassifiers AI Memo No 599; CBCL Paper No 42, Massachusetts Institute of Technoogy, Cambridge, 996 [23] V Vapnik The Nature of Statistica Learning Theory Springer-Verag, 995 [24] V N Vapnik The Nature of Statistica Learning Theory John Wiey & Sons, New York, 996 [25] V N Vapnik and A Ja Chervonenkis Theory of Pattern Recognition Nauka, Moscow, 974 In Russian [26] W H Woberg and O L Mangasarian Mutisurface method of pattern separation for medica diagnosis appied to breast cytoogy Proceedings of the Nationa Academy of Sciences, USA, 87: ,

SVM: Terminology 1(6) SVM: Terminology 2(6)

SVM: Terminology 1(6) SVM: Terminology 2(6) Andrew Kusiak Inteigent Systems Laboratory 39 Seamans Center he University of Iowa Iowa City, IA 54-57 SVM he maxima margin cassifier is simiar to the perceptron: It aso assumes that the data points are