SVM-based Supervised and Unsupervised Classification Schemes

Size: px

Start display at page:

Download "SVM-based Supervised and Unsupervised Classification Schemes"

Merryl Stevenson
5 years ago
Views:

1 SVM-based Supervised and Unsupervised Cassification Schemes LUMINITA STATE University of Pitesti Facuty of Mathematics and Computer Science 1 Targu din Vae St., Pitesti ROMANIA state@cicknet.ro IULIANA PARASCHIV-MUNTEANU University of Bucharest Facuty of Mathematics and Computer Science 14 Academiei St., Bucharest ROMANIA pmiuia@fmi.unibuc.ro Abstract: The aim of the research reported is to propose a training agorithm for support vector machine based on kerne functions and to test its performance in case of non-ineary separabe data. The training is based on the Sequentia Minima Optimization introduced by J.C. Patt in Severa cassifications schemes resuted by combining the SVM and the 2-means methods are proposed in the fifth section of the paper. A series of concusions derived experimentay concerning the comparative anaysis of the performances proved by the proposed methods are summarized in the fina part of the paper. The tests were performed on sampes randomy generated from Gaussian two-dimensiona distributions, and on data avaiabe in Wisconsin Diagnostic Breast Cancer Database. Key Words: Support Vector Machine, Pattern recognition, Statistica earning theory, Kerne functions, Principa Components Anaysis, k-means Agorithm. 1 Introduction In empirica data modeing a process of induction is used to buid up a mode of the system, from which it is hoped to deduce responses of the system that have yet to be observed. Utimatey the quantity and quaity of the observations govern the performance of this empirica mode. The Support Vector Machines SVM is a pattern cassification technique deveoped by Vadimir Vapnik and his team at AT&T Be Laboratories as an aternative training technique for Poynomia, Radia Basis Function and Muti-Layer Perceptron cassifiers in which the parameters are determined by soving a Quadratic Programming QP Probem with inear inequaity and equaity constraints, the number of variabes in the QP probem being equa to the size of earning data. The earning probem setting for SVMs is as foows: there is some unknown and noninear dependency mapping, function y = fx between some high-dimensiona input vector x and scaar output y or the vector output y as in the case of muticass SVMs. There is no information about the underying joint probabiity functions. Thus, one must perform a distribution-free earning. The SVMs beong to the supervised earning techniques because the ony information avaiabe is a finite training data set S consisting of abeed exampes x i, y i where x i IR d and y i { 1, 1}. In training SVM the decision boundaries are determined directy from the training data so that the separating margins of decision boundaries are maximized in the high-dimensiona space caed feature space. This earning strategy, based on statistica earning theory deveoped by Vapnik [6],[22], minimizes the cassification errors of the training data and the unknown data. The paper aims to present the resuts of the research toward improving the performance expressed in accuracy and time compexity of SVM impementations. The proposed supervised training agorithm for SVM is essentiay based on kerne functions of poynomia and exponentia types. The impementation of the search process for soft margin hyperpane uses a sight modification of Sequentia Minima Optimization SMO agorithm introduced by Patt [17] in 1999, to sove the quadratic programming probem invoved in the earning process. The SMO is a simpe agorithm that quicky soves the SVM probem by decomposing the overa quadratic programming probem into smaer quadratic programming sub-probems without any extra matrix storage and without invoking an iterative numerica routine for each sub-probem. The overa memory requirements of the SMO agorithm is inear in the size of training data and therefore it aows the use of arge size sampes in the training process. The proposed method was tested on simuated data and on medica data freey offered by Wisconsin Diagnostic Breast Cancer Database, UCI Machine Learning Repository: Data Sets, We ISSN: Issue 10, Voume 9, October 2010

2 propose a combined methodoogy resuted by using Principa Component Anaysis PCA means for dimensionaity reduction in the space of initia data representations with kerne-based SVM to aow possibe inseparabiity. We aso propose cassification schemes that combine kerne-based SVM with k-means as a method of unsupervised earning. Using a combination of both supervised and unsupervised earning methods yieded to very good experimenta resuts presented in the fifth section of the paper. 2 Genera Presentation of Support Vector Machine-based Cassification Support Vector Machine is a reativey new modefree paradigm in Machine Learning. Briefy, SVM is a supervised earning technique of parametric type, the parameters of the discrimination function being earned directy from data. Let us consider a twocass cassification probem the unique information concerning casses being represented by a finite sequence of abeed data, { S = x i, y i x i = x 1 i,..., x d T i IR d, y i { 1, 1}, i = 1, N } 1. The first component of each pair x i, y i of S represents an instance coming from the cass of abe y i. 2.1 The case of ineary separabe data The sequence is ineary separabe if there exists a inear discriminant function f : IR d IR that separates the exampes of S, that is fx = b + w 1 x w d x d, 2 for each x = x 1,..., x d T IR d, such that for any x i, y i S, f x i > 0 if y i = 1, and f x i < 0 if y i = 1. The inear separabiity can be expressed by the existence of the parameters b IR and w IR d such that w T z i + b > 0 where z i = y i x i and w = w 1,..., w d T. In such a case we say the hyperpane H w,b : w T x + b = 0. 3 separates without errors S. In a SVM-based approach the search for a soution yieds to the constrained quadratic optimization probem { minimize Φw y i w T x i + b 4 1, i = 1, N, where Φw = 1 2 w 2. Using the Lagrange mutipiers method the soution of the optimization probem 4 is given by the prima-dua soution of the optimization probem on the objective function Lw, b, α= 1 2 w 2 α i yi w T x i +b 1, 5 where α 1,..., α N are the Lagrange mutipies. Using standard arguments [1] the probem reduces to the simper optimization probem max 1 N αi α k y i y k x T 2 i x k + α i k=1 α i 0, 1 i N, 6 α i y i = 0. If α = α1,..., α N is a soution of 6, we say that x i is a support vector if αi 0. If S 1 is the set of support vector then the optima soution of 4 is w = αi y i x i, b = 1 y i α S 1 jy j x j S 1 x i S 1 that is the discrimination function is f x = w T x + b = x T i x j, αi y i x T i x + b. 7 The computation invoved in soving the optimization probem 6 can be carried out using the agorithm SV M 1 [20], Briefy, the agorithm SVM1 works as foows: Agorithm SV M1 [16] { } Input: S = x i, y i x i R d, y i { 1, 1},, N Step 1. Compute the matrix D = d ik of entries d ik = y i y k x i T x k, i, k = 1, N ; Step 2. Sove the constrained optimization probem α = arg max α T 1 1 α R N 2 αt Dα, α i 0, 1 i N, α i y i = 0, ISSN: Issue 10, Voume 9, October 2010

3 Step 3. Seect two support vectors x r, x s such that αr > 0, αs > 0, y r = 1, y s = 1. Step 4. Compute the parameters w, b of the optima separating hyperpane, w = αi y i x i, b = 1 2 w T x r + x s and the width of the separating area Output: w, b, ρ w, b. 2.2 The genera case ρ w, b = 2 w In rea ife situations the tests to check whether the data are inear separabe are costy from computationa point of view. Moreover, even in cases when a test for inear separabiity is used, frequenty enough we find out that the abeed data are not inear separabe. The method presented in 2.1 was generaized yieding to the concept of soft margin hyperpane by Cortes and Vapnik [6]. The method is essentiay a reguarization technique that uses the term expressing the effect of cassification errors set Φ σ ξ 1,..., ξ N = ξi σ, 8 where σ is a positive constant and the non-negative sack variabes ξ i, 1 i N, are introduced to aow inseparabiity. The soft margin hyperpane is given by a soution of the constrained optimization probem 1 N minimize 2 w 2 + c F ξ σ i y i w T x i + b 1 ξ i, 1 i N, ξ i 0, 1 i N, 9 where c is a given positive constant and F is a monotone increasing convex function such that F 0 = 0 hods. By appying the Lagrange mutipiers method in the particuar case when F u = u k and σ = 1, we obtain the objective function Lw, ξ, b, α, β = 1 N k 2 w 2 + c ξ i α i yi w T x i + b N 1 + ξ i β i ξ i, 10 where α = α 1,..., α N and β = β 1,..., β N are the Lagrange mutipiers. Therefore, in order to sove the constrained probem 9 the objective function L shoud be minimized with respect to w, ξ = ξ 1,..., ξ N and b and maximized with respect to the non-negative parameters α i 0 and β i 0, 1 i N. Using standard arguments we get L w w w=w = α i y i x i = 0, 11 L b α i y i = 0, b=b = L N ξ i k c ξi =ξi= ξ i k 1 α i β i =0,, N. 12 In the foowing we wi refer to the particuar case k = 1. In this case we get w = α i y i x i, α i y i = 0, c = α i + β i, 1 i N. 13 A hyperpane that assures the minimum number of errors in separating the data is given by the soution of the constrained optimization probem max α α i 1 2 j=1 α i α j y i y j x T i x j, α i y i = 0, 0 α i c,, N. 14 The computation of soft margin hyperpane is carried out by the agorithm SVM2 [20], 2009 Agorithm SV M2 [16] Input: S = { x i, y i x i R n, y i { 1, 1},, N } c 0, Step 1. Compute the matrix D = d ik of entries d ik = y i y k x i T x k, i, k = 1, N ; Step 2. Sove the constrained optimization probem α =arg max α T 1 1 α R N 2 αt Dα α max 2, 4 c α i 0, 1 i N, α i y i = 0, ISSN: Issue 10, Voume 9, October 2010

4 where α max = max {α 1,..., α N } Step 3. Seect two support vectors x r, x s, αr > 0, αs > 0, y r = 1, y s = 1. Step 4. Compute the parameters w, b of the soft margin hyperpane, w = αi y i x i, b = 1 2 w T x r + x s and the width of the separating area Output: w, b, ρ w, b. 2.3 The kerne trick ρ w, b = 2 w In cases when the data are strong non-separabe and the performance corresponding to the soft margin hyperpane is poor the data are projected on a higher dimensiona space IR m, m > d, using a non-inear mapping function g : IR d IR m. The expicit functiona expression of the mapping function g is hidden by the kerne trick. A symmetric function K : R d IR d IR is a positive semidefinite kerne [14], [15] if for any positive natura number M and for any h 1,..., h M IR M, {x 1,..., x M } IR d M h i h j K x i, x j i,j=1 According to the Mercer theorem [14], [15] if K is a positive semi-definite kerne then there exists a mapping function g : IR d IR m such that K x, x = gx T g x, x, x IR d. 16 The kerne trick in earning soft margin hyperpane is twofod. On one hand, seecting a positive semidefined kerne the computation of the soft margin hyperpane is performed in a higher dimensiona space without using the expicit function expression of the mapping function g. On the other hand, the computation in the higher dimensiona space is in fact carried out in terms of the computations in the initia space, that is it can be carried out without invoving increased computationa compexity. In our tests we used two types of kernes namey poynomias K x, x = x T x + 1 r, 17 and exponentias, respectivey K x, x = exp γ x x 2, γ > In order to determine the soft margin hyperpane in the higher dimensiona space IR m we have to sove the constrained optimization probem min Lw, b, ξ w IR m, b IR y i w T g x i + b 19 1 ξ i, i = 1, N ξ i 0, i = 1, N, where Lw, b, ξ = 1 2 w 2 + c ξ i and g is the mapping function. Using the Lagrange mutipies method, the dua probem resuted from the Karush- Kuhn-Tucker conditions is max α i 1 α i α j y i y j K x i, x j α IR N 2 j=1 20 α i y i = 0 0 α i c, i = 1, N, If α = α1,..., αn is a soution of 20 and S 1 is the set of support vectors then b = 1 y i α S 1 jy j K x i, x j. 21 x j S 1 x i S 1 and the discriminant function is fx = sgn αi y i K x i, x + b. 22 i S1 Consequenty, w = N αi α j y iy j K x i, x j j=1 and the width of the separating area is ρ = 2 w. 2.4 The SMO Agorithm for Soving the Dua Probem The Sequentia Minima Optimization SMO agorithm was introduced by Patt [17] as an iterative method for soving constrained optimization probem of the type min W α α IR N α i y i = α i c, i = 1, N. ISSN: Issue 10, Voume 9, October 2010

5 In our tests we appied the SMO agorithm to minimize the function W α = α i j=1 α i α j y i y j K x i, x j. The SMO is a simpe agorithm that quicky soves the SVM probem by decomposing the overa quadratic programming probem into smaer quadratic programming sub-probems without any extra matrix storage and without invoking an iterative numerica routine for each sub-probem. SMO agorithm chooses to sove the smaest possibe optimization probem at every step, invoving ony two Lagrange mutipiers because a inear equaity constraint has to hod for the mutipiers. At every step, SMO chooses two Lagrange mutipiers to jointy optimize, finds the optima vaues for these mutipiers and updates the SVM to refect the new optima vaues. 3 Unsupervised earning custering using the k-means method Center-based custering agorithms are very efficient for custering arge databases and high-dimensiona databases. They have own objective functions which define how good a custering soution is, the goa being to minimize the objective function. Custers found by center-based agorithms have convex shapes and each custer is represented by a center. The k-means agorithm MacQueen [13], 1967 was designed to custer numerica data, each custer having a center caed the mean. Let D = {x 1,..., x N } R d be the data set, k a given positive integer, and C 1,..., C k pairwise disjoint k custers of D, that is, C i = D, C i C j, i j. If we denote by µ C i the center of C i then the inertia momentum error is expressed by ε = k x C i d 2 x, µ C i, 24 where d is a convenabe distance function on R d. In the foowing we take d as being the Eucidean distance on R d, dx, y = x y. The k-means methods proceeds, for a given initia k custers, by aocating the remaining data to the nearest custers and then repeatedy changing the membership of the custers according to the error function unti the error function does not change significanty or the membership of the custers no onger changes. The k-means agorithm can be treated as an optimization probem where the goa is to minimize a given objective function under certain constrains. We denote by C the set of a subsets of R d of cardina k; any particuar Q = {q 1,..., q k } C is caed a set of possibe centers. A system of k pairwise disjoint custers of D can be obviousy represented in terms a matrix W = w i M N k R such that i w i {0, 1}, i = 1, N, = 1, k k ii w ik = 1, i = 1, N. =1 25 The k-means agorithm can be formuated as the constrained optimization probem: min P W, Q W M N k R, Q C w i {0, 1}, i = 1, N, = 1, k, 26 k w ik = 1, i = 1, N, =1 where the objective function is defined as P W, Q = =1 k w i x i q The probem 10 can be soved by decomposing it into two simpe probems P 1 and P 2 and iterativey soving them, where P 1. Fix Q = Q C and sove the reduced constrained optimization probem for P W, Q. P 2. Fix W = Ŵ M N k R and sove the reduced unconstrained optimization probem for P Ŵ, Q. The soutions of these probems are given by the foowing theorems: Theorem 1 For any fixed Q = { q 1,..., q k } a set of centers, the function P W, Q is minimized if and ony if W satisfies the conditions w i = 0 x i q > min x i q t, 1 t k w i = 1 = x i q = min x i q t, 1 t k k w ij = 1, j=1 for any i = 1, N, = 1, k. ISSN: Issue 10, Voume 9, October 2010

6 Proof: Let W 0 = w 0 i = w 0 i where { 1, xi q = min 1 t k x i q t 0, otherwise w 0 i = 1, i = 1, N. Then for any W M N k R satisfying the constraints of 10, if we denote by i the index such that that w ii = 1 and w ij = 0 for j i, we get P W, Q = x 1 q x N q N 2 min x 1 q t min x N q t 2 = 1 t k 1 t k P W 0, Q, that is W 0 is a soution of P 1. Let W a soution of P 1. If there exists i such that w i = 1 and x i q i > min x i q t = x i q 0 then 1 t k for W = w i where w j = w j for j i and w i = { 1, = 0 0, otherwise P W, Q = j=1 j i, we get k w j x j q 2 + x i q 0 2, =1 that is P W, Q P W, Q = x i q i 2 x i q 0 2 >0 which contradicts the assumption that W minimizes P W, Q. Note that in genera, for any given Q there are more soutions of W 0 type because any particuar data x i can be at minimum distance to more than one center of Q. Theorem 2 For any fixed Ŵ satisfying the constrains of 10, the function P Ŵ, Q is minimized if and ony if ŵ i x i q =, i = 1, k. ŵ i Proof: For each =1, k et C ={x i x i D, ŵ i =1}. Obviousy C 1,..., C k are pairwise disjoint custers of D and ŵ i = C, ŵ i x i = x i, = 1, k. x i C, Then P Ŵ, Q = Let =1 k ŵ i x i q 2 = ŵ i x i = ŵ i =1 q 0 = 1 x i, =1, k. C x i C Obviousy x i where k =1 x i C x i q 2. x i q 2 = x i q 0 + q 0 2 q = x i C x i C x i q C q 0 2 q + x i C 2 q 0 T q x i C q 0 x i C } x i q 0 2, x i C {{} 0 that is P Ŵ, Q 0 P Ŵ, Q for any Q C. Moreover since the quaity P Ŵ, Q 0 = P Ŵ, Q hods if and ony if q = q 0 for a = 1, k, for given Ŵ, Q0 is the unique set of centers that minimizes P Ŵ, Q. The k-means agorithm viewed as an optimization process for soving 10 is as foows The agorithm k-mop Input: D - the data set, k - the pre-specified number of custers, d - the data dimensionaity, T - threshod on the maximum number of iterations. Initiaizations: Q 0, t 0 Sove P W, Q 0 and get W 0 sw fase repeat Ŵ W t sove P Ŵ, Q and get Q t+1 if P Ŵ, Q t = P Ŵ, Q t+1 then sw true output Ŵ, Q t+1 ese Q Q t+1 sove P W t, Q and get W t+1 ISSN: Issue 10, Voume 9, October 2010

7 if P W t, Q = P W t+1, Q then sw true output W t+1, Q, endif endif t t + 1 unti sw or t > T. Note that the computationa compexity of the agorithm k-mop is ON kd per iteration. The sequence of vaues P W t, Q t where W t, Q t are computed by k-mop is stricty decreasing, therefore the agorithm converges to a oca minimum of the objective function. 4 The combined separating technique based on SVM and the k- means agorithm At first sight, it seems unreasonabe to compare a supervised technique to un unsupervised one mainy because they refer to totay different situations. On one hand the supervised techniques are appied in case the data set consists of correcty abeed objects, and on the other hand the unsupervised methods dea with unabeed objects. However our point is to combine SVM and k-means agorithm, in order to obtain a new design of a inear cassifier. The aim of the experimenta anaysis is to evauate the performance of the inear cassifier resuted from the combination of the supervised SVM method and the 2-means agorithm. Our method can be appied to whatever data, either inear separabe or non-inear separabe. Obviousy in case of non-inear separabe data the cassification can not be performed without errors and in this case the number of miscassified exampes is most reasonabe criterion for performance evauation. Of a particuar importance is the case of inear separabe data in this case the performance being evauated in terms of both, miscassified exampes and the generaization capacity expressed in terms of the width of separating area. In rea ive situations, usuay is very difficut or even impossibe to estabished whether the data represents a inear/non-inear separabe set. In using the SV M1 approach we can identify which case the given data set beongs to. For inear separabe data, SV M 1 computes a separation hyperpane optima from the point of view of the generaization capacity. In case of a non-inear separabe data SV M2 computes a inear cassifier that minimizes the number of miscassified exampes. A series of deveopments are based on non-inear transform whose range is high dimensiona space represented be kerne functions. The increase of dimensionaity and the convenabe choice of the kerne aow to transform the non-inear separabe probem into a inear separabe one. The computation compexity corresponding to kerne-based approaches is significanty arge therefore in case the performance of the agorithm SV M1 proves reasonabe good it coud be taken as an aternative approach of a kerne-based SV M. We perform a comparative anaysis on data consisting of exampes generated from two dimensiona Gaussian distributions. In case of a non-inear separabe data set using the k-means agorithm we get a system of pairwise disjoint custers together with a set of their centers representing a oca minimum point of the criterion 10, the custers being inear separabe when k = 2. Consequenty, the SV M 1 agorithm computes a inear cassifier that separates without errors the resuted custers. Our procedure is described as foows Input: S = { x i, y i x i R n, y i { 1, 1},, N } Step 1. Compute the matrix D = d ik of entries d ik = y i y k x i T x k, i, k = 1, N, and initiaize sh true Step 2. If the constrained optimization probem α = arg max α T 1 1 α R N 2 αt Dα, α i 0, 1 i N, α i y i = 0, do not have soution then sh fase input c, for hyperpane soft margin Sove the constrained optimization probem α =arg max α T 1 1 α R N 2 αt Dα α max 2, 4 c α i 0, 1 i N, α i y i = 0, endif Step 3. Seect x r, x s such that α r > 0, α s > 0, y r = 1, y s = 1 Compute the parameters w, b of the separating hyperpane, ISSN: Issue 10, Voume 9, October 2010

8 w = αi y i x i, b = 1 2 w T x r + x s Compute the width of the separating area ρ w, b = 2 w Step 4. if not sh then compute nr err1 - the numbers of exampes incorrect cassified compute err1 - error cassification endif { } Step 5. The set D = x i x i R d, i = 1, N is divided in two custers C 1 and C 2 using 2-means, marked out with y i = 1 and y i = 1 respectivey. Step 6. Appy agorithm SVM1 for xi } S ={, y i xi R d, y i { 1, 1},, N and obtain the parameters for optima separating hyperpane: w1, b 1, ρ w 1, b 1 compute nr err2 - the numbers of exampes incorrect cassified by 2 means compute err2 - error cassification after 2 means Output: w, b, ρ w, b, nr err1, err1, w1, b 1, ρ w 1, b 1, nr err2, err2. 5 Experimenta resuts In this section same of the resuts obtained in testing the potentia of the methodoogy exposed in previous sections for soving cassification probems. Some of the test were performed on simuated data randomy generated from two dimensiona Gaussian distribution using kernes of poynomia and exponentia types, respectivey. A refined methodoogy is proposed in the fina part of this section. The proposed methodoogy combines a Principa Component Anaysis PCA approach for dimensionaity reduction in the space of the initia data representations with kerne-based SVM to aow possibe inseparabiity. The tests were performed on the free Wisconsin Diagnostic Breast Cancer Database, taken from UCI Machine Learning Repository: Data Sets, The resut of the test performed on the famous XOR probem using the kerne K x, x = x T x is presented in Figure 1. The abeed data are { } S =, 1,, 1,, 1,, Figure 1: Soution of XOR computed by kerne SVM. Severa tests were performed in discriminated between exampes coming from two casses when the data are generated from two dimensiona Gaussian distribution. For instance, for sampes of sizes N 1 = 50, N 2 = 25 respectivey, generated from N µ 1, Σ 1 and N µ 2, Σ 2 where µ 1 =, Σ 1 1 =, µ =, Σ 3 2 = in most cases, we obtained non-ineary separabe data. Some of the resuts obtained when we used different expression for poynomia an exponentia kernes are represented in Figure 2 and Figure 3. A ong series of tests were performed on the medica data with confirmed diagnostic avaiabe at UCI Machine Learning Repository: Data Sets, the dimension of each record is d = 30, for each exampe the confirmed diagnostic representing the presence/absence of breast cancer being aso provided. The size of the data base is 569 exampes, from which 357 are positive exampes and 212 are negative exampes and we used them for the SVM design and for testing purposes. The first series of tests were performed on design sampes of sizes 20, 50, 100, 150, 200 respectivey, the exampe being taken randomy. In each case the compementary set of exampes was used to test the performance of the computed discrimination function and the empirica error functions were computed. The resuts are presented in tabes 1-8. For each sampe a PCA anaysis was deveoped separatey for sub-sampes coming from each cass in order to determine their most informationa directions. The tests pointed out that ony the first 15 eigenvectors of the sampe autocorreation matrix are reevant, were we evauated the reevance by the magnitude of the corresponding eigenvaues. We appied the SVM-based methodoogy using poy- ISSN: Issue 10, Voume 9, October 2010

9 a a b Figure 2: akx, x = x T x +1 2 ; bkx, x = x T x b Figure 3: a K x, x = exp 2 x x T x x ; b K x, x = exp x x T x x. nomia kernes to the resuted 15 ength representation obtained when we considered ony the principa directions whose corresponding eigenvaues are arger than 10 3 and evauated the empirica error functions. Our tests aimed to derived concusions concerning the performance expressed in terms of the empirica error of the foowing cassification schemes: 1. 2-means cassification scheme. a. The 2-means agorithm appied to the whoe data set. b. The 2-means minimum distance cassification scheme 2-MMD. The data set is spit into the design data set and the test data set. The 2-means agorithm is appied to the design data set, each eement coming from the test data set is cassified into the custer whose center is at minimum distance. 2. SVM cassification scheme using inear and ex- ponentia kernes SV M exp, SV M in. For each type of kerne the soft margin separating hyperpane is computed for the whoe data set. The variant SV M exp uses the exponentia kerne K x, x = exp x x T x x and the variant SV M in uses the inear kerne K x, x = x x T x x means SVM cassification schemes SV M2m exp, SV M2m in. The 2-means agorithm is appied to the design data and the soft margin hyperpane is computed to separate the resuted custers. The each of test exampe is then cassified using the computed separating hyperpane. The variant SV M 2m exp uses the exponentia kerne K x, x = exp x x T x x and the variant SV M2m in uses the inear kerne ISSN: Issue 10, Voume 9, October 2010

10 K x, x = x x T x x. The performance is evauated by computing by the percentage of miscassified exampes for design data set and test data set respectivey. We used the foowing notation: E 2m,p the percentage of miscassified exampes when the 2-means agorithm is appied to the design data set; E 2m,t the percentage of miscassified exampes when the 2-means agorithm is appied to the test data set; N S the number of support vector in case of a SVM-based cassification schemes; E d the percentage of miscassified exampes when a SVM-based cassification is appied to the design data set; E t the percentage of miscassified exampes when a SVM-based cassification is appied to the design data set. CP Ut the duration expressed in seconds of computing the soution of the optimization probems 14 and 20 respectivey using the SMOagorithm. In tabe 1 the CPUt represents the duration expressed in seconds when the 2-means agorithm is appied. Some of the resuts of our test s is presented in the initia data space of dimension 30 are presented in tabes 1, 2, 3, 4. Tabe 1: The 2-MMD scheme for d = 30. N E 2m,p % E 2m,t % CP Ut Tabe 2: The SV M 2m in scheme for d = 30. N N S E d % E t % CP Ut Tabe 3: The SV M 2m exp scheme for d = 30. N N S E d % E t % CP Ut Tabe 4: The SV M exp scheme for d = 30. N N S E d % E t % CP Ut Note that the CP Ut represents the tota CP U time to appy the 2-means agorithm to the design data set and the cassification of the test set exampes. The empirica error when the 2-means agorithm is appied to the whoe data set is 14.58%. The tests entai severa concusion concerning the comparative anaysis of the proposed cassification schemes. 1. The best recognition can be obtained using the SVM in technique but unfortunatey the time required to compute the soft margin hyperpane becomes prohibitivey arge when the voume of the design data set increases. In case the voume of the design data set is N = 20 the empirica error is 8.3% for the test data set and 0 for the design data set. 2. The 2-MMD and SVM2m in prove comparabe performances from both points of view the encoding quaity and the generaization capacities. 3. Aso, in genera, the SVM impementation using exponentia kernes prove better performances the SVM exp cassification schemes proves poor generaization capacities. 4. The variant SVM2m in proves better generaization capacities as compared to the 2-MMD method but its duration increases significanty when voume of the design data becomes arger. We combined the proposed methodoogy to a PCA preprocessing step. In case when we consider as being informative ony the principa directions whose corresponding eigenvaues are arger than 10 3 we obtain that there are 15 principa directions for each cass as we as for the whoe data set, and in case the significance eve is 10 2 we obtain that there are ony ISSN: Issue 10, Voume 9, October 2010

11 Tabe 5: The 2-MMD scheme for d = 15. N E 2m,p % E 2m,t % CP Ut Tabe 7: The SV M 2m exp scheme for d = 15. N N S E d % E t % CP Ut Tabe 6: The SV M 2m in scheme for d = 15. N N S E d % E t % CP Ut Tabe 8: The SV M exp scheme for d = 15. N N S E d % E t % CP Ut principa directions respectivey. We preprocessed data in order to obtain 15-ength and 12-ength representations and appied the described methodoogy. The PCA dimensionaity reduction can by performed to ways. On one hand is to compute the principa directions for the whoe data set, and on the other hand the principa directions can be computed separatey for each cass, in case of the second approach more informationa representations resut for a sampes. The amount of time to compute the principa components of the whoe data set is t = 0.03, the computation of the principa directions corresponding to each cass being amost equa to Being given that for this particuar data set there are not significant differences between the representations resuted by appying these methods our option was to reduce the dimensionaity by using the overa principa directions, and the resuts are presented in Tabe 5 - Tabe 8. Some other tests performed on different data bases pointed out that a the cassification schemes proved better performance when the representation were computed in terms of the principa directions of each cass. Some of the obtained resut are summarized in tabes 5, 6, 7, 8. It is interesting to note that the empirica error of the 2-means agorithm appied in the reduced space remains unchanged that is the cass separabiity is not affected by the dimensionaity reduction. Comparing the resuts presented in Tabe 1 - Tabe 4 and Tabe 5 - Tabe 8 we see that the performance of the cassifier is significanty improved by incuding PCA as a preprocessing step. This can be interpreted as a proof that the effects of the minor components on the cass variabiity is simiar to some sort of noise that affects the data and is responsibe for the decrease of cassifier performances. 6 Summary and fina remarks The paper presents some resuts in using SVM-based techniques for cassification purposes. Since the design of the SVM invoves the soutions of quadratic programming probems a natura question is to find fast agorithms to sove the invoved quadratic programming probems. In our deveopments we used the SMO agorithm introduced by Patt [17]. We tried to incude a PCA approach as a preprocessing step and appy the SVM-based methodoogy in the reduced space of principa components. The tests performed of UCI Machine Learning Repository: Data Sets, confirmed that this combined methodoogy aows significant improvement of the cassification performance. Some work aiming to combine different types of cassifiers with the SVM is sti in progress and the resuts are going to be pubished esewhere. Acknowedgements: The research was supported was deveoped in the framework of the Ph.D. program a University of Pitesti, Romania. References: [1] S. Abe, Support Vector Machines for Pattern Cassification, Springer-Verag, [2] C. Avies, J. Viegas, J. Ocampo, R. Arechiga, Principa Components and Invariant Moments Anaysis-Font Recognition Appied, WSEAS Transactions on Computers, Issue 5, Vo.5, 2006, pp [3] C.M. Bishop, Pattern Recognition and Machine Learning, Springer Verag, ISSN: Issue 10, Voume 9, October 2010

12 [4] C.J.C. Burges, A Tutoria on Support Vector Machines for Pattern Recognition, Data Mining and Knowedge Discovery 2, 1998, pp [5] P.C. Cheng, B.C. Chien, W.P. Yang, Categorizing Medica Images by Supervised Machine Learning, WSEAS Transactions on Computers, Issue 12, Voume 5, 2006, pp [6] C. Cortes and V.N. Vapnik, Support-vector networks, Machine Learning 203, 1995, pp [7] N. Cristianini and J. Shawe-Tayor, An Introduction to Support Vector Machines and Other Kerne-based Learning Methods, Cambridge University Press, [8] G. Gan, C. Ma, J. Wu, Data Custering: Theory, Agorithms and Appications, SIAM, [9] N. E Gayar, S.A. Shaban, S. Hamdy, Face Recognition with Co-training and Ensembedriven Learning, WSEAS Transactions on Computers, Issue 3, Vo. 6, 2007, pp [10] S.R. Gunn, Support Vector Machines for Cassification and Regression, University of Southampton, Technica Report, [11] R. Herbrich, Learning Kerne Cassifiers: Theory and Agorithms, MIT Press, [12] G. Kaur, A.S. Arora,V.K. Jain, Muti-Cass Support Vector Machine Cassifier in EMG Diagnosis, WSEAS Transactions on Signa Processing, Issue 12, Voume 5, 2009, pp [13] J.B. MacQueen, Some Methods for cassification and Anaysis of Mutivariate Observations, Proceedings of 5-th Berkeey Symposium on Mathematica Statistics and Probabiity, Berkeey, University of Caifornia Press, , [14] J. Mercer, Functions of Positive and Negative Type, and their Connection with the Theory of Integra Equations, Phiosophica Transactions of the Roya Society of London 209, Series A, 1909, pp [15] H.Q. Minh, P. Niyogi and Y. Yao, Mercer s Theorem, Feature Maps and Smoothing, Proc. of Computationa Learning Theory COLT 06, 2006, pp [16] I. Paraschiv-Munteanu, Support Vector Machine in soving pattern recognition tasks, Technica Report, Proceedings of First Doctora Student Workshop in Computer Science, University of Piteşti, May [17] J.C. Patt, Fast Training of Support Vector Machines using Sequentia Minima Optimization, in Advances Kerne Methods: Support Vector Machines, edited by B. Schökopf, C.J.C. Burges and A.J. Smoa, The MIT Press, 1999, pp [18] S.D. Sawarkar, A.A. Ghato, Breast Cancer Maignancy Identification using Support Vector Machine, WSEAS Transactions on Computers, Issue 8, Voume 5, 2006, pp [19] B. Schökopf, A.J. Smoa, Learning with Kernes: Support Vector Machines, Reguarization, Optimization, and Beyond, MIT Press, [20] L. State and I. Paraschiv-Munteanu, A comparative anaysis on the potentia of SVM and k- means in soving cassification tasks, Proceedings of the First Internationa Conference on Modeing and Deveopment of Inteigent Systems MDIS-2009, Sibiu, Romania, 2009, pp [21] Z. Teng, F. Ren, S. Kuroiwa, The Emotion Recognition through cassification with the Support Vector Machines, WSEAS Transactions on Computers, Issue 9, Voume 5, 2006, pp [22] V.N. Vapnik, Statistica Learning Theory, New York, Wiey-Interscience, [23] L. Wang, Support Vector Machines: Theory and Appications, Springer Verang, [24] H. Xaio, X. Zhang, Comparison Studies on Cassification for Remote Sensing Image Based on Data Mining Method, WSEAS Transactions on Computers, Issue 5, Vo. 7, 2008, pp [25] R. Xu, D.C.II Wunsch, Custering, Wiey&Sons, [26] C.Y. Yeh, S.J. Lee, C.H. Wu, S.H. Doong, A Hybrid Kerne Method for Custering, WSEAS Transactions on Computers, Issue 10, Voume 5, 2006, pp ISSN: Issue 10, Voume 9, October 2010

SVM: Terminology 1(6) SVM: Terminology 2(6)

SVM: Terminology 1(6) SVM: Terminology 2(6) Andrew Kusiak Inteigent Systems Laboratory 39 Seamans Center he University of Iowa Iowa City, IA 54-57 SVM he maxima margin cassifier is simiar to the perceptron: It aso assumes that the data points are