Feature Selection for SVMs

Size: px

Start display at page:

Download "Feature Selection for SVMs"

Merryl Tucker
6 years ago
Views:

1 Feature Selecton for SVMs J. Weston y, S. Mukherjee yy, O. Chapelle Λ, M. Pontl yy T. Poggo yy, V. Vapnk Λ;yyy y Barnhll BoInformatcs.com, Savannah, Georga, USA. yy CBCL MIT, Cambrdge, Massachusetts, USA. Λ AT&T Research Laboratores, Red Bank, USA. yyy Royal Holloway, Unversty of London, Egham, Surrey, UK. Abstract We ntroduce a method of feature selecton for Support Vector Machnes. The method s based upon fndng those features whch mnmze bounds on the leave-one-out error. Ths search can be effcently performed va gradent descent. The resultng algorthms are shown to be superor to some standard feature selecton algorthms on both toy data and real-lfe problems of face recognton, pedestran detecton and analyzng DNA mcroarray data. 1 Introducton In many supervsed learnng problems feature selecton s mportant for a varety of reasons: generalzaton performance, runnng tme requrements, and constrants and nterpretatonal ssues mposed by the problem tself. In classfcaton problems we are gven ` data ponts x 2 R n labeled y 2±1drawn..d from a probablty dstrbuton P (x;y). We would lke to select a subset of features whle preservng or mprovng the dscrmnatve ablty of a classfer. As a brute force search of all possble features s a combnatoral problem one needs to take nto account both the qualty of soluton and the computatonal expense of any gven algorthm. Support vector machnes (SVMs) have been extensvely used as a classfcaton tool wth a great deal of success from object recognton [5, 11] to classfcaton of cancer morphologes [10] and a varety of other areas, see e.g [13]. In ths artcle we ntroduce feature selecton algorthms for SVMs. The methods are based on mnmzng generalzaton bounds va gradent descent and are feasble to compute. Ths allows several new possbltes: one can speed up tme crtcal applcatons (e.g object recognton) and one can perform feature dscovery (e.g cancer dagnoss). We also show how SVMs can perform badly n the stuaton of many rrelevant features, a problem whch s remeded by usng our feature selecton approach. The artcle s organzed as follows. In secton 2 we descrbe the feature selecton problem, n secton 3 we revew SVMs and some of ther generalzaton bounds and n secton 4 we ntroduce the new SVM feature selecton method. Secton 5 then descrbes results on toy and real lfe data ndcatng the usefulness of our approach.

2 2 The Feature Selecton problem The feature selecton problem can be addressed n the followng two ways: (1) gven a fxed m f n, fnd the m features that gve the smallest expected generalzaton error; or (2) gven a maxmum allowable generalzaton error fl, fnd the smallest m. In both of these problems the expected generalzaton error s of course unknown, and thus must be estmated. In ths artcle we wll consder problem (1). Note that choces of m n problem (1) can usually can be reparameterzed as choces of fl n problem (2). Problem (1) s formulated as follows. Gven a fxed set of functons y = f (x;ff) we wsh to fnd a preprocessng of the data x 7! (x Λ ff), ff 2f0; 1g n, and the parameters ff of the functon f that gve the mnmum value of f (ff;ff)= Z V (y; f((x Λ ff);ff))dp (x;y) (1) subject to jjffjj 0 = m, where P (x;y) s unknown, x Λ ff =(x 1 ff 1 ;:::;x n ff n ) denotes an elementwse product, V ( ; ) s a loss functonal and jj jj 0 s the 0-norm. In the lterature one dstngushes between two types of method to solve ths problem: the so-called flter and wrapper methods [2]. Flter methods are defned as a preprocessng step to nducton that can remove rrelevant attrbutes before nducton occurs, and thus wsh to be vald for any set of functons f (x;ff). For example one popular flter method s to use Pearson correlaton coeffcents. The wrapper method, on the other hand, s defned as a search through the space of feature subsets usng the estmated accuracy from an nducton algorthm as a measure of goodness of a partcular feature subset. Thus, one approxmates f (ff;ff) by mnmzng f wrap (ff;ff) = mn f alg (ff) (2) ff subject to ff 2f0; 1g n where f alg s a learnng algorthm traned on data preprocessed wth fxed ff. Wrapper methods can provde more accurate solutons than flter methods [9], but n general are more computatonally expensve snce the nducton algorthm f alg must be evaluated over each feature set (vector ff) consdered, typcally usng performance on a hold out set as a measure of goodness of ft. In ths artcle we ntroduce a feature selecton algorthm for SVMs that takes advantage of the performance ncrease of wrapper methods whlst avodng ther computatonal complexty. Note, some prevous work on feature selecton for SVMs does exst, however results have been lmted to lnear kernels [3, 7] or lnear probablstc models [8]. Our approach can be appled to nonlnear problems. In order to descrbe ths algorthm, we frst revew the SVM method and some of ts propertes. 3 Support Vector Learnng Support Vector Machnes [13] realze the followng dea: they map x 2 R n nto a hgh (possbly nfnte) dmensonal space and construct an optmal hyperplane n ths space. Dfferent mappngs x 7! Φ(x) 2Hconstruct dfferent SVMs. The mappng Φ( ) s performed by a kernel functon K( ; ) whch defnes an nner product n H. The decson functon gven by an SVM s thus: f (x) =w Φ(x) +b = X ff 0 y K(x ; x) +b: (3) The optmal hyperplane s the one wth the maxmal dstance (n H space) to the closest mage Φ(x ) from the tranng data (called the maxmal margn). Ths reduces to maxmzng

3 the followng optmzaton problem: W 2 (ff) = `X =1 ff 1 2 `X ;j=1 ff ff j y y j K(x ; x j ) (4) under constrants P` ff =1 y =0 and ff 0; = 1;:::;`. For the non-separable case one can quadratcally penalze errors wth the modfed kernel K ψ K + 1 I where I s the dentty matrx and a constant penalzng the tranng errors (see [4] for reasons for ths choce). Suppose that the sze of the maxmal margn s M and the mages Φ(x 1 );:::;Φ(x`) of the tranng vectors are wthn a sphere of radus R. Then the followng holds true [13]. Theorem 1 If mages of tranng data of sze ` belongng to a sphere of sze R are separable wth the correspondng margn M, then the expectaton of the error probablty has the bound ρ ff EP err» 1` R 2 E 1` Φ Ψ M 2 = E R 2 W 2 (ff 0 ) ; (5) where expectaton s taken over sets of tranng data of sze `. Ths theorem justfes the dea that the performance depends on the rato EfR 2 =M 2 g and not smply on the large margn M, where R s controlled by the mappng functon Φ( ). Other bounds also exst, n partcular Vapnk and Chapelle [4] derved an estmate usng the concept of the span of support vectors. Theorem 2 Under the assumpton that the set of support vectors does not change when removng the example p EP ` 1 err» 1` E `X p=1 Ψ ψ! ff 0 p (K 1 SV ) 1 pp where Ψ s the step functon, K SV s the matrx of dot products between support vectors, p` 1 err s the probablty of test error for the machne traned on a sample of sze ` 1 and the expectatons are taken over the random choce of the sample. 4 Feature Selecton for SVMs In the problem of feature selecton we wsh to mnmze equaton (1) over ff and ff. The support vector method attempts to fnd the functon from the set f (x;w;b)=w Φ(x)+b that mnmzes generalzaton error. We frst enlarge the set of functons consdered by the algorthm to f (x;w;b;ff)=w Φ(x Λ ff) +b. Note that the mappng Φ ff (x) =Φ(x Λ ff) can be represented by choosng the kernel functon K ff n equatons (3) and (4): K ff (x; y) =K((x Λ ff); (y Λ ff)) = (Φ ff (x) Φ ff (y)) (7) for any K. Thus for these kernels the bounds n Theorems (1) and (2) stll hold. Hence, to mnmze f (ff;ff) over ff and ff we mnmze the wrapper functonal f wrap n equaton (2) where f alg s gven by the equatons (5) or (6) choosng a fxed value of ff mplemented by the kernel (7). Usng equaton (5) one mnmzes over ff: R 2 W 2 (ff) =R 2 (ff)w 2 (ff 0 ;ff) (8) where the radus R for kernel K ff can be computed by maxmzng (see, e.g [13]): R 2 (ff) = max f X f K ff (x ; x ) X ;j (6) f f j K ff (x ; x j ) (9)

4 P subject to f =1; f 0; =1;:::;`, and W 2 (ff 0 ;ff) s defned by the maxmum of functonal (4) usng kernel (7). In a smlar way, one can mnmze the span bound over ff nstead of equaton (8). Fndng the mnmum of R 2 W 2 over ff requres searchng over all possble subsets of n features whch s a combnatoral problem. To avod ths problem classcal methods of search nclude greedly addng or removng features (forward or backward selecton) and hll clmbng. All of these methods are expensve to compute f n s large. As an alternatve to these approaches we suggest the followng method: approxmate the bnary valued vector ff 2 f0; 1g n ; wth a real valued vector ff 2 R n. Then, to fnd the optmum value of ff one can mnmze R 2 W 2, or some other dfferentable crteron, by gradent descent. As explaned n [4] the dervatve of our crteron 2 W 2 (ff) = 2 (ff 0 2 ;ff) 2 (ff) = f 0 ff (x ; x 2 (ff 0 ;ff) = X ;j + W 2 (ff 0 (ff) (10) X ;j f 0 f 0 ff (x ; x j ) (11) ff 0 ff0 j y y ff (x ; x j ) : (12) We estmate the mnmum of f (ff;ff) by mnmzng equaton (8) n the space ff 2 R n usng the gradents (10) wth the followng extra constrant whch approxmates nteger programmng: X R 2 W 2 (ff) + (ff ) p (13) P subject to ff = m; ff 0; =1;:::;`. For large enough as p! 0 only m elements of ff wll be nonzero, approxmatng optmzaton problem f (ff;ff). One can further smplfy computatons by consderng a stepwse approxmaton procedure to fnd m features. To do ths one can mnmze R 2 W 2 (ff) wth ff unconstraned. One then sets the q f n smallest values of ff to zero, and repeats the mnmzaton untl only m nonzero elements of ff reman. Ths can mean repeatedly tranng a SVM just a few tmes, whch can be fast. 5 Experments 5.1 Toy data We compared standard SVMs, our feature selecton algorthms and three classcal flter methods to select features followed by SVM tranng. The three flter methods chose the m largest features accordng to: Pearson correlaton coeffcents, the Fsher crteron score 1, and the Kolmogorov-Smrnov test 2 ). The Pearson coeffcents and Fsher crteron cannot model nonlnear dependences. In the two followng artfcal datasets our objectve was to assess the ablty of the algorthm to select a small number of target features n the presence of rrelevant and redundant features. f 1 ff μ + f F (r) = r μ r ff, ff r + 2 where μ ± +ff r 2 r s the mean value for the r-th feature n the postve and negatve classes and ff ± 2 r s the standard devaton 2 KS tst(r) = p` sup ^P fx» f rg ^P fx» f r;y r =1g where f r denotes the r-th feature from each tranng example, and ^P s the correspondng emprcal dstrbuton.

5 Lnear problem Sx dmensons of 202 were relevant. The probablty of y = 1or 1 was equal. The frst three features fx 1 ;x 2 ;x 3 g were drawn as x = yn(; 1) and the second three features fx 4 ;x 5 ;x 6 g were drawn as x = N (0; 1) wth a probablty of 0:7, otherwse the frst three were drawn as x = N (0; 1) and the second three as x = yn( 3; 1). The remanng features are nose x = N (0; 20), =7;:::;202. Nonlnear problem Two dmensons of 52 were relevant. The probablty of y =1or 1 was equal. The data are drawn from the followng: f y = 1 then fx 1 ;x 2 g are drawn from N (μ 1 ; ±) or N (μ 2 ; ±) wth equal probablty, μ 1 = f 3; 3g and μ 4 2 = f 3 ; 3g and 4 ±=I, fy =1then fx 1 ;x 2 g are drawn agan from two normal dstrbutons wth equal probablty, wth μ 1 = f3; 3g and μ 2 = f 3; 3g and the same ± as before. The rest of the features are nose x = N (0; 20);=3;:::;52. In the lnear problem the frst sx features have redundancy and the rest of the features are rrelevant. In the nonlnear problem all but the frst two features are rrelevant. We used a lnear SVM for the lnear problem and a second order polynomal kernel for the nonlnear problem. For the flter methods and the SVM wth feature selecton we selected the 2 best features. The results are shown n Fgure (1) for varous tranng set szes, takng the average test error on 500 samples over 30 runs of each tranng set sze. The Fsher score (not shown n graphs due to space constrants) performed almost dentcally to correlaton coeffcents. In both problems standard SVMs perform poorly: n the lnear example usng ` = 500 ponts one obtans a test error of 13% for SVMs, whch should be compared to a test error of 3% wth ` =50usng our methods. Our SVM feature selecton methods also outperformed the flter methods, wth forward selecton beng margnally better than gradent descent. In the nonlnear problem, among the flter methods only the Kolmogorov-Smrnov test mproved performance over standard SVMs. 0.7 Span Bound & Forward Selecton RW Bound & Gradent Standard SVMs 0.6 Correlaton Coeffcents Kolmogorov Smrnov Test Span Bound & Forward Selecton RW Bound & Gradent Standard SVMs 0.6 Correlaton Coeffcents Kolmogorov Smrnov Test (a) (b) Fgure 1: A comparson of feature selecton methods on (a) a lnear problem and (b) a nonlnear problem both wth many rrelevant features. The x-axs s the number of tranng ponts, and the y-axs the test error as a fracton of test ponts. 5.2 Real-lfe data For the followng problems we compared mnmzng R 2 W 2 va gradent descent to the Fsher crteron score. Face detecton The face detecton experments descrbed n ths secton are for the system ntroduced n [12, 5]. The tranng set conssted of 2; 429 postve mages of frontal faces of

6 sze 19x19 and 13; 229 negatve mages not contanng faces. The test set conssted of 105 postve mages and 2; 000; 000 negatve mages. A wavelet representaton of these mages [5] was used, whch resulted n 1; 740 coeffcents for each mage. Performance of the system usng all coeffcents, 725 coeffcents, and 120 coeffcents s shown n the ROC curve n fgure (2a). The best results were acheved usng all features, however R 2 W 2 outperfomed the Fsher score. In ths case feature selecton was not useful for elmnatng rrelevant features, but one could obtan a soluton wth comparable performance but reduced complexty, whch could be mportant for tme crtcal applcatons. Pedestran detecton The pedestran detecton experments descrbed n ths secton are for the system ntroduced n [11]. The tranng set conssted of 924 postve mages of people of sze 128x64 and 10; 044 negatve mages not contanng pedestrans. The test set conssted of 124 postve mages and 800; 000 negatve mages. A wavelet representaton of these mages [5, 11] was used, whch resulted n 1; 326 coeffcents for each mage. Performance of the system usng all coeffcents and 120 coeffcents s shown n the ROC curve n fgure (2b). The results showed the same trends that were observed n the face recognton problem Detecton Rate False Postve Rate Detecton Rate Detecton Rate False Postve Rate (a) False Postve Rate (b) Fgure 2: The sold lne s usng all features, the sold lne wth a crcle s our feature selecton method (mnmzng R 2 W 2 by gradent descent) and the dotted lne s the Fsher score. (a)the top ROC curves are for 725 features and the bottom one for 120 features for face detecton. (b) ROC curves usng all features and 120 features for pedestran detecton. Cancer morphology classfcaton For DNA mcroarray data analyss one needs to determne the relevant genes n dscrmnaton as well as dscrmnate accurately. We look at two leukema dscrmnaton problems [6, 10] and a colon cancer problem [1] (see also [7] for a treatment of both of these problems). The frst problem was classfyng myelod and lymphoblastc leukemas based on the expresson of 7129 genes. The tranng set conssts of 38 examples and the test set of 34 examples. Usng all genes a lnear SVM makes 1 error on the test set. Usng 20 genes 0 errors are made for R 2 W 2 and 3 errors are made usng the Fsher score. Usng 5 genes 1 error s made for R 2 W 2 and 5 errors are made for the Fsher score. The method of [6] performs comparably to the Fsher score. The second problem was dscrmnatng B versus T cells for lymphoblastc cells [6]. S- tandard lnear SVMs make 1 error for ths problem. Usng 5 genes 0 errors are made for R 2 W 2 and 3 errors are made usng the Fsher score.

7 In the colon cancer problem [1] 62 tssue samples probed by olgonucleotde arrays contan 22 normal and 40 colon cancer tssues that must be dscrmnated based upon the expresson of 2000 genes. Splttng the data nto a tranng set of 50 and a test set of 12 n 50 separate trals we obtaned a test error of 13% for standard lnear SVMs. Takng 15 genes for each feature selecton method we obtaned 12.8% for R 2 W 2, 17.0% for Pearson correlaton coeffcents, 19.3% for the Fsher score and 19.2% for the Kolmogorov-Smrnov test. Our method s only worse than the best flter method n 8 of the 50 trals. 6 Concluson In ths artcle we have ntroduced a method to perform feature selecton for SVMs. Ths method s computatonally feasble for hgh dmensonal datasets compared to exstng wrapper methods, and experments on a varety of toy and real datasets show superor performance to the flter methods tred. Ths method, amongst other applcatons, speeds up SVMs for tme crtcal applcatons (e.g pedestran detecton), and makes possble feature dscovery (e.g gene dscovery). Secondly, n smple experments we showed that SVMs can ndeed suffer n hgh dmensonal spaces where many features are rrelevant. Our method provdes one way to crcumvent ths naturally occurng, complex problem. References [1] U. Alon, N. Barka, D. Notterman, K. Gsh, S. Ybarra, D. Mack, and A. Levne. Broad patterns of gene expresson revealed by clusterng analyss of tumor and normal colon cancer tssues probed by olgonucleotde arrays. Cell Bology, 96: , [2] A. Blum and P. Langley. Selecton of relevant features and examples n machne learnng. Artfcal Intellgence, 97: ,, [3] P. S. Bradley and O. L. Mangasaran. Feature selecton va concave mnmzaton and support vector machnes. In Proc. 13th Internatonal Conference on Machne Learnng, pages 82 90, San Francsco, CA, [4] O. Chapelle, V. Vapnk, O. Bousquet, and S. Mukherjee. Choosng kernel parameters for support vector machnes. Machne Learnng, [5] T. Evgenou, M. Pontl, C. Papageorgou, and T. Poggo. Image representatons for object detecton usng kernel classfers. In Asan Conference on Computer Vson, [6] T. Golub, D. Slonm, P. Tamayo, C. Huard, M. Gaasenbeek, J. Mesrov, H. Coller, M. Loh, J. Downng, M. Calgur, C. D. Bloomfeld, and E. S. Lander. Molecular classfcaton of cancer: Class dscovery and class predcton by gene expresson montorng. Scence, 286: , [7] I. Guyon, J. Weston, S. Barnhll, and V. Vapnk. Gene selecton for cancer classfcaton usng support vector machnes. Machne Learnng, [8] T. Jebara and T. Jaakkola. Feature selecton and dualtes n maxmum entropy dscrmnaton. In Uncertanty In Artfcal Intellegence, [9] J. Kohav. Wrappers for feature subset selecton. AIJ ssue on relevance, [10] S. Mukherjee, P. Tamayo, D. Slonm, A. Verr, T. Golub, J. Mesrov, and T. Poggo. Support vector machne classfcaton of mcroarray data. AI Memo 1677, Massachusetts Insttute of Technology, [11] M. Oren, C. Papageorgou, P. Snha, E. Osuna, and T. Poggo. Pedestran detecton usng wavelet templates. In Proc. Computer Vson and Pattern Recognton, pages , Puerto Rco, June [12] C. Papageorgou, M. Oren, and T. Poggo. A general framework for object detecton. In Internatonal Conference on Computer Vson, Bombay, Inda, January [13] V. Vapnk. Statstcal Learnng Theory. John Wley and Sons, New York, 1998.

Feature Selection for SVMs

Feature Selection for SVMs J. Weston, S. Mukherjee, O. Chapelle, M. Pontil T. Poggio, V. Vapnik, Barnhill BioInformatics.com, Savannah, Georgia, USA. CBCL MIT, Cambridge, Massachusetts, USA. AT&T Research