Sparse Support Vector Machine with L p Penalty for Feature Selection

Size: px

Start display at page:

Download "Sparse Support Vector Machine with L p Penalty for Feature Selection"

Avis Jefferson
6 years ago
Views:

1 Yao L, Zeng F, Li DH et al. Sparse support vector machine with L p penalty for feature selection. JOURNAL OF COM- PUTER SCIENCE AND TECHNOLOGY 32(1): Jan DOI /s Sparse Support Vector Machine with L p Penalty for Feature Selection Lan Yao 1, Feng Zeng 2,, Member, CCF, Dong-Hui Li 3, and Zhi-Gang Chen 2, Senior Member, CCF 1 College of Mathematics and Econometrics, Hunan University, Changsha , China 2 School of Software, Central South University, Changsha , China 3 School of Mathematical Sciences, South China Normal University, Guangzhou , China yao@hnu.edu.cn; fengzeng@csu.edu.cn; dhli@scnu.edu.cn; czg@csu.edu.cn Received February 28, 2016; revised September 7, Abstract We study the strategies in feature selection with sparse support vector machine (SVM). Recently, the socalled L p-svm (0 < p < 1) has attracted much attention because it can encourage better sparsity than the widely used L 1-SVM. However, L p-svm is a non-convex and non-lipschitz optimization problem. Solving this problem numerically is challenging. In this paper, we reformulate the L p-svm into an optimization model with linear objective function and smooth constraints (LOSC-SVM) so that it can be solved by numerical methods for smooth constrained optimization. Our numerical experiments on artificial datasets show that LOSC-SVM (0 < p < 1) can improve the classification performance in both feature selection and classification by choosing a suitable parameter p. We also apply it to some real-life datasets and experimental results show that it is superior to L 1-SVM. Keywords machine learning, feature selection, support vector machine, L p-regularization 1 Introduction Supportvectormachine(SVM) [1] isanoptimalmargin classifier and has been a popular tool in both machine learning and statistics communities. Although the SVM hyperplane only relies on a small subset of the training points, the resulting classifier always utilizes all features. When there are many noisy or redundance features, it will arise overfitting, reduce the ability of generalization and interpretability, and increase computational cost. Consequently, feature selection is very important in classification. The filter, wrapper and embedded methods are popular feature selection strategies in SVM [2]. The major difference among these three methods is their relationship with the classifier. Filters act as a preprocessing step before classifier training. They select important features based on some statistical properties, such as Pearson correlation coefficients and other classical test statistics. This procedure is independent from classifier learning. Wrappers evaluate the subsets of features according to their classification performance. They utilize the learning machine as a black box and cross-validation is a commonly choice to evaluate the performance. Wrapper methods claim more accuracy than filter methods, but are more computationally expensive. Embedded methods perform feature selection and classifier training simultaneously. Embedded methods include the interaction with the classifier structure. They are less computationally intensive than wrapper methods [3]. Many embedded methods for SVM have been developed. Guyon et al. [4] and Rakotomamonjy [5] applied a recursive feature elimination (RFE) strategy to obtain a relevant feature subset, training a series of SVMs and removing the feature with the smallest SVM-based ranking criterion at each iteration. Weston et al. [6] and Peleg and Meir [7] used scaling factors to indicate the importance of features, and iteratively optimized these scaling factors by minimizing the generalization error bound of SVM. Besides selecting a subset of features, another cate- Regular Paper This work is supported in part by the National Natural Science Foundation of China under Grant Nos , , , and , and the Research Foundation of Central South University of China under Grant No. 2014JSJJ019. Corresponding Author 2017 Springer Science + Business Media, LLC & Science Press, China

2 Lan Yao et al.: Sparse Support Vector Machine with L p Penalty for Feature Selection 69 gory of embedded methods is to formulate the optimization problem to obtain the sparse solution by adding a sparse term in the objective function or adding a cardinality constraint. It has been proved that L 1 -SVM can yield sparse solutions. The L 1 norm in L 1 -SVM plays a key role in feature selection. It encourages the coefficient to be either large or exactly zero, and then the irrelevant features can be automatically removed from the model. Another alternative is to minimize L 0 quasi-norm. Since it is a combinatorial optimization and NP-hard problem, several continuous, differentiable and concave approximations have been proposed [8-9]. Recently, L p quasi-norm (0 < p < 1) penalty attracts great attention since it can encourage better sparse than L 1 -norm, and several adaptive L p - SVMs are proposed to perform automatic feature selection. In this paper, we focus on the L p -SVM (0 < p < 1). Taking into account that the problem is non-convex, non-lipschitzian and NP-hard, we reformulate it as a smooth optimization model with linear objective function and smooth constraint (LOSC). The resulting LOSC-SVM can be solved by utilizing the standard optimization tools in Matlab. Theoretically, we will establish the equivalence between the LOSC-SVM and the L p -SVM. We also do numerical experiments to test the proposed LOSC-SVM model. We analyze the influence of the parameter p on the classifier performance. Observing that a certain penalty may suit best for certain data structure, we treat p as a tuning parameter instead of a fixed one. The best parameter p will be selected for each test problem. Our numerical experiments show that the choice of p is indeed an important factor for encouraging sparsity and improving the accuracy of the classifier. The LOSC-SVM with adaptive p works better than any fixed p in various situations. The rest of this paper is organized as follows. In Section 2, we briefly review the standard SVM (L 2 - SVM) and the sparse regularization SVMs. Section 3 describes the L p -SVM model and its smooth constrained optimization reformulation. In Section 4, we do numerical experiments to test the proposed reformulation LOSC-SVM. Section 5 gives the conclusive remarks. 2 Sparse Support Vector Machine Given a training dataset D = (x i ;y i ) n i=1, where x i R m is the feature vector and y i { 1,1} is the class label. For a binary classification problem, SVM is to find a separating hyperplane: w T x+b = 0, 1 which maximizes margin and minimizes training w p p errors n i=1 ξ i. Then the general model of L p -SVM can be constructed as follows: min w p p +C n i=1 ξ i s.t. y i (w T x i +b) 1 ξ i, i = 1,...,n, ξ i 0, where C is a trade-off parameter, and p is a nonnegative scalar. When p = 0, w 0 stands for the cardinality of the support set {w j > 0}, and for p (0,2], w p = ( m w i p ) 1/p. The case p = 2 corresponds to the standard C-SVM (L 2 -SVM) [1]. It is a convex quadratic programming and can be solved easily. However, the decision hyperplane learned by L 2 -SVM often utilizes all features. For feature selection purposes, w p with 0 < p < 1 are generally used as sparsity penalties to shrink the feature space. Then feature selection will be an indirect consequence after SVMs training. In what follows, we will give some details to the L p -SVMs with p = 0, 1 and p (0,1) respectively due to their particular roles in the sparse SVMs. 2.1 L 0 -SVM L 0 -SVM is expected to find the sparsestclassifierby minimizing w 0, the number of nonzero elements of w. However, it is a discrete and NP-hard problem [10]. In general, it is very difficult to develop efficient numerical methods. A widely used technique to deal with this problem is to approximate L 0 -SVM by a smooth problem. Bradley and Mangasarian [8] approximated w 0 with a cocave function as m w 0 1 e ( α wj ). Here, the parameterαcan controlits closenessto w 0 and its value is suggested to be 5 in this paper. The resulting problem is known as feature selection concave minimization (FSV). A successive linear approximation (SLA) algorithm was suggested to solve it, which involves a sequence of linear problem below: min m αe αv j (vj v j ) s.t. y i (w T x i +b) 1, v w v, i = 1,...,n, where v j (j = 1,..., m) denotes the j-th component of vector v. Here, v is introduced to eliminate the absolute value of w, and v is the solution of v from the last

3 70 J. Comput. Sci. & Technol., Jan. 2017, Vol.32, No.1 iteration. Westonet al. [9] proposedanotherapproximation to the zero-norm minimization (AROM), in which the zero-norm was approximate as: w 0 m log(ǫ+ w j ). Besides approximation techniques, some other authors explored convex relaxation to L 0 norm. For example, Chan et al. [11] applied a constraint w 0 r to the standard SVM and proposed two direct convex relaxations to it, namely QCQP-SVM and SDP-SVM respectively. 2.2 L 1 -SVM Bradley and Mangasarian [8] first proposed L 1 -SVM for classification and noted the sparse ability of L 1 - SVM. They reformulated L 1 -SVM to the following linear programming problem. min m u j + m v j +C n i=1 ξ i s.t. y i ((u v) T x i +b) 1 ξ i, u j 0,v j 0, ξ i 0, i = 1,...,n, where w = u v (u 0,v 0), and u j = (w j ) +, v j = ( w j ) +. This problem can be solved easily by existing linear programming solvers. L 1 -SVM has also been widely applied in computation biology [12] and drug design [13]. In the context of linear regression, L 1 norm penalty is well known as LASSO [14]. Instead of replacing the L 2 norm with a sparsity term, Neumann et al. [15] introduced additional sparsity penalty to the standard SVM and proposed two modified SVMs: L 2 -L 1 -SVM and L 2 -L 0 -SVM. 2.3 L p -SVM (0 < p < 1) Recently, extensive computational studies [16-18] have shown that the L p problem with 0 < p < 1 can find sparser solutions than the L 1 problem. On the other hand, in practice, SVM with a fixed norm such as L 2 -SVM, L 1 -SVM and L 0 -SVM has its advantages over others only under certain situations, because different norms may work well for different data structures. Therefore, it has become a welcome strategy to apply adaptive L p regularization (0 < p < 1) to perform feature selection. At the same time, L p regularization has been introduced to some variants of SVM, such as the least square SVM [19], the proximal SVM [20] and the multi-task SVM [21]. L p -SVM is a nonconvex and non-lipschitz problem. Due to the existence of the term w i p, the objective function is even not directionally differentiable at a point with some w i = 0, which makes the problem very difficult to solve. Most existing optimization algorithms are only efficient for smooth and convex problems. To solve that special non-smooth and non-lipschitz L p - SVM problem, several approximation algorithms have been proposed [22-26]. Liu et al. [22] proposed an L p -SVM model for multiclass classification problem and developed an iterative local quadratic algorithm (LQA) to solve L p -SVM, in which L p regularization is approximated by: w p w 0 p + ( w 0 p ) (w 2 w0 2 2 w 0 ), where w 0 is non-zero and close to w. Liu et al. [23] also proposed another smoothing model in which the nonsmooth term m j=0 w j p was approximated by m m w j p ( w j 2 +γ) p/2, j=0 j=0 where γ is set to a small value. With the smooth approximations of L p norm and hinge loss function, the objective function is differentiable and any gradientbased algorithm for unconstrained problem can be used to solve the problem. This approach seems easy to implement, whereas there is no principled way to set the smoothing parameter γ yet. Based on the idea in [27], Tian et al. [25] and Chen and Tian [20] applied the following smooth function to w in L p -SVM and L p -PSVM, respectively. t, if t > µ, s µ (t) = t 2 2µ + µ, if t µ, 2 where µ > 0. Similar to the approach of [8], Tan et al. [24] introduced the variable v to eliminate the absolute value of w and to lead to the following equivalent problem. min m vp j +C n i=1 ξ i, s.t. y i (w T x i +b) 1 ξ i, i = 1,...,n, ξ i 0, v w v, where v j (j = 1,...,m) denotes the j-th component of vector v. The resulting problem is differentiable for all positive v, but non-convex. Its solution can be found by a successive linear approximation algorithm (SLA). Zhang et al. [28] adopted the same reformulationto their

4 Lan Yao et al.: Sparse Support Vector Machine with L p Penalty for Feature Selection 71 L 2 -L p SVM modelandproposedaconstrainedconcaveconvex procedure (CCCP) to solve it. Liu and Liu [26] used a Legendre-Fenchel duality frame work to solve L p -SVM. Different from the above methods, in Section 3, we will derive a novelequivalent reformulationto L p -SVM. The resulting smooth constrained optimization model with linear objective function is relatively easy to deal with. 3 LOSC Reformulation to L p -SVM In this section, we will derive a smooth constrained optimizationreformulationtol p -SVMsothatitisrelatively easy to deal with optimization solvers. 3.1 L p -SVM Model At first, let us recall the model of L p -SVM: min m w j p +C n i=1 ξ i = φ(w,b,ξ) s.t. y i (w T x i +b) 1 ξ i, i = 1,...,n, ξ i 0, i = 1,...,n. 3.2 Reformulation to the L p -SVM Model (1) By introducing an auxiliary variable t = (t 1,t 2,...,t m ) T, L p -SVM can be reformulated to the following constrained problem. min m t j +C n ξ i = f(w,b,ξ,t) i=1 s.t. t α j w j 0, j = 1,...,m, t α j +w j 0, j = 1,...,m, t j 0, j = 1,...,m, y i (w T x i +b) 1 ξ i, i = 1,...,n, ξ i 0, i = 1,...,n. (2) It is obtained by making t j = w j p, and α = p 1 > 1. It is a minimization problem with linear optimization function and smooth constraints, which we call LOSC- SVM. In a way similar to [29], we can derive some nice properties of the reformulation. In particular, the KKT points of the reformulation are the same as the KKT points of L p -SVM. Denote the feasible region of the problem by F, i.e., F = {(w, b, t, ξ) t α j w j 0, t α j +w j 0, t j 0, j = 1,...,m} {(w, b, t, ξ) y i (w T x i +b) 1 ξ i, ξ i 0,i = 1,...,n}. We conclude this subsection by showing the equivalence between LOSC-SVM (2) and L p -SVM (1). Theorem 1. If u = (w,b,ξ ) R m+1+n is a solution of L p -SVM (1), then z = (w,b,ξ, w p ) R m+1+n+m is a solution of LOSC-SVM (2). Conversely, if ( w, b, ξ, t) is a solution of LOSC-SVM (2), then ( w, b, ξ) is a solution of L p -SVM (1). Proof. Let u = (w,b,ξ ) be a solution of L p - SVM (1) and z = ( w, b, ξ, t) be a solution of LOSC- SVM (2). It is clear that z = (w,b,ξ, w p ) F. Moreover, we have t α j w j, j = 1,2,...,m, and hence f( z) = m t j +C m n m ξ i w j p +C i=1 w j p +C Since z F, we have n ξ i i=1 n ξi = φ(u ). (3) i=1 φ(u ) = f(z ) f( z). This together with (3) implies f( z) = φ(u ). 3.3 Algorithm for Solving L p -SVM Problem Based on the equivalent reformulation LOSC-SVM (2) of L p -SVM, we give the algorithm shown below for solving the L p -SVM problem. Step 1: select parameter C (C > 0) and p (0 < p < 1) given training datasets T = (x i ;y i ) n i=1, where x i R m and y i { 1,1}. Step 2: solve LOSC-SVM (2) by the constrained optimization method fmincon in Matlab, and get the solution ( w, b, ξ, t). Step 3: if w k < max j ( w j ) 10 4, let w k = 0, k = 1,...,m. Step 4: construct the decision function f(x) = sgn( w x+ b). In step 2, LOSC-SVM (2) can be solved by smooth constrained optimization methods, such as interior point method, successive linear programming algorithm, sequential quadratic programming algorithm, and so on. 4 Experiments Numerical experiments are done to evaluate the proposed LOCS-SVM. We first analyze the effect of parameter p in LOSC-SVM on artificial datasets. Then, we compare the performance of LOSC-SVM with that of

5 72 J. Comput. Sci. & Technol., Jan. 2017, Vol.32, No.1 L p -SVM [24], L 0 -SVM, L 1 -SVM, and L 2 -SVM on eight UCI datasets. All the experiments are done in a personal computer (1.6 GHz of CPU, 4 GB of RAM) with Matlab R2010b on 64-bit Windows Artificial Dataset We start with an artificial binary linear classification problem. Similar to [9], the sample size of each class (y = 1 or 1) is equal, and only the first six features are relevant. In 70% samples, the first three features {x 1,x 2,x 3 } are drawn as x i = yn(i,1), and the second three features {x 4,x 5,x 6 } as x i = N(0,1). In the other 30% samples, the first three are drawn as x i = N(0,1) and the second three as x i = yn(i 3,1). The rest features are noise and drawn from N(0,20). To test the effect of parameter p on classification performance, we fix C = 0.5 and change p from 0.1 to 2. Then, we train LOSC-SVM on the training datasets with m features and n samples. The performance is estimated on the testing dataset with 500 points. We repeat training and testing 30 times and the average results are reported. Test 1 (m = 10, 30, 50, 100 and n =100). The cardinality and the accuracy are used as the metrics for performance comparison. Cardinality is the number of non-zero element in weight vector w, that is, the number of selected features. Fig.1(a) illustrates how the cardinality of the classifier changes as q increases. It shows that when p > 1, the classifier almost utilizes all features. However, when 0.1 p 1, LOSC- SVMs can always find sparse solution and smaller p (0.1 p 1) encourages sparser solution. Fig.1(b) illustrates how accuracy changes as q increases. It shows that when there is little noise in the dataset (m = 10), the accuracy of the LOSC-SVMs with different p is almost identical (about 99%). When the noise increases (m = 100), the accuracy of LOSC-SVMs with p > 1 decreases sharply, while the accuracy of LOSC-SVMs with 0.1 p 1 changes little. Table 1 lists the average results of test 1, and RFR (relevant feature rate) is the percentage of the relevant features in the selected features. The LOSC-SVMs with different p (0.1 p 1) have similar average accuracy. When p changes from 0.1 to 1, the cardinality increases by 12.37, and the relevant feature rate (RFR) decreases by 46.47%. It suggests that the larger p has the more irrelevant features selected. Table 1. Average Results of Test 1 (m = 10, 30, 50, 100 and n = 100) p Cardinality Accuracy (%) RFR (%) Test 2 (m = 30 and n = 10, 30, 50, 100). Fig.2 shows the result of test 2. It still supports the conclusion of test 1. Furthermore, Fig.2(b) shows that fewer training samples would lead to lower accuracy. The reason may be that too few training samples are prone to Cardinality m/10 m/30 m/50 m/100 Accuracy m/10 m/30 m/50 m/ p (a) p (b) Fig.1. Results on test 1: (a) cardinality and (b) accuracy of the classifier verus p.

6 Lan Yao et al.: Sparse Support Vector Machine with L p Penalty for Feature Selection Cardinality n/10 n/30 n/50 n/100 Accuracy n/10 n/30 n/50 n/ p (a) p (b) Fig.2. Results on test 2: (a) cardinality and (b) accuracy of the classifier verus p. be overfitting. When there are only 10 training samples, the classification accuracy drops a lot, especially when p = 0.1. It reveals that the sparsest solution is not always the best solution. That is, the smallest p is not always the best choice. Therefore, the choice of p is important. We cannot simply set p = 0.1 and it should be determined by the data itself. Based on the above analysis, the real data experiment in Subsection 4.2 will focus on the LOSC-SVM with 0 < p < 1 and the best p will be adaptively selected for each dataset. 4.2 UCI Datasets In this subsection, we compare the performance of the proposed LOSC-SVM (a reformulation to L p - SVM ) with that of L 2 -SVM, L 1 -SVM, L 0 -SVM and SLA-SVM [24], on eight UCI datasets [30]. The proposed LOSC-SVM is solved by the constrained optimization method fmincon in Matlab. L p -SVM [24] is solved by SLA algorithm, and we call it SLA-SVM in this paper. L 0 -SVM is solved by the commonly cited method FSV [8]. The basic information of these datasets is described in Table 2, where n and m are the number of the samples and the features, respectively. The name of PID is the abbreviation of Pima Indians Diabets and BCW is the abbreviation of Breast Cancer Wisconsin. We first preprocess each dataset as follows: deleting the samples with missing values and scaling each feature of the sample to [ 1, 1]. Then, the dataset is randomly split into three parts: two parts for training and one for testing. The parameters C {2 5,2 4,...,2 5 } and p {0.1, 0.2,..., 0.9} are determined by 5-fold crossvalidation. The weight w k will be set to zero, if it does w not satisfy k max j( w j ) 10 4[11]. The cardinality is computed as the number of non-zero elements in w. We repeat the training and testing procedure 20 times, and then the average results are reported. Table 2. Statistics of UCI Datasets Used in This Paper Dataset Number of Number of Class Distribution Samples (n) Features (m) n +/n 1 PID /500 2 BCW /444 3 SPECT /55 4 WDBC /357 5 Ionosphere /225 6 SPECTF /55 7 Sonar /111 8 Musk /269 In the simulation experiment, the datasets are classbalanced, i.e., the sample size ofeachclass is equal. Actually, many real datasets are class-imbalanced (e.g., 3 SPECT, 6 SPECTF). Therefore, except for accuracy (Acc), we apply another three scores to evaluate the classification performance: true positive rate (TPR), false positive rate (FPR), and the area under the ROC curve (AUC). TPR is the proportion of positive samples that are correctly labeled, FPR is the proportion of negative samples that are mislabeled, and AUC is equivalent to the area under the curve obtained by plotting TPR against FPR for each confidence value. Larger AUC represents better classification performance. When no confidence values are supplied for the classification, AUC = 1 2 (1 FPR+TPR).

7 74 J. Comput. Sci. & Technol., Jan. 2017, Vol.32, No.1 The results on eight UCI datasets are listed in Table 3. The best results among LOSC-SVM, SLA- SVM, L 0 -SVM, L 1 -SVM and L 2 -SVM are bolded. It shows that the proposed LOSC-SVM has good sparsity, while L 2 -SVM utilizes all features. The accuracy of LOSC-SVM is as good as or better than L 2 - SVM in seven datasets (databases 1 7). Only in dataset 8, the sparse solution introduces a slight increase in error (4.4%). Compared with other sparse SVMs, LOSC-SVM achieves better sparsity and accuracythanl 1 -SVMintheformersixdatasets(databases 1 6). LOSC-SVM selects more features than L 0 -SVM in datasets 7 and 8; however it gets more accuracy than L 0 -SVM in all datasets. Furthermore, LOSC-SVM performs better than SLA-SVM in both feature selection and classification in seven datasets (databases 1 5, 7, 8). Only in dataset 6, the solution of LOSC-SVM is not so sparse as that of SLA-SVM. However, the FPR of SLA-SVM is 100%, which indicates that all of the negative samples are wrongly classified. It may be due to the lack of negative samples to learn. On the contrary, the FPR of LOSC-SVM is 36.1% lower than that of SLA-SVM. It suggests that LOSC-SVM works better on class imbalanced problem. Table 4showsthe averageresultsoverthe eight UCI datasets. ItshowsthatLOSC-SVM,SLA-SVM andl 0 - SVM have the similar average cardinality which is much smaller than L 1 -SVM. Among the five SVMs, LOSC- SVM achieves the best results in four scores: sparsity, accuracy, AUC and FPR. With respect to the two L p - SVMs (0 < p < 1), the average FPR of LOSC-SVM is 6.6% lower than that of SLA-SVM, which indicates that on class-imbalanced problem, LOSC-SVM is more robust than SLA-SVM. These results support that the proposed LOSC-SVM is a promising tool to perform feature selection and classification simultaneously. Table 3. Results on UCI Datasets Dataset Method Cardinality Acc (%) AUC TPR FPR C p 1 PID LOSC-SVM ±0.7 ±1.1 ±0.010 ±0.028 ±0.027 SLA-SVM ±0.6 ±0.4 ± L 0 -SVM ±0.2 ±3.0 ±0.056 ±0.108 ±0.046 L 1 -SVM ±0.9 ±0.9 ±0.034 ±0.060 ±0.034 L 2 -SVM ±0.0 ±3.5 ±0.032 ±0.083 ± BCW LOSC-SVM ±0.7 ±0.7 ±0.010 ±0.024 ±0.007 SLA-SVM ±0.6 ±0.7 ±0.009 ±0.023 ±0.009 L 0 -SVM ±2.1 ±1.4 ±0.022 ±0.037 ±0.022 L 1 -SVM ±1.2 ±1.0 ±0.023 ±0.042 ±0.014 L 2 -SVM ±0.0 ±2.2 ±0.015 ±0.033 ± SPECT LOSC-SVM ±2.5 ±4.0 ±0.068 ±0.050 ±0.125 SLA-SVM ±1.6 ±4.0 ±0.083 ±0.029 ±0.166 L 0 -SVM ±3.6 ±3.9 ±0.127 ±0.072 ±0.250 L 1 -SVM ±5.3 ±2.3 ±0.069 ±0.053 ±0.144 L 2 -SVM ±0.0 ±2.2 ±0.000 ±0.000 ±0.000

8 Lan Yao et al.: Sparse Support Vector Machine with L p Penalty for Feature Selection 75 Table 3. (Continued) Dataset Method Cardinality Acc (%) AUC TPR FPR C p 4 WDBC LOSC-SVM ±1.2 ±0.7 ±0.011 ±0.024 ±0.018 SLA-SVM ±2.3 ±2.4 ±0.008 ±0.018 ±0.012 L 0 -SVM ±1.2 ±1.5 ±0.029 ±0.053 ±0.025 L 1 -SVM ±1.5 ±1.3 ±0.020 ±0.038 ±0.011 L 2 -SVM ±0.0 ±2.1 ±0.011 ±0.023 ± Ionosphere LOSC-SVM ±1.4 ±2.6 ±0.038 ±0.087 ±0.038 SLA-SVM ±2.6 ±1.5 ±0.031 ±0.048 ±0.028 L 0 -SVM ±2.9 ±4.2 ±0.076 ±0.125 ±0.113 L 1 -SVM ±3.3 ±3.7 ±0.045 ±0.094 ±0.043 L 2 -SVM ±0.4 ±5.4 ±0.035 ±0.068 ± SPECTF LOSC-SVM ±1.6 ±4.8 ±0.067 ±0.051 ±0.169 SLA-SVM ±4.6 ±1.7 ±0.000 ±0.000 ±0.000 L 0 -SVM ±2.3 ±3.8 ±0.124 ±0.072 ±0.246 L 1 -SVM ±11.0 ±4.5 ±0.072 ±0.070 ±0.155 L 2 -SVM ±0.0 ±3.9 ±0.003 ±0.008 ± Sonar LOSC-SVM ±2.0 ±5.0 ±0.059 ±0.073 ±0.090 SLA-SVM ±3.5 ±4.2 ±0.042 ±0.061 ±0.065 L 0 -SVM ±2.1 ±4.1 ±0.104 ±0.178 ±0.129 L 1 -SVM ±13.1 ±5.8 ±0.062 ±0.142 ±0.097 L 2 -SVM ±0.0 ±5.3 ±0.029 ±0.119 ± Musk1 LOSC-SVM ±5.0 ±1.9 ±0.025 ±0.043 ±0.035 SLA-SVM ±8.7 ±4.4 ±0.042 ±0.062 ±0.053 L 0 -SVM ±3.7 ±2.7 ±0.053 ±0.084 ±0.070 L 1 -SVM ±18.8 ±2.4 ±0.040 ±0.075 ±0.066 L 2 -SVM ±0.0 ±1.7 ±0.026 ±0.029 ±0.051

9 76 J. Comput. Sci. & Technol., Jan. 2017, Vol.32, No.1 Table 4. Average Results over the 8 UCI Datasets: Sparsity = Cardinality/m Method Cardinality Sparsity Acc (%) AUC TPR FPR LOSC-SVM ±1.9 ±2.6 ±0.037 ±0.048 ±0.064 SLA-SVM ±3.4 ±2.4 ±0.029 ±0.035 ±0.046 L 0 -SVM ±2.3 ±3.1 ±0.074 ±0.092 ±0.113 L 1 -SVM ±6.9 ±2.7 ±0.046 ±0.072 ±0.071 L 2 -SVM ±0.1 ±3.3 ±0.019 ±0.046 ± Conclusions In this paper, we studied a new adaptive L p -SVM classificationmethod. L p -SVM allowsaflexiblepenalty form chosen by data. Hence, the classifier is built based on the best p for any specific application. Since solving L p -SVM is an NP-hard problem, we proposed a new reformulation named LOSC-SVM to L p -SVM. LOSC- SVM is easy to be solved. Simulate experiments showed that in most situations a smaller p (0 < p 1) leads to a sparser solution almost without losing accuracy. The results on real datasets showed that the proposed LOSC-SVM is better than L 1 -SVM and L 2 -SVM in both feature selection and classification. The results suggested that the adaptive L p -SVM is an interesting approach to the application of feature selection. We will extend our research on several directions. For example, the choice of parameters (C, p) is important to improve the performance of L p -SVM. Most commonly used approach to selecting(c, p) is grid search coupled with cross-validation, which is computational expensive. In future, we will consider theoretically analyzing on (C, p) and explore more efficient methods. Furthermore, we will also develop faster algorithms that can handle large-scale problems. References [1] Vapnik V N. The Nature of Statistical Learning Theory (2nd edition). Springer, [2] Guyon I, Gunn S, Nikravesh M, Zadeh L A. Feature Extraction: Foundations and Applications (1st edition). Springer, [3] Saeys Y, Inza I, Larranagal P. A review of feature selection techniques in bioinformatics. Bioinformatics, 2007, 23(19): [4] Guyon I, Weston J, Barnhill S, Vapnik V. Gene selection for cancer classification using support vector machines. Machine Learning, 2002, 46(1/2/3): [5] Rakotomamonjy A. Variable selection using SVM based criteria. The Journal of Machine Learning Research, 2003, 3: [6] Weston J, Mukherjee S, Chapelle O, Pontil M, Poggio T, Vapnik V. Feature selection for SVMs. In Advances in Neural Information Processing Systems 13, Leen T K, Dietterich T G, Tresp V (eds.), Massachusetts Institute of Technology, 2001, pp [7] Peleg D, Meir R. A feature selection algorithm based on the global minimization of a generalization error bound. In Advances in Neural Information Processing Systems 17, Saul L K, Weiss Y, Bottou L (eds.), Massachusetts Institute of Technology, 2005, pp [8] Bradley P S, Mangasarian O L. Feature selection via concave minimization and support vector machines. In Proc. the 5th International Conference on Machine Learning, July 1998, pp [9] Weston J, Elisseeff A, Schölkopf B, Tipping M. Use of the zero norm with linear models and kernel methods. The Journal of Machine Learning Research, 2003, 3: [10] Amaldi E, Kann V. On the approximability of minimizing nonzero variables or unsatisfied relations in linear systems. Theoretical Computer Science, 1998, 209(1/2): [11] Chan A B, Vasconcelos N, Lanckriet G R G. Direct convex relaxations of sparse SVM. In Proc. the 24th International Conference on Machine Learning, June 2007, pp [12] Fung G M, Mangasarian O L. A feature selection newton method for support vector machine classification. Computational Optimization and Applications, 2004, 28(2): [13] Bi J B, Bennett K, Embrechts M, Breneman C, Song M H. Dimensionality reduction via sparse support vector machines. The Journal of Machine Learning Research, 2003, 3: [14] Tibshirani R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B (Methodological), 1996, 58(1): [15] Neumann J, Schnörr C, Steidl G. Combined SVM-based feature selection and classification. Machine Learning, 2005, 61(1/2/3): [16] Chartrand R. Exact reconstruction of sparse signals via nonconvex minimization. IEEE Signal Processing Letters, 2007, 14(10): [17] Chartrand R. Nonconvex regularization for shape preservation. In Proc. the IEEE International Conference on Image Processing, September 16-October 19, 2007, pp

Lan Yao et al.: Sparse Support Vector Machine with L p Penalty for Feature Selection 77 [18] Xu Z B, Zhang H, Wang Y, Chang X Y, Liang Y. L 1/2 regularization.

A weighted L q adaptive least squares support vector machine classifiers Robust and sparse approximation. Expert Systems with Applications, 2011, 38(3): 2253-2259. [20] Chen W J, Tian Y J.

l p l q penalty for sparse linear and sparse multiple kernel multitask learning. IEEE Transactions on Neural Networks, 2011, 22(8): 1307-1320. [22] Liu Y F, Zhang H H, Park C, Ahn J.

Sparse support vector machines with L p penalty for biomarker identification. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2010, 7(1): 100-107.

10 Lan Yao et al.: Sparse Support Vector Machine with L p Penalty for Feature Selection 77 [18] Xu Z B, Zhang H, Wang Y, Chang X Y, Liang Y. L 1/2 regularization. Science China Information Sciences, 2010, 53(6): [19] Liu J L, Li J P, Xu W X, Shi Y. A weighted L q adaptive least squares support vector machine classifiers Robust and sparse approximation. Expert Systems with Applications, 2011, 38(3): [20] Chen W J, Tian Y J. L p-norm proximal support vector machine and its applications. Procedia Computer Science, 2010, 1(1): [21] Rakotomamonjy A, Flamary R, Gasso G, Canu S. l p l q penalty for sparse linear and sparse multiple kernel multitask learning. IEEE Transactions on Neural Networks, 2011, 22(8): [22] Liu Y F, Zhang H H, Park C, Ahn J. Support vector machines with adaptive L q penalty. Computational Statistics and Data Analysis, 2007, 51(12): [23] Liu Z Q, Lin S L, Tan M. Sparse support vector machines with L p penalty for biomarker identification. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2010, 7(1): [24] Tan J Y, Zhang Z Q, Zhen L, Zhang C H, Deng N Y. Adaptive feature selection via a new version of support vector machine. Neural Computing and Applications, 2013, 23(3/4): [25] Tian Y J, Yu J, Chen W J. l p-norm support vector machine with CCCP. In Proc. the 7th International Conference on Fuzzy Systems and Knowledge Discovery, August 2010, pp [26] Liu J W, Liu Y. Non-integer norm regularization SVM via Legendre-Fenchel duality. Neurocomputing, 2014, 144: [27] Chen X J, Xu F M, Ye Y Y. Lower bound theory of nonzero entries in solutions of l 2 -l p minimization. SIAM J. Sci. Comput., 2010, 32(5): [28] Zhang C H, Shao Y H, Tan J Y, Deng N Y. Mixed-norm linear support vector machine. Neural Computing and Applications, 2013, 23(7): [29] Li D H, Wu L, Sun Z, Zhang X J. A constrained optimization reformulation and a feasible descent direction method for L 1/2 regularization. Computational Optimization and Applications, 2014, 59(1/2): [30] Newman D J, Hettich S, Blake C L, Merz C J. UCI repository of machine learning databases. Technical Report 9702, Department of Information and Computer Science, University of California, Irvine, Nov Feng Zeng received his B.Eng. and M.Eng. degrees in computer science from Hunan University, Changsha, in 2000 and 2005 respectively, and his Ph.D. degree in computer science in 2010 from Central South University, Changsha. He is now an associate professor in the School of Software, Central South University, Changsha. His current research interests include wireless network, QoS routing and data mining. He is a member of CFF. Dong-Hui Li is a professor of the School of Mathematical Sciences, South China Normal University, Guangzhou. He got his Bachelor s and Master s degrees in applied mathematics from Hunan University, Changsha, in 1983 and 1986 respectively. He got his first Ph.D. degree in applied mathematics from Hunan University, Changsha, in 1994, and his second Ph.D. degree in applied mathematics from Kyoto University, Kyoto, in His research interests include numerical methods in optimization and nonlinear equations with applications in supply chain and finance. He has published near 100 academic papers. Zhi-Gang Chen is a professor in the School of Software, Central South University (CSU), Changsha. His research interests are in QoS mechanism for IP network, web service, and wireless network. He is the director of Network Computing and Distributed Processing Laboratory in the School of Software, CSU, Changsha. He obtained his Ph.D. and M.S. degrees in computer science from CSU in 1998 and 1987 respectively. He is a senior member of CFF. Lan Yao is an assistant professor of the College of Mathematics and Econometrics, Hunan University, Changsha. She got her B.S. degree in computer science, M.S. and Ph.D. degrees in applied mathematics from Hunan University, Changsha, in 2000, 2006 and 2014 respectively. Her research interests include data mining, numerical methods in optimization and network optimization.

Support Vector Machine via Nonlinear Rescaling Method

Manuscript Click here to download Manuscript: svm-nrm_3.tex Support Vector Machine via Nonlinear Rescaling Method Roman Polyak Department of SEOR and Department of Mathematical Sciences George Mason University