Sparse Support Vector Machine with L p Penalty for Feature Selection

Size: px
Start display at page:

Download "Sparse Support Vector Machine with L p Penalty for Feature Selection"

Transcription

1 Yao L, Zeng F, Li DH et al. Sparse support vector machine with L p penalty for feature selection. JOURNAL OF COM- PUTER SCIENCE AND TECHNOLOGY 32(1): Jan DOI /s Sparse Support Vector Machine with L p Penalty for Feature Selection Lan Yao 1, Feng Zeng 2,, Member, CCF, Dong-Hui Li 3, and Zhi-Gang Chen 2, Senior Member, CCF 1 College of Mathematics and Econometrics, Hunan University, Changsha , China 2 School of Software, Central South University, Changsha , China 3 School of Mathematical Sciences, South China Normal University, Guangzhou , China yao@hnu.edu.cn; fengzeng@csu.edu.cn; dhli@scnu.edu.cn; czg@csu.edu.cn Received February 28, 2016; revised September 7, Abstract We study the strategies in feature selection with sparse support vector machine (SVM). Recently, the socalled L p-svm (0 < p < 1) has attracted much attention because it can encourage better sparsity than the widely used L 1-SVM. However, L p-svm is a non-convex and non-lipschitz optimization problem. Solving this problem numerically is challenging. In this paper, we reformulate the L p-svm into an optimization model with linear objective function and smooth constraints (LOSC-SVM) so that it can be solved by numerical methods for smooth constrained optimization. Our numerical experiments on artificial datasets show that LOSC-SVM (0 < p < 1) can improve the classification performance in both feature selection and classification by choosing a suitable parameter p. We also apply it to some real-life datasets and experimental results show that it is superior to L 1-SVM. Keywords machine learning, feature selection, support vector machine, L p-regularization 1 Introduction Supportvectormachine(SVM) [1] isanoptimalmargin classifier and has been a popular tool in both machine learning and statistics communities. Although the SVM hyperplane only relies on a small subset of the training points, the resulting classifier always utilizes all features. When there are many noisy or redundance features, it will arise overfitting, reduce the ability of generalization and interpretability, and increase computational cost. Consequently, feature selection is very important in classification. The filter, wrapper and embedded methods are popular feature selection strategies in SVM [2]. The major difference among these three methods is their relationship with the classifier. Filters act as a preprocessing step before classifier training. They select important features based on some statistical properties, such as Pearson correlation coefficients and other classical test statistics. This procedure is independent from classifier learning. Wrappers evaluate the subsets of features according to their classification performance. They utilize the learning machine as a black box and cross-validation is a commonly choice to evaluate the performance. Wrapper methods claim more accuracy than filter methods, but are more computationally expensive. Embedded methods perform feature selection and classifier training simultaneously. Embedded methods include the interaction with the classifier structure. They are less computationally intensive than wrapper methods [3]. Many embedded methods for SVM have been developed. Guyon et al. [4] and Rakotomamonjy [5] applied a recursive feature elimination (RFE) strategy to obtain a relevant feature subset, training a series of SVMs and removing the feature with the smallest SVM-based ranking criterion at each iteration. Weston et al. [6] and Peleg and Meir [7] used scaling factors to indicate the importance of features, and iteratively optimized these scaling factors by minimizing the generalization error bound of SVM. Besides selecting a subset of features, another cate- Regular Paper This work is supported in part by the National Natural Science Foundation of China under Grant Nos , , , and , and the Research Foundation of Central South University of China under Grant No. 2014JSJJ019. Corresponding Author 2017 Springer Science + Business Media, LLC & Science Press, China

2 Lan Yao et al.: Sparse Support Vector Machine with L p Penalty for Feature Selection 69 gory of embedded methods is to formulate the optimization problem to obtain the sparse solution by adding a sparse term in the objective function or adding a cardinality constraint. It has been proved that L 1 -SVM can yield sparse solutions. The L 1 norm in L 1 -SVM plays a key role in feature selection. It encourages the coefficient to be either large or exactly zero, and then the irrelevant features can be automatically removed from the model. Another alternative is to minimize L 0 quasi-norm. Since it is a combinatorial optimization and NP-hard problem, several continuous, differentiable and concave approximations have been proposed [8-9]. Recently, L p quasi-norm (0 < p < 1) penalty attracts great attention since it can encourage better sparse than L 1 -norm, and several adaptive L p - SVMs are proposed to perform automatic feature selection. In this paper, we focus on the L p -SVM (0 < p < 1). Taking into account that the problem is non-convex, non-lipschitzian and NP-hard, we reformulate it as a smooth optimization model with linear objective function and smooth constraint (LOSC). The resulting LOSC-SVM can be solved by utilizing the standard optimization tools in Matlab. Theoretically, we will establish the equivalence between the LOSC-SVM and the L p -SVM. We also do numerical experiments to test the proposed LOSC-SVM model. We analyze the influence of the parameter p on the classifier performance. Observing that a certain penalty may suit best for certain data structure, we treat p as a tuning parameter instead of a fixed one. The best parameter p will be selected for each test problem. Our numerical experiments show that the choice of p is indeed an important factor for encouraging sparsity and improving the accuracy of the classifier. The LOSC-SVM with adaptive p works better than any fixed p in various situations. The rest of this paper is organized as follows. In Section 2, we briefly review the standard SVM (L 2 - SVM) and the sparse regularization SVMs. Section 3 describes the L p -SVM model and its smooth constrained optimization reformulation. In Section 4, we do numerical experiments to test the proposed reformulation LOSC-SVM. Section 5 gives the conclusive remarks. 2 Sparse Support Vector Machine Given a training dataset D = (x i ;y i ) n i=1, where x i R m is the feature vector and y i { 1,1} is the class label. For a binary classification problem, SVM is to find a separating hyperplane: w T x+b = 0, 1 which maximizes margin and minimizes training w p p errors n i=1 ξ i. Then the general model of L p -SVM can be constructed as follows: min w p p +C n i=1 ξ i s.t. y i (w T x i +b) 1 ξ i, i = 1,...,n, ξ i 0, where C is a trade-off parameter, and p is a nonnegative scalar. When p = 0, w 0 stands for the cardinality of the support set {w j > 0}, and for p (0,2], w p = ( m w i p ) 1/p. The case p = 2 corresponds to the standard C-SVM (L 2 -SVM) [1]. It is a convex quadratic programming and can be solved easily. However, the decision hyperplane learned by L 2 -SVM often utilizes all features. For feature selection purposes, w p with 0 < p < 1 are generally used as sparsity penalties to shrink the feature space. Then feature selection will be an indirect consequence after SVMs training. In what follows, we will give some details to the L p -SVMs with p = 0, 1 and p (0,1) respectively due to their particular roles in the sparse SVMs. 2.1 L 0 -SVM L 0 -SVM is expected to find the sparsestclassifierby minimizing w 0, the number of nonzero elements of w. However, it is a discrete and NP-hard problem [10]. In general, it is very difficult to develop efficient numerical methods. A widely used technique to deal with this problem is to approximate L 0 -SVM by a smooth problem. Bradley and Mangasarian [8] approximated w 0 with a cocave function as m w 0 1 e ( α wj ). Here, the parameterαcan controlits closenessto w 0 and its value is suggested to be 5 in this paper. The resulting problem is known as feature selection concave minimization (FSV). A successive linear approximation (SLA) algorithm was suggested to solve it, which involves a sequence of linear problem below: min m αe αv j (vj v j ) s.t. y i (w T x i +b) 1, v w v, i = 1,...,n, where v j (j = 1,..., m) denotes the j-th component of vector v. Here, v is introduced to eliminate the absolute value of w, and v is the solution of v from the last

3 70 J. Comput. Sci. & Technol., Jan. 2017, Vol.32, No.1 iteration. Westonet al. [9] proposedanotherapproximation to the zero-norm minimization (AROM), in which the zero-norm was approximate as: w 0 m log(ǫ+ w j ). Besides approximation techniques, some other authors explored convex relaxation to L 0 norm. For example, Chan et al. [11] applied a constraint w 0 r to the standard SVM and proposed two direct convex relaxations to it, namely QCQP-SVM and SDP-SVM respectively. 2.2 L 1 -SVM Bradley and Mangasarian [8] first proposed L 1 -SVM for classification and noted the sparse ability of L 1 - SVM. They reformulated L 1 -SVM to the following linear programming problem. min m u j + m v j +C n i=1 ξ i s.t. y i ((u v) T x i +b) 1 ξ i, u j 0,v j 0, ξ i 0, i = 1,...,n, where w = u v (u 0,v 0), and u j = (w j ) +, v j = ( w j ) +. This problem can be solved easily by existing linear programming solvers. L 1 -SVM has also been widely applied in computation biology [12] and drug design [13]. In the context of linear regression, L 1 norm penalty is well known as LASSO [14]. Instead of replacing the L 2 norm with a sparsity term, Neumann et al. [15] introduced additional sparsity penalty to the standard SVM and proposed two modified SVMs: L 2 -L 1 -SVM and L 2 -L 0 -SVM. 2.3 L p -SVM (0 < p < 1) Recently, extensive computational studies [16-18] have shown that the L p problem with 0 < p < 1 can find sparser solutions than the L 1 problem. On the other hand, in practice, SVM with a fixed norm such as L 2 -SVM, L 1 -SVM and L 0 -SVM has its advantages over others only under certain situations, because different norms may work well for different data structures. Therefore, it has become a welcome strategy to apply adaptive L p regularization (0 < p < 1) to perform feature selection. At the same time, L p regularization has been introduced to some variants of SVM, such as the least square SVM [19], the proximal SVM [20] and the multi-task SVM [21]. L p -SVM is a nonconvex and non-lipschitz problem. Due to the existence of the term w i p, the objective function is even not directionally differentiable at a point with some w i = 0, which makes the problem very difficult to solve. Most existing optimization algorithms are only efficient for smooth and convex problems. To solve that special non-smooth and non-lipschitz L p - SVM problem, several approximation algorithms have been proposed [22-26]. Liu et al. [22] proposed an L p -SVM model for multiclass classification problem and developed an iterative local quadratic algorithm (LQA) to solve L p -SVM, in which L p regularization is approximated by: w p w 0 p + ( w 0 p ) (w 2 w0 2 2 w 0 ), where w 0 is non-zero and close to w. Liu et al. [23] also proposed another smoothing model in which the nonsmooth term m j=0 w j p was approximated by m m w j p ( w j 2 +γ) p/2, j=0 j=0 where γ is set to a small value. With the smooth approximations of L p norm and hinge loss function, the objective function is differentiable and any gradientbased algorithm for unconstrained problem can be used to solve the problem. This approach seems easy to implement, whereas there is no principled way to set the smoothing parameter γ yet. Based on the idea in [27], Tian et al. [25] and Chen and Tian [20] applied the following smooth function to w in L p -SVM and L p -PSVM, respectively. t, if t > µ, s µ (t) = t 2 2µ + µ, if t µ, 2 where µ > 0. Similar to the approach of [8], Tan et al. [24] introduced the variable v to eliminate the absolute value of w and to lead to the following equivalent problem. min m vp j +C n i=1 ξ i, s.t. y i (w T x i +b) 1 ξ i, i = 1,...,n, ξ i 0, v w v, where v j (j = 1,...,m) denotes the j-th component of vector v. The resulting problem is differentiable for all positive v, but non-convex. Its solution can be found by a successive linear approximation algorithm (SLA). Zhang et al. [28] adopted the same reformulationto their

4 Lan Yao et al.: Sparse Support Vector Machine with L p Penalty for Feature Selection 71 L 2 -L p SVM modelandproposedaconstrainedconcaveconvex procedure (CCCP) to solve it. Liu and Liu [26] used a Legendre-Fenchel duality frame work to solve L p -SVM. Different from the above methods, in Section 3, we will derive a novelequivalent reformulationto L p -SVM. The resulting smooth constrained optimization model with linear objective function is relatively easy to deal with. 3 LOSC Reformulation to L p -SVM In this section, we will derive a smooth constrained optimizationreformulationtol p -SVMsothatitisrelatively easy to deal with optimization solvers. 3.1 L p -SVM Model At first, let us recall the model of L p -SVM: min m w j p +C n i=1 ξ i = φ(w,b,ξ) s.t. y i (w T x i +b) 1 ξ i, i = 1,...,n, ξ i 0, i = 1,...,n. 3.2 Reformulation to the L p -SVM Model (1) By introducing an auxiliary variable t = (t 1,t 2,...,t m ) T, L p -SVM can be reformulated to the following constrained problem. min m t j +C n ξ i = f(w,b,ξ,t) i=1 s.t. t α j w j 0, j = 1,...,m, t α j +w j 0, j = 1,...,m, t j 0, j = 1,...,m, y i (w T x i +b) 1 ξ i, i = 1,...,n, ξ i 0, i = 1,...,n. (2) It is obtained by making t j = w j p, and α = p 1 > 1. It is a minimization problem with linear optimization function and smooth constraints, which we call LOSC- SVM. In a way similar to [29], we can derive some nice properties of the reformulation. In particular, the KKT points of the reformulation are the same as the KKT points of L p -SVM. Denote the feasible region of the problem by F, i.e., F = {(w, b, t, ξ) t α j w j 0, t α j +w j 0, t j 0, j = 1,...,m} {(w, b, t, ξ) y i (w T x i +b) 1 ξ i, ξ i 0,i = 1,...,n}. We conclude this subsection by showing the equivalence between LOSC-SVM (2) and L p -SVM (1). Theorem 1. If u = (w,b,ξ ) R m+1+n is a solution of L p -SVM (1), then z = (w,b,ξ, w p ) R m+1+n+m is a solution of LOSC-SVM (2). Conversely, if ( w, b, ξ, t) is a solution of LOSC-SVM (2), then ( w, b, ξ) is a solution of L p -SVM (1). Proof. Let u = (w,b,ξ ) be a solution of L p - SVM (1) and z = ( w, b, ξ, t) be a solution of LOSC- SVM (2). It is clear that z = (w,b,ξ, w p ) F. Moreover, we have t α j w j, j = 1,2,...,m, and hence f( z) = m t j +C m n m ξ i w j p +C i=1 w j p +C Since z F, we have n ξ i i=1 n ξi = φ(u ). (3) i=1 φ(u ) = f(z ) f( z). This together with (3) implies f( z) = φ(u ). 3.3 Algorithm for Solving L p -SVM Problem Based on the equivalent reformulation LOSC-SVM (2) of L p -SVM, we give the algorithm shown below for solving the L p -SVM problem. Step 1: select parameter C (C > 0) and p (0 < p < 1) given training datasets T = (x i ;y i ) n i=1, where x i R m and y i { 1,1}. Step 2: solve LOSC-SVM (2) by the constrained optimization method fmincon in Matlab, and get the solution ( w, b, ξ, t). Step 3: if w k < max j ( w j ) 10 4, let w k = 0, k = 1,...,m. Step 4: construct the decision function f(x) = sgn( w x+ b). In step 2, LOSC-SVM (2) can be solved by smooth constrained optimization methods, such as interior point method, successive linear programming algorithm, sequential quadratic programming algorithm, and so on. 4 Experiments Numerical experiments are done to evaluate the proposed LOCS-SVM. We first analyze the effect of parameter p in LOSC-SVM on artificial datasets. Then, we compare the performance of LOSC-SVM with that of

5 72 J. Comput. Sci. & Technol., Jan. 2017, Vol.32, No.1 L p -SVM [24], L 0 -SVM, L 1 -SVM, and L 2 -SVM on eight UCI datasets. All the experiments are done in a personal computer (1.6 GHz of CPU, 4 GB of RAM) with Matlab R2010b on 64-bit Windows Artificial Dataset We start with an artificial binary linear classification problem. Similar to [9], the sample size of each class (y = 1 or 1) is equal, and only the first six features are relevant. In 70% samples, the first three features {x 1,x 2,x 3 } are drawn as x i = yn(i,1), and the second three features {x 4,x 5,x 6 } as x i = N(0,1). In the other 30% samples, the first three are drawn as x i = N(0,1) and the second three as x i = yn(i 3,1). The rest features are noise and drawn from N(0,20). To test the effect of parameter p on classification performance, we fix C = 0.5 and change p from 0.1 to 2. Then, we train LOSC-SVM on the training datasets with m features and n samples. The performance is estimated on the testing dataset with 500 points. We repeat training and testing 30 times and the average results are reported. Test 1 (m = 10, 30, 50, 100 and n =100). The cardinality and the accuracy are used as the metrics for performance comparison. Cardinality is the number of non-zero element in weight vector w, that is, the number of selected features. Fig.1(a) illustrates how the cardinality of the classifier changes as q increases. It shows that when p > 1, the classifier almost utilizes all features. However, when 0.1 p 1, LOSC- SVMs can always find sparse solution and smaller p (0.1 p 1) encourages sparser solution. Fig.1(b) illustrates how accuracy changes as q increases. It shows that when there is little noise in the dataset (m = 10), the accuracy of the LOSC-SVMs with different p is almost identical (about 99%). When the noise increases (m = 100), the accuracy of LOSC-SVMs with p > 1 decreases sharply, while the accuracy of LOSC-SVMs with 0.1 p 1 changes little. Table 1 lists the average results of test 1, and RFR (relevant feature rate) is the percentage of the relevant features in the selected features. The LOSC-SVMs with different p (0.1 p 1) have similar average accuracy. When p changes from 0.1 to 1, the cardinality increases by 12.37, and the relevant feature rate (RFR) decreases by 46.47%. It suggests that the larger p has the more irrelevant features selected. Table 1. Average Results of Test 1 (m = 10, 30, 50, 100 and n = 100) p Cardinality Accuracy (%) RFR (%) Test 2 (m = 30 and n = 10, 30, 50, 100). Fig.2 shows the result of test 2. It still supports the conclusion of test 1. Furthermore, Fig.2(b) shows that fewer training samples would lead to lower accuracy. The reason may be that too few training samples are prone to Cardinality m/10 m/30 m/50 m/100 Accuracy m/10 m/30 m/50 m/ p (a) p (b) Fig.1. Results on test 1: (a) cardinality and (b) accuracy of the classifier verus p.

6 Lan Yao et al.: Sparse Support Vector Machine with L p Penalty for Feature Selection Cardinality n/10 n/30 n/50 n/100 Accuracy n/10 n/30 n/50 n/ p (a) p (b) Fig.2. Results on test 2: (a) cardinality and (b) accuracy of the classifier verus p. be overfitting. When there are only 10 training samples, the classification accuracy drops a lot, especially when p = 0.1. It reveals that the sparsest solution is not always the best solution. That is, the smallest p is not always the best choice. Therefore, the choice of p is important. We cannot simply set p = 0.1 and it should be determined by the data itself. Based on the above analysis, the real data experiment in Subsection 4.2 will focus on the LOSC-SVM with 0 < p < 1 and the best p will be adaptively selected for each dataset. 4.2 UCI Datasets In this subsection, we compare the performance of the proposed LOSC-SVM (a reformulation to L p - SVM ) with that of L 2 -SVM, L 1 -SVM, L 0 -SVM and SLA-SVM [24], on eight UCI datasets [30]. The proposed LOSC-SVM is solved by the constrained optimization method fmincon in Matlab. L p -SVM [24] is solved by SLA algorithm, and we call it SLA-SVM in this paper. L 0 -SVM is solved by the commonly cited method FSV [8]. The basic information of these datasets is described in Table 2, where n and m are the number of the samples and the features, respectively. The name of PID is the abbreviation of Pima Indians Diabets and BCW is the abbreviation of Breast Cancer Wisconsin. We first preprocess each dataset as follows: deleting the samples with missing values and scaling each feature of the sample to [ 1, 1]. Then, the dataset is randomly split into three parts: two parts for training and one for testing. The parameters C {2 5,2 4,...,2 5 } and p {0.1, 0.2,..., 0.9} are determined by 5-fold crossvalidation. The weight w k will be set to zero, if it does w not satisfy k max j( w j ) 10 4[11]. The cardinality is computed as the number of non-zero elements in w. We repeat the training and testing procedure 20 times, and then the average results are reported. Table 2. Statistics of UCI Datasets Used in This Paper Dataset Number of Number of Class Distribution Samples (n) Features (m) n +/n 1 PID /500 2 BCW /444 3 SPECT /55 4 WDBC /357 5 Ionosphere /225 6 SPECTF /55 7 Sonar /111 8 Musk /269 In the simulation experiment, the datasets are classbalanced, i.e., the sample size ofeachclass is equal. Actually, many real datasets are class-imbalanced (e.g., 3 SPECT, 6 SPECTF). Therefore, except for accuracy (Acc), we apply another three scores to evaluate the classification performance: true positive rate (TPR), false positive rate (FPR), and the area under the ROC curve (AUC). TPR is the proportion of positive samples that are correctly labeled, FPR is the proportion of negative samples that are mislabeled, and AUC is equivalent to the area under the curve obtained by plotting TPR against FPR for each confidence value. Larger AUC represents better classification performance. When no confidence values are supplied for the classification, AUC = 1 2 (1 FPR+TPR).

7 74 J. Comput. Sci. & Technol., Jan. 2017, Vol.32, No.1 The results on eight UCI datasets are listed in Table 3. The best results among LOSC-SVM, SLA- SVM, L 0 -SVM, L 1 -SVM and L 2 -SVM are bolded. It shows that the proposed LOSC-SVM has good sparsity, while L 2 -SVM utilizes all features. The accuracy of LOSC-SVM is as good as or better than L 2 - SVM in seven datasets (databases 1 7). Only in dataset 8, the sparse solution introduces a slight increase in error (4.4%). Compared with other sparse SVMs, LOSC-SVM achieves better sparsity and accuracythanl 1 -SVMintheformersixdatasets(databases 1 6). LOSC-SVM selects more features than L 0 -SVM in datasets 7 and 8; however it gets more accuracy than L 0 -SVM in all datasets. Furthermore, LOSC-SVM performs better than SLA-SVM in both feature selection and classification in seven datasets (databases 1 5, 7, 8). Only in dataset 6, the solution of LOSC-SVM is not so sparse as that of SLA-SVM. However, the FPR of SLA-SVM is 100%, which indicates that all of the negative samples are wrongly classified. It may be due to the lack of negative samples to learn. On the contrary, the FPR of LOSC-SVM is 36.1% lower than that of SLA-SVM. It suggests that LOSC-SVM works better on class imbalanced problem. Table 4showsthe averageresultsoverthe eight UCI datasets. ItshowsthatLOSC-SVM,SLA-SVM andl 0 - SVM have the similar average cardinality which is much smaller than L 1 -SVM. Among the five SVMs, LOSC- SVM achieves the best results in four scores: sparsity, accuracy, AUC and FPR. With respect to the two L p - SVMs (0 < p < 1), the average FPR of LOSC-SVM is 6.6% lower than that of SLA-SVM, which indicates that on class-imbalanced problem, LOSC-SVM is more robust than SLA-SVM. These results support that the proposed LOSC-SVM is a promising tool to perform feature selection and classification simultaneously. Table 3. Results on UCI Datasets Dataset Method Cardinality Acc (%) AUC TPR FPR C p 1 PID LOSC-SVM ±0.7 ±1.1 ±0.010 ±0.028 ±0.027 SLA-SVM ±0.6 ±0.4 ± L 0 -SVM ±0.2 ±3.0 ±0.056 ±0.108 ±0.046 L 1 -SVM ±0.9 ±0.9 ±0.034 ±0.060 ±0.034 L 2 -SVM ±0.0 ±3.5 ±0.032 ±0.083 ± BCW LOSC-SVM ±0.7 ±0.7 ±0.010 ±0.024 ±0.007 SLA-SVM ±0.6 ±0.7 ±0.009 ±0.023 ±0.009 L 0 -SVM ±2.1 ±1.4 ±0.022 ±0.037 ±0.022 L 1 -SVM ±1.2 ±1.0 ±0.023 ±0.042 ±0.014 L 2 -SVM ±0.0 ±2.2 ±0.015 ±0.033 ± SPECT LOSC-SVM ±2.5 ±4.0 ±0.068 ±0.050 ±0.125 SLA-SVM ±1.6 ±4.0 ±0.083 ±0.029 ±0.166 L 0 -SVM ±3.6 ±3.9 ±0.127 ±0.072 ±0.250 L 1 -SVM ±5.3 ±2.3 ±0.069 ±0.053 ±0.144 L 2 -SVM ±0.0 ±2.2 ±0.000 ±0.000 ±0.000

8 Lan Yao et al.: Sparse Support Vector Machine with L p Penalty for Feature Selection 75 Table 3. (Continued) Dataset Method Cardinality Acc (%) AUC TPR FPR C p 4 WDBC LOSC-SVM ±1.2 ±0.7 ±0.011 ±0.024 ±0.018 SLA-SVM ±2.3 ±2.4 ±0.008 ±0.018 ±0.012 L 0 -SVM ±1.2 ±1.5 ±0.029 ±0.053 ±0.025 L 1 -SVM ±1.5 ±1.3 ±0.020 ±0.038 ±0.011 L 2 -SVM ±0.0 ±2.1 ±0.011 ±0.023 ± Ionosphere LOSC-SVM ±1.4 ±2.6 ±0.038 ±0.087 ±0.038 SLA-SVM ±2.6 ±1.5 ±0.031 ±0.048 ±0.028 L 0 -SVM ±2.9 ±4.2 ±0.076 ±0.125 ±0.113 L 1 -SVM ±3.3 ±3.7 ±0.045 ±0.094 ±0.043 L 2 -SVM ±0.4 ±5.4 ±0.035 ±0.068 ± SPECTF LOSC-SVM ±1.6 ±4.8 ±0.067 ±0.051 ±0.169 SLA-SVM ±4.6 ±1.7 ±0.000 ±0.000 ±0.000 L 0 -SVM ±2.3 ±3.8 ±0.124 ±0.072 ±0.246 L 1 -SVM ±11.0 ±4.5 ±0.072 ±0.070 ±0.155 L 2 -SVM ±0.0 ±3.9 ±0.003 ±0.008 ± Sonar LOSC-SVM ±2.0 ±5.0 ±0.059 ±0.073 ±0.090 SLA-SVM ±3.5 ±4.2 ±0.042 ±0.061 ±0.065 L 0 -SVM ±2.1 ±4.1 ±0.104 ±0.178 ±0.129 L 1 -SVM ±13.1 ±5.8 ±0.062 ±0.142 ±0.097 L 2 -SVM ±0.0 ±5.3 ±0.029 ±0.119 ± Musk1 LOSC-SVM ±5.0 ±1.9 ±0.025 ±0.043 ±0.035 SLA-SVM ±8.7 ±4.4 ±0.042 ±0.062 ±0.053 L 0 -SVM ±3.7 ±2.7 ±0.053 ±0.084 ±0.070 L 1 -SVM ±18.8 ±2.4 ±0.040 ±0.075 ±0.066 L 2 -SVM ±0.0 ±1.7 ±0.026 ±0.029 ±0.051

9 76 J. Comput. Sci. & Technol., Jan. 2017, Vol.32, No.1 Table 4. Average Results over the 8 UCI Datasets: Sparsity = Cardinality/m Method Cardinality Sparsity Acc (%) AUC TPR FPR LOSC-SVM ±1.9 ±2.6 ±0.037 ±0.048 ±0.064 SLA-SVM ±3.4 ±2.4 ±0.029 ±0.035 ±0.046 L 0 -SVM ±2.3 ±3.1 ±0.074 ±0.092 ±0.113 L 1 -SVM ±6.9 ±2.7 ±0.046 ±0.072 ±0.071 L 2 -SVM ±0.1 ±3.3 ±0.019 ±0.046 ± Conclusions In this paper, we studied a new adaptive L p -SVM classificationmethod. L p -SVM allowsaflexiblepenalty form chosen by data. Hence, the classifier is built based on the best p for any specific application. Since solving L p -SVM is an NP-hard problem, we proposed a new reformulation named LOSC-SVM to L p -SVM. LOSC- SVM is easy to be solved. Simulate experiments showed that in most situations a smaller p (0 < p 1) leads to a sparser solution almost without losing accuracy. The results on real datasets showed that the proposed LOSC-SVM is better than L 1 -SVM and L 2 -SVM in both feature selection and classification. The results suggested that the adaptive L p -SVM is an interesting approach to the application of feature selection. We will extend our research on several directions. For example, the choice of parameters (C, p) is important to improve the performance of L p -SVM. Most commonly used approach to selecting(c, p) is grid search coupled with cross-validation, which is computational expensive. In future, we will consider theoretically analyzing on (C, p) and explore more efficient methods. Furthermore, we will also develop faster algorithms that can handle large-scale problems. References [1] Vapnik V N. The Nature of Statistical Learning Theory (2nd edition). Springer, [2] Guyon I, Gunn S, Nikravesh M, Zadeh L A. Feature Extraction: Foundations and Applications (1st edition). Springer, [3] Saeys Y, Inza I, Larranagal P. A review of feature selection techniques in bioinformatics. Bioinformatics, 2007, 23(19): [4] Guyon I, Weston J, Barnhill S, Vapnik V. Gene selection for cancer classification using support vector machines. Machine Learning, 2002, 46(1/2/3): [5] Rakotomamonjy A. Variable selection using SVM based criteria. The Journal of Machine Learning Research, 2003, 3: [6] Weston J, Mukherjee S, Chapelle O, Pontil M, Poggio T, Vapnik V. Feature selection for SVMs. In Advances in Neural Information Processing Systems 13, Leen T K, Dietterich T G, Tresp V (eds.), Massachusetts Institute of Technology, 2001, pp [7] Peleg D, Meir R. A feature selection algorithm based on the global minimization of a generalization error bound. In Advances in Neural Information Processing Systems 17, Saul L K, Weiss Y, Bottou L (eds.), Massachusetts Institute of Technology, 2005, pp [8] Bradley P S, Mangasarian O L. Feature selection via concave minimization and support vector machines. In Proc. the 5th International Conference on Machine Learning, July 1998, pp [9] Weston J, Elisseeff A, Schölkopf B, Tipping M. Use of the zero norm with linear models and kernel methods. The Journal of Machine Learning Research, 2003, 3: [10] Amaldi E, Kann V. On the approximability of minimizing nonzero variables or unsatisfied relations in linear systems. Theoretical Computer Science, 1998, 209(1/2): [11] Chan A B, Vasconcelos N, Lanckriet G R G. Direct convex relaxations of sparse SVM. In Proc. the 24th International Conference on Machine Learning, June 2007, pp [12] Fung G M, Mangasarian O L. A feature selection newton method for support vector machine classification. Computational Optimization and Applications, 2004, 28(2): [13] Bi J B, Bennett K, Embrechts M, Breneman C, Song M H. Dimensionality reduction via sparse support vector machines. The Journal of Machine Learning Research, 2003, 3: [14] Tibshirani R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B (Methodological), 1996, 58(1): [15] Neumann J, Schnörr C, Steidl G. Combined SVM-based feature selection and classification. Machine Learning, 2005, 61(1/2/3): [16] Chartrand R. Exact reconstruction of sparse signals via nonconvex minimization. IEEE Signal Processing Letters, 2007, 14(10): [17] Chartrand R. Nonconvex regularization for shape preservation. In Proc. the IEEE International Conference on Image Processing, September 16-October 19, 2007, pp

10 Lan Yao et al.: Sparse Support Vector Machine with L p Penalty for Feature Selection 77 [18] Xu Z B, Zhang H, Wang Y, Chang X Y, Liang Y. L 1/2 regularization. Science China Information Sciences, 2010, 53(6): [19] Liu J L, Li J P, Xu W X, Shi Y. A weighted L q adaptive least squares support vector machine classifiers Robust and sparse approximation. Expert Systems with Applications, 2011, 38(3): [20] Chen W J, Tian Y J. L p-norm proximal support vector machine and its applications. Procedia Computer Science, 2010, 1(1): [21] Rakotomamonjy A, Flamary R, Gasso G, Canu S. l p l q penalty for sparse linear and sparse multiple kernel multitask learning. IEEE Transactions on Neural Networks, 2011, 22(8): [22] Liu Y F, Zhang H H, Park C, Ahn J. Support vector machines with adaptive L q penalty. Computational Statistics and Data Analysis, 2007, 51(12): [23] Liu Z Q, Lin S L, Tan M. Sparse support vector machines with L p penalty for biomarker identification. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2010, 7(1): [24] Tan J Y, Zhang Z Q, Zhen L, Zhang C H, Deng N Y. Adaptive feature selection via a new version of support vector machine. Neural Computing and Applications, 2013, 23(3/4): [25] Tian Y J, Yu J, Chen W J. l p-norm support vector machine with CCCP. In Proc. the 7th International Conference on Fuzzy Systems and Knowledge Discovery, August 2010, pp [26] Liu J W, Liu Y. Non-integer norm regularization SVM via Legendre-Fenchel duality. Neurocomputing, 2014, 144: [27] Chen X J, Xu F M, Ye Y Y. Lower bound theory of nonzero entries in solutions of l 2 -l p minimization. SIAM J. Sci. Comput., 2010, 32(5): [28] Zhang C H, Shao Y H, Tan J Y, Deng N Y. Mixed-norm linear support vector machine. Neural Computing and Applications, 2013, 23(7): [29] Li D H, Wu L, Sun Z, Zhang X J. A constrained optimization reformulation and a feasible descent direction method for L 1/2 regularization. Computational Optimization and Applications, 2014, 59(1/2): [30] Newman D J, Hettich S, Blake C L, Merz C J. UCI repository of machine learning databases. Technical Report 9702, Department of Information and Computer Science, University of California, Irvine, Nov Feng Zeng received his B.Eng. and M.Eng. degrees in computer science from Hunan University, Changsha, in 2000 and 2005 respectively, and his Ph.D. degree in computer science in 2010 from Central South University, Changsha. He is now an associate professor in the School of Software, Central South University, Changsha. His current research interests include wireless network, QoS routing and data mining. He is a member of CFF. Dong-Hui Li is a professor of the School of Mathematical Sciences, South China Normal University, Guangzhou. He got his Bachelor s and Master s degrees in applied mathematics from Hunan University, Changsha, in 1983 and 1986 respectively. He got his first Ph.D. degree in applied mathematics from Hunan University, Changsha, in 1994, and his second Ph.D. degree in applied mathematics from Kyoto University, Kyoto, in His research interests include numerical methods in optimization and nonlinear equations with applications in supply chain and finance. He has published near 100 academic papers. Zhi-Gang Chen is a professor in the School of Software, Central South University (CSU), Changsha. His research interests are in QoS mechanism for IP network, web service, and wireless network. He is the director of Network Computing and Distributed Processing Laboratory in the School of Software, CSU, Changsha. He obtained his Ph.D. and M.S. degrees in computer science from CSU in 1998 and 1987 respectively. He is a senior member of CFF. Lan Yao is an assistant professor of the College of Mathematics and Econometrics, Hunan University, Changsha. She got her B.S. degree in computer science, M.S. and Ph.D. degrees in applied mathematics from Hunan University, Changsha, in 2000, 2006 and 2014 respectively. Her research interests include data mining, numerical methods in optimization and network optimization.

Support Vector Machine via Nonlinear Rescaling Method

Support Vector Machine via Nonlinear Rescaling Method Manuscript Click here to download Manuscript: svm-nrm_3.tex Support Vector Machine via Nonlinear Rescaling Method Roman Polyak Department of SEOR and Department of Mathematical Sciences George Mason University

More information

An Improved 1-norm SVM for Simultaneous Classification and Variable Selection

An Improved 1-norm SVM for Simultaneous Classification and Variable Selection An Improved 1-norm SVM for Simultaneous Classification and Variable Selection Hui Zou School of Statistics University of Minnesota Minneapolis, MN 55455 hzou@stat.umn.edu Abstract We propose a novel extension

More information

SVM-based Feature Selection by Direct Objective Minimisation

SVM-based Feature Selection by Direct Objective Minimisation SVM-based Feature Selection by Direct Objective Minimisation Julia Neumann, Christoph Schnörr, and Gabriele Steidl Dept. of Mathematics and Computer Science University of Mannheim, 683 Mannheim, Germany

More information

Change point method: an exact line search method for SVMs

Change point method: an exact line search method for SVMs Erasmus University Rotterdam Bachelor Thesis Econometrics & Operations Research Change point method: an exact line search method for SVMs Author: Yegor Troyan Student number: 386332 Supervisor: Dr. P.J.F.

More information

Feature Selection for SVMs

Feature Selection for SVMs Feature Selection for SVMs J. Weston, S. Mukherjee, O. Chapelle, M. Pontil T. Poggio, V. Vapnik, Barnhill BioInformatics.com, Savannah, Georgia, USA. CBCL MIT, Cambridge, Massachusetts, USA. AT&T Research

More information

Unsupervised and Semisupervised Classification via Absolute Value Inequalities

Unsupervised and Semisupervised Classification via Absolute Value Inequalities Unsupervised and Semisupervised Classification via Absolute Value Inequalities Glenn M. Fung & Olvi L. Mangasarian Abstract We consider the problem of classifying completely or partially unlabeled data

More information

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines Gautam Kunapuli Example: Text Categorization Example: Develop a model to classify news stories into various categories based on their content. sports politics Use the bag-of-words representation for this

More information

ScienceDirect. Weighted linear loss support vector machine for large scale problems

ScienceDirect. Weighted linear loss support vector machine for large scale problems Available online at www.sciencedirect.com ScienceDirect Procedia Computer Science 3 ( 204 ) 639 647 2nd International Conference on Information Technology and Quantitative Management, ITQM 204 Weighted

More information

Iterative Laplacian Score for Feature Selection

Iterative Laplacian Score for Feature Selection Iterative Laplacian Score for Feature Selection Linling Zhu, Linsong Miao, and Daoqiang Zhang College of Computer Science and echnology, Nanjing University of Aeronautics and Astronautics, Nanjing 2006,

More information

Unsupervised Classification via Convex Absolute Value Inequalities

Unsupervised Classification via Convex Absolute Value Inequalities Unsupervised Classification via Convex Absolute Value Inequalities Olvi L. Mangasarian Abstract We consider the problem of classifying completely unlabeled data by using convex inequalities that contain

More information

Convex envelopes, cardinality constrained optimization and LASSO. An application in supervised learning: support vector machines (SVMs)

Convex envelopes, cardinality constrained optimization and LASSO. An application in supervised learning: support vector machines (SVMs) ORF 523 Lecture 8 Princeton University Instructor: A.A. Ahmadi Scribe: G. Hall Any typos should be emailed to a a a@princeton.edu. 1 Outline Convexity-preserving operations Convex envelopes, cardinality

More information

MULTIPLEKERNELLEARNING CSE902

MULTIPLEKERNELLEARNING CSE902 MULTIPLEKERNELLEARNING CSE902 Multiple Kernel Learning -keywords Heterogeneous information fusion Feature selection Max-margin classification Multiple kernel learning MKL Convex optimization Kernel classification

More information

A Tutorial on Support Vector Machine

A Tutorial on Support Vector Machine A Tutorial on School of Computing National University of Singapore Contents Theory on Using with Other s Contents Transforming Theory on Using with Other s What is a classifier? A function that maps instances

More information

EUSIPCO

EUSIPCO EUSIPCO 013 1569746769 SUBSET PURSUIT FOR ANALYSIS DICTIONARY LEARNING Ye Zhang 1,, Haolong Wang 1, Tenglong Yu 1, Wenwu Wang 1 Department of Electronic and Information Engineering, Nanchang University,

More information

Structured Statistical Learning with Support Vector Machine for Feature Selection and Prediction

Structured Statistical Learning with Support Vector Machine for Feature Selection and Prediction Structured Statistical Learning with Support Vector Machine for Feature Selection and Prediction Yoonkyung Lee Department of Statistics The Ohio State University http://www.stat.ohio-state.edu/ yklee Predictive

More information

Support Vector Machine & Its Applications

Support Vector Machine & Its Applications Support Vector Machine & Its Applications A portion (1/3) of the slides are taken from Prof. Andrew Moore s SVM tutorial at http://www.cs.cmu.edu/~awm/tutorials Mingyue Tan The University of British Columbia

More information

Twin Support Vector Machine in Linear Programs

Twin Support Vector Machine in Linear Programs Procedia Computer Science Volume 29, 204, Pages 770 778 ICCS 204. 4th International Conference on Computational Science Twin Support Vector Machine in Linear Programs Dewei Li Yingjie Tian 2 Information

More information

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar Data Mining Support Vector Machines Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar 02/03/2018 Introduction to Data Mining 1 Support Vector Machines Find a linear hyperplane

More information

From Lasso regression to Feature vector machine

From Lasso regression to Feature vector machine From Lasso regression to Feature vector machine Fan Li, Yiming Yang and Eric P. Xing,2 LTI and 2 CALD, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA USA 523 {hustlf,yiming,epxing}@cs.cmu.edu

More information

SVM TRADE-OFF BETWEEN MAXIMIZE THE MARGIN AND MINIMIZE THE VARIABLES USED FOR REGRESSION

SVM TRADE-OFF BETWEEN MAXIMIZE THE MARGIN AND MINIMIZE THE VARIABLES USED FOR REGRESSION International Journal of Pure and Applied Mathematics Volume 87 No. 6 2013, 741-750 ISSN: 1311-8080 (printed version); ISSN: 1314-3395 (on-line version) url: http://www.ijpam.eu doi: http://dx.doi.org/10.12732/ijpam.v87i6.2

More information

Sparse Approximation and Variable Selection

Sparse Approximation and Variable Selection Sparse Approximation and Variable Selection Lorenzo Rosasco 9.520 Class 07 February 26, 2007 About this class Goal To introduce the problem of variable selection, discuss its connection to sparse approximation

More information

A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie

A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie Computational Biology Program Memorial Sloan-Kettering Cancer Center http://cbio.mskcc.org/leslielab

More information

Nonlinear Support Vector Machines through Iterative Majorization and I-Splines

Nonlinear Support Vector Machines through Iterative Majorization and I-Splines Nonlinear Support Vector Machines through Iterative Majorization and I-Splines P.J.F. Groenen G. Nalbantov J.C. Bioch July 9, 26 Econometric Institute Report EI 26-25 Abstract To minimize the primal support

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Reading: Ben-Hur & Weston, A User s Guide to Support Vector Machines (linked from class web page) Notation Assume a binary classification problem. Instances are represented by vector

More information

Infinite Ensemble Learning with Support Vector Machinery

Infinite Ensemble Learning with Support Vector Machinery Infinite Ensemble Learning with Support Vector Machinery Hsuan-Tien Lin and Ling Li Learning Systems Group, California Institute of Technology ECML/PKDD, October 4, 2005 H.-T. Lin and L. Li (Learning Systems

More information

Study on Classification Methods Based on Three Different Learning Criteria. Jae Kyu Suhr

Study on Classification Methods Based on Three Different Learning Criteria. Jae Kyu Suhr Study on Classification Methods Based on Three Different Learning Criteria Jae Kyu Suhr Contents Introduction Three learning criteria LSE, TER, AUC Methods based on three learning criteria LSE:, ELM TER:

More information

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines Lecture 9: Large Margin Classifiers. Linear Support Vector Machines Perceptrons Definition Perceptron learning rule Convergence Margin & max margin classifiers (Linear) support vector machines Formulation

More information

Linear Classification and SVM. Dr. Xin Zhang

Linear Classification and SVM. Dr. Xin Zhang Linear Classification and SVM Dr. Xin Zhang Email: eexinzhang@scut.edu.cn What is linear classification? Classification is intrinsically non-linear It puts non-identical things in the same class, so a

More information

Plan. Lecture: What is Chemoinformatics and Drug Design? Description of Support Vector Machine (SVM) and its used in Chemoinformatics.

Plan. Lecture: What is Chemoinformatics and Drug Design? Description of Support Vector Machine (SVM) and its used in Chemoinformatics. Plan Lecture: What is Chemoinformatics and Drug Design? Description of Support Vector Machine (SVM) and its used in Chemoinformatics. Exercise: Example and exercise with herg potassium channel: Use of

More information

Large Scale Semi-supervised Linear SVM with Stochastic Gradient Descent

Large Scale Semi-supervised Linear SVM with Stochastic Gradient Descent Journal of Computational Information Systems 9: 15 (2013) 6251 6258 Available at http://www.jofcis.com Large Scale Semi-supervised Linear SVM with Stochastic Gradient Descent Xin ZHOU, Conghui ZHU, Sheng

More information

DATA MINING AND MACHINE LEARNING

DATA MINING AND MACHINE LEARNING DATA MINING AND MACHINE LEARNING Lecture 5: Regularization and loss functions Lecturer: Simone Scardapane Academic Year 2016/2017 Table of contents Loss functions Loss functions for regression problems

More information

Support Vector Machine (continued)

Support Vector Machine (continued) Support Vector Machine continued) Overlapping class distribution: In practice the class-conditional distributions may overlap, so that the training data points are no longer linearly separable. We need

More information

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan Support'Vector'Machines Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan kasthuri.kannan@nyumc.org Overview Support Vector Machines for Classification Linear Discrimination Nonlinear Discrimination

More information

that feature is absent (i = 1,, n). Given a parameterized family of classification or regression functions 4

that feature is absent (i = 1,, n). Given a parameterized family of classification or regression functions 4 Embedded Methods Thomas Navin Lal 1, Olivier Chapelle 1, Jason Weston 2, and André Elisseeff 3 1 Max Planck Institute for Biological Cybernetics, Tübingen, Germany {navin.lal, olivier.chapelle}@tuebingen.mpg.de

More information

Compressed Sensing in Cancer Biology? (A Work in Progress)

Compressed Sensing in Cancer Biology? (A Work in Progress) Compressed Sensing in Cancer Biology? (A Work in Progress) M. Vidyasagar FRS Cecil & Ida Green Chair The University of Texas at Dallas M.Vidyasagar@utdallas.edu www.utdallas.edu/ m.vidyasagar University

More information

A Brief Overview of Practical Optimization Algorithms in the Context of Relaxation

A Brief Overview of Practical Optimization Algorithms in the Context of Relaxation A Brief Overview of Practical Optimization Algorithms in the Context of Relaxation Zhouchen Lin Peking University April 22, 2018 Too Many Opt. Problems! Too Many Opt. Algorithms! Zero-th order algorithms:

More information

Machine Learning And Applications: Supervised Learning-SVM

Machine Learning And Applications: Supervised Learning-SVM Machine Learning And Applications: Supervised Learning-SVM Raphaël Bournhonesque École Normale Supérieure de Lyon, Lyon, France raphael.bournhonesque@ens-lyon.fr 1 Supervised vs unsupervised learning Machine

More information

Lecture Notes on Support Vector Machine

Lecture Notes on Support Vector Machine Lecture Notes on Support Vector Machine Feng Li fli@sdu.edu.cn Shandong University, China 1 Hyperplane and Margin In a n-dimensional space, a hyper plane is defined by ω T x + b = 0 (1) where ω R n is

More information

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) and Kernel Methods Support Vector Machine (SVM) and Kernel Methods CE-717: Machine Learning Sharif University of Technology Fall 2014 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin

More information

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers) Support vector machines In a nutshell Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers) Solution only depends on a small subset of training

More information

A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES. Wei Chu, S. Sathiya Keerthi, Chong Jin Ong

A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES. Wei Chu, S. Sathiya Keerthi, Chong Jin Ong A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES Wei Chu, S. Sathiya Keerthi, Chong Jin Ong Control Division, Department of Mechanical Engineering, National University of Singapore 0 Kent Ridge Crescent,

More information

Constrained Optimization and Support Vector Machines

Constrained Optimization and Support Vector Machines Constrained Optimization and Support Vector Machines Man-Wai MAK Dept. of Electronic and Information Engineering, The Hong Kong Polytechnic University enmwmak@polyu.edu.hk http://www.eie.polyu.edu.hk/

More information

Variable Selection in Data Mining Project

Variable Selection in Data Mining Project Variable Selection Variable Selection in Data Mining Project Gilles Godbout IFT 6266 - Algorithmes d Apprentissage Session Project Dept. Informatique et Recherche Opérationnelle Université de Montréal

More information

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18 CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$

More information

SMO vs PDCO for SVM: Sequential Minimal Optimization vs Primal-Dual interior method for Convex Objectives for Support Vector Machines

SMO vs PDCO for SVM: Sequential Minimal Optimization vs Primal-Dual interior method for Convex Objectives for Support Vector Machines vs for SVM: Sequential Minimal Optimization vs Primal-Dual interior method for Convex Objectives for Support Vector Machines Ding Ma Michael Saunders Working paper, January 5 Introduction In machine learning,

More information

Support Vector Machines for Classification: A Statistical Portrait

Support Vector Machines for Classification: A Statistical Portrait Support Vector Machines for Classification: A Statistical Portrait Yoonkyung Lee Department of Statistics The Ohio State University May 27, 2011 The Spring Conference of Korean Statistical Society KAIST,

More information

Introduction to Support Vector Machines

Introduction to Support Vector Machines Introduction to Support Vector Machines Hsuan-Tien Lin Learning Systems Group, California Institute of Technology Talk in NTU EE/CS Speech Lab, November 16, 2005 H.-T. Lin (Learning Systems Group) Introduction

More information

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers) Support vector machines In a nutshell Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers) Solution only depends on a small subset of training

More information

Lecture 18: Multiclass Support Vector Machines

Lecture 18: Multiclass Support Vector Machines Fall, 2017 Outlines Overview of Multiclass Learning Traditional Methods for Multiclass Problems One-vs-rest approaches Pairwise approaches Recent development for Multiclass Problems Simultaneous Classification

More information

Efficient Maximum Margin Clustering

Efficient Maximum Margin Clustering Dept. Automation, Tsinghua Univ. MLA, Nov. 8, 2008 Nanjing, China Outline 1 2 3 4 5 Outline 1 2 3 4 5 Support Vector Machine Given X = {x 1,, x n }, y = (y 1,..., y n ) { 1, +1} n, SVM finds a hyperplane

More information

Smooth LASSO for Classification

Smooth LASSO for Classification 2010 International Conference on Technologies and Applications of Artificial Intelligence Smooth LASSO for Classification Li-Jen Chien Department of Computer Science and Information Engineering National

More information

CSC 411 Lecture 17: Support Vector Machine

CSC 411 Lecture 17: Support Vector Machine CSC 411 Lecture 17: Support Vector Machine Ethan Fetaya, James Lucas and Emad Andrews University of Toronto CSC411 Lec17 1 / 1 Today Max-margin classification SVM Hard SVM Duality Soft SVM CSC411 Lec17

More information

1162 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 11, NO. 5, SEPTEMBER The Evidence Framework Applied to Support Vector Machines

1162 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 11, NO. 5, SEPTEMBER The Evidence Framework Applied to Support Vector Machines 1162 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 11, NO. 5, SEPTEMBER 2000 Brief Papers The Evidence Framework Applied to Support Vector Machines James Tin-Yau Kwok Abstract In this paper, we show that

More information

Predicting the Probability of Correct Classification

Predicting the Probability of Correct Classification Predicting the Probability of Correct Classification Gregory Z. Grudic Department of Computer Science University of Colorado, Boulder grudic@cs.colorado.edu Abstract We propose a formulation for binary

More information

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) and Kernel Methods Support Vector Machine (SVM) and Kernel Methods CE-717: Machine Learning Sharif University of Technology Fall 2015 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin

More information

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs E0 270 Machine Learning Lecture 5 (Jan 22, 203) Support Vector Machines for Classification and Regression Lecturer: Shivani Agarwal Disclaimer: These notes are a brief summary of the topics covered in

More information

THE ADVICEPTRON: GIVING ADVICE TO THE PERCEPTRON

THE ADVICEPTRON: GIVING ADVICE TO THE PERCEPTRON THE ADVICEPTRON: GIVING ADVICE TO THE PERCEPTRON Gautam Kunapuli Kristin P. Bennett University of Wisconsin Madison Rensselaer Polytechnic Institute Madison, WI 5375 Troy, NY 1218 kunapg@wisc.edu bennek@rpi.edu

More information

Kernel Methods and Support Vector Machines

Kernel Methods and Support Vector Machines Kernel Methods and Support Vector Machines Oliver Schulte - CMPT 726 Bishop PRML Ch. 6 Support Vector Machines Defining Characteristics Like logistic regression, good for continuous input features, discrete

More information

Matheuristics for Ψ-Learning

Matheuristics for Ψ-Learning Matheuristics for Ψ-Learning Emilio Carrizosa 1, Amaya Nogales-Gómez 1, and Dolores Romero Morales 2 1 Departamento de Estadística e Investigación Operativa Facultad de Matemáticas Universidad de Sevilla

More information

Support vector machines Lecture 4

Support vector machines Lecture 4 Support vector machines Lecture 4 David Sontag New York University Slides adapted from Luke Zettlemoyer, Vibhav Gogate, and Carlos Guestrin Q: What does the Perceptron mistake bound tell us? Theorem: The

More information

Introduction to Support Vector Machines

Introduction to Support Vector Machines Introduction to Support Vector Machines Andreas Maletti Technische Universität Dresden Fakultät Informatik June 15, 2006 1 The Problem 2 The Basics 3 The Proposed Solution Learning by Machines Learning

More information

Class 4: Classification. Quaid Morris February 11 th, 2011 ML4Bio

Class 4: Classification. Quaid Morris February 11 th, 2011 ML4Bio Class 4: Classification Quaid Morris February 11 th, 211 ML4Bio Overview Basic concepts in classification: overfitting, cross-validation, evaluation. Linear Discriminant Analysis and Quadratic Discriminant

More information

Optimal Kernel Selection in Kernel Fisher Discriminant Analysis

Optimal Kernel Selection in Kernel Fisher Discriminant Analysis Optimal Kernel Selection in Kernel Fisher Discriminant Analysis Seung-Jean Kim Alessandro Magnani Stephen Boyd Department of Electrical Engineering, Stanford University, Stanford, CA 94304 USA sjkim@stanford.org

More information

An Improved Conjugate Gradient Scheme to the Solution of Least Squares SVM

An Improved Conjugate Gradient Scheme to the Solution of Least Squares SVM An Improved Conjugate Gradient Scheme to the Solution of Least Squares SVM Wei Chu Chong Jin Ong chuwei@gatsby.ucl.ac.uk mpeongcj@nus.edu.sg S. Sathiya Keerthi mpessk@nus.edu.sg Control Division, Department

More information

Perceptron Revisited: Linear Separators. Support Vector Machines

Perceptron Revisited: Linear Separators. Support Vector Machines Support Vector Machines Perceptron Revisited: Linear Separators Binary classification can be viewed as the task of separating classes in feature space: w T x + b > 0 w T x + b = 0 w T x + b < 0 Department

More information

DESIGNING RBF CLASSIFIERS FOR WEIGHTED BOOSTING

DESIGNING RBF CLASSIFIERS FOR WEIGHTED BOOSTING DESIGNING RBF CLASSIFIERS FOR WEIGHTED BOOSTING Vanessa Gómez-Verdejo, Jerónimo Arenas-García, Manuel Ortega-Moral and Aníbal R. Figueiras-Vidal Department of Signal Theory and Communications Universidad

More information

Statistical Methods for Data Mining

Statistical Methods for Data Mining Statistical Methods for Data Mining Kuangnan Fang Xiamen University Email: xmufkn@xmu.edu.cn Support Vector Machines Here we approach the two-class classification problem in a direct way: We try and find

More information

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers) Support vector machines In a nutshell Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers) Solution only depends on a small subset of training

More information

HYPERGRAPH BASED SEMI-SUPERVISED LEARNING ALGORITHMS APPLIED TO SPEECH RECOGNITION PROBLEM: A NOVEL APPROACH

HYPERGRAPH BASED SEMI-SUPERVISED LEARNING ALGORITHMS APPLIED TO SPEECH RECOGNITION PROBLEM: A NOVEL APPROACH HYPERGRAPH BASED SEMI-SUPERVISED LEARNING ALGORITHMS APPLIED TO SPEECH RECOGNITION PROBLEM: A NOVEL APPROACH Hoang Trang 1, Tran Hoang Loc 1 1 Ho Chi Minh City University of Technology-VNU HCM, Ho Chi

More information

Polyhedral Computation. Linear Classifiers & the SVM

Polyhedral Computation. Linear Classifiers & the SVM Polyhedral Computation Linear Classifiers & the SVM mcuturi@i.kyoto-u.ac.jp Nov 26 2010 1 Statistical Inference Statistical: useful to study random systems... Mutations, environmental changes etc. life

More information

Support Vector Machines, Kernel SVM

Support Vector Machines, Kernel SVM Support Vector Machines, Kernel SVM Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, 2017 1 / 40 Outline 1 Administration 2 Review of last lecture 3 SVM

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015 Outline One versus all/one versus one Ranking loss for multiclass/multilabel classification Scaling to millions of labels Multiclass

More information

Adaptive Kernel Principal Component Analysis With Unsupervised Learning of Kernels

Adaptive Kernel Principal Component Analysis With Unsupervised Learning of Kernels Adaptive Kernel Principal Component Analysis With Unsupervised Learning of Kernels Daoqiang Zhang Zhi-Hua Zhou National Laboratory for Novel Software Technology Nanjing University, Nanjing 2193, China

More information

Spectral gradient projection method for solving nonlinear monotone equations

Spectral gradient projection method for solving nonlinear monotone equations Journal of Computational and Applied Mathematics 196 (2006) 478 484 www.elsevier.com/locate/cam Spectral gradient projection method for solving nonlinear monotone equations Li Zhang, Weijun Zhou Department

More information

Max-Margin Ratio Machine

Max-Margin Ratio Machine JMLR: Workshop and Conference Proceedings 25:1 13, 2012 Asian Conference on Machine Learning Max-Margin Ratio Machine Suicheng Gu and Yuhong Guo Department of Computer and Information Sciences Temple University,

More information

Novel Distance-Based SVM Kernels for Infinite Ensemble Learning

Novel Distance-Based SVM Kernels for Infinite Ensemble Learning Novel Distance-Based SVM Kernels for Infinite Ensemble Learning Hsuan-Tien Lin and Ling Li htlin@caltech.edu, ling@caltech.edu Learning Systems Group, California Institute of Technology, USA Abstract Ensemble

More information

A Magiv CV Theory for Large-Margin Classifiers

A Magiv CV Theory for Large-Margin Classifiers A Magiv CV Theory for Large-Margin Classifiers Hui Zou School of Statistics, University of Minnesota June 30, 2018 Joint work with Boxiang Wang Outline 1 Background 2 Magic CV formula 3 Magic support vector

More information

FEATURE SELECTION IN SVM VIA POLYHEDRAL K-NORM. Sparse optimization, Cardinality constraint, k-norm, Support Vector Machine, DC

FEATURE SELECTION IN SVM VIA POLYHEDRAL K-NORM. Sparse optimization, Cardinality constraint, k-norm, Support Vector Machine, DC FEATURE SELECTION IN SVM VIA POLYHEDRAL K-NORM M. GAUDIOSO, E. GORGONE, AND J.-B. HIRIART URRUTY Abstract. We treat the Feature Selection problem in the Support Vector Machine (SVM) framework by adopting

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 18, 2016 Outline One versus all/one versus one Ranking loss for multiclass/multilabel classification Scaling to millions of labels Multiclass

More information

L5 Support Vector Classification

L5 Support Vector Classification L5 Support Vector Classification Support Vector Machine Problem definition Geometrical picture Optimization problem Optimization Problem Hard margin Convexity Dual problem Soft margin problem Alexander

More information

Learning by constraints and SVMs (2)

Learning by constraints and SVMs (2) Statistical Techniques in Robotics (16-831, F12) Lecture#14 (Wednesday ctober 17) Learning by constraints and SVMs (2) Lecturer: Drew Bagnell Scribe: Albert Wu 1 1 Support Vector Ranking Machine pening

More information

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning Kernel Machines Pradeep Ravikumar Co-instructor: Manuela Veloso Machine Learning 10-701 SVM linearly separable case n training points (x 1,, x n ) d features x j is a d-dimensional vector Primal problem:

More information

Hierarchical Penalization

Hierarchical Penalization Hierarchical Penalization Marie Szafransi 1, Yves Grandvalet 1, 2 and Pierre Morizet-Mahoudeaux 1 Heudiasyc 1, UMR CNRS 6599 Université de Technologie de Compiègne BP 20529, 60205 Compiègne Cedex, France

More information

Research Reports on Mathematical and Computing Sciences

Research Reports on Mathematical and Computing Sciences ISSN 1342-2804 Research Reports on Mathematical and Computing Sciences A Modified Algorithm for Nonconvex Support Vector Classification Akiko Takeda August 2007, B 443 Department of Mathematical and Computing

More information

Support Vector Machine Classification via Parameterless Robust Linear Programming

Support Vector Machine Classification via Parameterless Robust Linear Programming Support Vector Machine Classification via Parameterless Robust Linear Programming O. L. Mangasarian Abstract We show that the problem of minimizing the sum of arbitrary-norm real distances to misclassified

More information

FEATURE SELECTION COMBINED WITH RANDOM SUBSPACE ENSEMBLE FOR GENE EXPRESSION BASED DIAGNOSIS OF MALIGNANCIES

FEATURE SELECTION COMBINED WITH RANDOM SUBSPACE ENSEMBLE FOR GENE EXPRESSION BASED DIAGNOSIS OF MALIGNANCIES FEATURE SELECTION COMBINED WITH RANDOM SUBSPACE ENSEMBLE FOR GENE EXPRESSION BASED DIAGNOSIS OF MALIGNANCIES Alberto Bertoni, 1 Raffaella Folgieri, 1 Giorgio Valentini, 1 1 DSI, Dipartimento di Scienze

More information

Comparison of Modern Stochastic Optimization Algorithms

Comparison of Modern Stochastic Optimization Algorithms Comparison of Modern Stochastic Optimization Algorithms George Papamakarios December 214 Abstract Gradient-based optimization methods are popular in machine learning applications. In large-scale problems,

More information

Evaluation requires to define performance measures to be optimized

Evaluation requires to define performance measures to be optimized Evaluation Basic concepts Evaluation requires to define performance measures to be optimized Performance of learning algorithms cannot be evaluated on entire domain (generalization error) approximation

More information

PROGRAMMING. Yufeng Liu and Yichao Wu. University of North Carolina at Chapel Hill

PROGRAMMING. Yufeng Liu and Yichao Wu. University of North Carolina at Chapel Hill Statistica Sinica 16(26), 441-457 OPTIMIZING ψ-learning VIA MIXED INTEGER PROGRAMMING Yufeng Liu and Yichao Wu University of North Carolina at Chapel Hill Abstract: As a new margin-based classifier, ψ-learning

More information

IE598 Big Data Optimization Introduction

IE598 Big Data Optimization Introduction IE598 Big Data Optimization Introduction Instructor: Niao He Jan 17, 2018 1 A little about me Assistant Professor, ISE & CSL UIUC, 2016 Ph.D. in Operations Research, M.S. in Computational Sci. & Eng. Georgia

More information

Sparse Additive machine

Sparse Additive machine Sparse Additive machine Tuo Zhao Han Liu Department of Biostatistics and Computer Science, Johns Hopkins University Abstract We develop a high dimensional nonparametric classification method named sparse

More information

Second Order Cone Programming, Missing or Uncertain Data, and Sparse SVMs

Second Order Cone Programming, Missing or Uncertain Data, and Sparse SVMs Second Order Cone Programming, Missing or Uncertain Data, and Sparse SVMs Ammon Washburn University of Arizona September 25, 2015 1 / 28 Introduction We will begin with basic Support Vector Machines (SVMs)

More information

Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers

Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers Erin Allwein, Robert Schapire and Yoram Singer Journal of Machine Learning Research, 1:113-141, 000 CSE 54: Seminar on Learning

More information

Support Vector Machines.

Support Vector Machines. Support Vector Machines www.cs.wisc.edu/~dpage 1 Goals for the lecture you should understand the following concepts the margin slack variables the linear support vector machine nonlinear SVMs the kernel

More information

A Second order Cone Programming Formulation for Classifying Missing Data

A Second order Cone Programming Formulation for Classifying Missing Data A Second order Cone Programming Formulation for Classifying Missing Data Chiranjib Bhattacharyya Department of Computer Science and Automation Indian Institute of Science Bangalore, 560 012, India chiru@csa.iisc.ernet.in

More information

Scale-Invariance of Support Vector Machines based on the Triangular Kernel. Abstract

Scale-Invariance of Support Vector Machines based on the Triangular Kernel. Abstract Scale-Invariance of Support Vector Machines based on the Triangular Kernel François Fleuret Hichem Sahbi IMEDIA Research Group INRIA Domaine de Voluceau 78150 Le Chesnay, France Abstract This paper focuses

More information

ESANN'2003 proceedings - European Symposium on Artificial Neural Networks Bruges (Belgium), April 2003, d-side publi., ISBN X, pp.

ESANN'2003 proceedings - European Symposium on Artificial Neural Networks Bruges (Belgium), April 2003, d-side publi., ISBN X, pp. On different ensembles of kernel machines Michiko Yamana, Hiroyuki Nakahara, Massimiliano Pontil, and Shun-ichi Amari Λ Abstract. We study some ensembles of kernel machines. Each machine is first trained

More information

Support Vector Machine

Support Vector Machine Andrea Passerini passerini@disi.unitn.it Machine Learning Support vector machines In a nutshell Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

More information

Pattern Recognition 2018 Support Vector Machines

Pattern Recognition 2018 Support Vector Machines Pattern Recognition 2018 Support Vector Machines Ad Feelders Universiteit Utrecht Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 1 / 48 Support Vector Machines Ad Feelders ( Universiteit Utrecht

More information