SVM-based Feature Selection by Direct Objective Minimisation

Similar documents
Adaptive Kernel Principal Component Analysis With Unsupervised Learning of Kernels

Support Vector Machine (continued)

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Support Vector Machine Classification via Parameterless Robust Linear Programming

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan

ML (cont.): SUPPORT VECTOR MACHINES

A NEW MULTI-CLASS SUPPORT VECTOR ALGORITHM

CS798: Selected topics in Machine Learning

Perceptron Revisited: Linear Separators. Support Vector Machines

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

Support Vector Machines for Classification: A Statistical Portrait

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Learning Kernel Parameters by using Class Separability Measure

An Improved 1-norm SVM for Simultaneous Classification and Variable Selection

Semi-Supervised Learning through Principal Directions Estimation

Machine Learning : Support Vector Machines

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machines

Sparse Support Vector Machine with L p Penalty for Feature Selection

Robust Kernel-Based Regression

SVM Incremental Learning, Adaptation and Optimization

Support Vector Machine (SVM) and Kernel Methods

Introduction to Support Vector Machines

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

Nonlinear Knowledge-Based Classification

Support Vector Machines

On the Convergence of the Concave-Convex Procedure

SVM TRADE-OFF BETWEEN MAXIMIZE THE MARGIN AND MINIMIZE THE VARIABLES USED FOR REGRESSION

Formulation with slack variables

Linear, threshold units. Linear Discriminant Functions and Support Vector Machines. Biometrics CSE 190 Lecture 11. X i : inputs W i : weights

Review: Support vector machines. Machine learning techniques and image analysis

Applied inductive learning - Lecture 7

Polyhedral Computation. Linear Classifiers & the SVM

Support Vector Machines

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Linear & nonlinear classifiers

Scale-Invariance of Support Vector Machines based on the Triangular Kernel. Abstract

Statistical learning theory, Support vector machines, and Bioinformatics

Support Vector Machines

Structured Statistical Learning with Support Vector Machine for Feature Selection and Prediction

Kernel-based Feature Extraction under Maximum Margin Criterion

Transductive Experiment Design

Pattern Recognition 2018 Support Vector Machines

Machine learning for pervasive systems Classification in high-dimensional spaces

Support Vector Machines II. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Support Vector Machine via Nonlinear Rescaling Method

Analysis of Multiclass Support Vector Machines

Support Vector Machines Explained

Machine Learning. Support Vector Machines. Manfred Huber

Jeff Howbert Introduction to Machine Learning Winter

Linear & nonlinear classifiers

Machine Learning. Lecture 6: Support Vector Machine. Feng Li.

Efficient Wavelet Adaptation

SUPPORT VECTOR MACHINE FOR THE SIMULTANEOUS APPROXIMATION OF A FUNCTION AND ITS DERIVATIVE

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University

Generalized Kernel Classification and Regression

SSVM: A Smooth Support Vector Machine for Classification

Support Vector Machine I

Supervised locally linear embedding

Links between Perceptrons, MLPs and SVMs

On Convergence Rate of Concave-Convex Procedure

Convex Optimization. Dani Yogatama. School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA. February 12, 2014

Gaps in Support Vector Optimization

Exact 1-Norm Support Vector Machines via Unconstrained Convex Differentiable Minimization

I D I A P. Online Policy Adaptation for Ensemble Classifiers R E S E A R C H R E P O R T. Samy Bengio b. Christos Dimitrakakis a IDIAP RR 03-69

Non-linear Support Vector Machines

ESANN'2003 proceedings - European Symposium on Artificial Neural Networks Bruges (Belgium), April 2003, d-side publi., ISBN X, pp.

(Kernels +) Support Vector Machines

The Disputed Federalist Papers: SVM Feature Selection via Concave Minimization

Lecture Support Vector Machine (SVM) Classifiers

Lecture 7: Kernels for Classification and Regression

Variable Selection in Data Mining Project

Novel Distance-Based SVM Kernels for Infinite Ensemble Learning

Applied Machine Learning Annalisa Marsico

Statistical Methods for Data Mining

Statistical Pattern Recognition

L5 Support Vector Classification

Machine Learning And Applications: Supervised Learning-SVM

Convex Optimization in Classification Problems

Research Reports on Mathematical and Computing Sciences

Knowledge-Based Nonlinear Kernel Classifiers

Pattern Recognition and Machine Learning. Perceptrons and Support Vector machines

6.036 midterm review. Wednesday, March 18, 15

An introduction to Support Vector Machines

Support Vector Machines

Lecture 18: Multiclass Support Vector Machines

Unsupervised and Semisupervised Classification via Absolute Value Inequalities

Local Learning Projections

Discriminative Direction for Kernel Classifiers

A Second order Cone Programming Formulation for Classifying Missing Data

Ellipsoidal Kernel Machines

Unsupervised Classification via Convex Absolute Value Inequalities

Introduction to Support Vector Machines

The Decision List Machine

MinOver Revisited for Incremental Support-Vector-Classification

below, kernel PCA Eigenvectors, and linear combinations thereof. For the cases where the pre-image does exist, we can provide a means of constructing

Transcription:

SVM-based Feature Selection by Direct Objective Minimisation Julia Neumann, Christoph Schnörr, and Gabriele Steidl Dept. of Mathematics and Computer Science University of Mannheim, 683 Mannheim, Germany http://www.cvgpr.uni-mannheim.de, http://kiwi.math.uni-mannheim.de {jneumann,schnoerr,steidl}@uni-mannheim.de Abstract. We propose various novel embedded approaches for (simultaneous) feature selection and classification within a general optimisation framework. In particular, we include linear and nonlinear SVMs. We apply difference of convex functions programming to solve our problems and present results for artificial and real-world data. Introduction Overview and related work. Given a pattern recognition problem as a training set of labelled feature vectors, our goal is to find a mapping that classifies the data correctly. In this context, feature selection aims at picking out some of the original input dimensions (features) (i) for performance issues by facilitating data collection and reducing storage space and classification time, (ii) to perform semantics analysis helping to understand the problem, and (iii) to improve prediction accuracy by avoiding the curse of dimensionality (cf. [6]). Feature selection approaches divide into filters that act as a preprocessing step independently of the classifier, wrappers that take the classifier into account as a black box, and embedded approaches that simultaneously determine features and classifier during the training process (cf. [6]). In this paper, we deal with the latter method and focus on direct objective minimisation. Our linear classification framework is based on [4], but takes into account that the Support Vector Machine (SVM) provides good generalisation ability by its l -regulariser. There exist only few papers on nonlinear classification with embedded feature selection. An approach for the quadratic -norm SVM was suggested in []. An example for a wrapper method employing a Gaussian kernel SVM error bound is []. Contribution. We propose a range of new embedded methods for feature selection regularising linear embedded approaches and construct feature selection methods for nonlinear SVMs. To solve the non-convex problems, we apply the general difference of convex functions (d.c.) optimisation algorithm. Structure. In the next section, we present various extensions of the linear embedded approach proposed in [4] and consider feature selection methods in conjunction with nonlinear classification. The d.c. optimisation approach and its application to our problems is described in Sect. 3. Numerical results illustrating and evaluating various approaches are given in Sect. 4.

Feature Selection by Direct Objective Minimisation Given a training set {(x i, y i ) X {, } : i =,..., n} with X R d, our goal is both to find a classifier F : X {, } and to select features.. Linear Classification The linear classification approaches construct two parallel bounding planes in R d such that the differently labelled sets are to some extent in the two opposite half spaces determined by these planes. More precisely, one solves the minimisation problem min ( ) w R d,b R ( y i (w T x i + b)) + + ρ(w) () i= with [, ), regulariser ρ and x + := max(x, ). Then the classifier is F (x) = sgn(w T x + b). For ρ =, the linear method () was proposed as Robust Linear Programming (RLP) by Bennett and Mangasarian []. Note that these authors weighted the training errors by /n ±, where n ± = {i : y i = ±}. In order to maximise the margin between the two parallel planes, the original SVM penalises the l -norm ρ(w) = w. Then () can be solved by a convex Quadratic Program (QP). In order to suppress features, l p -norms with p < are used. In [4], the l -norm (lasso penalty) ρ(w) = w leads to good feature selection and classification results. Moreover, for the l -norm, () can be solved by a linear program. The feature selection can be further improved by using the so-called l - norm w = {i : w i } [4, ]. Since the l -norm is non-smooth, it was approximated in [4] by the concave functional ρ(w) = e (e T (e α wi ) ) d w () with approximation parameter α R + and e = (,..., ) T. Problem () with penalty () is known as Feature Selection concave (FSV). Now the solution of () becomes more sophisticated and can be obtained, e.g., by the Successive Linearization Algorithm (SLA) as proposed in [4]. New feature selection approaches. Since the l penalty term leads to very good classification results while the l and l penalty terms focus on feature selection, we suggest using combinations of these terms. As common, to eliminate the absolute values in the l -norm or in the approximate l -norm, we introduce additional variables v i w i (i =,..., d) and consider νρ(v) + χ [ v,v] (w) instead of ρ(w), where χ C denotes the indicator function χ C (x) = if x C and χ C (x) = otherwise (cf. [7, 8]). As a result, for µ, ν R +, we minimise f(w, b, v) := µ ( y i (w T x i + b)) + + n w + νρ(v) + χ [ v,v](w). (3) i= In case of the l -norm, problem (3) can be solved by a convex QP. For the approximate l -norm an appropriate method is presented in Sect. 3. i=

. Nonlinear Classification For problems which are not linearly separable a so-called feature map φ which usually maps the set X R d onto a higher dimensional space φ(x ) R d (d d) is used. Then the linear approach () is applied in the new feature space φ(x ). This results in a nonlinear classification in the original space R d, i.e., in nonlinear separating surfaces. Quadratic feature map. We start with the simple quadratic feature map φ : X R d, x (x α : α N d, < α ), where d = d(d+3), and apply () in R d with the approximate l -penalty (): f(w, b, v) :=( ) + d ( y i (w T φ(x i ) + b)) + + e T (e e αv ) i= i= φ i (e j ) χ [ vj,v j](w i ) min w R d,b R,v R d, (4) where e j R d denotes the j-th unit vector. We want to select features in the original space R d due to (i)-(ii) in Sect.. Thus we include the appropriate indicator functions. A similar approach in [] does not involve this idea and achieves only a feature selection in the transformed feature space R d. We will refer to (4) as quadratic FSV. In principle, the approach can be extended to other feature maps φ, especially to other polynomial degrees. Gaussian kernel feature map. Next we consider SVMs with the feature map related to the Gaussian kernel K(x, z) = K θ (x, z) = e x z,θ /σ (5) with weighted l -norm x,θ = d k= θ k x k by K(x, z) = φ(x), φ(z) for all x, z X. We apply the usual SVM classifier. For further information on nonlinear SVMs see, e.g., [9]. Direct feature selection, i.e., the setting of as many θ k to zero as possible while retaining or improving the classification ability, is a difficult problem. One possible approach is to use a wrapper as in []. In [5], the alignment Â(K, yyt ) = y T Ky/(n K F ) was proposed as a measure of conformance of a kernel with a learning task. Therefore, we suggest to maximise in a modified form yn T Ky n where y n = (y i /n yi ) n i=. Then, with penalty (), we define our kernel-target alignment approach for feature selection as f(θ) := ( ) yt n K θy n + d et (e e αθ ) + χ [,e] (θ) min θ R d. (6) The scaling factors, d ensure that both objective terms take values in [, ].

3 D.C. Programming and Optimisation A robust algorithm for minimising non-convex problems is the Difference of Convex functions Algorithm (DCA) proposed in [7]. Its goal is to minimise a function f : R d R { } which reads f(x) = g(x) h(x) min, (7) x R d where g, h : R d R { } are lower semi-continuous, proper convex functions cf. [8]. In the next subsections, we first introduce the DCA and then apply it to our non-convex feature selection problems. 3. D.C. Programming For g as assumed above, we introduce the domain of g, its conjugate function at x R d and its subdifferential at z R d by dom g := {x R d : g(x) < }, g ( x) := sup x R d{ x, x g(x)} and g(z) := { x R d : g(x) g(z) + x z, x x R d }, respectively. For differentiable functions we have that g(z) = { g(z)}. According to [8, Theorem 3.5], it holds g(x) = arg max x R d {x T x g ( x)}, g ( x) = arg max x R d { x T x g(x)}. (8) Further assume that dom g dom h and dom h dom g. It was proved in [7] that then every limit point of the sequence ( x k) k N produced by the following algorithm is a critical point of f in (7): Algorithm 3.: D.C. minimisation Algorithm (DCA)(g, h, tol) choose x dom g arbitrarily for k N select x k h(x k ) arbitrarily select x do ( k+ g ( x k ) arbitrarily x ) k+ if min i x k x i, k+ i x k i x tol k i then return (x k+ ) i =,..., d We can show but omit this point due to lack of space that the DCA applied to a particular d.c. decomposition (7) of FSV coincides with the SLA. 3. Application to our Feature Selection Problems The crucial point in applying the DCA is to define a suitable d.c. decomposition (7) of the objective function. The aim of this section is to propose such decompositions for our different approaches. l -l -SVM. A viable d.c. decomposition for (3) with () reads g(w, b, v) = µ ( y i (w T x i + b)) + + n w + χ [ v,v] (w), i= h(v) = νe T (e e αv )

which gives rise to a convex QP in each DCA step. Quadratic FSV. To solve (4) we use the d.c. decomposition g(w, b, v) = ( ) d ( y i (w T φ(x i ) + b)) + + i= h(v) = e T (e e αv ), which leads to a linear problem in each DCA step. i= φ i (e j) χ [ vj,v j](w i ), Kernel-target alignment approach. For the function defined in (6), as the kernel (5) is convex in θ, we split f as g(θ) = n + n h(θ) = i,j= y i=y j e i,j= y i y j x i x j,θ /σ + χ [,e] (θ), n y i e xi xj,θ /σ d et (e e αθ ). Now h is differentiable, so applying the DCA we find the solution in the first step of iteration k as θ k = h(θ k ). In the second step, we are looking for θ k+ g ( θ k ) (8) = arg max θ {θ T θk g(θ)} which leads to solving the convex non-quadratic problem min θ R d n + n e i,j= y i y j x i x j,θ /σ θ T θk subject to θ e with a valid initial point θ e. We efficiently solve this problem by a penalty/barrier multiplier method with logarithmic-quadratic penalty function as proposed in []. 4 Evaluation 4. Ground Truth Experiments In this section, we consider artificial training sets in R and R 4 where y is a function of the first two features x and x. The examples in Fig. show that our quadratic FSV approach indeed performs feature selection and finds classification rules for quadratic, not linearly separable problems. For the non-quadratic chess board classification problems in Fig., our kernel-target alignment approach performs very well, in contrast to all other feature selection approaches presented. Remarkably, the alignment functional incorporates implicit feature selection for =. In both cases, only relevant feature sets are selected as can be seen in the bottom plots.

x x.5.5.5.5 x 3 3 {,, 3, 4} features features {, } {} x...3.4.5.6.7.8 {} {}.9 deterministic in R {, }...3.4.5.6.7.8.9 random in R4 (four normal random variables x,..., x4 with variances,, and ) Fig.. Quadratic classification problems with y = sgn(x + x ). Top: Training points and decision boundaries (white lines) computed by (4) for =., left: in R, right: projection onto selected features. Bottom: Features determined by (4) 4 3 x x 3 3 x 3 4 x {,, 3, 4} {, } {} features features 4 4...3.4.5.6.7.8 deterministic (xi { 3,,, 3} ).9 {, } {}...3.4.5.6.7.8.9 4 random features (same as Fig. right) Fig.. Chess board classification problems with y+ = (b x c mod ) (b x c mod ). Top: Training points and Gaussian SVM decision boundaries (white lines) for σ =, =., left: in R, right: projection onto selected features. Bottom: Features determined by (6)

4. Real-World Data To test all our methods on real-world data, we use several data sets from the UCI repository [3] resumed in Table. We rescaled the features linearly to zero mean and unit variance and compare our approaches with RLP and FSV favoured in [4]. Table. Statistics for data sets used data set number of features d number of samples n class distribution n +/n wpbc6 3 4/ 69 wpbc4 3 55 8/7 liver 6 345 45/ cleveland 3 97 6/37 ionosphere 34 35 5/6 pima 8 768 5/68 bcw 9 683 444/39 Choice of parameters. We set α = 5 in () as proposed in [4] and σ = d in (5) which maximises the problems alignment. We start the DCA with v = for the l -l -SVM, FSV and quadratic FSV and with θ = e/ for the kerneltarget alignment approach, respectively. We stop on v with tol = 5 resp. tol = 3 for θ. We retain one half of each run s cross-validation training set for parameter selection. The parameters are chosen to minimise the validation error from ln µ {,..., }, ln ν { 5,..., 5}, {.5,.,.,...,.9,.95} for (quadratic) FSV and {,.,...,.9} for the kernel-target alignment approach. In case of equal validation error, we choose the larger values for (ν, µ) resp.. In the same manner, the SVM weight parameter is chosen according to the smallest {e 5, e 4,..., e 5 } independently of the selected features. The results are summarised in Table where the number of features is determined as {j =,..., d : v j > 8 } resp. {j =,..., d : θ j > }. It is clear that all proposed approaches perform feature selection: linear FSV discards most features followed by the kernel-target alignment approach and then the l - l -SVM, then the l -l -SVM. In addition, for all approaches the test error is often smaller than for RLP. The quadratic FSV performs well mainly for special problems (e.g., liver and ionosphere ), but the classification is good in general for all other approaches. Table. Feature selection and classification tenfold cross-validation performance (average number of features, average test error [%]); bold numbers indicate lowest errors linear classification nonlinear classification RLP FSV l -l -SVM l -l -SVM quad. FSV k.-t. align. data set dim. err dim. err dim. err dim. err dim. err dim. err wpbc6 3. 4.9.4 36.4.4 35.5 3.4 37.3 3. 37.3 3.9 35.5 wpbc4 3. 7.7. 8..6 7.4.9 8.. 8..9 8. liver 6. 3.9. 36. 6. 35. 5. 34. 3. 3.5.5 35.4 cleveland 3. 6..8 3. 9.9 6.5 8. 6.5 9. 3.3 3. 3.6 ionosphere 33. 3.4.3.7 4.8 3.4 4. 5.7 3.9.8 6.6 7.7 pima 8..5.7 8.9 6.6 5. 6. 4.7 4.7 9.9.6 5.7 bcw 9. 3.4.4 4.8 8.7 3. 7.9 3. 5.4 9.4.8 4.

5 Summary and Conclusion We proposed several novel methods that extend existing linear embedded feature selection approaches towards better generalisation ability by improved regularisation and constructed feature selection methods in connection with nonlinear classifiers. In order to apply the DCA, we found appropriate splittings of our non-convex objective functions. In the experiments with real data, effective feature selection was always carried out in conjunction with a small classification error. So direct objective minimisation feature selection is profitable and viable for different types of classifiers. In higher dimensions, the curse of dimensionality affects the classification error even more such that our methods will become more important here. A further evaluation of high-dimensional problems as well as the incorporation of other feature maps is future work. Acknowledgements. This work was funded by the DFG, Grant Schn 457/5. References. A. Ben-Tal and M. Zibulevsky. Penalty/barrier multiplier methods for convex programming problems. SIAM Journal on Optimization, 7():347 366, May 997.. K. P. Bennett and O. L. Mangasarian. Robust linear programming discrimination of two linearly inseparable sets. Optimization Methods and Software, :3 34, 99. 3. C. L. Blake and C. J. Merz. UCI repository of machine learning databases, 998. 4. P. S. Bradley and O. L. Mangasarian. Feature selection via concave minimization and support vector machines. In Proceedings of the 5th International Conference on Machine Learning, pages 8 9, San Francisco, CA, USA, 998. Morgan Kaufmann. 5. N. Cristianini, J. Shawe-Taylor, A. Elisseeff, and J. Kandola. On kernel-target alignment. In T. G. Dietterich, S. Becker, and Z. Ghahramani, editors, Advances in Neural Information Processing Systems 4, pages 367 373. MIT Press, Cambridge, MA, USA,. 6. I. Guyon and A. Elisseeff. An introduction to variable and feature selection. Journal of Machine Learning Research, 3:57 8, Mar. 3. 7. T. Pham Dinh and L. T. Hoai An. A d.c. optimization algorithm for solving the trust-region subproblem. SIAM Journal on Optimization, 8():476 55, May 998. 8. R. T. Rockafellar. Convex Analysis. Princeton University press, Princeton, NJ, USA, 97. 9. B. Schölkopf and A. J. Smola. Learning with Kernels. MIT Press, Cambridge, MA, USA,.. J. Weston, A. Elisseeff, B. Schölkopf, and M. Tipping. Use of the zero-norm with linear models and kernel methods. Journal of Machine Learning Research, 3:439 46, Mar. 3.. J. Weston, S. Mukherjee, O. Chapelle, M. Pontil, T. Poggio, and V. Vapnik. Feature selection for SVMs. In T. K. Leen, T. G. Dietterich, and V. Tresp, editors, Advances in Neural Information Processing Systems 3, pages 668 674. MIT Press, Cambridge, MA, USA,.. J. Zhu, S. Rosset, T. Hastie, and R. Tibshirani. -norm support vector machines. In S. Thrun, L. Saul, and B. Schölkopf, editors, Advances in Neural Information Processing Systems 6. MIT Press, Cambridge, MA, USA, 4.