Enhancing Generalization Capability of SVM Classifiers with Feature Weight Adjustment

Size: px

Start display at page:

Download "Enhancing Generalization Capability of SVM Classifiers with Feature Weight Adjustment"

Roland Lee
5 years ago
Views:

1 Enhancing Generalization Capability of SVM Classifiers ith Feature Weight Adjustment Xizhao Wang and Qiang He College of Mathematics and Computer Science, Hebei University, Baoding 07002, Hebei, China Abstract. It is ell recognized that support vector machines (SVMs ould produce better classification performance in terms of generalization poer. A SVM constructs an optimal separating hyper-plane through maximizing the margin beteen to classes in high-dimensional feature space. Based on statistical learning theory, the margin scale reflects the generalization capability to a great extent. The bigger the margin scale taes, the better the generalization capability of SVMs ill have. This paper maes an attt to enlarge the margin beteen to support vector hyper-planes by feature eight adjustment. The experiments demonstrate that our proposed techniques in this paper can enhance the generalization capability of the original SVM classifiers. Introduction Statistical learning theory (SLT [], a ne theory for small-sample learning problems, as introduced in the 960 s by Vapni et al, hich can deal ith the situation here the samples are limited. Most machine-learning methods perform irical ris minimization (ERM induction principle [], hich is effective hen samples are enough. Hoever, in most cases of real orld, the samples are limited. It is difficult to apply expected ris minimization [2] directly to classification problems. Most expected ris minimization problems are converted to minimize the irical ris. Unfortunately irical ris minimization is not alays equivalent to expected ris minimization. It implies that ERM cannot lead to a good generalization capability but expected ris minimization can. The statistical learning theory has shon a clear relationship beteen expected ris and irical ris, and shon that the generalization can be controlled by the capacity of learning machine. Support vector machines (SVMs are a ne classification technique based on SLT [2]. Due to its extraordinary generalization, SVMs have been a poerful tool for solving classification problems ith to classes. A SVM first maps the original input space into a high-dimensional feature space through some predefined nonlinear mapping and then constructs an optimal separating hyper-plane maximizing the margin beteen to classes in the feature space. Based on SLT, e no that, the bigger the margin scale taes, the better the generalization capability of SVMs ill have. Therefore e alays expect that the margin is as large as possible for to-class problems. M.Gh. Negoita et al. (Eds.: KES 2004, LNAI 323, pp , Springer-Verlag Berlin Heidelberg 2004

2 038 X. Wang and Q. He Feature eight learning, hich assigns a eight to each feature to indicate the importance degree of feature, is an extension of feature selection. This paper adopts a feature eight learning technique introduced in [4]. We expect to enlarge the margin beteen to support vector hyper-planes by using this technique for generalization capability improvement. The rest of this paper is organized as follos. In section 2, e revie the relationship beteen the margin of to hyper-planes and the generalization capability of SVMs. In section 3, the detailed technique of feature eight learning is introduced. To experiments, hich sho that the feature eight learning s effect to margin enlargement, are presented in section 4. Section 5 gives some remars and concludes our paper. 2 Relationship Beteen the Margin and the Generalization Capability Firstly, e consider a function, i.e., the groth function of the set of indicator functions [2] G ( l = ln sup N ( z, z2,, zl ( z,, zl here N ( z, z2,, zl evaluates ho many different separations of the given sample z, z 2,, z l can be done using functions from the set of indicator functions. From SLT, e no the folloing conclusion: any groth function satisfies G ( l = l ln 2 or l G ( l < h(ln + h (2 here h is an integer for hich G ( h = h ln 2 and G ( h + ( h + ln 2 (3 We no revie the definition of VC dimension [3]: The VC dimension of the set of indicator function Q (z,α,α equal h if the groth function is bounded by a logarithmic function ith coefficient h. VC dimension is a pivotal concept in the statistical learning theory. For a to-class classification problem, the -margin separating hyper-plane is defined in SLT []. The folloing theorems are valid for the set of -margin separating hyper-planes [2]. Theorem. Let vector x X belong to a sphere of radius R. then the set of - margin separating hyper-plane has the VC dimension h bounded by the inequality 2 R h min, n + 2 (4 From inequality (4, e no that larger the margin of the set of functions is, the more less VC dimension is.

3 Theorem 2. With probability at least Enhancing Generalization Capability of SVM Classifiers 039 η, the inequality Bε 4 R( α R( α R( α Bε (5 holds true, here R (α, R (α are the expected ris and the irical ris functional respectively [],l is the size of training set, h is VC dimension of the set of functions. ε is formulated in [2]. For simplicity, e rite (5 as follo: ( h R( α R( α + Φ l (6 Φ is called confidence interval, hich is monotonic increasing function of h, and monotonic decreasing function of l. Inequality (6 gives bounds on the generalization ability of learning machine. Our purpose is to minimize the left side of (6, i.e., R (α. Minimization of R (α implies the optimal generalization capability of learning machines. In fact, minimizing R (α is not feasible due to its integral formulation. Practically, in place of minimizing R (α, e do minimize R (α. Inequality (6 shos the relationship beteen R (α and R (α. Due to the existence of the second term of the right side of (6, it is clear that the minimization of R (α cannot guarantee the minimization of R (α. It is expected that the second term of the right side of (6 is as small as possible since the small confidence interval possibly implies the small actual ris R (α, i.e., possibly implies a good generalization capability. Noting that Φ is a monotonic increasing function of h hich is decreasing ith the increase of here is the margin of separating hyper-planes (see Inequation (4, enlarging possibly leads to an improvement of generalization capability of the learning machine. The traditional ERM principle only minimizes irical ris ithout considering confidence interval, so it perhaps cannot get good generalization capability. SVMs based on SLT perform structural ris minimization (SRM induction principle [][2], not only minimizes the irical ris, but also considers confidence interval by maximizing the margin beteen to classes in high-dimensional feature space. 3 Feature-Weight Learning Each feature is considered to have an importance degree called feature-eight. Featureeight assignment is an extension of feature selection. The latter has only either 0- eight or -eight value, hile the former can have eight values in the interval [0,]. The feature-eight learning depends on the similarity beteen samples. There are many ays to define the similarity measure, such as the related coefficient and Euclidean distance, etc. Here the similarity measure ρ is defined as follos:

4 040 X. Wang and Q. He ρ = (7 + β * d here β is a positive parameter in the interval [0,]. It can adjust the similarity degrees distributed around 0.5, i.e., β is required to satisfy the folloing: here 2 = 0.5 (8 nn ( + β d p< q d is commonly used Euclidean distance, and distance defined as follos: ( d is the eighted Euclidean d = ( x x 2 2 i j = s (9 here = (,, s is the feature-eight vector. Its component is the importance degree corresponding to each feature. Larger is, more important the th feature is. When = (,,, the space { ( d r } is a hyper-sphere ith radius r (called original space and ( d is d and ( ρ is axes ould be extended or shrun in accordance ith ρ. When (,,, it means that the. Thus the space { ( d r } is hyper-ellipse, called the transformed space. The loer the value of is, the higher the flattening extent is. A good partition should have the folloing property: the samples ithin one cluster are more similar and dissimilar samples are more separate, hich implies that the fuzziness of the partition is lo. Therefore e hope that by adjusting, ( ρ tends to one or zero if ρ is greater or less than 0.5, respectively. Based on above discussion, e learn the feature-eight value by minimizing an evaluation function E ( [4] defined as follos: ( ρ ρ ρ ρ 2 E ( = ( + ( nn ( 2 i j i here n is the number of samples. From equation (0, e can say that E ( decreases as the similarity degree ρ tends to 0 or if ρ < 0.5 or ρ > 0.5. Therefore e expect that feature-eight learning by minimizing E ( can lead to ( ρ close to 0 or, hich obviously can improve pq the performance of FCM. To minimize the evaluation function E (, e use the gradient descent technique. First of all, let be the change of, compute as follos: (0

5 Enhancing Generalization Capability of SVM Classifiers 04 = η here η is the learning rate. An appropriate value of η could speed up the convergence of the algorithm since too small η leads to lo computational efficiency but too big η results in divergence of the algorithm. Through one dimensional searching technique [5], η is determined by: ( E( = Min η,, E( η λ,, λ > 0 n λ n n n (2 The derivate of E ( can obtained from folloing: E ρ = ( 2 N( N j < i d ρ (3 d ρ d β = ( + β d 2 (4 d ( x = d x i The algorithm is described briefly as follos: ( Initialize all eight values ith. And solve β from equation (8; (2 Compute ( ρ by equation (7 and ( d by equation (9. (3 Let be the change of. Compute by (3-5; = η (4 If + 0 is satisfied,then = + ; (5 Go to (3 until E ( is less than a given threshold or the times of iteration reach the user specified number. j (5 4 Enlarging the Margin by Feather Weight Adjustment for Generalization Capability Improvement In this section, for the purpose of generalization capability improvement, e ould lie to experimentally demonstrate the enlargement of margin beteen to hyperplanes by the feature eight adjustment mentioned in previous section. We select to databases to complete the demonstration. The first one is the ell-non toy example, i.e., Iris database from UCI, hich includes 50 cases ith 3 classes. Since the SVM is only for to-class classification

6 042 X. Wang and Q. He problems, e delete the cases belonging to class one and the remaining 00 cases ith to classes are used to demonstrate. The second one, called Pima India Diabetes, is also from UCI. The to databases characters are shon in table. According to the feature eight learning algorithm given in the end of section 3, e can learn the eights for the selected to databases. Table 2 shos the result of feature eight leaning for the to selected databases. Applying SVM Toolbox ( to the original data of the to selected databases, one can obtain the optimal separating hyper-planes and the corresponding margin. Due to the limit of paper length, e omit the formulation of hyper-planes and margins. For details, one can refer to [][2]. Similarly, using the same method, one can evaluate the margins among the separating hyperplanes after the learned eighs are incorporated into the original databases. Tables 3 and 4 sho the size of margin of separating hyper-planes, the training accuracy, and the testing accuracy for the to databases, here 70% of the databases are randomly selected as the training sets and the remaining 30% as the testing sets. It is orth noting that the experimental results depend on the parameters chosen in the SVM Toolbox. From Tables 3 and 4, one can see that the margins are indeed enlarged. Hoever, the improvement for training and testing accuracy is not significant. We speculate that the reason is that ( the data is not enough for Iris database and (2 Pima database is non-linear separable very much. The further investigation to the generalization capability improvement is in progress. Table. The characters of databases Database Name Number of Number of Category of samples features features Iris 00 4 Numerical Pima Numerical Table 2. The results of feature-eight learning Database Name Results of feature-eight learning Iris 0.000,0.0002,.0,0.64 Pima , , , , , , , Table 3. Margin enlargement for Iris database Iris database Margin Training Testing Accuracy Accuracy Before feature eight adjustment After feature eight adjustment

7 Enhancing Generalization Capability of SVM Classifiers 043 Table 4. Margin enlargement for Pima database Pima database Margin Training Accuracy Testing Accuracy Before feature eight adjustment After feature eight adjustment Conclusions According to SLT, the enlargement of margin of separating hyper-plane can enhance the generalization capability of the learning machine. This paper maes an attt to enlarge the margin by an approach of feature eight adjustment. Initial experiments sho the approach s effectiveness. We have the folloing remars: ( Whether the feature eight learning approach can be mathematically proved to enlarge the margin? (2 Whether the approach has a difference beteen linear separable and non-linear separable data sets? (3 When evaluating the margin of separating hyper-plane, ho to choose the optimal values of parameters in the quadratic program? Acnoledgement This research is supported by the ey project of Chinese Educational Ministry (0306, the Natural Science Foundation of HeBei Province (60336, and the Doctoral Foundation of Hebei Educational Committee (B20036 References. Vapni V. N, The Nature of statistical learning theory, Ne Yor: Springer-Verlag, Vapni V. N, Statistical learning theory. Ne Yor: A Wiley-Interscience Publication, Vladimir N. Vapni, An Overvie of Statistical Learning Theory, IEEE Transactions on Neural Netors, 0, ( Basa J, De R. K, Pal S. K, Unsupervised feature selection using a neuro-fuzzy approach, Pattern Recognition Letters 9 ( Rao, Optimization theory and applications (Wiley eastern limited, 985.

Linear models: the perceptron and closest centroid algorithms. D = {(x i,y i )} n i=1. x i 2 R d 9/3/13. Preliminaries. Chapter 1, 7.

Preliminaries Linear models: the perceptron and closest centroid algorithms Chapter 1, 7 Definition: The Euclidean dot product beteen to vectors is the expression d T x = i x i The dot product is also