Kobe University Repository : Kernel

Size: px

Start display at page:

Download "Kobe University Repository : Kernel"

Hannah Hamilton
5 years ago
Views:

Shigeo Neural Networks for Signal Processing, 00.

1 Kobe University Repository : Kernel タイトル Title 著者 Author(s) 掲載誌巻号ページ Citation 刊行日 Issue date 資源タイプ Resource Type 版区分 Resource Version 権利 Rights DOI JaLCDOI URL Analysis of support vector machines Abe, Shigeo Neural Networks for Signal Processing, 00. Proceedings of the 00 th IEEE Workshop on,: Conference Paper / 会議発表論文 author 0.09/NNSP PDF issue:

2 ANALYSIS OF SUPPORT VECTOR MACHINES Shigeo Abe Graduate School of Science and Technology Kobe University, Kobe, Japan Phone/Fax: Web: english.html Abstract. We compare L and L soft margin support vector machines from the standpoint of positive definiteness, the number of support vectors, and uniqueness and degeneracy of solutions. Since the Hessian matrix of L SVMs is positive definite, the number of support vectors for L SVMs is larger than or equal to the number of L SVMs. For L SVMs, if there are plural irreducible sets of support vectors, the solution of the dual problem is non-unique although the primal problem is unique. Similar to L SVMs, degenerate solutions, in which all the data are classified into one class, occur for L SVMs. INTRODUCTION As opposed to L soft margin support vector machines (L SVMs), L soft margin support vector machines (L SVMs) are widely used for pattern classification and function approximation. And much effort has been done to clarify the properties of L SVMs [,, 3, 4]. Pontil and Verri [] clarified dependence of the L SVM solutions on the margin parameter C. Rifkin, Pontil, and Verri [] showed degeneracy of L SVM solutions, in which any data are classified into one class. Fernández [3] also proved the existence of degeneracy without mentioning it. Burges [4] discussed non-uniqueness of L SVM primal solutions, and uniqueness of L SVM solutions. But except for [4], little comparison has been made between L and L SVMs [5]. In this paper, we compare L SVMs with L SVMs from the standpoint of positive definiteness, the number of support vectors, and uniqueness and degeneracy of solutions. Since the Hessian matrix of L SVMs is positive definite, the solutions are unique. For the L SVMs, we introduce the concept of irreducible set of support vectors and show that if there are plural irreducible sets, the dual solutions are non-unique. Finally, we show that L SVMs have degenerate solutions similar to L SVMs. In the following, first we summarize L and L SVMs and discuss the Hessian matrices for L and L SVMs. Then we discuss non-uniqueness of

3 L SVM dual solutions. Finally, we prove the existence of degeneracy for L SVMs. SOFT MARGIN SUPPORT VECTOR MACHINES In soft margin support vector machines, we consider the linear decision function D(x) =w t g(x)+b () in the feature space, where w is the weight vector, g(x) is the mapping function that maps the m-dimensional input x into the l-dimensional feature space, and b is a scalar. We determine the decision function so that the classification error for the training data and unknown data is minimized. This can be achieved by minimizing w + C ξ p i i = () subject to the constraints y i (w t g(x i )+b) ξ i for i =,...,M, (3) where ξ i are the positive slack variables associated with the training data x i, M is the number of training data, y i are the class labels ( or ) for x i, C is the margin parameter, and p is either or. When p =, we call the support vector machine L soft margin support vector machine (L SVM) and when p =, L soft margin support vector machine (L SVM). The dual problem for the L SVM is to maximize Q(α) = α i α i α j y i y j H(x i, x j ) (4) i = i,j= subject to the constraints y i α i =0, 0 α i C, (5) i = where α i are the Lagrange multipliers associated with the training data x i and H(x, x )=g(x) t g(x) is the kernel function. The dual problem for the L SVM is to maximize Q(α) = α i ( α i α j y i y j H(x i, x j )+ δ ) ij (6) C i = i,j= subject to the constraints y i α i =0, α i 0. (7) i = Let the solution of (4) and (5), or (6) and (7) be α i (i =,...,M).

4 HESSIAN MATRIX Rewriting (4) for the L SVM using the mapping function g(x), we have Q(α) = ( α i M ) t M α i y i g(x i ) α i y i g(x i ). (8) i = Solving (5) for α s (s {,...,M}), Substituting (9) into (8), we obtain M α s = y s y i α i. (9) i s Q(α) = ( y s y i )α i i s α i y i (g(x i ) g(x s )) i s t α i y i (g(x i ) g(x s )). (0) i s Thus the Hessian matrix of Q(α), H L, which is an (M ) (M ) matrix, is given by H L = Q(α) α = ( y i (g(x i ) g(x s )) )t ( yj (g(x j ) g(x s )) ), () where α is obtained by deleting α s from α. Since H L is expressed by the product of a transposed matrix and the matrix, H L is positive semidefinite. Let N g be the maximum number of independent vectors among {g(x i ) g(x s ) i {i,...,m,i s}. Since N g does not exceeds the dimension of the feature space, l, the rank of H L is given by [6, pp. 3 3] min(l, N g ). () Therefore, if M>(l + ), H L is positive semi-definite. The Hessian matrix H L, in which one variable is eliminated, for the L SVM is expressed by { } yi y j + δ ij H L = H L +. (3) C Because of the term M α i /C in Q(α), H L is positive definite. Thus, unlike H L, H L is positive definite irrespective of the dimension of the feature space.

5 y y 0 x 0 x (a) (b) Figure : Convex functions. (a) Strictly convex function. (b) Convex function Table : CONVEXITY OF OBJECTIVE FUNCTIONS Hard Margin L Soft Margin L Soft Margin Primal Strictly Convex Convex Strictly Convex (w,b) (w,b,ξ i ) (w,b,ξ i ) Dual Convex Convex Strictly Convex (α i ) (α i ) (α i ) NON-UNIQUE SOLUTIONS If a convex function gives a minimum or maximum at a point not in an interval, the function is called strictly convex. In general, if the objective function of a quadratic programming problem constrained in a convex set is strictly convex or the associated Hessian matrix is positive (negative) definite, the solution is unique. And if the objective function is convex, there may be cases where the solution is non-unique (see Fig. ). Convexity of objective functions for different support vector machine architectures is summarized in Table. The symbols in the brackets show the variables. We must notice that since b is not included in the dual problem, even if the solution of the dual problem is unique, the solution of the primal problem may not be unique [4]. Assume that the hard margin SVM has a solution, i.e., the given problem is separable in the feature space. Then, since the objective function of the primal problem is w /, which is strictly convex, the primal problem has a unique solution for w and b. But the dual solution may be non-unique because the Hessian matrix is positive semi-definite. Since the hard margin SVM for a separable problem is equivalent to the L SVM with an unbounded solution, we leave the discussion to that for the L SVM. The objective function of the primal problem for the L SVM is strictly convex. Therefore, w and b are uniquely determined if we solve the primal problem. In addition, since the Hessian matrix of the dual objective function

6 x 3 0 x Figure : An example of a non-support vector is positive definite, α i are uniquely determined. And because of the uniqueness of the primal problem, b is determined uniquely using the Kuhn-Tucker condition. The L SVM includes the linear sum of ξ i. Therefore, the primal objective function is convex. Likewise, the Hessian matrix of the dual objective function is positive semi-definite. Thus the primal and dual solutions may be non-unique. Theorem For the L SVM, vectors that satisfy y i (w t g(x i )+b) = are not always support vectors. Proof. Consider the two-dimensional case shown in Fig.. In the figure, x belongs to Class, x and x belong to Class, and x x and x 3 x are orthogonal. The dual problem with the dot product kernel is given as follows. Maximize subject to Q(α) = α + α + α 3 (α x α x α 3 x 3 ) t (α x α x α 3 x 3 ) (4) α α α 3 =0, C α i 0, i =,, 3. (5) Substituting α 3 = α α and α = aα (a 0) into (4), we obtain Defining Q(α) = α (6) becomes α (x x 3 a(x x 3 )) t (x x 3 a(x x 3 )). (6) d (a) =(x x 3 a(x x 3 )) t (x x 3 a(x x 3 )), (7) Q(α) =α α d (a). (8)

7 When C d (a), (9) Q(α) is maximized at α =/d (a) and takes the maximum ( ) Q d = (a) d (a). (0) Since x x and x 3 x are orthogonal, d(a) is minimized at a =. Thus Q(/d (a)) is maximized at a =. Namely, α = α =/d (a) and α 3 =0. Since y 3 (w t x 3 + b) =, the theorem is proved. Definition For L SVMs a set of support vectors is irreducible if deletion of all the non-support vectors that satisfy y i (w t g(x i )+b) =, and any support vector results in the change of the optimal hyperplane. It is reducible if the optimal hyperplane does not change for deletion of all the non-support vectors that satisfy y i (w t g(x i )+b) =, and some support vectors. Deletion of non-support vectors from the training data set does not change the solution. In the following theorem, the Hessian matrix associated with a set of support vectors means that the Hessian matrix is calculated for the support vectors, not the entire training data. Theorem For L SVMs, let all the support vectors be unbounded. Then the Hessian matrix associated with an irreducible set of support vectors is positive definite and the Hessian matrix associated with a reducible set of support vectors is positive semi-definite. Proof. Let the set of support vectors be irreducible. Then, since deletion of any support vector results in the change of the optimal hyperplane, any g(x i ) g(x s ) cannot be expressed by the remaining g(x j ) g(x s ). Thus the associated Hessian matrix is positive definite. If the set of support vectors is reducible, deletion of some support vector, e.g., x i does not cause the change of the optimal hyperplane. This means that g(x i ) g(x s ) is expressed by the linear sum of the remaining g(x j ) g(x s ). Thus the associated Hessian matrix is positive semi-definite. Theorem 3 For the L SVM, if there is only one irreducible set of support vectors and the support vectors are all unbounded, the solution is unique. Proof. Delete the non-support vectors from the training data. Then since the set of support vectors is irreducible the associated Hessian matrix is positive definite. Thus, the solution is unique for the irreducible set. Since there is only one irreducible set, the solution is unique for the given problem. Example Consider the two-dimensional case shown in Fig. 3, in which x and x belong to Class and x 3 and x 4 belong to Class. Since x, x, x 3,

8 x 4 0 x 3 Figure 3: Non-unique solutions and x 4 form a rectangle, {x, x 3 } and {x, x 4 } are irreducible sets of support vectors for the dot product kernel. The training is to maximize Q(α) =α + α + α 3 + α 4 ( (α + α 4 ) +(α + α 3 ) ) () subject to α + α = α 3 + α 4, C α i 0, i =,...,4. () For C, (α,α,α 3,α 4 ) = (, 0,, 0) and (0,, 0, ) are two solutions. Thus, (α,α,α 3,α 4 )=(β, β,β, β), (3) where 0 β, is also a solution. Then, (α,α,α 3,α 4 )=(0.5, 0.5, 0.5, 0.5) is a solution. For the L SVM, the objective function becomes Q(α) = α + α + α 3 + α 4 ( (α + α 4 ) +(α + α 3 ) + α + α + α 3 + α 4 C Then for α i =/( + /C) (i =,...,4), (4) becomes Q(α) = ). (4) +. (5) C For α = α 3 =/( + /C) and α = α 4 = 0, (4) becomes Q(α) = +. (6) C Thus, for C>0, Q(α) given by (6) is smaller than that by (5). Therefore, α = α 3 =/( + /C) and α = α 4 =0orα = α 4 =/( + /C) and α = α 3 = 0 are not optimal, but α i =/( + /C) (i =,...,4) are.

9 In general, the number of support vectors for L SVMs is larger than that for L SVMs. The above example shows non-uniqueness of the dual problem but the primal problem is unique since there are unbounded support vectors. Nonunique solutions occur when there are no unbounded support vectors. Burges and Crisp [4] derive conditions in which the dual solution is unique but the primal solution is non-unique. DEGENERATE SOLUTIONS Rifkin, Pontil, and Verri [] discuss degenerate solutions, in which w = 0, for L SVMs. Fernández [3] derived similar results for L SVMs, although he did not refer to degeneracy. Degeneracy occurs also for L SVMs. In the following we discuss degenerate solutions following the proof of [3]. Theorem 4 Let C = KC 0, where K and C 0 are positive parameters, and α be a solution of the L support vector machine with K =. Define w(α) = α i y i g(x). (7) Then the necessary and sufficient condition for is that Kα is also a solution for any K (> ). w(α )=0 (8) Proof. We prove the theorem for L SVMs. The proof for L SVMs is given by deleting α t α/(c) in the following proof. Necessary condition. Let α be the optimal solution for K (K >) and w(α ) 0. Then, there is α such that α = Kα. Since α satisfies the equality constraint for K =, it is a non-optimal solution. Then for K =, Q(α ) = For K (K >), Q(α )= Q(Kα ) = K = K α i α t α C 0 α i w(α ) t w(α ) α t α C 0. (9) α i Kα t α C 0 Q(Kα ) α i K w(α ) t w(α ) Kα t α. (30) C 0

10 Class Class Class 0 x x x 3 x Figure 4: Non-separable one-dimensional case Multiplying K to all the terms in (9) and comparing it with (30), we see the contradiction. Thus, Kα is the optimal solution for K>. Sufficient condition. Suppose Kα is the optimal solution for any K ( ). Thus for any K ( ) Q(Kα ) = Kα i K w(α ) t w(α ) Kα t α Q(α )= Rewriting (3), we have C 0 α i w(α ) t w(α ) α t α. (3) C 0 α i K + w(α ) t w(α )+ α t α. (3) C 0 Since (3) is satisfied as K approaches infinity, w(α )=0 must be satisfied. Otherwise, Kα cannot be the optimal solution. Example Consider the case shown in Fig. 4. Here, we use the dot product kernel. The inequality constraints are w + b ξ, (33) b ξ, (34) w + b ξ 3. (35) The dual problem for the L SVM is given as follows:namely maximize subject to Q(α) =α + α + α 3 ( α + α 3 ) (36) α α + α 3 =0, (37) C α i 0 for i =,, 3. (38) From (37), α = α + α 3. Then substituting it into (36), we obtain Q(α) = α +α 3 ( α + α 3 ), (39) C α i 0 for i =,, 3. (40)

11 Equation (39) is maximized when α = α 3. Thus the optimal solution is given by α = C, α = C, α 3 = C. (4) Therefore, x =, 0, and are support vectors. Thus from (7), w =0. Substituting w = ξ = ξ 3 = 0 into (33) and (35) and taking the equalities, we obtain b =. The dual objective function for the L SVM is given by Q(α) =α + α + α 3 ( α + α 3 ) α + α + α 3 C The objective function is maximized when (4) α = α = C 3, α = 4C 3. (43) Thus, w = 0 and b =/3. Therefore, any datum is classified into Class. CONCLUSIONS In this paper, we compared L and L support vector machines from the standpoint of uniqueness and degeneracy of solutions. We introduced the concept of irreducible set of support vectors and clarified the condition for non-uniqueness of L SVM dual solutions. We also proved that degeneracy of L SVM solutions occurs. REFERENCES [] M. Pontil and A. Verri, Properties of support vector machines, Neural Computation, Vol. 0, No. 4, pp , 998. [] R. Rifkin, M. Pontil, and A. Verri, A note on support vector machine degeneracy, Proceedings of 0th International Conference on Algorithmic Learning Theory, (ALT 99). (Lecture Notes in Artificial Intelligence Vol. 70), pp. 5 63, 999. [3] R. Fernández, Behavior of the weights of a support vector machine as a function of the regularization parameter c, Proceedings of the 8th International Conference on Artificial Neural Networks (ICANN 98), Vol., pp. 97 9, 998. [4] C. J. C. Burges and D. J. Crisp, Uniqueness of the SVM solution, In S. A. Solla, T. K. Leen, and K.-R. Müller, Eds., Advances in Neural Information Processing Systems, pp. 3 9, The MIT Press, 000. [5] N. Cristianini and J. Shawe-Taylor, An Introduction to Support Vector Machines and Other Kernel-based Learning Methods, Cambridge University Press, Cambridge, UK, 000. [6] S. Abe, Pattern Classification: Neuro-fuzzy Methods and Their Comparison, Springer-Verlag, London, UK, 00.

Kobe University Repository : Kernel

Kobe University Repository : Kernel タイトル Title 著者 Author(s) 掲載誌巻号ページ Citation 刊行日 Issue date 資源タイプ Resource Type 版区分 Resource Version 権利 Rights DOI JaLCDOI URL Comparison between error correcting output