Kobe University Repository : Kernel

Similar documents
Kobe University Repository : Kernel

Analysis of Multiclass Support Vector Machines

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Jeff Howbert Introduction to Machine Learning Winter

Kobe University Repository : Kernel

Kobe University Repository : Kernel

Introduction to Support Vector Machines

Support Vector Machine (SVM) and Kernel Methods

Introduction to Machine Learning Prof. Sudeshna Sarkar Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

CSC 411 Lecture 17: Support Vector Machine

Lecture Notes on Support Vector Machine

Statistical Pattern Recognition

Support Vector Machine (SVM) and Kernel Methods

Convex Optimization and Support Vector Machine

Support Vector Machines

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Linear Support Vector Machine. Classification. Linear SVM. Huiping Cao. Huiping Cao, Slide 1/26

Sequential Minimal Optimization (SMO)

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

Lecture Support Vector Machine (SVM) Classifiers

Support Vector Machines

Support Vector Machines

Learning with kernels and SVM

ML (cont.): SUPPORT VECTOR MACHINES

Support Vector Machine (SVM) and Kernel Methods

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Support Vector Machines

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Brief Introduction to Machine Learning

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Support Vector Machine

Support Vector Machines and Kernel Methods

Support Vector Machines

MACHINE LEARNING. Support Vector Machines. Alessandro Moschitti

Machine Learning. Support Vector Machines. Manfred Huber

Kobe University Repository : Kernel

Formulation with slack variables

Support Vector Machines Explained

Support Vector Machines for Classification and Regression

A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES. Wei Chu, S. Sathiya Keerthi, Chong Jin Ong

Linear & nonlinear classifiers

Support Vector Machines.

Non-linear Support Vector Machines

Machine Learning. Lecture 6: Support Vector Machine. Feng Li.

Support Vector Machine

Support Vector Machine (continued)

SUPPORT VECTOR MACHINE

Support vector machines

Lecture 2: Linear SVM in the Dual

L5 Support Vector Classification

Support Vector Machines: Maximum Margin Classifiers

Support Vector Machines

Linear & nonlinear classifiers

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Statistical Machine Learning from Data

CS145: INTRODUCTION TO DATA MINING

The Lagrangian L : R d R m R r R is an (easier to optimize) lower bound on the original problem:

An introduction to Support Vector Machines

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Applied inductive learning - Lecture 7

Kobe University Repository : Kernel

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University

SVM and Kernel machine

COMP 652: Machine Learning. Lecture 12. COMP Lecture 12 1 / 37

CS-E4830 Kernel Methods in Machine Learning

Support Vector Machines, Kernel SVM

SVM TRADE-OFF BETWEEN MAXIMIZE THE MARGIN AND MINIMIZE THE VARIABLES USED FOR REGRESSION

Introduction to Support Vector Machines

Support Vector Machines and Speaker Verification

Announcements - Homework

Support Vector Machines

Incorporating detractors into SVM classification

Sparse Support Vector Machines by Kernel Discriminant Analysis

ICS-E4030 Kernel Methods in Machine Learning

Convex Optimization. Dani Yogatama. School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA. February 12, 2014

Review: Support vector machines. Machine learning techniques and image analysis

Support Vector Machines with Example Dependent Costs

1 Boosting. COS 511: Foundations of Machine Learning. Rob Schapire Lecture # 10 Scribe: John H. White, IV 3/6/2003

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan

Lecture 3 January 28

Support Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Introduction to Support Vector Machines

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Foundation of Intelligent Systems, Part I. SVM s & Kernel Methods

Appendix A Taylor Approximations and Definite Matrices

CS , Fall 2011 Assignment 2 Solutions

Support Vector Machines. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Kobe University Repository : Kernel

Support Vector Machine for Classification and Regression

Constrained Optimization and Support Vector Machines

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

A NEW MULTI-CLASS SUPPORT VECTOR ALGORITHM

LECTURE 7 Support vector machines

Synthesis of Maximum Margin and Multiview Learning using Unlabeled Data

Pattern Recognition 2018 Support Vector Machines

Comparison of Sparse Least Squares Support Vector Regressors Trained in Primal and Dual

Transcription:

Kobe University Repository : Kernel タイトル Title 著者 Author(s) 掲載誌 巻号 ページ Citation 刊行日 Issue date 資源タイプ Resource Type 版区分 Resource Version 権利 Rights DOI JaLCDOI URL Analysis of support vector machines Abe, Shigeo Neural Networks for Signal Processing, 00. Proceedings of the 00 th IEEE Workshop on,:89-98 00-09 Conference Paper / 会議発表論文 author 0.09/NNSP.00.03000 http://www.lib.kobe-u.ac.jp/handle_kernel/900003 PDF issue: 08--30

ANALYSIS OF SUPPORT VECTOR MACHINES Shigeo Abe Graduate School of Science and Technology Kobe University, Kobe, Japan Phone/Fax:+8 78 803 60 E-mail:abe@eedept.kobe-u.ac.jp Web:www.eedept.kobe-u.ac.jp/eedept english.html Abstract. We compare L and L soft margin support vector machines from the standpoint of positive definiteness, the number of support vectors, and uniqueness and degeneracy of solutions. Since the Hessian matrix of L SVMs is positive definite, the number of support vectors for L SVMs is larger than or equal to the number of L SVMs. For L SVMs, if there are plural irreducible sets of support vectors, the solution of the dual problem is non-unique although the primal problem is unique. Similar to L SVMs, degenerate solutions, in which all the data are classified into one class, occur for L SVMs. INTRODUCTION As opposed to L soft margin support vector machines (L SVMs), L soft margin support vector machines (L SVMs) are widely used for pattern classification and function approximation. And much effort has been done to clarify the properties of L SVMs [,, 3, 4]. Pontil and Verri [] clarified dependence of the L SVM solutions on the margin parameter C. Rifkin, Pontil, and Verri [] showed degeneracy of L SVM solutions, in which any data are classified into one class. Fernández [3] also proved the existence of degeneracy without mentioning it. Burges [4] discussed non-uniqueness of L SVM primal solutions, and uniqueness of L SVM solutions. But except for [4], little comparison has been made between L and L SVMs [5]. In this paper, we compare L SVMs with L SVMs from the standpoint of positive definiteness, the number of support vectors, and uniqueness and degeneracy of solutions. Since the Hessian matrix of L SVMs is positive definite, the solutions are unique. For the L SVMs, we introduce the concept of irreducible set of support vectors and show that if there are plural irreducible sets, the dual solutions are non-unique. Finally, we show that L SVMs have degenerate solutions similar to L SVMs. In the following, first we summarize L and L SVMs and discuss the Hessian matrices for L and L SVMs. Then we discuss non-uniqueness of

L SVM dual solutions. Finally, we prove the existence of degeneracy for L SVMs. SOFT MARGIN SUPPORT VECTOR MACHINES In soft margin support vector machines, we consider the linear decision function D(x) =w t g(x)+b () in the feature space, where w is the weight vector, g(x) is the mapping function that maps the m-dimensional input x into the l-dimensional feature space, and b is a scalar. We determine the decision function so that the classification error for the training data and unknown data is minimized. This can be achieved by minimizing w + C ξ p i i = () subject to the constraints y i (w t g(x i )+b) ξ i for i =,...,M, (3) where ξ i are the positive slack variables associated with the training data x i, M is the number of training data, y i are the class labels ( or ) for x i, C is the margin parameter, and p is either or. When p =, we call the support vector machine L soft margin support vector machine (L SVM) and when p =, L soft margin support vector machine (L SVM). The dual problem for the L SVM is to maximize Q(α) = α i α i α j y i y j H(x i, x j ) (4) i = i,j= subject to the constraints y i α i =0, 0 α i C, (5) i = where α i are the Lagrange multipliers associated with the training data x i and H(x, x )=g(x) t g(x) is the kernel function. The dual problem for the L SVM is to maximize Q(α) = α i ( α i α j y i y j H(x i, x j )+ δ ) ij (6) C i = i,j= subject to the constraints y i α i =0, α i 0. (7) i = Let the solution of (4) and (5), or (6) and (7) be α i (i =,...,M).

HESSIAN MATRIX Rewriting (4) for the L SVM using the mapping function g(x), we have Q(α) = ( α i M ) t M α i y i g(x i ) α i y i g(x i ). (8) i = Solving (5) for α s (s {,...,M}), Substituting (9) into (8), we obtain M α s = y s y i α i. (9) i s Q(α) = ( y s y i )α i i s α i y i (g(x i ) g(x s )) i s t α i y i (g(x i ) g(x s )). (0) i s Thus the Hessian matrix of Q(α), H L, which is an (M ) (M ) matrix, is given by H L = Q(α) α = ( y i (g(x i ) g(x s )) )t ( yj (g(x j ) g(x s )) ), () where α is obtained by deleting α s from α. Since H L is expressed by the product of a transposed matrix and the matrix, H L is positive semidefinite. Let N g be the maximum number of independent vectors among {g(x i ) g(x s ) i {i,...,m,i s}. Since N g does not exceeds the dimension of the feature space, l, the rank of H L is given by [6, pp. 3 3] min(l, N g ). () Therefore, if M>(l + ), H L is positive semi-definite. The Hessian matrix H L, in which one variable is eliminated, for the L SVM is expressed by { } yi y j + δ ij H L = H L +. (3) C Because of the term M α i /C in Q(α), H L is positive definite. Thus, unlike H L, H L is positive definite irrespective of the dimension of the feature space.

y y 0 x 0 x (a) (b) Figure : Convex functions. (a) Strictly convex function. (b) Convex function Table : CONVEXITY OF OBJECTIVE FUNCTIONS Hard Margin L Soft Margin L Soft Margin Primal Strictly Convex Convex Strictly Convex (w,b) (w,b,ξ i ) (w,b,ξ i ) Dual Convex Convex Strictly Convex (α i ) (α i ) (α i ) NON-UNIQUE SOLUTIONS If a convex function gives a minimum or maximum at a point not in an interval, the function is called strictly convex. In general, if the objective function of a quadratic programming problem constrained in a convex set is strictly convex or the associated Hessian matrix is positive (negative) definite, the solution is unique. And if the objective function is convex, there may be cases where the solution is non-unique (see Fig. ). Convexity of objective functions for different support vector machine architectures is summarized in Table. The symbols in the brackets show the variables. We must notice that since b is not included in the dual problem, even if the solution of the dual problem is unique, the solution of the primal problem may not be unique [4]. Assume that the hard margin SVM has a solution, i.e., the given problem is separable in the feature space. Then, since the objective function of the primal problem is w /, which is strictly convex, the primal problem has a unique solution for w and b. But the dual solution may be non-unique because the Hessian matrix is positive semi-definite. Since the hard margin SVM for a separable problem is equivalent to the L SVM with an unbounded solution, we leave the discussion to that for the L SVM. The objective function of the primal problem for the L SVM is strictly convex. Therefore, w and b are uniquely determined if we solve the primal problem. In addition, since the Hessian matrix of the dual objective function

x 3 0 x Figure : An example of a non-support vector is positive definite, α i are uniquely determined. And because of the uniqueness of the primal problem, b is determined uniquely using the Kuhn-Tucker condition. The L SVM includes the linear sum of ξ i. Therefore, the primal objective function is convex. Likewise, the Hessian matrix of the dual objective function is positive semi-definite. Thus the primal and dual solutions may be non-unique. Theorem For the L SVM, vectors that satisfy y i (w t g(x i )+b) = are not always support vectors. Proof. Consider the two-dimensional case shown in Fig.. In the figure, x belongs to Class, x and x belong to Class, and x x and x 3 x are orthogonal. The dual problem with the dot product kernel is given as follows. Maximize subject to Q(α) = α + α + α 3 (α x α x α 3 x 3 ) t (α x α x α 3 x 3 ) (4) α α α 3 =0, C α i 0, i =,, 3. (5) Substituting α 3 = α α and α = aα (a 0) into (4), we obtain Defining Q(α) = α (6) becomes α (x x 3 a(x x 3 )) t (x x 3 a(x x 3 )). (6) d (a) =(x x 3 a(x x 3 )) t (x x 3 a(x x 3 )), (7) Q(α) =α α d (a). (8)

When C d (a), (9) Q(α) is maximized at α =/d (a) and takes the maximum ( ) Q d = (a) d (a). (0) Since x x and x 3 x are orthogonal, d(a) is minimized at a =. Thus Q(/d (a)) is maximized at a =. Namely, α = α =/d (a) and α 3 =0. Since y 3 (w t x 3 + b) =, the theorem is proved. Definition For L SVMs a set of support vectors is irreducible if deletion of all the non-support vectors that satisfy y i (w t g(x i )+b) =, and any support vector results in the change of the optimal hyperplane. It is reducible if the optimal hyperplane does not change for deletion of all the non-support vectors that satisfy y i (w t g(x i )+b) =, and some support vectors. Deletion of non-support vectors from the training data set does not change the solution. In the following theorem, the Hessian matrix associated with a set of support vectors means that the Hessian matrix is calculated for the support vectors, not the entire training data. Theorem For L SVMs, let all the support vectors be unbounded. Then the Hessian matrix associated with an irreducible set of support vectors is positive definite and the Hessian matrix associated with a reducible set of support vectors is positive semi-definite. Proof. Let the set of support vectors be irreducible. Then, since deletion of any support vector results in the change of the optimal hyperplane, any g(x i ) g(x s ) cannot be expressed by the remaining g(x j ) g(x s ). Thus the associated Hessian matrix is positive definite. If the set of support vectors is reducible, deletion of some support vector, e.g., x i does not cause the change of the optimal hyperplane. This means that g(x i ) g(x s ) is expressed by the linear sum of the remaining g(x j ) g(x s ). Thus the associated Hessian matrix is positive semi-definite. Theorem 3 For the L SVM, if there is only one irreducible set of support vectors and the support vectors are all unbounded, the solution is unique. Proof. Delete the non-support vectors from the training data. Then since the set of support vectors is irreducible the associated Hessian matrix is positive definite. Thus, the solution is unique for the irreducible set. Since there is only one irreducible set, the solution is unique for the given problem. Example Consider the two-dimensional case shown in Fig. 3, in which x and x belong to Class and x 3 and x 4 belong to Class. Since x, x, x 3,

x 4 0 x 3 Figure 3: Non-unique solutions and x 4 form a rectangle, {x, x 3 } and {x, x 4 } are irreducible sets of support vectors for the dot product kernel. The training is to maximize Q(α) =α + α + α 3 + α 4 ( (α + α 4 ) +(α + α 3 ) ) () subject to α + α = α 3 + α 4, C α i 0, i =,...,4. () For C, (α,α,α 3,α 4 ) = (, 0,, 0) and (0,, 0, ) are two solutions. Thus, (α,α,α 3,α 4 )=(β, β,β, β), (3) where 0 β, is also a solution. Then, (α,α,α 3,α 4 )=(0.5, 0.5, 0.5, 0.5) is a solution. For the L SVM, the objective function becomes Q(α) = α + α + α 3 + α 4 ( (α + α 4 ) +(α + α 3 ) + α + α + α 3 + α 4 C Then for α i =/( + /C) (i =,...,4), (4) becomes Q(α) = ). (4) +. (5) C For α = α 3 =/( + /C) and α = α 4 = 0, (4) becomes Q(α) = +. (6) C Thus, for C>0, Q(α) given by (6) is smaller than that by (5). Therefore, α = α 3 =/( + /C) and α = α 4 =0orα = α 4 =/( + /C) and α = α 3 = 0 are not optimal, but α i =/( + /C) (i =,...,4) are.

In general, the number of support vectors for L SVMs is larger than that for L SVMs. The above example shows non-uniqueness of the dual problem but the primal problem is unique since there are unbounded support vectors. Nonunique solutions occur when there are no unbounded support vectors. Burges and Crisp [4] derive conditions in which the dual solution is unique but the primal solution is non-unique. DEGENERATE SOLUTIONS Rifkin, Pontil, and Verri [] discuss degenerate solutions, in which w = 0, for L SVMs. Fernández [3] derived similar results for L SVMs, although he did not refer to degeneracy. Degeneracy occurs also for L SVMs. In the following we discuss degenerate solutions following the proof of [3]. Theorem 4 Let C = KC 0, where K and C 0 are positive parameters, and α be a solution of the L support vector machine with K =. Define w(α) = α i y i g(x). (7) Then the necessary and sufficient condition for is that Kα is also a solution for any K (> ). w(α )=0 (8) Proof. We prove the theorem for L SVMs. The proof for L SVMs is given by deleting α t α/(c) in the following proof. Necessary condition. Let α be the optimal solution for K (K >) and w(α ) 0. Then, there is α such that α = Kα. Since α satisfies the equality constraint for K =, it is a non-optimal solution. Then for K =, Q(α ) = For K (K >), Q(α )= Q(Kα ) = K = K α i α t α C 0 α i w(α ) t w(α ) α t α C 0. (9) α i Kα t α C 0 Q(Kα ) α i K w(α ) t w(α ) Kα t α. (30) C 0

Class Class Class 0 x x x 3 x Figure 4: Non-separable one-dimensional case Multiplying K to all the terms in (9) and comparing it with (30), we see the contradiction. Thus, Kα is the optimal solution for K>. Sufficient condition. Suppose Kα is the optimal solution for any K ( ). Thus for any K ( ) Q(Kα ) = Kα i K w(α ) t w(α ) Kα t α Q(α )= Rewriting (3), we have C 0 α i w(α ) t w(α ) α t α. (3) C 0 α i K + w(α ) t w(α )+ α t α. (3) C 0 Since (3) is satisfied as K approaches infinity, w(α )=0 must be satisfied. Otherwise, Kα cannot be the optimal solution. Example Consider the case shown in Fig. 4. Here, we use the dot product kernel. The inequality constraints are w + b ξ, (33) b ξ, (34) w + b ξ 3. (35) The dual problem for the L SVM is given as follows:namely maximize subject to Q(α) =α + α + α 3 ( α + α 3 ) (36) α α + α 3 =0, (37) C α i 0 for i =,, 3. (38) From (37), α = α + α 3. Then substituting it into (36), we obtain Q(α) = α +α 3 ( α + α 3 ), (39) C α i 0 for i =,, 3. (40)

Equation (39) is maximized when α = α 3. Thus the optimal solution is given by α = C, α = C, α 3 = C. (4) Therefore, x =, 0, and are support vectors. Thus from (7), w =0. Substituting w = ξ = ξ 3 = 0 into (33) and (35) and taking the equalities, we obtain b =. The dual objective function for the L SVM is given by Q(α) =α + α + α 3 ( α + α 3 ) α + α + α 3 C The objective function is maximized when (4) α = α = C 3, α = 4C 3. (43) Thus, w = 0 and b =/3. Therefore, any datum is classified into Class. CONCLUSIONS In this paper, we compared L and L support vector machines from the standpoint of uniqueness and degeneracy of solutions. We introduced the concept of irreducible set of support vectors and clarified the condition for non-uniqueness of L SVM dual solutions. We also proved that degeneracy of L SVM solutions occurs. REFERENCES [] M. Pontil and A. Verri, Properties of support vector machines, Neural Computation, Vol. 0, No. 4, pp. 955 974, 998. [] R. Rifkin, M. Pontil, and A. Verri, A note on support vector machine degeneracy, Proceedings of 0th International Conference on Algorithmic Learning Theory, (ALT 99). (Lecture Notes in Artificial Intelligence Vol. 70), pp. 5 63, 999. [3] R. Fernández, Behavior of the weights of a support vector machine as a function of the regularization parameter c, Proceedings of the 8th International Conference on Artificial Neural Networks (ICANN 98), Vol., pp. 97 9, 998. [4] C. J. C. Burges and D. J. Crisp, Uniqueness of the SVM solution, In S. A. Solla, T. K. Leen, and K.-R. Müller, Eds., Advances in Neural Information Processing Systems, pp. 3 9, The MIT Press, 000. [5] N. Cristianini and J. Shawe-Taylor, An Introduction to Support Vector Machines and Other Kernel-based Learning Methods, Cambridge University Press, Cambridge, UK, 000. [6] S. Abe, Pattern Classification: Neuro-fuzzy Methods and Their Comparison, Springer-Verlag, London, UK, 00.