Kernel Methods and Support Vector Machines

Intelligent Systes: Reasoning and Recognition Jaes L. Crowley ENSIAG 2 / osig 1 Second Seester 2012/2013 Lesson 20 2 ay 2013 Kernel ethods and Support Vector achines Contents Kernel Functions...2 Quadratic as a Kernel Function... 4 Gaussian Kernel... 5 Kernel functions for Sybolic Data... 6 Kernels for Bayesian Reasoning... 6 Support Vector achines...7 Hard-argin SVs - Separable Training Data... 8 Soft argin SV's - Non-separable training data.... 12 Sources Bibliographiques : Neural Networks for Pattern Recognition, C.. Bishop, Oxford Univ. Press, 1995. A Coputational Biology Exaple using Support Vector achines, Suzy Fei, 2009 (on line.

Kernel ethods and Support Vector achines Lesson 20 Kernel Functions Linear discriinant functions can be very efficient classifiers, provided that the class features can be separated by a linear decision surface. The linear discriinant function is: g( X = W T X + b The decision rule is IF W T X + b > 0 THEN E C 1 else E C 1 For any doains, it is easier to separate the classes with a linear function if you can transfor your feature data into a space with a higher nuber of diensions. One way to do this is to transfor the features with a kernel function. Instead of a decision surface: g( X = W T X + b We use a decision surface g( X = W T ( X + b This can be used to construct non-linear decision surfaces for our data. The trick is to learn with the non-linear apping, but to use the resulting discriinant without actually coputing the apping. Forally, a kernel function is any function that satisfies the ercer condition. ercer s condition requires that for a finite set of observations S, one can select a easure µ(t = T for all T S such that k( X i, X j c i c j # 0 i=1 j=1 20-2

Kernel ethods and Support Vector achines Lesson 20 This condition is satisfied by inner products. Inner products are of the for W T X = W, X = D d=1 w d x d Thus k( x, z = ( x,( z is a valid kernel function. The ercer function can be satisfied by other functions. For exaple, the Gaussian function: k( x, z = e is a valid Kernel. x z 2 2# 2 We can learn the discriinant in an inner product space W T ( X = W,( X where the vector W will be learned fro the apped values of our training data. This will give us a discriinant function of the for: g( X = # a y ( X,( X + b =1 We will see that we can learn in the kernel space, and then recognize without actually having to copute the kernel function Kernels can be extended to infinite diensional spaces and even to non-nuerical and sybolic data 20-3

Kernel ethods and Support Vector achines Lesson 20 Quadratic as a Kernel Function A coon kernel is the quadratic kernel k( x, z = ( z T x + c 2 = ( z T ( x For D=2, this gives ( # % % X = % % % $ 2 x 1 & 2 ( x 2 ( 2x 1 x 2 ( ( 2x 1 c ( 2x 2 c ' This Kernel aps a 2-D feature space to a 5 D kernel space. This can be used to ap a plane to a hyperbolic surface. 20-4

Kernel ethods and Support Vector achines Lesson 20 Gaussian Kernel The Gaussian exponential is very often used as a kernel function. In this case: k( x, z = e x z 2 2# 2 This satisfies ercer s condition because the exponent is separable x z 2 = x T x 2 x T z + z T z Intuitively, you can see this as placing a Gaussian function ultiplied by the indicator variable (y = +/- 1 at each training saple, and then suing the functions. The zero-crossings in the su of Gaussianss defines the decision surface. Depending on σ, this can provide a good fit or an over fit to the data. If σ is large copared to the distance between the classes, this can give an overly flat discriinant surface. If σ is sall copared to the distance between classes, this will overfit the saples. A good choice for σ will be coparable to the distance between the closest ebers of the two classes. Figure fro lecture A Coputational Biology Exaple using Support Vector achines Suzy Fei, 2009 (on line. Aong other properties, the feature vector can have infinite nuber of diensions. 20-5

Kernel ethods and Support Vector achines Lesson 20 Kernel functions for Sybolic Data Kernel functions can be defined over graphs, sets, strings and text Consider for exaple, a non-vector space coposed of a set of words S. Consider two subsets of S: A S and B S We can define a kernel function of A and B using the intersection operation. k(a, B = 2 AB where. denotes the cardinality (the nuber of eleents of a set. Kernels for Bayesian Reasoning We can define a Kernel for Baysian reasoning for evidence accuulation. Given a probabilistic odel p(x we can define a kernel as: k( X, Z = p( X p( Z This is clearly a valid kernel because it is a 1-D inner product. Intuitively, it says that two feature vectors, X and Z, are siilar if they both have high probability. We can extend this for conditional probabilities to k( Z, X = N p( n=1 Z p( X Ap(A Two vectors, X, Z, will give large values for the kernel, and hence be seen as siilar, if they have significant probability for the sae coponents. 20-6

Kernel ethods and Support Vector achines Lesson 20 Support Vector achines A significant liitation for linear leaning ethods is that the kernel function ust be evaluated for every training point during learning. An alternative is to use a learning algorith with sparse support - that is a uses only a sall nuber of points to learn the separation boundary. A Support Vector achine (SV is such an algorith. SV's are popular for probles of classification, regression and novelty detection. The solution of the odel paraeters corresponds to a convex optiisation proble. Any local solution is a global solution. We will use the two class proble, K=2, to illustrate the principle. ulti-class solutions are possible. Our linear odel is for the decision surface is g( X = W T ( X + b Where ( X is a feature space transforation that aps a hyper-plane in F diensions into a non-linear decision surfaces in D diensions. Training data is a set of training saples { X }and their indicator variable, { y }. As in our last lecture, for a 2 Class proble, y is -1 or +1. A new, observed point (not in the training data will be classified using the function sign(g( X, so that a classification of a training saple is correct if y g( X > 0 20-7

Kernel ethods and Support Vector achines Lesson 20 Hard-argin SVs - Separable Training Data We assue that the two classes can be separated by a linear function. That is, there exists a hyper-plane g( X = w T ( X + b such that y g( X > 0 for all. Generally there will exist any solutions for separable data. The argin, γ, is the iniu distance of any saple fro the hyper-plane For a support Vector achine, we will deterine the decision surface that axiizes the argin, γ. What we are going to do is design the decision boundary to that it has an equal distance fro a sall nuber of support points. The distance for a point fro the hyper-plane is g( X w y g( X > 0 for all training points The distance for the point X to the decision surface is: y g( X = y ( w T ( X + b w w For a decision surface, (W, b, the support vectors are the saples fro the training set, { } that iniize the argin, γ, X in { } = in $ 1 w y ( w T #( X ' % + b ( & We will seek to axiize the argin by solving 20-8

Kernel ethods and Support Vector achines Lesson 20 # 1 arg ax w,b w in y ( w T ( X & $ { + b }' % ( The factor. 1 w can be reoved fro the optiization because w does not depend on Direct solution would be very difficult, but the proble can be converted to an equivalent proble. Note that rescaling the proble changes nothing. Thus we will scale the equation such for the saple that is closest to the decision surface (sallest argin: y ( w T ( X + b =1 that is: y g( X =1 For all other saple points: y ( w T ( X + b #1 This is known as the Canonical Representation for the decision hyperplane. The training saple where y ( w T ( X + b =1 are said to be the active constraint. All other training saples are inactive. By definition there is always at least one active constraint. When the argin is axiized, there will be two active constraints. 1 Thus the optiization proble is to axiize arg in# w,b $ 2 constraints. The factor of ½ is a convenience for later analysis. w 2 % & ' subject to the active To solve this proble, we will use Lagrange ultipliers, a 0, with one ultiplier for each constraint. This gives a Lagrangian function: L( w,b, a = 1 2 w 2 a y ( w T #( X + b 1 $ =1 { } 20-9

Kernel ethods and Support Vector achines Lesson 20 Setting the derivatives to zero, we obtain: L w = 0 # w = a y ( X # =1 L b = 0 # a y = 0 =1 Eliinating w,b fro L( w,b, a we obtain : L(a = a # 1 2 =1 with constraints: a a n y y n k( X, X n =1 n=1 a 0 for =1,..., # a y = 0 1 where the kernel function is : k( X 1, X 2 = ( X 1 T ( X 2 The solution takes the for of a quadratic prograing proble in D variables (the Kernel space. This would norally take O(D 3 coputations. In going to the dual forulation, we have converted this to a dual proble over data points, requiring O( 3 coputations. This can appear to be a proble, but the solution only depends on a sall nuber of points To classify a new observed point, we evaluate: g( X = a y k( X, X + b =1 The solution to optiization probles of this for satisfy the Karush-Kuhn-Tucker condition, requiring: 20-10

Kernel ethods and Support Vector achines Lesson 20 a 0 y g( X 1# 0 a y g( X 1 { } # 0 For every data point in the training saples, { X }, either a = 0 or y g( X =1 Any point for which a = 0 does not contribute to g( X = a y k( X, X + b and thus is not used (is not active. The reaining points, for which a 0 are called the Support vectors. These points lie on the argin at t y( X =1 of the axiu argin hyperplane. Once the odel is trained, all other points can be discarded Let us define the support vectors as the set S. Now that we have solved for S and a, we can solve for b: $ we note that : y # a n y n k( X, X ' & n + b =1 % ns ( averaging over all support vectors in S gives: b = 1 % y N a n y n k( X, X ( $ ' $ n * S & #S n#s This can be expressed as iniization of an error function, E (z such that the error function is zero if z 0 and otherwise. =1 Fro Bishop p 331. 20-11

Kernel ethods and Support Vector achines Lesson 20 Soft argin SV's - Non-separable training data. So far we have assued that the data are linearly separable in ( X. For any probles soe training data ay overlap. The proble is that the error function goes to for any point on the wrong side of the decision surface. This is called a hard argin SV. We will relax this by adding a slack variable, S for each training saple: S 1 We will define S =0 for saples on the correct side of the argin, and S = y g( X for other saples. For a saple inside the argin, but on the correct side of the decision surface: 0 < S 1 For a saple on the decision surface: S = 1 For a saple on the wrong side of the decision surface: S > 1 Soft argin SV: Bishop p 332 (note use of ξ in place S This is soeties called a soft argin. To softly penalize points on the wrong side, we iniize : 20-12

Kernel ethods and Support Vector achines Lesson 20 C S + 1 2 =1 w 2 where C > 0 controls the tradeoff between slack variables and the argin. because any isclassified point S > 1, the upper bound on the nuber of isclassified points is is S. =1 C is an inverse factor. (note that C= is the earlier SV with hard argins. To solve for the SV we write the Lagrangian: L( w,b, a = 1 2 w 2 + C S # a y g( X { #1+ S }# µ S =1 =1 =1 The KKT conditions are a 0 y g( X 1+ S # 0 a y g( X 1+ S { } # 0 µ 0 S 1 µ S = 0 Solving the derivatives of L( w,b, a for zero gives L w = 0 # w = a y ( X # =1 L b = 0 # a t = 0 =1 L S = 0 # a = C µ using these to eliinate w, b and {S } fro L(w, b, a we obtain L(a = a # 1 2 =1 a a n y y n k( X, X n =1 n=1 20-13

Kernel ethods and Support Vector achines Lesson 20 This appears to be the sae as before, except that the constraints are different. 0 a C a y = 0 =1 (referred to as a box constraint. Solution is a quadratic prograing proble, with coplexity O( 3. However, as before, a large subset of training saples have a = 0, and thus do not contribute to the optiization. For the reaining points y g( X =1 S For saples ON the argin a < C hence µ > 0 requiring that S = 0 For saples INSIDE the argin: a = C and S 1 if correctly classified and S > 1 if isclassified. as before to solve for b we note that : y $ # a n y n k( X, X ' & n + b =1 % ns ( averaging over all support vectors in S gives: b = 1 % y N a n y n k( X, X ( $ ' $ n * S & #T n#s where T denotes the set of support vectors such that 0 < a < C. 20-14