SVMC An introduction to Support Vector Machines Classification

Size: px

Start display at page:

Download "SVMC An introduction to Support Vector Machines Classification"

Ethelbert Stewart
5 years ago
Views:

1 SVMC An introduction to Support Vector Machines Classification 6.783, Biomedical Decision Support Lorenzo Rosasco Department of Brain and Cognitive Science MIT

2 A typical problem We have a cohort of patients from two groups- say A and B. We wish to devise a classification rule to distinguish patients of one group from patients of the other group.

3 Learning and Generalization Goal: classify correctly new patients 3

4 Plan 1. Linear SVM 2. Non Linear SVM: Kernels 3. Tuning SVM 4. Beyond SVM: Regularization Networks

5 Learning from Data To make predictions we need informations about the patients patient 1: patient 2 : x =(x 1,..., x n ) x =(x 1,..., x n )... patient l : x =(x 1,..., x n )

6 Linear model Patients of class A are labeled y=1 Patients of class B are labeled y=-1 Linear model w x = n j=1 w j x j classification rule sign(w x)

7 1D Case Y y=1 w x>0 w x =0 y=-1 w x<0 X

8 How do we find a good solution? x =(x 1,x 2 ) y=-1 y=1 2D Classification Problem

9 How do we find a good solution? w x>0 w x<0 w x =0

10 How do we find a good solution?

11 How do we find a good solution?

12 How do we find a good solution??

13 How do we find a good solution? M The margin M measures the distance of the two closest points

14 Maximum Margin Hyperplane...with little effort... one can show that maximizing the margin M is equivalent to: maximizing 1 w

15 Friday, October 30, In 2009practice, we want to work with datasets that are not linearly SVM Linear and Separable SVM Text min w 2 Bias and Slackw R n subject to : y i (w x) 1 i = 1,..., l The SVM introduced by Vapnik includes an unregularized bias Typically term b, an leading off-set to classification term is via added a function to of thesolution form: f (x) =sign (w x + b).

16 A more general Algorithm There are two things we would like to improve: Allow for errors Non Linear Models

17 Measuring errors

18 Measuring errors (cont) ξ i ξ i ξ i ξ i Slack Variables

19 Linear SVM min w R n,ξ R n,b R C l i=1 ξ i w 2 subject to : y i (w x + b) 1 ξ i i = 1,..., l ξ i 0 i = 1,..., l

20 Optimization How do we solve this minimization problem? (...and why do we call it SVM anyway?)

21 Some facts Representer Theorem Dual Formulation Box Constraints and Support Vectors

22 Representer Theorem The solution to the minimization problem can be written as l w x = c i (x x i ) i=1

23 Dual Problem The coefficients can be found solving: max α R l subject to : l i=1 α i 1 2 αt Qα Text l i=1 y iα i = 0 0 α i C i = 1,..., l Here Q = y i y j (x i x j ) α i = c i /y i

24 Toward Simpler Optimality Conditions Deter b Optimality conditions impler Optimality Conditions Determining with little effort... one can show that Suppose we have the optimal α i s. Also suppose (this hap we have the optimal α i s. Also suppose (this happens e) that there exists an i satisfying 0 < α i < C. Then in practice) that there exists an i satisfying 0 < α i < C. The α i < C = ζ i > 0 If then α i < C = ζ i > 0 y i f (x i ) 1 = ξ i = 0 = ξ i = 0 l l = y i ( y j α j K (x i, x j )+b) 1 = 0 The solution = is sparse: y i ( some training points j=1 j=1 do not contribute to the solution. = b = y i l = b = y i j=1 y j α j K (x i, x j ) y j α j K (x i, x j )+b) 1 = 0 l j=1 y j α j K (x i, x j )

25 Sparse Solution Note that: The solution depends only on the training set points. (no dependence on the number of features!)

26 Feature Map f(x) =w Φ(x)

27 A Key Observation The solution depends only on Q = y i y j (x i x j ) max α R l subject to : l i=1 α i 1 2 αt Qα Text l i=1 y iα i = 0 0 α i C i = 1,..., l Idea: use Q = y i y j (Φ(x i ) Φ(x j ))

28 Kernels and Feature Maps The crucial quantity is the inner product K(x, t) =Φ(x) Φ(t) called Kernel. A function is called Kernel if it is: symmetric positive definite

29 Examples of Kernels Very common examples of symmetric pd kernels are Linear kernel K (x, x )=x x Gaussian kernel Polynomial kernel K (x, x )=e x x 2 σ 2, σ > 0 K (x, x )=(x x + 1) d, d N For specific applications, designing an effective kernel is a challenging problem.

30 Non Linear SVM Summing up: Define Feature Map either explicitly or via a kernel Find linear solution in the Feature space Use same solver as in the linear case Representer theorem now gives: w Φ(x) = l i=1 c i (Φ(x) Φ(x i )) = l i=1 c i K(x, x i )

31 Example in 1D Y y=1 X y=-1

32 Software SVM Light: SVM Torch: libsvm:

33 Model Selection We have to fix the Regularization parameter C We have to choose the kernel (and its parameter) Using default values is usually a BAD BAD idea

34 Regularization Parameter min w R n,ξ R n,b R C l i=1 ξ i w 2 Large C: we try to minimize errors ignoring the complexity of the solution Small C we ignore the errors to obtain a simple solution

35 Which Kernel? For very high dimensional data linear kernel is often the default choice allows computational speed up less prone to overfitting Gaussian Kernel with proper tuning is another common choice Whenever possible use prior knowledge to build problem specific features or

36 2D demo demo (a) (b)

37 Practical Rules We can choose C (and the kernel parameter) via cross validation Holdout set Training Set Validation Set K-fold cross validation K=# of examples is called Leave One Out

38 K-Fold CV We have to compute several solutions...

39 ISCLASS minimum, and this is reflected in inefficiencies nearer CLASS curves in other simulation studies we have done show this We have observed (as did Joachims) that the value of XA in the a good estimate of the value of MISCLASS at its minimizer, only stic. The GACV at its minimizer is an estimate of twice the miste. The value of one half the GACV is somewhat more pessimistic. ce one obtains the solution to the problem the computation of both R)XA are equally trivial. A Rule of Thumb This is how the CV error typically looks like og10 GCKL log10 GACV!!"'%!!"'$ & )%%% $!"6% '!!"$%!!"&% * +,-(.45-0/3!!"$$!"&%!"'% ( )!"(%!!%%%!!"&$!)!!"#%!(!!"#$!*!!"(!!"'!)!,-(.+/012/3!'!(!!$ og10 BRM ISCLASS!!"&!)$!)! +,-(.+/012/3!$ log10 BRXA!)%% & Fix a reasonable kernel, then fine tune C!!"' $ '!)") /3 * (!!"$!!"&

40 Which values do we start from? For the Gaussian kernel, pick sigma of the order of the average distance... k(x i, X j )=exp ( X i X j 2 σ 2 ) Take min (and max) C as the value for which the training set error does not increase (decrease) anymore.

41 Computational Considerations the training time depends on the parameters: the more we fit, the slower the algorithm. typically the computational burden is in the selection of the regularization parameter (solvers for regularization path).

42 Regularization Networks SVM are an example of a family of algorithms of the form: C l i=1 V (y i,w Φ(x i )) + w 2 V is called loss function

43 Hinge Loss V (yw Φ(x)) 0-1 loss hinge loss yw Φ(x)

44 Loss functions

45 Representer Theorem For a LARGE class of loss functions: w Φ(x) = n i=1 α i (Φ(x) Φ(x i )) = n i=1 α i K(x, x i ) The way we compute the coefficients depends on the considered loss function.

46 Regularized LS The simplest, yet powerful, algorithm is probably RLS Square loss V (y, w Φ(x)) = (y w Φ(x)) 2 Algorithm (Q + 1 C I)α = y, Q i,j = K(x i,x j ) Leave one out can be computed at the price of one (!!!) solution

47 Summary Separable, Linear SVM Non Separable, Linear SVM Non Separable, Non Linear SVM How to use SVM

Perceptron Revisited: Linear Separators. Support Vector Machines

Support Vector Machines Perceptron Revisited: Linear Separators Binary classification can be viewed as the task of separating classes in feature space: w T x + b > 0 w T x + b = 0 w T x + b < 0 Department