Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Size: px

Start display at page:

Download "Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012"

Amy Goodman
5 years ago
Views:

1 Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012

2 Linear classifier Which classifier? x 2 x 1 2

3 Linear classifier Margin concept x 2 Support Vector Machine (SVM): a classifier that finds the solution with maximum margin x 1 3

4 Linear classifier geometry (recall) Discriminant function: y(x) = w T x + w 0 (Signed) distance from x to the boundary: w T x + w 0 w Distance does not chance when scaling w and w 0 equally 4

5 Margin Margin: When classes are linearly separable distance to the closest point min i y (i) wt x (i) + w 0 w 5

6 Maximum margin: optimization problem We want to find a line with maximum margin: max w,w 0 min i y (i) wt x (i) + w 0 w 1 max w,w 0 w min i y i (w T x (i) + w 0 ) It is hard to solve. Instead we set min i y i w T x i + w 0 = 1 since we can scale w and w 0 (Equal problem). 6

7 Maximum margin: optimization problem 1 max w,w 0 w s. t. y i w T x i + w 0 1 i = 1,, N min w 2 w,w 0 s. t. y i w T x i + w 0 1 i = 1,, N A regularization problem subject to the margin constraints 7

8 SVM: primal problem min w 2 w,w 0 s. t. y i w T x i + w 0 1 i = 1,, N x 2 x 1 8

9 Representer theorem Theorem: the solution to this problem: w = argmin w 2 = w T w w,w 0 s. t. y i w T x i + w 0 1 i = 1,, N can be represented as w = α i x (i) i=1 w is a linear combination of the training examples N 9

10 Lagrangian multipliers min w 2 w,w 0 s. t. y i w T x i + w 0 1 i = 1,, N The above problem is equivalent to the following one: min w,w 0 max *α i w 2 N + α i 1 y i (w T x (i) + w 0 ) i=1 Lagrangian multiplier 10

11 Maximum margin equivalent problems min w,w 0 max *α i w 2 N + α i 1 y i (w T x (i) + w 0 ) i=1 For this problem we can interchange the order of min and max: max min 1 *α i 0+ w,w 0 2 w 2 N + α i 1 y i (w T x (i) + w 0 ) i=1 α =,α 1,, α N - 11

12 Solving optimization problem max min *α i 0+ w,w 0 J w, w 0, α J w, w 0, α = 1 2 w 2 N + α i 1 y i (w T x (i) + w 0 ) i=1 J w,w 0,α w J w,w 0,α N = 0 w = i=1 α i y (i) x (i) = 0 w i=1 α i y (i) = 0 0 N w 0 was dropped out but a global constraint on α was created. 12

13 Maximum margin: dual problem max α N i=1 α i 1 2 N i=1 α i α j y (i) y (j) x i T x j N Subject to i=1 α i y (i) = 0 QP: quadratic problem α i 0 i = 1,, n we find α by solving above problem and then w = α i y (i) x (i) N i=1 13

14 Maximum margin: decision boundary If the distance to the boundary of a particular x i is y i w T x i + w 0 > 1 α i = 0 and thus x i is not a support vector. Inactive constraint The direction of hyper-plane (only based on support vectors): Support vectors w = α i >0 α i y (i) x (i) w 0 can be set by making the margin equidistant to two classes. 14

15 Karush-Kuhn-Tucker (KKT) conditions L w,w 0,α w = 0 L w,w 0,α w 0 = 0 y i w T x i + w 0 1 i = 1,, N α i 0 i = 1,, N α i 1 y i w T x i + w 0 = 0 i = 1,, N 15

16 Support Vectors x 2 α > 0 α > 0 α > 0 1 w x 1 16

17 Support Vectors x 2 α > 0 α = 0 α = 0 α > 0 α > 0 1 w x 1 17

18 Classification of a test example Test example: x y = sign w 0 + w T x y = sign(w 0 + α i y i x i α i >0 y = sign(w 0 + α i >0 T x) α i y i x i T x) The classifier is based on the expansion in terms of dot products of x with support vectors. 18

19 Support Vectors Linear hyper-plane defined by support vectors Support vectors are sufficient to predict labels of new points How many support vectors in linearly separable case (d dimensions)? d

20 More general than linearly separable case Allow error in classification Overlapping classes that can be approximately separated by a linear boundary Noise in the linearly separable data x 2 20 x 1 x 1

21 More general than linearly separable case: objective function Minimizing the number of misclassified points?! NP-complete Soft margin: Maximizing a margin while trying to minimize the distance between misclassified points and their correct plane 21

22 SVM (soft margin): Primal problem SVM with slack variables min w 2 + C ξ w,w 0, ξ N i i i=1 i=1 s. t. y i w T x i + w 0 1 ξ i i = 1,, N ξ i 0 x 2 N ξ i : slack variables ξ i > 1: if x i misclassifed ξ i 0 ξ i < 1 : if x i correctly classified but inside margin 22 x 1

23 Soft margin linear penalty if mistake (hinge loss) C: tradeoff parameter Soft margin approach is still QP Has a unique minimum ξ = 0 ξ = 0 23

24 Soft margin parameter C is a regularization parameter: small C allows margin constraints to be easily ignored large margin large C makes constraints hard to ignore narrow margin C = enforces all constraints: hard margin C can be determined using a technique like crossvalidation 24

25 SVM (soft margin): dual problem max α N i=1 α i 1 2 N i=1 N i=1 α i α j y (i) y (j) x i T x j Subject to α i y (i) = 0 0 α i C i = 1,, n By solving the above quadratic problem first we find α and then find w = α i y (i) x (i) N i=1 For a test sample x (as before): y = sign w 0 + w T x = sign(w 0 + α i y i x i T x) α i >0 25

26 SVM (soft margin): another view min w 2 + C ξ w,w 0, ξ N i i i=1 i=1 s. t. y i w T x i + w 0 1 ξ i i = 1,, N ξ i 0 Equivalent to the unconstrained optimization problem N min w 2 + C max (0,1 y (i) (w T x (i) + w 0 )) w,w 0 i=1 N 26

27 SVM loss function vs. other loss functions Hinge loss vs. 0-1 loss max (0,1 y (i) (w T x (i) + w 0 )) 0-1 Loss Hinge Loss w T x + w 0 y = 1 27

28 SVM loss function vs. other loss functions Hinge loss vs. log conditional likelihood max (0,1 y (i) (w T x (i) + w 0 )) Hinge Loss w T x + w 0 y = 1 28

29 Not linearly separable data Noisy data or overlapping classes (we discussed about it: soft margin) Near linearly separable x 2 x 1 Non-linear decision surface x 2 Transform to a new feature space 29 x 1

30 Nonlinear SVM Φ: x φ(x) 30

31 SVM in a transformed space φ: R d R m x φ x Find a hyper-plane in the feature space: f x = w T φ x + w 0 = 0 x 2 φ 2 (x) φ: x φ x 31 x 1 φ 1 (x)

32 SVM in a transformed space: Primal Primal problem: 32 min w 2 + C ξ i w,w 0 i=1 s. t. y i w T φ(x i ) + w 0 1 ξ i i = 1,, N ξ i 0 Classifying a new data: y = sign w 0 + w T φ(x) = sign(w 0 + α α i y i φ(x i ) T i >0 φ(x) ) Transform to a space where data can be classified linearly w R m If m d (very high dimensional feature space) then there are many more parameters to learn N

33 SVM classifier in a transformed space: Dual Optimization problem: max α N i=1 α i 1 2 N i=1 α i α j y (i) y (j) φ x (i) Subject to N i=1 α i y (i) = 0 0 α i C i = 1,, n T φ x (j) If we have inner products φ x (i) T φ x (j), only α =,α 1,, α N - needs to be learnt It is not necessary to learn m parameters 33

34 Kernel trick Kernel: inner product in a feature space k x, x = φ x T φ(x ) φ x = φ 1 x,, φ m x T Kernel trick Extension of many well-known algorithms Idea: the input vectors enters only in the form of scalar products We can replace that inner product with other choices of kernel 34

35 Kernel SVM Optimization problem: max α N i=1 α i 1 2 N i=1 α i α j y (i) y (j) φ k(x x (i) i T, φx (j) x (j) ) Subject to N i=1 α i y (i) = 0 0 α i C i = 1,, n Classifying a new data: y = sign w 0 + w T φ(x) = sign(w 0 + α i y i φ(x i ) T φ(x) k(x i, x) α i >0 ) 35

36 Constructing kernels Construct kernel functions directly Ensure that it is a valid kernel Corresponds to an inner product in some feature space. Example: k(x, x ) = x T x 2 φ x = x 1 2, 2x 1 x 2, x 2 2 T We need a way to test whether a kernel is valid without having to construct φ x 36

37 Valid kernel: necessary & sufficient conditions [Shawe-Taylor & Cristianini 2004] Gram matrix K N N : K ij = k(x (i), x (j) ) Restricting the kernel function to a set of points *x 1, x 2,, x (N) + K must be positive semi-definite For all possible choices of the set *x i + 37

38 Some common kernels Linear: k(x, x ) = x T x Polynomial: k x, x = (x T x + c) m c 0 Contains all polynomials terms up to degree m Gaussian: k x, x = exp ( x x 2 2σ 2 ) Infinite dimensional feature space Sigmoid: k x, x = tanh (ax T x + b) Stationary: function of x x RBF functions: k x, x = g x x 38

39 SVM Gaussian kernel: Example 39 Source: Zisserman s slides

40 SVM Gaussian kernel: Example 40 Source: Zisserman s slides

41 SVM Gaussian kernel: Example 41 Source: Zisserman s slides

42 SVM Gaussian kernel: Example 42 Source: Zisserman s slides

43 SVM Gaussian kernel: Example 43 Source: Zisserman s slides

44 SVM Gaussian kernel: Example 44 Source: Zisserman s slides

45 SVM Gaussian kernel: Example 45 Source: Zisserman s slides

46 Kernel function for objects Kernel functions can be defined over objects: graphs, sets, strings, Kernel for sets: example k A, B = 2 A B 46

47 SVM: Summary Hard margin: maximizing margin 47 Primal and dual problems Dual problem represents classifier decision in terms of support vectors Quadratic optimization problem single global minimum Soft margin: handling noisy data and overlapping classes Slack variables in the primal problem Again dual problem is a quadratic problem Kernel SVM s Learns linear decision boundary in a high dimension space using SVM Suitable boundaries can be found in the original dimension space Kernel trick

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) and Kernel Methods CE-717: Machine Learning Sharif University of Technology Fall 2014 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin