Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012
Linear classifier Which classifier? x 2 x 1 2
Linear classifier Margin concept x 2 Support Vector Machine (SVM): a classifier that finds the solution with maximum margin x 1 3
Linear classifier geometry (recall) Discriminant function: y(x) = w T x + w 0 (Signed) distance from x to the boundary: w T x + w 0 w Distance does not chance when scaling w and w 0 equally 4
Margin Margin: When classes are linearly separable distance to the closest point min i y (i) wt x (i) + w 0 w 5
Maximum margin: optimization problem We want to find a line with maximum margin: max w,w 0 min i y (i) wt x (i) + w 0 w 1 max w,w 0 w min i y i (w T x (i) + w 0 ) It is hard to solve. Instead we set min i y i w T x i + w 0 = 1 since we can scale w and w 0 (Equal problem). 6
Maximum margin: optimization problem 1 max w,w 0 w s. t. y i w T x i + w 0 1 i = 1,, N min w 2 w,w 0 s. t. y i w T x i + w 0 1 i = 1,, N A regularization problem subject to the margin constraints 7
SVM: primal problem min w 2 w,w 0 s. t. y i w T x i + w 0 1 i = 1,, N x 2 x 1 8
Representer theorem Theorem: the solution to this problem: w = argmin w 2 = w T w w,w 0 s. t. y i w T x i + w 0 1 i = 1,, N can be represented as w = α i x (i) i=1 w is a linear combination of the training examples N 9
Lagrangian multipliers min w 2 w,w 0 s. t. y i w T x i + w 0 1 i = 1,, N The above problem is equivalent to the following one: min w,w 0 max *α i 0+ 1 2 w 2 N + α i 1 y i (w T x (i) + w 0 ) i=1 Lagrangian multiplier 10
Maximum margin equivalent problems min w,w 0 max *α i 0+ 1 2 w 2 N + α i 1 y i (w T x (i) + w 0 ) i=1 For this problem we can interchange the order of min and max: max min 1 *α i 0+ w,w 0 2 w 2 N + α i 1 y i (w T x (i) + w 0 ) i=1 α =,α 1,, α N - 11
Solving optimization problem max min *α i 0+ w,w 0 J w, w 0, α J w, w 0, α = 1 2 w 2 N + α i 1 y i (w T x (i) + w 0 ) i=1 J w,w 0,α w J w,w 0,α N = 0 w = i=1 α i y (i) x (i) = 0 w i=1 α i y (i) = 0 0 N w 0 was dropped out but a global constraint on α was created. 12
Maximum margin: dual problem max α N i=1 α i 1 2 N i=1 α i α j y (i) y (j) x i T x j N Subject to i=1 α i y (i) = 0 QP: quadratic problem α i 0 i = 1,, n we find α by solving above problem and then w = α i y (i) x (i) N i=1 13
Maximum margin: decision boundary If the distance to the boundary of a particular x i is y i w T x i + w 0 > 1 α i = 0 and thus x i is not a support vector. Inactive constraint The direction of hyper-plane (only based on support vectors): Support vectors w = α i >0 α i y (i) x (i) w 0 can be set by making the margin equidistant to two classes. 14
Karush-Kuhn-Tucker (KKT) conditions L w,w 0,α w = 0 L w,w 0,α w 0 = 0 y i w T x i + w 0 1 i = 1,, N α i 0 i = 1,, N α i 1 y i w T x i + w 0 = 0 i = 1,, N 15
Support Vectors x 2 α > 0 α > 0 α > 0 1 w x 1 16
Support Vectors x 2 α > 0 α = 0 α = 0 α > 0 α > 0 1 w x 1 17
Classification of a test example Test example: x y = sign w 0 + w T x y = sign(w 0 + α i y i x i α i >0 y = sign(w 0 + α i >0 T x) α i y i x i T x) The classifier is based on the expansion in terms of dot products of x with support vectors. 18
Support Vectors Linear hyper-plane defined by support vectors Support vectors are sufficient to predict labels of new points How many support vectors in linearly separable case (d dimensions)? d + 1 19
More general than linearly separable case Allow error in classification Overlapping classes that can be approximately separated by a linear boundary Noise in the linearly separable data x 2 20 x 1 x 1
More general than linearly separable case: objective function Minimizing the number of misclassified points?! NP-complete Soft margin: Maximizing a margin while trying to minimize the distance between misclassified points and their correct plane 21
SVM (soft margin): Primal problem SVM with slack variables min w 2 + C ξ w,w 0, ξ N i i i=1 i=1 s. t. y i w T x i + w 0 1 ξ i i = 1,, N ξ i 0 x 2 N ξ i : slack variables ξ i > 1: if x i misclassifed ξ i 0 ξ i < 1 : if x i correctly classified but inside margin 22 x 1
Soft margin linear penalty if mistake (hinge loss) C: tradeoff parameter Soft margin approach is still QP Has a unique minimum ξ = 0 ξ = 0 23
Soft margin parameter C is a regularization parameter: small C allows margin constraints to be easily ignored large margin large C makes constraints hard to ignore narrow margin C = enforces all constraints: hard margin C can be determined using a technique like crossvalidation 24
SVM (soft margin): dual problem max α N i=1 α i 1 2 N i=1 N i=1 α i α j y (i) y (j) x i T x j Subject to α i y (i) = 0 0 α i C i = 1,, n By solving the above quadratic problem first we find α and then find w = α i y (i) x (i) N i=1 For a test sample x (as before): y = sign w 0 + w T x = sign(w 0 + α i y i x i T x) α i >0 25
SVM (soft margin): another view min w 2 + C ξ w,w 0, ξ N i i i=1 i=1 s. t. y i w T x i + w 0 1 ξ i i = 1,, N ξ i 0 Equivalent to the unconstrained optimization problem N min w 2 + C max (0,1 y (i) (w T x (i) + w 0 )) w,w 0 i=1 N 26
SVM loss function vs. other loss functions Hinge loss vs. 0-1 loss max (0,1 y (i) (w T x (i) + w 0 )) 0-1 Loss Hinge Loss w T x + w 0 y = 1 27
SVM loss function vs. other loss functions Hinge loss vs. log conditional likelihood max (0,1 y (i) (w T x (i) + w 0 )) Hinge Loss w T x + w 0 y = 1 28
Not linearly separable data Noisy data or overlapping classes (we discussed about it: soft margin) Near linearly separable x 2 x 1 Non-linear decision surface x 2 Transform to a new feature space 29 x 1
Nonlinear SVM Φ: x φ(x) 30
SVM in a transformed space φ: R d R m x φ x Find a hyper-plane in the feature space: f x = w T φ x + w 0 = 0 x 2 φ 2 (x) φ: x φ x 31 x 1 φ 1 (x)
SVM in a transformed space: Primal Primal problem: 32 min w 2 + C ξ i w,w 0 i=1 s. t. y i w T φ(x i ) + w 0 1 ξ i i = 1,, N ξ i 0 Classifying a new data: y = sign w 0 + w T φ(x) = sign(w 0 + α α i y i φ(x i ) T i >0 φ(x) ) Transform to a space where data can be classified linearly w R m If m d (very high dimensional feature space) then there are many more parameters to learn N
SVM classifier in a transformed space: Dual Optimization problem: max α N i=1 α i 1 2 N i=1 α i α j y (i) y (j) φ x (i) Subject to N i=1 α i y (i) = 0 0 α i C i = 1,, n T φ x (j) If we have inner products φ x (i) T φ x (j), only α =,α 1,, α N - needs to be learnt It is not necessary to learn m parameters 33
Kernel trick Kernel: inner product in a feature space k x, x = φ x T φ(x ) φ x = φ 1 x,, φ m x T Kernel trick Extension of many well-known algorithms Idea: the input vectors enters only in the form of scalar products We can replace that inner product with other choices of kernel 34
Kernel SVM Optimization problem: max α N i=1 α i 1 2 N i=1 α i α j y (i) y (j) φ k(x x (i) i T, φx (j) x (j) ) Subject to N i=1 α i y (i) = 0 0 α i C i = 1,, n Classifying a new data: y = sign w 0 + w T φ(x) = sign(w 0 + α i y i φ(x i ) T φ(x) k(x i, x) α i >0 ) 35
Constructing kernels Construct kernel functions directly Ensure that it is a valid kernel Corresponds to an inner product in some feature space. Example: k(x, x ) = x T x 2 φ x = x 1 2, 2x 1 x 2, x 2 2 T We need a way to test whether a kernel is valid without having to construct φ x 36
Valid kernel: necessary & sufficient conditions [Shawe-Taylor & Cristianini 2004] Gram matrix K N N : K ij = k(x (i), x (j) ) Restricting the kernel function to a set of points *x 1, x 2,, x (N) + K must be positive semi-definite For all possible choices of the set *x i + 37
Some common kernels Linear: k(x, x ) = x T x Polynomial: k x, x = (x T x + c) m c 0 Contains all polynomials terms up to degree m Gaussian: k x, x = exp ( x x 2 2σ 2 ) Infinite dimensional feature space Sigmoid: k x, x = tanh (ax T x + b) Stationary: function of x x RBF functions: k x, x = g x x 38
SVM Gaussian kernel: Example 39 Source: Zisserman s slides
SVM Gaussian kernel: Example 40 Source: Zisserman s slides
SVM Gaussian kernel: Example 41 Source: Zisserman s slides
SVM Gaussian kernel: Example 42 Source: Zisserman s slides
SVM Gaussian kernel: Example 43 Source: Zisserman s slides
SVM Gaussian kernel: Example 44 Source: Zisserman s slides
SVM Gaussian kernel: Example 45 Source: Zisserman s slides
Kernel function for objects Kernel functions can be defined over objects: graphs, sets, strings, Kernel for sets: example k A, B = 2 A B 46
SVM: Summary Hard margin: maximizing margin 47 Primal and dual problems Dual problem represents classifier decision in terms of support vectors Quadratic optimization problem single global minimum Soft margin: handling noisy data and overlapping classes Slack variables in the primal problem Again dual problem is a quadratic problem Kernel SVM s Learns linear decision boundary in a high dimension space using SVM Suitable boundaries can be found in the original dimension space Kernel trick