SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels

Size: px

Start display at page:

Download "SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels"

Reginald Bates
5 years ago
Views:

1 SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels Karl Stratos June 21, / 33

2 Tangent: Some Loose Ends in Logistic Regression Polynomial feature expansion in logistic regression Regularization in logistic regression Classification metrics 2 / 33

3 Increasing the Complexity of Logistic Regression The predicted probability of a logistic regressor is p(1 x, w, w 0 ) = σ(w x + w 0 ). Linear decision boundary w x + w 0 = 0 p(1 ) = / 33

4 Polynomial Feature Expansion We can use the same polynomial feature expansion that we used in linear regression. For instance, with d = 2 dimensions p(1 x, w, w 0 ) = σ(w 0 + w 1 x 1 + w 2 x 2 + w 3 x w 4 x 2 2) Nonlinear decision boundary w 0 + w 1 x 1 + w 2 x 2 + w 3 x w 4 x 2 2 = 0 p(1 ) = / 33

5 Tangent: Some Loose Ends in Logistic Regression Polynomial feature expansion in logistic regression Regularization in logistic regression Classification metrics 5 / 33

6 Regularization in Logistic Regression Same idea as in linear regression: penalize the squared l 2 or l 1 norm of the model parameter to prevent the model from becoming too confident about the training data. Squared l 2 regularization with hyperparameter σ 2 i log p(y (i) x (i), w) + 1 2σ 2 w 2 6 / 33

7 Tangent: Some Loose Ends in Logistic Regression Polynomial feature expansion in logistic regression Regularization in logistic regression Classification metrics 7 / 33

8 Label Imbalance Problem Suppose most of the labels are y = 0, say 99.9% of the time. In this scenario, classification accuracy is not a useful metric. tp + tn tp + fp + tn + fn Just by guessing 0 all the time, we get accuracy 99.9%! 8 / 33

9 Label Imbalance Problem Suppose most of the labels are y = 0, say 99.9% of the time. In this scenario, classification accuracy is not a useful metric. tp + tn tp + fp + tn + fn Just by guessing 0 all the time, we get accuracy 99.9%! Consider other metrics, such as Precision: Out of your y = 1 predictions, how many were actually 1? tp tp + fp Recall: Out of points that are labeled 1, how many did you label as y = 1? tp tp + fn 8 / 33

10 More Metrics F1: Harmonic mean of precision and recall precision recall 2 precision + recall PR curve: change threshold and plot precision/recall ROC curve: change threshold and plot false/true positive rates 9 / 33

11 Transition Slide Back to SVMs 10 / 33

12 Review: Basic Support Vector Machines (SVMs) Training data: S = { (x (i), y (i) ) } n i=1 where y(i) {±1} is a binary label of x (i) R d. S is assumed to be linearly separable: there exists w such that y (i) w x (i) > 0 for all i = 1... n. Find a separator that maximizes the margin on S: 11 / 33

13 Review: Basic Support Vector Machines (SVMs) Training data: S = { (x (i), y (i) ) } n i=1 where y(i) {±1} is a binary label of x (i) R d. S is assumed to be linearly separable: there exists w such that y (i) w x (i) > 0 for all i = 1... n. Find a separator that maximizes the margin on S: w svm S := arg max w R d wlog arg max w R d : w =1 arg min w R d : y (i) w x (i) 1 i n min i=1 y(i) w w x(i) n min i=1 y(i) w x (i) w 2... Equivalently, find a separator with the minimum l 2 norm. 11 / 33

14 Review: Inner Product Formulation Using the representer theorem β 1... β n R : w svm S = n β i x (i) we can convert the original problem into an equivalent problem min β 1...β n R n β i β j x (i) x (j) i,j=1 subject to y (i) n β j x (j) x (i) 1 j=1 i=1 i = 1... n where the only information from data we need for training is the inner product between input points. Likewise at test time: w svm S x = n i=1: β i 0 β ix (i) x Allows for the use of kernels (later). 12 / 33

15 Today How to handle non-separable data Connection to a convex surrogate loss on the 0-1 loss How to handle multi-class classification The kernel trick 13 / 33

16 Overview Non-Separable Data Multi-Class Classification Kernel Trick 14 / 33

17 Introduce Slack Variables min w 2 w R d subject to y (i) w x (i) 1 i = 1... n min w 2 + w R d, ξ 1...ξ n R n i=1 ξ i subject to y (i) w x (i) 1 ξ i ξ i 0 i = 1... n i = 1... n 15 / 33

18 Unconstrained Formulation Soft SVM solution w soft S, ξ1... ξn := arg min w 2 + w R d, ξ 1...ξ n R with constraints ξ i max ( 0, 1 y (i) w x (i)) for all i = 1... n. n i=1 ξ i Note that ξ i = max ( 0, 1 y (i) w soft S w soft S = arg min w R d w 2 + n i=1 x (i)), so max (0, 1 y (i) w x (i)) No constraints :) Convex but not differentiable, can still be optimized by subgradient descent 16 / 33

19 Soft SVMs as Empirical Risk Minimization What we really want: minimize the 0-1 loss w S = arg min w R d n i=1 [[ ]] y (i) w x (i) 0 where [[τ]] is 1 if τ is true and 0 otherwise Difficult to optimize (neither convex nor differentiable) Instead minimize the hinge loss w S = arg min w R d n max (0, 1 y (i) w x (i)) } {{ } hinge(w x (i) ) i=1 which is a convex upper bound on the 0-1 loss 17 / 33

20 A Big Picture of Binary Classification Both logistic regression and soft SVMs are l 2 -regularized minimization of the 0-1 loss by convex surrogates. 18 / 33

21 Generalized Representer Theorem Claim. Let l : R n [0, ) be any function and define w = arg min w 2 + l (w x (1),..., w x (n)) w R d Then w = n i=1 β ix (i) for some β 1... β n R. Proof. Same as in the hard SVM case Thus we can simiarly derive an inner product formulation of the soft SVM solution (will come back to this later): w soft S = arg min w R d w 2 + n i=1 max (0, 1 y (i) w x (i)) } {{ } l(w x (1),...,w x (n) ) 19 / 33

22 Overview Non-Separable Data Multi-Class Classification Kernel Trick 20 / 33

23 One-Vs-All Training data: S = { (x (i), y (i) ) } n i=1 where y(i) {1... m} is the label of x (i) R d. Parameters: w y R d for each y {1... m} One-vs-all soft SVM objective: optimize min w 2 + w R d, ξ 1...ξ n R such that ξ i 0 for all i = 1... n and n i=1 ξ i w y(i) x (i) w y x (i) 1 ξ i for all i = 1... n and y {1... m} such that y y (i) 21 / 33

24 Unconstrained Formulation w one vs all S = arg min w 2 + w R d n max i=1 y {1...m}: y y (i) (0, 1 + w y x (i) w y(i) x (i)) 22 / 33

25 One-Vs-One One-vs-one soft SVM objective: optimize n min w 2 + w R d, ξ 1...ξ n R such that ξ i 0 for all i = 1... n and i=1 w y(i) x (i) max w y x (i) 1 ξ i y {1...m}:y y (i) for all i = 1... n Unconstrained Formulation w one vs all S = arg min w 2 + w R d n max i=1 ( ) 0, 1 + max w y x (i) w y(i) x (i) y {1...m}: y y (i) ξ i 23 / 33

26 Overview Non-Separable Data Multi-Class Classification Kernel Trick 24 / 33

27 Inner Product Formulation of Soft SVM (Binary) G R n n : a symmetric matrix with G i,j = x (i) x (j) (i.e., the Gram matrix). β = arg min β R n β Gβ + w soft S = n β i x (i) i=1 n i=1 ) max (0, 1 y (i) β G i User manual 1. Calculate G i,j = x (i) x (j) for every i, j = 1... n. 2. Find β using G (e.g., by subgradient descent). 3. Test time: given a new point x to classify, return n sign β i x (i) x i=1:β i 0 25 / 33

28 Recall: Polynomial Feature Expansion Idea: transform input by φ : R d R D to allow the linear model to better fit the data. Example: expansion by a degree p = 2 polynomial with bias c = 1 ( (x ( ( ) ) φ poly(2,1) 2 (x 1... x d ) = i )i, 2xi x j )i<j, 2xi, 1 i Computationally expensive: time to calculate feature expansion O(d p ) exponential in p 26 / 33

29 But Computing Inner Product is Easy! Inner product between two points x and y in the feature space φ poly(2,1) (x) φ poly(2,1) (y) = i,j x i x j y i y j + 2 i x i y i + 1 = ( ) 2 x y + 1 Instead of computing O(d 2 ) terms in each φ poly(2,1) (x) and φ poly(2,1) (y) and then taking a dot product, we can just 1. Compute z = x y (O(d)-time operation) 2. Square z + 1 (O(1)-time operation) 27 / 33

30 Kernel Trick Idea: when all we need is inner product, we can do implicit feature expansion by a kernel function without ever computing the explicit feature expansion Kernel function K(x, y) is any function that defines pairwise similarity between two data points such that K(x, y) = φ(x) φ(y) for some input mapping φ : R d? Applicable beyond SVMs (e.g., kernel PCA) 28 / 33

31 Degree-p Polynomial Kernel Parameters: degree p, bias c K poly(p,c) (x, y) = ( ) p x y + c We saw that the underlying feature expansion is some degree p polynomial 29 / 33

32 Radial Basis Function (RBF) Kernel Parameter: σ 2 > 0 K RBF(σ2) (x, y) = exp ( ) x y 2 2σ 2 What is the underlying feature expansion? For σ 2 = 1, K RBF(σ2) (x, y) = C = C p=0 p=0 1 ( ) p x y p! 1 p! φpoly(p,0) (x) φ poly(p,0) (y) The underlying feature space is infinite-dimensional. 30 / 33

33 Summary of Kernel Trick for Soft SVM Before ( linear kernel ) 1. Calculate G i,j = x (i) x (j) for every i, j = 1... n. 2. Find β using G (e.g., by subgradient descent). 3. Test time: given a new point x to classify, return n sign βi x (i) x i=1:βi 0 Choose some kernel K(x, y). 1. Calculate G i,j = K ( x (i), x (j)) for every i, j = 1... n. 2. Find β using G (e.g., by subgradient descent). 3. Test time: given a new point x to classify, return n sign βi K i=1:βi 0 ( x (i), x) 31 / 33

34 Aside: Kernel Approximation Kernel trick is clever but requires the inner product formulation. This requires storing the n n Gram matrix: not scalable. Kernel approximation: approximate the implicit feature expansion defined under a kernel, and use that expansion directly Rahimi and Recht (2007): z(x) R N where z i (x) := 2/N cos (µ i x + b i ) given by µ i N (0, I d ) and b i U(0, 2π) E [z(x) z(y)] = K RBF(1) (x, y) Use z(x) R N directly (no kernel trick) 32 / 33

35 Summary Soft SVM = hard SVM + slack variables to handle non-separable data A unifying framework for SVMs and logistic regression: convex surrogate loss on the 0-1 loss SVMs can be naturally extended to handle multi-class classification Kernel trick: when training and inference only depend on inner product between data points, we can replace the inner product with a kernel function and perform implicit feature expansion 33 / 33

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Linear classifier Which classifier? x 2 x 1 2 Linear classifier Margin concept x 2