Support Vector Machine

Size: px

Start display at page:

Download "Support Vector Machine"

Marcus Burns
5 years ago
Views:

1 Support Vector Machine Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018

2 Outline Linear Support Vector Machine Kernelized SVM Kernels 2

3 From ERM to RLM Empirical Risk Minimization in the binary classification case Y = { 1, 1} with the binary loss function is computationally hard does not provide asymptotic universal consistency 3

4 From ERM to RLM Empirical Risk Minimization in the binary classification case Y = { 1, 1} with the binary loss function is computationally hard does not provide asymptotic universal consistency Regularized Loss Minimization replace the binary loss function by a surrogate convex calibrated loss (function) add a (convex) regularization term typical example: Support Vector Machine 3

5 Outline Linear Support Vector Machine Kernelized SVM Kernels 4

6 Support Vector Machine (linear case) Hypotheses Y = { 1, 1} X = R P (with extensions) binary loss l b (p, t) = 1 p t RLM point of view class of models G = { } x g β0,β(x) = β 0 + β T x surrogate loss: the hinge loss l hinge (p, t) = max(0, 1 pt) (convex and calibrated) regularization: C(g β0,β) = β 2 (convex) (β 1 0, β) = arg min β 0,β N N i=1 )) max (0, 1 Y i (β 0 + β T X i + λ β 2 5

7 Ridge principle Linear function regularity if f (x) = β T x, then f (x) f (x ) β x x (Cauchy-Schwarz inequality) thus β measures the regularity of f Ridge models linear models regularized using their squared natural regularity measure ridge regression: Y = R quadratic loss: l b (p, t) = (p t) 2, no surrogate loss needed ridge logistic regression: Y = { 1, 1} surrogate logistic loss: l logi (p, t) = log(1 + exp( pt)) (convex and calibrated) 6

8 Why the hinge loss? Binary loss versus surrogate losses loss binary logistic hinge quadratic pt 7

9 Why the hinge loss? The hinge loss is convex and calibrated is an upper bound of the binary loss does not penalize (at all) very good decision (pt 1) does not overemphasize errors loss pt Robust decisions the actual decision model is f (x) = sign(g(x)) if g(x i )Y i 0 the decision is correct, but it could be influenced by noise, especially is g(x i )Y i 0 confidence in the decision increases with g(x i )Y i the hinge loss penalizes decisions that are not robust enough in term of the margin g(x i )Y i 8

10 Geometrical point of view Definition (Linear separation) A data set D is linearly separable if there is a linear model given by g β0,β(x) = β 0 + β T x such tat 1 i N sign(g(x i )) = Y i Definition (Geometrical margin) Let D be a linearly separable data set. The geometrical margin of a separating linear model given by g β0,β is M(g β0,β) = min 1 i N β 0 + β T X i. β Interpretation The geometrical margin is the smallest distance between points in the data set and the separating hyperplane associated to the model. 9

11 Maximal geometrical margin Linearly separable data many linear classifiers 10

12 Maximal geometrical margin Linearly separable data many linear classifiers if some data are close to the separator, the margin is small low robustness 10

13 Maximal geometrical margin Linearly separable data many linear classifiers if some data are close to the separator, the margin is small low robustness choose the classifier by maximizing the geometrical margin 10

14 Optimization problem Geometrical margin maximization the problem is equivalent to max β 0,β min 1 i N (P 0 ) maximize β 0,β β 0 + β T X i β min 1 i N Y i (β 0 +β T X i ) β subject to i, Y i (β 0 + β T X i ) > 0 Multiple solutions if (β 0, β) is a solution of (P 0 ) then (λβ 0, λβ) is also a solution for any λ > 0 normalization needed 11

15 Optimization problem Normalization let (β 0, β) be a solution of (P 0 ) and denote λ = min 1 i N Y i (β 0 + β T X i ) as λ > 0, (β 0, β ) = 1 λ (β 0, β) is also a solution and i, Y i (β 0 + β T Xi ) = 1 λ Y i(β 0 + β T X i ) and thus min 1 i N Y i (β 0 + β T Xi ) = 1 the maximal geometrical margin is then 1 β in summary, if (P 0 ) has a solution, then there is (β 0, β ) such that min 1 i N Y i (β 0 + β T X i ) = 1 the margin is 1 β 12

16 Equivalent problem Normalized problem we consider (P 1 ) maximize β 0,β 1 β subject to i, Y i (β 0 + β T X i ) 1 let (β 0, β) be a solution of (P 1 ) and for a λ > 0 denote (β 0, β ) = λ(β 0, β) then 1 β = 1 λ β and i, Y i(β 0 + β T Xi ) λ > 0 thus min 1 i N Y i (β 0 +β T X i ) β 1 β therefore (P 0 ) has a solution with a maximal margin of at least 1 β with the converse result from last slide, we conclude that (P 1 ) and (P 0 ) have the same optimal value and have solutions together 13

17 Linearly separable SVM Final problem (P) 1 minimize β 0,β 2 β 2 subject to i, Y i (β 0 + β T X i ) 1 Convex problem: convex criterion and convex constraints KKT conditions stationarity: β = N i=1 µ iy i X i and N i=1 µ iy i = 0 ) complementary slackness: i, µ i (Y i (β0 + β T X i ) 1 = 0 Support vectors the X i such that Y i (β 0 + β T X i ) = 1 β depends only on them! 14

18 Non separable case Soft margin if the data is not separable i, Y i (β 0 + β T X i ) 1 can not hold relax the constraints: i, Y i (β 0 + β T X i ) 1 ξ i (with ξ i 0) penalize the relaxation: N i=1 ξ i should be minimal Linear Support Vector Machine (P) minimize β 0,β,ξ subject to 1 2 β 2 + C N i=1 ξ i i, Y i (β 0 + β T X i ) 1 ξ i i, ξ i 0 15

19 Link to the hinge loss Analysis let (β 0, β, ξ ) be a solution of (P), then either ξ i = 0 or ξ i = 1 Y i (β 0 + β T X i ) and thus ξ i = max(0, 1 Y i (β 0 + β T X i )) i.e. ξ i = l hinge (g β 0,β (X i), Y i ) therefore solving (P) implies to minimize 1 2 β 2 + CN R lhinge (g β0,β, D) one can show rigorously that (P) is equivalent to RLM 16

20 In practice Dual approach (P) is equivalent to following easier to solve dual problem (D) maximize α subject to N i=1 α i 1 N N 2 i=1 j=1 α iα j Y i Y j X T i X j N i=1 α iy i = 0 i, 0 α i C for any α i such that 0 < α i < C, β 0 = Y i N j=1 α jy j X T i X j β = N j=1 α jy j X j conditions on α i sparsity: Y i (β 0 + β T X i ) > 1 α i = 0 support vector: Y i (β 0 + β T X i ) < 1 α i = C on the margin: 0 < α i < C Y i (β 0 + β T X i ) = 1 17

21 Algorithms Standard (older) solutions with quadratic programming for the dual problem scales between Θ ( N 2) and Θ ( N 3), depending on the number of support vectors (and thus on C) Modern approaches For solutions within ɛ of the optimal one cutting plan methods Θ ( ) NP λɛ stochastic sub-gradient methods Θ ( ) P λɛ Meta-parameter C or λ must be optimized (with a validation like approach) notice that the running time of the algorithm strongly depends on C/λ 18

22 Example Wine data set chemical analysis of wines derived from three cultivars (Y = {1, 2, 3}) merge cultivar 1 and 3 to get a binary classification problem 178 observations with 2 variables (X = R 2 ) Support vector machine use half the data for the learning set D use the other half D for selecting C grid search: C of the form 10 k with k { 3, 2,..., 3} notice that the range of C should depend on N 19

23 Example Alcohol class 1 class Flavanoids 20

24 Results R^ Empirical risk on D Empirical risk on D C 21

25 Results Alcohol class 1 class Flavanoids 22

26 Results Alcohol class 1 class Flavanoids 22

27 Example Wine quality data quality of a wine graded from 1 to 10: good wines for 6 or more bad wines for 5 or less 11 numerical variables giving chemical and physical properties of the wine 1067 examples for learning 532 examples for selecting C 23

28 Results R^ Empirical risk on D Empirical risk on D C 24

29 Results Confusion matrix final model on the full data set BAD GOOD BAD GOOD possible over estimated quality but consistent with the validation set performances 25

30 Outline Linear Support Vector Machine Kernelized SVM Kernels 26

31 Linear separability Are linearly separable data set frequent? Results from T. Cover (1965) for points in general position: separability depends on the dimension of the input space the expectation of the maximum number of linearly separable points in dimension P is 2P the expectation of the minimal number of independent variables needed to separate N points is N+1 2 the transition from separable to non separable is more and more abrupt as the dimension increases 27

32 Linear separability Are linearly separable data set frequent? Results from T. Cover (1965) for points in general position: separability depends on the dimension of the input space the expectation of the maximum number of linearly separable points in dimension P is 2P the expectation of the minimal number of independent variables needed to separate N points is N+1 2 the transition from separable to non separable is more and more abrupt as the dimension increases Consequence Let s increase the number of variables! 27

33 Explicit mapping Data transformation chose a mapping Φ from X = R P to R Q with Q P with Φ(X) = (φ 1 (X),..., φ Q (X)) T replace the data set D = ((X i, Y i )) 1 i N by ((Φ(X i ), Y i )) 1 i N classical approach for the generalized linear model with interaction terms φ(x) = X j X k polynomial terms φ(x) = Xj k etc. Support Vector Machine applies directly with g β0,β(x) = β 0 + β T Φ(x) can be used with Q = N by leveraging the regularization (as in ridge regression) 28

34 SVM optimization revisited Primal version (P) minimize β 0,β,ξ subject to The parameter number grows with Q Dual version 1 2 β 2 + C N i=1 ξ i i, Y i (β 0 + β T Φ(X i )) 1 ξ i i, ξ i 0 (D) maximize α subject to N i=1 α i 1 N N 2 i=1 j=1 α iα j Y i Y j Φ(X i ) T Φ(X j ) N i=1 α iy i = 0 i, 0 α i C The α parameter dimension does not depend on Q 29

35 SVM optimization revisited Primal version (P) minimize β 0,β,ξ subject to The parameter number grows with Q Dual version 1 2 β 2 + C N i=1 ξ i i, Y i (β 0 + β T Φ(X i )) 1 ξ i i, ξ i 0 (D) maximize α subject to N i=1 α i 1 N N 2 i=1 j=1 α iα j Y i Y j Φ(X i ) T Φ(X j ) N i=1 α iy i = 0 i, 0 α i C The α parameter dimension does not depend on Q 29

36 Dual problem Representation of the solution β 0 = Y i N j=1 α jφ(x i ) T Φ(X j ) for an α i such that 0 < α i < C β = N j=1 α jy j Φ(X j ) and therefore g β0,β(x) = β 0 + N α j Y j Φ(X j ) T Φ(x) = g β0,α(x) j=1 Calculating the SVM β is not needed the parameters β 0 and α are of fixed dimension independent of Q everything can be computed using only Φ(x) T Φ(x ) for x X and x X 30

37 Dual problem Representation of the solution β 0 = Y i N j=1 α jφ(x i ) T Φ(X j ) for an α i such that 0 < α i < C β = N j=1 α jy j Φ(X j ) and therefore g β0,β(x) = β 0 + N α j Y j Φ(X j ) T Φ(x) = g β0,α(x) j=1 Calculating the SVM β is not needed the parameters β 0 and α are of fixed dimension independent of Q everything can be computed using only Φ(x) T Φ(x ) for x X and x X 30

38 Regularization point of view Regularization term original term β 2 using the formula for β we have β 2 = N N α i α j Y i Y j Φ(X i ) T Φ(X j ) i=1 j=1 Hinge loss R lhinge (g β0,α, D) = 1 N N N max 0, 1 Y i β 0 + α j Y j Φ(X j ) T Φ(X i ) i=1 j=1 31

39 Regularization point of view Regularization term original term β 2 using the formula for β we have β 2 = N N α i α j Y i Y j Φ(X i ) T Φ(X j ) i=1 j=1 Hinge loss R lhinge (g β0,α, D) = 1 N N N max 0, 1 Y i β 0 + α j Y j Φ(X j ) T Φ(X i ) i=1 j=1 31

40 Implicit approach Specifying Φ with the dual problem or the regularization approach we do not need Φ but rather Φ(x) T Φ(x ) for any (x, x ) X 2 the linear case corresponds to Φ(x) T Φ(x ) = x T x could we specify directly k(x, x )? Definition (Kernel) A kernel K on a set X is a function from X X to R such that k is symmetric: (x, x ) X X, k(x, x ) = k(x, x) k is positive definite: for all n 1, all (x i ) 1 i n, n elements of X, and all (γ i ) 1 i n, n elements of R, n i=1 j=1 n γ i γ j k(x i, x j ) 0 32

41 Kernel based SVM Dual with kernel optimization problem (D) maximize α subject to N i=1 α i 1 N N 2 i=1 j=1 α iα j Y i Y j k(x i, X j ) N i=1 α iy i = 0 i, 0 α i C β 0 = Y i N j=1 α jk(x i, X j ) model g β0,α(x) = β 0 + N α j Y j k(x j, x) similar construction for the regularized point of view j=1 33

42 Issues Important questions Does this construction make sense? Why specifying a kernel k rather than a transformation Φ? 34

43 Outline Linear Support Vector Machine Kernelized SVM Kernels 35

44 Kernels are efficient Polynomial kernel for X = R P, k d (x, x ) = (x T x + 1) d is a kernel computation time Θ (P + d) one can show that there is Φ d from X to R Q such that k d (x, x ) = Φ d (x) T Φ d (x ) but Q = ( ) P+d P = (P+d)! P!d! and computing Φ costs a lot, e.g. Θ ( P d) when d is small and P is large Illustration for d = 2 and P = 2, Φ 2 (x) = ( 1, 2x 1, 2x 2, ) T 2x 1 x 2, x1 2, x 2 2 Φ 2 (x) computation involves 6 operations and Φ 2 (x) T Φ 2 (x ) uses 11 operations k 2 (x, x ) needs 5 operations 36

45 Kernels are flexible Numerical examples For X = R P Linear kernel: k lin (x, x ) = x T x Polynomial kernel: k d (x, x ) = (x T x + 1) d Gaussian kernel: k gauss,σ (x, x ) = exp ( x x 2 2σ 2 ) Laplace kernel: k lap,γ (x, x ) = exp ( γ x x ) Dissimilarity spaces When X has a dissimilarity d, ( one can ) use the Gaussian kernel as follows: k gauss,σ (x, x ) = exp d(x,x ) 2σ 2 37

46 Kernels are flexible String kernels For X the set of words on an alphabet Σ (i.e., X = Σ ) generic matching kernel k(x, x ) = s σ num(x, s)num(x, s)w(s) where num(x, s) denotes the number of times a string s occurs in x and w(s) is a weighting function on Σ examples: common characters: w(s) = 0 is s > 1 (where s is the length of the string s) common sub-strings or words with similar tricks numerous extensions: position dependent weighting, approximate matches, etc. 38

47 Kernels are flexible Many other input types graph nodes whole graphs texts images etc. 39

48 Kernel theory Definition (Reproducing Kernel Hilbert Space (RKHS)) Let X be an arbitrary space and let H be a Hilbert space of functions from X to R (with the inner product.,. H ). H is a Reproducing Kernel Hilbert Space if there is a function K from X X to R such that: 1. for all x X, K (x,.) H 2. for all f H and for all x X, f (x) = f, K (x,.) H K is the reproducing kernel of H. Theorem (Moore Aronszajn) Let k be a symmetric positive definite kernel on X. Then there is a unique RKHS of functions from X to R for which k is the reproducing kernel. 40

49 From theory to practice Kernel trick Assume given a symmetric positive definite kernel k on X. then there is a feature space H and a feature map Φ from X to H defined by Φ(x) = k(x,.) H is a vector space with an inner product: any machine learning algorithm that uses only the Euclidean structure of R P can be implemented in H for instance the linear SVM: replace β T x by β, Φ(x) H and β 2 by β 2 H 41

50 Representer theorem To ease the use of the kernel trick, we need a theorem. Theorem (Representer theorem) Let k be a symmetric positive definite kernel on X with H the associated RKHS. Assume that C is a strictly increasing function from R + to R and R is a function from R N to R. Any solution to arg min f H R(f (x 1),..., f (x N )) + C( f H ) for some fixed elements of X, x 1,..., x N, can be written f = N i=1 α ik(x i,.) 42

51 Livear SVM and the kernel trick Regularized version optimization problem with β H 1 min β 0,β N N l hinge (β 0 + β, Φ(X i ) H, Y i ) + λ β 2 H i=1 with Φ(X i ) = k(x i,.) according to the representer theorem, β has the form N j=1 α jk(x j,.) and therefore β, Φ(X i ) H = N j=1 α j k(x j,.), k(x i,.) H = N j=1 α jk(x i, X j ) β 2 H = N N i=1 j=1 α iα j k(x i, X j ) new optimization problem 1 N N N N min l hinge β 0 + α j k(x i, X j ), Y i + λ α i α j k(x i, X j ) β 0,α N i=1 j=1 i=1 j=1 43

52 Kernel ridge regression Ridge regression regularized linear regression optimization problem in an arbitrary Hilbert space H 1 min β 0,β N N (β 0 + β, Φ(X i ) H Y i ) 2 + λ β H i=1 Kernel ridge regression representer theorem again! transformed optimization problem 2 1 N N N N min β 0 + α j k(x i, X j ) Y i + λ α i α j k(x i, X j ) β 0,α N i=1 j=1 i=1 j=1 44

53 Kernels provide consistent learning Dimension many kernels increase a lot the dimension of the data representation some kernels lead to a RHKS of infinite dimension (e.g., the Gaussian kernel) according to Cover s theorem, this should turn any data set into a linearly separable one but interesting RHKS have infinite VC-dimension! Effect of regularization in the convex case g = arg min {g H g H µ} Â(g, D) even if H has an infinite VC-dimension, {g H g H µ} has a controlled capacity consistency can be proved in some cases (e.g. SVM) 45

54 Kernel methods General framework chose a kernel k chose a loss function l chose a penalty function C (strictly increasing function) solve 1 N N N N min l β 0 + α j k(x i, X j ), Y i + λc α i α j k(x i, X j ) β 0,α N i=1 j=1 use a model derived from (or equal to) X β 0 + N j=1 α jk(x i, X) very general consistent results are available very good performances in practice i=1 j=1 46

55 In practice Practical aspects loss functions? quadratic in regression hinge loss in classification numerous other possibilities kernel? must be chosen by a validation like approach meta parameter(s) must be tuned the Gaussian kernel is the de facto standard choice the regularization trade-off must be tuned by a validation like approach 47

56 Example Wine quality data Same setting as before quality of a wine graded from 1 to 10: good wines for 6 or more bad wines for 5 or less 11 numerical variables giving chemical and physical properties of the wine 1067 examples for learning 532 examples for selecting C SVM with the hinge loss and a Gaussian kernel in R with e1071, the Gaussian kernel is k(x, x ) = exp( γ x x 2 ) 48

57 Results Empirical risk on D Empirical risk on D R^ γ 49

58 Results Confusion matrix final model on the full data set BAD GOOD BAD GOOD Linear kernel BAD GOOD BAD GOOD Gaussian kernel possible over estimated quality but consistent with the validation set performances Remarks strong overfitting when C is large and γ also somewhat similar effects of C and γ: small values favor regular models, large values favor overfitting 50

59 Outline Linear Support Vector Machine Kernelized SVM Kernels 51

60 Summary Kernel methods a general framework for convex regularized loss minimization pros and cons + very flexible framework (regression, classification, but also semi-supervised learning, novelty detection, etc) + can be applied to exotic data with adapted kernels + state of the art performances + strong theoretical results - relatively slow especially for nonlinear kernels - many meta-parameters - somewhat complex implementations 52

61 Licence This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. 53

62 Version Last git commit: By: Fabrice Rossi Git hash: 1b39c1bacfc1b07f96d689db230b a62d4 54

Empirical Risk Minimization

Empirical Risk Minimization Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Introduction PAC learning ERM in practice 2 General setting Data X the input space and Y the output space