Classifier Complexity and Support Vector Classifiers

Size: px

Start display at page:

Download "Classifier Complexity and Support Vector Classifiers"

Myrtle Johnston
5 years ago
Views:

Classifier Complexity and Support Vector Classifiers Feature 2 6 4 2 0 2 4 6 8 RBF kernel 10 10 8 6 4 2 0 2 4 6 Feature 1

1 Classifier Complexity and Support Vector Classifiers Feature RBF kernel Feature 1 David M.J. Tax Pattern Recognition Laboratory Delft University of Technology D.M.J.Tax@tudelft.nl Advanced Pattern Recognition 1

2 Contents Complexity and learning curve How to adapt the complexity? Regularization (QDC, Fisher) Measuring complexity: VC-dimension Support vector classifiers Class overlap Kernel trick Examples Conclusions Advanced Pattern Recognition 2

3 Complexity? Banana Set Banana Set Feature Feature Feature Feature 1 Advanced Pattern Recognition 3

4 Complexity? Banana Set Banana Set Feature Feature Feature Feature 1 The complexity of a classifier indicates the ability to fit to any data distribution Advanced Pattern Recognition 4

5 Which complexity to choose? A simple classifier fits to only a few specific data distributions A complex classifier fits to almost all data distributions So, we should always use a complex classifier! (?) Advanced Pattern Recognition 5

6 The learning curve error true error ε asymptotic error apparent error ε A number of training objects Advanced Pattern Recognition 6

7 The learning curve apparent error is too optimistic error true error ε asymptotic error apparent error ε A number of training objects Advanced Pattern Recognition 7

8 The learning curve more complex classifier error true error ε apparent error ε A asymptotic number of training objects Advanced Pattern Recognition 8

9 The learning curve error more complex classifier ε - lower training error - higher test error - lower asymptotic error apparent error ε A asymptotic number of training objects Advanced Pattern Recognition 9

10 So, complex or not? Complex classifiers are good when you have sufficient number of training objects When a small number of training objects is available, you overtrain Use a simple classifier when you don t have many training examples Choose the complexity according to the available training set size Advanced Pattern Recognition 10

11 Peaking phenomenon sample size error Bayes error complexity The minimum error for a given number of samples is obtained for a specific complexity Advanced Pattern Recognition 11

12 Bias-variance dilemma Assume, our function f tries to model the relation between x and y given dataset X = {x i,y i },i=1...n ε = E X [ (f X (x) y(x)) 2] The mean-squared error can be decomposed: = E X h (f X (x) E X [f X (x)]) 2i + E X (EX [f X (x)] y(x)) 2 variance (squared) bias Advanced Pattern Recognition 12

13 Bias-variance dilemma error = (squared) bias + variance This tradeoff is very general More flexible models have lower bias, but higher variance Simple models have high bias, but low variance Advanced Pattern Recognition 13

14 Bias-variance dilemma error total error variance bias complexity Bias-variance explains the same peaking phenomenon Advanced Pattern Recognition 14

15 Bias-variance in real life Banana Set Banana Set Feature Feature Feature Feature 1 Generate 10 times a dataset, plot classifier Still 40 objects are generated (a lot!) Advanced Pattern Recognition 15

16 Adapting the complexity Start with a simple classifier and increase the complexity till it fits In practice it is not simple to increase complexity Start with a complex classifier, add a regularizer term to diminish the chance of an over-complex solution minimize ε A + λω(w) Advanced Pattern Recognition 16

17 Regularization For general classification/regression problems, avoid that weights become too large (also called ridge regression): ε λ = 1 ( yi w T ) 2 x i + λ w 2 N i For normal based classifiers, avoid that the covariance matrix adapts to outliers or a few examples Σ λ =Σ+λI Advanced Pattern Recognition 17

18 Regularization Normal-based classif. Estimate a Gaussian in a 2D feature space: Estimate mean and covariance matrix µ Advanced Pattern Recognition 18

19 Regularization Normal-based classif. Estimate a Gaussian in a 2D feature space: µ Estimate mean and covariance matrix µ If you have just 2 points in 2D: PROBLEM! Advanced Pattern Recognition 19

20 Regularization Normal-based classif. Estimate a Gaussian on 2 points: µ ˆ = apple For a normal-based classifier I need the inverse: That is: ˆ 1 = apple Advanced Pattern Recognition (x µ) T 1 (x µ) 20

21 Regularization Normal-based classif. Add a bit of artificial noise to the Gaussian apple 5+ 0 ˆ = 0 0+ Now the inverse is well-defined: That is: ˆ 1 = apple Advanced Pattern Recognition 21

22 Quadratic classifier R(x) = (x µ B ) T G 1 B (x µ B) When insufficient data is available (nr of obj s is smaller than dimensionality), the inverse covariance matrices are not defined: the classifier is not defined Regularization solves it: (x µ A ) T G 1 A (x µ A) G B = G B + λi Advanced Pattern Recognition 22

23 Quadratic classifier error QDC crashes dimensionality training set size Advanced Pattern Recognition 23

24 Quadratic classifier error regularized QDC dimensionality training set size Advanced Pattern Recognition 24

25 Fisher classifier R(x) =(µ B µ A ) T G 1 x + C The same phenomenon appears, but because the covariance matrix is averaged over two classes, the peak appears for small sample size Instead of regularization the pseudo-inverse of the covariance matrix is used. Advanced Pattern Recognition 25

26 Learning & feature curve peaking dset 520, fisherc pca Fisher classifier number of training objects (Concordia dataset, 4000 obj in 1024D) nr train obj 0 nr features number of features Advanced Pattern Recognition 26

27 Measuring complexity By changing the regularization parameter, the complexity is changed Is there a direct way to measure complexity? Yes: the VC-dimension (Vapnik-Chervonenkis dimension) of a classifier Advanced Pattern Recognition 27

28 VC-dimension h: the VC-dimension of a classifier: The largest number h of vectors that can be separated in all the possible ways. 2 h Advanced Pattern Recognition 28

29 VC-dim for linear classifier For linear classifier: h = p +1 Advanced Pattern Recognition 29

30 Use of VC-dimension Unfortunately, only for a very few classifiers the VCdimension is known Fortunately, when you know h of a classifier, you can bound the true error of the classifier Advanced Pattern Recognition 30

31 Bounding the true error With probability at least where 1 η ε ε A + E(N) ( 1+ 2 E(N) =4 V.Vapnik, Statistical learning theory, 1998 the inequality holds: 1+ ε A E(N) ) h(ln(2n/h) + 1) ln(η/4) N When h is small, the true error is close to the apparent error Advanced Pattern Recognition 31

32 Is this bound practical? The given bound (and others) is very loose The worst case scenario is assumed: objects can be randomly labeled In practice, features are chosen such that objects from one class are nicely clustered and can be separated from other classes Advanced Pattern Recognition 32

33 Changing h for linear cl. The linear classifier had h = p +1 By putting some constraints on this linear classifier, the VC dimension can be reduced Assume a linearly separable dataset, constrain the weights such that the output of the classifier is always larger than one: w T x i + b +1, for y i = +1 w T x i + b 1, for y i = 1 Advanced Pattern Recognition 33

34 Constraining the weights w T x + b 1 w w T x + b +1 w T x + b =0 The constraints force the weights to have a minimum value: canonical hyperplane Advanced Pattern Recognition 34

35 h for a canonical hyperplane w R ρ ( R 2 ) h min ρ 2,p +1 Advanced Pattern Recognition 35

36 Finding a good classifier To find a classifier with small VC-dimension: 1. minimize the dimensionality p 2. minimize the radius R 3. maximize the margin ρ Using simple algebra w 2 = 1 ρ 2 So: min w 2 Support vector classifier w T x i + b +1, for y i = +1 w T x i + b 1, for y i = 1 Advanced Pattern Recognition 36

37 Optimizing the classifier It appears you can rewrite the constraints inside the optimization using Lagrange Multipliers: min α N i=1 α i 1 2 i=1 N i,j N α i 0 i α i y i =0 N w = α i y i x i i=1 y i y j α i α j x T i x j Advanced Pattern Recognition 37

38 Support vectors The classifier becomes: w = N i=1 α i y i x i The solution is phrased in terms of objects, not features. Only a few weights become non-zero The objects with non-zero weight are called the support vectors Advanced Pattern Recognition 38

39 Support vector classifier support vectors ( i > 0) w ( i = 0) w T x + b = 1 w T x + b = +1 w T x + b =0 Advanced Pattern Recognition 39

40 Error estimation By leave-one-out you can obtain a bound on the error: #support vectors LOO N Advanced Pattern Recognition 40

41 High dimensional feature spaces The classifier is determined by objects, not features: the classifier tends to perform very well in high dimensional feature spaces. Advanced Pattern Recognition 41

42 Limitations of the SVM 1.The data should be separable 2.The decision boundary is linear Advanced Pattern Recognition 42

43 Problem 1 The classes should be separable... Advanced Pattern Recognition 43

44 Weakening the constraints w T x + b +1 ξ i ξ i w T x + b =0 Introduce a slack variable weaken the constraints ξ i for each object, to Advanced Pattern Recognition 44

45 SVM with slacks The optimization changes into: N min w 2 + C i=1 ξ i w T x i + b +1 ξ i, for y i =+1 w T x i + b 1+ξ i, ξ i 0 i for y i = 1 Advanced Pattern Recognition 45

46 Tradeoff parameter C min w 2 + C N i=1 ξ i Notice that the tradeoff parameter C has to be defined beforehand It weighs the contributions between the training error and the structural error Its values if often optimized using crossvalidation Advanced Pattern Recognition 46

47 Influence of C C=1 C= Feature 1 Erroneous objects have still large influence when C is not optimized carefully Advanced Pattern Recognition 47

48 Problem 2 The decision boundary is only linear... Advanced Pattern Recognition 48

49 Trick: transform your data min α N i=1 N i=1 α i 1 2 α i 0 α i y i =0 N Magic: all operations are on inner products between objects Assume I have a magic transformation of my data that makes it linearly separable... i,j f(z) =w T z + b = i y i y j α i α j x T i x j N α i y i x T i z + b i=1 Advanced Pattern Recognition 49

50 Example transformation x x x 1 Original data: Mapped data: x =(x 1,x 2 ) Φ(x) =(x 2 1,x 2 2, 2x 1 x 2 ) x 2 1 Advanced Pattern Recognition 50

51 Polynomial kernel When we have two vectors Φ(x) =(x 2 1,x 2 2, 2x 1 x 2 ) Φ(y) =(y 2 1,y 2 2, 2y 1 y 2 ) Φ(x) T Φ(y) =x 2 1y x 2 2y x 1 x 2 y 1 y 2 = ( (x 1,x 2 )(y 1,y 2 ) T ) 2 = ( x T y ) 2 it becomes very cheap to compute the inner product. Advanced Pattern Recognition 51

52 Transform your data min α N α i 1 2 i=1 N i=1 α i y i =0 α i 0 i f(z) = N y i y j α i α j Φ(x i ) T Φ(x j ) i,j N α i y i Φ(x i ) T Φ(z)+b i=1 If we have to introduce the magic Φ(x) we can as well introduce the magic kernel function: K(x, y) =Φ(x) T Φ(y) Advanced Pattern Recognition 52

53 The kernel-trick The idea to replace all inner products by a single function, the kernel function K, is called the kernel trick It implicitly maps the data to a (most often) high dimensional feature space The practical computational complexity does not change (except for computing the kernel function) Advanced Pattern Recognition 53

54 Kernel functions 6 Polynomial kernel 6 RBF kernel Feature Feature Feature 1 K(x, y) =(x T y +1) d Feature 1 ( ) x y 2 K(x, y) = exp σ 2 Advanced Pattern Recognition 54

55 Results USPS handwritten digits dataset Cortes & Vapnik, Machine Learning 1995 Advanced Pattern Recognition 55

56 Advanced kernel functions People construct special kernel functions for applications: Tangent distance kernels to incorporate rotation/ scaling insensitivity (handwritten digits recognition) String matching kernels to classify DNA/protein sequences Fisher kernels to incorporate knowledge on class densities Hausdorff kernels to compare image blobs... Advanced Pattern Recognition 56

57 High dimensional feature spaces SVM appears to work well in high dimensional feature spaces The class overlap is often not large Minimizing h minimizes the risk of overfitting The classifier is determined by the support objects and not directly by the features No density is estimated Advanced Pattern Recognition 57

58 Advantages of SVM The SVM generalizes remarkably well, in particular in highdimensional feature spaces with (relatively) low sample sizes Given a kernel and a C, there is one unique solution The kernel trick allows for a varying complexity of the classifier The kernel trick allows for especially engineered representations for problems No strict data model is assumed (when you can assume it, use it!) The foundation of the SVM is pretty solid (when no slack variables are used) An error estimate is available using just the training data (but it is a pretty loose estimate, and crossvalidation is still required to optimize K and C) Advanced Pattern Recognition 58

59 Disadvantages of SVM The quadratic optimization is computationally expensive (although more and more specialized optimizers appear) The kernel and C have to be optimized The SVM has problems with highly overlapping classes Advanced Pattern Recognition 59

60 Conclusions Always tune the complexity of your classifier to your data (#training objects, dimensionality, class overlap, class shape...) A classifier does not need to estimate (class-) densities, like an SVM The SVM can be used in almost all applications (by adjusting K and C appropriately) Advanced Pattern Recognition 60

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar Data Mining Support Vector Machines Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar 02/03/2018 Introduction to Data Mining 1 Support Vector Machines Find a linear hyperplane