Classifier Complexity and Support Vector Classifiers Feature 2 6 4 2 0 2 4 6 8 RBF kernel 10 10 8 6 4 2 0 2 4 6 Feature 1 David M.J. Tax Pattern Recognition Laboratory Delft University of Technology D.M.J.Tax@tudelft.nl Advanced Pattern Recognition 1
Contents Complexity and learning curve How to adapt the complexity? Regularization (QDC, Fisher) Measuring complexity: VC-dimension Support vector classifiers Class overlap Kernel trick Examples Conclusions Advanced Pattern Recognition 2
Complexity? Banana Set Banana Set 4 4 2 2 0 0 Feature 2 2 4 Feature 2 2 4 6 6 8 8 10 10 8 6 4 2 0 2 4 6 Feature 1 8 6 4 2 0 2 4 6 Feature 1 Advanced Pattern Recognition 3
Complexity? Banana Set Banana Set 4 4 2 2 0 0 Feature 2 2 4 Feature 2 2 4 6 6 8 8 10 10 8 6 4 2 0 2 4 6 Feature 1 8 6 4 2 0 2 4 6 Feature 1 The complexity of a classifier indicates the ability to fit to any data distribution Advanced Pattern Recognition 4
Which complexity to choose? A simple classifier fits to only a few specific data distributions A complex classifier fits to almost all data distributions So, we should always use a complex classifier! (?) Advanced Pattern Recognition 5
The learning curve error true error ε asymptotic error apparent error ε A number of training objects Advanced Pattern Recognition 6
The learning curve apparent error is too optimistic error true error ε asymptotic error apparent error ε A number of training objects Advanced Pattern Recognition 7
The learning curve more complex classifier error true error ε apparent error ε A asymptotic number of training objects Advanced Pattern Recognition 8
The learning curve error more complex classifier ε - lower training error - higher test error - lower asymptotic error apparent error ε A asymptotic number of training objects Advanced Pattern Recognition 9
So, complex or not? Complex classifiers are good when you have sufficient number of training objects When a small number of training objects is available, you overtrain Use a simple classifier when you don t have many training examples Choose the complexity according to the available training set size Advanced Pattern Recognition 10
Peaking phenomenon sample size error Bayes error complexity The minimum error for a given number of samples is obtained for a specific complexity Advanced Pattern Recognition 11
Bias-variance dilemma Assume, our function f tries to model the relation between x and y given dataset X = {x i,y i },i=1...n ε = E X [ (f X (x) y(x)) 2] The mean-squared error can be decomposed: = E X h (f X (x) E X [f X (x)]) 2i + E X (EX [f X (x)] y(x)) 2 variance (squared) bias Advanced Pattern Recognition 12
Bias-variance dilemma error = (squared) bias + variance This tradeoff is very general More flexible models have lower bias, but higher variance Simple models have high bias, but low variance Advanced Pattern Recognition 13
Bias-variance dilemma error total error variance bias complexity Bias-variance explains the same peaking phenomenon Advanced Pattern Recognition 14
Bias-variance in real life Banana Set Banana Set 4 4 2 2 Feature 2 0 2 Feature 2 0 2 4 4 6 6 10 8 6 4 2 0 2 4 6 Feature 1 10 8 6 4 2 0 2 4 6 Feature 1 Generate 10 times a dataset, plot classifier Still 40 objects are generated (a lot!) Advanced Pattern Recognition 15
Adapting the complexity Start with a simple classifier and increase the complexity till it fits In practice it is not simple to increase complexity Start with a complex classifier, add a regularizer term to diminish the chance of an over-complex solution minimize ε A + λω(w) Advanced Pattern Recognition 16
Regularization For general classification/regression problems, avoid that weights become too large (also called ridge regression): ε λ = 1 ( yi w T ) 2 x i + λ w 2 N i For normal based classifiers, avoid that the covariance matrix adapts to outliers or a few examples Σ λ =Σ+λI Advanced Pattern Recognition 17
Regularization Normal-based classif. Estimate a Gaussian in a 2D feature space: Estimate mean and covariance matrix µ Advanced Pattern Recognition 18
Regularization Normal-based classif. Estimate a Gaussian in a 2D feature space: µ Estimate mean and covariance matrix µ If you have just 2 points in 2D: PROBLEM! Advanced Pattern Recognition 19
Regularization Normal-based classif. Estimate a Gaussian on 2 points: µ ˆ = apple 5 0 0 0 For a normal-based classifier I need the inverse: That is: ˆ 1 = apple 1 5 0 0 1 0 Advanced Pattern Recognition (x µ) T 1 (x µ) 20
Regularization Normal-based classif. Add a bit of artificial noise to the Gaussian apple 5+ 0 ˆ = 0 0+ Now the inverse is well-defined: That is: ˆ 1 = apple 1 5+ 0 0 1 0+ Advanced Pattern Recognition 21
Quadratic classifier R(x) = (x µ B ) T G 1 B (x µ B) When insufficient data is available (nr of obj s is smaller than dimensionality), the inverse covariance matrices are not defined: the classifier is not defined Regularization solves it: (x µ A ) T G 1 A (x µ A) G B = G B + λi Advanced Pattern Recognition 22
Quadratic classifier error QDC crashes dimensionality training set size Advanced Pattern Recognition 23
Quadratic classifier error regularized QDC dimensionality training set size Advanced Pattern Recognition 24
Fisher classifier R(x) =(µ B µ A ) T G 1 x + C The same phenomenon appears, but because the covariance matrix is averaged over two classes, the peak appears for small sample size Instead of regularization the pseudo-inverse of the covariance matrix is used. Advanced Pattern Recognition 25
Learning & feature curve peaking dset 520, fisherc pca Fisher classifier 0.4 0.3 0.2 0.1 0 50 number of training objects 40 30 20 10 40 60 80 100 20 (Concordia dataset, 4000 obj in 1024D) nr train obj 0 nr features number of features Advanced Pattern Recognition 26
Measuring complexity By changing the regularization parameter, the complexity is changed Is there a direct way to measure complexity? Yes: the VC-dimension (Vapnik-Chervonenkis dimension) of a classifier Advanced Pattern Recognition 27
VC-dimension h: the VC-dimension of a classifier: The largest number h of vectors that can be separated in all the possible ways. 2 h Advanced Pattern Recognition 28
VC-dim for linear classifier For linear classifier: h = p +1 Advanced Pattern Recognition 29
Use of VC-dimension Unfortunately, only for a very few classifiers the VCdimension is known Fortunately, when you know h of a classifier, you can bound the true error of the classifier Advanced Pattern Recognition 30
Bounding the true error With probability at least where 1 η ε ε A + E(N) ( 1+ 2 E(N) =4 V.Vapnik, Statistical learning theory, 1998 the inequality holds: 1+ ε A E(N) ) h(ln(2n/h) + 1) ln(η/4) N When h is small, the true error is close to the apparent error Advanced Pattern Recognition 31
Is this bound practical? The given bound (and others) is very loose The worst case scenario is assumed: objects can be randomly labeled In practice, features are chosen such that objects from one class are nicely clustered and can be separated from other classes Advanced Pattern Recognition 32
Changing h for linear cl. The linear classifier had h = p +1 By putting some constraints on this linear classifier, the VC dimension can be reduced Assume a linearly separable dataset, constrain the weights such that the output of the classifier is always larger than one: w T x i + b +1, for y i = +1 w T x i + b 1, for y i = 1 Advanced Pattern Recognition 33
Constraining the weights w T x + b 1 w w T x + b +1 w T x + b =0 The constraints force the weights to have a minimum value: canonical hyperplane Advanced Pattern Recognition 34
h for a canonical hyperplane w R ρ ( R 2 ) h min ρ 2,p +1 Advanced Pattern Recognition 35
Finding a good classifier To find a classifier with small VC-dimension: 1. minimize the dimensionality p 2. minimize the radius R 3. maximize the margin ρ Using simple algebra w 2 = 1 ρ 2 So: min w 2 Support vector classifier w T x i + b +1, for y i = +1 w T x i + b 1, for y i = 1 Advanced Pattern Recognition 36
Optimizing the classifier It appears you can rewrite the constraints inside the optimization using Lagrange Multipliers: min α N i=1 α i 1 2 i=1 N i,j N α i 0 i α i y i =0 N w = α i y i x i i=1 y i y j α i α j x T i x j Advanced Pattern Recognition 37
Support vectors The classifier becomes: w = N i=1 α i y i x i The solution is phrased in terms of objects, not features. Only a few weights become non-zero The objects with non-zero weight are called the support vectors Advanced Pattern Recognition 38
Support vector classifier support vectors ( i > 0) w ( i = 0) w T x + b = 1 w T x + b = +1 w T x + b =0 Advanced Pattern Recognition 39
Error estimation By leave-one-out you can obtain a bound on the error: #support vectors LOO N Advanced Pattern Recognition 40
High dimensional feature spaces The classifier is determined by objects, not features: the classifier tends to perform very well in high dimensional feature spaces. Advanced Pattern Recognition 41
Limitations of the SVM 1.The data should be separable 2.The decision boundary is linear Advanced Pattern Recognition 42
Problem 1 The classes should be separable... Advanced Pattern Recognition 43
Weakening the constraints w T x + b +1 ξ i ξ i w T x + b =0 Introduce a slack variable weaken the constraints ξ i for each object, to Advanced Pattern Recognition 44
SVM with slacks The optimization changes into: N min w 2 + C i=1 ξ i w T x i + b +1 ξ i, for y i =+1 w T x i + b 1+ξ i, ξ i 0 i for y i = 1 Advanced Pattern Recognition 45
Tradeoff parameter C min w 2 + C N i=1 ξ i Notice that the tradeoff parameter C has to be defined beforehand It weighs the contributions between the training error and the structural error Its values if often optimized using crossvalidation Advanced Pattern Recognition 46
Influence of C 2.5 2 1.5 1 0.5 0 C=1 C=1000 0.5 1 1.5 2 1 0 1 2 3 Feature 1 Erroneous objects have still large influence when C is not optimized carefully Advanced Pattern Recognition 47
Problem 2 The decision boundary is only linear... Advanced Pattern Recognition 48
Trick: transform your data min α N i=1 N i=1 α i 1 2 α i 0 α i y i =0 N Magic: all operations are on inner products between objects Assume I have a magic transformation of my data that makes it linearly separable... i,j f(z) =w T z + b = i y i y j α i α j x T i x j N α i y i x T i z + b i=1 Advanced Pattern Recognition 49
Example transformation x 2 5 4 3 2 1 0 1 x 2 2 25 20 15 10 5 2 4 2 0 2 4 x 1 Original data: Mapped data: 0 0 5 10 15 20 x =(x 1,x 2 ) Φ(x) =(x 2 1,x 2 2, 2x 1 x 2 ) x 2 1 Advanced Pattern Recognition 50
Polynomial kernel When we have two vectors Φ(x) =(x 2 1,x 2 2, 2x 1 x 2 ) Φ(y) =(y 2 1,y 2 2, 2y 1 y 2 ) Φ(x) T Φ(y) =x 2 1y 2 1 + x 2 2y 2 2 +2x 1 x 2 y 1 y 2 = ( (x 1,x 2 )(y 1,y 2 ) T ) 2 = ( x T y ) 2 it becomes very cheap to compute the inner product. Advanced Pattern Recognition 51
Transform your data min α N α i 1 2 i=1 N i=1 α i y i =0 α i 0 i f(z) = N y i y j α i α j Φ(x i ) T Φ(x j ) i,j N α i y i Φ(x i ) T Φ(z)+b i=1 If we have to introduce the magic Φ(x) we can as well introduce the magic kernel function: K(x, y) =Φ(x) T Φ(y) Advanced Pattern Recognition 52
The kernel-trick The idea to replace all inner products by a single function, the kernel function K, is called the kernel trick It implicitly maps the data to a (most often) high dimensional feature space The practical computational complexity does not change (except for computing the kernel function) Advanced Pattern Recognition 53
Kernel functions 6 Polynomial kernel 6 RBF kernel 4 4 2 2 Feature 2 0 2 4 Feature 2 0 2 4 6 6 8 8 10 10 8 6 4 2 0 2 4 6 Feature 1 K(x, y) =(x T y +1) d 10 10 8 6 4 2 0 2 4 6 Feature 1 ( ) x y 2 K(x, y) = exp σ 2 Advanced Pattern Recognition 54
Results 7 7 4 8 0 1 4 8 7 4 8 7 3 7 USPS handwritten digits dataset Cortes & Vapnik, Machine Learning 1995 Advanced Pattern Recognition 55
Advanced kernel functions People construct special kernel functions for applications: Tangent distance kernels to incorporate rotation/ scaling insensitivity (handwritten digits recognition) String matching kernels to classify DNA/protein sequences Fisher kernels to incorporate knowledge on class densities Hausdorff kernels to compare image blobs... Advanced Pattern Recognition 56
High dimensional feature spaces SVM appears to work well in high dimensional feature spaces The class overlap is often not large Minimizing h minimizes the risk of overfitting The classifier is determined by the support objects and not directly by the features No density is estimated Advanced Pattern Recognition 57
Advantages of SVM The SVM generalizes remarkably well, in particular in highdimensional feature spaces with (relatively) low sample sizes Given a kernel and a C, there is one unique solution The kernel trick allows for a varying complexity of the classifier The kernel trick allows for especially engineered representations for problems No strict data model is assumed (when you can assume it, use it!) The foundation of the SVM is pretty solid (when no slack variables are used) An error estimate is available using just the training data (but it is a pretty loose estimate, and crossvalidation is still required to optimize K and C) Advanced Pattern Recognition 58
Disadvantages of SVM The quadratic optimization is computationally expensive (although more and more specialized optimizers appear) The kernel and C have to be optimized The SVM has problems with highly overlapping classes Advanced Pattern Recognition 59
Conclusions Always tune the complexity of your classifier to your data (#training objects, dimensionality, class overlap, class shape...) A classifier does not need to estimate (class-) densities, like an SVM The SVM can be used in almost all applications (by adjusting K and C appropriately) Advanced Pattern Recognition 60