Classifier Complexity and Support Vector Classifiers

Similar documents
Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Support Vector Machine (SVM) and Kernel Methods

Statistical Pattern Recognition

Support Vector Machine (continued)

Jeff Howbert Introduction to Machine Learning Winter

Review: Support vector machines. Machine learning techniques and image analysis

Pattern Recognition 2018 Support Vector Machines

Support Vector Machines

Perceptron Revisited: Linear Separators. Support Vector Machines

Introduction to SVM and RVM

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

Statistical Machine Learning from Data

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Machine Learning. Support Vector Machines. Manfred Huber

Support Vector Machines

Introduction to Support Vector Machines

Support Vector Machine

Support vector machines Lecture 4

Linear & nonlinear classifiers

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels

Linear & nonlinear classifiers

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Support Vector Machines.

Support Vector Machines

Kernel Methods and Support Vector Machines

Discriminative Models

CS798: Selected topics in Machine Learning

Support Vector Machines

Support Vector Machines Explained

Introduction to Support Vector Machines

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

Discriminative Models

L5 Support Vector Classification

Machine Learning Practice Page 2 of 2 10/28/13

A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES. Wei Chu, S. Sathiya Keerthi, Chong Jin Ong

Support Vector Machines for Classification: A Statistical Portrait

Linear Classification and SVM. Dr. Xin Zhang

Neural Networks. Prof. Dr. Rudolf Kruse. Computational Intelligence Group Faculty for Computer Science

Learning with kernels and SVM

Support Vector Machine for Classification and Regression

Support Vector Machines

Kernel methods, kernel SVM and ridge regression

Linear, threshold units. Linear Discriminant Functions and Support Vector Machines. Biometrics CSE 190 Lecture 11. X i : inputs W i : weights

CS145: INTRODUCTION TO DATA MINING

Support Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan

Support Vector Machines

Support Vector Machines and Kernel Methods

Kernel Methods. Machine Learning A W VO

Support Vector Machines. Maximizing the Margin

Support Vector Machines

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Statistical learning theory, Support vector machines, and Bioinformatics

Support Vector Machines

A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie

Support Vector Machine & Its Applications

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

Support Vector Machines

Lecture 10: A brief introduction to Support Vector Machine

Lecture 10: Support Vector Machine and Large Margin Classifier

PAC-learning, VC Dimension and Margin-based Bounds

ML (cont.): SUPPORT VECTOR MACHINES

An introduction to Support Vector Machines

Lecture Notes on Support Vector Machine

SVMs: nonlinearity through kernels

Understanding Generalization Error: Bounds and Decompositions

Applied Machine Learning Annalisa Marsico

Machine Learning 1. Linear Classifiers. Marius Kloft. Humboldt University of Berlin Summer Term Machine Learning 1 Linear Classifiers 1

SUPPORT VECTOR MACHINE

Support Vector Machine

Machine Learning And Applications: Supervised Learning-SVM

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

SVM optimization and Kernel methods

Modelli Lineari (Generalizzati) e SVM

Kernel Methods. Barnabás Póczos

Machine Learning

CMU-Q Lecture 24:

Support Vector Machines II. CAP 5610: Machine Learning Instructor: Guo-Jun QI

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

(Kernels +) Support Vector Machines

Linear classifiers Lecture 3

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Introduction to Machine Learning

Lecture Support Vector Machine (SVM) Classifiers

Machine Learning. Kernels. Fall (Kernels, Kernelized Perceptron and SVM) Professor Liang Huang. (Chap. 12 of CIML)

Stat542 (F11) Statistical Learning. First consider the scenario where the two classes of points are separable.

1. Kernel ridge regression In contrast to ordinary least squares which has a cost function. m (θ T x (i) y (i) ) 2, J(θ) = 1 2.

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

Machine Learning Lecture 7

Transcription:

Classifier Complexity and Support Vector Classifiers Feature 2 6 4 2 0 2 4 6 8 RBF kernel 10 10 8 6 4 2 0 2 4 6 Feature 1 David M.J. Tax Pattern Recognition Laboratory Delft University of Technology D.M.J.Tax@tudelft.nl Advanced Pattern Recognition 1

Contents Complexity and learning curve How to adapt the complexity? Regularization (QDC, Fisher) Measuring complexity: VC-dimension Support vector classifiers Class overlap Kernel trick Examples Conclusions Advanced Pattern Recognition 2

Complexity? Banana Set Banana Set 4 4 2 2 0 0 Feature 2 2 4 Feature 2 2 4 6 6 8 8 10 10 8 6 4 2 0 2 4 6 Feature 1 8 6 4 2 0 2 4 6 Feature 1 Advanced Pattern Recognition 3

Complexity? Banana Set Banana Set 4 4 2 2 0 0 Feature 2 2 4 Feature 2 2 4 6 6 8 8 10 10 8 6 4 2 0 2 4 6 Feature 1 8 6 4 2 0 2 4 6 Feature 1 The complexity of a classifier indicates the ability to fit to any data distribution Advanced Pattern Recognition 4

Which complexity to choose? A simple classifier fits to only a few specific data distributions A complex classifier fits to almost all data distributions So, we should always use a complex classifier! (?) Advanced Pattern Recognition 5

The learning curve error true error ε asymptotic error apparent error ε A number of training objects Advanced Pattern Recognition 6

The learning curve apparent error is too optimistic error true error ε asymptotic error apparent error ε A number of training objects Advanced Pattern Recognition 7

The learning curve more complex classifier error true error ε apparent error ε A asymptotic number of training objects Advanced Pattern Recognition 8

The learning curve error more complex classifier ε - lower training error - higher test error - lower asymptotic error apparent error ε A asymptotic number of training objects Advanced Pattern Recognition 9

So, complex or not? Complex classifiers are good when you have sufficient number of training objects When a small number of training objects is available, you overtrain Use a simple classifier when you don t have many training examples Choose the complexity according to the available training set size Advanced Pattern Recognition 10

Peaking phenomenon sample size error Bayes error complexity The minimum error for a given number of samples is obtained for a specific complexity Advanced Pattern Recognition 11

Bias-variance dilemma Assume, our function f tries to model the relation between x and y given dataset X = {x i,y i },i=1...n ε = E X [ (f X (x) y(x)) 2] The mean-squared error can be decomposed: = E X h (f X (x) E X [f X (x)]) 2i + E X (EX [f X (x)] y(x)) 2 variance (squared) bias Advanced Pattern Recognition 12

Bias-variance dilemma error = (squared) bias + variance This tradeoff is very general More flexible models have lower bias, but higher variance Simple models have high bias, but low variance Advanced Pattern Recognition 13

Bias-variance dilemma error total error variance bias complexity Bias-variance explains the same peaking phenomenon Advanced Pattern Recognition 14

Bias-variance in real life Banana Set Banana Set 4 4 2 2 Feature 2 0 2 Feature 2 0 2 4 4 6 6 10 8 6 4 2 0 2 4 6 Feature 1 10 8 6 4 2 0 2 4 6 Feature 1 Generate 10 times a dataset, plot classifier Still 40 objects are generated (a lot!) Advanced Pattern Recognition 15

Adapting the complexity Start with a simple classifier and increase the complexity till it fits In practice it is not simple to increase complexity Start with a complex classifier, add a regularizer term to diminish the chance of an over-complex solution minimize ε A + λω(w) Advanced Pattern Recognition 16

Regularization For general classification/regression problems, avoid that weights become too large (also called ridge regression): ε λ = 1 ( yi w T ) 2 x i + λ w 2 N i For normal based classifiers, avoid that the covariance matrix adapts to outliers or a few examples Σ λ =Σ+λI Advanced Pattern Recognition 17

Regularization Normal-based classif. Estimate a Gaussian in a 2D feature space: Estimate mean and covariance matrix µ Advanced Pattern Recognition 18

Regularization Normal-based classif. Estimate a Gaussian in a 2D feature space: µ Estimate mean and covariance matrix µ If you have just 2 points in 2D: PROBLEM! Advanced Pattern Recognition 19

Regularization Normal-based classif. Estimate a Gaussian on 2 points: µ ˆ = apple 5 0 0 0 For a normal-based classifier I need the inverse: That is: ˆ 1 = apple 1 5 0 0 1 0 Advanced Pattern Recognition (x µ) T 1 (x µ) 20

Regularization Normal-based classif. Add a bit of artificial noise to the Gaussian apple 5+ 0 ˆ = 0 0+ Now the inverse is well-defined: That is: ˆ 1 = apple 1 5+ 0 0 1 0+ Advanced Pattern Recognition 21

Quadratic classifier R(x) = (x µ B ) T G 1 B (x µ B) When insufficient data is available (nr of obj s is smaller than dimensionality), the inverse covariance matrices are not defined: the classifier is not defined Regularization solves it: (x µ A ) T G 1 A (x µ A) G B = G B + λi Advanced Pattern Recognition 22

Quadratic classifier error QDC crashes dimensionality training set size Advanced Pattern Recognition 23

Quadratic classifier error regularized QDC dimensionality training set size Advanced Pattern Recognition 24

Fisher classifier R(x) =(µ B µ A ) T G 1 x + C The same phenomenon appears, but because the covariance matrix is averaged over two classes, the peak appears for small sample size Instead of regularization the pseudo-inverse of the covariance matrix is used. Advanced Pattern Recognition 25

Learning & feature curve peaking dset 520, fisherc pca Fisher classifier 0.4 0.3 0.2 0.1 0 50 number of training objects 40 30 20 10 40 60 80 100 20 (Concordia dataset, 4000 obj in 1024D) nr train obj 0 nr features number of features Advanced Pattern Recognition 26

Measuring complexity By changing the regularization parameter, the complexity is changed Is there a direct way to measure complexity? Yes: the VC-dimension (Vapnik-Chervonenkis dimension) of a classifier Advanced Pattern Recognition 27

VC-dimension h: the VC-dimension of a classifier: The largest number h of vectors that can be separated in all the possible ways. 2 h Advanced Pattern Recognition 28

VC-dim for linear classifier For linear classifier: h = p +1 Advanced Pattern Recognition 29

Use of VC-dimension Unfortunately, only for a very few classifiers the VCdimension is known Fortunately, when you know h of a classifier, you can bound the true error of the classifier Advanced Pattern Recognition 30

Bounding the true error With probability at least where 1 η ε ε A + E(N) ( 1+ 2 E(N) =4 V.Vapnik, Statistical learning theory, 1998 the inequality holds: 1+ ε A E(N) ) h(ln(2n/h) + 1) ln(η/4) N When h is small, the true error is close to the apparent error Advanced Pattern Recognition 31

Is this bound practical? The given bound (and others) is very loose The worst case scenario is assumed: objects can be randomly labeled In practice, features are chosen such that objects from one class are nicely clustered and can be separated from other classes Advanced Pattern Recognition 32

Changing h for linear cl. The linear classifier had h = p +1 By putting some constraints on this linear classifier, the VC dimension can be reduced Assume a linearly separable dataset, constrain the weights such that the output of the classifier is always larger than one: w T x i + b +1, for y i = +1 w T x i + b 1, for y i = 1 Advanced Pattern Recognition 33

Constraining the weights w T x + b 1 w w T x + b +1 w T x + b =0 The constraints force the weights to have a minimum value: canonical hyperplane Advanced Pattern Recognition 34

h for a canonical hyperplane w R ρ ( R 2 ) h min ρ 2,p +1 Advanced Pattern Recognition 35

Finding a good classifier To find a classifier with small VC-dimension: 1. minimize the dimensionality p 2. minimize the radius R 3. maximize the margin ρ Using simple algebra w 2 = 1 ρ 2 So: min w 2 Support vector classifier w T x i + b +1, for y i = +1 w T x i + b 1, for y i = 1 Advanced Pattern Recognition 36

Optimizing the classifier It appears you can rewrite the constraints inside the optimization using Lagrange Multipliers: min α N i=1 α i 1 2 i=1 N i,j N α i 0 i α i y i =0 N w = α i y i x i i=1 y i y j α i α j x T i x j Advanced Pattern Recognition 37

Support vectors The classifier becomes: w = N i=1 α i y i x i The solution is phrased in terms of objects, not features. Only a few weights become non-zero The objects with non-zero weight are called the support vectors Advanced Pattern Recognition 38

Support vector classifier support vectors ( i > 0) w ( i = 0) w T x + b = 1 w T x + b = +1 w T x + b =0 Advanced Pattern Recognition 39

Error estimation By leave-one-out you can obtain a bound on the error: #support vectors LOO N Advanced Pattern Recognition 40

High dimensional feature spaces The classifier is determined by objects, not features: the classifier tends to perform very well in high dimensional feature spaces. Advanced Pattern Recognition 41

Limitations of the SVM 1.The data should be separable 2.The decision boundary is linear Advanced Pattern Recognition 42

Problem 1 The classes should be separable... Advanced Pattern Recognition 43

Weakening the constraints w T x + b +1 ξ i ξ i w T x + b =0 Introduce a slack variable weaken the constraints ξ i for each object, to Advanced Pattern Recognition 44

SVM with slacks The optimization changes into: N min w 2 + C i=1 ξ i w T x i + b +1 ξ i, for y i =+1 w T x i + b 1+ξ i, ξ i 0 i for y i = 1 Advanced Pattern Recognition 45

Tradeoff parameter C min w 2 + C N i=1 ξ i Notice that the tradeoff parameter C has to be defined beforehand It weighs the contributions between the training error and the structural error Its values if often optimized using crossvalidation Advanced Pattern Recognition 46

Influence of C 2.5 2 1.5 1 0.5 0 C=1 C=1000 0.5 1 1.5 2 1 0 1 2 3 Feature 1 Erroneous objects have still large influence when C is not optimized carefully Advanced Pattern Recognition 47

Problem 2 The decision boundary is only linear... Advanced Pattern Recognition 48

Trick: transform your data min α N i=1 N i=1 α i 1 2 α i 0 α i y i =0 N Magic: all operations are on inner products between objects Assume I have a magic transformation of my data that makes it linearly separable... i,j f(z) =w T z + b = i y i y j α i α j x T i x j N α i y i x T i z + b i=1 Advanced Pattern Recognition 49

Example transformation x 2 5 4 3 2 1 0 1 x 2 2 25 20 15 10 5 2 4 2 0 2 4 x 1 Original data: Mapped data: 0 0 5 10 15 20 x =(x 1,x 2 ) Φ(x) =(x 2 1,x 2 2, 2x 1 x 2 ) x 2 1 Advanced Pattern Recognition 50

Polynomial kernel When we have two vectors Φ(x) =(x 2 1,x 2 2, 2x 1 x 2 ) Φ(y) =(y 2 1,y 2 2, 2y 1 y 2 ) Φ(x) T Φ(y) =x 2 1y 2 1 + x 2 2y 2 2 +2x 1 x 2 y 1 y 2 = ( (x 1,x 2 )(y 1,y 2 ) T ) 2 = ( x T y ) 2 it becomes very cheap to compute the inner product. Advanced Pattern Recognition 51

Transform your data min α N α i 1 2 i=1 N i=1 α i y i =0 α i 0 i f(z) = N y i y j α i α j Φ(x i ) T Φ(x j ) i,j N α i y i Φ(x i ) T Φ(z)+b i=1 If we have to introduce the magic Φ(x) we can as well introduce the magic kernel function: K(x, y) =Φ(x) T Φ(y) Advanced Pattern Recognition 52

The kernel-trick The idea to replace all inner products by a single function, the kernel function K, is called the kernel trick It implicitly maps the data to a (most often) high dimensional feature space The practical computational complexity does not change (except for computing the kernel function) Advanced Pattern Recognition 53

Kernel functions 6 Polynomial kernel 6 RBF kernel 4 4 2 2 Feature 2 0 2 4 Feature 2 0 2 4 6 6 8 8 10 10 8 6 4 2 0 2 4 6 Feature 1 K(x, y) =(x T y +1) d 10 10 8 6 4 2 0 2 4 6 Feature 1 ( ) x y 2 K(x, y) = exp σ 2 Advanced Pattern Recognition 54

Results 7 7 4 8 0 1 4 8 7 4 8 7 3 7 USPS handwritten digits dataset Cortes & Vapnik, Machine Learning 1995 Advanced Pattern Recognition 55

Advanced kernel functions People construct special kernel functions for applications: Tangent distance kernels to incorporate rotation/ scaling insensitivity (handwritten digits recognition) String matching kernels to classify DNA/protein sequences Fisher kernels to incorporate knowledge on class densities Hausdorff kernels to compare image blobs... Advanced Pattern Recognition 56

High dimensional feature spaces SVM appears to work well in high dimensional feature spaces The class overlap is often not large Minimizing h minimizes the risk of overfitting The classifier is determined by the support objects and not directly by the features No density is estimated Advanced Pattern Recognition 57

Advantages of SVM The SVM generalizes remarkably well, in particular in highdimensional feature spaces with (relatively) low sample sizes Given a kernel and a C, there is one unique solution The kernel trick allows for a varying complexity of the classifier The kernel trick allows for especially engineered representations for problems No strict data model is assumed (when you can assume it, use it!) The foundation of the SVM is pretty solid (when no slack variables are used) An error estimate is available using just the training data (but it is a pretty loose estimate, and crossvalidation is still required to optimize K and C) Advanced Pattern Recognition 58

Disadvantages of SVM The quadratic optimization is computationally expensive (although more and more specialized optimizers appear) The kernel and C have to be optimized The SVM has problems with highly overlapping classes Advanced Pattern Recognition 59

Conclusions Always tune the complexity of your classifier to your data (#training objects, dimensionality, class overlap, class shape...) A classifier does not need to estimate (class-) densities, like an SVM The SVM can be used in almost all applications (by adjusting K and C appropriately) Advanced Pattern Recognition 60