VC dimension, Model Selection and Performance Assessment for SVM and Other Machine Learning Algorithms

03/Feb/2010 VC dimension, Model Selection and Performance Assessment for SVM and Other Machine Learning Algorithms Presented by Andriy Temko Department of Electrical and Electronic Engineering

Page 2 of 32 VC Dimension Content Performance Assessment (a way to estimate the error) Model Selection (a way to reduce the error)

Page 3 of 32 Structural Risk Minimization The upper bound was derived by Chervonenkis and Vapnik in the 1970s. With the confidence 1-η, 0 η 1, TESTERR (α) TRAINERR (α) + where R is the length of the training set, h is the VC-dimension of the class of functions

Page 4 of 32 VC Dimension It is a number characterizing the decision strategy Abbreviated VC-dimension Named after Vladimir Vapnik and Alexey Chervonenkis (Appeared in their book in Russian. V. Vapnik, A. Chervonenkis: Pattern Recognition Theory, Statistical Learning Problems, Nauka, Moskva, 1974) It is one of the core concepts in VC theory of learning In the original 1974 publication, it was called capacity of a class of strategies The VC dimension is a measure of the capacity of a statistical classification algorithm more sophisticated measure of model complexity than dimensionality or number of free parameters

Page 5 of 32 VC Dimension VC-dimension (definition) is the maximal number h of data points (observations) that can be shattered.

Page 6 of 32 Shattering (I) 2 dimensional space 3 points Dictionary: a line VC dimension of a line is at least 3

Page 7 of 32 Shattering (II) Note: not any position of points but any labelling for a position Although 3 points placed on a line can not be shattered VC dimension of a line is still at least 3.

Page 8 of 32 Shattering (III) You can not find a position of 4 points in 2 dimensional space that can be shattered for any labelling 2 dimensional space 4 points Dictionary: a line VC dimension of a line is 3 Consequently, VC-dimension of linear decision strategies is h = n + 1.

Page 9 of 32 VC dimension. Practical View Bad news: Computing guaranteed risk is useless in many practical situations. VC dimension cannot be accurately estimated for nonlinear models such as neural networks Structural Risk Minimization may lead to a non-linear optimization problem VC dimension may be infinite (e.g., for a nearest neighbor classifier or for Gaussian kernel), requiring infinite amount of training data. Good news: Structural Risk Minimization can be applied for linear classifiers. Especially useful for Support Vector Machines.

Page 10 of 32 VC dimension. Notes Is then empirical risk minimization = minimization of training set error, e.g. neural networks with backpropagation, dead? No! Structural Risk may be so large that this upper bound becomes useless Find a tighter bound and you will be famous! It is not impossible!

Page 11 of 32 VC Dimension Content Performance Assessment Model Selection

Page 12 of 32 Performance Assessment: Loss Function Typical choices for quantitative response Y: L( Y, 2 ( Y fˆ( X )) fˆ( X )) = Y fˆ( X ) (squared error) (absolute error) Typical choices for categorical response G: L( G, Gˆ ( X L( G, pˆ( X )) = I( G )) = 2 K k = 1 = 2log Gˆ ( X )) I( G pˆ G = k)log ( X ) (0-1 loss) pˆ k (log-likelihood)

Page 13 of 32 Training error is the average loss over the training sample. For the quantitative response variable Y: err = 1 N N i= 1 L( y i x i For the categorical response variable G: err err = = 1 N N i= 1 2 N I( g N i= 1 Train Error i, log pˆ fˆ( Gˆ ( x g i ( x i i )) )) )

Page 14 of 32 Test (Generalization) Error Generalization error or test error is the expected prediction error over an independent test sample. For quantitative response Y: Err = E[ L( Y, fˆ( X))] For categorical response G: Err Err = = E[ L( G, Gˆ( X))] E[ L( G, pˆ( X))] TRUE ERROR RATE the classifier s error rate on the ENTIRE POPULATION (after we train on all available training data)

Page 15 of 32 Estimation of the True Error In real applications we only have access to a finite set of examples, usually smaller than we wanted The holdout method Random Subsampling (bootstrap) K-Fold Cross-validation Leave-one-out Cross Validation

Page 16 of 32 Bias-Variance Dilemma (I) Bias-variance high bias, low variance low bias, high variance example: BIAS: How much it deviates from the true value high bias, high variance low bias, low variance VARIANCE: How much variability it shows for different samples of the population

Page 17 of 32 Methods (I) The holdout method X X V V In problems where we have a sparse dataset we may not be able to afford the luxury of setting aside a portion of the dataset for testing Since it is a single train-and-test experiment, the holdout estimate of error rate will be misleading if we happen to get an unfortunate split Usually used in machine learning evaluation campaigns for comparison of approaches Usually specified for stand-alone large commercial Databases to facilitate comparison

Page 18 of 32 Methods (II) Random splits K-fold CV LOO CV Testing set are not independent Overoptimistic assessment (for non- Gaussian) Trade-off between computational cost and bias/variance Large variance High computational cost Most unbiased estimate possible

Page 19 of 32 Bias-Variance Dilemma (II) LOO CV high bias, low variance low bias, high variance Random Splits K-fold CV high bias, high variance low bias, low variance The holdout method

Page 20 of 32 Example: Neonatal Seizure Detection min/max ROC Var Hold-out#1 tr-75% test-25% 85/97 92.2 0 Hold-out#2 tr-75% test-25% 97/99 97.7 0 2-fold CV 93/95 94.4 1.2 6-fold CV 92/97 95.1 1.7 LOO CV 89/99 96.5 2.5 Test Error (!!! Hypothetical!!!) 96.0 In performance assessment task we are more interested in low bias sacrificing high variance LOO is a good choice. if computationally feasible a way to estimate the error not a way to reduce the error

Page 21 of 32 VC Dimension Content Performance Assessment Model Selection

Page 22 of 32 Model Selection (I)

Page 23 of 32 Model Selection (II) GMM Learning algorithm C, σ, SVM #gaus, T cov, E generalization MLP E generalization Model selection #n, #layers, No free lunch theorems E generalization

Page 24 of 32 Estimation of the model prediction error Empirical: K-fold CV LOO CV Bootstrap Test-set Theoretical: Bayesian information criterion (BIC) Akaike information criterion (AIC) Minimum description length (MDL) Structural risk minimization (SRM)... Empirical methods are data-driven and in practice work better Theoretical methods have the advantage that you only need the training error

Page 25 of 32 BIC i f i LL (training) #params BIC Choice As the amount of data goes to infinity, BIC promises* to select the model that the data was generated from *Subject to about a million caveats

Page 26 of 32 AIC i f i LL (training) #params AIC Choice As the amount of data goes to infinity, AIC promises* to select the model that ll have the best likelihood for future data *Another million caveats

Page 27 of 32 Structural Risk (VC dimension) i f i E tr VC confidence Upper bound on E test Choice VC-confidence term is usually very very conservative (at least hundreds of times larger than the empirical overfitting effect).

Page 28 of 32 Cross-Validation i f i Training Err. 10-fold CV Error Choice Empirical methods tried on Neonatal Seizure Detection task (17 patients): LOO CV, 10-2cv, 5-2cv, 10 random splits similar performance LOO problem is lack of continuity--a small change in the data can cause a large change in the model selected. Large variance is not good for model selection. Overfitting when model selection is patient dependent and performance assessment is patient-independent. E.g. very long feature sets (mean-variance features) patient dependent (model selection) ROC = 93% patient independent (model selection) ROC = 96%

Page 29 of 32 Response to Parameter Selection Usually not convex!!! Grid search, simplex search, etc

Page 30 of 32 Procedure Outline 1. Divide the available data into development and test set 2. Divide development set to training/validation 3. Select architecture and training parameters 4. Train the model using the training set 5. Evaluate the model using the validation set 6. Repeat steps 2 through 4 using different architectures and training parameters 7. Select the best model and train it using all development data 8. Assess this final model using the test set After assessing the final model with the test set, YOU MUST NOT further tune the model

Page 31 of 32 Acknowledgements The following material has been used in preparation of these slides: T. Dietterich, Statistical Tests for Comparing Supervised Classification Learning Algorithms, Neural Computation 96 L. Wang J. Feng, Learning Gaussian mixture models by structural risk minimization, IEEE MLC 05 Ron Kohavi, A Study of CrossValidation and Bootstrap for Accuracy Estimation and Model Selection, IJCAI 95 C. Burges, A tutorial on support vector machines, DMKD 04 Talks and slides: V. Hlavác, Vapnik-Chervonenkis learning theory M. Pardo Algorithm independent learning B. Chakraborty, Model Assessment, Selection and Averaging A. Moore, VC-dimension for characterizing classifiers A. Moore, Cross-validation for detecting and preventing overfitting R. Gutierrez-Osuna, Introduction to Pattern Analysis

Page 32 of 32 Questions?