VC dimension, Model Selection and Performance Assessment for SVM and Other Machine Learning Algorithms

Similar documents
VC-dimension for characterizing classifiers

VC-dimension for characterizing classifiers

Lecture Slides for INTRODUCTION TO. Machine Learning. By: Postedited by: R.

Computational Learning Theory

PDEEC Machine Learning 2016/17

Generalization, Overfitting, and Model Selection

Support Vector Machines

Hastie, Tibshirani & Friedman: Elements of Statistical Learning Chapter Model Assessment and Selection. CN700/March 4, 2008.

STAT 535 Lecture 5 November, 2018 Brief overview of Model Selection and Regularization c Marina Meilă

LEARNING & LINEAR CLASSIFIERS

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015

Statistical Learning Reading Assignments

Machine Learning Lecture 7

Generalization and Overfitting

PAC-learning, VC Dimension and Margin-based Bounds

Machine Learning. Lecture 9: Learning Theory. Feng Li.

COMP9444: Neural Networks. Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization

Cross-validation for detecting and preventing overfitting

Understanding Generalization Error: Bounds and Decompositions

Hypothesis Evaluation

Holdout and Cross-Validation Methods Overfitting Avoidance

Resampling techniques for statistical modeling

PAC Learning Introduction to Machine Learning. Matt Gormley Lecture 14 March 5, 2018

Classifier Complexity and Support Vector Classifiers

Lecture Support Vector Machine (SVM) Classifiers

Machine Learning

Chapter 6 Classification and Prediction (2)

Generalization, Overfitting, and Model Selection

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Methods and Criteria for Model Selection. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

Machine Learning

An introduction to Support Vector Machines

Learning Theory. Piyush Rai. CS5350/6350: Machine Learning. September 27, (CS5350/6350) Learning Theory September 27, / 14

Support Vector Machine & Its Applications

Minimum Description Length (MDL)

Principles of Risk Minimization for Learning Theory

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

MS&E 226: Small Data

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

Introduction to Support Vector Machines

Discriminative Models

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Lecture 2 Machine Learning Review

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 9

Neutron inverse kinetics via Gaussian Processes

PAC-learning, VC Dimension and Margin-based Bounds

Computational Learning Theory: Shattering and VC Dimensions. Machine Learning. Spring The slides are mainly from Vivek Srikumar

Machine Learning. Model Selection and Validation. Fabio Vandin November 7, 2017

Introduction to Machine Learning

Semi-Supervised Support Vector Machines

Hypothesis Testing and Computational Learning Theory. EECS 349 Machine Learning With slides from Bryan Pardo, Tom Mitchell

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Brief Introduction to Machine Learning

Nonlinear Classification

Machine Learning 4771

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

Classifier performance evaluation

Support Vector Machines. Machine Learning Fall 2017

A Tutorial on Support Vector Machine

ECE521 week 3: 23/26 January 2017

CS534 Machine Learning - Spring Final Exam

Introduction. Chapter 1

Support Vector Machines

Discriminative Models

Nonparametric Bayesian Methods (Gaussian Processes)

ECE-271B. Nuno Vasconcelos ECE Department, UCSD

Overfitting, Bias / Variance Analysis

CS6220: DATA MINING TECHNIQUES

Lecture Slides for INTRODUCTION TO. Machine Learning. ETHEM ALPAYDIN The MIT Press,

Neural Network Learning: Testing Bounds on Sample Complexity

Support Vector Machines and Kernel Methods

SVMC An introduction to Support Vector Machines Classification

Optimization Methods for Machine Learning (OMML)

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

Machine Learning for OR & FE

ECE 5424: Introduction to Machine Learning

10-701/ Machine Learning, Fall

Learning From Data Lecture 15 Reflecting on Our Path - Epilogue to Part I

Regression, Ridge Regression, Lasso

Cross Validation & Ensembling

A Magiv CV Theory for Large-Margin Classifiers

Neural networks and support vector machines

Advanced statistical methods for data analysis Lecture 2

Performance Evaluation and Comparison

Universal Learning Technology: Support Vector Machines

Introduction to Machine Learning. Introduction to ML - TAU 2016/7 1

E. Alpaydın AERFAISS

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning

Performance Evaluation

Neural Networks. Prof. Dr. Rudolf Kruse. Computational Intelligence Group Faculty for Computer Science

Performance of Cross Validation in Tree-Based Models

ECE662: Pattern Recognition and Decision Making Processes: HW TWO

Linear Classification and SVM. Dr. Xin Zhang

Low Bias Bagged Support Vector Machines

Data Mining und Maschinelles Lernen

CS4495/6495 Introduction to Computer Vision. 8C-L3 Support Vector Machines

Support Vector Machines

CS6220: DATA MINING TECHNIQUES

Transcription:

03/Feb/2010 VC dimension, Model Selection and Performance Assessment for SVM and Other Machine Learning Algorithms Presented by Andriy Temko Department of Electrical and Electronic Engineering

Page 2 of 32 VC Dimension Content Performance Assessment (a way to estimate the error) Model Selection (a way to reduce the error)

Page 3 of 32 Structural Risk Minimization The upper bound was derived by Chervonenkis and Vapnik in the 1970s. With the confidence 1-η, 0 η 1, TESTERR (α) TRAINERR (α) + where R is the length of the training set, h is the VC-dimension of the class of functions

Page 4 of 32 VC Dimension It is a number characterizing the decision strategy Abbreviated VC-dimension Named after Vladimir Vapnik and Alexey Chervonenkis (Appeared in their book in Russian. V. Vapnik, A. Chervonenkis: Pattern Recognition Theory, Statistical Learning Problems, Nauka, Moskva, 1974) It is one of the core concepts in VC theory of learning In the original 1974 publication, it was called capacity of a class of strategies The VC dimension is a measure of the capacity of a statistical classification algorithm more sophisticated measure of model complexity than dimensionality or number of free parameters

Page 5 of 32 VC Dimension VC-dimension (definition) is the maximal number h of data points (observations) that can be shattered.

Page 6 of 32 Shattering (I) 2 dimensional space 3 points Dictionary: a line VC dimension of a line is at least 3

Page 7 of 32 Shattering (II) Note: not any position of points but any labelling for a position Although 3 points placed on a line can not be shattered VC dimension of a line is still at least 3.

Page 8 of 32 Shattering (III) You can not find a position of 4 points in 2 dimensional space that can be shattered for any labelling 2 dimensional space 4 points Dictionary: a line VC dimension of a line is 3 Consequently, VC-dimension of linear decision strategies is h = n + 1.

Page 9 of 32 VC dimension. Practical View Bad news: Computing guaranteed risk is useless in many practical situations. VC dimension cannot be accurately estimated for nonlinear models such as neural networks Structural Risk Minimization may lead to a non-linear optimization problem VC dimension may be infinite (e.g., for a nearest neighbor classifier or for Gaussian kernel), requiring infinite amount of training data. Good news: Structural Risk Minimization can be applied for linear classifiers. Especially useful for Support Vector Machines.

Page 10 of 32 VC dimension. Notes Is then empirical risk minimization = minimization of training set error, e.g. neural networks with backpropagation, dead? No! Structural Risk may be so large that this upper bound becomes useless Find a tighter bound and you will be famous! It is not impossible!

Page 11 of 32 VC Dimension Content Performance Assessment Model Selection

Page 12 of 32 Performance Assessment: Loss Function Typical choices for quantitative response Y: L( Y, 2 ( Y fˆ( X )) fˆ( X )) = Y fˆ( X ) (squared error) (absolute error) Typical choices for categorical response G: L( G, Gˆ ( X L( G, pˆ( X )) = I( G )) = 2 K k = 1 = 2log Gˆ ( X )) I( G pˆ G = k)log ( X ) (0-1 loss) pˆ k (log-likelihood)

Page 13 of 32 Training error is the average loss over the training sample. For the quantitative response variable Y: err = 1 N N i= 1 L( y i x i For the categorical response variable G: err err = = 1 N N i= 1 2 N I( g N i= 1 Train Error i, log pˆ fˆ( Gˆ ( x g i ( x i i )) )) )

Page 14 of 32 Test (Generalization) Error Generalization error or test error is the expected prediction error over an independent test sample. For quantitative response Y: Err = E[ L( Y, fˆ( X))] For categorical response G: Err Err = = E[ L( G, Gˆ( X))] E[ L( G, pˆ( X))] TRUE ERROR RATE the classifier s error rate on the ENTIRE POPULATION (after we train on all available training data)

Page 15 of 32 Estimation of the True Error In real applications we only have access to a finite set of examples, usually smaller than we wanted The holdout method Random Subsampling (bootstrap) K-Fold Cross-validation Leave-one-out Cross Validation

Page 16 of 32 Bias-Variance Dilemma (I) Bias-variance high bias, low variance low bias, high variance example: BIAS: How much it deviates from the true value high bias, high variance low bias, low variance VARIANCE: How much variability it shows for different samples of the population

Page 17 of 32 Methods (I) The holdout method X X V V In problems where we have a sparse dataset we may not be able to afford the luxury of setting aside a portion of the dataset for testing Since it is a single train-and-test experiment, the holdout estimate of error rate will be misleading if we happen to get an unfortunate split Usually used in machine learning evaluation campaigns for comparison of approaches Usually specified for stand-alone large commercial Databases to facilitate comparison

Page 18 of 32 Methods (II) Random splits K-fold CV LOO CV Testing set are not independent Overoptimistic assessment (for non- Gaussian) Trade-off between computational cost and bias/variance Large variance High computational cost Most unbiased estimate possible

Page 19 of 32 Bias-Variance Dilemma (II) LOO CV high bias, low variance low bias, high variance Random Splits K-fold CV high bias, high variance low bias, low variance The holdout method

Page 20 of 32 Example: Neonatal Seizure Detection min/max ROC Var Hold-out#1 tr-75% test-25% 85/97 92.2 0 Hold-out#2 tr-75% test-25% 97/99 97.7 0 2-fold CV 93/95 94.4 1.2 6-fold CV 92/97 95.1 1.7 LOO CV 89/99 96.5 2.5 Test Error (!!! Hypothetical!!!) 96.0 In performance assessment task we are more interested in low bias sacrificing high variance LOO is a good choice. if computationally feasible a way to estimate the error not a way to reduce the error

Page 21 of 32 VC Dimension Content Performance Assessment Model Selection

Page 22 of 32 Model Selection (I)

Page 23 of 32 Model Selection (II) GMM Learning algorithm C, σ, SVM #gaus, T cov, E generalization MLP E generalization Model selection #n, #layers, No free lunch theorems E generalization

Page 24 of 32 Estimation of the model prediction error Empirical: K-fold CV LOO CV Bootstrap Test-set Theoretical: Bayesian information criterion (BIC) Akaike information criterion (AIC) Minimum description length (MDL) Structural risk minimization (SRM)... Empirical methods are data-driven and in practice work better Theoretical methods have the advantage that you only need the training error

Page 25 of 32 BIC i f i LL (training) #params BIC Choice As the amount of data goes to infinity, BIC promises* to select the model that the data was generated from *Subject to about a million caveats

Page 26 of 32 AIC i f i LL (training) #params AIC Choice As the amount of data goes to infinity, AIC promises* to select the model that ll have the best likelihood for future data *Another million caveats

Page 27 of 32 Structural Risk (VC dimension) i f i E tr VC confidence Upper bound on E test Choice VC-confidence term is usually very very conservative (at least hundreds of times larger than the empirical overfitting effect).

Page 28 of 32 Cross-Validation i f i Training Err. 10-fold CV Error Choice Empirical methods tried on Neonatal Seizure Detection task (17 patients): LOO CV, 10-2cv, 5-2cv, 10 random splits similar performance LOO problem is lack of continuity--a small change in the data can cause a large change in the model selected. Large variance is not good for model selection. Overfitting when model selection is patient dependent and performance assessment is patient-independent. E.g. very long feature sets (mean-variance features) patient dependent (model selection) ROC = 93% patient independent (model selection) ROC = 96%

Page 29 of 32 Response to Parameter Selection Usually not convex!!! Grid search, simplex search, etc

Page 30 of 32 Procedure Outline 1. Divide the available data into development and test set 2. Divide development set to training/validation 3. Select architecture and training parameters 4. Train the model using the training set 5. Evaluate the model using the validation set 6. Repeat steps 2 through 4 using different architectures and training parameters 7. Select the best model and train it using all development data 8. Assess this final model using the test set After assessing the final model with the test set, YOU MUST NOT further tune the model

Page 31 of 32 Acknowledgements The following material has been used in preparation of these slides: T. Dietterich, Statistical Tests for Comparing Supervised Classification Learning Algorithms, Neural Computation 96 L. Wang J. Feng, Learning Gaussian mixture models by structural risk minimization, IEEE MLC 05 Ron Kohavi, A Study of CrossValidation and Bootstrap for Accuracy Estimation and Model Selection, IJCAI 95 C. Burges, A tutorial on support vector machines, DMKD 04 Talks and slides: V. Hlavác, Vapnik-Chervonenkis learning theory M. Pardo Algorithm independent learning B. Chakraborty, Model Assessment, Selection and Averaging A. Moore, VC-dimension for characterizing classifiers A. Moore, Cross-validation for detecting and preventing overfitting R. Gutierrez-Osuna, Introduction to Pattern Analysis

Page 32 of 32 Questions?