Cross-validation for detecting and preventing overfitting

Cross-validation for detecting and preventing overfitting A Regression Problem = f() + noise Can we learn f from this data? Note to other teachers and users of these slides. Andrew would be delighted if ou found this source material useful in giving our own lectures. Feel free to use these slides verbatim, or to modif them to fit our own needs. PowerPoint originals are available. If ou make use of a significant portion of these slides in our own lecture, please include this message, or the following link to the source repositor of Andrew s tutorials http//www.cs.cmu.edu/~awm/tutorials. Comments and corrections gratefull received. Andrew W. Moore Associate Professor School of Computer Science Carnegie Mellon Universit www.cs.cmu.edu/~awm awm@cs.cmu.edu 42-268-599 Let s consider three methods Copright 200, Andrew W. Moore Oct 5th, 200 Copright 200, Andrew W. Moore Cross- Slide 2 Univariate Linear regression with a constant term X Y X= = =().. =.. Originall discussed in the previous Andrew Lecture Neural Nets Copright 200, Andrew W. Moore Cross- Slide Copright 200, Andrew W. Moore Cross- Slide 4 Univariate Linear regression with a constant term X Y X= = =().. Z= = =.. z =(,).. z k =(, k ) =.. Univariate Linear regression with a constant term X Y X= = =().. =.. Z= = β=(z T Z) - (Z T ) z =(,).. z k =(, k ) =.. est = β 0 + β Copright 200, Andrew W. Moore Cross- Slide 5 Copright 200, Andrew W. Moore Cross- Slide 6

Quadratic Regression Quadratic Regression X Y Z= z=(,, 2,) 9 X= = =(,2).. =.. = Much more about this in the future Andrew Lecture Favorite Regression Algorithms β=(z T Z) - (Z T ) est = β 0 + β + β 2 2 Copright 200, Andrew W. Moore Cross- Slide Copright 200, Andrew W. Moore Cross- Slide 8 Join-the-dots Which is best? Also known as piecewise linear nonparametric regression if that makes ou feel better Wh not choose the method with the best fit to the data? Copright 200, Andrew W. Moore Cross- Slide 9 Copright 200, Andrew W. Moore Cross- Slide 0 What do we reall want? The test method. Randoml choose 0% of the data to be in a test training Wh not choose the method with the best fit to the data? How well are ou going to predict future data drawn from the same distribution? Copright 200, Andrew W. Moore Cross- Slide Copright 200, Andrew W. Moore Cross- Slide 2 2

The test method The test method. Randoml choose 0% of the data to be in a test. Randoml choose 0% of the data to be in a test training training. Perform our regression on the training. Perform our regression on the training (Linear regression eample) (Linear regression eample) Mean Squared Error = 2.4 4. Estimate our future performance with the test Copright 200, Andrew W. Moore Cross- Slide Copright 200, Andrew W. Moore Cross- Slide 4 The test method The test method. Randoml choose 0% of the data to be in a test. Randoml choose 0% of the data to be in a test training training. Perform our regression on the training. Perform our regression on the training (Quadratic regression eample) 4. Estimate our future performance with the test Mean Squared Error = 0.9 (Join the dots eample) Mean Squared Error = 2.2 4. Estimate our future performance with the test Copright 200, Andrew W. Moore Cross- Slide 5 Copright 200, Andrew W. Moore Cross- Slide 6 The test method Good news Ver ver simple Can then simpl choose the method with the best test- score Bad news What s the downside? The test method Good news Ver ver simple Can then simpl choose the method with the best test- score Bad news Wastes data we get an estimate of the best method to appl to 0% less data If we don t have much data, our test- might just be luck or unluck We sa the test- estimator of performance has high variance Copright 200, Andrew W. Moore Cross- Slide Copright 200, Andrew W. Moore Cross- Slide 8

LOOCV (Leave-one-out Cross ) For k= to R. Let ( k, k ) be the k th record LOOCV (Leave-one-out Cross ) For k= to R. Let ( k, k ) be the k th record 2. Temporaril remove ( k, k ) from the data Copright 200, Andrew W. Moore Cross- Slide 9 Copright 200, Andrew W. Moore Cross- Slide 20 LOOCV (Leave-one-out Cross ) For k= to R. Let ( k, k ) be the k th record 2. Temporaril remove ( k, k ) from the data. Train on the remaining R- LOOCV (Leave-one-out Cross ) For k= to R. Let ( k, k ) be the k th record 2. Temporaril remove ( k, k ) from the data. Train on the remaining R- 4. Note our error ( k, k ) Copright 200, Andrew W. Moore Cross- Slide 2 Copright 200, Andrew W. Moore Cross- Slide 22 LOOCV (Leave-one-out Cross ) For k= to R. Let ( k, k ) be the k th record 2. Temporaril remove ( k, k ) from the data. Train on the remaining R- 4. Note our error ( k, k ) When ou ve done all points, report the mean error. LOOCV (Leave-one-out Cross ) For k= to R. Let ( k, k ) be the k th record 2. Temporaril remove ( k, k ) from the data. Train on the remaining R- 4. Note our error ( k, k ) When ou ve done all points, report the mean error. MSE LOOCV = 2.2 Copright 200, Andrew W. Moore Cross- Slide 2 Copright 200, Andrew W. Moore Cross- Slide 24 4

LOOCV for Quadratic Regression For k= to R. Let ( k, k ) be the k th record 2. Temporaril remove ( k, k ) from the data. Train on the remaining R- 4. Note our error ( k, k ) When ou ve done all points, report the mean error. MSE LOOCV =0.962 LOOCV for Join The Dots For k= to R. Let ( k, k ) be the k th record 2. Temporaril remove ( k, k ) from the data. Train on the remaining R- 4. Note our error ( k, k ) When ou ve done all points, report the mean error. MSE LOOCV =. Copright 200, Andrew W. Moore Cross- Slide 25 Copright 200, Andrew W. Moore Cross- Slide 26 Which kind of Cross? Test- Leaveone-out Downside Variance unreliable estimate of future performance Epensive. Has some weird behavior Upside Cheap Doesn t waste data Randoml break the data into k partitions (in our eample we ll have k=..can we get the best of both worlds? Copright 200, Andrew W. Moore Cross- Slide 2 Copright 200, Andrew W. Moore Cross- Slide 28 Randoml break the data into k partitions (in our eample we ll have k= For the red partition Train on all the the test- sum of errors on the red Randoml break the data into k partitions (in our eample we ll have k= For the red partition Train on all the the test- sum of errors on the red For the green partition Train on all the Find the test- sum of errors on the green Copright 200, Andrew W. Moore Cross- Slide 29 Copright 200, Andrew W. Moore Cross- Slide 0 5

Randoml break the data into k partitions (in our eample we ll have k= For the red partition Train on all the the test- sum of errors on the red Randoml break the data into k partitions (in our eample we ll have k= For the red partition Train on all the the test- sum of errors on the red For the green partition Train on all the Find the test- sum of errors on the green For the green partition Train on all the Find the test- sum of errors on the green For the blue partition Train on all the points not in the blue partition. Find the test- sum of errors on the blue MSE FOLD =2.05 For the blue partition Train on all the points not in the blue partition. Find the test- sum of errors on the blue Then report the mean error Copright 200, Andrew W. Moore Cross- Slide Copright 200, Andrew W. Moore Cross- Slide 2 Randoml break the data into k partitions (in our eample we ll have k= For the red partition Train on all the the test- sum of errors on the red Randoml break the data into k partitions (in our eample we ll have k= For the red partition Train on all the the test- sum of errors on the red For the green partition Train on all the Find the test- sum of errors on the green For the green partition Train on all the Find the test- sum of errors on the green Quadratic Regression MSE FOLD =. For the blue partition Train on all the points not in the blue partition. Find the test- sum of errors on the blue Then report the mean error Joint-the-dots MSE FOLD =2.9 For the blue partition Train on all the points not in the blue partition. Find the test- sum of errors on the blue Then report the mean error Copright 200, Andrew W. Moore Cross- Slide Copright 200, Andrew W. Moore Cross- Slide 4 Which kind of Cross? Test- Leaveone-out 0-fold -fold R-fold Downside Variance unreliable estimate of future performance Epensive. Has some weird behavior Wastes 0% of the data. 0 times more epensive than test Wastier than 0-fold. Epensivier than test Identical to Leave-one-out Upside Cheap Doesn t waste data Which kind of Cross? Test- Onl wastes 0%. Onl 0 times more epensive instead of R times. Slightl better than test Leaveone-out 0-fold -fold R-fold Downside Variance unreliable Cheap estimate of future performance But note One of Epensive. Andrew s Doesn t jos waste in data life is Has some weird behavioralgorithmic tricks for Wastes 0% of the data. making Onl wastes these cheap 0%. Onl 0 times more epensive 0 times more epensive than test instead of R times. Wastier than 0-fold. Epensivier than test Identical to Leave-one-out Upside Slightl better than test Copright 200, Andrew W. Moore Cross- Slide 5 Copright 200, Andrew W. Moore Cross- Slide 6 6

CV-based Model Selection We re tring to decide which algorithm to use. We train each machine and make a table Alternatives to CV-based model selection Model selection methods. Cross-validation 2. AIC (Akaike Information Criterion). BIC (Baesian Information Criterion) 4. VC-dimension (Vapnik-Chervonenkis Dimension) i f i TRAINERR 0-FOLD-CV-ERR Choice f 2 4 f 2 f f 4 Onl directl applicable to choosing classifiers 5 6 f 5 f 6 Described in a future Andrew Lecture Copright 200, Andrew W. Moore Cross- Slide Copright 200, Andrew W. Moore Cross- Slide 8 Which model selection method is best?. (CV) Cross-validation 2. AIC (Akaike Information Criterion). BIC (Baesian Information Criterion) 4. (SRMVC) Structural Risk Minimize with VC-dimension AIC, BIC and SRMVC advantage ou onl need the training error. CV error might have more variance SRMVC is wildl conservative Asmptoticall AIC and Leave-one-out CV should be the same Asmptoticall BIC and carefull chosen k-fold should be same You want BIC ou want the best structure instead of the best predictor (e.g. for clustering or Baes Net structure finding) Man alternatives---including proper Baesian approaches. It s an emotional issue. Copright 200, Andrew W. Moore Cross- Slide 9 Other Cross-validation issues Can do leave all pairs out or leave-allntuples-out if feeling resourceful. Some folks do k-folds in which each fold is an independentl-chosen sub of the data Do ou know what AIC and BIC are? If so LOOCV behaves like AIC asmptoticall. k-fold behaves like BIC if ou choose k carefull If not Nardel nardel noo noo Copright 200, Andrew W. Moore Cross- Slide 40 Cross- for regression Choosing the number of hidden units in a neural net Feature selection (see later) Choosing a polnomial degree Choosing which regressor to use Supervising Gradient Descent This is a weird but common use of Test- validation Suppose ou have a neural net with too man hidden units. It will overfit. As gradient descent progresses, maintain a graph of MSE-test-error vs. Iteration Mean Squared Error Use the weights ou found on this iteration Training Set Test Set Iteration of Gradient Descent Copright 200, Andrew W. Moore Cross- Slide 4 Copright 200, Andrew W. Moore Cross- Slide 42

Supervising Gradient Descent This is a weird but common use of Test- validation Suppose ou have a neural net with too Relies on an intuition that a not-full- man hidden units. It will overfit. As gradient descent having fewer progresses, parameters. maintain a graph of MSE-test-error Works prett well in vs. practice, Iteration apparentl minimized of weights is somewhat like Cross-validation for classification Instead of computing the sum squared errors on a test, ou should compute Mean Squared Error Use the weights ou found on this iteration Training Set Test Set Iteration of Gradient Descent Copright 200, Andrew W. Moore Cross- Slide 4 Copright 200, Andrew W. Moore Cross- Slide 44 Cross-validation for classification Instead of computing the sum squared errors on a test, ou should compute The total number of misclassifications on a test. Cross-validation for classification Instead of computing the sum squared errors on a test, ou should compute The total number of misclassifications on a test. But there s a more sensitive alternative Compute log P(all test outputs all test inputs, our model) Copright 200, Andrew W. Moore Cross- Slide 45 Copright 200, Andrew W. Moore Cross- Slide 46 Cross- for classification Choosing the pruning parameter for decision trees Feature selection (see later) What kind of Gaussian to use in a Gaussianbased Baes Classifier Choosing which classifier to use Copright 200, Andrew W. Moore Cross- Slide 4 Cross- for densit estimation Compute the sum of log-likelihoods of test points Eample uses Choosing what kind of Gaussian assumption to use Choose the densit estimator NOT Feature selection (test densit will almost alwas look better with fewer features) Copright 200, Andrew W. Moore Cross- Slide 48 8

Feature Selection Suppose ou have a learning algorithm LA and a of input attributes { X, X 2.. X m } You epect that LA will onl find some sub of the attributes useful. Question How can we use cross-validation to find a useful sub? Four ideas Forward selection Backward elimination Hill Climbing Another fun area in which Andrew has spent a lot of his wild outh Stochastic search (Simulated Annealing or GAs) Copright 200, Andrew W. Moore Cross- Slide 49 Ver serious warning Intensive use of cross validation can overfit. How? What can be done about it? Copright 200, Andrew W. Moore Cross- Slide 50 Ver serious warning Intensive use of cross validation can overfit. How? Imagine a data with 50 records and 000 attributes. You tr 000 linear regression models, each one using one of the attributes. Ver serious warning Intensive use of cross validation can overfit. How? Imagine a data with 50 records and 000 attributes. You tr 000 linear regression models, each one using one of the attributes. The best of those 000 looks good! What can be done about it? What can be done about it? Copright 200, Andrew W. Moore Cross- Slide 5 Copright 200, Andrew W. Moore Cross- Slide 52 Ver serious warning Intensive use of cross validation can overfit. How? Imagine a data with 50 records and 000 attributes. You tr 000 linear regression models, each one using one of the attributes. The best of those 000 looks good! But ou realize it would have looked good even if the output had been purel random! What can be done about it? Hold out an additional test before doing an model selection. Check the best model performs well even on the additional test. Copright 200, Andrew W. Moore Cross- Slide 5 What ou should know Wh ou can t use training--error to estimate the qualit of our learning algorithm on our data. Wh ou can t use training error to choose the learning algorithm Test- cross-validation Leave-one-out cross-validation k-fold cross-validation Feature selection methods CV for classification, regression & densities Copright 200, Andrew W. Moore Cross- Slide 54 9