Cross-validation for detecting and preventing overfitting

Similar documents
VC-dimension for characterizing classifiers

VC-dimension for characterizing classifiers

Probability Densities in Data Mining

Eight more Classic Machine Learning algorithms

VC dimension, Model Selection and Performance Assessment for SVM and Other Machine Learning Algorithms

CMU-Q Lecture 24:

Machine Learning Foundations

Feedforward Neural Networks

Holdout and Cross-Validation Methods Overfitting Avoidance

Learning From Data Lecture 7 Approximation Versus Generalization

Algorithms for Playing and Solving games*

INF Introduction to classifiction Anne Solberg

Andrew W. Moore Professor School of Computer Science Carnegie Mellon University

CPSC 340: Machine Learning and Data Mining

Lecture Slides for INTRODUCTION TO. Machine Learning. By: Postedited by: R.

10-701/ Machine Learning - Midterm Exam, Fall 2010

Notes on Discriminant Functions and Optimal Classification

CS534 Machine Learning - Spring Final Exam

Machine Learning Lecture 7

Midterm Exam Solutions, Spring 2007

MS&E 226: Small Data

INF Introduction to classifiction Anne Solberg Based on Chapter 2 ( ) in Duda and Hart: Pattern Classification

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

Information Gain. Andrew W. Moore Professor School of Computer Science Carnegie Mellon University.

Machine Learning

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

Introduction to Machine Learning. Regression. Computer Science, Tel-Aviv University,

STAT 535 Lecture 5 November, 2018 Brief overview of Model Selection and Regularization c Marina Meilă

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

PDEEC Machine Learning 2016/17

Pattern Recognition and Machine Learning

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

Machine Learning, Fall 2009: Midterm

COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d)

CPSC 340: Machine Learning and Data Mining. Regularization Fall 2017

Machine Learning

ECE521 week 3: 23/26 January 2017

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

Bayesian Networks Structure Learning (cont.)

CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS

Machine Learning and Data Mining. Linear classification. Kalev Kask

Decision Trees. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. February 5 th, Carlos Guestrin 1

Linear Models for Classification

12 Statistical Justifications; the Bias-Variance Decomposition

CS6220: DATA MINING TECHNIQUES

Minimum Description Length (MDL)

Machine Learning Linear Classification. Prof. Matteo Matteucci

CS6220: DATA MINING TECHNIQUES

Machine Learning and Data Mining. Linear regression. Kalev Kask

Machine Learning. Nonparametric Methods. Space of ML Problems. Todo. Histograms. Instance-Based Learning (aka non-parametric methods)

Probabilistic and Bayesian Analytics

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Chapter 18 Quadratic Function 2

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

BAYESIAN DECISION THEORY

Introduction to Machine Learning Midterm Exam

Generative v. Discriminative classifiers Intuition

Midterm, Fall 2003

Machine Learning Basics: Maximum Likelihood Estimation

10-701/ Machine Learning, Fall

CSE 546 Midterm Exam, Fall 2014

Regression I: Mean Squared Error and Measuring Quality of Fit

ECE521 Lecture7. Logistic Regression

Final Exam, Machine Learning, Spring 2009

Linear Regression In God we trust, all others bring data. William Edwards Deming

Lecture 27: More on Rotational Kinematics

Introduction to Machine Learning and Cross-Validation

Machine Learning. Lecture 9: Learning Theory. Feng Li.

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

PAC Learning Introduction to Machine Learning. Matt Gormley Lecture 14 March 5, 2018

Artificial Neural Networks 2

Support Vector Machines

INF Anne Solberg One of the most challenging topics in image analysis is recognizing a specific object in an image.

Announcements - Homework

INTRODUCTION TO DIOPHANTINE EQUATIONS

Neural Networks in Structured Prediction. November 17, 2015

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

Machine Learning (CSE 446): Neural Networks

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Machine Learning Practice Page 2 of 2 10/28/13

Generative v. Discriminative classifiers Intuition

Backpropagation Introduction to Machine Learning. Matt Gormley Lecture 12 Feb 23, 2018

CPSC 340: Machine Learning and Data Mining. Stochastic Gradient Fall 2017

Classification and Regression Trees

The Perceptron algorithm

Notes on Machine Learning for and

Linear Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Resampling techniques for statistical modeling

FIND A FUNCTION TO CLASSIFY HIGH VALUE CUSTOMERS

COMS 4771 Introduction to Machine Learning. James McInerney Adapted from slides by Nakul Verma

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning

Introduction to Statistical modeling: handout for Math 489/583

Biostatistics-Lecture 16 Model Selection. Ruibin Xi Peking University School of Mathematical Sciences

Support Vector Machines and Kernel Methods

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

1 History of statistical/machine learning. 2 Supervised learning. 3 Two approaches to supervised learning. 4 The general learning procedure

Non-Convex Optimization. CS6787 Lecture 7 Fall 2017

COMS 4771 Regression. Nakul Verma

Figure 1: Visualising the input features. Figure 2: Visualising the input-output data pairs.

Lecture 2 Machine Learning Review

Transcription:

Cross-validation for detecting and preventing overfitting A Regression Problem = f() + noise Can we learn f from this data? Note to other teachers and users of these slides. Andrew would be delighted if ou found this source material useful in giving our own lectures. Feel free to use these slides verbatim, or to modif them to fit our own needs. PowerPoint originals are available. If ou make use of a significant portion of these slides in our own lecture, please include this message, or the following link to the source repositor of Andrew s tutorials http//www.cs.cmu.edu/~awm/tutorials. Comments and corrections gratefull received. Andrew W. Moore Associate Professor School of Computer Science Carnegie Mellon Universit www.cs.cmu.edu/~awm awm@cs.cmu.edu 42-268-599 Let s consider three methods Copright 200, Andrew W. Moore Oct 5th, 200 Copright 200, Andrew W. Moore Cross- Slide 2 Univariate Linear regression with a constant term X Y X= = =().. =.. Originall discussed in the previous Andrew Lecture Neural Nets Copright 200, Andrew W. Moore Cross- Slide Copright 200, Andrew W. Moore Cross- Slide 4 Univariate Linear regression with a constant term X Y X= = =().. Z= = =.. z =(,).. z k =(, k ) =.. Univariate Linear regression with a constant term X Y X= = =().. =.. Z= = β=(z T Z) - (Z T ) z =(,).. z k =(, k ) =.. est = β 0 + β Copright 200, Andrew W. Moore Cross- Slide 5 Copright 200, Andrew W. Moore Cross- Slide 6

Quadratic Regression Quadratic Regression X Y Z= z=(,, 2,) 9 X= = =(,2).. =.. = Much more about this in the future Andrew Lecture Favorite Regression Algorithms β=(z T Z) - (Z T ) est = β 0 + β + β 2 2 Copright 200, Andrew W. Moore Cross- Slide Copright 200, Andrew W. Moore Cross- Slide 8 Join-the-dots Which is best? Also known as piecewise linear nonparametric regression if that makes ou feel better Wh not choose the method with the best fit to the data? Copright 200, Andrew W. Moore Cross- Slide 9 Copright 200, Andrew W. Moore Cross- Slide 0 What do we reall want? The test method. Randoml choose 0% of the data to be in a test training Wh not choose the method with the best fit to the data? How well are ou going to predict future data drawn from the same distribution? Copright 200, Andrew W. Moore Cross- Slide Copright 200, Andrew W. Moore Cross- Slide 2 2

The test method The test method. Randoml choose 0% of the data to be in a test. Randoml choose 0% of the data to be in a test training training. Perform our regression on the training. Perform our regression on the training (Linear regression eample) (Linear regression eample) Mean Squared Error = 2.4 4. Estimate our future performance with the test Copright 200, Andrew W. Moore Cross- Slide Copright 200, Andrew W. Moore Cross- Slide 4 The test method The test method. Randoml choose 0% of the data to be in a test. Randoml choose 0% of the data to be in a test training training. Perform our regression on the training. Perform our regression on the training (Quadratic regression eample) 4. Estimate our future performance with the test Mean Squared Error = 0.9 (Join the dots eample) Mean Squared Error = 2.2 4. Estimate our future performance with the test Copright 200, Andrew W. Moore Cross- Slide 5 Copright 200, Andrew W. Moore Cross- Slide 6 The test method Good news Ver ver simple Can then simpl choose the method with the best test- score Bad news What s the downside? The test method Good news Ver ver simple Can then simpl choose the method with the best test- score Bad news Wastes data we get an estimate of the best method to appl to 0% less data If we don t have much data, our test- might just be luck or unluck We sa the test- estimator of performance has high variance Copright 200, Andrew W. Moore Cross- Slide Copright 200, Andrew W. Moore Cross- Slide 8

LOOCV (Leave-one-out Cross ) For k= to R. Let ( k, k ) be the k th record LOOCV (Leave-one-out Cross ) For k= to R. Let ( k, k ) be the k th record 2. Temporaril remove ( k, k ) from the data Copright 200, Andrew W. Moore Cross- Slide 9 Copright 200, Andrew W. Moore Cross- Slide 20 LOOCV (Leave-one-out Cross ) For k= to R. Let ( k, k ) be the k th record 2. Temporaril remove ( k, k ) from the data. Train on the remaining R- LOOCV (Leave-one-out Cross ) For k= to R. Let ( k, k ) be the k th record 2. Temporaril remove ( k, k ) from the data. Train on the remaining R- 4. Note our error ( k, k ) Copright 200, Andrew W. Moore Cross- Slide 2 Copright 200, Andrew W. Moore Cross- Slide 22 LOOCV (Leave-one-out Cross ) For k= to R. Let ( k, k ) be the k th record 2. Temporaril remove ( k, k ) from the data. Train on the remaining R- 4. Note our error ( k, k ) When ou ve done all points, report the mean error. LOOCV (Leave-one-out Cross ) For k= to R. Let ( k, k ) be the k th record 2. Temporaril remove ( k, k ) from the data. Train on the remaining R- 4. Note our error ( k, k ) When ou ve done all points, report the mean error. MSE LOOCV = 2.2 Copright 200, Andrew W. Moore Cross- Slide 2 Copright 200, Andrew W. Moore Cross- Slide 24 4

LOOCV for Quadratic Regression For k= to R. Let ( k, k ) be the k th record 2. Temporaril remove ( k, k ) from the data. Train on the remaining R- 4. Note our error ( k, k ) When ou ve done all points, report the mean error. MSE LOOCV =0.962 LOOCV for Join The Dots For k= to R. Let ( k, k ) be the k th record 2. Temporaril remove ( k, k ) from the data. Train on the remaining R- 4. Note our error ( k, k ) When ou ve done all points, report the mean error. MSE LOOCV =. Copright 200, Andrew W. Moore Cross- Slide 25 Copright 200, Andrew W. Moore Cross- Slide 26 Which kind of Cross? Test- Leaveone-out Downside Variance unreliable estimate of future performance Epensive. Has some weird behavior Upside Cheap Doesn t waste data Randoml break the data into k partitions (in our eample we ll have k=..can we get the best of both worlds? Copright 200, Andrew W. Moore Cross- Slide 2 Copright 200, Andrew W. Moore Cross- Slide 28 Randoml break the data into k partitions (in our eample we ll have k= For the red partition Train on all the the test- sum of errors on the red Randoml break the data into k partitions (in our eample we ll have k= For the red partition Train on all the the test- sum of errors on the red For the green partition Train on all the Find the test- sum of errors on the green Copright 200, Andrew W. Moore Cross- Slide 29 Copright 200, Andrew W. Moore Cross- Slide 0 5

Randoml break the data into k partitions (in our eample we ll have k= For the red partition Train on all the the test- sum of errors on the red Randoml break the data into k partitions (in our eample we ll have k= For the red partition Train on all the the test- sum of errors on the red For the green partition Train on all the Find the test- sum of errors on the green For the green partition Train on all the Find the test- sum of errors on the green For the blue partition Train on all the points not in the blue partition. Find the test- sum of errors on the blue MSE FOLD =2.05 For the blue partition Train on all the points not in the blue partition. Find the test- sum of errors on the blue Then report the mean error Copright 200, Andrew W. Moore Cross- Slide Copright 200, Andrew W. Moore Cross- Slide 2 Randoml break the data into k partitions (in our eample we ll have k= For the red partition Train on all the the test- sum of errors on the red Randoml break the data into k partitions (in our eample we ll have k= For the red partition Train on all the the test- sum of errors on the red For the green partition Train on all the Find the test- sum of errors on the green For the green partition Train on all the Find the test- sum of errors on the green Quadratic Regression MSE FOLD =. For the blue partition Train on all the points not in the blue partition. Find the test- sum of errors on the blue Then report the mean error Joint-the-dots MSE FOLD =2.9 For the blue partition Train on all the points not in the blue partition. Find the test- sum of errors on the blue Then report the mean error Copright 200, Andrew W. Moore Cross- Slide Copright 200, Andrew W. Moore Cross- Slide 4 Which kind of Cross? Test- Leaveone-out 0-fold -fold R-fold Downside Variance unreliable estimate of future performance Epensive. Has some weird behavior Wastes 0% of the data. 0 times more epensive than test Wastier than 0-fold. Epensivier than test Identical to Leave-one-out Upside Cheap Doesn t waste data Which kind of Cross? Test- Onl wastes 0%. Onl 0 times more epensive instead of R times. Slightl better than test Leaveone-out 0-fold -fold R-fold Downside Variance unreliable Cheap estimate of future performance But note One of Epensive. Andrew s Doesn t jos waste in data life is Has some weird behavioralgorithmic tricks for Wastes 0% of the data. making Onl wastes these cheap 0%. Onl 0 times more epensive 0 times more epensive than test instead of R times. Wastier than 0-fold. Epensivier than test Identical to Leave-one-out Upside Slightl better than test Copright 200, Andrew W. Moore Cross- Slide 5 Copright 200, Andrew W. Moore Cross- Slide 6 6

CV-based Model Selection We re tring to decide which algorithm to use. We train each machine and make a table Alternatives to CV-based model selection Model selection methods. Cross-validation 2. AIC (Akaike Information Criterion). BIC (Baesian Information Criterion) 4. VC-dimension (Vapnik-Chervonenkis Dimension) i f i TRAINERR 0-FOLD-CV-ERR Choice f 2 4 f 2 f f 4 Onl directl applicable to choosing classifiers 5 6 f 5 f 6 Described in a future Andrew Lecture Copright 200, Andrew W. Moore Cross- Slide Copright 200, Andrew W. Moore Cross- Slide 8 Which model selection method is best?. (CV) Cross-validation 2. AIC (Akaike Information Criterion). BIC (Baesian Information Criterion) 4. (SRMVC) Structural Risk Minimize with VC-dimension AIC, BIC and SRMVC advantage ou onl need the training error. CV error might have more variance SRMVC is wildl conservative Asmptoticall AIC and Leave-one-out CV should be the same Asmptoticall BIC and carefull chosen k-fold should be same You want BIC ou want the best structure instead of the best predictor (e.g. for clustering or Baes Net structure finding) Man alternatives---including proper Baesian approaches. It s an emotional issue. Copright 200, Andrew W. Moore Cross- Slide 9 Other Cross-validation issues Can do leave all pairs out or leave-allntuples-out if feeling resourceful. Some folks do k-folds in which each fold is an independentl-chosen sub of the data Do ou know what AIC and BIC are? If so LOOCV behaves like AIC asmptoticall. k-fold behaves like BIC if ou choose k carefull If not Nardel nardel noo noo Copright 200, Andrew W. Moore Cross- Slide 40 Cross- for regression Choosing the number of hidden units in a neural net Feature selection (see later) Choosing a polnomial degree Choosing which regressor to use Supervising Gradient Descent This is a weird but common use of Test- validation Suppose ou have a neural net with too man hidden units. It will overfit. As gradient descent progresses, maintain a graph of MSE-test-error vs. Iteration Mean Squared Error Use the weights ou found on this iteration Training Set Test Set Iteration of Gradient Descent Copright 200, Andrew W. Moore Cross- Slide 4 Copright 200, Andrew W. Moore Cross- Slide 42

Supervising Gradient Descent This is a weird but common use of Test- validation Suppose ou have a neural net with too Relies on an intuition that a not-full- man hidden units. It will overfit. As gradient descent having fewer progresses, parameters. maintain a graph of MSE-test-error Works prett well in vs. practice, Iteration apparentl minimized of weights is somewhat like Cross-validation for classification Instead of computing the sum squared errors on a test, ou should compute Mean Squared Error Use the weights ou found on this iteration Training Set Test Set Iteration of Gradient Descent Copright 200, Andrew W. Moore Cross- Slide 4 Copright 200, Andrew W. Moore Cross- Slide 44 Cross-validation for classification Instead of computing the sum squared errors on a test, ou should compute The total number of misclassifications on a test. Cross-validation for classification Instead of computing the sum squared errors on a test, ou should compute The total number of misclassifications on a test. But there s a more sensitive alternative Compute log P(all test outputs all test inputs, our model) Copright 200, Andrew W. Moore Cross- Slide 45 Copright 200, Andrew W. Moore Cross- Slide 46 Cross- for classification Choosing the pruning parameter for decision trees Feature selection (see later) What kind of Gaussian to use in a Gaussianbased Baes Classifier Choosing which classifier to use Copright 200, Andrew W. Moore Cross- Slide 4 Cross- for densit estimation Compute the sum of log-likelihoods of test points Eample uses Choosing what kind of Gaussian assumption to use Choose the densit estimator NOT Feature selection (test densit will almost alwas look better with fewer features) Copright 200, Andrew W. Moore Cross- Slide 48 8

Feature Selection Suppose ou have a learning algorithm LA and a of input attributes { X, X 2.. X m } You epect that LA will onl find some sub of the attributes useful. Question How can we use cross-validation to find a useful sub? Four ideas Forward selection Backward elimination Hill Climbing Another fun area in which Andrew has spent a lot of his wild outh Stochastic search (Simulated Annealing or GAs) Copright 200, Andrew W. Moore Cross- Slide 49 Ver serious warning Intensive use of cross validation can overfit. How? What can be done about it? Copright 200, Andrew W. Moore Cross- Slide 50 Ver serious warning Intensive use of cross validation can overfit. How? Imagine a data with 50 records and 000 attributes. You tr 000 linear regression models, each one using one of the attributes. Ver serious warning Intensive use of cross validation can overfit. How? Imagine a data with 50 records and 000 attributes. You tr 000 linear regression models, each one using one of the attributes. The best of those 000 looks good! What can be done about it? What can be done about it? Copright 200, Andrew W. Moore Cross- Slide 5 Copright 200, Andrew W. Moore Cross- Slide 52 Ver serious warning Intensive use of cross validation can overfit. How? Imagine a data with 50 records and 000 attributes. You tr 000 linear regression models, each one using one of the attributes. The best of those 000 looks good! But ou realize it would have looked good even if the output had been purel random! What can be done about it? Hold out an additional test before doing an model selection. Check the best model performs well even on the additional test. Copright 200, Andrew W. Moore Cross- Slide 5 What ou should know Wh ou can t use training--error to estimate the qualit of our learning algorithm on our data. Wh ou can t use training error to choose the learning algorithm Test- cross-validation Leave-one-out cross-validation k-fold cross-validation Feature selection methods CV for classification, regression & densities Copright 200, Andrew W. Moore Cross- Slide 54 9