ORIE 4741: Learning with Big Messy Data. Train, Test, Validate

Size: px

Start display at page:

Download "ORIE 4741: Learning with Big Messy Data. Train, Test, Validate"

Gloria Tucker
5 years ago
Views:

1 ORIE 4741: Learning with Big Messy Data Train, Test, Validate Professor Udell Operations Research and Information Engineering Cornell December 4, / 14

2 Exercise You run a hospital. A vendor wants to sell you a new machine learning system for diagnosing patients, and tells you the system works with 99% accuracy. If she s right, you might save millions of dollars and thousands of lives. If she s wrong, you might lose the same. What evidence would you ask for to verify this claim? ideas: 2 / 14

3 Exercise You run a hospital. A vendor wants to sell you a new machine learning system for diagnosing patients, and tells you the system works with 99% accuracy. If she s right, you might save millions of dollars and thousands of lives. If she s wrong, you might lose the same. What evidence would you ask for to verify this claim? ideas: what data was the system trained on? how well does it perform on your hospital s data? how well does it do per disease? how many false positives / false negatives? 2 / 14

4 Generalization of supervised learning unknown target function f : X Y training examples D = {(x 1, y 1 ),..., (x n, y n )} hypothesis set H learning algorithm A final hypothesis g : X Y how well will our classifier do on new data? 3 / 14

5 A cautionary tale predict who will be approved for credit card training examples D = all previous applicants + decisions hypothesis set H = rules of the form 1 applicant name = Abigail Adams 1 applicant name = Barry Balakrishnan h(applicant) = 1 applicant name = Carrie Chen otherwise 4 / 14

6 A cautionary tale predict who will be approved for credit card training examples D = all previous applicants + decisions hypothesis set H = rules of the form 1 applicant name = Abigail Adams 1 applicant name = Barry Balakrishnan h(applicant) = 1 applicant name = Carrie Chen otherwise learning algorithm A: pick h that performs best on D final hypothesis memorizes the data02 { 1 applicant = x i and y i = 1 for some i h(applicant) = 1 otherwise 4 / 14

7 A cautionary tale predict who will be approved for credit card training examples D = all previous applicants + decisions hypothesis set H = rules of the form 1 applicant name = Abigail Adams 1 applicant name = Barry Balakrishnan h(applicant) = 1 applicant name = Carrie Chen otherwise learning algorithm A: pick h that performs best on D final hypothesis memorizes the data02 { 1 applicant = x i and y i = 1 for some i h(applicant) = 1 otherwise how well will our classifier do on new data? 4 / 14

8 Estimate performance how well will our model do on new data? 5 / 14

9 Estimate performance how well will our model do on new data? worst case: terrible 5 / 14

10 Estimate performance how well will our model do on new data? worst case: terrible usually:? ideas: 5 / 14

11 Estimate performance how well will our model do on new data? worst case: terrible usually:? ideas: (given infinite data) evaluate model on new data (given finite data) split data; fit on one part, evaluate on another part 5 / 14

12 Error metric how to measure how well model fits? define an error metric error : Y Y R error(y, y ) = penalty for predicting y when true label is y examples: Classification. Misclassification error. error mis (y, y ) = 1(y y ) Weighted misclassification error. If false positives are β times worse than false negatives, error β (y, y ) = β1(y = 1, y = 1) + 1(y = 1, y = 1) Regression. Square error. error sq (y, y ) = (y y ) 2 6 / 14

13 Error metric how to measure how well model fits? define an error metric error : Y Y R error(y, y ) = penalty for predicting y when true label is y examples: Classification. Misclassification error. error mis (y, y ) = 1(y y ) Weighted misclassification error. If false positives are β times worse than false negatives, error β (y, y ) = β1(y = 1, y = 1) + 1(y = 1, y = 1) Regression. Square error. error sq (y, y ) = (y y ) 2 choose an error metric that makes sense for your application! 6 / 14

14 Train and test error partition data D into training set Dtrain and test set Dtest so D train D test =, D train D test = D algorithm A is only allowed to see the training set training error. E train (g) = 1 D train i D train error(y i, g(x i )) test error. E test (g) = 1 D test i D test error(y i, g(x i )) 7 / 14

15 Generalization and Overfitting goal of model is not to predict well on D goal of model is to predict well on new data if the model has training set error and test set error, we say the model: low test set error high test set error low training set error generalizes overfits high training set error?!?! underfits 8 / 14

16 Generalization and Overfitting goal of model is not to predict well on D goal of model is to predict well on new data if the model has training set error and test set error, we say the model: low test set error high test set error low training set error generalizes overfits high training set error?!?! underfits Q: How to fix underfitting? 8 / 14

17 Generalization and Overfitting goal of model is not to predict well on D goal of model is to predict well on new data if the model has training set error and test set error, we say the model: low test set error high test set error low training set error generalizes overfits high training set error?!?! underfits Q: How to fix underfitting? A: Use more complex model, or add new features. 8 / 14

18 Generalization and Overfitting goal of model is not to predict well on D goal of model is to predict well on new data if the model has training set error and test set error, we say the model: low test set error high test set error low training set error generalizes overfits high training set error?!?! underfits Q: How to fix underfitting? A: Use more complex model, or add new features. Q: How to fix overfitting? 8 / 14

19 Generalization and Overfitting goal of model is not to predict well on D goal of model is to predict well on new data if the model has training set error and test set error, we say the model: low test set error high test set error low training set error generalizes overfits high training set error?!?! underfits Q: How to fix underfitting? A: Use more complex model, or add new features. Q: How to fix overfitting? A: Use less complex model, remove features, or find more data. 8 / 14

20 how to choose train and test sets? at random (eg, toss random coin with prob p) more data in the training set improves the model fit more data in test set helps determine how well the model will work out of sample rule of thumb: put about 20% of data into the test set 9 / 14

21 Validation training set error improves with model complexity. how to decide which model to use? a simple and effective procedure: split data into training set D train and test set D test pick m different interesting model classes e.g., different φs: φ 1, φ 2,..., φ m fit ( train ) models on training set D train get one model h : X Y for each φs, and set H = {h 1, h 2,..., h m } compute error of each model on test set D test and choose lowest: g = argmin E Dtest (h) h H 10 / 14

22 Demo: Linear models (validation) 11 / 14

23 Cross-Validation a simple and effective procedure: pick a bunch of interesting model classes (e.g., different φs) for each possible split of data into training set D and test set D fit ( train ) models on training set D compute error of each model on test set D estimate error as average of error on each D 12 / 14

24 How to pick splits how to pick splits? Leave-one-out cross-validation: D has one example, D has all the rest advantage: accurate disadvantage: slow n-fold cross-validation: D has 1 n of the examples, D has all the rest (usually, n = 5 or 10) decently accurate, not too slow 13 / 14

25 Recap: Validation we care how well model performs on new data to simulate new data, split D into train and test set evaluate error on each via error metric if training set error is high, we say model fits poorly solution: use more complex model, or add new features if test set error is much higher than training set error, we say model overfits solution: make less complex model, or remove features 14 / 14

ORIE 4741: Learning with Big Messy Data. Generalization

ORIE 4741: Learning with Big Messy Data Generalization Professor Udell Operations Research and Information Engineering Cornell September 23, 2017 1 / 21 Announcements midterm 10/5 makeup exam 10/2, by