Methods and Criteria for Model Selection. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

Similar documents
PDEEC Machine Learning 2016/17

Performance Evaluation and Comparison

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Performance Evaluation

Machine Learning Lecture 7

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

Lecture 3: Introduction to Complexity Regularization

Recap from previous lecture

Performance Evaluation

Model Accuracy Measures

Class 4: Classification. Quaid Morris February 11 th, 2011 ML4Bio

CMU-Q Lecture 24:

Lecture Slides for INTRODUCTION TO. Machine Learning. ETHEM ALPAYDIN The MIT Press,

Bias-Variance Tradeoff

Stats notes Chapter 5 of Data Mining From Witten and Frank

Linear Regression. Machine Learning CSE546 Kevin Jamieson University of Washington. Oct 5, Kevin Jamieson 1

Applied Machine Learning Annalisa Marsico

Evaluation. Andrea Passerini Machine Learning. Evaluation

Pointwise Exact Bootstrap Distributions of Cost Curves

Chapter 7: Model Assessment and Selection

Machine Learning Linear Classification. Prof. Matteo Matteucci

Lecture 2 Machine Learning Review

Evaluation requires to define performance measures to be optimized

Minimum Description Length (MDL)

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Hastie, Tibshirani & Friedman: Elements of Statistical Learning Chapter Model Assessment and Selection. CN700/March 4, 2008.

Warm up: risk prediction with logistic regression

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

Support Vector Machines

Generalization, Overfitting, and Model Selection

Bayesian Learning (II)

ECE 5424: Introduction to Machine Learning

Smart Home Health Analytics Information Systems University of Maryland Baltimore County

Data Mining and Analysis: Fundamental Concepts and Algorithms

Lecture 4 Discriminant Analysis, k-nearest Neighbors

Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 8, 2018

STAT 535 Lecture 5 November, 2018 Brief overview of Model Selection and Regularization c Marina Meilă

Evaluation & Credibility Issues

Machine Learning Practice Page 2 of 2 10/28/13

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

Statistical Methods for SVM

FINAL: CS 6375 (Machine Learning) Fall 2014

Hypothesis Evaluation

VC dimension, Model Selection and Performance Assessment for SVM and Other Machine Learning Algorithms

Qualifying Exam in Machine Learning

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine Learning 4771

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels

Data Mining und Maschinelles Lernen

Machine Learning and Data Mining. Linear classification. Kalev Kask

Learning with multiple models. Boosting.

6.867 Machine learning

Machine Learning. Ensemble Methods. Manfred Huber

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

CS7267 MACHINE LEARNING

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, What about continuous variables?

Announcements. Proposals graded

CptS 570 Machine Learning School of EECS Washington State University. CptS Machine Learning 1

Machine Learning And Applications: Supervised Learning-SVM

Introduction to Supervised Learning. Performance Evaluation

COMP9444: Neural Networks. Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

Machine Learning

PAC-learning, VC Dimension and Margin-based Bounds

Ensemble Methods for Machine Learning

SUPPORT VECTOR MACHINE

Bootstrap, Jackknife and other resampling methods

CS6220: DATA MINING TECHNIQUES

Evaluating Classifiers. Lecture 2 Instructor: Max Welling

CS6375: Machine Learning Gautam Kunapuli. Decision Trees

PATTERN RECOGNITION AND MACHINE LEARNING

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

MS-C1620 Statistical inference

Support vector machines Lecture 4

Kernel Methods and Support Vector Machines

Regularization. CSCE 970 Lecture 3: Regularization. Stephen Scott and Vinod Variyam. Introduction. Outline

Stephen Scott.

Diagnostics. Gad Kimmel

Machine Learning Basics Lecture 4: SVM I. Princeton University COS 495 Instructor: Yingyu Liang

Machine Learning

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Performance Evaluation

ECE521 Lecture7. Logistic Regression

Decision Trees. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

Computational Learning Theory. CS534 - Machine Learning

Lecture 3. STAT161/261 Introduction to Pattern Recognition and Machine Learning Spring 2018 Prof. Allie Fletcher

CS6220: DATA MINING TECHNIQUES

Performance Evaluation and Hypothesis Testing

NeuroComp Machine Learning and Validation

Generalization and Overfitting

Ensemble Methods. NLP ML Web! Fall 2013! Andrew Rosenberg! TA/Grader: David Guy Brizan

Machine Learning CSE546 Sham Kakade University of Washington. Oct 4, What about continuous variables?

Online Advertising is Big Business

MODULE -4 BAYEIAN LEARNING

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers

Logistic Regression. Machine Learning Fall 2018

Learning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin

18.9 SUPPORT VECTOR MACHINES

Announcements Kevin Jamieson

ESS2222. Lecture 3 Bias-Variance Trade-off

Transcription:

Methods and Criteria for Model Selection CS57300 Data Mining Fall 2016 Instructor: Bruno Ribeiro

Goal } Introduce classifier evaluation criteria } Introduce Bias x Variance duality } Model Assessment } Model Selection 2015 Bruno Ribeiro

In Training Phase There is Often a Better Model Train Test Model assessment Real-world Error 2015 Bruno Ribeiro

Testing Classification Accuracy 4

Classification Error True positive (TP): positive prediction that is correct True negative (TN): negative prediction that is correct False positive (FP): positive prediction that is incorrect False negative (FN): negative prediction that is incorrect } Accuracy = (TP + TN) / (TP + TN + FP + FN) % predictions that are correct } Misclassification = (FP+FN) / (TP+TN+FP+FN) % predictions that are incorrect } Recall (Sensitivity): R = TP / (TP + FN) % positive instances that are predicted positive } Precision: P = TP / (TP + FP) % positive predictions that are correct } Specificity = TN / (TN + FP) % negative instances that are predicted negative } F1 = 2 (P R) / (P+R) harmonic mean of precision and recall

Precision and Recall 6

More score functions } Absolute loss: } Squared loss: } Likelihood/conditional likelihood: } Area under the ROC curve Receiver Operating Characteristic (ROC) curve For classifier that a classification threshold can be defined (e.g. Logistic Regression, SVMs [vary hyperplane offset b],..) Plots the true positive rate against the false positive rate for different classification thresholds

} Evaluates performance over varying costs and class distributions Can summarize with area under the curve (AUC) AUC of 0.5 is random AUC of 1.0 is perfect

E.g. Measuring the performance of Logistic Regression predicting labels {+,-} P(Y) True class P(Y) True class Predict class P(Y) True class Predict class P(Y) True class Predict class 0.51 0.94 + 0.94 + + 0.94 + + 0.94 + + 0.07 0.84-0.84 - - 0.84 - + 0.84 - + 0.84 0.67 + - 0.94 0.58 + - 0.67 0.51 + 0.58 0.42 + - 0.67 + - 0.58 - - 0.51 + - 0.67 + - 0.58 - - 0.51 + - 0.67 + + 0.58 - - 0.51 + - 0.16-0.42 + - 0.42 + - 0.42 + - 0.42 0.1 + - 0.16 - - 0.16 - - 0.16 - - 0.16 0.07-0.1 - - 0.1 - - 0.1 - - 0.07 - - 0.07 - - 0.07 - - FPR = 0/5 TPR = 1/4 FPR = 1/5 TPR = 1/4 FPR = 1/5 TPR = 2/4

ROC curve True positive rate (TPR) False positive rate (FPR)

ROC Curves Should not Look Like This } If your ROC curve looks like this, you are likely doing something wrong Either you have too few discrete features Or you are predicting labels 0 and 1 but gave software library the predictions { 0, 1 } instead of the probability [0,1] the label is 1 11

Issues with conventional score functions } Assumes errors for all individuals are equally weighted May want to weight recent instances more heavily May want to include information about reliability of sets of measurements } Assumes errors for all contexts are equally weighted May want to weigh false positive and false negatives differently May be costs associated with acting on particular classifications (e.g., marketing to individuals)

Examples } Loan decisions Cost of lending to a defaulter is much greater than the lost business of refusing loan to a non-defaulter } Oil-slick detection Cost of failing to detect an environmental disaster is far less than the cost of a false alarm } Promotional mailing Cost of sending junk mail to a household that doesn t respond is far less consequential than the lost business of not sending it to a household that would have responded

Cost-sensitive models } Define a score function based on a cost matrix } If ~y is the predicted class and y is the true class, then need to define a matrix of costs C(~y,y) } Reflects the severity of classifying an instance with true class y to class ~y +$10 -$1 -$5 0

Assessing Error with Training Data We don t have the test data! Train Test Can we just use training data to evaluate our classifier? 15

The Problem of Overfitting Prediction Error 0.0 0.2 0.4 0.6 0.8 1.0 1.2 High Bias Low Variance Low Bias High Variance Real-World Error Training Error 0 5 10 15 20 25 30 35 2015 Bruno Ribeiro Model Complexity (df) The Elements of Statistical Learning Jerome H. Friedman, Robert Tibshirani, and Trevor Hastie

Linear Regression n observations

Cubic Regression y Better n observations x

Degree n-1 Polynomial y Even better? n observations x

But while overfitting is widely used, be careful a) Overfitting is not just about number of parameters b) Overfitting is an easy-to-understand concept but reality is more complex c) Overfitting is also known as the variance of the generalization error 20

Overfitting, hypotheses, and priors } Consider Logistic regression classifier. } For a dataset {φ n,t n } where ϕ n is transformed feature vector of item n, t n {0,1}is hood func the class of item n, for n=1,,n data points where t =(t 1,...,t N ) T and y n = σ(a n ) and a n = w T φ n.. } If N on is by small taking but the dimension neg t toof w, ϕ we n is large, obtainwe will likely overfit data } But what happens if w ~ Normal(0, Σ)? } Prior effectively reduces the number of hypothesis we are encouraged to test (w is no longer free ), note that number of model parameters stay the same. Overfitting is just a symptom Real underlying cause is the test of multiple hypotheses 21

Multiple Hypotheses view of Classification } We will not cover the multiple hypothesis view of classification Just a few pointers (topic covered in the Machine Learning class) } Problem is that of Empirical Risk Minimization, VC-dimension and PAC learning We define a hypothesis class H Data points {(x (i),y (i) )} i=1,,m with empirical error for hypothesis h H ˆε(h) = 1 m m 1{h(x (i) ) y (i) } i=1 Goal: Find hypothesis that minimizes empirical error: ĥ = arg min ˆε(h) h H Distinct theorems for finite and infinite hypothesis classes 22

Model Selection and Assessment } Model Selection: Estimating performances of different models to select the best one } Model Assessment: Having chosen a model, estimating the prediction error on new data 2015 Bruno Ribeiro

Model Selection and Assessment } In data-rich scenarios split the data: Train Validation Test Model Selection Model assessment } In data-poor scenarios: approximate validation step analytically AIC, BIC, MDL via sample re-use cross- validation Leave-one-out K-fold bootstrap 2015 Bruno Ribeiro

Learning Curves (both data-rich and data-poor) } Learning curve shows how method performs with more training data : (a) Over validation data (b) Over training data Score: F1 = 2 (P R) / (P+R) } One of the most useful methods to debug classifiers. Check for overfitting? See gap between training and validation Check for code bugs or bad local optimum? Training data accuracy must decrease with more data Training data Validation data 25

Data-poor Assessment of Classification Error 26

Model: True target Predicted target random noise t = f(x 0 ;? )+ ; E[ ] = 0 Var( ) = 2 For squared-error loss with additive noise: Add and subtract E h f(x 0 ; ˆ ) i Deviation of average estimate From true function s mean Err(x 0 )=E[(t f(x 0 ; ˆ )) 2 X = x 0 ] h apple = E[t 2 ]+E f(x 0 ; ˆ ) 2 f(x 0 ; )? + E f(x 0 ; ˆ ) E = 2 + Bias 2 (f(x 0 ; ˆ )) + Var(f(x 0 ; ˆ ) h f(x 0 ; ˆ ) i 2 True noise Expected squared deviation of estimate around its own mean 2015 Bruno Ribeiro

} Bias Often related to size of model space Smaller the model space M, less likely true model lies in M More complex models tend to have lower bias } Variance Often related to size of dataset and quality of information When data is large enough to estimate true parameters (θ * ) we get lower variance } With big data simple models can perform surprisingly well due to lower variance Err(x 0 )=E[(t f(x 0 ; ˆ )) 2 X = x 0 ] = 2 + Ef(x 0 ; ˆ ) 2 h f(x 0 ;? ) + E f(x 0 ; ˆ ) Ef(x 0 ; ˆ ) i 2 = 2 + Bias 2 (f(x 0 ; ˆ )) + Var(f(x 0 ; ˆ ))

} Hypothesis space Larger search space, smaller bias Larger search space, larger variance } Problem: Looking for BEST MODEL (point estimate) } By testing too many hypotheses the best looking hypothesis is likely not a good one Models with error < 3ε Models with error < ε Models with error < 2ε Our Search Space Our Search Space

Optimism of the Training Error Rate } Typically: training error rate < true error (same data is being used to fit the method and assess its error) Err train = 1 N NX L(t(i),f(x 0 (i); ˆ )) <Err(X) =E[L(t, f(x; ˆ ))] i=1 Loss function Target overly optimistic Estimate from training data without k-th fold 2015 Bruno Ribeiro

Estimating Test Error } Can we estimate the discrepancy between Err(x 0 ) and Err(X)? } D old observations; D new new observations } Suppose we observed new values of label t(i) for same input x(i) Err in --- In-sample error: Expectation over N new responses at each x i E D [Err in ]= 1 N NX E D E D newl(t new (i),f(x 0 (i); ˆ )) i=1 = E D [Err train ]+ 2 N NX Cov(ˆt i,t i ) i=1 Trained with old observation D 2015 Bruno Ribeiro Adjustment for optimism of training error

Measuring Optimism If model: with θ = d t = f(x; )+ ; E[ ] = 0 Var( ) = 2 Then, if f is linear, f(x; ) =x T then E D [Err in ]=E D [Err train ]+ 2 N d 2 Optimism grows linearly with model dimension Optimism decreases as training sample size increases 2015 Bruno Ribeiro

Ways to Estimate Prediction Error } In-sample error estimates: AIC BIC MDL } Extra-sample error estimates: Cross-Validation Leave-one-out K-fold Bootstrap 2015 Bruno Ribeiro

Estimates of In-Sample Prediction Error } General form of the in-sample estimate: } For additive error: Known as the Marllow s C p statistic 2015 Bruno Ribeiro

2015 Bruno Ribeiro Akaike Information Criterion (AIC) AIC = 2 N ln P [t = f(x 0; ˆ )] + 2 d N AIC does not assume model is correct Better for small # samples May not converge to correct model (as N model bias 0) Bayesian Information Criterion (BIC) BIC = 2 N ln P [t = f(x 0; ˆ )] + d N log N BIC assumes model is correct Converges to correct parameterization (θ * ) of model (as N model bias = 0)

MDL (Minimum Description Length) } Find hypothesis (model) that minimizes H(M) + H(D M), where H(M) length in bits of description of model M H(D M) length in bits of description of data encoded by model M } If model probabilistic: length = ln P [t = f(x 0 ; ˆ ) M] ln P [ˆ M] Length of transmitting the discrepancy given the model + optimal coding under the given model Description of the model under optimal coding MDL principle: choose the model with the minimum description length Equivalent to maximizing the posterior: 2015 Bruno Ribeiro P [t = f(x 0 ; ˆ ) M]P [ˆ M]

BIC is one example of Bayesian Approach } While we will not cover Bayesian methods in this course, they tend to ameliorate some of these point estimate issues } Harder to compute 37

} Does not look for Best Hypothesis Not interested in point estimates of model parameters } Seeks to take variance into account into model evaluation Err(x 0 )=E[(t f(x 0 ; )) 2 X = x 0, ]P [ Data] } If testing too many hypotheses and not enough data, P [ˆ X = x 0 ] is less concentrated (higher entropy), thus leading to larger errors

Data-rich Assessment of Classification Error 39

Estimation of Extra-Sample Error } Cross Validation Train Validation Test } Bootstrap Model Selection Model assessment 2015 Bruno Ribeiro

Estimation of Extra-Sample Error via Cross-Validation } Goal: Estimate the test error for a supervised learning method } Algorithm: 1. Split the data in two parts (training and validation) 2. Train the method in the training part 3. Compute the error on the validation part 41

Wrong Way to Perform Cross-validation } Use cross-validation to estimate too many model decisions or for selection of too many features } Use cross-validation to hunt for too many new models Same problem we have when hunting for p-values (too many hypotheses) The human regression machine: Same as multiple hypothesis testing Proposes New Model CV Bad Check CV 2015 Bruno Ribeiro 42

In Data-poor scenarios we should use K-fold Cross-validation 43

1 2 3 4 5 Train Train Validation Train Train validation train K-fold K-fold cross-validation error CV err (K) = 1 K 2016 Bruno Ribeiro KX k=1 1 S k X Validation fold Loss function i2s k L(t(i), ˆt ( Target Estimate from training data without k-th fold k) (i))

Computation increases Variance decreases bias decreases K fold Leave-one-out Small K Large K

Cross-Validation: Choosing K 1-Err 0.0 0.2 0.4 0.6 0.8 0 50 100 150 200 Size of Training Set 2015 Bruno Ribeiro Popular choices for K: 5, 10, N=leave one out

Evaluating classification algorithms A and B } Use k-fold cross-validation to get k estimates of error for algorithms A and B Train1 Train2 Train3 Train4 Train5 Train6 Dataset Y X1 X2 Y X1 X2 Y X1 X2 Y X1 X2 Y X1 X2 Y X1 X2 Y X1 X2 Valid.1 Valid.2 Valid.3 Valid.4 Valid.5 Valid.6 Y X1 X2 Y X1 X2 Y X1 X2 Y X1 X2 Y X1 X2 Y X1 X2 Error A.1 Error A.2 Error A.3 Error A.4 Error A.5 Error A.6 Error B.1 Error B.2 Error B.3 Error B.4 Error B.5 Error B.6 } Set of errors estimated over the test set folds provides empirical estimate of sampling distribution } Mean is estimate of expected error } Is the mean error significant between A and B?

Assessing significance } Use paired t-test to assess whether the two distributions of errors are statistically different from each other Error A.1 Error B.1 Error A.2 Error B.2 Error A.3 Error B.3 Error A.4 Error B.4 Error A.5 Error B.5 } Takes into account both the difference in means and the variability of the scores B A

Paired t-test } Instead of comparing the difference in means between populations 1 and 2 1 NX X (1) 1 NX i X (2) i N N i=1 i=1 } A paired t-test will compare the mean difference d = 1 NX (X (1) i X (2) i ) N i=1 } The real effect is when estimating the standard error of the mean difference v SE( d) = s d p n, where s d = } The t statistic has N-1 degrees of freedom and is given by t = u t 1 N d SE( d) NX i=1 (X (1) i X (2) i d) 2 49

Bootstrapping 50

Bootstrapping } Bootstrap is a powerful statistical tool to quantify the uncertainty associated with a given estimator or statistical learning method } E.g.: It can provide an estimate of the standard error of a coefficient, or a confidence interval for that coefficient } Instead of sampling new datasets from the unknown population distribution, resample from the training data (empirical population distribution) 51

Why Bootstrapping? } The term bootstrap derives from to pull oneself up by one s bootstraps, likely based on one of the eighteenth century novel The Surprising Adventures of Baron Munchausen by Rudolph Erich Raspe: The Baron had fallen to the bottom of a deep lake. Just when it looked like all was lost, he thought to pick himself up by his own bootstraps. } It is not the same as the term bootstrap used in computer science meaning to boot a computer from a set of core instructions, though the meaning is similar. 52

Bootstrap: Main Concept 250 7. Model Assessment and Selection Step 3: Compute variance of estimated parameters & error over validation data S(Z 1 ) S(Z 2 ) S(Z B ) Bootstrap replications Step 2: Learn model S(.) over Bootstrap samples Bootstrap samples Z 1 Z 2 Z B Z =(z 1,z 2,...,z N ) Step 1: Draw N samples with replacement from training data Training sample The Elements of Statistical Learning Jerome H. Friedman, Robert Tibshirani, and Trevor Hastie

Bootstrap Validation Data } Two ways to use validation data Validation comes from whatever data points are not used in training Issue: Validation data size not consistent across bootstrap replications Validation error uncertainty not consistent Advantage: Keeps all training data for potentially training model Validation split from training before bootstrapping Issue: Must split validation from training, reducing training data Advantage: Validation data size kept consistent between bootstrap replications 54