Methods and Criteria for Model Selection. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

Methods and Criteria for Model Selection CS57300 Data Mining Fall 2016 Instructor: Bruno Ribeiro

Goal } Introduce classifier evaluation criteria } Introduce Bias x Variance duality } Model Assessment } Model Selection 2015 Bruno Ribeiro

In Training Phase There is Often a Better Model Train Test Model assessment Real-world Error 2015 Bruno Ribeiro

Testing Classification Accuracy 4

Classification Error True positive (TP): positive prediction that is correct True negative (TN): negative prediction that is correct False positive (FP): positive prediction that is incorrect False negative (FN): negative prediction that is incorrect } Accuracy = (TP + TN) / (TP + TN + FP + FN) % predictions that are correct } Misclassification = (FP+FN) / (TP+TN+FP+FN) % predictions that are incorrect } Recall (Sensitivity): R = TP / (TP + FN) % positive instances that are predicted positive } Precision: P = TP / (TP + FP) % positive predictions that are correct } Specificity = TN / (TN + FP) % negative instances that are predicted negative } F1 = 2 (P R) / (P+R) harmonic mean of precision and recall

Precision and Recall 6

More score functions } Absolute loss: } Squared loss: } Likelihood/conditional likelihood: } Area under the ROC curve Receiver Operating Characteristic (ROC) curve For classifier that a classification threshold can be defined (e.g. Logistic Regression, SVMs [vary hyperplane offset b],..) Plots the true positive rate against the false positive rate for different classification thresholds

} Evaluates performance over varying costs and class distributions Can summarize with area under the curve (AUC) AUC of 0.5 is random AUC of 1.0 is perfect

E.g. Measuring the performance of Logistic Regression predicting labels {+,-} P(Y) True class P(Y) True class Predict class P(Y) True class Predict class P(Y) True class Predict class 0.51 0.94 + 0.94 + + 0.94 + + 0.94 + + 0.07 0.84-0.84 - - 0.84 - + 0.84 - + 0.84 0.67 + - 0.94 0.58 + - 0.67 0.51 + 0.58 0.42 + - 0.67 + - 0.58 - - 0.51 + - 0.67 + - 0.58 - - 0.51 + - 0.67 + + 0.58 - - 0.51 + - 0.16-0.42 + - 0.42 + - 0.42 + - 0.42 0.1 + - 0.16 - - 0.16 - - 0.16 - - 0.16 0.07-0.1 - - 0.1 - - 0.1 - - 0.07 - - 0.07 - - 0.07 - - FPR = 0/5 TPR = 1/4 FPR = 1/5 TPR = 1/4 FPR = 1/5 TPR = 2/4

ROC curve True positive rate (TPR) False positive rate (FPR)

ROC Curves Should not Look Like This } If your ROC curve looks like this, you are likely doing something wrong Either you have too few discrete features Or you are predicting labels 0 and 1 but gave software library the predictions { 0, 1 } instead of the probability [0,1] the label is 1 11

Issues with conventional score functions } Assumes errors for all individuals are equally weighted May want to weight recent instances more heavily May want to include information about reliability of sets of measurements } Assumes errors for all contexts are equally weighted May want to weigh false positive and false negatives differently May be costs associated with acting on particular classifications (e.g., marketing to individuals)

Examples } Loan decisions Cost of lending to a defaulter is much greater than the lost business of refusing loan to a non-defaulter } Oil-slick detection Cost of failing to detect an environmental disaster is far less than the cost of a false alarm } Promotional mailing Cost of sending junk mail to a household that doesn t respond is far less consequential than the lost business of not sending it to a household that would have responded

Cost-sensitive models } Define a score function based on a cost matrix } If ~y is the predicted class and y is the true class, then need to define a matrix of costs C(~y,y) } Reflects the severity of classifying an instance with true class y to class ~y +$10 -$1 -$5 0

Assessing Error with Training Data We don t have the test data! Train Test Can we just use training data to evaluate our classifier? 15

The Problem of Overfitting Prediction Error 0.0 0.2 0.4 0.6 0.8 1.0 1.2 High Bias Low Variance Low Bias High Variance Real-World Error Training Error 0 5 10 15 20 25 30 35 2015 Bruno Ribeiro Model Complexity (df) The Elements of Statistical Learning Jerome H. Friedman, Robert Tibshirani, and Trevor Hastie

Linear Regression n observations

Cubic Regression y Better n observations x

Degree n-1 Polynomial y Even better? n observations x

But while overfitting is widely used, be careful a) Overfitting is not just about number of parameters b) Overfitting is an easy-to-understand concept but reality is more complex c) Overfitting is also known as the variance of the generalization error 20

Overfitting, hypotheses, and priors } Consider Logistic regression classifier. } For a dataset {φ n,t n } where ϕ n is transformed feature vector of item n, t n {0,1}is hood func the class of item n, for n=1,,n data points where t =(t 1,...,t N ) T and y n = σ(a n ) and a n = w T φ n.. } If N on is by small taking but the dimension neg t toof w, ϕ we n is large, obtainwe will likely overfit data } But what happens if w ~ Normal(0, Σ)? } Prior effectively reduces the number of hypothesis we are encouraged to test (w is no longer free ), note that number of model parameters stay the same. Overfitting is just a symptom Real underlying cause is the test of multiple hypotheses 21

Multiple Hypotheses view of Classification } We will not cover the multiple hypothesis view of classification Just a few pointers (topic covered in the Machine Learning class) } Problem is that of Empirical Risk Minimization, VC-dimension and PAC learning We define a hypothesis class H Data points {(x (i),y (i) )} i=1,,m with empirical error for hypothesis h H ˆε(h) = 1 m m 1{h(x (i) ) y (i) } i=1 Goal: Find hypothesis that minimizes empirical error: ĥ = arg min ˆε(h) h H Distinct theorems for finite and infinite hypothesis classes 22

Model Selection and Assessment } Model Selection: Estimating performances of different models to select the best one } Model Assessment: Having chosen a model, estimating the prediction error on new data 2015 Bruno Ribeiro

Model Selection and Assessment } In data-rich scenarios split the data: Train Validation Test Model Selection Model assessment } In data-poor scenarios: approximate validation step analytically AIC, BIC, MDL via sample re-use cross- validation Leave-one-out K-fold bootstrap 2015 Bruno Ribeiro

Learning Curves (both data-rich and data-poor) } Learning curve shows how method performs with more training data : (a) Over validation data (b) Over training data Score: F1 = 2 (P R) / (P+R) } One of the most useful methods to debug classifiers. Check for overfitting? See gap between training and validation Check for code bugs or bad local optimum? Training data accuracy must decrease with more data Training data Validation data 25

Data-poor Assessment of Classification Error 26

Model: True target Predicted target random noise t = f(x 0 ;? )+ ; E[ ] = 0 Var( ) = 2 For squared-error loss with additive noise: Add and subtract E h f(x 0 ; ˆ ) i Deviation of average estimate From true function s mean Err(x 0 )=E[(t f(x 0 ; ˆ )) 2 X = x 0 ] h apple = E[t 2 ]+E f(x 0 ; ˆ ) 2 f(x 0 ; )? + E f(x 0 ; ˆ ) E = 2 + Bias 2 (f(x 0 ; ˆ )) + Var(f(x 0 ; ˆ ) h f(x 0 ; ˆ ) i 2 True noise Expected squared deviation of estimate around its own mean 2015 Bruno Ribeiro

} Bias Often related to size of model space Smaller the model space M, less likely true model lies in M More complex models tend to have lower bias } Variance Often related to size of dataset and quality of information When data is large enough to estimate true parameters (θ * ) we get lower variance } With big data simple models can perform surprisingly well due to lower variance Err(x 0 )=E[(t f(x 0 ; ˆ )) 2 X = x 0 ] = 2 + Ef(x 0 ; ˆ ) 2 h f(x 0 ;? ) + E f(x 0 ; ˆ ) Ef(x 0 ; ˆ ) i 2 = 2 + Bias 2 (f(x 0 ; ˆ )) + Var(f(x 0 ; ˆ ))

} Hypothesis space Larger search space, smaller bias Larger search space, larger variance } Problem: Looking for BEST MODEL (point estimate) } By testing too many hypotheses the best looking hypothesis is likely not a good one Models with error < 3ε Models with error < ε Models with error < 2ε Our Search Space Our Search Space

Optimism of the Training Error Rate } Typically: training error rate < true error (same data is being used to fit the method and assess its error) Err train = 1 N NX L(t(i),f(x 0 (i); ˆ )) <Err(X) =E[L(t, f(x; ˆ ))] i=1 Loss function Target overly optimistic Estimate from training data without k-th fold 2015 Bruno Ribeiro

Estimating Test Error } Can we estimate the discrepancy between Err(x 0 ) and Err(X)? } D old observations; D new new observations } Suppose we observed new values of label t(i) for same input x(i) Err in --- In-sample error: Expectation over N new responses at each x i E D [Err in ]= 1 N NX E D E D newl(t new (i),f(x 0 (i); ˆ )) i=1 = E D [Err train ]+ 2 N NX Cov(ˆt i,t i ) i=1 Trained with old observation D 2015 Bruno Ribeiro Adjustment for optimism of training error

Measuring Optimism If model: with θ = d t = f(x; )+ ; E[ ] = 0 Var( ) = 2 Then, if f is linear, f(x; ) =x T then E D [Err in ]=E D [Err train ]+ 2 N d 2 Optimism grows linearly with model dimension Optimism decreases as training sample size increases 2015 Bruno Ribeiro

Ways to Estimate Prediction Error } In-sample error estimates: AIC BIC MDL } Extra-sample error estimates: Cross-Validation Leave-one-out K-fold Bootstrap 2015 Bruno Ribeiro

Estimates of In-Sample Prediction Error } General form of the in-sample estimate: } For additive error: Known as the Marllow s C p statistic 2015 Bruno Ribeiro

2015 Bruno Ribeiro Akaike Information Criterion (AIC) AIC = 2 N ln P [t = f(x 0; ˆ )] + 2 d N AIC does not assume model is correct Better for small # samples May not converge to correct model (as N model bias 0) Bayesian Information Criterion (BIC) BIC = 2 N ln P [t = f(x 0; ˆ )] + d N log N BIC assumes model is correct Converges to correct parameterization (θ * ) of model (as N model bias = 0)

MDL (Minimum Description Length) } Find hypothesis (model) that minimizes H(M) + H(D M), where H(M) length in bits of description of model M H(D M) length in bits of description of data encoded by model M } If model probabilistic: length = ln P [t = f(x 0 ; ˆ ) M] ln P [ˆ M] Length of transmitting the discrepancy given the model + optimal coding under the given model Description of the model under optimal coding MDL principle: choose the model with the minimum description length Equivalent to maximizing the posterior: 2015 Bruno Ribeiro P [t = f(x 0 ; ˆ ) M]P [ˆ M]

BIC is one example of Bayesian Approach } While we will not cover Bayesian methods in this course, they tend to ameliorate some of these point estimate issues } Harder to compute 37

} Does not look for Best Hypothesis Not interested in point estimates of model parameters } Seeks to take variance into account into model evaluation Err(x 0 )=E[(t f(x 0 ; )) 2 X = x 0, ]P [ Data] } If testing too many hypotheses and not enough data, P [ˆ X = x 0 ] is less concentrated (higher entropy), thus leading to larger errors

Data-rich Assessment of Classification Error 39

Estimation of Extra-Sample Error } Cross Validation Train Validation Test } Bootstrap Model Selection Model assessment 2015 Bruno Ribeiro

Estimation of Extra-Sample Error via Cross-Validation } Goal: Estimate the test error for a supervised learning method } Algorithm: 1. Split the data in two parts (training and validation) 2. Train the method in the training part 3. Compute the error on the validation part 41

Wrong Way to Perform Cross-validation } Use cross-validation to estimate too many model decisions or for selection of too many features } Use cross-validation to hunt for too many new models Same problem we have when hunting for p-values (too many hypotheses) The human regression machine: Same as multiple hypothesis testing Proposes New Model CV Bad Check CV 2015 Bruno Ribeiro 42

In Data-poor scenarios we should use K-fold Cross-validation 43

1 2 3 4 5 Train Train Validation Train Train validation train K-fold K-fold cross-validation error CV err (K) = 1 K 2016 Bruno Ribeiro KX k=1 1 S k X Validation fold Loss function i2s k L(t(i), ˆt ( Target Estimate from training data without k-th fold k) (i))

Computation increases Variance decreases bias decreases K fold Leave-one-out Small K Large K

Cross-Validation: Choosing K 1-Err 0.0 0.2 0.4 0.6 0.8 0 50 100 150 200 Size of Training Set 2015 Bruno Ribeiro Popular choices for K: 5, 10, N=leave one out

Evaluating classification algorithms A and B } Use k-fold cross-validation to get k estimates of error for algorithms A and B Train1 Train2 Train3 Train4 Train5 Train6 Dataset Y X1 X2 Y X1 X2 Y X1 X2 Y X1 X2 Y X1 X2 Y X1 X2 Y X1 X2 Valid.1 Valid.2 Valid.3 Valid.4 Valid.5 Valid.6 Y X1 X2 Y X1 X2 Y X1 X2 Y X1 X2 Y X1 X2 Y X1 X2 Error A.1 Error A.2 Error A.3 Error A.4 Error A.5 Error A.6 Error B.1 Error B.2 Error B.3 Error B.4 Error B.5 Error B.6 } Set of errors estimated over the test set folds provides empirical estimate of sampling distribution } Mean is estimate of expected error } Is the mean error significant between A and B?

Assessing significance } Use paired t-test to assess whether the two distributions of errors are statistically different from each other Error A.1 Error B.1 Error A.2 Error B.2 Error A.3 Error B.3 Error A.4 Error B.4 Error A.5 Error B.5 } Takes into account both the difference in means and the variability of the scores B A

Paired t-test } Instead of comparing the difference in means between populations 1 and 2 1 NX X (1) 1 NX i X (2) i N N i=1 i=1 } A paired t-test will compare the mean difference d = 1 NX (X (1) i X (2) i ) N i=1 } The real effect is when estimating the standard error of the mean difference v SE( d) = s d p n, where s d = } The t statistic has N-1 degrees of freedom and is given by t = u t 1 N d SE( d) NX i=1 (X (1) i X (2) i d) 2 49

Bootstrapping 50

Bootstrapping } Bootstrap is a powerful statistical tool to quantify the uncertainty associated with a given estimator or statistical learning method } E.g.: It can provide an estimate of the standard error of a coefficient, or a confidence interval for that coefficient } Instead of sampling new datasets from the unknown population distribution, resample from the training data (empirical population distribution) 51

Why Bootstrapping? } The term bootstrap derives from to pull oneself up by one s bootstraps, likely based on one of the eighteenth century novel The Surprising Adventures of Baron Munchausen by Rudolph Erich Raspe: The Baron had fallen to the bottom of a deep lake. Just when it looked like all was lost, he thought to pick himself up by his own bootstraps. } It is not the same as the term bootstrap used in computer science meaning to boot a computer from a set of core instructions, though the meaning is similar. 52

Bootstrap: Main Concept 250 7. Model Assessment and Selection Step 3: Compute variance of estimated parameters & error over validation data S(Z 1 ) S(Z 2 ) S(Z B ) Bootstrap replications Step 2: Learn model S(.) over Bootstrap samples Bootstrap samples Z 1 Z 2 Z B Z =(z 1,z 2,...,z N ) Step 1: Draw N samples with replacement from training data Training sample The Elements of Statistical Learning Jerome H. Friedman, Robert Tibshirani, and Trevor Hastie

Bootstrap Validation Data } Two ways to use validation data Validation comes from whatever data points are not used in training Issue: Validation data size not consistent across bootstrap replications Validation error uncertainty not consistent Advantage: Keeps all training data for potentially training model Validation split from training before bootstrapping Issue: Must split validation from training, reducing training data Advantage: Validation data size kept consistent between bootstrap replications 54