Classifier Learning Curve How to estimate classifier performance. Learning curves Feature curves Rejects and ROC curves True classification error ε Bayes error ε* Sub-optimal classifier Bayes consistent classifier The Apparent Classification Error The apparent (or resubstitution error) of the training set is positively biased (optimistic). Error Estimation by Test Set Training gendat True error ε Testing Classifier Classification Error Estimate An independent test set is needed! bias Apparent error of training set Design Set Other training set other classifier Other test set other error estimate N..3..3.54.95..7.3.3.5.9 3 4 Training Set Size Test Set Size Training set should be large for good classifiers. Test set should be large for a reliable, unbiased error estimate. In practice often just a single design set is given gendat Cross-validation Rotate n times Testing Training Classifieri crossval Classification Error Estimate Same set for training and testing may give a good classifier, but will definitely yield an optimistically biased error estimate. A large, independent test set yields an unbiased and reliable (small variance) error estimate for a badly trained classifier. A small, independent test set yields an unbiased, but unreliable (large variance) error estimate for a well trained classifier. 5-5 is an often used standard solution for the sizes for training and testing. Is it really good? There are better alternatives. Size test set /n of design set. is (n - )/n of design set. Train and test n test times. Average errors. (Good choice: n = ) All objects are tested ones most reliable test result that is possible. Final classifier: trained by all objects best possible classifier. Error estimate is slightly pessimistically biased. 5 6
Leave-one-out Procedure Rotate n times Training Classifieri crossval testk Classification Error Estimate Epected Learning Curves by Estimated Errors Leave-one-out (n-) error estimate Cross-validation error estimate (rotation method) Test single object True error ε Cross-validation in which n is total number of objects. One object tested at a time. n classifiers to be computed. In general unfeasible for large n. Doable for k-nn classifier (needs no training). 7 Apparent error of training set 8 Averaged Learning Curve Repeated Learning Curves For obtaining theoretically epected curves many repetitions are needed. Small sample sizes have a very large variability. a = gendath([ ]); e = (a,ldc,[,3,5,7,,5,],5); (e); a = gendath([ ]); for j=: e = (a,ldc,[,3,5,7,,5,],); hold on; (e); end 9 Learning Curves for Different Classifier Compleity Peaking Phenomenon, Overtraining Curse of Dimensionality, Rao s Parado f compleity training set size Bayes error More comple classifiers are better in case of large training sets and worse in case of small training sets feature set size (dimensionality) classifier compleity
Eample Overtraining, Polynomial Classifier Eample Overtraining () svc([], p ) gendat svc([], p ) 3 4 5 6 Compleity 5/8/8 3 Eample Overtraining (3) 5/8/8 4 Eample Overtraining (4) parzenc([],h) Compleity Run smoothing parameter in Parzen classifier from small to large 5/8/8 5 Overtraining ÅÆ Increasing Bias 5/8/8 Eample Curse of Dimensionality Fisher classifier for various feature rankings True error 6 Bias Apparent error feature set size (dimensionality) classifier compleity 5/8/8 7 5/8/8 8 3
Confusion Matri () confmat Confusion Matri () real labels obtained labels Confusion matri: confmat objects from class A 8 class B 6 3 class C 4 5 -. error in class A -.3 error in class B -.5 error in class C.8 averaged error 9 Confusion Matri (3) Conclusions on Error Estimations objects from class A 8. class B 6 4. class C..33 class A 9.9 class B 3. class C..3 class A 7.3 class B 5 4.3 class C 3 5.3.3 classification details are only observable in the confusion matri!! Larger training sets yield better classifiers. Independent test sets are needed for obtaining unbiased error estimates. Larger test sets yield more accurate error estimates. Leave-one-out cross-validation seems to be an optimal compromise, but might be computationally infeasible. -fold cross-validation is a good practice. More comple classifiers need larger training sets to avoid overtraining. This holds in particular for larger feature sizes, due to the curse of dimensionality. For too small training sets, more simple classifiers or smaller feature sets are needed. Confusion matrices allow a detailed look at the per class classification Reject and ROC Analysis Outlier Reject rejectc Reject Types Reject Curves Performance Measures Varying Costs and Priors ROC Analysis Error ε r ε Reject objects for which probability density is low: r Reject length Note: in these area the posterior probabilities might be high! 3 4 4
Ambiguity Reject rejectc Reject objects for which classification is unsure: about equal posterior probabilities: Reject Curve Error reject d r d r Reject The classification error ε can be reduced to ε r by rejecting a fraction r of the objects. 5 6 Rejecting Classifier Reject Fraction Reject Original classifier error decrease Rejecting classifier Rejected fraction r Rejected fraction r For reasonably well trained classifiers r > (ε(r=) ε(r)). To decrease the error by %, more than % has to be rejected. 7 8 How much to reject? reject Compare the cost of a rejected object, c r, with the cost of a classification error, c ε : For given total cost c Error ε this is a linear function in the (r,ε) space. Shift it until a possible operating point is reached. d* Cost c d Reject r Objects of class Α, with prior p(a), is classified by S() = in a part η A, assigned to A, and a part, assigned to B. p(a) = η A + η A 9 3 5
Objects of class Β, with prior p(b), is classified by S() = in a part η B, assigned to B, and a part, assigned to A. 3 η B = + ε = + η A = η A + η A η B = η B + η B η = η A + η B p(a) + p(b) = p(a) = η A + p(b) = η B + /p(a): error in class A : error contribution class A /p(b): error in class B : error contribution class B η Α /p(a) : sensitivity (recall) class A ε : classification error (what goes correct in class A) η : performance η Α / (η Α + )/ : precision class A p(b) = η B + ε + η = (what belongs to A in what is classified as A) 5/8/8 3 η A ηb =η A = η B Given a trained classifier and a test set: T N Sensitivity: = Specificity: = True labels T N Error: classified as T N TP FN FP TN Error: probability of erroneous classifications Performance: error Sensitivity of a target class (e.g. diseased patients): performance for objects from that target class. Specificity: performance for all objects outside the target class. Precision of a target class : fraction of correct objects among all objects assigned to that class. Recall: fraction of correctly classified objects. This is identical to the performance. It is also identical to the sensitivity when related to a particular class. False Discovery Rate (FDR): 33 34 ROC Curve roc ROC Curve Shift in decision boundary d Priors = [.5.5] Priors = [..8] For every classifier S() d = two errors, and are made. The ROC curve shows them as function of d. ROC is independent of class prior probabilities Change of prior is analogous to the shift of the decision boundary 35 36 6
ROC Analysis roc When are ROC Curves Useful? () ROC: Receiver-Operator Characteristic (from communication theory) Performance of target class Error class B Performance What is the performance for given cost? What are the cost of a preset performance? Error in non-targets Medical diagnostics Database retrieval 37 Error class A -class pattern recognition Cost Trade-off between cost and performance. 38 When are ROC Curves Useful? () When are ROC Curves Useful? (3) S () operating points: value of d in S() d = 39 Study of the effect of changing priors Combine Classifiers Compare behavior of different classifiers S 3 () 4 S () Any 'virtual' classifier between S () and S () in the (, ) space can be realized by using at random α times S () and (-α) times S (). ε = αε + ( α ε ε = αε + ( α ε A A ) A B B ) B Summary Reject and ROC Reject for solving ambiguity: reject objects close to the decision boundary lower costs. Reject option for protection against outliers. ROC analysis for performance cost trade-off. ROC analysis in case of unknown or varying priors. ROC analysis for comparing / combining classifiers. 4 7