E. Alpaydın AERFAISS 00
Introduction Questions: Is the error rate of y classifier less than %? Is k-nn ore accurate than MLP? Does having PCA before iprove accuracy? Which kernel leads to highest accuracy with SVM? E. Alpaydın AERFAISS 00 3
Material Training/validation/test sets Resapling ethods Coparing ultiple algoriths on a single data set Coparison on ultiple data sets E. Alpaydın AERFAISS 00 4
Algorith Preference Criteria (Application-dependent): Misclassification error, or risk (loss functions) Training tie/space coplexity Testing tie/space coplexity Interpretability Easy prograability Cost-sensitive learning E. Alpaydın AERFAISS 00 5
Experient Design: Factors and Response Controllable factors: -Learning algorith -Hyperparaeters -Input representation Uncontrollable factors: -Noise in data -Randoness in splitting -Randoness in optiization Arrive to conclusions not affected by chance, i.e., statistically significant. E. Alpaydın AERFAISS 00 6
Strategies of Experientation Response surface design E. Alpaydın AERFAISS 00 7
Basic Principles of Experiental Design. Randoization: Independence of results, unaffected by order. Replication: Average over chance and uncontrollable factors (k-fold cv) 3. Blocking: Reduce or eliinate the variability due to nuisance factors: Paired tests E. Alpaydın AERFAISS 00 8
Guidelines for ML experients A. Ai of the study: Copare hyperparaeters or two or ore algoriths Single/ultiple data sets B. Selection of the response variable Accuracy/precision-recall/loss function Cost-conscious fraework C. Choice of factors and levels What are the factors to be played with? What are the factor levels? E. Alpaydın AERFAISS 00 9
Guidelines (cont d) D. Choice of experiental design Factorial design (grid search) How any replicates? E. Perforing the experient Unbiased in experientation, a separate tester Good code and docuentation F. Statistical Analysis of the Data Hypothesis testing Visualization of results: Histogras, plots G. Conclusions and Recoendations Draw obective conclusions E. Alpaydın AERFAISS 00 0
Splitting Data The need for training, validation, and test sets Training set: Optiize paraeters Validation set: Optiize hyperparaeters Test set: Measure generalization perforance Use data once. E. Alpaydın AERFAISS 00
The need for ultiple training/validation sets { i,v i } i : Training/validation sets of fold i Stratification -fold cross-validation: Divide into k, i,i=,..., T i share - parts Resapling and -Fold Cross-Validation 3 3 T V T V T V E. Alpaydın AERFAISS 00
5 Cross-Validation (Dietterich, 998, Neural Coputation) 5 0 5 0 5 9 5 9 4 4 3 3 V T V T V T V T V T V T E. Alpaydın AERFAISS 00 3
Bootstrapping Draw instances fro a dataset with replaceent Prob that we do not pick an instance after N draws N N e 0. 368 that is, only 36.8% is new! E. Alpaydın AERFAISS 00 4
Making Decisions and Error Classifier predicts + if P(+ x)>q and predicts otherwise E. Alpaydın AERFAISS 00 5
Measures of Perforance E. Alpaydın AERFAISS 00 6
E. Alpaydın AERFAISS 00 7
Precision and Recall fp tp fn Retrieved but not relevant Relevant but not retrieved E. Alpaydın AERFAISS 00 8
ROC Precision/Recall Curves E. Alpaydın AERFAISS 00 9
E. Alpaydın AERFAISS 00 0
Statistics Review: Sapling = { x t } t where x t ~ N ( μ, σ ) ~ N ( μ, σ /N) Iplication for odel cobination Ulaş et al (009), Info Sci E. Alpaydın AERFAISS 00
Interval Estiation 0 95 96 96 0 95 96 96 N z N z P N N P N P N / /...... ~ Z E. Alpaydın AERFAISS 00 00(- α) percent confidence interval
95 0 64 0 95 64 N z P N P N P.... E. Alpaydın AERFAISS 00 3
E. Alpaydın AERFAISS 00 4 N S t N S t P t S N N x S N N N t t, /, / ~ / When σ is not known:
Hypothesis Testing Reect a null hypothesis if not supported by the saple with enough confidence E. Alpaydın AERFAISS 00 5
H 0 : μ = μ 0 vs. H : μ μ 0 Accept H 0 with level of significance α if μ 0 is in the 00(- α) confidence interval Two-sided test Type II error N 0 z z /, / How large a saple? E. Alpaydın AERFAISS 00 6
One-sided test: H 0 : μ μ 0 vs. H : μ > μ 0 Accept if N 0, z Variance unknown: Use t, instead of z Accept H 0 : μ = μ 0 if N 0 S t t /, N, /, N E. Alpaydın AERFAISS 00 7
Assessing Error: H 0 : p p 0 vs. H : p > p 0 Single training/validation set: Binoial Test If error prob is p 0, prob that there are e errors or ore P N N 0 0 e e p p N Reect if this prob is less than α α N=00, e=0 E. Alpaydın AERFAISS 00 8
Noral Approxiation to the Binoial H 0 : < 0 vs. H : > 0 Nuber of errors is approx Noral (CLT) with ean Np 0 and var Np 0 (-p 0 ) z e Np Np 0 0 p 0 ~ Z Reect if z > z α α E. Alpaydın AERFAISS 00 9
t Test Multiple training/validation sets x t i = if instance t isclassified on fold i t Error rate of fold i: xi p i N t With and s average and var of p i, we reect p 0 or less error if p0 ~ t S is greater than t α,- N E. Alpaydın AERFAISS 00 30
Coparing Classifiers: H 0 : μ = μ vs. H : μ μ Single training/validation set: McNear s Test Under H 0, we expect e 0 = e 0 =(e 0 + e 0 )/ Accept if < α, e0 e0 e 0 e 0 ~ E. Alpaydın AERFAISS 00 3
-Fold CV Paired t Test 0 0 0 0 0 i i i i t t t s s p s p H H, /, /, ~ : : in if Accept vs. E. Alpaydın AERFAISS 00 3 Use -fold cv to get training/validation folds p i, p i : Errors of classifiers and on fold i p i = p i p i : Paired difference on fold i The null hypothesis is whether p i has ean 0
5 cv Paired t Test (Dietterich, 998, Neural Coputation) Use 5 cv to get folds of 5 tra/val replications p i () : difference btw errors of and on fold =, of replication i=,...,5 p i p p / s p i i p i 5 i s i p p i / 5 Two-sided test: Accept H 0 : μ 0 = μ if in (-t α/,5,t α/,5 ) One-sided test: Accept H 0 : μ 0 μ if < t α,5 ~ t 5 i i p i E. Alpaydın AERFAISS 00 33
5 cv Paired F Test (Alpaydın, 999, Neural Coputation) 5 i 5 i s p i i ~ F 0, 5 Two-sided test: Reect H 0 : μ 0 = μ if > F α,0,5 E. Alpaydın AERFAISS 00 34
Coparing L> Algoriths: Analysis of Variance (Anova) H : 0 L Errors of L algoriths on folds i We construct two estiators to σ. One is valid if H 0 is true, the other is always valid. We reect H 0,,,..., L, i,..., ~ N if the two estiators disagree. E. Alpaydın AERFAISS 00 35
0 0 L L L L i i SSb H SSb L S L S L H N ~ ~ / ˆ /, ~ : we have is true, So when, naely, is Thus anestiatorof is true If E. Alpaydın AERFAISS 00 36
0 0 L L L L L L i i i i L i i F H F L SSw L SSb L SSw L SSb SSw S SSw L L S S S H,,, : ~ / / / / / ~ ~ ˆ if Reect : group variances average of is the secondestiator to our Regardlessof E. Alpaydın AERFAISS 00 37
ANOVA table E. Alpaydın AERFAISS 00 38 If ANOVA reects, we do pairwise posthoc tests ) ( ~ : vs : 0 L w i i i t t H H
More on Coparing Multiple Populations Range tests: Newan-euls test Contrasts: Check if significant difference between, and 3,4,5. H 0 : (μ + μ )/ = (μ 3 + μ 4 + μ 5 )/3 vs. H : (μ + μ )/ (μ 3 + μ 4 + μ 5 )/3 E. Alpaydın AERFAISS 00 39
MultiTest: Coparison of L> algoriths (Yıldız and Alpaydın, 006, IEEE T Pai) Generate a full ordering using pairwise tests and prior ordering Order algoriths in decreasing order of prior preference (e.g., based on coplexity) For a directed graph using pairwise one-sided tests with i preferred over If the test reects, we add a edge fro i to, to show that is to be preferred over i. E. Alpaydın AERFAISS 00 40
MultiTest: Pseudo-code E. Alpaydın AERFAISS 00 4
MultiTest E. Alpaydın AERFAISS 00 4
E. Alpaydın AERFAISS 00 43
E. Alpaydın AERFAISS 00 44
Nonparaetric Tests If the norality assuption does not hold, it does not ake sense to take or copare averages Coparison of training ties, eory needs, and so on Coparison over ultiple data sets We can use order and rank inforation E. Alpaydın AERFAISS 00 45
Sign test Coparing two algoriths: Sign test: Count how any ties A beats B over N datasets, and check if this could have been by chance if A and B did have the sae error rate Wilcoxon signed rank test E. Alpaydın AERFAISS 00 46
ruskal-wallis Test Coparing ultiple algoriths ruskal-wallis test: Calculate the average rank of all algoriths on M datasets, and check if these could have been by chance if they all had equal error If W reects, we do pairwise posthoc tests Tukey s test: E. Alpaydın AERFAISS 00 47
Critical Difference diagras (Desar, 006, JMLR) Friedan s test followed by Neenyi s posthoc test for pairwise coparisons E. Alpaydın AERFAISS 00 48
Conclusions See first, think later, then test. But always see first. Otherwise you will only see what you were expecting. - Douglas Adas So long and thanks for all the fish Testing is not a separate step done after all runs are copleted, but the whole experiental process should be designed beforehand. E. Alpaydın AERFAISS 00 49
References Alpaydın, E. 00. Introduction to Machine Learning, nd edition, The MIT Press. This presentation is based on Chapter 9 of this book. Desar, J. 006. ``Statistical Coparison of Classifiers over Multiple Data Sets.'' Journal of Machine Learning Research 7: --30. Dietterich, T. G. 998. ``Approxiate Statistical Tests for Coparing Supervised Classification Learning Algoriths.'' Neural Coputation 0: 895--93. Fawcett, T. 006. ``An Introduction to ROC Analysis.'' Pattern Recognition Letters 7: 86--874. Montgoery, D. C. 005. Design and Analysis of Experients. 6th ed., New York: Wiley. Yıldız, O. T., and E. Alpaydın. 006. ``Ordering and Finding the Best of > Supervised Learning Algoriths.'' IEEE Transactions on Pattern Analysis and Machine Intelligence 8: 39--40. E. Alpaydın AERFAISS 00 50