Statistical Comparisons of Classifiers over Multiple Data Sets. Peiqian Li

Size: px

Start display at page:

Download "Statistical Comparisons of Classifiers over Multiple Data Sets. Peiqian Li"

Philip Goodman
5 years ago
Views:

1 Statistical Comparisons of Classifiers over Multiple Data Sets Peiqian Li

2 Motivation Outline Statistics and Tests for Comparison of Classifiers Comparisons of Two Classifiers Averaging over data sets Paired t-test Wilcoxon signed-ranks test Comparisons of Multiple Classifiers ANOVA Friedman test Conclusion Outline 1 / 17

3 Motivation comparing two learning algorithms on a single data set comparisons of more algorithms on multiple data sets more essential to typical machine learning studies no established procedure over multiple data sets Motivation / 17

4 Statistics and Tests for Comparison of Classifiers k learning algorithms on N data sets c ij : performance score of the j-th algorithm on the i-th data set statistically significantly different? which are the particular algorithms that differ in performance fundamental difference sample size = number of data sets Statistics and Tests for Comparison of Classifiers 3 / 17

5 Averaging over data sets it is debatable whether error rates in different domains are commensurable, and hence whether averaging error rates across domains is very meaningful -- Webb (000) results not comparable averages meaningless susceptible to outliers Statistics and Tests - Comparisons of two Classifiers 4 / 17

6 Paired t-test A common way to test whether the difference is non-random d i = c i1 c i t statistic d σ / d Student distribution with N 1 degrees of freedom Weaknesses Commensurability Differences distributed normally affected by outliers Statistics and Tests - Comparisons of two Classifiers 5 / 17

7 Wilcoxon signed-ranks test ranks the differences for each data set compares the ranks for the positive and the negative differences. 1 1 R = rank( d ) + rank( d ), R = rank( d ) + rank( d ) + i i i i d > 0 d = 0 d < 0 d = 0 i i i i + T = min( R, R ) z = T N ( N + 1) 4 N( N + 1)( N + 1) Statistics and Tests - Comparisons of two Classifiers 6 / 17

8 C4.5 C4.5+m difference rank d rank adult (sample) breast cancer breast cancer wisconsin cmc ionosphere iris liver disorders lung cancer lymphography mushroom primary tumor rheum voting wine R + = = 93 R - = = 1 T = 1 < Statistics and Tests - Comparisons of two Classifiers 7 / 17

9 Wilcoxon signed-ranks test more sensible than t-test commensurability: only qualitatively does not assume normal distributions: safer Outliers: less effect less powerful or more powerful assumptions of the paired t-test Statistics and Tests - Comparisons of Two Classifiers 8 / 17

10 Comparisons of Multiple Classifiers well-known statistical problem control the family-wise error probability of making at least one Type 1 error Statistics offers powerful specialized procedures ANOVA non-parametric counterpart: Friedman test Statistics and Tests - Comparisons of Multiple Classifiers 9 / 17

11 ANOVA repeated-measures ANOVA(within-subjects ANOVA) common statistical method between more than two related sample means total variability variability between the classifiers variability between the data sets residual (error) variability Statistics and Tests - Comparisons of Multiple Classifiers 10 / 17

12 probably violated assumptions normal distributions minor problem sphericity homogeneity of variance requires random variables have equal variance Violations of these assumptions have an even greater effect on the post-hoc tests Statistics and Tests - Comparisons of Multiple Classifiers 11 / 17

Friedman test ranks algorithms for each data set eparately average ranks of algorithms R j 1 j r i i = N C4.5 C4.5+m C4.5+cf C4.5+m+cf adult (sample) 0.763 (4) 0.768 (3) 0.771 () 0.

13 Friedman test ranks algorithms for each data set eparately average ranks of algorithms R j 1 j r i i = N C4.5 C4.5+m C4.5+cf C4.5+m+cf adult (sample) (4) (3) () (1) breast cancer (1) () (3) (4) breast cancer wisconsin (4) (1) () (3) cmc 0.68 (4) (1) (3) () ionosphere 0.88 (4) () (3) (1) iris (1) (.5) (4) (.5) liver disorders (3) () (4) (1) lung cancer (.5) (.5) (4) 0.65 (1) lymphography (4) (3) () (1) mushroom (.5) (.5) (.5) (.5) primary tumor (4) 0.96 (.5) (1) 0.96 (.5) rheum (3) () (4) (1) voting 0.97 (4) (1) () (3) wine (3) (1) (4) () average rank Statistics and Tests - Comparisons of Multiple Classifiers 1 / 17

14 Friedman test χ 1 N k( k + 1) = R ( + 1) 4 F j k k j according to χ F with k 1 degrees of freedom F F = ( N 1) χf N( k 1) χ F according to the F-distribution with k 1 and (k 1)(N 1) degrees of freedom Statistics and Tests - Comparisons of Multiple Classifiers 13 / 17

15 Friedman test average rank (Rj) C C4.5+m.000 C4.5+cf.893 C4.5+m+cf χ F 1 N k( k + 1) = Rj k( k + 1) j ( 4 5 = ) = F F = ( N 1) χf N( k 1) χ F (( ) N ) = = F k 1,( k 1) ( 1) (3,39) 3.69 α = 0.05 = F Statistics and Tests - Comparisons of Multiple Classifiers 14 / 17

16 post-hoc test Nemenyi test (Nemenyi,1963) is used when all classifiers are compared to each other k( k + 1) CD = q α 6N #classifiers q q α=0.05: CD=1.5 α=0.10: CD=1.16 C4.5- C4.5+m C4.5- C4.5+m+cf C4.5 C4.5+m C4.5+cf C4.5+m+cf Statistics and Tests - Comparisons of Multiple Classifiers 15 / 17

17 post-hoc test Bonferroni correction all classifiers are compared with a control classifier more powerful than the Nemenyi test ( ) i j z = R R k( k + 1) 6N find the corresponding probability from the table of normal distribution compared with an appropriate α Statistics and Tests - Comparisons of Multiple Classifiers 16 / 17

18 Conclusion Wilcoxon signed-ranks test & Friedman test Appropriate assume some, but limited commensurability safer than parametric tests do not assume normal distributions or homogeneity stronger than the other tests studied / 17

19 Danke für Ihre Aufmerksamkeit

Should We Really Use Post-Hoc Tests Based on Mean-Ranks?

Journal of Machine Learning Research 17 (2016) 1-10 Submitted 11/14; Revised 3/15; Published 3/16 Should We Really Use Post-Hoc Tests Based on Mean-Ranks? Alessio Benavoli Giorgio Corani Francesca Mangili