Statistical Tests for Computational Intelligence Research and Human Subjective Tests

Size: px

Start display at page:

Download "Statistical Tests for Computational Intelligence Research and Human Subjective Tests"

Tabitha Pope
6 years ago
Views:

1 tatistical Tests for Computational Intelligence Research and Human ubjective Tests lides are downloadable from Hideyuki TAKAGI Kyushu University, Japan ver. March 6, 015 ver. July 15, 013 ver. July 11, 013 ver. April 3, 013 Contents groups n groups (n > ) distribution Parametric Test (normality) un un t -test t -test ANOVA (Analysis of Variance) one-way ANOVA two-way ANOVA Non-parametric Test (no normality) un Mann-Whitney U-test sign test Wilcoxon signed-ranks test one-way two-way Kruskal-Wallis test Friedman test + cheffé's method of comparison for Human ubjective Tests

2 How to how ignificance? Just compare averages visually? It is not scientific. fitness proposed EC1 conventional EC generations fitness conventional EC proposed EC generations Fig. XX Average convergence curves of n times of trial runs. How to how ignificance? ound design concept: exiting sound made by conventional IEC sound made by proposed IEC1 sound made by proposed IEC Which method is good to make exiting sound? How to show it?

3 You cannot show the superiority of your method without statistical tests. Papers without statistics tests may be rejected. My method is significantly! statistical test distribution Which Test hould We Use? groups n groups (n > ) Parametric Test (normality) un un t -test t -test ANOVA (Analysis of Variance) one-way ANOVA two-way ANOVA Non-parametric Test (no normality) un Mann-Whitney U-test sign test Wilcoxon signed-ranks test one-way two-way Kruskal-Wallis test Friedman test

4 Which Test hould We Use? groups n groups (n > ) distribution Parametric Test (normality) un un t -test t -test ANOVA (Analysis of Variance) one-way ANOVA two-way ANOVA Non-parametric Test (no normality) un Mann-Whitney U-test n-th generation sign test Wilcoxon signed-ranks test one-way two-way Kruskal-Wallis test n-th generation Friedman test Which Test hould we Use? groups n groups (n > ) distribution Parametric Test (normality) Non-parametric Test (no normality) un un un t -test t -test Mann-Whitney U-test sign test Wilcoxon signed-ranks test ANOVA (Analysis of Variance) Normality Test Anderson-Darling test D'Agostino-Pearson test Kolmogorov-mirnov test hapiro-wilk test one-way Jarque Bera test two-way one-way ANOVA two-way ANOVA Kruskal-Wallis test Friedman test

5 Which Test hould We Use? groups n groups (n > ) distribution Parametric Test (normality) Non-parametric Test (no normality) un un un un t -test group A group B t 4.3 -test Mann-Whitney U-test sign test Wilcoxon signed-ranks test ANOVA (Analysis of Variance) initial # one-way ANOVA conven tional proposed two-way ANOVA one-way Kruskal-Wallis test two-way Friedman test Which Test hould We Use? groups n groups (n > ) distribution Parametric Test (normality) un un un t -test A group B group t -test ANOVA (Analysis of Variance) initial # one-way ANOVA GA proposed two-way ANOVA Non-parametric Test (no normality) un Mann-Whitney U-test sign test Wilcoxon signed-ranks test one-way two-way Kruskal-Wallis test Friedman test

6 distribution Which Test hould We Use? Q1: Which tests are more sensitive, those for un or? groups n groups (n > ) A1: tatistical tests for because of more information. Parametric Test (normality) un un un t -test A group B group t -test ANOVA (Analysis of Variance) initial # one-way ANOVA GA proposed two-way ANOVA Non-parametric Test (no normality) un Mann-Whitney U-test sign test Wilcoxon signed-ranks test one-way two-way Kruskal-Wallis test Friedman test distribution Which Test hould We Use? Q: How should you design your experimental conditions groups to use statistical ntests groups for (n > ) and reduce the # of trial runs? A: Use the same initialized for the set of (method A, method B) at each trial run. Parametric Test (normality) un un t -test t -test ANOVA (Analysis of Variance) one-way ANOVA two-way ANOVA significant? Non-parametric Test (no normality) un Mann-Whitney U-test sign test Wilcoxon signed-ranks n-th generation test one-way two-way Kruskal-Wallis test Friedman test

7 distribution Which Test hould we Use? Q3: Which statistical tests are sensitive, parametric tests or non-parametric ones and why? groups n groups (n > ) A3: Parametric tests which can use information of assumed distribution. Parametric Test (normality) un un t -test t -test ANOVA (Analysis of Variance) one-way ANOVA two-way ANOVA Non-parametric Test (no normality) un Mann-Whitney U-test sign test Wilcoxon signed-ranks test one-way two-way Kruskal-Wallis test Friedman test t-test groups n groups (n > ) distribution Parametric Test (normality) un un t -test t -test t -test ANOVA (Analysis of Variance) one-way ANOVA two-way ANOVA Non-parametric Test (no normality) un Mann-Whitney U-test sign test Wilcoxon signed-ranks test one-way two-way Kruskal-Wallis test Friedman test

8 t-test How to how ignificance? significant? n-th generation t-test Test this difference with assuming no difference. (null hypothesis) A B significant difference? Conditions to use t-tests: (1) normality () equal variances (not essential though )

(null hypothesis) Normality Test significant Anderson-Darling test difference?

9 t-test F-Test A B 1 When 10 (p > 0.05), we assume 14 that 9 there is no significant difference between σ A and σ B Test this difference with assuming no difference. (null hypothesis) Normality Test significant Anderson-Darling test difference? D'Agostino-Pearson test Kolmogorov-mirnov test hapiro-wilk test Jarque Bera test Conditions to use t-tests: (1) normality () equal variances (not essential though ) t-test Excel (3 bits version only?) has t-tests and ANOVA in Data Analysis Tools. You must install its add-in. (File -> option -> add-in, and set its add-in.)

10 t-test (1) t-test: Pairs two sample for means significant? This is a case when each pair of two methods with the same initial condition. n-th generation () t-test: Two-sample assuming equal variances (3) t-test: Two-sample assuming unequal variances: Welch's t-test t-test sample A B t-test: Paired Two ample for Means Variable 1 Variable Mean Variance Observations Pearson Correlation Hypothesized Mean Difference 0 df 9 t tat P(T<=t) one-tail t Critical one-tail P(T<=t) two-tail t Critical two-tail

11 t-test sample A B A 3. > B t-test: Paired Two ample for Means When p-value is less than 0.01 or 0.05, we assume that there is significant difference with the level Variable of significance 1 Variable of (p < 0.01) or (p < 0.05). Mean Variance Observations % Pearson.5% Correlation % A BHypothesized A < B Mean When A>B never 0happens, Difference you may use a one-tail test. df 9 t tat P(T<=t) one-tail t Critical one-tail P(T<=t) two-tail t Critical two-tail t-test (1) t-test: Pairs two sample for means () t-test: Two-sample assuming equal variances Difference between two groups is significant (p < 0.01). We cannot say that there is a significant difference between two group.

12 distribution ANOVA: Analysis of Variance groups n groups (n > ) Parametric Test (normality) un un t -test t -test ANOVA (Analysis of Variance) one-way ANOVA ANOVA two-way ANOVA Non-parametric Test (no normality) un Mann-Whitney U-test sign test Wilcoxon signed-ranks test one-way two-way Kruskal-Wallis test Friedman test ANOVA: Analysis of Variance significant? n-th generation

13 ANOVA: Analysis of Variance 1. Analysis of more than two groups.. Normality and equal variance are required. Excel has ANOVA in Data Analysis Tools. A B C C A B ANOVA: Analysis of Variance 1. Analysis of more than two groups.. Normality and equal variance are required. A B C Excel has ANOVA in Data Analysis Tools. Check it using the Bartlett test. C A B three t-tests = one ANOVA Three times of t-test with (p<0.05) equivalent one ANOVA (p<0.14). 1-(1-0.05) 3 = 0.14

14 ANOVA: Analysis of Variance n-th generation When are independent, use one-way ANOVA (single factor ANOVA). When correspond each other, use two-way ANOVA (two-factor ANOVA). ANOVA: Analysis of Variance Q1: What are "single factor" and "two factors"? A1: A column factor (e.g. three groups) and a sample factor (e.g. initialized condition). When are independent, use one-way ANOVA column (single factor factor ANOVA). When correspond each other, use two-way ANOVA column (two-factor factor ANOVA). sample factor

15 ANOVA: Analysis of Variance one-factor (one-way) ANOVA column factor two-factor (two-way) ANOVA column factor group A group B group C sample factor initial group A group B group C condition # # # # # # # # We cannot say that three groups are significantly different. (p=0.089) There are significant difference somewhere among three groups. (p<0.05) ANOVA: Analysis of Variance Output of the one-way ANOVA ource of Variation df M F P-value F crit Between Groups E Within Groups Total Output of the two-way ANOVA ource of Variation df M F P-value F crit ample Columns Interaction Within Total When (p-value < 0.01 or 0.05), there is(are) significant difference somewhere among groups. ignificant difference among ample (e.g. initial conditions) cannot be found (p > 0.05). ignificant difference can be found somewhere among Columns (e.g. three methods) (p < 0.01). We need not care an interaction effect between two factors (e.g. initial condition vs. methods) (p > 0.05). ample factor Column factor A B C

16 ANOVA: Analysis of Variance Q1: Where is significant among A, B, and C? A1: Apply multiple comparisons between all pairs among columns. (Fisher's PLD method, cheffé method, Bonferroni-Dunn test, Dunnett method, Williams method, Tukey method, Nemenyi test, Tukey-Kramer method, Games/Howell method, Duncan's new multiple range test, tudent-newman-keuls method, etc. Each has different characteristics.) ource of Variation df M F P-value F crit ample Columns Interaction Within Total significant? ample factor Column factor A B C distribution Non-Parametric Tests groups n groups (n > ) Parametric Test (normality) un un t -test t -test ANOVA (Analysis of Variance) one-way ANOVA If normality and equal variances are not guaranteed, use non-parametric tests. two-way ANOVA Non-parametric Test (no normality) un Mann-Whitney U-test sign test Wilcoxon signed-ranks test one-way two-way Kruskal-Wallis test Friedman test

17 Mann-Whitney U-test groups n groups (n > ) distribution Parametric Test (normality) un un t -test t -test ANOVA (Analysis of Variance) one-way ANOVA two-way ANOVA Non-parametric Test (no normality) un Mann-Whitney U-test sign test Wilcoxon signed-ranks test one-way two-way Kruskal-Wallis test Friedman test Mann-Whitney U-test (Wilcoxon-Mann-Whitney test, two sample Wilcoxon test) 1. Comparison of two groups.. Data have no normality. 3. There are no corresponding between two groups.? no normality? n-th generation

18 Mann-Whitney U-test (Wilcoxon-Mann-Whitney test, two sample Wilcoxon test) 1. Calculate a U value U = = 9 U' = 11 (U + U' = n 1 n ) when two values are the same, ( count as 0.5. ) Mann-Whitney U-test (cont.) (Wilcoxon-Mann-Whitney test, two sample Wilcoxon test). ee a U-test table. Use the smaller value of U or U'. When n 1 0 and n 0, see a Mann-Whitney test table. (where n 1 and n are the # of of two groups.) Otherwise, since U follows the below normal distribution roughly, n1n n1n ( n1 n 1) N U, U N, 1 U U normalize U as z and check a standard normal distribution table U nn 1 U U n1 n 1 1 ( with the z, where and. n n Use an Excel function to calculate the p-value for the z-value: p-value = 1 - NORM..DIT( z ) 1)

19 Examples: Mann-Whitney U-test (Wilcoxon-Mann-Whitney test, two sample Wilcoxon test) Ex.1 0 Ex Ex U = 9 U' = 11 U = 1 U' = 13 U = 3.5 U' = 1.5 (p < 0.05) n n ー (p < 0.01) n n ー 4 ーー Exercise: Mann-Whitney U-test (Wilcoxon-Mann-Whitney test, two sample Wilcoxon test) (p < 0.05) n 1 n U = 9.5 U' = ー ince U' > 5, (p > 0.05): ( significance is not found) (p < 0.01) n n ーーーー 4 ーー

20 ign Test groups n groups (n > ) distribution Parametric Test (normality) un un t -test t -test ANOVA (Analysis of Variance) one-way ANOVA two-way ANOVA Non-parametric Test (no normality) un Mann-Whitney U-test sign test Wilcoxon signed-ranks test one-way two-way Kruskal-Wallis test Friedman test ign Test (1)ign Test significance test between the # of winnings and losses ()Wilcoxon's igned Ranks Test significance test using both the # of winnings and losses and the level of winnings/losses of groups # of winnings and losses the level of winnings/losses

21 ign Test 1. Calculate the # of winnings and losses by comparing runs with the same initial.. Check a sign test table to show significance of two methods. n-th generation ign Test Fig.3 in Y. Pei and H. Takagi, "Fourier analysis of the fitness landscape for evolutionary search acceleration," IEEE Congress on Evolutionary Computation (CEC), pp.1-7, Brisbane, Australia (June 10-15, 01). The (+,-) marks show whether our proposed methods converge significantly or poorer than normal DE, respectively, (p 0.05). Fig. in the same paper. Generations F1: DE_N vs. DE_LR F1: DE_N vs. DE_L F1: DE_N vs. DE_FR_GLB_nD F1: DE_N vs. DE_FR_LOC_nD F1: DE_N vs. DE_FR_GLB_1D F1: DE_N vs. DE_FR_LOC_1D F: DE_N vs. DE_LR F: DE_N vs. DE_L F: DE_N vs. DE_FR_GLB_nD F: DE_N vs. DE_FR_LOC_nD F: DE_N vs. DE_FR_GLB_1D F: DE_N vs. DE_FR_LOC_1D F3: DE_N vs. DE_LR F3: DE_N vs. DE_L F3: DE_N vs. DE_FR_GLB_nD F3: DE_N vs. DE_FR_LOC_nD F3: DE_N vs. DE_FR_GLB_1D F3: DE_N vs. DE_FR_LOC_1D + + F4: DE_N vs. DE_LR F4: DE_N vs. DE_L F4: DE_N vs. DE_FR_GLB_nD F4: DE_N vs. DE_FR_LOC_nD F4: DE_N vs. DE_FR_GLB_1D F4: DE_N vs. DE_FR_LOC_1D F5: DE_N vs. DE_LR F5: DE_N vs. DE_L F5: DE_N vs. DE_FR_GLB_nD F5: DE_N vs. DE_FR_LOC_nD ++ F5: DE_N vs. DE_FR_GLB_1D +++ F5: DE_N vs. DE_FR_LOC_1D + F6: DE_N vs. DE_LR F6: DE_N vs. DE_L F6: DE_N vs. DE_FR_GLB_nD F6: DE_N vs. DE_FR_LOC_nD F6: DE_N vs. DE_FR_GLB_1D F6: DE_N vs. DE_FR_LOC_1D F7: DE_N vs. DE_LR + F7: DE_N vs. DE_L F7: DE_N vs. DE_FR_GLB_nD F7: DE_N vs. DE_FR_LOC_nD + F7: DE_N vs. DE_FR_GLB_1D + F7: DE_N vs. DE_FR_LOC_1D F8: DE_N vs. DE_LR F8: DE_N vs. DE_L F8: DE_N vs. DE_FR_GLB_nD F8: DE_N vs. DE_FR_LOC_nD + F8: DE_N vs. DE_FR_GLB_1D F8: DE_N vs. DE_FR_LOC_1D

22 Task Example Whether performances of pattern recognition methods A and B are significantly different? n 1 cases: Both methods succeeded. n cases: Method A succeeded, and method B failed. n 3 cases: Method A failed, and method B succeeded. n 4 cases: Both methods failed. level of significance % % level of significance ign Test % % How to check? 1. et N = n + n 3.. Check the right table with the N. 3. If min(n, n 3 ) is smaller than the number for the N, we can say that there is significant difference with the significant risk level of XX. Exercise Whether there is significant difference for n = 1 and n 3 = 8? ANWER: Check the right table with N = 40. As n is bigger than 11 and smaller than 13, we can say that there is a significant difference between two with (p < 0.05) but cannot say so with (p < 0.01). ign Test level of significance % % Let's think about the case of N = 17. To say that n 1 and n are significantly different, (n 1 vs. n ) = (17 vs. 0), (16 vs. 1), or (15 vs. ) (p < 0.01) or (n 1 vs. n ) = (14 vs. 3) or (13 vs. 4) (p < 0.05)

23 Exercise: ign Test level of significance % % Check the significance of: 16 vs vs. 1 9 vs vs. 5 distribution Wilcoxon igned-ranks Test groups n groups (n > ) Parametric Test (normality) un un t -test t -test ANOVA (Analysis of Variance) one-way ANOVA two-way ANOVA Non-parametric Test (no normality) un Mann-Whitney U-test sign test Wilcoxon signed-ranks test one-way two-way Kruskal-Wallis test Friedman test

24 Wilcoxon igned-ranks Test Q: When a sign test could not show significance, how to do? A: Try the Wilcoxon signed-ranks test. It is more sensitive than a simple sign test due to more information use. n-th generation Wilcoxon igned-ranks Test (1)ign Test significance test between the # of winnings and losses ()Wilcoxon's igned Ranks Test significance test using both the # of winnings and losses and the level of winnings/losses of groups # of winnings and losses the level of winnings/losses

25 Wilcoxon igned-ranks Test Example: (step 1) (step ) (step 3) (step 4) v (system A) v (system B) difference d rank of d add sign to the ranks rank of fewer # of signs n 8 (step 6) Wilcoxon test table (step 5) T # of ( tep4) 3 (step 6) n = 8 T = 3 T=3 3 (n=8, p<0.05), then difference between systems A and B is significant. T=3 > 0 (n=8, p<0.01), then we cannot say there is a significant difference. When n > 5 As T follows the below normal distribution roughly, n( n 1) n( n 1)(n 1) N T, T N, 4 4 normalize T as the below and check a standard normal distribution table with the z; see T and in the above equation. T T z T T Wilcoxon Test Table: significance point of T one-tail p < 0.05 p < two-tail p < 0.05 p < 0.01 n =

26 Wilcoxon igned-ranks Test (step 1) (step ) (step 3) (step 4) v (system A) v (system B) difference d rank of d add sign to the ranks rank of fewer # of signs Tip # Tip # Tip # Tips: 1. When d = 0, ignore the.. When there are the same ranks of d, give average ranks. 1 Give the average rank 6.5 = ( )/ Exercise 1: Wilcoxon igned-ranks Test (step 1) (step ) (step 3) (step 4) v (system A) v (system B) difference d rank of d add sign to the ranks rank of fewer # of signs n = (step 5) T (step 6) Wilcoxon test table T = # of ( tep4)

27 Exercise 1: Wilcoxon igned-ranks Test (step 1) (step ) (step 3) (step 4) v (system A) v (system B) difference d rank of d add sign to the ranks rank of fewer # of signs n = 8 (step 6) Wilcoxon test table (step 5) T # of ( tep4) T = As T(=) < 3, there is a significant difference between A and B (p<0.05). But, as 0 < T(=), we cannot say so with the significance level of (p<0.01). Exercise : Wilcoxon igned-ranks Test (step 1) (step ) (step 3) (step 4) v (system A) v (system B) difference d rank of d add sign to the ranks rank of fewer # of signs n = (step 6) Wilcoxon test table (step 5) T # of ( tep4)

28 Exercise : Wilcoxon igned-ranks Test (step 1) (step ) (step 3) (step 4) v (system A) v (system B) difference d rank of d add sign to the ranks rank of fewer # of signs (No need to care the case of d = 0.) n = 8 (no count for d = 0.) (step 6) Wilcoxon test table As T > 3, we cannot say that there is a significant difference between A and B. (step 5) T # of ( tep4) T = 4 Exercise 3: Wilcoxon igned-ranks Test Explain how to apply this test to test whether two groups are significantly different at the below generation? n-th generation

29 Kruskal-Wallis Test groups n groups (n > ) distribution Parametric Test (normality) un un t -test t -test ANOVA (Analysis of Variance) one-way ANOVA two-way ANOVA Non-parametric Test (no normality) un Mann-Whitney U-test sign test Kruskal-Wallis test two-way Wilcoxon signed-ranks test Friedman test Kruskal-Wallis Test 1. Comparison of more than two groups.. Data have no normality. 3. There are no corresponding among groups.??? no normality n-th generation

30 Kruskal-Wallis Test Let's use ranks of Kruskal-Wallis Test N: total # of k: # of groups n i : # of of group i R i : sum of ranks of group i R 1 = 38 R = 69 R 3 = 46 How to Test 1. Rank all.. Calculate N, k, n i and R i. 3. Calculate statistical value H. 1 H N( N 1) i1 R n 3( N 1) 4. If k = 3 and N 17, compare the H with a significant point in a Kruskal-Wallis test table. Otherwise, assume that H follows the χ distribution and test the H using a χ distribution table of (k-1) degrees of freedom k i i

31 Example: Kruskal-Wallis Test N = n 1 +n +n 3 = 17 k = 3 groups (n 1, n, n 3 ) = (6, 5, 6) (R 1, R, R 3 ) = (38, 69, 46) 1 H N( N 1) 1 17(17 = k i1 R n i 38*38 1) 6 i 3( N 1) 69* *46 6 3(17 1) ince significant points of (p<0.05) and (p<0.01) for (n 1, n, n 3 ) = (6, 5, 6) are and 8.14, respectively, there are significant difference(s) somewhere among three groups (p<0.05) significance point of (p<0.05) 8.14 significance point of (p<0.01) Kruskal-Wallis Test Table (for k = 3 and N 17) n 1 n n 3 p < 0.05 p < 0.01 n 1 n n 3 p < 0.05 p < Example: Kruskal-Wallis Test N = n 1 +n +n 3 = 17 k = 3 groups (n 1, n, n 3 ) = (6, 5, 6) (R 1, R, R 3 ) = (38, 69, 46) Q1: Where is significant among A, B, and C? k 1 Ri H A1: Apply multiple 3( Ncomparisons 1) between all pairs N( among N 1) i1columns. ni (Fisher's PLD 3 method, cheffé method, Bonferroni-Dunn test, Dunnett method, 1 38*38 69*69 46*46 Williams Ri method, Tukey method, Nemenyi 3(17 test, 1) Tukey-Kramer method, 17(17 Games/Howell 1) i1 ni 6 method, 5 Duncan's 6 new multiple range test, tudent-newman-keuls method, etc. Each has different characteristics.) = ince significant points of (p<0.05) and (p<0.01) for (n 1, n, n 3 ) = (6, 5, 6) are and 8.14, respectively, there are significant difference(s) somewhere among three groups (p<0.05) significance point of (p<0.05) 8.14 significance point of (p<0.01) Kruskal-Wallis Test Table (for k = 3 and N 17) n 1 n n 3 p < 0.05 p < 0.01 n 1 n n 3 p < 0.05 p <

32 Exercise: Kruskal-Wallis Test R 1 = R = R 3 = N = n 1 +n +n 3 = 13 samples k = 3 groups (n 1, n, n 3 ) = ( 5, 4, 4) (R 1, R, R 3 ) = (4, 44, 3) 1 H N( N 1) = 6.7 k i1 significance point of (p<0.05) R n i i 3( N 1) There is/are significant difference(s) somewhere among three groups (p<0.05) significance point of (p<0.01) Friedman Test groups n groups (n > ) distribution Parametric Test (normality) un un t -test t -test ANOVA (Analysis of Variance) one-way ANOVA two-way ANOVA Non-parametric Test (no normality) un Mann-Whitney U-test sign test one-way Kruskal-Wallis test Wilcoxon signed-ranks test Friedman test

33 Friedman Test When (1) more than two groups, () have correspondence (not independent), but (3) the conditions of two-way ANOVA are not satisfied, Let' use ranks of and Friedman test. (ex.) Comparison of recognition rates. benchmark tasks methods a b c d A B C D methods a b c d Friedman Test tep 1: Make a ranking table. tep : um ranks of the factor that you want to test. benchmark tasks method a b c d A B C D Σ # of methods (k = 4) # of (n = 4) tep 3: Calculate the Friedman test value, χ r. k 1 r Ri 3n( k 1) nk( k 1) i1 where (k, n) are the # of levels of factors 1 and. methods a b c d ranking among methods tep 4: If k =3 or 4, compare χ r with a significant point in a Friedman test table. Otherwise, use a χ table of (k-1) degrees of freedom.

34 Example: Friedman Test tep 1: Make a ranking table. tep : um ranks of the factor that you want to test. benchmark tasks # of methods (k = 4) tep 3: Calculate the Friedman test value, χ r. k 1 r Ri 3n( k 1) nk( k 1) method a b c d A B C D Σ i *4*5 4*4*5 8.1 tep 4: ince significant point for (k,n) = (4,4) is7.80, there is/are significant difference(s) somewhere among four methods, a, b, c, and d (p<0.05) significance point of (p<0.05) significance point of (p<0.01) # of (n = 4) k n p<0.05 p< Friedman test table Example: Friedman Test tep 1: Make a ranking table. tep : um ranks of the factor that you want to test. benchmark method tasks a b c d A B C D Σ Q1: Where is significant among a, b, c, or d? A1: Apply multiple comparisons between all pairs among columns. (Fisher's PLD method, cheffé method, Bonferroni-Dunn test, Dunnett Friedman test table. method, Williams method, # of methods Tukey method, (k = 4) Nemenyi test, Tukey-Kramer tep 3: method, Calculate Games/Howell the Friedman method, test Duncan's value, χnew r. multiple range test, tudent-newman-keuls k 1 method, etc. Each has different characteristics.) r Ri 3n( k 1) nk( k 1) i *4*5 4*4*5 8.1 tep 4: ince significant point for (k,n) = (4,4) is7.80, there is/are significant difference(s) somewhere among four methods, a, b, c, and d (p<0.05) significance point of (p<0.05) significance point of (p<0.01) # of (n = 4) k n p<0.05 p<

35 Multiple Comparisons When there is significant difference among groups, multiple comparison is used to know which group is significantly difference from others. Example 4C = 6 times of pair comparisons with (p < 0.05) 1 - (1-0.05) 6 = significance level 6.5%! Multiple Comparisons When there is significant difference among groups, multiple comparison is used to know which group is significantly difference from others. olution is to apply multiple pair comparisons with more strict significance level. Example 4C = 6 times of pair comparisons with (p < 0.05) 1 - (1-0.05) 6 = significance level 6.5%!

36 Multiple Comparisons -- Bobferroni method -- When pair comparisons are applied m times, let's use a significance level of p / m C = 6 times of pair comparisons with (p < ) 6 Features: (1) imple. () Rather strict, i.e. showing significances is rather hard. Multiple Comparisons -- Holm method -- Corrected Bonferroni method to detect significances easily. Example pair comparisons p-value corrected p-value eqn. corrected p-value vs. vs. vs. vs. vs. vs = p-value* = p-value* = p-value* = p-value* = p-value* = p-value*

37 cheffé's Method of Paired Comparison groups n groups (n > ) distribution normality (parametric) t -test ANOVA (Analysis of Variance) one-way ANOVA two-way ANOVA sign test one-way kruskal-wallis test no normality (non-parametric) Wilcoxon igned-ranks Test two-way Friedman test + cheffé's method of comparison for Human ubjective Tests cheffé's Method of Paired Comparison lighting design of 3-D CG Corridor W K L Verenda Wall room layout planning design B Target ystem Evolutionary Computation room lighting design by optimizing LED assignments subjective evaluations Interactive Evolutionary Computation IEC image enhancement processing Can you hear me??? hearing-aid fitting MEM design measuring mental scale geological simulation

38 cheffé's Method of Paired Comparison ANOVA based on n C comparisons for n objects. even ANOVA even even significance check using a yardstick cheffé's Method of Paired Comparison Original method and three modified methods All subjects must evaluate all pairs. no yes order effect yes no original ( 原法, 195) Haga's variation ( 芳賀の変法 ) Ura's variation ( 浦の変法, 1956) Nakaya's variation ( 中屋の変法, 1970) Order Effect (1) and then () and then may result different evaluation.

39 cheffé's Method of Paired Comparison 1. Ask N human subjects to evaluate t objects in 3, 5 or 7 grades.. Assign [-1, +1], [-, +] or [-3, +3] for these grades. 3. Then, start calculation (see other material). Questionnaire Total row even even even strap for a mobile phone invitation to a dinner Paired comparisons for t=3 objects. O1 O O3 O4 O5 O6 A1 - A A1 - A A - A ix subjects (N = 6) Application Example: What is the best present to be her/his boy/girl friend? [ITUATION] He/he is my longing. I want to be her/his boy/girl friend before we graduate from our university. To get over my one-way love, I decided to present something of about 3,000 JPY and express my heart. I show you 5 C pairs of presents. Please compare each pair and mark your relative evaluation in five levels. tea /coffee stuffed animal fountain pen Ex. Q.

40 I think effective. Results of cheffé's Method of Paired Comparison (Nakaya's variation) What is the best present to be her/his boy/girl friend? less effective (significant difference) present from a male I will catch her heart by dinner more effective less effective present from a female How about tea leave or a stuffed anima? more effective Reality is... less effective I hesitate to accept it as we have not gone about with him more effective less effective Eat! Eat! Eat! more effective cheffé's Method of Paired Comparison Modified methods by Ura and Nakaya Original method and three modified methods All subjects must evaluate all pairs. no yes order effect yes no original ( 原法, 195) Haga's variation ( 芳賀の変法 ) Ura's variation ( 浦の変法, 1956) Nakaya's variation ( 中屋の変法, 1970)

41 cheffé's Method of Paired Comparison Modified method by Ura Pairwise comparisons for objects which are effected by display order (order effect). even even even even even even cheffé's Method of Paired Comparison Modified method by Ura Ask N human subjects to evaluate t C pairs for t objects in 3, 5 or 7 grades and assign [-1, +1], [-, +] or [-3, +3], respectively. even even even even even even

0 1 A - -1 0 1 A - -1 0 1 A 3 - -1 0 1 A 3

subject compares the i-th object with the

42 cheffé's Method of Paired Comparison Modified method by Ura tep 1: Make comparison table of each human subject. even A A A A A A A A 3 A A1 A A3 A4 3 A A 4 A A A A 4 A 4 cheffé's Method of Paired Comparison Modified method by Ura tep 1: Make comparison table of each human subject. ubject O 1 x ijl : evaluation value when the l-th human subject compares the i-th object with the j-th object. ubject O ubject O 3

of object (4) N: # of human subjects (3) A 4 A 3 A A 1-1.1667-0.

150 ˆ 4 ˆ 3 ˆ ˆ1 cheffé's Method of Paired Comparison Modified

( B) T 1 tn 1 t ( x ( x 1 ( x N i ji 1 x Nt( t 1) ( B) i l i i 1

43 cheffé's Method of Paired Comparison Modified method by Ura tep : Make a table summing all subjects' and calculate the average evaluations for all objects. Average of four objects ˆ i 1 tn ( x i x i ) where t: # of object (4) N: # of human subjects (3) A 4 A 3 A A ˆ 4 ˆ 3 ˆ ˆ1 cheffé's Method of Paired Comparison Modified method by Ura tep 3: Make a ANOVA table. ( B) T 1 tn 1 t ( x ( x 1 ( x N i ji 1 x Nt( t 1) ( B) i l i i 1 t( t 1) T l i ji i x x ij ijl il x i ) x x l ( B) ji il ) ) ( B) unbiased variance = /f where, ( B),,, ( B), F = and f degree of freedom. unbiased variance unbiased variance of for F tests., T

44 x x N x x t x x tn i i j ji ij l i il l i B i i i ) ( ) ( 1 ) ( 1 ) ( 1 l i i j ijl T B B T i l B x x t t x t Nt ) ( ) ( ) ( 1) ( 1 1) ( 1 cheffé's Method of Paired Comparison Modified method by Ura ANOVA table A 4 A 3 A A 1 ˆ1 ˆ ˆ3 ˆ 4 There are significant difference among A 1 -A 4 cheffé's Method of Paired Comparison Modified method by Ura ANOVA table.

45 cheffé's Method of Paired Comparison Modified method by Ura tep 4: Apply multiple comparisons. Q1: Where is significant among A 1, A, and A 3? A1: Apply multiple comparisons between all pairs. (Fisher's PLD method, cheffé method, Bonferroni-Dunn test, Dunnett method, Williams method, Tukey method, Nemenyi test, Tukey-Kramer method, Games/Howell method, Duncan's new multiple range test, tudent-newman-keuls method, etc. Each has different characteristics.) cheffé's Method of Paired Comparison Modified method by Ura tep 4: Apply multiple comparisons between all pairs and find which distance is significant. (Fisher's PLD method, cheffé method, Bonferroni-Dunn test, Dunnett method, Williams method, Tukey method, Nemenyi test, Tukey-Kramer method, Games/Howell method, Duncan's new multiple range test, tudent-newman-keuls method, etc. Each has different characteristics.) Example of a simple multiple comparison. Calculate a studentized yardstick When a difference of average > a studentized yardstick, the distance is significant. A 4 A 3 A A ˆ 4 ˆ 3 ˆ ˆ 1

46 cheffé's Method of Paired Comparison Modified method by Ura tep 4: Example of a simple multiple comparisons. Y q ( t, f ) ˆ / tn (studentized yardstick) where ( ˆ, t, N ) are an unbiased variance of ε, the # of objects, and the #of human subjects; q ( t, f ) is a studentized range obtained is a statistical test table for t, the degree of freedom of ε ( f ), and the significant level of φ; see these variables in an ANOVA table. When (t, f) = (4,1), studentized yardsticks for significance levels of 5% and 1% are: (ee q 0.05 (4,1) in the next slide.) tudentized yardstick q ( 0.05 t, f ) f t

47 cheffé's Method of Paired Comparison Modified method by Ura tep 4: Example of a simple multiple comparisons. cheffé's Method of Paired Comparison Modified methods by Ura and Nakaya Original method and three modified methods All subjects must evaluate all pairs. no yes order effect yes no original ( 原法, 195) Haga's variation ( 芳賀の変法 ) Ura's variation ( 浦の変法, 1956) Nakaya's variation ( 中屋の変法, 1970)

48 cheffé's Method of Paired Comparison Modified method by Nakaya Pairwise comparisons for objects that can be compared without order effect. even even even cheffé's Method of Paired Comparison Modified method by Nakaya 1. Ask N human subjects to evaluate t objects in 3, 5 or 7 grades.. Assign [-1, +1], [-, +] or [-3, +3] for these grades, respectively. 3. Then, start calculation (see other material). Questionnaire even even even Paired comparisons for t=3 objects. ix human subjects (N = 6) O 1 O O 3 O 4 O 5 O 6 A1 - A A1 - A A - A

x ijl : evaluation value when the l-th human subject

Nakaya tep : Make a table summing all subjects' and

Average of four objects ˆ i 1 tn xi where t: # of object

49 cheffé's Method of Paired Comparison Modified method by Nakaya tep 1: Make comparison table of each human subject. x ijl : evaluation value when the l-th human subject compares the i-th object with the j-th object. cheffé's Method of Paired Comparison Modified method by Nakaya tep : Make a table summing all subjects' and calculate the average evaluations for all objects. Average of four objects ˆ i 1 tn xi where t: # of object (3) N: # of human subjects (6)

50 tep 3: Make a ANOVA table. 1 xi.. tn ( B) 1 t i 1 x tn i ANOVA table. cheffé's Method of Paired Comparison Modified method by Nakaya l i i.. x i. l F ( B) 1 t T l i x i. l ( B) Unbiased variance Unbariased variance of There are significant difference among A 1 -A 3 cheffé's Method of Paired Comparison Modified method by Nakaya tep 4: Apply multiple comparisons. Q1: Where is significant among A 1, A, and A 3? A1: Apply multiple comparisons between all pairs among columns. (Fisher's PLD method, cheffé method, Bonferroni-Dunn test, Dunnett method, Williams method, Tukey method, Nemenyi test, Tukey-Kramer method, Games/Howell method, Duncan's new multiple range test, tudent-newman-keuls method, etc. Each has different characteristics.) ANOVA table.

51 cheffé's Method of Paired Comparison Modified method by Nakaya tep 4: Apply multiple comparisons between all pairs and find which distance is significant. (Fisher's PLD method, cheffé method, Bonferroni-Dunn test, Dunnett method, Williams method, Tukey method, Nemenyi test, Tukey-Kramer method, Games/Howell method, Duncan's new multiple range test, tudent-newman-keuls method, etc. Each has different characteristics.) Example of a simple multiple comparison. Calculate a studentized yardstick When a difference of average > a studentized yardstick, the distance is significant. cheffé's Method of Paired Comparison Modified method by Nakaya tep 4: Example of a simple multiple comparisons. Y q ( t, f ) ˆ / tn (studentized yardstick) where ( ˆ, t, N ) are an unbiased variance of ε, the # of objects, and the #of human subjects; q ( t, f ) is a studentized range obtained is a statistical test table for t, the degree of freedom of ε ( f ), and the significant level of φ; see these variables in an ANOVA table. Y / (ee q 0.05 (3,5) in the next slide.) Y /

52 tudentized yardstick q ( 0.05 t, f ) f t distribution UMMARY 1. We overview which statistical test we should use for which case. groups n groups (n > ) Parametric Test (normality) un un t -test t -test ANOVA (Analysis of Variance) one-way ANOVA two-way ANOVA Non-parametric Test (no normality) un Mann-Whitney U-test one-way Kruskal-Wallis test sign test two-way Wilcoxon signed-ranks test Friedman test + cheffé's method of comparison for Human ubjective Tests. We can appeal the effectiveness of our experiments with correct use of statistical tests.

Contents. 2 groups n groups (n > 2) (independent) unpaired. paired t -test. one-way ANOVA ANOVA. (related) paired. two-way ANOVA.

Contents. 2 groups n groups (n > 2) (independent) unpaired. paired t -test. one-way ANOVA ANOVA. (related) paired. two-way ANOVA. Statstcal ests for Computatonal Intellgence Research and Human Subjectve ests Sldes are downloadable from http://www.desgn.kyushu-u.ac.jp/~takag Hdeyuk AKAGI Kyushu Unversty, Japan http://www.desgn.kyushu-u.ac.jp/~takag/