Explaining Psychological Statistics (2 nd Ed.) by Barry H. Cohen Chapter 13 Section D F versus Q: Different Approaches to Controlling Type I Errors with Multiple Comparisons In section B of this chapter, I mentioned that when using Tukey s HSD it is unusual to find a pair of means that are significantly different when the overall ANOVA is not significant (p. 375; EPS, 2e). However, it is also possible, though unusual, to find no significant differences with the HSD procedure when the ANOVA is significant. Tukey s HSD is based on a different distribution from the ANOVA, and though it always yields similar results, there is some room for discrepancies with respect to statistical significance, as I will demonstrate in the following examples. ANOVA Not Significant, but HSD Finds Significance Suppose that your experiment consists of five groups of ten participants each, and each group is subjected to a different form of stress (e.g., speeded arithmetic; mild electric shock). For each group the average heart rate of the participants is calculated, yielding the following set of means (in beats per minute): 65, 71, 72, 73, 79. The MS bet for these means is just 10 times their unbiased variance, so MS bet equals 250. For simplicity, let s say that MS W happens to be exactly 100, so that F for the ANOVA equals 2.5. Given that the critical value for a.05 test is about 2.58, the ANOVA has fallen just short of conventional significance, and therefore followup t tests using the LSD formula would not be justified. However, the Type I error protection of Tukey s HSD does not depend on the significance of the ANOVA, so we can go right ahead and calculate HSD for this example. The appropriate q for this example is about 4.02; when multiplied by the square root of MS W over n, we obtain 12.7 as our value for HSD. The difference between the two extreme means (79-65 = 14) is larger than HSD, so use of Tukey s procedure can legitimately declare those two means to differ significantly at the.05 level, even though the omnibus ANOVA is not significant at that level. As I said, this is an unusual but not an impossible combination of events. Note that the ANOVA was indeed close to being significant; if for instance, MS W were as large as 125, making F as small as 2.0, HSD would become slightly larger than 14, the largest difference of the means. Next, using a similar example, I will show how the ANOVA can be significant without HSD finding any pair of means that differs significantly. ANOVA Is Significant, but HSD Does Not Find Significance The pattern of the five means in the preceding example is the kind that favors Tukey s HSD over the ANOVA: two of the means are relatively far apart while the rest of the means are clustered together in the middle. That clustering tends to reduce the variance of the means and therefore the ANOVA F. The following set of means follows the opposite pattern: 65, 66, 72, 78, 79; note that the largest difference of means is still 14, but the variance of the means is now 42.5, so MS bet equals 425. If MS W is still only 100, F will increase to an easily significant 4.25. In fact, given the increase in MS bet, MS W can increase to 125, and the ANOVA would still be significant (425 / 125 = 3.4 > 2.58). However, increasing MS W would also produce an increase in HSD, even though there has been no change in the largest difference of means. As mentioned above, with MS W at 125, HSD would come out larger than the difference of 65 and 79 (HSD = 14.l2), so despite the significance of the ANOVA, HSD would no longer indicate any significantly different pairs of means. As you can see, the pattern that favors the ANOVA over Tukey s HSD is one in which the means are not clustered centrally, but tend instead to be clustered at the two extremes. When dealing with only three groups, perform LSD-type followup tests, but only if your ANOVA is significant. With
2 more than three groups, you are allowed to use Tukey s HSD without looking at the ANOVA, but tradition strongly dictates that you report the results of the ANOVA before reporting any conclusions from a post hoc test like HSD. Whereas it is unlikely to find significance with HSD but not the ANOVA, it is even less likely to find such results published in the literature. Fisher-Hayter Test Tukey s HSD is easy to use and understand but it is more conservative than necessary. Computer simulations have shown that under typical data analytic conditions this procedure tends to keep experiment-wise alpha between about.02 and.03 when you think you are setting the overall alpha at.05. Hayter (1986) devised a modification of HSD that employs the two-step process originated by Fisher (so it is often called the Fisher-Hayter or the modified LSD test) to squeeze more power out of HSD without allowing a EW to rise above.05. The first step of the Fisher-Hayter test is to evaluate the significance of the one-way ANOVA. If (and only if) the ANOVA is significant, you can proceed to the second step, which involves the calculation of HSD, but with an important twist : the critical value for HSD (i.e., q) is determined by setting the number of groups to k - 1, rather than k. The Fisher-Hayter (F-H) test is easily illustrated in terms of the immediately preceding example. Because the ANOVA was significant, we can proceed to calculate HSD as though the number of groups were four instead of five. In this case, q is only about 3.77, instead of 4.02, and HSD comes out to 13.3. Now, the largest difference of means is larger than HSD, so we can identify a significantly different pair of means to follow up on our significant ANOVA. Note that the Fisher-Hayter test comes in two varieties: the F version, just described, and a Q version, whose first step is testing the largest difference of means with HSD based on k groups, in order to decide whether to proceed. For the preceding example, the F-H Q test would not have gotten through the first step. Similarly, in the first example (ANOVA not significant), the F-H F test could not proceed, but F-H Q would have gone to the second step. HSD comes out to 11.92 for that second step, but beyond the largest difference of means, F-H Q would not find any additional significant pairs of means. Simultaneous versus Sequential Tests Tukey s HSD is a good example of a simultaneous post hoc comparison test. That is, the value of HSD is determined at the outset and is used for all pairwise comparisons; there is no part of the test that changes based on the results of another part. Simultaneous tests have an advantage in terms of the ease with which confidence intervals (CI s) can be found, but sequential tests often have greater power without sacrificing Type I error control. Fisher s protected (or LSD) test is the simplest example of a sequential test. Depending on the results of the first step (testing the ANOVA), there may or may not be a second step in the sequence. If the HSD you have calculated is based on a q value for the.05 level, it is very easy to find the 95% CI for any difference of population means; it is just the difference of the corresponding sample means plus or minus HSD. If, for instance, the average heart rate is 69 for the physically stressed group and 73 for the mentally stressed group (in the five-group experiment alluded to above), your point estimate for the difference of the two conditions if applied to the entire population would be73-69, which equals 4. If your.05 HSD turned out to be 4.5 for this experiment, your 95% CI for the physical/mental stress difference would range from -.5 to 8.5. That zero falls within the 95% CI tells you that this difference is not significant at the.05 level. Of course, you could also tell that from the fact that the sample difference of 4 is less than HSD (4.5).
One problem with basing CI s on LSD, instead of HSD, even for only three groups, is that you cannot begin with calculating CI s and then determine significance by noting whether zero is contained within a particular CI or not. Zero may be outside the range of a particular CI, but if the ANOVA is not significant, the population difference being estimated cannot be declared significant either. Although they do not lend themselves to the calculation of CI s, the increased power advantages of sequential tests are hard to ignore. The F-H test described above is a good example of a powerful sequential test that maintains adequate control over experimentwise error. You may notice that other multiple comparison tests are available in F and Q versions (e.g., the REGW test). This tells you that the test is a sequential one, and that the first step requires the significance of either the ordinary ANOVA (F version), or the largest pairwise difference according to the usual HSD test (the Q version) in order to proceed with further testing. Sharper Bonferroni Tests The ordinary Bonferroni adjustment, as applied to all possible pairs of means following an ANOVA is, as I mentioned in section A, much too conservative for routine use. For instance, there are ten possible pairs of means that could be tested in the five-group stress example, so an alpha of.05 / 10 =.005 would have to be used for each test. By way of contrast, q was 4.02, which is equivalent to a critical t of 4.02 / sqrt (2) = 2.84, which would correspond to using an alpha of.0068 better than.005, but still on the conservative side. The second stage of the F-H test involving five groups of ten participants involves a q of 3.77, and therefore an alpha per comparison of about.011 (if you get that far). Of course, if you can plan to test only five of the ten possible pairwise differences, you can do even better (alpha per comparison =.01), but there are ways to make the Bonferroni adjustment somewhat less conservative, even without planning any comparisons, as I will show next. Sidak s Test The Bonferroni test is based on an inequality that the overall (i.e., experimentwise) alpha will be less than or equal to the number of tests conducted (j) times the alpha used for each comparison (a pc ). So, in the case of a seven-group experiment, in which all of the 21 possible pairwise comparisons are to be tested (each at the.05 level), Bonferroni tells us that a EW will be less than or equal to j*a pc = 21*.05 = 1.05. Of course, we already knew that just by the way we define probability. As j gets larger, the Bonferroni inequality becomes progressively less informative. However, if we can assume that all of the tests we are conducting are mutually independent, we can use a sharper (i.e., more accurate) inequality, based on Formula 13.2. Solving that formula for alpha, we obtain an adjustment for a pc that is somewhat less severe than the Bonferroni adjustment (as expressed in Formula 13.8): Formula 13.16 3 When you are performing all possible pairwise comparisons among your samples, your tests are not all mutually independent, but Sidak (1967) showed that even with this lack of independence the use of Formula 13.16 keeps a EW below the level chosen, while providing more power than the traditional Bonferroni correction. Therefore, the use of the preceding formula is often referred to as Sidak s test. [Note: Formula 13.16 can be derived from the unnumbered formula Sidak (1967) presents on page 629, just before section 4 of his article.] A couple of examples (based on a EW =.05) will help to illustrate the difference between the Bonferroni and Sidak adjustments.
First, let us consider the three-group case, in which there are only three different pairings that can be tested. According to the Sidak test, the alpha that should be used for each of the three tests is: a pc = 1 - (1 - a EW ) 1/3 = 1 - (.95) 1/3. The fractional exponent means that rather than cubing.95, we are to take the cube root of.95, which is about.98305, so a pc = 1 -.98305 =.01695. This is only slightly larger than the Bonferroni a pc (.05 / 3), which rounds off to =.01667. There isn t much difference between the two adjustments, but the Sidak alpha is always the larger of the two. For another example, consider the seven-group experiment. There are 21 possible pairwise comparisons, so Sidak s adjusted alpha for each comparison comes to: 1 - (.95) 1/21 =1 -.99756 =.00244, which is a bit larger than.05 / 21, which equals.002381. Because Sidak s test does not make a large difference, and is computationally more complex than the Bonferroni test, it was not popular before readily available statistical packages (e.g., SPSS) started to include it. Various tests that incorporate the Bonferroni adjustment can be made a bit more powerful by using instead the Sidak adjustment. Another way to add power to the Bonferroni adjustment is to turn it into a sequential test, as shown next. Sequential Bonferroni Tests A Step-Down Bonferroni Test: Holm (1979) demonstrated that you could add power to your Bonferroni test with the following sequentially rejective (step-down) procedure. The first step is to check whether any of your comparisons are significant with the ordinary Bonferroni adjustment; if not even one is significant by the unmodified Bonferroni criterion testing stops, and no significant comparisons are found. If you are testing all possible pairs of five means, as in the previous example, you would determine the p values for each of the ten pairwise comparisons, and then see if your smallest p value is less than.005. If it is, you declare it to be significant at the.05 level, and then compare the next smallest p with.05 / 9 =.0056. In terms of a general formula, to be significant, a comparison must have a p i < a EW / (m - i + 1), where p i goes from p 1 (the smallest p) to p m (the largest p), and m is the total number of comparisons in the set. However, you must start with the smallest p, and go up step by step (following the proper sequence), and stop as soon as you hit a p that is not significant according to the preceding formula. For example, suppose that the five-group stress study had large enough samples so that the p values for the ten possible pairwise comparisons are as follows (ordered from smallest to largest):.002,.0054,.007,.008,.009,.0094,.012,.015,.028,.067. Because the smallest p is less than.005, it is, according to Holm s procedure, significant. The second smallest,.0054, does not have to be smaller than.005, it only has to be less than.0056, and it is. Next,.007 is compared to.05 / 8 =.00625; because p 3 is not less than this value, it is not declared significant, and testing stops at this point. It does not matter that p 6 (.0094) happens to be less than.05 / 5 =.01; once you hit a nonsignificant result in the sequence you cannot test any larger p values without ruining the Type I error control. You can see the extra power in this approach in that the p of.0054 would not have been significant according to the ordinary Bonferroni correction, but it is significant according to Holm s step-down test. It is worth noting that more than one modification of the Bonferroni procedure can be applied within the same test. For instance, Holland and Copenhaver (1988) showed that the power of Holm s test can be improved slightly by basing it on Sidak s (sharper) inequality, rather than the usual Bonferroni adjustment. However, Olejnik, Li, Supattathum, and Huberty (1997) pointed out that Holland and Copenhaver s test requires an assumption that makes it somewhat less generally applicable than Holm s test, which makes no such assumptions. 4
A Step-Up Bonferroni Test: More recently, Hochberg (1986) demonstrated that a step-up procedure could have more power than Holm s step-down test. His step-up test begins with a test of the largest p value in the set. If this p is less than.05, then all of the comparisons are declared significant (e.g., if all 10 of the p s are between.02 and.04, they are declared significant at the.05 level, even though none of them would be significant with the usual Bonferroni adjustment). You use the same formula as Holm s test, but you apply it in the reverse order. To illustrate I will again use the set of ten p values above. First, p m,the largest p value, is compared to a EW / (m - i + 1) =.05 / (m - m +1) =.05 / 1 =.05. Because.067 is not less than.05, this comparison is not declared significant but testing does not stop. The second largest p is then compared to.05 / (10-9 +1) =.05 / (1+1) =.025. Because.028 is not less than.025, this result is not significant either, and the procedure continues. The next p, p 3, is compared to.05 / 3 =.0167; it (.015) is less than this value, so this comparison is significant, and therefore testing stops. All of the p values smaller than p 3 are automatically declared significant without further testing. It is easy to see that Holm s test is not as powerful as Hochberg s test. The former test found only the p s of.002 and.0051 to be significant, whereas the latter found all of the p values from.007 through.015 to be significant, as well. Of course, I deliberately devised an example to emphasize the difference between these two procedures. In most real-life cases, the conclusions from the two methods will rarely differ. Considering that Hochberg s procedure rests on the assumption that the tests are independent of one another (Olejnik, Li, Supattathum, & Huberty, 1997), and Holm s test does not, it seems that the conservative way to apply the Bonferroni test is by the use of either Holm s procedure, or Sidak s adjustment (Formula 13.16). Of course, you can still use the simpler, unmodified Bonferroni alpha correction, but if you are analyzing your data with statistical software, there is no excuse for throwing away even a small increase in power. Adjusted p Values When you request a Bonferroni test from SPSS under post hoc comparisons, what you get for each pair of means is a p value that is adjusted so that it can be compared directly to.05, assuming that that is your desired experimentwise alpha. For instance, for a three-group experiment, a pairwise comparison (i.e., a t test) that yields a p value of.016 would be considered significant at the.05 level, because.016 < (.05 / 3). Instead of giving you the actual two-tailed p value, SPSS adjusts the p value by multiplying it by 3, in this case, and gives you a Bonferroni p of.048 (.016*3), which you can see immediately is just under.05, and therefore significant by the Bonferroni test. Quite simply, SPSS adjusts the actual p value by applying the Bonferroni correction backwards. In the general case, without SPSS, you would divide.05 by the total number of possible pairwise comparisons (if conservative enough to use Bonferroni as a post hoc test), and then compare each of your p values to that shrunken value. SPSS performs the opposite operation, and multiplies each of your actual p values by the total number of possible pairs, so each can be compared to.05 (or whatever value you want a EW to be less than). To express the above operation as a formula, solve Formula 13.8 for a EW, and then change the term a EW to the Bonferroni adjusted p, and change a pc in that formula to your actual p value, like this: 5
Of course, you don t need to see a formula in this simple case, but the formula approach makes it easier to understand more complex adjustments, such as the one SPSS refers to as the Sidak adjustment. To obtain the formula for the Sidak-adjusted p, you have to solve Formula 13.16 above for a EW but that just brings you back to Formula 13.2. Changing a EW in Formula 3.2 to Sidak p, and plain alpha to your actual p value, you get the formula that SPSS uses to turn your p values into Sidak-adjusted p values: 6 Formula 13.17 where j is the total number of possible comparisons. In the three-group Bonferroni example above, a p value of.016 was adjusted to.048. The corresponding Sidak p is 1 -.984 3 = 1 -.953 =.047. Not a big improvement, but if the computer is doing the work, why not get all the power you can (without sacrificing Type I error control)? References Hayter, A. J. (1986). The maximum familywise error rate of Fisher's least significant difference test. Journal of the American Statistical Association, 81, 1000-1004. Hochberg, Y. (1988). A sharper Bonferroni procedure for multiple tests of significance. Biometrika, 75, 800 802. Holland, B. S., & Copenhaver M. D. (1988). Improved Bonferroni-type multiple testing procedures. Psychological Bulletin, 104, 145-149. Holm, S. (1979). A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics, 6, 65-70. Olejnik, S., Li, J., Supattathum, S., & Huberty, C. J. (1997). Multiple testing and statistical power with modified Bonferroni procedures. Journal of Educational and Behavioral Statistics, 22, 389-406. Sidak, Z. (1967). Rectangular confidence regions for the means of multivariate normal distributions. Journal of the American Statistical Association, 62, 626-633.