Chapter 13 Section D. F versus Q: Different Approaches to Controlling Type I Errors with Multiple Comparisons

Similar documents
Multiple Comparison Procedures Cohen Chapter 13. For EDUC/PSY 6600

Multiple Comparisons

COMPARING SEVERAL MEANS: ANOVA

Multiple t Tests. Introduction to Analysis of Variance. Experiments with More than 2 Conditions

Chapter Seven: Multi-Sample Methods 1/52

H0: Tested by k-grp ANOVA

Your schedule of coming weeks. One-way ANOVA, II. Review from last time. Review from last time /22/2004. Create ANOVA table

H0: Tested by k-grp ANOVA

The One-Way Independent-Samples ANOVA. (For Between-Subjects Designs)

Contrasts and Multiple Comparisons Supplement for Pages

Specific Differences. Lukas Meier, Seminar für Statistik

Introduction to the Analysis of Variance (ANOVA) Computing One-Way Independent Measures (Between Subjects) ANOVAs

Keppel, G. & Wickens, T. D. Design and Analysis Chapter 12: Detailed Analyses of Main Effects and Simple Effects

STAT 5200 Handout #7a Contrasts & Post hoc Means Comparisons (Ch. 4-5)

Multiple Pairwise Comparison Procedures in One-Way ANOVA with Fixed Effects Model

DETAILED CONTENTS PART I INTRODUCTION AND DESCRIPTIVE STATISTICS. 1. Introduction to Statistics

Keppel, G. & Wickens, T. D. Design and Analysis Chapter 4: Analytical Comparisons Among Treatment Means

Chapter 1 Statistical Inference

Week 5 Video 1 Relationship Mining Correlation Mining

Multiple Testing. Gary W. Oehlert. January 28, School of Statistics University of Minnesota

Analysis of Variance II Bios 662

Orthogonal, Planned and Unplanned Comparisons

13: Additional ANOVA Topics. Post hoc Comparisons

Three Factor Completely Randomized Design with One Continuous Factor: Using SPSS GLM UNIVARIATE R. C. Gardner Department of Psychology

B. Weaver (18-Oct-2006) MC Procedures Chapter 1: Multiple Comparison Procedures ) C (1.1)

Comparisons among means (or, the analysis of factor effects)

DESIGNING EXPERIMENTS AND ANALYZING DATA A Model Comparison Perspective

22s:152 Applied Linear Regression. Chapter 8: 1-Way Analysis of Variance (ANOVA) 2-Way Analysis of Variance (ANOVA)

These are all actually contrasts (the coef sum to zero). What are these contrasts representing? What would make them large?

BIOL Biometry LAB 6 - SINGLE FACTOR ANOVA and MULTIPLE COMPARISON PROCEDURES

2 Hand-out 2. Dr. M. P. M. M. M c Loughlin Revised 2018

Multiple Testing. Tim Hanson. January, Modified from originals by Gary W. Oehlert. Department of Statistics University of South Carolina

Outline. Topic 19 - Inference. The Cell Means Model. Estimates. Inference for Means Differences in cell means Contrasts. STAT Fall 2013

22s:152 Applied Linear Regression. 1-way ANOVA visual:

Module 03 Lecture 14 Inferential Statistics ANOVA and TOI

Multiple Comparison Procedures, Trimmed Means and Transformed Statistics. Rhonda K. Kowalchuk Southern Illinois University Carbondale

Regression Part II. One- factor ANOVA Another dummy variable coding scheme Contrasts Mul?ple comparisons Interac?ons

Hypothesis testing I. - In particular, we are talking about statistical hypotheses. [get everyone s finger length!] n =

STAT22200 Spring 2014 Chapter 5

Multiple Comparison Methods for Means

Hypothesis testing: Steps

Independent Samples ANOVA

4:3 LEC - PLANNED COMPARISONS AND REGRESSION ANALYSES

Analysis of Variance (ANOVA)

Hypothesis testing: Steps

Chapter 14: Repeated-measures designs

ANOVA. Testing more than 2 conditions

ON STEPWISE CONTROL OF THE GENERALIZED FAMILYWISE ERROR RATE. By Wenge Guo and M. Bhaskara Rao

A posteriori multiple comparison tests

9 One-Way Analysis of Variance

One sided tests. An example of a two sided alternative is what we ve been using for our two sample tests:

Chapter 3 ALGEBRA. Overview. Algebra. 3.1 Linear Equations and Applications 3.2 More Linear Equations 3.3 Equations with Exponents. Section 3.

Chapter 6 Planned Contrasts and Post-hoc Tests for one-way ANOVA

10. Alternative case influence statistics

Using SPSS for One Way Analysis of Variance

ANALYSIS OF VARIANCE OF BALANCED DAIRY SCIENCE DATA USING SAS

An inferential procedure to use sample data to understand a population Procedures

Analysis of Variance (ANOVA)

N J SS W /df W N - 1

y response variable x 1, x 2,, x k -- a set of explanatory variables

One-Way ANOVA Source Table J - 1 SS B / J - 1 MS B /MS W. Pairwise Post-Hoc Comparisons of Means

What Does the F-Ratio Tell Us?

Linear Combinations. Comparison of treatment means. Bruce A Craig. Department of Statistics Purdue University. STAT 514 Topic 6 1

Introduction to the Analysis of Variance (ANOVA)

Finding Limits Graphically and Numerically

Introduction to Analysis of Variance. Chapter 11

Comparing Several Means: ANOVA

2 Prediction and Analysis of Variance

ANOVA Analysis of Variance

UNIT 3 CONCEPT OF DISPERSION

ANOVA continued. Chapter 10

Chapter 6: Linear combinations and multiple comparisons

Contrasts (in general)

8. TRANSFORMING TOOL #1 (the Addition Property of Equality)

PSYC 331 STATISTICS FOR PSYCHOLOGISTS

Special Theory of Relativity Prof. Shiva Prasad Department of Physics Indian Institute of Technology, Bombay. Lecture - 15 Momentum Energy Four Vector

Hypothesis testing, part 2. With some material from Howard Seltman, Blase Ur, Bilge Mutlu, Vibha Sazawal

The entire data set consists of n = 32 widgets, 8 of which were made from each of q = 4 different materials.

In ANOVA the response variable is numerical and the explanatory variables are categorical.

The Pennsylvania State University The Graduate School Eberly College of Science GENERALIZED STEPWISE PROCEDURES FOR

ANOVA continued. Chapter 11

Volume vs. Diameter. Teacher Lab Discussion. Overview. Picture, Data Table, and Graph

26:010:557 / 26:620:557 Social Science Research Methods

ANOVA continued. Chapter 10

MORE ON SIMPLE REGRESSION: OVERVIEW

ANOVA: Analysis of Variance

arxiv: v1 [stat.me] 14 Jan 2019

Comparing Several Means

Probability Year 9. Terminology

Confidence Intervals. Confidence interval for sample mean. Confidence interval for sample mean. Confidence interval for sample mean

Multiple Testing of General Contrasts: Truncated Closure and the Extended Shaffer-Royen Method

Examining Multiple Comparison Procedures According to Error Rate, Power Type and False Discovery Rate

Numbers. The aim of this lesson is to enable you to: describe and use the number system. use positive and negative numbers

GLM Repeated-measures designs: One within-subjects factor

Sequences and infinite series

Pre-Algebra Notes Unit Three: Multi-Step Equations and Inequalities (optional)

Error Correcting Codes Prof. Dr. P. Vijay Kumar Department of Electrical Communication Engineering Indian Institute of Science, Bangalore

Design of Engineering Experiments Part 2 Basic Statistical Concepts Simple comparative experiments

MITOCW ocw f99-lec05_300k

Probability Methods in Civil Engineering Prof. Dr. Rajib Maity Department of Civil Engineering Indian Institution of Technology, Kharagpur

Transcription:

Explaining Psychological Statistics (2 nd Ed.) by Barry H. Cohen Chapter 13 Section D F versus Q: Different Approaches to Controlling Type I Errors with Multiple Comparisons In section B of this chapter, I mentioned that when using Tukey s HSD it is unusual to find a pair of means that are significantly different when the overall ANOVA is not significant (p. 375; EPS, 2e). However, it is also possible, though unusual, to find no significant differences with the HSD procedure when the ANOVA is significant. Tukey s HSD is based on a different distribution from the ANOVA, and though it always yields similar results, there is some room for discrepancies with respect to statistical significance, as I will demonstrate in the following examples. ANOVA Not Significant, but HSD Finds Significance Suppose that your experiment consists of five groups of ten participants each, and each group is subjected to a different form of stress (e.g., speeded arithmetic; mild electric shock). For each group the average heart rate of the participants is calculated, yielding the following set of means (in beats per minute): 65, 71, 72, 73, 79. The MS bet for these means is just 10 times their unbiased variance, so MS bet equals 250. For simplicity, let s say that MS W happens to be exactly 100, so that F for the ANOVA equals 2.5. Given that the critical value for a.05 test is about 2.58, the ANOVA has fallen just short of conventional significance, and therefore followup t tests using the LSD formula would not be justified. However, the Type I error protection of Tukey s HSD does not depend on the significance of the ANOVA, so we can go right ahead and calculate HSD for this example. The appropriate q for this example is about 4.02; when multiplied by the square root of MS W over n, we obtain 12.7 as our value for HSD. The difference between the two extreme means (79-65 = 14) is larger than HSD, so use of Tukey s procedure can legitimately declare those two means to differ significantly at the.05 level, even though the omnibus ANOVA is not significant at that level. As I said, this is an unusual but not an impossible combination of events. Note that the ANOVA was indeed close to being significant; if for instance, MS W were as large as 125, making F as small as 2.0, HSD would become slightly larger than 14, the largest difference of the means. Next, using a similar example, I will show how the ANOVA can be significant without HSD finding any pair of means that differs significantly. ANOVA Is Significant, but HSD Does Not Find Significance The pattern of the five means in the preceding example is the kind that favors Tukey s HSD over the ANOVA: two of the means are relatively far apart while the rest of the means are clustered together in the middle. That clustering tends to reduce the variance of the means and therefore the ANOVA F. The following set of means follows the opposite pattern: 65, 66, 72, 78, 79; note that the largest difference of means is still 14, but the variance of the means is now 42.5, so MS bet equals 425. If MS W is still only 100, F will increase to an easily significant 4.25. In fact, given the increase in MS bet, MS W can increase to 125, and the ANOVA would still be significant (425 / 125 = 3.4 > 2.58). However, increasing MS W would also produce an increase in HSD, even though there has been no change in the largest difference of means. As mentioned above, with MS W at 125, HSD would come out larger than the difference of 65 and 79 (HSD = 14.l2), so despite the significance of the ANOVA, HSD would no longer indicate any significantly different pairs of means. As you can see, the pattern that favors the ANOVA over Tukey s HSD is one in which the means are not clustered centrally, but tend instead to be clustered at the two extremes. When dealing with only three groups, perform LSD-type followup tests, but only if your ANOVA is significant. With

2 more than three groups, you are allowed to use Tukey s HSD without looking at the ANOVA, but tradition strongly dictates that you report the results of the ANOVA before reporting any conclusions from a post hoc test like HSD. Whereas it is unlikely to find significance with HSD but not the ANOVA, it is even less likely to find such results published in the literature. Fisher-Hayter Test Tukey s HSD is easy to use and understand but it is more conservative than necessary. Computer simulations have shown that under typical data analytic conditions this procedure tends to keep experiment-wise alpha between about.02 and.03 when you think you are setting the overall alpha at.05. Hayter (1986) devised a modification of HSD that employs the two-step process originated by Fisher (so it is often called the Fisher-Hayter or the modified LSD test) to squeeze more power out of HSD without allowing a EW to rise above.05. The first step of the Fisher-Hayter test is to evaluate the significance of the one-way ANOVA. If (and only if) the ANOVA is significant, you can proceed to the second step, which involves the calculation of HSD, but with an important twist : the critical value for HSD (i.e., q) is determined by setting the number of groups to k - 1, rather than k. The Fisher-Hayter (F-H) test is easily illustrated in terms of the immediately preceding example. Because the ANOVA was significant, we can proceed to calculate HSD as though the number of groups were four instead of five. In this case, q is only about 3.77, instead of 4.02, and HSD comes out to 13.3. Now, the largest difference of means is larger than HSD, so we can identify a significantly different pair of means to follow up on our significant ANOVA. Note that the Fisher-Hayter test comes in two varieties: the F version, just described, and a Q version, whose first step is testing the largest difference of means with HSD based on k groups, in order to decide whether to proceed. For the preceding example, the F-H Q test would not have gotten through the first step. Similarly, in the first example (ANOVA not significant), the F-H F test could not proceed, but F-H Q would have gone to the second step. HSD comes out to 11.92 for that second step, but beyond the largest difference of means, F-H Q would not find any additional significant pairs of means. Simultaneous versus Sequential Tests Tukey s HSD is a good example of a simultaneous post hoc comparison test. That is, the value of HSD is determined at the outset and is used for all pairwise comparisons; there is no part of the test that changes based on the results of another part. Simultaneous tests have an advantage in terms of the ease with which confidence intervals (CI s) can be found, but sequential tests often have greater power without sacrificing Type I error control. Fisher s protected (or LSD) test is the simplest example of a sequential test. Depending on the results of the first step (testing the ANOVA), there may or may not be a second step in the sequence. If the HSD you have calculated is based on a q value for the.05 level, it is very easy to find the 95% CI for any difference of population means; it is just the difference of the corresponding sample means plus or minus HSD. If, for instance, the average heart rate is 69 for the physically stressed group and 73 for the mentally stressed group (in the five-group experiment alluded to above), your point estimate for the difference of the two conditions if applied to the entire population would be73-69, which equals 4. If your.05 HSD turned out to be 4.5 for this experiment, your 95% CI for the physical/mental stress difference would range from -.5 to 8.5. That zero falls within the 95% CI tells you that this difference is not significant at the.05 level. Of course, you could also tell that from the fact that the sample difference of 4 is less than HSD (4.5).

One problem with basing CI s on LSD, instead of HSD, even for only three groups, is that you cannot begin with calculating CI s and then determine significance by noting whether zero is contained within a particular CI or not. Zero may be outside the range of a particular CI, but if the ANOVA is not significant, the population difference being estimated cannot be declared significant either. Although they do not lend themselves to the calculation of CI s, the increased power advantages of sequential tests are hard to ignore. The F-H test described above is a good example of a powerful sequential test that maintains adequate control over experimentwise error. You may notice that other multiple comparison tests are available in F and Q versions (e.g., the REGW test). This tells you that the test is a sequential one, and that the first step requires the significance of either the ordinary ANOVA (F version), or the largest pairwise difference according to the usual HSD test (the Q version) in order to proceed with further testing. Sharper Bonferroni Tests The ordinary Bonferroni adjustment, as applied to all possible pairs of means following an ANOVA is, as I mentioned in section A, much too conservative for routine use. For instance, there are ten possible pairs of means that could be tested in the five-group stress example, so an alpha of.05 / 10 =.005 would have to be used for each test. By way of contrast, q was 4.02, which is equivalent to a critical t of 4.02 / sqrt (2) = 2.84, which would correspond to using an alpha of.0068 better than.005, but still on the conservative side. The second stage of the F-H test involving five groups of ten participants involves a q of 3.77, and therefore an alpha per comparison of about.011 (if you get that far). Of course, if you can plan to test only five of the ten possible pairwise differences, you can do even better (alpha per comparison =.01), but there are ways to make the Bonferroni adjustment somewhat less conservative, even without planning any comparisons, as I will show next. Sidak s Test The Bonferroni test is based on an inequality that the overall (i.e., experimentwise) alpha will be less than or equal to the number of tests conducted (j) times the alpha used for each comparison (a pc ). So, in the case of a seven-group experiment, in which all of the 21 possible pairwise comparisons are to be tested (each at the.05 level), Bonferroni tells us that a EW will be less than or equal to j*a pc = 21*.05 = 1.05. Of course, we already knew that just by the way we define probability. As j gets larger, the Bonferroni inequality becomes progressively less informative. However, if we can assume that all of the tests we are conducting are mutually independent, we can use a sharper (i.e., more accurate) inequality, based on Formula 13.2. Solving that formula for alpha, we obtain an adjustment for a pc that is somewhat less severe than the Bonferroni adjustment (as expressed in Formula 13.8): Formula 13.16 3 When you are performing all possible pairwise comparisons among your samples, your tests are not all mutually independent, but Sidak (1967) showed that even with this lack of independence the use of Formula 13.16 keeps a EW below the level chosen, while providing more power than the traditional Bonferroni correction. Therefore, the use of the preceding formula is often referred to as Sidak s test. [Note: Formula 13.16 can be derived from the unnumbered formula Sidak (1967) presents on page 629, just before section 4 of his article.] A couple of examples (based on a EW =.05) will help to illustrate the difference between the Bonferroni and Sidak adjustments.

First, let us consider the three-group case, in which there are only three different pairings that can be tested. According to the Sidak test, the alpha that should be used for each of the three tests is: a pc = 1 - (1 - a EW ) 1/3 = 1 - (.95) 1/3. The fractional exponent means that rather than cubing.95, we are to take the cube root of.95, which is about.98305, so a pc = 1 -.98305 =.01695. This is only slightly larger than the Bonferroni a pc (.05 / 3), which rounds off to =.01667. There isn t much difference between the two adjustments, but the Sidak alpha is always the larger of the two. For another example, consider the seven-group experiment. There are 21 possible pairwise comparisons, so Sidak s adjusted alpha for each comparison comes to: 1 - (.95) 1/21 =1 -.99756 =.00244, which is a bit larger than.05 / 21, which equals.002381. Because Sidak s test does not make a large difference, and is computationally more complex than the Bonferroni test, it was not popular before readily available statistical packages (e.g., SPSS) started to include it. Various tests that incorporate the Bonferroni adjustment can be made a bit more powerful by using instead the Sidak adjustment. Another way to add power to the Bonferroni adjustment is to turn it into a sequential test, as shown next. Sequential Bonferroni Tests A Step-Down Bonferroni Test: Holm (1979) demonstrated that you could add power to your Bonferroni test with the following sequentially rejective (step-down) procedure. The first step is to check whether any of your comparisons are significant with the ordinary Bonferroni adjustment; if not even one is significant by the unmodified Bonferroni criterion testing stops, and no significant comparisons are found. If you are testing all possible pairs of five means, as in the previous example, you would determine the p values for each of the ten pairwise comparisons, and then see if your smallest p value is less than.005. If it is, you declare it to be significant at the.05 level, and then compare the next smallest p with.05 / 9 =.0056. In terms of a general formula, to be significant, a comparison must have a p i < a EW / (m - i + 1), where p i goes from p 1 (the smallest p) to p m (the largest p), and m is the total number of comparisons in the set. However, you must start with the smallest p, and go up step by step (following the proper sequence), and stop as soon as you hit a p that is not significant according to the preceding formula. For example, suppose that the five-group stress study had large enough samples so that the p values for the ten possible pairwise comparisons are as follows (ordered from smallest to largest):.002,.0054,.007,.008,.009,.0094,.012,.015,.028,.067. Because the smallest p is less than.005, it is, according to Holm s procedure, significant. The second smallest,.0054, does not have to be smaller than.005, it only has to be less than.0056, and it is. Next,.007 is compared to.05 / 8 =.00625; because p 3 is not less than this value, it is not declared significant, and testing stops at this point. It does not matter that p 6 (.0094) happens to be less than.05 / 5 =.01; once you hit a nonsignificant result in the sequence you cannot test any larger p values without ruining the Type I error control. You can see the extra power in this approach in that the p of.0054 would not have been significant according to the ordinary Bonferroni correction, but it is significant according to Holm s step-down test. It is worth noting that more than one modification of the Bonferroni procedure can be applied within the same test. For instance, Holland and Copenhaver (1988) showed that the power of Holm s test can be improved slightly by basing it on Sidak s (sharper) inequality, rather than the usual Bonferroni adjustment. However, Olejnik, Li, Supattathum, and Huberty (1997) pointed out that Holland and Copenhaver s test requires an assumption that makes it somewhat less generally applicable than Holm s test, which makes no such assumptions. 4

A Step-Up Bonferroni Test: More recently, Hochberg (1986) demonstrated that a step-up procedure could have more power than Holm s step-down test. His step-up test begins with a test of the largest p value in the set. If this p is less than.05, then all of the comparisons are declared significant (e.g., if all 10 of the p s are between.02 and.04, they are declared significant at the.05 level, even though none of them would be significant with the usual Bonferroni adjustment). You use the same formula as Holm s test, but you apply it in the reverse order. To illustrate I will again use the set of ten p values above. First, p m,the largest p value, is compared to a EW / (m - i + 1) =.05 / (m - m +1) =.05 / 1 =.05. Because.067 is not less than.05, this comparison is not declared significant but testing does not stop. The second largest p is then compared to.05 / (10-9 +1) =.05 / (1+1) =.025. Because.028 is not less than.025, this result is not significant either, and the procedure continues. The next p, p 3, is compared to.05 / 3 =.0167; it (.015) is less than this value, so this comparison is significant, and therefore testing stops. All of the p values smaller than p 3 are automatically declared significant without further testing. It is easy to see that Holm s test is not as powerful as Hochberg s test. The former test found only the p s of.002 and.0051 to be significant, whereas the latter found all of the p values from.007 through.015 to be significant, as well. Of course, I deliberately devised an example to emphasize the difference between these two procedures. In most real-life cases, the conclusions from the two methods will rarely differ. Considering that Hochberg s procedure rests on the assumption that the tests are independent of one another (Olejnik, Li, Supattathum, & Huberty, 1997), and Holm s test does not, it seems that the conservative way to apply the Bonferroni test is by the use of either Holm s procedure, or Sidak s adjustment (Formula 13.16). Of course, you can still use the simpler, unmodified Bonferroni alpha correction, but if you are analyzing your data with statistical software, there is no excuse for throwing away even a small increase in power. Adjusted p Values When you request a Bonferroni test from SPSS under post hoc comparisons, what you get for each pair of means is a p value that is adjusted so that it can be compared directly to.05, assuming that that is your desired experimentwise alpha. For instance, for a three-group experiment, a pairwise comparison (i.e., a t test) that yields a p value of.016 would be considered significant at the.05 level, because.016 < (.05 / 3). Instead of giving you the actual two-tailed p value, SPSS adjusts the p value by multiplying it by 3, in this case, and gives you a Bonferroni p of.048 (.016*3), which you can see immediately is just under.05, and therefore significant by the Bonferroni test. Quite simply, SPSS adjusts the actual p value by applying the Bonferroni correction backwards. In the general case, without SPSS, you would divide.05 by the total number of possible pairwise comparisons (if conservative enough to use Bonferroni as a post hoc test), and then compare each of your p values to that shrunken value. SPSS performs the opposite operation, and multiplies each of your actual p values by the total number of possible pairs, so each can be compared to.05 (or whatever value you want a EW to be less than). To express the above operation as a formula, solve Formula 13.8 for a EW, and then change the term a EW to the Bonferroni adjusted p, and change a pc in that formula to your actual p value, like this: 5

Of course, you don t need to see a formula in this simple case, but the formula approach makes it easier to understand more complex adjustments, such as the one SPSS refers to as the Sidak adjustment. To obtain the formula for the Sidak-adjusted p, you have to solve Formula 13.16 above for a EW but that just brings you back to Formula 13.2. Changing a EW in Formula 3.2 to Sidak p, and plain alpha to your actual p value, you get the formula that SPSS uses to turn your p values into Sidak-adjusted p values: 6 Formula 13.17 where j is the total number of possible comparisons. In the three-group Bonferroni example above, a p value of.016 was adjusted to.048. The corresponding Sidak p is 1 -.984 3 = 1 -.953 =.047. Not a big improvement, but if the computer is doing the work, why not get all the power you can (without sacrificing Type I error control)? References Hayter, A. J. (1986). The maximum familywise error rate of Fisher's least significant difference test. Journal of the American Statistical Association, 81, 1000-1004. Hochberg, Y. (1988). A sharper Bonferroni procedure for multiple tests of significance. Biometrika, 75, 800 802. Holland, B. S., & Copenhaver M. D. (1988). Improved Bonferroni-type multiple testing procedures. Psychological Bulletin, 104, 145-149. Holm, S. (1979). A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics, 6, 65-70. Olejnik, S., Li, J., Supattathum, S., & Huberty, C. J. (1997). Multiple testing and statistical power with modified Bonferroni procedures. Journal of Educational and Behavioral Statistics, 22, 389-406. Sidak, Z. (1967). Rectangular confidence regions for the means of multivariate normal distributions. Journal of the American Statistical Association, 62, 626-633.