Statistical Hypothesis Testing: Problems and Alternatives

Size: px

Start display at page:

Download "Statistical Hypothesis Testing: Problems and Alternatives"

Violet Potter
5 years ago
Views:

FORUM Statistical Hypothesis Testing: Problems and Alternatives NORMAN S. MATLOFF Division of Computer Science, University of California at Davis, Davis, California 95616 Environ. Entomol.

1 FORUM Statistical Hypothesis Testing: Problems and Alternatives NORMAN S. MATLOFF Division of Computer Science, University of California at Davis, Davis, California Environ. Entomol. 20(5): (1991) ABSTRACT Scientists must make decisions based on experimental data and often use statistical hypothesis tests, either formally or via P values, as the basis for such decisions. However, these tests are designed for answering questions which are almost never of scientific relevance. This paper demonstrates this problem and discusses specific alternative methods. KEY WORDS hypothesis testing, confidence intervals, simultaneous inference THE ANALYSISOF EXPERIMENTALDATAis inher- ently a statistical problem because the data either explicitly or implicitly arise via sampling from some population. Suppose, for example, we feed a group of animals of a certain breed an ordinary diet for a given period of time, then feed the same animals a zinc-enriched diet, taking before-and-after measurements of some attribute X. Then we wish to extrapolate the results to the (conceptual) population of all animals of this breed. If, say, the mean of X was 24.7 units higher in these animals after the zinc-enriched diet, we would like to be able to say that the mean of X would be substantially higher when computed among all animals of this breed. In most cases, the scientist is called upon to make a decision of some kind; e.g., he or she might be asked to make a recommendation as to whether the zinc-enriched diet produces important benefits. A glance at the literature in almost any scientific field immediately reveals that statistical hypothesis tests, typically in the form of "p values" (i.e., observed significance levels), playa dominant role in this decision-making. Tables summarizing experimental results often are presented using the "star system," in which asterisks are used to flag the results that have small p values. However, hypothesis tests can produce misleading results. In the zinc example described above, for instance, suppose that zinc enrichment produces only a very minor effect of no scientific importance. If we collect data on a large sample of animals, we will probably get a small P value. As is customary, this small P value would be called "significant" in spite of the fact the difference caused by zinc-enrichment is not of scientific significance. In other words, statistical significance is not the same as scientific significance. The divergence between these two concepts stems from the fact that a hypothesis test, either a formal one or one implied by the reporting of P values, is addressing the wrong question. Specifically, the test is asking whether a certain condition holds exactly, and this exactness is almost never of scientific interest. For instance, the hypothesis being tested in the diet example described above would probably be Ho: J.tdiff = 0, where J.tdlff is the population mean difference between the zinc-enriched and nonzinc-enriched values of X. But Ho is not the question of scientific interest. If JA.diff is nonzero but close to 0, then Ho is false in a mathematical sense yet is essentially true from the scientist's point of view. Unfortunately, the test is designed to address only the mathematical sense. The problem can be even worse in other settings, such as goodness-of-fit tests. For instance, consider a genetics problem in which the model states that phenotype ratios are 9:3:3:1. A goodness-of-fit test is designed to answer whether these ratios hold exactly. However, we know a priori this is not true; no model can completely capture all possible genetical mechanisms. What we are interested in instead is whether a given model holds to a sufficiently close approximation that the model will be scientifically useful. Yet, again, a standard goodness-of-fit test is not designed to answer such a question. It is designed to answer whether the model holds exactly, and thus a large enough sample will pounce on even the most minor deviations from that model. Even in cases in which the exactness of a certain null hypothesis Ho seems to be of scientific interest, our test may pick up "departures" from Ho that are due to "noise" rather than to real differences. For example, a measuring instrument being used in the experiment may have a small but nonzero bias. With a large enough sample, we would reject H 0 even if it is true. So, a hypothesis test can be very misleading with large samples. But the other side of the coin is that a test can be equally misleading with small samples. Here, even a departure from Ho that is large enough Xj91jI246_1250$02.00jO 1991 Entomological Society of America

2 October 1991 MATLOFF: STATISTICAL HYPOTHESIS TESTING 1247 to be of scientific interest may easily be missed because of the small sample size. If so, no asterisks will be reported next to the result, and it will be announced that no significant difference was found, even if the treatment was of substantial value. These problems could be ameliorated to some extent if the power of the test, typically denoted 1 - {3,were computed at various types and degrees of departures from Ho. But this is difficult to do in complex models and in any case is still only indirectly answering the questions of real scientific interest. Recognition of problems such as those cited above is not new. They have been mentioned occasionally, such as by Snedecor & Cochran (1967), Graybill (1976), Freedman et al. (1978), Jones & Matloff (1986), Matloff (1988), and Moore & McCabe (1989). The usual alternative cited is to create confidence intervals (CIs). However, although the use of CIs is straightforward in simple settings comparing two means such as that of the zinc example, it is less clear how this approach can be extended to other types of settings such as the goodness-offit problem mentioned earlier. This paper discusses the drawbacks of hypothesis testing in more depth, using a different approach than do the references above. It will also present a generally applicable tool for decision-making that does not involve testing. The concepts are illustrated using the genetics example mentioned above, which will now be described more formally. Let p, denote the population proportion of phenotype i; i = 1, 2, 3, 4. The null hypothesis is then 933 P, = 16' P2 = 16' P3 = 16' 1 P. = 16' (1.1) Should We Be "Guided by the Stars"? As already mentioned, it is common in the presentation of experimental results to use asterisks to flag statistics that have small P values. A statistic is described as being "One-star," "two-star," and so on. It was argued qualitatively in the introduction that such descriptions are useless and even dangerous. This section presents a quantitative discussion of this problem. Suppose we perform a x 2 goodness-of-fit test on the genetics model hypothesized in equation (1.1) above, based on a random sample of size n. Assume that we report our results with the following "star system": P value> 0.05, no stars; P < 0.05, one star; P < 0.01, two stars; P < 0.005, three stars; p < 0.001, four stars. How many stars can we expect under various conditions, such as type or degree of departure from Ho and size of n? To investigate this question, a computer simulation was conducted under the five sets of phenotype proportions p, shown in Table 1. By comparison, Ho in decimal form is Ho: PI = , P2 = , P3 = , P. = (2.1) Thus, settings I through V represent progressively larger departures from H o. Specifically, these departures from Ho are taken in the direction toward the setting in which p, = c" where c, is 1 and all the other c, are O. Let h, be the value of p, hypothesized by Ho. Then setting] has p, = AJb; + (1 - AJ)c" i = 1, 2,3, 4, and] = I, II, III, IV, V. The values used for AJ were 0.95, 0.90, 0.85, 0.75, and 0.65 for] = I, II, III, IV, V, respectively. For each set of proportions p" sample sizes n = 50, 100, 250, 500, 750, and 1,000 were simulated. For each set of proportions p, and each value of n, 10,000 samples were generated, recording the number of stars for each sample. The goal was to find the expected value of the number of stars for each setting and each value of n. The results are shown in Fig. 1, which certainly shows the effects of large and small sample sizes discussed in the introduction. The proportions p, for curve II are only slightly different from those of H 0, yet for samples of size 750, the expected number of stars is one. In other words, we will probably characterize the p, as being "significantly different" from the specification in H0' even though the difference is actually rather small. On the other hand, the proportions p, for curve V represent a very substantial departure from H 0, yet the curve shows that this departure will not have a high probability of being detected if our sample size is only 50. In fact, in a separate computation not shown in Fig. 1, it was found that only 37% of the samples of size 50 yielded one or more stars. Thus, a substantial departure is highly likely to be overlooked in small samples. A horizontal line at height 1.0 intersects four of the five curves within the range of sample sizes studied. Thus, if we are told only that a certain statistic is "one-star," this information by itself is useless. "One-starness" is a property that would be consistent with any of the proportion sets p, from II to V. In other words, the knowledge that our statistic rates one star does not tell us whether the departure from Ho is substantial or is very minor and possibly of no scientific interest. Thus the number of stars by itself is noninformative for scientific purposes. As mentioned in the introduction, the number of stars by itself is relevant only to the question of whether Ho is exactly true-a question which is almost always not of interest to us, especially be-

1248 ENVIRONMENTAL ENTOMOLOGY Vol. 20, no. 5 Table 1. Alternative phenotype proportions J values PI P2 pg P4 4.0 I 0.5844 0.1781 0.1781 0.0594 III II 0.6062 0.1688 0.1688 0.0562 3.0 III 0.6281 0.

3 1248 ENVIRONMENTAL ENTOMOLOGY Vol. 20, no. 5 Table 1. Alternative phenotype proportions J values PI P2 pg P4 4.0 I III II III mean IV V stars II 1.0 cause we usually know a priori that H 0 cannot be exactly true. Of course, if we were to be armed with charts similar to Fig. 1 when we read the scientific literature, we could view the number of stars appended to a statistic in proper perspective. This kind of chart, combined with our knowledge of the sample size n, could in theory impart some real meaning to the star counts. For any given setting, such a chart would give us an indication of how the value of n affects the number of stars reported. But this would be impractical. A different chart would need to be constructed for each setting, because the size of n has quite different effects in different settings. Most readers do not have the time and resources to generate such charts. Moreover, note that even Fig. 1 is woefully incomplete. It presents curves for only a few proportion sets p" and only sets of a particular kind-sets in which p, is increasing toward 1.0 and P2' Po, and P. are decreasing to 0.0. We have no idea what the curves for other types of departures look like. And in any case, the use of such charts would still only be a patch up repair to the problems stemming from the fact that hypothesis tests do not address questions of scientific interest. Alternatives The major point developed in the preceding section was that P values by themselves have very little scientific meaning. They are designed only to assess the exact validity of Ho, a question which is almost never of scientific interest. Thus, instead, we should base our analyses on methods that do allow direct examination of questions of scientific interest. Consider again the zinc example. Suppose our sample consists of n animals. Using the n differences X.fte< - Xbefore, we can form a CI for J.tdrH, which is denoted here as (a, b). Consider the following four situations: (1) a is near 0, and b - a is large; (2) a is near 0, and b - a is small; (3) a is far from 0, and b - a is large; (4) a is far from 0, and b - a is small. As can be seen in the situations listed above, a CI for J.tdiff actually carries two major pieces of information, which are called here "the estimate (E) part" and "the accuracy (A) part." The E part estimates the size of the effect of zinc. Here, the E part is the distance from a to O. The A part provides an indication of the accuracy of the E n Fig. 1. Mean number of asterisks part. Here the A part is the size of b - a. Of course, the words "near," "far," "large," and "small" have to be defined in this context of the experiment itself. For example, "near" would be defined to be within the range that the investigator, or the reader of the investigator's report, considers to represent an unimportant effect. In situation 1, we see that, although the E part of the CI suggests that the effect of zinc might not be large, the A part tells us that the effect might on the other hand actually be quite substantial. This contrasts with situation 2, in which we have strong evidence that the effect is small. The main point here is that the "star system" includes neither an E part nor an A part and thus excludes vital information. In situation 2, for instance, in which the CI is telling us that we can be fairly confident that zinc does not have an important effect, a hypothesis test would indicate a "significant" effect. Reliance on hypothesis tests in such a situation is clearly contrary to good scientific method and in fact defeats the purpose of requiring statistical analyses when reporting experimental results. In the preceding section, it was shown that hypothesis tests can be misleading in both small and large samples. The dangers of hypothesis tests in small samples occur mainly in situation I, in which a possibly substantial effect might be missed because of insufficient sample size. There is no such danger in basing our analysis on CIs, because in situation 1, the A part of the CI would warn us that the sample size was too small. Similarly, the danger of hypothesis testing in large samples lies mainly in situation 2, in which a small effect of no scientific interest might be declared as "highly significant." Again, this danger is avoided if one uses CIs, because the E part would show us that the effect is minor. The question to be addressed now is how to extend CI analysis such as that above to more complex settings. This will be done through the formation of simultaneous CIs, analogous to the Scheffe or Duncan methodology that is familiar to many scientists. In fact, the methodology presented here is an extension of Scheffe's techniques. Following is the technical environment which is I

October 1991 MATLOFF: STATISTICAL HYPOTHESIS TESTING 1249 assumed (some readers may wish to skip this paragraph and proceed directly to the example). Suppose we have statistics T".

4 October 1991 MATLOFF: STATISTICAL HYPOTHESIS TESTING 1249 assumed (some readers may wish to skip this paragraph and proceed directly to the example). Suppose we have statistics T"..., T w which have an approximately multivariate normal distribution, with the estimated covariance between T, and TJ being denoted by C.,. It is required that the matrix C be nonsingular. Let 'Y, be the mean of T,. The values 'Y. are the parameters of interest, such as population means or proportions, and the statistics T, are estimates of these parameters. We might be interested in various linear combinations of the form The derivation of Scheffe's method can be generalized (Rao 1973, Hochberg & Tamhane 1987). The result is that the CIs w ~ at T, ± VdX2w;a,-, hold simultaneously with approximate confidence level 1 - a, where X 2 w;a is the upper-a percentile of ax distribution having w degrees of freedom, and w w w d = ~ C u r2, + 2 ~ ~ a,ajcij' (3.3) '-1 (=1 }=1 (3.1) (3.2) Example 1. Consider the genetics problem mentioned above. Here we have the following correspondences: T, is PI' the estimate of p,; 'Y, is PI; w is 3 (not 4, because P. is redundant, because the p, sum to 1); C'J is -P,PJ/n for i "* j and P.(l - PI) for j = i. We can get simultaneous CIs for p" P2, and P3 using equation (3.2). For example, to get the interval for p" we use a] = 1 and a 2 = a 3 = O. These CIs could then be used to assess goodness of fit of the genetics model (equation 1.1). Again, it is crucial that we pay attention to both the E and A parts of the CIs. For example, if the E part shows that the model's value for Ph , is just slightly outside our CI for Ph then we may decide that the model is still scientifically useful (compared with the star system, which would summarily discard the model). On the other hand, if the A part shows that the CI contains but is very wide, then this is a signal that we do not have enough data to properly assess the validity of the model. Example 2. Consider a two-way analysis of variance setting, with r rows and c columns. Let X'Jk be the kth observation at level i of the first treatment and level j of the second treatment, i = I,..., r; j = 1,..., c; k = 1,..., n. To preserve the single-subscript notation above, instead of denoting the (i, j)th population cell mean as IL,!, we will call that mean 'Y(l-Ile+1' In other words, the means in the top row are denoted 'Y, through 'Ye, the means in the second row are called 'Ye+l through 'Y2o, and so on. Of course, the (i, j)th sample cell mean is accordingly T(I-IIc+1' The value of w here is rc. C qq is the sample variance of observations in the q-th cell, and C". is 0 if P "* q. Suppose we want to investigate the additivity (i.e., noninteraction), meaning that we want to investigate how closely the relationship /Lbd - /Led = /Lbi - /Lei a 3 = 1,a. = -1, as = -1, ag = 1. (3.4) holds for all b, d, e, f. We can do this by forming simultaneous CIs for the quantities /Lbd - /Led - (/Lbi - /Lei)' (3.5) For example, say T = 4 and c = 5. Then for b = 1, d = 3, e = 2, and f = 4, the nonzero values of a, would be Again, we should not discard the idea of additivity here if the E parts of the CIs show that 0 is slightly outside some of the CIs for (3.5); i.e., that there are slight discrepancies from equation 3.4. In such a case, the additive model might still be a very valuable description. On the other hand, if the A parts of the CIs show that the CIs contain 0 but are very wide, this is a signal that we do not have sufficient data to assess the additivity. This is very important, because it can be shown that hypothesis tests for interaction typically have very low power. The methodology here is based on large-sample approximations. The reader may wonder how large n must be to make the approximations work well. This is an important question, although it should be pointed out that even "exact" statistical methodology is actually approximate in practice; no population has an exact normal distribution, nor are variances exactly homogeneous, and independence assumptions are often violated to at least some degree. However, toward this end, a simulation was conducted to assess how well equation 3.2 works. The setting used was that of example 1 above. The simulation generated 10,000 samples of size 100. A count was kept of how many samples there were in which all three of the confidence intervals for Ph P2' and P3 contained their respective parameters. The value of a was 0.05, so that the intervals had nominal confidence level 95%. The result was that 94.2% of the samples were such that all three CIs contained their respective parameters. Summary It has been shown that the use of hypothesis tests, including informal P values, is generally non informative and can be misleading. One of the prob-

1250 ENVIRONMENTAL ENTOMOLOGY Vol. 20, no. 5 lems is that a test addresses the question of the exact validity of some model, yet this exactness is usually not of scientific interest.

5 1250 ENVIRONMENTAL ENTOMOLOGY Vol. 20, no. 5 lems is that a test addresses the question of the exact validity of some model, yet this exactness is usually not of scientific interest. Another problem is that hypothesis tests are very sensitive to sample sizes, yet give the investigator no feedback as to the degree of the effect of sample size in his or her particular setting. These problems can be solved by using CI analysis. In contrast to analysis based on hypothesis tests, the E parts of the CIs address questions of direct scientific interest, and the A parts give explicit feedback as to whether the investigator has collected sufficient data for making reasonable conclusions. Basic texts in statistics usually do not pro vide much information about forming CIs in nonsimple settings. However, it is shown here that simultaneous CIs can be easily obtained for many such settings. References Cited Freedman, D., R. Pisani & R. Purves Statistics, 1st ed. Norton, New York. Graybill, F Theory and application of the linear model. Duxbury, Boston,Mass. Hochberg, Y. & A. Tamhane Multiple comparison procedures. Wiley, New York. Jones, D. & N. MatIotr Statistical hypothesis testing in biology: a contradiction in terms. J. Econ. Entomol. 79: Matlotr, N Probability modeling and computer simulation, with applications to engineering and computer science. Prindle, Weber & Schmidt-Kent, Boston, Mass. Moore, D. & G. McCabe Introduction to the practice of statistics, Freeman, San Francisco. Rao, C Linear statistical inference and its applications, 2nd ed. Wiley, New York. Snedecor, G. & W. Cochran Statistical methods, 6th ed. Iowa State University Press, Ames. Received for publication 18 September 1990; accepted 23 April 1991.

Testing Research and Statistical Hypotheses

Testing Research and Statistical Hypotheses Introduction In the last lab we analyzed metric artifact attributes such as thickness or width/thickness ratio. Those were continuous variables, which as you