Tel Aviv University. Improved Multiple Test Procedures for Discrete Distributions: New Ideas and Analytical Review

Size: px

Start display at page:

Download "Tel Aviv University. Improved Multiple Test Procedures for Discrete Distributions: New Ideas and Analytical Review"

Valentine Logan Jordan
5 years ago
Views:

1 Tel Aviv University The Raymond and Beverly Sackler Faculty of Exact Sciences Department of Statistics and Operation Research School of Mathematical Sciences Improved Multiple Test Procedures for Discrete Distributions: New Ideas and Analytical Review Thesis submitted in partial fulfillment of graduation requirements For the degree of M. Sc. In Applied Statistics By Roee Gutman Prepared under the supervision of Prof. Yosef Hochberg September

2 Acknowledgment Thanks to Prof. Yosef Hochberg from the Raymond and Beverly Sackler Faculty of Exact Sciences, Department of Statistics and Operation Research, School of Mathematical Science, Tel Aviv University for his long lasting mentorship, supportive originality and ideas, by which he guided me through the world of statistics. Special thanks to Prof. Yoav Benjamini and to Dr. Felix Abramovich from the Raymond and Beverly Sackler Faculty of Exact Sciences, Department of Statistics and Operation Research, School of Mathematical Science, Tel Aviv University for their willingness to help with this project. Their meritorious comments and advice are deeply appreciated. 2

3 Abstract Hypothesis testing is one of the basic problems in statistics. The solution that is usually used for this problem is the likelihood ratio test (LRT). LRT has several well known advantages. However when the significance level is predefined and the power is maximized, it is not always the most powerful test. Furthermore, when discrete distributions are used other methods should be considered as well. This study explores some of the options for single hypothesis testing when the underlying distribution is discrete. The power issue in multiple hypotheses is of utmost importance. Procedures dedicated to multiple comparisons, when the underlying distribution of the test statistics is discrete, enable a gain in power. We review and compare most of the existing non-randomized multiple hypotheses testing procedures for discrete distributions. In addition, three new procedures are proposed: TWW k, stepwise TWW k and an expansion of Paroush Integer Programming (IP) (1969) procedure for multiple tests. All procedures (the new and the existing ones) are compared to each other using mathematical analysis. If mathematical analysis did not define a universally most powerful test, simulation analysis was performed using several test cases where procedures for multiple hypotheses for discrete distributions should be used. This work makes it clear that discrete analysis procedures should be used with multiple hypotheses testing in order to gain power. Additionally, this work more precisely couples the most appropriate discrete procedures with the specific test case. 3

4 Contents: 1. Introduction. 2. New thoughts about discrete single hypothesis. 3. A comparative review of old and new ideas in multiple testing of discrete distributions. 3.1 Single step methods. 3.2 A newly proposed single hypothesis method. 3.3 Comparison between single step multiple hypothesis methods. 3.4 Stepwise methods. 3.5 A newly proposed stepwise method. 3.6 Comparison between stepwise multiple hypothesis methods. 3.7 Global hypothesis testing. 3.8 A newly proposed global hypothesis testing. 4. Applications of the multiple testing procedures. 4.1 Example 1 cdna transcripts. 4.2 Example 2 Animal carcinogenicity test. 4.3 Example 3 Efficacy of a respiratory therapy. 4.4 Example 4 Relationship between DVT and 3 genetic factors. 5. Applications of the multiple testing procedures to simulated data sets. 5.1 Independent test statistics. 5.2 Independent trend simulation. 5.3 The dependent case of multinomial sampling. 5.4 The dependent case of extended multinomial hyper-geometric distribution. 6. Discussion. 7. References. 8. Appendices. 4

5 Introduction: Hypothesis testing is one of the most thoroughly explored problems in statistics. In testing a hypothesis, one wishes to decide, based on observations X, whether or not a hypothesis that has been formulated prior to observing X is correct. The choice is dichotomized between accepting or rejecting the hypothesis. The procedure that solves such a problem is called a test of the hypothesis in question. A non-randomized test procedure assigns to each possible value (x Range(X)) one of the decisions either accept or reject H 0. The values may be then classified into two regions S 0 and S 1. If x falls into S 0 the hypothesis H 0 is accepted, otherwise it is rejected. This paper thesis deals exclusively with non-randomized tests. When performing a test, one may arrive at the correct decision, or may commit one of two errors: rejecting the hypothesis when it is true or accepting it when it is false. It is desired to carry out a test in a manner that keeps the probabilities of the two types of errors to a minimum. Unfortunately, for a given number of observations, it is impossible to simultaneously minimize both probabilities. It is customary, therefore, to assign a bound to the probability of incorrectly rejecting H 0 when it is true, and to attempt to minimize the other probability subject to this condition. Neyman & Pearson proposed the likelihood ratio test (LRT). This test has a number of desired properties; it is easy to apply, it leads to definite and reasonable conclusions, and it possesses various pleasant large sample properties. In view of these properties, the test seems to be universally satisfactory. There are, however, scenarios, in which the LRT is unsatisfactory, and may be even useless. This can be demonstrated by the following example given by E.L. Lehmann (1950): Null /2 /2 ½ - ½ - Hypothesis Alternatives p*c (1-p)*c (1-c) (1-c) (1-c) * (½- ) * (½- ) * (1- ) (1- ) (1- ) and c are constants 0< 1/2, /(2- )<c< and p ranges over interval [0,1]. It is desired to test the null hypothesis at significance level. The LRT rejects H 0 when x = +2 or x =-2. Hence its power against each alternative is c. Since c<, this test is, literally, worse than useless, because a test with power can be obtained without observing X at all, simply by the use of random numbers. The value of a test is significantly improved if it rejects H 0 when x=0 while acquiring the power 5

6 (1-c)/(1-)* >, so that a reasonable test for the hypothesis does exist. This kind of examples made it necessary to improve the existing methods for discrete distributions in a way that will not rely on the LRT. In the following sections some existing procedures will be reviewed and some original new multiple comparisons methods will be presented. A power comparison between these new and the existing methods will be performed, and be utilized in a critical review of all new and existing methods. Section two raises some thoughts about methods for discrete distributions. Section three presents some of the existing multiple comparisons methods and suggests new ones. In addition mathematical comparison between the methods presented is applied. Section four displays several test cases where discrete methods for multiple comparisons should be considered. Section five includes power comparison, using simulation analysis, between the methods for which mathematical analysis came out inconclusive. Section six summarizes the results and makes suggestions for the methods that should be used for discrete distributions. 6

7 2. New Thoughts about Discrete Single Hypothesis: Discrete single testing contains a finite sample space with N outcomes (sample points). To each point, two numbers are attached P i and Q i, which are the probability of point i under two alternative hypotheses H 0 and H 1 N ( P, Q 0 P Q 1) i i i i i1 i1 N The testing also includes a test statistics t and a decision rule, such as t t() for accepting or for rejecting H 0. The problem at this stage is to construct a non-randomized most powerful test for H 0 in favor of H 1, where the probability of rejecting H 0, when it is true, is at most. A simple solution to this problem may be the likelihood ratio test (LRT) that can be obtained from many elementary textbooks (e.g. Lehmann E. L. 1986, Ferguson 1967). The points are ranked according to R i = Q i /P i, then each point is associated to the rejection region, if and only if (i) is smaller than, or equal to, a given constant c, so that c is the maximal number that P[ i] (where P[1] P[N] are ordered by R i ). The Neyman Pearson basic lemma states that the LRT for H 0 against H 1 with a constant level of significance is more powerful than any other test with the same or smaller level of significance (for a given number C, the LRT is the uniformly most C powerful among all tests with P[ i] ). However, the theorem does not imply that the LRT is i1 the most powerful test when the significance level is predefined. This point is illustrated by the following example: Table 1 Sample points ranked by R i Total P i Q i R i From this table, the points within the rejection region can easily be derived. The differences between LRT and non-randomized most powerful testing are depicted in the next table. C 1 7

8 Points within the rejection region Size of the test LRT Non-randomized most powerful test 5% 1, 2, 3 1, 2, 4 10% 1, 2, 3, 4 1, 4, 5 The power of the non-randomized most powerful test is 3% higher than the LRT when = 5%. It increase to 5%, when = 10%. The difference between these two tests stems from the fact that is selected a-priori and that the sample space is discrete and has a finite number of points. To obtain the most powerful test one has to compare no more than 2 N potential rejection areas; however, this method is inefficient and cumbersome. Paroush (1969) devised a method that uses integer programming (IP) to solve this problem. He transformed the statistical test into a form of linear programming: Max subject to i X i Q i X i P and X 0, 1 for all i=1,..,n i Where P i s share the probabilities to receive the value associated with X i under the null hypothesis. Q i s are the probabilities to receive the value associated with X i under the alternatives. The sample point k will be in the rejection region of the test, if and only if, the associated X k will be in the optimal solution. Example: H 0 : Y~B(10,0.5) H 1 :Y~B(10,0.3) The probability of type I error = 0.05 Using the Neyman Pearson LRT the points that are included in the rejection region will be {0, 1}. The power of the test will be and the probability to commit type I error will be Using IP the rejection region will consist of the points {0, 2, 10}. The power of the test will be and the type I error will be The later result improves the power of the test, while exhausting the level to its end. However, this result contains an illogical point (point 10). We expect that if the sample result was 10, it would imply that the probability of success is either higher than or equal to 0.5, and not smaller as claimed by H 1. This outcome occurs since IP doesn t take into account the direction of the rejection region and tries to exhaust to its end. In order to make some sense of the results received from IP, one may change the target function; minimizing a function of the two types of errors instead of minimizing type II error (maximizing the power). This may be entertained by using an increasing function [i.e. Q i X i + (1-X i ) * P i ]. Maximizing this function will minimize both types of error while controlling type I error at level 8

9 not larger than. The rejection region includes, now, only the points {0, 2}, a result that seems more logical. Another example for comparing the LRT to IP, will be the Fisher s tea drinking lady problem. A British woman claimed to be able to distinguish whether milk or tea was added to the cup first. To test her claim, she was given eight cups of tea, in four of which milk was added first. She was told that there were four cups of each type. The results were recorded in the following table: Truth \ Before After Sum Lady says Before X 1 4 After X 2 4 Sum In this problem the alternative hypothesis is a compound hypothesis. However, the equation suggested by Paroush requires the exact distribution under the alternative. This problem can be overcome quite easily by simply replacing the Q i by P i, thus achieving a test with type I error <, and maximize the size of the test (P{XS 1 } under the null hypothesis. S 1 is the rejection region). Fisher s tea drinking lady problem can be solved using several types of probabilistic models. One type of model for this problem will be the binomial distribution for the total number of successes (X 1 + X 2 ). The probability that the lady will classify the right cup will than be equal for both groups. Under the null hypothesis (X 1+ X 2 )~B(8, 0.5) the probability will be as follows The test of interest for the researchers was to discover if the lady has the power to decide whether the milk was poured before or after the tea. This question leads to a one sided test. However, it might also be of interest, if the lady could classify the glasses in the opposite direction (can classify the cups but into the wrong groups). Under these circumstances the test is two sided, but with a smaller interest in one side. One can suggest the following rejection region: reject the null hypothesis if it falls into the region {0, 1, 8} when the significance level is This is giving a stronger probability to reject one side of the test, while still keeping the option to reject the null hypothesis by an opposite surprising result. This test controls the type I error < and receives higher power than the LRT. Another way to solve the problem will be by using the Fisher Exact procedure (Agresti 1990) that relies on the hypergeometric probability. The one-sided rejection region received in this case 9

10 by IP is {0, 4} and by LRT is {0} when = In this case Pr(X 1 =0) = Pr(X 1 =4) and in case of two-sided test the rejection region for both IP and LRT will be similar. However, if we change the problem a bit and take 13 glasses for each group instead of 4 and stay with the same level of significance, the IP rejection region will consist of {0,1, 2, 4, 12, 13} retrieving a type I error of , while the LRT rejection region for one sided test will consist only of {0, 1, 2, 3} retrieving a type I error of , and for two-sided test will consist of {0, 1, 2, 3, 11,12, 13}; type I error of IP produces a significantly larger rejection region than LRT. It is of note, that the IP alone may yield several rejection areas with an equal type I error (e.g. in the last example the region {0, 1, 2, 9, 12, 13} has the same probability for type I error). The researcher needs to decide up-front which hypothesis is in question. If it is the lady has no power (H 0 ) vs. the lady has the power (H 1 ) to claim the identity of the glasses, the researcher should choose the one sided test, that its rejection region will consist of {0, 1, 2, 4, 12, 13}. On the other hand, if the researcher wants to test weather the lady can distinguish between the glasses, even if the group identity is wrong, his rejection region will consist of {0, 1, 2, 9, 12, 13}. Another disadvantage that may arise when using the IP method for hypothesis testing is the lack of alpha consistency (AC). Hypothesis that is accepted at a given level maybe rejected at a lower level. This can be seen in Table 1 where point 2 is included in the rejection region if = 0.05 but is excluded from the rejection region when = A Comparative Review of Old and New Ideas in Multiple Testing of Discrete Distributions The development and use of procedures for multiple testing for discrete distributions are more imperative than for single testing, since in multiple testing the gain in power is much more crucial. The next section will elaborate on multiple comparisons methods where the sampling distribution is discrete. 3.1 Single Step Methods Consider an animal carcinogenicity experiment. The experiment includes J+1 groups j=0, 1,,J. Where group j=0 is the control group. In each group there are n j animals. We define n ji to be the number of animals in group j that their i th organ (i=1,..,i) was available for histopathological examination, and x ji corresponds to the number of these n ji in which tumor was discovered in the i th organ. The purpose of the experiment is to determine if each of the J experimental groups 10

11 differs from the control group, in the rates of occurrence of tumor discovery at one or more of the I sites. This problem is described in the literature as a multiple testing problem. In general, the problem involves a family of hypotheses H 01,,H 0N (alternative H 11,,H 1N ). The hypotheses are tested simultaneously and a multiple level has to be controlled. A valid procedure to solve this problem will maintain strong control of the familywise error rate (FWE) at its nominal level. (i.e. the probability of rejecting at least one true H 0i (i=1,..,n) is at most no matter which and how many H 0i are true (Hochberg and Tamhane, 1987). A simple way to solve the question is to use the Bonferroni method. This method rejects all hypotheses with p-values less than or equal to /n. If the underlying distribution is continuous, the p-values are uniformly distributed on [0,1] under the null hypothesis (H 0i ). For discrete data statistics, however, there actually exists a smallest attainable p-value i * for each hypothesis. Gart, Chu and Tarone (1979) noted, that the number of significance tests could be reduced by eliminating those tests for which the smallest p-value is higher than ( i * > ). Tarone (1990) improved this idea by noting that even for hypotheses with /n < i * < rejection may never be possible. At each site of the I possible sites, a significance test can be performed (the sites are indexed by i). For each integer k, define R k = {H i : k i * < } (the set of sites satisfying k i * < ) and m(k) = R k, where is the nominal significance level and i * is the minimum achievable level at site i. Thus m(1) is the number of sites that can be rejected at the nominal level. If m(1) > 1, a correction for multiple comparisons must be considered. Gart et al (1970) and Mantel (1980), noted that the denominator in Bonferroni test can be reduced from I to m(1). In many cases the correction factor can be further reduced. Claim 0: For any integer k < m(1), m( k1) m( k) and m[ m(1)] m(1) This can be seen quite easily since if Hi Rk 1 then Hi Rk ( H R 1 * * i k H R ), thus R i k1 i k i k k1 Rk. Using the same reasoning it can be deduced that m[ m(1)] m(1). From Claim 0 it stems that if the correction factor will be m(1) there may exists some H i such that their * i. Thus we will not be able to reject them whatever their p-value will be. By m(1) excluding those hypotheses the correction factor can be reduced until we reach the smallest 11

12 number k such that m( k) k. Define K to be the smallest value of k such that m( k) k. This reduction will only have effect when dealing with discrete data, since in continuous data m(1) = m(2) =.= m(i) = I, so that K = I and the usual Bonferroni method is applied. The values K and R k can be determined using only the information in the marginal total. Tarone s procedure rejects H 0i if and only if H 0i is contained in R k and p i < /K, where p i is the observed significance level at site i. From this follows Pr(reject at any site) R Pr(reject at site i) m(k)*/k. k Define i as the largest achievable significance level such that i /K for i=1,,m(1). Using the above modified Bonferroni, we see that R k Pr(reject at site i) = R k i < except when m(k)=k and i = /K for all i in R k. When R k i is considerably less then, Tarone suggested to expand the critical region of the significance test by using marginal information until the largest possible rejection region is obtained (such as adding the tail outcome of smallest probability not included in the rejection area). Unfortunately, Tarone s procedure (T) lacks AC (Roth 1998). The following example will demonstrate the lack of AC in T. If n=5 i = 0.002, 0.024, 0.029, 0.029, 0.07 and the p i = for all I, Then at level = 0.09, K = 4, so that the critical value is 0.09/4 = , and none of the hypotheses is rejected. But at the = 0.05 level, K = 2 and the critical value is so we are able to reject hypotheses 1 and 2. Roth (1998) developed procedure T* that modifies T and achieves AC while simultaneously increasing the power. T* maintains strong control of FWE <. The procedure rejects all H 0i s such that p i < /K* where M = {x[0,1] m(x) x} and K* = inf{xm}. A simple way to construct T* in practice will be to arrange the smallest attainable p- value in an increasing manner * (1) * (n). If m(k)=k then K*=K else K*= / * (K). Roth showed that T* has AC and it is a universal improvement to T. When trying to redefine this procedure for FWE, by simply defining R j ={H i * (i) } and redefining m(j), K, M, K* in terms of the new R j s, this T* procedure is no longer valid. It can be further demonstrated using the previous example with = Specifically; K = 3 and K* = 0.058/0.029, but T* rejects {H 0i p i 0.029}. Since T* is based on Bonferroni, and since there are 4 H 0i s that might be rejected by this rule (i.e. m(k*) = 4 > K*), the FWE can potentially be as large as *4 = 12

13 0.116 > 0.058, so that the validity disappears. Therefore, the procedure will have to be modified and adapted to stand the FWE criterion. The major problem of T*, when the FWE criterion is used, arises from those cases where m(k*) > K*. Hommel and Krummenauer (1998) and Roth (1999) solved this major obstacle by redefining T* as follows: When m(k*)k*, reject {H 0i p i /K*}, and when m(k*) > K*, reject {H 0i p i /K* and * i < /K* }. Roth even suggested another procedure named T k. This procedure which rejects {H 0i R k p i /m(k)}does improve T, and is more powerful than T (since m(k) K), but still lacks AC. Furthermore, it is more powerful than T* for special cases when * (m(k)+1) * (k) and m(k)<k. Nevertheless, T k is not universally more powerful than T*, since T* may reject some hypotheses that are outside of R k. Westfall and Wolfinger (W & W 1997) suggested a different approach based on the full set of possible values for each P i, rather than just on the minimum attainable p-values i * for each P i. They defined adjusted p-values (p j ) as p j = Pr(min P i p j ) where P i refers to the random p- values considered under their null hypotheses. This test is being widely used in the analysis of toxicology data (Heyse & Rom 1988). The justification for using the min(p i ) is that it measures the degree of surprise that an analyst should experience after isolating the smallest p-value from a long list of p-values calculated from a given data set. An additional justification for using the min(p i ) is that the p-values are always on the same scale. If we define p i (i=1,..,k) as the observed p-values of given tests, given that the distribution of the test statistics is discrete, the observed values of the random p-values P i will be {p it : t = 1,,m i } (m i is the maximum available value for the i th test statistic) where Pr(P i p it )=p it. The adjusted value, p j will be the probability that a p-value as small as p j will be observed in the entire study when all null hypotheses are true. Using discreetness p j = 1 - ( 1 k i1 p it ( j) ) where p it(j) = max t {p it : p it p j } if min t {p it } p 0 otherwise For each hypothesis, the procedure computes its adjusted p-value and compares it to FWE=. The procedure assumes independence between the tests, thus making the method rather conservative, although less than the Bonferroni method. The simplest method to bind the true 13

14 values of p j will be to use the Bonferroni inequality. The discrete Bonferroni adjusted p-values are p j = min{ k i1 p it ( j), 1}. 3.2 A Newly Proposed Single Step Method We propose a new method, TWW k, that controls the FWE and incorporates the discreteness of the distribution. This method will use W &W on the set defined by T k. TWW k rejects {H 0i R k p i } where p i = Pr(min P i p j ),{ j H R }. This method controls the FWE. Pr(reject oj k at least one H 0i H C 0 ) = Pr( min {1 (1- P j ) m(k) } H C 0 ) = 1 - Pr( min {1 (1- P j ) m(k) } > 1 jm ( K ) 1 jm ( K ) H C 0 ) = 1- Pr(P j > 1-(1- ) 1/m(K) for all j H C 0 ) = 1 - m( K ) Pr(P j > 1-(1- ) 1/m(K) H 0 C ) 1-{(1- ) 1/m(K) } m(k) (assuming P j ~U[0,1]) =. j1 This method is tested against the existing methods in the following chapters. 3.3 Comparison between Single Step Multiple Hypothesis Procedures This chapter is devoted to the comparison between the methods suggested so far. Some of the methods are universally more powerful than others; for other methods there exist situations where one method is better than other and vice versa. Claim 1: T* is universally more powerful than T. T rejects H 0i when p i /K. Procedure T* rejects H 0i when p i /K* when m(k*)k* and to reject {H 0i p i /K* and i * < /K*} when m(k*) > K*. 1) If m(k) = K than K*=K and T=T*. 2) m(k) < K => m(k) K-1 => /K < (k) * /(K-1) => K*= / (k) * => p i (k) *, but /K < (k) * => T* is universally more powerful than T. Claim 2: T k is universally more powerful than T. T can reject H 0i when p i /K and H 0i R k. T k rejects H 0i when p i /m(k) and H 0i R k. m(k) K than /K /m(k) => T k is universally more powerful than T. Claim 3: None of T k and T* is universally more powerful than the other. A simple example can demonstrate this claim, suppose we have 4 hypotheses that their minimal attained p-values are (0.01, , 0.012, 0.015) K in this case will be equal to 4, m(k) = 3, and 14

15 K*= The p-values attained under H 0i will be (0.016, 0.016, 0.016, 0.016). In this case T k rejects {H 01, H 02, H 03 } while T* rejects none of the hypotheses, when =0.05. If we change the attained p-values to (0.015, 0.012, 0.01, 0.015), keep the same minimal attained p-values and keep the same significance level, all of the hypotheses will be rejected by T*, while T k will only reject {H 01, H 02, H 03 }. Claim 4: Westfall & Wolfinger (WW) method is universally more powerful to T*. I will start with the case of independence: 1) The adjusted p-values that where devised by H & K are p i = min{1,q(p i )*p i }, q = q(p i ) => * (q) p i < * (q+1) 2) The adjusted p-values devised by W & W p i are 1 (1 p ( )) 1 (1 ) q it j pi (q has the same meaning as in 1). The inequality is true since there can be at most q p j which are smaller than p i. 3) If p i = 1 using H & K 1 (1-p i ) q < 1 the adjusted p-values devised by W & W are smaller than the one suggested by H & K. 4) If p i = q(p i )*p i => p i = q*p i using H & K, for each p-value devised by W & W p i 1 (1-p i ) q. Using the Taylor series 1 (1-p i ) q = q* p i 0.5q(q-1)(1-) q-2 p 2 i (where is between 0 and p i )< q* p i =>the adjusted p-values are smaller in W & W than those defined by H & K, thus the method suggested by W & W is much more powerful than T* suggested by (H & K and Roth). In case the p-values are dependent using W & W p i = k i1 p it ( j) q pi = q*p i again smaller than i1 the p-values suggested by H & K, one can conclude that W & W is universally more powerful than T*. Claim 5: TWW k is universally more powerful than T k. This can be shown in a similar way to the proof that was described in the previous section. The adjusted p-value in T k method is p i = m(k) * p i for {i H 0i R k }. The adjusted p-value in TWW k for {i H 0i R k } are p i = 1 - (1-p i ) m(k) under independence and p i = min{ k i1 p it ( j), 1} otherwise. Using the methods mentioned above it can be seen that p i m(k) * p i for all {i H 0i R k }. => TWW k is universally more powerful than T k. 15

16 Claim 6: None of W & W, and the TWW k /T k, is universally more powerful than the others. This can be demonstrated by the following example: There are H 0i i=1,..,4 the minimum achievable level for each hypothesis is 0.01, 0.03, 0.05, 0.05 respectively. Suppose the p-values achieved in the experiment were 0.05, 0.05, 0.05, Using the TWW k /T k when =0.05 than K = 2 and m(k) = 1, thus we will reject H 01 since it s p- value is smaller than, or equal to, and H 01 R k. Using W & W method it is obvious that the adjusted p-value for each hypothesis will be 1-(1-0.05) 4 = => none of the hypotheses is rejected at the 0.05 level. If we change the p-values achieved in the experiment a bit p 1 =0.01, p 2 =0.03, p 3 =0.05, p 4 =0.05, and we assume that the values that can be received under H 01 = {0.01, 0.02, 0.04 }, the results received by TWW k /T k will not change, and we will still reject H 01. However, the adjusted p-values suggested by W&W H 01 = 0.01, H 02 =0.0494, so we are able to reject both H 01 and H 02 when =0.05. Thus, it can be seen, that none of the procedures is universally more powerful than the other. 3.4 Stepwise Procedures: Stepwise methods provide a further increase of the power of multiple testing methods. These techniques are not unique to discrete distributions, but need to be mentioned since it improves the power of the multiple hypothesis tests. The procedure suggested by Westfall and Wolfinger can easily be adapted to stepwise analysis. The p-values are adjusted using the step down technique, by adjusting the smallest p-value according to min(p i ) distribution. The second smallest p-value is adjusted according to the min(p i ) distribution of all the variables excluding the variable whose unadjusted p-value was smallest, and so on. Hommel and Krummenauer (H & K) (1998) developed another step down procedure, which is similar to Holm s (1979) Bonferroni test, but incorporates Tarone s discrete methods (T, T*). This procedure was named TH*: 1) Set I = {1,,n}. 2) For j=1,..,#i define: m I (,j)=#{i I i * /j}, number of hypotheses with indices i I that can be rejected at level /j. K I () = min {j=1,,#i m I (,j) j} and b I () = / K I (). 16

17 3) For i I reject H i iff p i b I () for some 0<. 4) Let J= index set of all hypotheses that have been rejected in step 3. 5) If J is empty stop otherwise set I=I-J and return to step 2. For practical performance of the third step for a specific I = {i 1,,i t } one can apply the single step method suggested by Roth(1999) and H & K (1998) described in section 3.1. Both latter procedures use the step down technique. Roth (1998) described a step up procedure R based on Hochberg s procedure (H). Procedure R is composed of two procedures: procedure L (that is closely related to H), and a component procedure C. R rejects H i if it was rejected by either L or C. Procedure L: 1) Accept the entire p i s that are not in R 1 = {H i i * < }. 2) Order the p i s from highest to lowest p (1),, p (t). 3) Let Q = {jp (j) < /j, p (j) R 1 } define q = min{j Q}. 4) Reject all of the H i R 1 such that p i < /q. Procedure C: 1) Consider only the {H i R k } order the p i s from highest to lowest by p (1),, p (m(k)) if m(k) < K than q (i) = 0 for I = m(k),..,k. 2) For j=1,..,k define p* j = max{{q (j) } {p i H i R j - R k }} 3) Let W= {j p* j < /j} define w = min{j W}. 4) Reject H i if p i < /w. Roth showed that procedure R is valid if H is valid for all subsets of R1 of size q* where q* is defined as the larger of m(k) and max{{0} {i=1,,k-1 M i is not empty}}. The validity of R requires weaker assumptions than those required for H. Pairwise independence suffices in case q*=2 and can be extended to independence of subgroups of size q* for q*>2. Similarly, pairwise TP2 (simply positively correlated) suffices for q*=2 and it is a conjecture that this notion can be extended to subgroups of size q* in case q*>2.. Roth suggested another variation of R (RMOD), which generalizes Rom s (1990) procedure instead of Hochberg s. The arguments for preferring either of the variations (Rom s or Hochberg s procedure) are analogous to the continuous case. Like T and T k R/RMOD also lacks AC. Like T and T k R also lacks AC. 17

18 3.5 A Newly Proposed Stepwise Method Using the mechanisms described in section 3.2, one can apply W & W stepwise method to the group of p-values with a hypothesis that belongs only to R k. This method has properties similar to those of TWW k (lack of AC, universally more powerful than T k and T), but it has a higher power, since we use a stepwise method rather than a single step. 3.6 Comparison between Stepwise Multiple Hypothesis Methods In this section comparison between stepwise methods will be performed. Comparison between single step methods and stepwise methods was not performed since each single step method has a matching stepwise method, which is more powerful. However, there exist situations where a one type of single step method is more powerful than other type of stepwise method. Claim 7: Stepwise Westfall & Wolfinger (WW) method is universally more powerful than TH*. Using the same technique described for single step procedure, at each stage of the stepwise method, it can be shown that W & W stepwise method is universally more powerful than both H & K and Roth T* method. Claim 8: None of Roth s R/RMOD method, W & W stepwise method and stepwise TWW k is universally more powerful than the others. This can be seen using the following simple examples. Suppose we attain four independent hypotheses. Their minimal attained p-values are (0.03, 0.015, , 0.002). The p-values attained under H 0i are (0.06, 0.026, 0.02, 0.023). Using Roth s R/RMOD method, it is clear that the third and fourth hypotheses will be rejected, when = Suppose that all the hypotheses, except the first, can reach p-value of 0.02, then if the stepwise W & W or stepwise TWW k is used none of the hypotheses will be rejected. If we change the problem a bit, and assume that the maximum achieved p-value that is smaller than 0.02 for the second and forth hypotheses is 0.015, than we will reject the second, third and fourth hypotheses using either stepwise W & W or stepwise TWW k. Roth s R/RMOD method still rejects only the third and fourth hypotheses. Head on comparison between stepwise W & W and stepwise TWW k shows that no method is universally more powerful than the other. This can be demonstrated by examples similar to the ones given for the single step methods. 18

19 Claim 9: None of Roth s R/RMOD method, W & W stepwise method and T k /TWW k are universally more powerful than the others. In Claim 8 it had been demonstrated that Roth s R/RMOD is not universally more powerful than W & W stepwise method and vice versa. Now I will demonstrate that Roth s R/RMOD method is not universally more powerful than T k /TWW k. Suppose we attain four independent hypotheses. Their minimal attained p-values are (0.01, , 0.012, 0.015). K in this case will be equal to 4, and m(k) = 3. The p-values attained under H 0i are (0.06, 0.026, 0.016, 0.023). When = 0.05 Roth s R/RMOD method rejects none of the hypotheses while T k /TWW k reject {H 03 }. If we change the attained p-values under H 0i to (0.045, 0.026, 0.016, 0.023) and keep the same minimal attained p-values and the same significance level, Roth s R/RMOD method rejects all four hypotheses while T k rejects only {H 03 }. In order to compare W & W stepwise method to T k /TWW k, I will use the same example given in Claim 6, as can be seen in the first part of the claim W &W will not reject any of the hypotheses, thus stepwise W & W method does not reject any of the hypotheses. However, T k /TWW k rejects H 01. In the second part of the example W &W rejects H 01 and H 02, since stepwise W & W is more powerful than W & W than it rejects at least H 01 and H 02, thus more powerful than T k /TWW k. Claim 10: None of T k /TWW k and TH* are universally more powerful than the others. Using the same example as in Claim 3 it can be seen quite easily that when the p-values attained under H 0i will be (0.016, 0.016, 0.016, 0.016), T k /TWW k will reject {H 01, H 02, H 03 }, while TH* will reject none of the hypotheses when the significance level is If we change the attained p-values to (0.015, 0.012, 0.01, 0.015), keep the same minimal attained p-values and keep the same significance level, all of the hypotheses will be rejected by TH*, while T k /TWW k will still reject only reject {H 01, H 02, H 03 }. Claim 11: Stepwise TWW k is not universally more powerful than neither TH* nor T*. Using the same example described in Claim 3 and Claim 10, one can see that since stepwise TWW k is universally more powerful than T k /TWW k it rejects {H 01, H 02, H 03 } in the first part of the example, while TH* and T* do not reject any of the hypotheses. In the second part of the example stepwise TWW k will still reject only {H 01, H 02, H 03 }, while TH* and T* rejects all of the hypotheses. The results of the statistical power comparisons of all single step and stepwise methods suggested so far, is schematically depicted in Fig 1. 19

20 3.5 Global Hypothesis Test: In evaluating results of several hypotheses testing, the first step, usually, is to test the global null m hypothesis H0 = H0i. The rejection of the global hypothesis leads to the conclusion that at least i1 one of the individual hypotheses is false. On using a stepwise method with global hypothesis testing, one can reach conclusions on all of the individual hypotheses. Rom (1992) devised a procedure to reject the global null hypothesis for discrete distribution. He took advantage of the discreetness of the joint distribution by evaluating the probability: {P (1) <p (1) } or {P (1) =p (1) } {P (2) <p (2) } or,,or {P (1) =p (1) } {P (2) =p (2) } {P (n) p (n) } where p (i) is the ordered p-value received in the test. This probability is the overall significance of the observed p-values and is compared to for testing the global null hypothesis. This probability is smaller than or equal to P (i) p (i) the inequality that is used in the Bonferroni procedure and some of its modifications. The multiple test procedure can also be put in the following simple form: reject the global hypothesis if: {p (1) <c 1 }or{p (1) =c 1 } {p (2) <c 2 }or,,or{p (1) =c 1 } {p (2) =c 2 }. {p (n) c n } The critical points c 1,,c n can be computed exactly if we know the underlying distribution or via Monte Carlo. As can be easily seen, Rom actually calculates the exact probability of P (1) p (1) (which is the min(p i ) probability). In order to make conclusions on the single hypothesis we can use the method suggested by W & W, and thus receive a shortcut from the full closure test. H & K suggested a different method that tests the global hypothesis for discrete distributions. Their method is based on the Rüger test (Rüger 1978): Choose an integer s, 1sn in advance; reject H 0 iff p (s) s /n The method uses the same principle of T* on the Rüger test. H & K defined m s (,j)=#{i I i * (s)/j}, K s () = min {j=s,,n m s (,j) j} and b s () = (s)/ K s (). The rejection rule will be: reject H i iff p (s) b s () for some 0<. The algorithm can be constructed in the following way: 1) Choose r, s rn such that, r = s and 0< * n+1 or s< r < n and ((r-1)* r )/s <(r* r+1 )/s or 20

21 r=n and ((n-1)* n )/s <1 2) Then K s ()=r and b s () = (s)/r 3) Reject H 0 iff p (s) (s)/r or p (s) < * r. This procedure does not provide any means to make decisions on individual hypothesis. One can apply the test on the full closure test (Markus et al 1976) for all intersection hypotheses. 3.7 A Newly Proposed Global Hypothesis Testing Methods To test the global null hypothesis an expansion of Paroush IP method can be applied: Define vector P as the vector of all probabilities P i1, i2,..., in = Pr(i 1 =a 1,,i n =a n ) under the null hypothesis. Define Q i1, i2,..., in = Pr(i 1 =q 1,,i n =q n ) under the alternative hypothesis. Define vector X N,1 (N= size of sample space). The test will be Max subject to i X i Q i X i P and X 0, 1 for all i=1,..,n The sample point k will be in the rejection region of the test, if and only if, the associated X k will be in the optimal solution. As with the single hypothesis test, if there exists no a-priori knowledge of Q, and a complex hypothesis (one sided/two sided) is being applied, than by defining Q = P a level test is achieved. In order to make decisions on individual hypotheses, there is a need to use the full closure test for all intersection hypotheses. A small shortcut can be made using the following method: 1. Run the procedure on the global null hypothesis, the results will be a group of points in the N dimensional space. Define this group of points as T = {(t 1,,t n ) (t 1,,t n ) chosen by IP method}. 2. When running the IP procedure on each intersection hypothesis ( H 0,...,H 0 ), apply it i i 1 i K only to the group of points X = ( x,..., i x 1 i ), such that there is at least one K pointtt ( t..., t ) x t x t... x t. 1 N i1 i1 i2 i2 ik ik When applying the IP algorithm, one may receive several groups of points that control the same level. Only one group of points is chosen for the test. The question is which group of points should be used. One approach, when applying this algorithm to a complex hypothesis, is to choose the group with a larger number of points. The reason for this approach is that under the null hypothesis these groups have the same probability of being rejected. However, under the alternative, these points might have a higher probability thus receiving higher power. This group can be achieved easily using a small change in the IP equation: 21

22 Max ( Q ) X i i subject to P i X i and X i 0, 1 for all i=1,..,n Where is equal to the smallest absolute value of the differences between the all the pairs of Q i, divided by N. A major disadvantage of this procedure is the great amount of time and computer power needed in order to apply this calculation at each step of the closure. This problem increases exponentially as the number of hypotheses increases. When the hypotheses are dependent the exponent increases much faster (depending on the computer configuration and software used). (Fig 2, Fig 3) 22

23 4. Applications of the Multiple Testing Procedures. This section includes several experiments where discrete multiple methods should be used. All discrete multiple methods excluding IP + closure and H & K improvement to the Rüger + closure test are applied. IP +closure cannot be used due to the great amount of time and computer memory needed for this procedure (it was applied only in Example 2). The Rüger +closure test was not applied due to the relatively arbitrary choice of s which is needed for each of the closure steps. In order to compute the Exact distribution (in Examples 3 and Example 4), of the p-values for TWW k and W & W method, a permutational method (described by Westfall and Young 1993) was computed using 1,000 resampled data sets. 4.1 Example 1 cdna Transcripts: This data set was reported in Tarone (1990). In this experiment complementary DNA (cdna) transcripts are produced from transcribed RNA obtained from cells grown under normal conditions and from cells grown under unusual conditions. The cdna transcripts from a gene of interest are sequenced and compared to a known nucleotide sequence, in order to determine the number of nucleotides in each transcript. The frequencies of the nucleotide changes are compared in transcripts from both the control and the study cells to determine if the transcribed RNA in the study cell differs from that in the control cell. The known sequence may be several nucleotides in length; so that a multiple comparison problem should be addressed. Let N 0i number of transcript in control group, N 1i number of transcripts in study group, X 0i number of observed nucleotide in control, X 1i number of observed nucleotide in study. The p- values are calculated using one-sided Fisher s exact test. Nucleotide X 0i /N 0i X 1i /N 1i Minimal Attained p-value attained p-value 1 1/11 3/ /11 4/ /11 2/ /10 8/ /10 3/ /9 2/ /9 2/ /9 2/ /8 2/

24 The hypotheses that were rejected using the different methods at the different level of significance are summarized in the table below: Significance level\ Method Tarone s T {} {H 04 } {H 04 } T* {} {H 04 } {H 04 } TH* {} {H 04 } {H 04 } W & W {H 04 } {H 04 } {H 04 } Stepwise W & W {H 04 } {H 04 } {H 04 } T k {} {H 04 } {H 04 } Roth s R {} {H 04 } {H 04 } Roth s RMOD {} {H 04 } {H 04 } TWW k {H 04 } {H 04 } {H 04 } Stepwise TWW k {H 04 } {H 04 } {H 04 } * {} None of the hypotheses were rejected. The newly proposed methods (bold) are equal to, or superior to, all other in the 0.01 level of significance. 4.2 Example 2 - Animal Carcinogenicity Test.: This example was also reported in Tarone (1990), and it describes an animal experiment that tested the carcinogenicity of a test compound. Several organs and tissues were examined for the presence of tumor. The experiment included three groups: control (0), low dose (1), and high dose (2). The groups consisted of equally spaced doses. The number of observed tumors was recorded for each type group [animal (mouse, rat), gender (male, female), and tumor site]. A trend statistic of the following form was defined T j = X 0j * 0 + X 1j * 1 + X 2j * 2 were X ij are the number of observed tumors at dose group i, and type group j. Upper-tailed p-values were computed for each type group, using Fisher's exact statistics. 24

25 X 0i /N 0i X 1i /N 1i X 2i /N 2i Minimal attained p-value Attained p-value Male Rat Tissue subcut 2/19 4/50 1/ Liver 0/19 1/50 3/ Kidney 0/19 6/50 13/ x Thyroid: follicular 3/19 1/49 2/ Thyroid: C-cell 1/19 2/49 2/ Thyroid: all 4/19 3/49 4/ x Pituitary 0/16 4/44 1/ Pancreatic islets 1/16 2/50 1/ Hematopoietic 1/17 4/49 2/ Female Rat Liver 2/20 5/49 3/ x Kidney & renal 0/20 1/49 3/ pelvis Pituitary 6/20 10/45 3/ x Thyroid: follicular 1/19 2/49 6/ x Thyroid: C-cell 0/19 6/49 4/ x Thyroid: all 1/19 8/49 10/ x Mammary 7/20 13/48 11/ x Uterus 1/20 1/48 3/ Male mouse Lung 1/18 3/50 3/ Liver 2/18 19/50 44/ x x Kidney 1/18 2/50 3/ Hematopoietic 0/18 1/50 3/ Female mouse Liver 0/20 37/45 39/ x x Multiple organs 0/20 4/46 0/

26 The results are presented in the table below: Sig. level\ Method Tarone s T {Male mouse liver, Female mouse liver} T* {Male mouse liver, Female mouse liver} TH* W & W Stepwise W & W T k Roth s R Roth s RMOD TWW k {Male mouse liver, Female mouse liver} {Male mouse liver, Female mouse liver} {Male mouse liver, Female mouse liver} {Male mouse liver, Female mouse liver} {Male mouse liver, Female mouse liver} {Male mouse liver, Female mouse liver} {Male mouse liver, Female mouse liver} {Male rat kidney, Male mouse liver, Female mouse liver} {Male rat kidney, Male mouse liver, Female mouse liver} {Male rat kidney, Male mouse liver, Female mouse liver} {Male rat kidney, Male mouse liver, Female mouse liver} {Male rat kidney, Male mouse liver, Female mouse liver} {Male rat kidney, Male mouse liver, Female mouse liver} {Male rat kidney, Male mouse liver, Female mouse liver} {Male rat kidney, Male mouse liver, Female mouse liver} {Male rat kidney, Male mouse liver, Female mouse liver} {Male rat kidney, Male mouse liver, Female mouse liver} {Male rat kidney, Male mouse liver, Female mouse liver} {Male rat kidney, Male mouse liver, Female mouse liver} {Male rat kidney, Male mouse liver, Female mouse liver} {Male rat kidney, Male mouse liver, Female mouse liver} {Male rat kidney, Male mouse liver, Female mouse liver} {Male rat kidney, Male mouse liver, Female mouse liver} {Male rat kidney, Male mouse liver, Female mouse liver} {Male rat kidney, Male mouse liver, Female mouse liver} Stepwise TWW k {Male mouse liver, Female mouse liver} {Male rat kidney, Male mouse liver, Female mouse liver} {Male rat kidney, Male mouse liver, Female mouse liver} None of the methods tested including the new ones, was more powerful than the others testing the hypotheses in this example. 4.3 Example 3 - Efficacy of a Respiratory Therapy: The data for this example is from W & W (1997). It is the result of an experiment on the efficacy of respiratory therapy given by Koch, Carr, Amara, Stokes and Uryniak (1990). The analysis of the rating of respiratory health was based on multinomial distribution, such that each group (placebo, active) was selected from multinomial probability, and the objective was to compare the rating categories for the active and placebo groups. 26

27 Number in response category Group Very poor Poor Fair Good Excellent Total Placebo Active The categories that were found significantly different, using the multiple comparison procedures are depicted below (Roth s R/RMOD method was not used this time, since the hypotheses are dependent). Significance level\ Method Tarone s T {} { Very poor } { Very poor } T* {} { Very poor } { Very poor } TH* {} { Very poor } { Very poor } W & W { Very poor } { Very poor } { Very poor } Stepwise W & W { Very poor } { Very poor } { Very poor } T k {} { Very poor } { Very poor } TWW k { Very poor } { Very poor } { Very poor } Stepwise TWW k { Very poor } { Very poor } { Very poor } IP + Closure {} { } { } * {} None of the hypotheses were rejected. Two of the newly proposed methods were equal to or more powerful than all traditional methods. IP, in this example, was inferior (see discussion). 4.4 Example 4 - Relationship between DVT and 3 Genetic Factors: This example comes from an experiment that tested the relationship between Deep Vein Thrombosis (DVT) and three different genetic factors (Fact V, Fact II and MTHFR) (Salamon et. al.1999). The population was divided into Healthy (controls), and those with DVT. Each subject was tested for the presence of one of the three genetic factors. The subjects were then divided into one of the eight available genetic groups (a genetic group is built of the combination of presence or absence of all the three factors). The results of this study were published in Arteriosclerosis, Thrombosis and Vascular Biology (Salomon et. al. 1999). Genetic Group/ Number of patients Healthy controls None Fact V Fact II MTHFR Fact V + Fact II Fact V + MTHFR Fact II + MTHFR DVT All 3 Factors 27

Familywise Error Rate Controlling Procedures for Discrete Data

Familywise Error Rate Controlling Procedures for Discrete Data arxiv:1711.08147v1 [stat.me] 22 Nov 2017 Yalin Zhu Center for Mathematical Sciences, Merck & Co., Inc., West Point, PA, U.S.A. Wenge Guo Department