An Alpha-Exhaustive Multiple Testing Procedure

Current Research in Biostatistics Original Research Paper An Alpha-Exhaustive Multiple Testing Procedure, Mar Chang, Xuan Deng and John Balser Boston University, Boston MA, USA Veristat, Southborough MA, USA Article history Received: 9-07-06 Revised: 07-08-06 Accepted: 06--06 Corresponding Author: Mar Chang Boston University, Boston MA, USA and Veristat, Southborough MA, USA Email: mar.chang@veristat.com Abstract: A multiple testing procedure can be a single-step procedure such as Bonferroni s method or a stepwise procedure such as Hochberg s stepup method and Hommel s method. It can be an -exhaustive or - conservative approach. We develop a single -exhaustive procedure that can improve power -5% over Hochberg s and Hommel s methods in common situations when the test statistics are mutually independent. The method can also be generalized to dependent test statistics. The idea behind our method is to construct the rejection rules using the product of marginal p-values and by controlling the upper bounds of the th order terms so that is controlled for any configuration of null hypotheses. Such upper bounds or critical values are determined progressively from = towards = K, the number of null hypotheses in the problem. Keywords: Multiple Testing, Alpha-Exhaustive, Hypothesis Test Introduction Multiple testing problems are common in pharmaceutical statistics and life-sciences in general. The main goal of Multiple Testing Procedures (MTP) is to (strongly) control the family wise type-i error rate. A MTP can be a single-step data-independent procedure such as Bonferroni s method or a data-dependent stepwise procedure such as Hochberg s stepup method and Hommel s stepup method. It can be an-exhaustive or -conservative approach. Conceptually, stepwise procedures are usually more powerful than single-step procedures and -exhaustive procedures are usually more powerful than -conservative approach. However, consider these two aspects together, comparisons of testing procedures are not that simple, often depending on the configuration of the alternative "hypotheses" or more precisely, the truths. For example, the power of a fallbac procedure is dependent on the weight and "effective sizes" in the alternative hypotheses. A fixed sequence testing procedure is a special case of the fallbac procedure and its power is heavily dependent on the order of the test sequence of the hypothesis. We develop a simple single -exhaustive procedure that can improve power -5% over Hochberg s and Hommel s methods in common situations when the test statistics are independent. The method can also be generalized to dependent test statistics. The idea behind our method is to construct the rejection rules using the product of marginal p-values and by controlling the upper bounds of the th order terms so that is exhausted for any configuration of null hypotheses. Such upper bounds or critical values are determined progressively from = towards = K, the number of null hypotheses in the problem. Unlie common stepwise test procedures, in which every step in the decision rule will only involve one critical value for decision-maing, the proposed -exhaustive approach is a single-step method with multiple critical values involved in the decision rules. The paper is organized as follows. In section, we will review several important stepwise test procedures that will be used in our power comparisons. In section 3, we elaborate our progressive -exhaustive procedure for two-hypothesis testing, outline the idea, derive the formulations for critical values and provide examples of using this procedure in comparison with other methods. We also provide the power formulation for the -exhaustive procedure for two-hypothesis testing. In section, we provide power comparisons among several different methods using simulation. In section 5, we extend the -exhaustive procedure to threehypothesis testing and compare with Hommel s procedure in power under broad conditions. In section 6, we further describe the -exhaustive procedure for general K-hypothesis testing and simulation algorithms for determining the critical values. In the last section, discussion and summary are provided. We place 06 Mar Chang, Xuan Deng and John Balser. This open access article is distributed under a Creative Commons Attribution (CC-BY) 3.0 license.

Mar Chang et al. / Current Research in Biostatistics 06, 6 (): 30. DOI:0.38/amjbsp.06.30. mathematical derivation in the Appendix. To mae the procedure ready for practical use, we have included the SAS code in the Appendix. Multiple Testing Procedures Stepwise procedures are different from single-step procedures, in the sense that a stepwise procedure must follow a specific order to test each hypothesis. In general, stepwise procedures are more powerful than single-step procedures. There are three categories of stepwise procedures that are dependent on how the stepwise tests proceed: Stepup, stepdown and fallbac procedure. The commonly used stepwise procedures include the Bonferroni-Holm stepdown method (Holm, 979), the Holm stepdown method (Dmitrieno et al., 009, p.65), Hommel s stepup procedure (Hommel, 988), Hochberg s stepup method (Hochberg, 988), the fallbac procedure (Wiens, 003) and the sequential test with fixed sequences (Westfall et al., 999). Stepdown Procedure A stepdown procedure starts with the most significant p-value and ends with the least significant p-value. In the procedure, the p-values are arranged in an ascending order, i.e., from the smallest to the largest: p p p () ( ) ( )... ( K ) with the corresponding hypotheses: H, H,..., H () K ( ) ( ) ( ) The test proceeds from H () to H (K). If p () >C ( =,..., K), retain all H (i) (i ); otherwise, reject H () and continue to test H (+). The constants C are different for different procedures. The adjusted p-values are: pɶ = C p ( ) pɶ = max ( pɶ, C p( ) ), =,..., n (3) Therefore an alternative test procedure is to compare the adjusted p-values against the unadjusted. After adjusting p-values, one can test the hypotheses in any order. Fallbac Procedure (Wiens, 003) The Holm procedure is based on a data-driven order of testing, while the fixed-sequence procedure is based on a prefixed order of testing. A compromise between them is the so-called fallbac procedure. The fallbac procedure was introduced by Wiens (003) and was further studied by Dmitrieno et al. (006; Hommel and Bretz, 008). The test procedure can be outlined as follows. In the fallbac procedure, we allocate the overall error rate among the hypotheses according to their weights w, where w 0 and w =. For fixed sequence test, w = and w =... = w K = 0: Step : Test H at = w. If p, reject this hypothesis; otherwise retain it. Go to the next step Step i =,..., K : Test H at = + w if H is rejected and at = w if H is retained. If p, reject H ; otherwise retainit. Go to the next step Step K: Test H K at K = K + w K if H K is rejected and at K = w K if H K is retained. If p K K, reject H K ; otherwise retainit The formula for the adjusted p-value is complicated to be written explicitly. Stepup Procedure A stepup procedure starts with the least significant p- value and ends with the most significant p-value. The procedure proceeds from H (K) to H (). If, P () C ( =,..., K), reject all H (i) (i ); otherwise, retain H () and continue to test H ( ). The adjusted p-values are: pɶ K = CK p( ), K pɶ = min ( pɶ +, C p( ) ), = K,..., () Progressive -Exhaustive Testing Procedure An -exhaustive procedure is a closed testing procedure based on intersection hypothesis tests the size of which is exactly. In other words, Pr (Reject H I ) = for any intersection hypothesis H I, I {,...,K}. Put in a simple way, in an -exhaustive procedure, the supremum of the probability of false rejection in any null hypothesis configuration is equal to. Many stepwise test procedures have been developed, which are not necessarily -exhaustive. Therefore, there is a room for improvement. However, an -exhaustive procedure is not necessarily a powerful test. A fixed sequence testis an -exhaustive test, but it is often the least powerful test if the sequence of tests was inappropriately chosen. Let s discuss the situation of two-hypothesis testing: H : H H versus H : H H (5) o Here H is the negation of H, =,. In this setting, if either H or H is rejected, the null hypothesis H o is 3

Mar Chang et al. / Current Research in Biostatistics 06, 6 (): 30. DOI:0.38/amjbsp.06.30. rejected. Let p and p be the marginalp-values for testing H and H, respectively. The decision rules of the progressive -exhaustive testing procedure are: If p p and p, then reject H If p p and p, then reject H where critical value > 0 and > 0. The idea behind this procedure is to borrow strength among marginal p-values. In plain language, the procedure says that we don t have to mae an adjustment, as long as p and the other p-value p is small. For example, if p = and p = 0.0, we can reject H. The and are so determined that when both H and H are true, the type-i error will not exceed. The procedure can control the Familywise Error Rate (FWER) strongly but at the same time exhaust all under all the null hypothesis configurations: H, H and H H. This is done progressively as described below. Step : To control the familywise error rate at level, when only H is true and H is not true, then p p can be satisfied with probability of (if, for example, the test drug is very effective, p will be virtually always 0). Therefore, to control FWER, a necessary condition is sup Pr(p p p H H ) = sup Pr (p H ) =. Therefore, type-i error is strongly controlled and exhausted when H H is true. By the same toen, we can reach the same conclusion when H H is true. Step : Now we need to determine and to exhaust when H H is true. In this study, we only consider the case when the two test statistics, p and p, are independent. Under the global null hypothesis H 0, p and p are iid U (0, ), which is equivalent to two standard normal test statistics: z = -Φ(p ) and z =-Φ(p ) under H 0 being independent. However, woring on the p-scale, the testing procedure can be used for different endpoints (normal, binary, survival), as long as p and p are independent and stochastically equal to or larger than uniform p and p. Since T = p p < implies p <, we have the p conditional cdf for T: T p ( ) p, < = < F T p, p p (6) Fig.. Pr (p p p ) If, then Pr (p p p H,H ) = Pr (p H,H ) =. If <, then (Fig. for a geometric interpretation): Pr ( p p p ) F ( T p T p ) f ( p ) dp dp dp (7) = < = + 0 0 = + ln Pr Thus: ( p p p ) p, = + ln, (8) Consequently, the type-i error rate under H H is given by (for simplicity, we just use FWER (H H )): ( H ) ( p p p p p p ) ( p p p ) Pr( p p p ) ( p p ( a ) p p ) FWER H = Pr = Pr + Pr min, = + ln + + ln ( ), for min, (9) In (9), we have used the following result: if min(, ), then: 3

Mar Chang et al. / Current Research in Biostatistics 06, 6 (): 30. DOI:0.38/amjbsp.06.30. ( p p ( ) p p ) ( p p ) Pr min, = Pr = (0) However, if > min(a, ), then the probability becomes (Fig. ): Pr ( p p p p ) min (, )/ = dp + dp 0 min (, )/ p = min (, ) + min (, ) ln min, () ( ) To summarize the type-i error rates under various null configurations, we have: ( ) ( ) FWER H = FWER H = FWER H ( H ), min > = + ln + + ln, min + ln + + ln min + ln, min < min where, min = min (, ): () (3) The and are determined so that FWER(H H ) =. We are not interested in, because p p in the rejection criteria has no effect. In fact, FWER(H H ) = - = will have no solution for any between 0 and. We are not interested in < either, because it maes the conditions, p < and p <, have no effect in determining the rejection boundary. In fact: FWER( H H ) = + ln + + ln min + ln = min has no solution for < and <. Therefore, the only scenario that we are interested in is: FWER ( H H ) = + ln + + ln, () With (), we can determine the rejection boundaries, for given. Here are the steps: Choose so that < Let FWER (H H ) = to solve for That is, is the solution of: + ln + + ln =, min (5) Examples of critical values from (5) are presented in Table and. When =, (5) can be simplified as: ln + = (6) Fig.. Pr (p p p p ) The critical values for various with = are presented in Table 3. The rejection boundaries in Table -3 have been verified each by 0,000,000 simulations. Illustrative Example Suppose in the Statistical Analysis Plan for a cardiovascular trial with two primary endpoints (not coprimary endpoints that have to be met simultaneously), the two test statistics for the two hypotheses (H and H ) of the two endpoints are assumed to be independent and the -exhaustive procedure (with one-sided = = 0.00855 and = 0.05) was specified in the analysis for 33

Mar Chang et al. / Current Research in Biostatistics 06, 6 (): 30. DOI:0.38/amjbsp.06.30. the multiplicity adjustment to control FWER. At the end of trial, the p-values for the two endpoints are: Scenario () one-sided p = 0.0 and p = 0.05, scenario () p = 0.0 and p = 0., (3) p = 0.05 and p = 0.0, () p = 0.0 and p = 0.6 and (5) p = 0.0 and p = 0.5. Using the -exhaustive procedure for scenario (), we reject both H and H because p p = 0.0006 p = 0.0 to reject H and p p = 0.0006 p = 0.05 to reject H. For scenario (), we reject H but not H because p p = 0.008 p = 0.0 and p p = 0.008 p = 0. >. For scenario (3), we reject H, but not H. For scenario (), we reject H but not H. For scenario (5), we can reject neither H nor H. Using the test procedures described in the previous section, we summarize the rejection status in Table. The -exhaustive procedure can reject at least one hypothesis except for scenario 5. The reason that -Ex method (with = ) cannot reject any hypothesis for scenario (5) is that the method emphasizes the consistency of the evidence against all the hypotheses and such consistency is obviously not presented in scenario (5). The power of -Ex procedure ( = ) for the twohypothesis at one-side -level can be written as: ( z ) ( z ) ( ( ) ( )) Φ Φ Power = Pr ( min Φ z, Φ z < H, H As a comparison, the power of Hommel s procedure for the two-hypothesis at one-sided -level can be written as: ( Φ( z ) Φ( z )) ( ( ) ( )) min Power = Pr min Φ z, Φ z < / H, H Table. Critical values for one-sided = 0.05 0.000095 0.000650 0.00000 0.00000 0.003000 0.00000 0.00855 0.005000 0.05000 0.088 0.0856 0.009378 0.0078 0.0058 0.00855 0.007 Table. Critical values for one-sided = 0.05 0.00035 0.00500 0.00000 0.005000 0.006000 0.007000 0.008000 0.00097 0.050000 0.0565 0.00078 0.0760 0.05607 0.0393 0.0508 0.00097 Table 3. Critical values for one-sided test when = 0.005000 0.00000 0.05000 0.050000 0.075000 0.00000, 0.0009 0.00897 0.00855 0.00097 0.05739 0.0798 Table. Rejection with different test procedures Scenario ---------------------------------------------------------------------------------------------------------------------- Method 3 5 Fixed Sequence H, H H H H Bonferroni H H Fallbac (w = 0.5) H H Holm-Stepdown H H Hochberg H, H H Hommel H, H H H -exhaustive H, H H H H Note: One-sided = 0.05, = = 0.00855 for -exhaustive Table 5. Power comparisons for two-hypothesis testing (δ = 0.3, σ = ) δ = 0.3 δ = 0.5 δ = 0 ---------------------------------- ---------------------------------- ----------------------------- Method Power Power Power Power Power Power Fixed Seq (H,H ) 0.60 0.800 0. 0.80 0.00 0.05 Fixed Seq (H,H ) 0.60 0.800 0. 0.800 0.00 0.800 Bonferroni 0.59 0.96 0.50 0.77 0.009 0.73 Fallbac (H,H ) 0.590 0.96 0.68 0.78 0.00 0.730 Fallbac (H,H ) 0.590 0.96 0. 0.783 0.08 0.73 Holm 0.65 0.96 0.33 0.78 0.09 0.730 Hochberg 0.660 0.933 0. 0.79 0.00 0.73 Hommel 0.660 0.933 0. 0.79 0.00 0.73 Progressive -Ex 0.660 0.96 0.0 0.83 0.00 0.7 Note: Sample size = 90, one-sided = 0.05, = = 0.00855 for Progressive -Ex 3

Mar Chang et al. / Current Research in Biostatistics 06, 6 (): 30. DOI:0.38/amjbsp.06.30. Power Comparison of Two Hypotheses There seems a general impression that whatever the test procedure to use the power of rejection cannot be improved significantly for two-hypothesis testing. However, this is not necessarily true. We have compared power of seven different testing methods described in section and presented results in Table 5, where Power is probability of simultaneously rejecting H :δ 0 and H : δ 0 and Power is the probability of rejecting either H or H. For the fallbac procedure the weights w = w = 0.5 are used. The fixed sequence method is equivalent to the fallbac procedure with w = and w = 0. The progressive -exhaustive procedure performs overall the best, while the Hommel method performs the second best. In general, Holm procedure is uniformly more powerful than the Bonferroni procedure. Hochberg s procedure is uniformly more powerful than Holm s procedure and Hommel s procedure is uniformly more powerful than Hochberg s procedure. Holm, fixed-sequence and fallbac are nonparametric and control FWER for any joint distribution of test statistics. Hommel and Hochberg procedures are semi parametric and control FWER only for some joint distributions, including positively dependent test statistics such as multivariate normal test statistics. Nonparametric procedures mae no assumptions about the joint distribution of test statistics which results in power loss (Dmitrieno, 03). For two-hypothesis testing, Hochberg s method is equivalent to Hommel s method. The power of the fallbac method depends on the weights w i and the order of the hypotheses. Comparisons of Hommel s method to use of the -Ex method with different and are presented in Table 6. For -Ex, = = 0.00855; -Ex, = 0.003355; = ; -Ex 3, = 0.003798, =.6 ; -Ex, = 0.0033, =.5 ; -Ex 5, = 0.0033, =.5. From the table, we can see that all -Ex procedures perform very well except -Ex 5, in which the treatment effect δ is smaller than δ, but alphas were set up in a wrong direction ( =.5 > ). In general, should be chosen larger than if δ is expected larger than δ ; otherwise choose. Table 6. Power comparison for two-hypothesis testing δ = δ (σ = ) ------------------------------------------------------------------------ Method 0.0/0.3 0.03/0.3 0.05/0.3 0./0.3 0.5/0.3 0./0.3 0.3/0.3 Hommel 0.73 0.736 0.7 0.759 0.79 0.836 0.933 -Ex 0.7 0.736 0.75 0.796 0.83 0.890 0.96 -Ex 0.75 0.76 0.777 0.8 0.85 0.893 0.96 -Ex 3 0.735 0.755 0.770 0.808 0.850 0.89 0.96 -Ex 0.73 0.76 0.76 0.803 0.86 0.89 0.96 -Ex 5 0.700 0.75 0.7 0.790 0.839 0.887 0.96 The reason that -Ex method (with = ) has lower power than Hommel s method at the extreme case when δ = 0 and δ = 0.3 is that the former emphasizes the consistency of the evidence against all the hypotheses, while δ = 0 and δ = 0.3 are clearly inconsistent. If we believe δ is smaller than δ, we should use different and (e.g., =.6 in - Ex 3 ). However, if the directional guess is wrong, it could reduce the power as seen in -Ex 5. Formulation for Three Hypotheses We now discuss the progressive -Exhaustive procedure for three-hypothesis testing: H : H H H vsh : H H H (7) 0 3 a 3 Similar to two-hypothesis testing, the rejectionacceptance rules of the -exhaustive procedure for threehypothesis testing are: If p p p 3 p p p p 3 p, reject H ; otherwise accept H If p p p 3 p p p p 3 p, reject H, otherwise accept H If p p p 3 < p 3 p 3 p 3 p 3 p 3, reject H 3, otherwise accept H 3 Here 3. Determination of Critical Values The derivations of the critical values are placed in the Appendix. Here we described the ey steps and summarized the results. The critical values, and 3 are determined by all the paired null hypotheses: H H, H H 3 and H H 3. To exhaust familywise error rate, it necessarily requires that FWER(H H ) =, FWER(H H 3 ) = and FWER(H 3 H ) =, which are equivalent to (Appendix), respectively: + ln + + ln = + ln + 3 + 3 ln =, < 3 < (8) 3 3 + 3 ln + + ln = 3 Each equation in (8) is in the same expression as (5). Therefore, the critical values in Table -3 are also valid for (8). 35

Mar Chang et al. / Current Research in Biostatistics 06, 6 (): 30. DOI:0.38/amjbsp.06.30. Table 7. Critical values for one-sided test ( = = 3 ) 0.00000 0.05000 0.050000 0.075000 0.00000,, 3 0.00897 0.00855 0.00097 0.05739 0.0798 0.0005 0.00677 0.00557 0.007566 0.009966 Table 8. Power comparison δ = δ (δ 3 = 0.3, σ = ) ------------------------------------------------------------------------------------------------------------------------------- Method 0.0/0.0 0.0/0.3 0.03/0.3 0./0.3 0./0. 0./0. 0.3/0.3 Hommel 0.8 0.735 0.737 0.79 0.6 0.533 0.869 Dunnet Step-up 0.78 0.75 0.78 0.78 0.60 0.58 0.86 Graphic Approach 0.78 0.7 0.7 0.773 0.508 0.55 0.85 -Ex 0.70 0.756 0.775 0.885 0.698 0.599 0.9 Note: = 0.05, = = 3 = 0.00855, = 0.00677, sample size = 60 To exhaust under H H H 3, it requires that FWER(H H H 3 ) =, that is (assume = = 3, Appendix): 3 3 + ln + 3 ( ) + 3 = (9) Now we can use (8) to determine, and 3 and use (9) to determine for the case when = = 3. Examples of critical values for various are presented in Table 7. The critical values can also be determined through simulations. Especially when dimension is high, simulation is a convenient way to obtain the results: For given (,, 3 ), we can use simulation by trying different until FWER(H H H 3 ) =. We have verified the critical values through simulations: For = 0.05 and = = 3 = 0.00855, = 0.00677; the type-i error rate is 0.05003 under H H H 3 through 0,000,000 simulations. This progressive method to determine the critical values can be generalized to K-hypothesis testing. Power Comparison Let H i : δ i 0, i =,, 3. Using the rejection boundaries in Table 7, we can easily obtain the power of the -Ex method through simulations. To compare the performance of -Ex, we compared the best method, Hommel s method as the standard. Since there are three hypotheses, it is meaningful to compare our approach to other more recently developed approaches. However, the gate eeping procedure is difficult to communicate with the nonstatisticians and requires large set of tests when the number of individual hypotheses increases. An iterative graphical approach by Bretz et al. (009) deals with those weaness and constructs the Bonferroni-type tests with a simple updating algorithm that fully describes a sequentially rejective test procedure. The graphic approach was then extended by dissociating the underlying weighting strategy and applied using weighted Bonferroni tests, weighted parametric tests and weighted Simes tests (Bretz et al., 0). The existing methods controls the FWER; However, the power is addressed or compared with other methods. From Table 8, we can see that the -Ex procedure provides more power in all cases except the case when δ = δ = 0 and δ = 0.3. Again, lie in the case of two-hypothesis testing, when the parameters in the alternative hypotheses (e.g., effects of the different endpoints) are very different, we should use different, and 3 such that their trend is in opposite to the trend of parameters in the alternative hypotheses. K-Hypothesis Testing Procedure We now discuss the progressive -exhaustive procedure for K-hypothesis testing. To avoid the rejection boundary being too small, causing inconvenience, we use term p i= i instead of p i = i in the decision rules for a general K-hypothesis testing. It is obvious that these two test statistics are equivalent in terms of power. For K-hypothesis testing, the K rejection rules are specified as: We will reject H i if and only if: /3 / ( ) ( ) p p p, p p, p. i l l i i > l, i, l i i We didn t include higher order term of p-product because through simulations, we find that even we set p i p j p p m, the FWER is controlled. This means that for a multiple testing problem with more than four hypothese, the proposed procedure may not be an alpha-exhaustive one. The rejection boundaries,,... K are progressively determined: Determine = based on 36

Mar Chang et al. / Current Research in Biostatistics 06, 6 (): 30. DOI:0.38/amjbsp.06.30. one-hypothesis testing, then given, determine based on two-hypothesis testing; and given and, determine 3 based on three-hypothesis testing. The process continues until K is determined. For high dimensional hypothesis testing problems, Partition Principle for multiple testing (Hsu, 996) can be used to reduce the number of null configurations to be tested. Simulation is usually more convenient than numerical integration when the dimension is high. Summary and Discussion To construct a MTP, we need to consider at least three things to ensure the power: () -exhaustive, () synergize strengths among data for local hypothesis or marginal p-values and (3) be able to use correlations between local test statistics or local p-values. In principle, the proposed -exhaustive procedure has considered all three aspects. To achieve -exhaustive, we use the marginal p-value product corresponding to each null hypothesis configuration and enforce it with an upper bound in the rejection rules. Such p-value product terms in the rejection rules also ensure the synergy between the marginal p-values. The K-hypothesis testing algorithm can be applied to the test statistics with correlations with modifications of critical regions for rejection (the critical values in Table -3 are applicable for independent test statistics), but due to its complexity and larger applications in clinical trials (dose-finding, subgroup analysis, adaptive design), we are developing separate manuscripts to address that. Unlie traditional stepwise procedures, the decision rule in the progressive -exhaustive procedure explicitly uses a set of statistics (p,p p,p p 3,p p p 3,etc.) with a set of critical values in the decision rule for rejecting a single H ( =,,..., K). In this sense, the decision rules in the -exhaustive procedure are expressed in the form of "adjusted p-values" and hence the order of the tests is irrelevant. Many stepwise testing procedures also have the feature of borrowing strengths among data for local hypotheses, but such dependencies are realized through a discrete function. The-exhaustive procedure uses a continuous dependency function of marginal p- values, i.e., product of p-values, which is more effective. We have also tried other functions such as average p-values or linear combination of normalinverse p-values, the results are similar. In summary, the proposed progressive -exhaustive procedure is not only statistically powerful, but it also stresses the importance of clinical/practical meaningfulness since the method emphasizes the consistency among the evidences coming from different endpoints, different doses and different populations, that is, the totality of the evidence. The test procedure is simple and performs well in broad situations. When the true "standardized" effect size (value of the parameter) is very different for different hypothesis, the critical values don t have to the same for rejecting all the hypotheses. Instead, the critical values can be different and optimized based on the prior information on effect size or considering the importance of different endpoints. Appendices Derivations of Formulations of Critical Values for Three-Hypothesis Testing FWER H = suppr H3 H3 ( H ) ( p p p3 p p p p3 p ) ( p pp3 p p p p3 p ) (( p p p ) ( p p p )) = suppr ( p p p ) Pr ( p p p ) ( p p p p ) = Pr + Pr = + ln + + ln,, < where, FWER(H H ) is the type-i error rate under FWER(H H ) and sup is the supreme under all H3 possible H 3. To control type-i error under H H, H H 3 and H 3 H, it is required that, respectively: ln ln + ln + + ln = + ln + + ln = + + 3 + 3 = 3 3 3 3 (A) From (A ), we can see that among, and 3, at least two should be the same, either = or = 3 ( ). The type-i error rate under H H H 3 can be expressed as: ( 3 ) ( p p p3 p p p p3 p ) ( p p p3 p p p p3 p ) ( p p p3 p3 p 3 p3 p 3 p3 ) FWER H H H = Pr For the purpose of critical value derivation, we rewrite FWER(H H H 3 )as: 37

Mar Chang et al. / Current Research in Biostatistics 06, 6 (): 30. DOI:0.38/amjbsp.06.30. ( ) FWER H H H = π + π 3 + π π π π + π 3 3 3 33 3 where, under H H H 3 : ( p p p p p p p p ) ( p p p p p p p p ) ( p p p p p p p p ) (A) π = Pr 3 3, π = Pr 3 3, (A3) π3 = Pr 3 3 3 3 3 3 p p p p p Pr, 3 π = p p3 p p3 p p p p p3 p p π 3 = Pr, (A) p p3 p3 p 3 p p3 p p p3 p3 p 3 π 3 = Pr, p p3 p p p3 p and: π 3 p p p3 p p < p p3 < = Pr p3 p < p < p < p3 < (A5) We will use Fig. 3 to assist in our integration of π, π andπ 3. That is: π + = adp dp + dp dp + dp dp Ω Ω p Ω3 p Ω dp dp p p = + + + p dpdp dp dp 0 p p dp dp p p = + ln + (A6) Similarly, we use Fig. to assist in our integration of π, π 3 and π 3. That is: π = dp dp + dp dp + dp dp Ω Ω p Ω3 p ( since ) p dpdp 0 p = + < ( ) = a + a = (A7) For π 3, we just give the result for the case when 3 = = 3. Assume, otherwise only p p p 3 has an effect in the joint probabilities, while p p and other will have no effect. Fig. 3.π = π = π 3 38

Mar Chang et al. / Current Research in Biostatistics 06, 6 (): 30. DOI:0.38/amjbsp.06.30. Fig..π Fig. 5. π 3 39

Mar Chang et al. / American Journal of Biostatistics 06, 6 (): 30. DOI:0.38/amjbsp.06.30. In fact, the is so small that the three curves p i p j = cut the cube into a smaller cube, as shown in Fig. 5. Therefore, we have: π = dp dp = (A8) 3 3 Ω To Summarize, we have: ( 3 ) FWER H H H = 3 a + ln + 3( ) + 3 = 3 + ln + 3 ( ) + 3 3 (A9) To control, FWER(H H H 3 ) at level, it is required that (assume we have chosen = = 3 ): 3 3a + ln 3 ( ) + 3 = (A0) After = = 3 is determined using (A), we can solve from (A0). For a general case, after is chosen between and, and 3 can be determined using (A) and can be easily obtained from simulations. The power simulations are presented in Table 8. SAS Code for Progressive Test Procedure /* Progressive Alpha-exhaustive Test Procedure for Two-Hypothesis */ %Macro aextesth(nsims, u, u, sigma, N, alpha, alpha, alpha); * nsims = the number of simulation runs; * u, u = parameters for H and H. sigma = common standard deviation; * N = sample size; * alpha, alpha, alpha = critical values on p-scale; * Power = prob of rejecting H or H, * PowerBoth = prob of rejecting H and H.; Data aexh; eep u u N PowerBoth Power; N=&N; u=&u; u=&u; sigma=σ alpha=α Power=0; PowerBoth=0; Do isim= To &nsims; z=rand("normal", &u, sigma/sqrt(n))/sigma*sqrt(n); p=-cdf("normal", z); z=rand("normal", &u, sigma/sqrt(n))/sigma*sqrt(n); p=-cdf("normal", z); sig=0; sig=0; If p*p<=&alpha And p<=alpha Then sig=; If p*p<=&alpha And p<=alpha Then sig=; If sig= OR sig= Then Power=Power+/&nSims; If sig= And sig= Then PowerBoth=PowerBoth+/&nSims; End; Output; Run; %Mend; Title "Checing Type-I Error under H and H"; %aextesth(0000000, 0, 0.0,, 90, 0.00855, 0.00855, 0.05); Proc print data=aexh; Run; Title "Power under H: u=0.3 and H: u= 0.3"; %aextesth(000000, 0.3, 0.3,, 90, 0.00855, 0.00855, 0.05); Proc print data=aexh; Run; /* Progressive Alpha-exhaustive Test Procedure for Three-Hypothesis */ %Macro aextest3h(nsims, u, u, u3, sigma, N, alpha, alpha, alpha); * nsims = the number of simulation runs; * N = sample size; * u, u, u3 = parmeters for H, H and H3; * alpha, alpha, alpha = cretical values on p-scale; * Power = prob of rejecting H or H or H3; * PowerAll = prob of rejecting H, H and H3 simutanneuosly; Data aex3h; eep u u u3 sigma N alpha PowerAll Power; u=&u; u=&u; u3=&u3; sigma=σ N=&N; alpha=α alpha=α alpha=α Power=0; PowerAll=0; Do isim= To &nsims; z=rand("normal", u, sigma/sqrt(n))/sigma*sqrt(n); p=-cdf("normal", z); z=rand("normal", u, sigma/sqrt(n))/sigma*sqrt(n); p=-cdf("normal", z); z3=rand("normal", u3, sigma/sqrt(n))/sigma*sqrt(n); p3=-cdf("normal", z3); sig=0; sig=0; sig3=0; p=p*p*p3; If p<=alpha & p*p<=alpha & p*p3<=alpha & p<=alpha Then sig=; If p<=alpha & p*p<=alpha & p*p3<=alpha & p<=alpha Then sig=; If p<=alpha & p3*p<=alpha & p3*p<=alpha & p3<=alpha Then sig3=; If sig= OR sig= Or sig3= Then Power=Power+/&nSims; 0

Mar Chang et al. / American Journal of Biostatistics 06, 6 (): 30. DOI:0.38/amjbsp.06.30. If sig= And sig= And sig3= Then PowerAll=PowerAll+/&nSims; End; Output; Run; %Mend; Title "Checing Type-I Error under H, H and H3"; %aextest3h(0000000, 0, 0, 0,, 60, 0.00855, 0.00677, 0.05); proc print data=aex3h; Run; Title "Power when u=0, u=0.3 and u3= 0.3"; %aextest3h(000000, 0, 0.3, 0.3,, 60, 0.00855, 0.00677, 0.05); Proc print data=aex3h; Run; Acnowledgment We would lie to than the anonymous reviewers and editors for their reviews. Author s Contributions Mar Chang: Involved in the methodology development, draft and approval of the manuscript. Xuan Deng: Contributed to reveiw, revise and approval of the manuscript. John Balser: Discussion on idea, issues, review manuscript. Ethics This article is original and contains unpublished material. The corresponding author confirms that all of the authors have read and approved the manuscript and no ethical issues involved. Reference Bretz, F., W. Maurer, W. Brannath and M. Posch, 009. A graphical approach to sequentially rejective multiple test procedures. Stat. Med., 8: 586-60. DOI: 0.00/sim.395 Bretz, F., W. Maurer and G. Hommel, 0. Test and power considerations for multiple endpoint analyses using sequentially rejective graphical procedures. Stat. Med., 30: 89.50. DOI: 0.00/sim.3988 Dmitrieno, A., A.C. Tamhane and F. Bretz, 009. Multiple Testing Problems in Pharmaceutical Statistics. st Edn., CRC Press, ISBN-0: 58889853, pp: 30. Dmitrieno, A., B.L. Wiens and P.H. Westfall, 006. Fallbac tests in dose-response clinical trials. J. Biopharmaceutical Stat., 6: 75-755. DOI: 0.080/05300600860600 Dmitrieno, A., 03. Multiple testing procedures in clinical trials. Proceedings of the IBS Worshop, Sept. 9-0, Berlin. Hochberg, Y., 988. A sharper Bonferroni procedure for multiple tests of significance. Biometria, 75: 800-80. DOI: 0.307/33635 Holm, S., 979. A simple sequentially rejective multiple test procedure. Scandinavian J. Stat., 6: 65-70. Hommel, G., 988. A stagewise rejective multiple test procedure based on a modified Bonferroni test. Biometria, 75: 383-386. DOI: 0.093/biomet/75..383 Hommel, G. and F. Bretz, 008. Aesthetics and power considerations in multiple testing-a contradiction? Biom. J., 50: 657-666. DOI: 0.00/bimj.007063 Hsu, J.C., 996. Multiple Comparisons: Theory and Methods. st Edn., CRC Press, ISBN-0: 0988, pp: 96. Westfall, P.H., R.D. Tobias, D. Rom, R.D. Wolfinger and Y. Hochberg, 999. Multiple comparisons and multiple tests using SAS system. SAS Institute, SAS Campus Drivew, Cary, North Carolina, USA. Wiens, B.L., 003. A fixed sequence Bonferroni procedure for testing multiple endpoints. Pharmaceutical Stat., : -5. DOI: 0.00/pst.6