Statistical Applications in Genetics and Molecular Biology

Size: px

Start display at page:

Download "Statistical Applications in Genetics and Molecular Biology"

Randall Atkins
6 years ago
Views:

1 Statistical Applications in Genetics and Molecular Biology Volume 5, Issue Article 28 A Two-Step Multiple Comparison Procedure for a Large Number of Tests and Multiple Treatments Hongmei Jiang Rebecca W. Doerge Northwestern University, hongmei@northwestern.edu Purdue University, doerge@purdue.edu Copyright c 2006 The Berkeley Electronic Press. All rights reserved.

2 A Two-Step Multiple Comparison Procedure for a Large Number of Tests and Multiple Treatments Hongmei Jiang and Rebecca W. Doerge Abstract For situations where the number of tested hypotheses is increasingly large, the power to detect statistically significant multiple treatment effects decreases. As is the case with microarray technology, often researchers are interested in identifying differentially expressed genes for more than two types of cells or treatments. A two-step procedure is proposed for the purpose of increasing power to detect significant effects (i.e., to identify differentially expressed genes). Specifically, in the first step, the null hypothesis of equality across the mean expression levels for all treatments is tested for each gene. In the second step, only pairwise comparisons corresponding to the genes for which the treatment means are statistically different in the first step are tested. We propose an approach to estimate the overall FDR for both fixed rejection regions and fixed FDR significance levels. Also proposed is a procedure to find the FDR significance levels used in the first step and the second step such that the overall FDR can be controlled below a pre-specified FDR significance level. When compared via simulation the two-step approach has increased power over a one-step procedure, and controls the FDR at a desire significance level. KEYWORDS: false discovery rate, multiple comparisons, multiple tests, testing differential expression Acknowledgments: We are very grateful to two reviewers and the Associate Editor for their helpful comments and suggestions.

3 Jiang and Doerge: A Two-Step Multiple Comparison Procedure 1 1 Introduction Advances in many areas of technology (e.g., communication, health care, and biotechnology) are giving rise to vast experiments that provide data for testing a very large number of repetitive tests. These situations require a multiple comparison correction that not only accommodates the number of tests that are being conducted, but also controls the rate of false positives at a desired level. While this problem presents itself in a variety of applications the one that motivated this work is microarray technology; a powerful tool that is widely applicable to almost every area of science (e.g., basic science, agriculture, and medical research). Microarrays provide a systematic way to study transcript variation for thousands of genes simultaneously. The key question addressed by most microarray experiments is to ask which genes are differentially expressed genes between a pair of conditions (i.e., control and treatment). Numerous approaches that range from traditional statistical analyses to new statistical models have been proposed for testing differential gene expression (Schena et al., 1996; Baldi and Long, 2001; Efron, 2003; Newton et al., 2001; Gottardo et al., 2003; Tusher et al., 2001; Kerr et al., 2000; Wolfinger et al., 2001) between pairs of conditions. Since the traditional familywise error rate (FWER) multiple comparisons procedures, such as Bonferroni s procedure, are too conservative, false discovery rate (FDR) controlling procedures (Benjamini and Hochberg, 1995) have been widely used in microarray studies. Benjamini and Hochberg (2000) propose an adaptive procedure, that has increased power over the original procedure, by incorporating the estimate of the proportion of true null hypotheses. A variety of methods have been proposed to estimate the proportion of true null hypotheses for multiple testing problems, such as Storey s bootstrap method (Storey, 2002), Storey and Tibshirani s smoother estimate (Storey and Tibshirani, 2003), and Langaas et al. s method based on nonparametric maximum likelihood estimation of the p-value density, under the restriction of decreasing and convex decreasing densities (Langaas et al., 2005). Although testing for differential expression of a gene between pairs of conditions or treatments is informative, in a microarray study it is quite common for researchers to be interested in comparing more than two treatment conditions for thousands of genes in the experiment. For instance, Hedenfalk et al. (2001) studied gene expression changes among breast cancers due to mutations in either the gene BRCA1 or the gene BRCA2 and sporadic tumor (i.e., three conditions) using 5,361 genes. With a large number (m) of genes, the number of pairwise comparisons are typically very large (3m for 3 treatments, and 6m for 4 treatments, etc.). Therefore, when the goal is to identify statistically Published by The Berkeley Electronic Press, 2006

4 2 Statistical Applications in Genetics and Molecular Biology Vol. 5 [2006], No. 1, Article 28 differentially expressed genes between each pair of conditions, in the typical one-step multiple comparison procedure C m (C is number of pairwise comparisons for each gene) hypothesis tests are treated as a family, and a false discovery rate (FDR) controlling procedure such as Benjamini and Hochberg s procedure (Benjamini and Hochberg, 1995) is applied at a significance level α. In situations where the majority of genes are not differentially expressed across the treatments, applying the FDR controlling procedure to a large family of multiple comparisons may not be most powerful simply because when the number of hypothesis increases, the power of detecting differentially expressed genes decreases. Lu et al. (2005) explored this issue and proposed a two-step strategy. In the first step, a subset of genes that are potentially differentially expressed among the treatments are identified with a loose criterion. In the second step, these potential genes are combined for detecting differentially expressed genes with a more stringent criterion. It is expected that the smaller number of genes in the second step will give rise to a more powerful test. In both steps of the procedure Lu et al. (2005) employ a Bonferroni adjustment to address the multiple comparison problem. Lu et al. (2005) point out that Benjamini and Hochberg s FDR controlling procedure (Benjamini and Hochberg, 1995) can be used in both steps but do not address the family-wise error rate (FWER) or the FDR for the whole/entire procedure. Specifically, suppose the FDR significance levels used in the two steps are 0.05 and 0.01, respectively. The FDR for the whole procedure must be taken into account, and not limited to the individual FDRs at each step, since the false rejections in the first step will affect the results of the second step. Using this as our motivation, a two-step multiple comparison procedure is proposed for testing pairwise comparisons of more than two treatments for a large number of genes such that the power to detect differentially expressed genes, while controlling the FDR at a pre-chosen significance level, will be higher than a one-step procedure. Although Lu et al. (2005) used a mixed model approach for their two-step procedure, our proposed two-step procedure is not limited by the specifics of the model. Specifically, in the first step, the null hypothesis of equality across the mean expression levels for all treatments is tested for each gene. In the second step, only pairwise comparisons corresponding to the genes for which the treatment means are statistically different in the first step are tested. The two-step procedure can be applied in practice in three different ways: 1. The rejection regions in the first and second step both can be fixed. That is, equality tests of expression levels for the genes in the first step with corresponding p-values less than or equal to c 1 are considered statistically significant, and pairwise comparisons in the second step with p-values less than or equal to c 2 are statistically significant, where

5 Jiang and Doerge: A Two-Step Multiple Comparison Procedure 3 c 1 and c 2 are fixed and known. Although it is typical to use the term rejection region in conjunction with the term test statistic(s), here we rely on the term rejection region in conjunction with the term p-value(s) for ease of explanation; 2. One can apply an FDR controlling procedure at significance level α 1 in the first step, and an FDR controlling procedure at significance level α 2 in the second step, where α 1 and α 2 are fixed and known; 3. One can pre-specify the overall FDR α to control the overall FDR below α. In this work we propose an approach to estimate the overall FDR for both fixed rejection regions (situation 1) and fixed FDR significance levels (situation 2). We also propose a procedure to find the FDR significance levels used in the first step and the second step such that the overall FDR can be controlled below a pre-specified FDR significance level. Using simulated data we demonstrate that our proposed two-step procedure has increased power over a one-step procedure and controls the FDR for the entire procedure at a desired significance level. 2 A two-step multiple comparison procedure A novel two-step multiple comparison procedure is proposed in the context of testing for differential expression. Initially, we present it generally with no specific FDR controlling procedure specified: Step 1. The null hypothesis that a gene is not differentially expressed across all treatment conditions is tested for each gene (e.g., the global F-test from ANOVA model). For the family of m tests corresponding to the m genes, an FDR controlling procedure is applied to control the FDR at level α 1. Suppose there are K tests that are significant. Let A denote the collection of the genes which have statistically significant treatment effects. If K=0, the procedure is stopped and it is concluded that no pairwise comparisons are significant and that there are no differentially expressed genes; otherwise, go to Step 2. Step 2. (a) For genes not belonging to A, conclude pairwise comparisons among the treatments for these genes are not significant. (b) For genes belonging to A, perform pairwise (C) comparisons for each gene. Since there are K genes, in total there are C K pairwise comparisons. Apply an FDR controlling procedure for this family of C K tests at level α 2. Using FDR significance levels α 1 and α 2 our two-step procedure follows (this can also be accomplished using fixed rejection regions in a similar way). Published by The Berkeley Electronic Press, 2006

6 4 Statistical Applications in Genetics and Molecular Biology Vol. 5 [2006], No. 1, Article 28 Step 1. The null hypothesis that a gene is not differentially expressed across all treatment conditions is tested for each gene (e.g., the global F- test from ANOVA model). Tests with p-values c 1 are considered as statistically significant. Suppose there are K tests that are significant. Let A denote the collection of the genes which have statistically significant treatment effects. If K=0, the procedure is stopped and it is concluded that no pairwise comparisons are significant and that there are no differentially expressed genes; otherwise, go to Step 2. Step 2. (a) For genes not belonging to A, conclude pairwise comparisons among the treatments for these genes are not significant. (b) For genes belonging to A, perform pairwise (C) comparisons for each gene. Since there are K genes, in total there are C K pairwise comparisons. Pairwise comparisons with p-values c 2 are considered as statistically significant. We assume that if a gene does not have a significant treatment effect (tested in Step 1), then all of the pairwise comparisons among the treatments corresponding to that gene are not significant. Only genes with a statistically significant treatment effect will enter into the second step to be tested for pairwise comparisons. However, if a gene has a significant treatment effect (Step 1), some or all the pairwise comparisons may not be significant. For the fixed FDR significance levels α 1 and α 2, or the fixed rejection regions [0, c 1 ] and [0, c 2 ] in the respective Step 1 and Step 2, determination of the overall FDR remains necessary. Choosing the significance level α 1 in Step 1 and α 2 in Step 2 so that the FDR for the entire two-step procedure is controlled at a desired significance level α is an additional issue that is of interest. To address these issues the two-step multiple comparison procedure is investigated further to gain an appreciation of the overall FDR relative to the FDR in each step of the procedure. 3 Estimating FDR for fixed rejection regions 3.1 Derivation of the FDR Assume the two-step procedure with fixed rejection regions are used. That is, assume that genes with p-values c 1 have a significant treatment effect (i.e., at least one treatment mean is different from others) in Step 1; and the pairwise comparisons with p-values c 2 are identified as statistically significant in Step 2, where c 1 and c 2 are known. Our goal is to compute the overall FDR for the

7 Jiang and Doerge: A Two-Step Multiple Comparison Procedure 5 two-step multiple comparison procedure. The approach is similar to Storey s positive false discovery rate (pfdr) procedure (Storey, 2002, 2003b) where one estimates the FDR for a given rejection region. Let H0 i denote the null hypothesis of no treatment effect for the ith gene and let H ij 0 denote the null hypothesis that the jth pair of treatment means are not different for the ith gene. For instance, if three treatments are of interest, j = 1, 2, 3; if four treatments are of interest, j = 1, 2,, 6. Let D i = 0 indicate that there is no treatment effect for the ith gene, and let D i = 1 indicate a treatment effect for the ith gene. Furthermore, let D ij = 0 indicate that the means of the jth pair of treatments for gene i are the same, and D ij = 1 when they are different. If D i = 0, then D ij = 0 for all j. Finally let p i denote the p-value for testing the null hypothesis H0 i in Step 1; and p ij denote the p-value for testing the null hypothesis H ij 0 in Step 2. Our two-step multiple comparison approach is different from the one-step multiple comparison procedure where the decision to reject depends on only p ij, since the decision whether to reject H ij 0 or not in the two-step multiple comparison procedure depends on both p i and p ij. Essentially, the two-step multiple comparison procedure has two criteria. The null hypothesis H ij 0 is rejected if and only if both conditions p i c 1 and p ij c 2 are satisfied. Obviously, the two-step comparison procedure is exactly the one-step procedure when c 1 1. In fact, if c 1 is large enough such that the two events, {p ij c 2 } for some j, and {p i c 1 }, occur simultaneously for every gene i, then the two-step comparison procedure will produce the same results as the one-step procedure. Theorem 1. In a two-step multiple comparison procedure, suppose that objects/genes with p-values c 1 are considered as having a significant treatment effect (i.e., at least one treatment mean is different from others) in Step 1; and the pairwise comparisons with p-values c 2 are identified as statistically significant in Step 2. Assume c 1 and c 2 are known, and the objects/genes are independent. The pfdr of this two-step multiple comparison procedure is: pfdr = pfdr 1 P (p ij c 2 D i = 0, p i c 1 ) P (p ij c 2 p i c 1 ) + (1 pfdr 1 ) P (p ij c 2 D ij = 0, D i = 1, p i c 1 )P (D ij = 0 D i = 1, p i c 1 ), (1) P (p ij c 2 p i c 1 ) where pfdr 1 = P (D i = 0 p i c 1 ), which is the pfdr in Step 1. Published by The Berkeley Electronic Press, 2006

8 6 Statistical Applications in Genetics and Molecular Biology Vol. 5 [2006], No. 1, Article 28 Proof. Since the goal of the two-step multiple comparison procedure is to identify statistically significant pairwise comparisons, only the rejections in Step 2 are of interest. Assume the objects/genes are independent. Using the Bayesian interpretation of pfdr (Storey, 2003b), the pfdr for the whole procedure is the probability of having a false rejection of a pairwise comparison given that it is in the rejection region (i.e., the probability that D ij = 0 given that p i c 1 and p ij c 2 ), pfdr = P (D ij = 0 p i c 1, p ij c 2 ) = P (D ij = 0, p ij c 2 p i c 1 ). (2) P (p ij c 2 p i c 1 ) To compute the numerator of equation (2), falsely rejected genes in the first step are treated separately from the rejected genes that in fact have different treatment effects. P (D ij = 0, p ij c 2 p i c 1 ) = P (D ij = 0, p ij c 2 D i = 0, p i c 1 ) P (D i = 0 p i c 1 ) +P (D ij = 0, p ij c 2 D i = 1, p i c 1 ) P (D i = 1 p i c 1 ) = P (D ij = 0, p ij c 2 D i = 0, p i c 1 ) pfdr 1 +P (D ij = 0, p ij c 2 D i = 1, p i c 1 ) (1 pfdr 1 ) = P (p ij c 2 D ij = 0, D i = 0, p i c 1 ) P (D ij = 0 D i = 0, p i c 1 ) pfdr 1 + P (p ij c 2 D ij = 0, D i = 1, p i c 1 ) (1 pfdr 1 ) P (D ij = 0 D i = 1, p i c 1 ). Assume that all pairwise comparisons for a gene are not significant if that gene does not have a significant treatment effect, then Then, P (D ij = 0 D i = 0, p i c 1 ) = P (D ij = 0 D i = 0) = 1. P (D ij = 0, p ij c 2 p i c 1 ) = P (p ij c 2 D i = 0, p i c 1 ) pfdr 1 + (1 pfdr 1 ) P (p ij c 2 D ij = 0, D i = 0, p i c 1 )P (D ij = 0 D i = 1, p i c 1 ). (3) Combining equation (3) with equation (2) gives rise to the pfdr formulation as in equation (1).

9 Jiang and Doerge: A Two-Step Multiple Comparison Procedure Estimation of the FDR With respect to microarray studies, the probability of having at least one rejection, P (R > 0) is almost 1, making the FDR and the pfdr essentially the same (Storey et al., 2004; Black, 2004). Therefore, the pfdr can be replaced with FDR in equation (1), and the FDR for a two-step multiple comparison procedure is, FDR = FDR 1 P (p ij c 2 D i = 0, p i c 1 ) P (p ij c 2 p i c 1 ) + (1 FDR 1 ) P (p ij c 2 D ij = 0, D i = 1, p i c 1 )P (D ij = 0 D i = 1, p i c 1 ). (4) P (p ij c 2 p i c 1 ) To estimate the FDR of the two-step multiple comparison procedure with fixed rejection region, the five components of equation (4) have to be estimated: (1) P (p ij c 2 p i c 1 ) can be estimated using the proportion of rejections among the pairwise comparisons occurred in Step 2. That is, P (p ij c 2 p i c 1 ) = #{p ij : p ij c 2, p i c 1 }, (5) #{p i : p i c 1 } C where C is the number of pairwise comparisons for each gene, #{p i : p i c 1 } is the number of statistically significant genes (i.e., with p-values c 1 ) in Step 1, and #{p ij : p ij c 2, p i c 1 } is the number of significant pairwise comparisons (i.e., with p-values c 2 ) in Step 2. (2) The FDR in Step 1, FDR 1, can be estimated using the approach of Storey (2002) : F DR 1 = c 1 π 01 #{p i : p i c 1 }/m, (6) where m is the total number of genes, #{p i : p i c 1 } is the number of p-values c 1 in Step 1, and π 01 is the estimate for π 01 which is the proportion of true null hypotheses in Step 1 (i.e., the proportion of genes which in fact have no treatment effect among all m genes). Details about estimating the proportion of true null hypotheses are not covered here; references are given in Section 1. Published by The Berkeley Electronic Press, 2006

10 8 Statistical Applications in Genetics and Molecular Biology Vol. 5 [2006], No. 1, Article 28 (3) P (p ij c 2 D i = 0, p i c 1 ) is the probability of claiming a statistically significant pairwise comparison which is associated with a falsely rejected gene (tested in Step 1). A resampling technique can be employed to estimate this probability. The following procedure is applied to the cases where a global F-test from an ANOVA model with constant variance and normal distribution assumption is employed to test for the treatment effect in Step 1. The concept is to generate a large data set under the true null hypothesis (i.e., all treatment means are the same for all genes) and then analyze these data in the same manner as the real (actual) data. The proportion of rejections in Step 2 (ratio of the number of rejections to the total number of pairwise comparisons) is then computed. The specifics are as follows: (i) (ii) Using the same sample size as the real data, generate a random sample from a standard normal distribution for a large number of genes (e.g., M = 100, 000). Assume there are 3 treatment conditions and n observations within each treatment condition, making the random sample of size 3nM. These data are then analyzed using the same analysis as used for the real data. The p-value (p i ) for testing the null hypothesis that the treatment means are equal, and the p-values (p ij) for testing the pairwise comparisons for i = 1,, M are computed. Let #{p i : p i c 1 } be the number of p-values such that p i c 1 and #{p ij : p ij c 2, p i c 1 } be the number of p-values such that p ij c 2 where i is chosen such that p i c 1. These quantities as gained by resampling provide an estimate of the probability of claiming a statistically significant pairwise comparison that is associated with a falsely rejected genes, namely P (p ij c 2 D i = 0, p i c 1 ) = #{p ij : p ij c 2, p i c 1 } #{p i : p i c 1} C, (7) where C is the number of pairwise comparisons for each gene. In Section 6, we present an algorithm for situations when the experimental design is unbalanced and the data are not normally distributed. A permutation method is used to estimate the true null distribution of the test statistics. (4) The estimate of P (p ij c 2 D ij = 0, D i = 1, p i c 1 ) is c 2 when the

11 Jiang and Doerge: A Two-Step Multiple Comparison Procedure 9 probability P (p i c 1 D ij = 0, D i = 1) = 1. Notice that P (p ij c 2 D ij = 0, D i = 1, p i c 1 ) P (p ij c 2 D ij = 0, D i = 1) = P (p ij c 2, p i c 1 D ij = 0, D i = 1) P (p ij c 2 D ij = 0, D i = 1) P (p i c 1 D ij = 0, D i = 1) P (p ij c 2 D ij = 0, D i = 1) P (p i c 1 D ij = 0, D i = 1) P (p ij c 2 D ij = 0, D i = 1) = P (p ij c 2 D ij = 0, D i = 1) 1 P (p i c 1 D ij = 0, D i = 1). P (p i c 1 D ij = 0, D i = 1) Since the p-value p ij corresponding to D i = 1 and D ij = 0 is uniformly distributed on the interval (0,1), then P (p ij c 2 D ij = 0, D i = 1) = c 2. Hence, P (p ij c 2 D ij = 0, D i = 1, p i c 1 ) c 2 1 P (p i c 1 D ij = 0, D i = 1) c 2. (8) P (p i c 1 D ij = 0, D i = 1) Therefore, when P (p i c 1 D ij = 0, D i = 1) = 1, P (p ij c 2 D ij = 0, D i = 1, p i c 1 ) = c 2 holds. For an infinite sample size, the event {p i c 1 D ij = 0, D i = 1} is deterministic regardless of the value that c 1 takes. For a finite sample size, P (p i c 1 D ij = 0, D i = 1) can be very close to, or equal to 1 for a reasonable value of c 1. For example, suppose there are three treatment conditions with an equal sample size n under each of the three conditions. Suppose further that a gene has treatment means (0, 0, 3). Using the noncentral F-distribution under the assumption of the normal distribution, P (p i 0.01 D ij = 0, D i = 1) = when n = 6, and when n = 10, and 1 when n = 30; P (p i D ij = 0, D i = 1) = when n = 6, and when n = 10, and 1 when n = 30. When c 1 is extremely small, P (p i c 1 D ij = 0, D i = 1) can be much smaller than 1 for a finite sample size. Using equation (8) the following method can be employed to provide an overestimate of P (p ij c 2 D ij = 0, D i = 1, p i c 1 ). P (p ij c 2 D ij = 0, D i = 1, p i c 1 ) 1 = c 2 + c P (p i c 1 D ij = 0, D i = 1) 2. (9) P (p i c 1 D ij = 0, D i = 1) Published by The Berkeley Electronic Press, 2006

12 10 Statistical Applications in Genetics and Molecular Biology Vol. 5 [2006], No. 1, Article 28 Let E be the set of genes which enter the second step of the two-step procedure, but do not have all pairwise comparisons statistically significant, i.e., E = {gene g : p g c 1, at least one j such that p gj > c 2 }. Since the true means are unknown they have to be estimated. For gene g E, let x gj denote the sample mean for gene g under treatment condition j, and [j] denote the treatment which has the jth largest magnitude (absolute value) of the sample mean. For example, if the three treatment means for gene g satisfy x g3 < x g1 < x g2, then [1] = 3, [2] = 1 and [3] = 2. For gene g E, define the pseudo means under the J treatment conditions as following: µ g[1] = = µ g[j 1] = 0 and µ g[j] = µ g where µ g = max{ x gi x gj, i, j = 1,, J, i j}. It becomes necessary to compute the probability that a gene with these pseudo means will have a p-value for testing the equality of means below c 1. Under the assumption of normality, the global F-test statistic for testing the equality of the means has a non-central F-distribution with non-centrality parameter ncp g = ( j=j 1 j=1 n [j] (0 µ g /J) 2 + n [J] ( µ g µ g /J) 2 ) / σ 2 g, where n j is the sample size under treatment j and σ 2 g is the estimate of the variance for gene g. Then P (p g c 1 D gj = 0, D g = 1) = P (f J 1,N J,ncpg F 1 J 1,N J (1 c 1)), where N = n j, and f J 1,N J,ncpg is a random variable of non-central F-distribution with degrees of freedom J 1 and N J and non-centrality parameter ncp g, F 1 J 1,N J (1 c 1) is the (1 c 1 ) 100th percentile for a F-distribution with degrees of freedom J 1 and N J. Thus, P (p i c 1 D ij = 0, D i = 1) = average of P (p g c 1 D gj = 0, D g = 1), (10) where g E. When the assumption of normality does not hold, a permutation method is presented (in Section 6) to estimate this probability. (5) The last component of equation (4), P (D ij = 0 D i = 1, p i c 1 ), can be estimated using the proportion of non-significant pairwise comparisons among all pairwise comparisons associated with correctly rejected genes in

13 Jiang and Doerge: A Two-Step Multiple Comparison Procedure 11 Step 1. However, it is impossible to separate the correctly rejected genes from the falsely rejected genes, hence an overestimate is pursued. Define π 02 as the estimate of the proportion of true null hypotheses given the distribution of the p-values in Step 2. We emphasize true here because π 02 is computed based on the the distribution of p-values in Step 2 using the same methods as those used to estimate π 01, and it is not exactly P (D ij = 0 p i c 1 ). Let K denote the number of genes in Step 2, then C K π 02 estimates the number of true null hypotheses based on the p-values, and C K (1 FDR 1 ) is the estimated number of pairwise comparisons generated by correctly rejected genes. Since the p-value (p ij ) corresponding to D i = 1 and D ij = 0 is approximately uniformly distributed, and the estimate C K π 02 also includes some true null hypotheses corresponding to D i = 0 and D ij = 0, the number of true null hypotheses (D ij = 0) corresponding to D i = 1 is less than or equal to C K π 02. Therefore, P (D ij = 0 D i = 1, p i c 1 ) = C K π 02 C K (1 FDR 1 ) = π 02 (1 FDR 1 ). Using equations (5) (9), along with the estimates of the proportions of true null hypotheses ( π 01 and π 02 ) in Step 1 and Step 2, the FDR (equation 4) of the two-step multiple comparison procedure can be estimated by FDR = P (p ij c 2 D i = 0, p i c 1 ) P (p ij c 2 p i c 1 ) FDR 1 + P (p ij c 2 D ij = 0, D i = 1, p i c 1 ) π 02. (11) P (p ij c 2 p i c 1 ) 3.3 Simulation study and results A simulation study is employed to illustrate the accuracy of the proposed method for estimating the FDR of the two-step multiple comparison procedure. Assume there are 3 treatments, and m = 1000 genes. Allow a proportion (R 1 ) of the genes to have a treatment effect. For any gene having a treatment effect, there are two cases: it is differentially expressed across all three treatments; or it is not differentially expressed between two treatments, but differentially expressed under the third treatment. Among the genes which have a treatment effect, assume a proportion (R 2 ) of them are not differentially expressed between two treatments, but differentially expressed under the third treatment. Published by The Berkeley Electronic Press, 2006

14 12 Statistical Applications in Genetics and Molecular Biology Vol. 5 [2006], No. 1, Article 28 That is, R 1 m genes have treatment effects, and R 1 R 2 m genes have treatment means (µ a, µ 0, µ 0 ) or (µ 0, µ a, µ 0 ) or (µ 0, µ 0, µ a ), where µ 0 and µ a are different; and R 1 (1 R 2 ) m genes have treatment means (µ 1, µ 2, µ 3 ), where µ 1, µ 2, and µ 3 are different. In this simulation, half of the R 1 R 2 m genes are chosen to have mean (2,0,0) and the other half have mean (4,0,0); and the R 1 (1 R 2 ) m genes have means (4,2,0). For the (1 R 1 ) m genes not having a treatment effect, the mean vector is (0,0,0). The values for R 1 are 0.10, 0.20, 0.30, 0.40, and 0.50, and the values for R 2 are 0.0, 0.20, 0.40, 0.60, 0.80 and 1. Large values of R 1 are not used in this simulation because the proportion of significant genes in most microarray studies is relatively small. Assume for each gene that there are n = 6 observations under each of the treatments. For each combination of R 1 and R 2, 1000 data sets (each with size of 1000 genes 6 replicates 3 treatments) are generated from normal distributions with standard deviation 1. For each simulated data, 1000 global F-test statistics corresponding to the m = 1000 genes are computed for testing equality of the three treatment means across the 1000 genes. If a gene has a p-value smaller than or equal to a pre-specified level c 1, then it is considered as having significant treatment effect, and thus enters the second step. In the second step, for the genes with statistically significant treatment effects from Step 1, pairwise comparisons are performed using t-tests. Pairwise comparisons with a p-value less than or equal to a pre-specified level c 2 are considered as statistically significant. Various values of c 1 and c 2 are used in the simulation. For each data simulation, π 01 and π 02, the estimates of the proportion of true null hypotheses in Step 1 and Step 2, are computed using Storey and Tibshirani s smoother estimate (Storey and Tibshirani, 2003), and the FDR is estimated using equation (11). The average of the estimated FDR from 1000 simulations for (c 1, c 2 ) = (0.10, 0.05), (0.10, 0.01) and (0.05, 0.01) are presented in Table 1. The average of the true FDR from the 1000 simulations is also presented. For the estimated FDR presented in Table 1, P (p ij c 2 D ij = 0, D i = 1, p i c 1 ) is estimated using c 2 instead of equation (9). It is clear that the estimated FDR is very close to the true FDR when c 1 is not too small which indicates P (p i c 1 D ij = 0, D i = 1) is close to 1. As seen in Table 1 the proposed method yields accurate estimates of the overall FDR. As one would expect the overall FDR for any two-step procedure depends on the configuration of R 1 and R 2. For our two-step approach with c 1 = 0.10, c 2 = 0.05 when R 1 = 0.10 and R 2 = 1.0 the FDR can be as big as 0.39, yet when R 1 = 0.50 and R 2 = 0.0 the FDR can be as small as For the same value of R 1 and the same rejection regions [0, c 1 ] in Step 1 and [0, c 2 ] in Step 2, the FDR increases as R 2 increases. On the other hand, for the same value of R 2 and the same rejection regions [0, c 1 ] in Step 1 and [0, c 2 ] in Step

15 Jiang and Doerge: A Two-Step Multiple Comparison Procedure 13 2, the FDR decreases as R 1 (the proportion of genes having treatment effect) increases. 4 Estimating FDR for fixed FDR significance levels The two-step multiple comparison procedure can also be applied using fixed FDR significance levels in Step 1 and Step 2, respectively. For instance, an FDR controlling procedure at FDR significance level α 1 (α 1 is known and fixed) is applied to the p-values in Step 1, and statistically significant genes are identified. Let A denote the collection of statistically significant genes. Define d 1 be the smallest p-value in Step 1 which is not statistically significant, i.e., d 1 = min{p i, i A c }, where A c is the complement of A. In Step 2, pairwise comparisons associated with the statistically significant genes (i.e., genes in set A) are tested using an FDR controlling procedure at FDR significance level α 2 (α 2 is known and fixed) and statistically significant effects are identified. Let d 2 be the smallest p-value for pairwise comparisons in Step 2 which are not statistically significant. Since the goal is to compute the overall FDR, this can be achieved by replacing c 1 and c 2 with the respective d 1 and d 2 when using the method for estimating the FDR for fixed rejection regions (11). That is, assuming d 1 and d 2 are known, FDR(α 1, α 2 ) = P (p ij d 2 D i = 0, p i d 1 ) P (p ij d 2 p i d 1 ) FDR 1 + P (p ij d 2 D ij = 0, D i = 1, p i d 1 ) π 02. (12) P (p ij d 2 p i d 1 ) It is worth noting that for this approach, d 1 is determined by the p-values in Step 1, α 1, and the FDR controlling procedures applied in Step 1; and d 2 is determined by the p-values in both steps, α 1, α 2, and the FDR controlling procedures applied in Step 1 and Step 2, respectively. 5 Controlling the FDR at a desired significance level Instead of estimating the FDR for a fixed rejection region, traditional multiple comparison procedures (Hochberg and Tamhane, 1987; Hsu, 1996) reject Published by The Berkeley Electronic Press, 2006

16 14 Statistical Applications in Genetics and Molecular Biology Vol. 5 [2006], No. 1, Article 28 Table 1: Simulation results. Estimated FDR ( FDR) and true FDR of pairwise comparisons for 3 treatments and 1000 genes as applied to the two-step multiple comparison procedure using fixed rejection regions c 1 and c 2 in Steps 1 and 2, respectively. R 1 : the proportion of genes having a treatment effect; R 2 : the proportion of genes with a treatment effect having one treatment mean different and the other two the same. R 1 R 2 = c 1 = 0.10 FDR c 2 = True FDR c 1 = 0.10 FDR c 2 = True FDR c 1 = 0.05 FDR c 2 = True FDR

17 Jiang and Doerge: A Two-Step Multiple Comparison Procedure 15 the null hypotheses at a pre-chosen significance level. If the desired FDR significance level of the two-step multiple comparison is α, then the problem becomes choosing the FDR significance levels α 1 and α 2 in Step 1 and Step 2, respectively, so that the overall FDR is controlled by α. 5.1 An approximate upper bound for FDR Although the resampling procedure that is required for estimating the FDR (equation 11) may appear to be a disadvantage, when the experimental design is complicated, it may in fact be difficult to generate data under the null hypothesis. Fortunately, an upper bound of P (p ij c 2 D i =0,p i c 1 ) P (p ij c 2 p i c 1 is possible, thus ) estimating P (p ij c 2 D i = 0, p i c 1 ) via simulation can be avoided. Theorem 2. In the two-step multiple comparison procedure, P (p ij c 2 D i = 0, p i c 1 ) P (p ij c 2 p i c 1 ) 1. Proof. P (p ij c 2 D i = 0, p i c 1 ) P (p ij c 2 p i c 1 ) = P (p ij c 2 D ij = 0, D i = 0, p i c 1 ) P (p ij c 2 p i c 1 ) = = P (p ij c 2,D ij =0,D i =0 p i c 1 ) P (D ij =0,D i =0 p i c 1 ) P (p ij c 2 p i c 1 ) P (p ij c 2,D ij =0,D i =0 p i c 1 ) P (p ij c 2 p i c 1 ) P (D ij = 0, D i = 0 p i c 1 ) = P (D ij = 0, D i = 0 p i c 1, p ij c 2 ) P (D ij = 0, D i = 0 p i c 1 ) 1. (13) When c 2 1 this equality (equation 13) holds for two specific reasons. First, the probability of a false rejection in Step 1 (reject the null hypothesis H 0 i when it is true) only depends on the p-values p i and c 1. Second, with a constraint in Step 2 (p ij c 2 and c 2 < 1), the chance of making a false rejection (reject the null hypothesis H 0 ij when it is true) will be smaller than when compared to the procedure for which no constraint is applied. Published by The Berkeley Electronic Press, 2006

18 16 Statistical Applications in Genetics and Molecular Biology Vol. 5 [2006], No. 1, Article 28 is When P (p i c 1 D i = 1, D ij = 0 for some j) = 1, the FDR (equation 4) FDR = FDR 1 P (p ij c 2 D i = 0, p i c 1 ) P (p ij c 2 p i c 1 ) +(1 FDR 1 ) c 2P (D ij = 0 D i = 1, p i c 1 ). P (p ij c 2 p i c 1 ) Define π 02, FDR 2 and pfdr 2 to be the proportion of true null hypotheses, the FDR and the pfdr in Step 2 based on the empirical distribution of the p-values. Then Notice that FDR 2 = pfdr 2 = c 2 π 02 P (p ij c 2 p i c 1 ). c 2 P (D ij = 0 D i = 1, p i c 1 ) P (p ij c 2 p i c 1 ) c 2 π 02 /(1 FDR 1 ) P (p ij c 2 p i c 1 ) = FDR 2 1 FDR 1, thus an upper bound for the overall FDR (equation 4) is, F DR FDR 1 + FDR 2. (14) Therefore, the overall FDR can be controlled below level α as long as the FDR significance levels α 1 and α 2 used in the respective Step 1 and Step 2 satisfy α 1 + α 2 α. However, when P (p i c 1 D i = 1, D ij = 0 for some j) is far less than 1, the realized FDR may exceed FDR 1 + FDR 2. One strategy is to put more weight of the overall FDR on FDR 1 so that P (p i c 1 D i = 1, D ij = 0 for some j) is closer to 1, and at the same time more genes can be included in the analysis in Step 2. Next, we investigate the performance of the two-step procedure with fixed FDR significance levels in Step 1 and Step 2, and propose a method to choose FDR significance levels in the two steps so that the overall FDR can be controlled below a pre-chosen overall FDR significance level. 5.2 Fixing the FDR significance levels A simulation study is employed to illustrate the improved power of the two-step multiple comparison procedure over the one-step procedure. The simulation scenario is the same as Section 3.3. There are 3 treatment conditions, a sample size of n = 6 within each treatment condition, and m = 1000 genes. For

19 Jiang and Doerge: A Two-Step Multiple Comparison Procedure 17 each combination of R 1 and R 2, 1000 data sets are generated from standard normal distributions, and there are 3nm data points within each data set. The FDR controlling procedure is then applied to the corresponding 1000 genes at a FDR significance level α 1. In the second step, for the genes with significant treatment effects from Step 1, pairwise comparisons are performed with the FDR controlling procedure at FDR significance level α 2. The respective FDR significance levels used in the first and second step are (α 1, α 2 ) = (0.04, 0.01), and (0.03, 0.02), and the estimated FDR, the true FDR and average power are listed in Tables 2 and 3. Here, the average power is defined to be the expected proportion of correct rejections among the true alternative hypotheses. For the purpose of comparing the results with the one-step FDR controlling procedure, the estimated FDR, the true FDR, and the average power for the one-step procedure are also listed in Table 2. For the one-step procedure, an FDR controlling procedure is applied to the family of 3m pairwise comparisons. Specifically, Benjamini and Hochberg s adaptive FDR controlling procedure (Benjamini and Hochberg, 2000) with the incorporation of the estimate of the proportion of null hypotheses by Storey and Tibshirani s smoother estimate (Storey and Tibshirani, 2003) is employed. When the proportion of genes having a treatment effect (R 1 ) is small, the two-step multiple comparison procedure is more powerful than the one-step multiple comparison procedure because of the reduced number of tests in Step 2. For example, in this simulation, when R 1 = 0.2 and R 2 = 0.2, the one-step procedure has 80% power, while the two-step procedure has approximate power 96%. As observed from the simulations when R 2, the proportion of significant genes for which one treatment effect is different but the other two are the same, increases, the power of the two-step procedure decreases. This is due to the fact that when R 2 increases, fewer genes are included in Step 2. From this simulation, the power for α 1 = 0.04, α 2 = 0.01 is slightly bigger than that for α 1 = 0.03, α 2 = 0.02 when R 1 is small. Furthermore, when α 1 = 0.04 more genes are included in the Step 2. Simulations have been performed for different values of FDR level α that vary from 0.01, 0.02,, 0.2. The FDR controlling procedure with the incorporation of the estimate of true null hypotheses is applied in both steps of the two-step procedure, and Step 1 and Step 2 FDR levels are set to α 1 = 4/5α and α 2 = 1/5α. These simulations (Figure 1) demonstrate that the overall FDR is controlled at FDR level α for all values of α. Based on this work and experience our ad hoc suggestion is to use α 1 = 4/5α and α 2 = 1/5α if the overall FDR is required to be controlled at FDR level α. Published by The Berkeley Electronic Press, 2006

20 18 Statistical Applications in Genetics and Molecular Biology Vol. 5 [2006], No. 1, Article 28 Table 2: Simulation results. Estimated FDR ( FDR), true FDR, and average power for pairwise comparisons for 3 treatment conditions and 1000 genes using both the two-step and one-step procedure, respectively. For the twostep procedure, the FDR significance levels α 1 = 0.04 and α 2 = 0.01 are used in Step 1 and Step 2, respectively. For the one-step procedure, the FDR significance level is R 1 R 2 = Two- FDR Step True FDR Power One- True Step FDR Power

21 Jiang and Doerge: A Two-Step Multiple Comparison Procedure 19 Table 3: Simulation results. Estimated FDR ( FDR), true FDR, and average power for pairwise comparisons for 3 treatment conditions and 1000 genes using the two-step procedure at the FDR significance levels α 1 = 0.03 and α 2 = 0.02 in Step 1 and Step 2, respectively. R 1 R 2 = FDR True FDR Power Published by The Berkeley Electronic Press, 2006

22 20 Statistical Applications in Genetics and Molecular Biology Vol. 5 [2006], No. 1, Article 28 True FDR R 1 =0.2, R 2 =0.2 R 1 =0.2, R 2 =0.8 R 1 =0.4, R 2 =0.2 R 1 =0.4, R 2 = α Figure 1: Simulation results of the FDR for the two-step multiple comparison procedure using α 1 = 4α and α 5 2 = 1 α for different levels of α. In total there 5 are m = 1000 genes, 3 treatment conditions, and four different combinations of R 1 and R 2 : R 1 = 0.2, R 2 = 0.2 (short dashed line), R 1 = 0.2, R 2 = 0.8 (dotted line), R 1 = 0.4, R 2 = 0.2 (dotted-dashed line) and R 1 = 0.4, R 2 = 0.8 (long dashed line). The black straight line represents the pre-chosen FDR level. Here R 1 is the proportion of genes having a treatment effect; R 2 is the proportion of genes with a treatment effect having one treatment mean different but the other two the same.

23 Jiang and Doerge: A Two-Step Multiple Comparison Procedure Choosing the FDR significance levels Here we propose an adaptive approach for choosing α 1 and α 2, and suggest some guidelines and direction for selecting α 1 and α 2. First, α 1 should be bigger than α 2. When a looser criterion is used in Step 1, more genes are available to enter the second step. Second, α 1 and α 2 should be chosen such that the overall FDR is close to but below the pre-specified significance level. Hence, the power for detecting a significant effect will be maximized. Third, the choice of α 1 and α 2 should lead to the largest number of rejections occurring in Step 2. With these guidelines in mind, we propose the following directive for finding the significance levels α 1 and α 2. Let S be a set of values of (i α)/n where i = 1,, n 1 and n is a positive integer. That is, S = {α/n, 2α/n,, (n 1)α/n}. Let FDR(α 1, α 2 )) be the estimated overall FDR and R(α 1, α 2 ) the number of rejections (or statistically significant pairwise comparisons) in Step 2 when a two-step procedure with respective significance levels α 1 and α 2 in Step 1 and 2 is applied. Then α1 and α2 are chosen such that (α1, α2) = arg α1,α 2 { max R(α 1, α 2 )}. (15) α 1,α 2 S,α 1 >α 2,α 1 +α 2 α, FDR(α 1,α 2 ) α Using the same simulation as in Section 3.3, for each of the 1000 data sets, we apply our guidelines to find α1 and α2. Suppose the overall FDR significance level α = 0.05 and S = {α/5, 2α/5, 3α/5, 4α/5}, then α1 and α2 can be chosen from (α 1, α 2 ) = (0.02, 0.01), (0.03, 0.01), (0.03, 0.02), and (0.04, 0.01). Table 4 gives the frequency distribution of α1 and α2 based on these 1000 simulations. As can be seen, when R 1 = 0.20, R 2 = 0.60, the choice of (α1, α2) is (0.03, 0.01) for 12 simulated data sets, (0.03, 0.02) for 877 simulated data sets, and (0.04, 0.01) for 111 simulated data sets. The chosen significance levels in the two step method are more diverse when R 1 is small, and then they converge to (α1, α2) = (0.03, 0.02) as R 1 gets larger. Evidently, the case where R 2 = 0.0 (genes which have a treatment effect where all means are different from each other) yields random results. This is most likely due to the fact that almost all pairwise comparisons in Step 2 are significant. Given the choices of α1 and α2 (Table 4), the average FDR is controlled below α = 0.05 (Table 5), and the two-step procedure has more power than the one-step procedure (Table 2). For these results, α1 and α2 take values from S = {α/5, 2α/5, 3α/5, 4α/5}. However, for more accurate results, we suggest S = {α/20, 2α/20,, 19α/20}. Published by The Berkeley Electronic Press, 2006

24 22 Statistical Applications in Genetics and Molecular Biology Vol. 5 [2006], No. 1, Article 28 Table 4: Frequency distribution of α1 and α2 from 1000 simulations for pairwise comparisons for 3 treatment conditions and 1000 genes. Here α1 and α2 are determined using the stated guidelines, and by controlling the overall FDR for the two-step procedure below α = α1 = R 1 R 2 α2 =

25 Jiang and Doerge: A Two-Step Multiple Comparison Procedure 23 Table 5: Simulation results. Estimated FDR ( FDR), true FDR, and power for pairwise comparisons for 3 treatment conditions and 1000 genes using the two-step procedure. The FDR for the entire procedure is controlled below 0.05 with significance levels α 1 and α 2 chosen automatically (results are listed in Table 4). R 1 R 2 = FDR True FDR Power Published by The Berkeley Electronic Press, 2006

Multiple Testing. Hoang Tran. Department of Statistics, Florida State University

Multiple Testing. Hoang Tran. Department of Statistics, Florida State University Multiple Testing Hoang Tran Department of Statistics, Florida State University Large-Scale Testing Examples: Microarray data: testing differences in gene expression between two traits/conditions Microbiome