Statistical Applications in Genetics and Molecular Biology

Similar documents
Multiple Testing. Hoang Tran. Department of Statistics, Florida State University

Statistical testing. Samantha Kleinberg. October 20, 2009

High-throughput Testing

Step-down FDR Procedures for Large Numbers of Hypotheses

Summary and discussion of: Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing

False discovery rate and related concepts in multiple comparisons problems, with applications to microarray data

Table of Outcomes. Table of Outcomes. Table of Outcomes. Table of Outcomes. Table of Outcomes. Table of Outcomes. T=number of type 2 errors

Comparison of the Empirical Bayes and the Significance Analysis of Microarrays

Resampling-based Multiple Testing with Applications to Microarray Data Analysis

Stat 206: Estimation and testing for a mean vector,

The miss rate for the analysis of gene expression data

A moment-based method for estimating the proportion of true null hypotheses and its application to microarray gene expression data

Sample Size Estimation for Studies of High-Dimensional Data

High-Throughput Sequencing Course. Introduction. Introduction. Multiple Testing. Biostatistics and Bioinformatics. Summer 2018

Looking at the Other Side of Bonferroni

ON STEPWISE CONTROL OF THE GENERALIZED FAMILYWISE ERROR RATE. By Wenge Guo and M. Bhaskara Rao

Statistical Applications in Genetics and Molecular Biology

Controlling Bayes Directional False Discovery Rate in Random Effects Model 1

Week 5 Video 1 Relationship Mining Correlation Mining

Non-specific filtering and control of false positives

FDR-CONTROLLING STEPWISE PROCEDURES AND THEIR FALSE NEGATIVES RATES

Bayesian Determination of Threshold for Identifying Differentially Expressed Genes in Microarray Experiments

A Bayesian Determination of Threshold for Identifying Differentially Expressed Genes in Microarray Experiments

Specific Differences. Lukas Meier, Seminar für Statistik

Modified Simes Critical Values Under Positive Dependence

FDR and ROC: Similarities, Assumptions, and Decisions

Large-Scale Hypothesis Testing

Performance Evaluation and Comparison

DETECTING DIFFERENTIALLY EXPRESSED GENES WHILE CONTROLLING THE FALSE DISCOVERY RATE FOR MICROARRAY DATA

False discovery rate procedures for high-dimensional data Kim, K.I.

Research Article Sample Size Calculation for Controlling False Discovery Proportion

Multiple Testing. Gary W. Oehlert. January 28, School of Statistics University of Minnesota

Controlling the False Discovery Rate: Understanding and Extending the Benjamini-Hochberg Method

Lecture 7 April 16, 2018

Applying the Benjamini Hochberg procedure to a set of generalized p-values

Quick Calculation for Sample Size while Controlling False Discovery Rate with Application to Microarray Analysis

BIOINFORMATICS ORIGINAL PAPER

IEOR165 Discussion Week 12

Mixtures of multiple testing procedures for gatekeeping applications in clinical trials

Single gene analysis of differential expression

Alpha-Investing. Sequential Control of Expected False Discoveries

Procedures controlling generalized false discovery rate

arxiv: v1 [math.st] 31 Mar 2009

Linear Models and Empirical Bayes Methods for. Assessing Differential Expression in Microarray Experiments

STEPDOWN PROCEDURES CONTROLLING A GENERALIZED FALSE DISCOVERY RATE. National Institute of Environmental Health Sciences and Temple University

The legacy of Sir Ronald A. Fisher. Fisher s three fundamental principles: local control, replication, and randomization.

ESTIMATING THE PROPORTION OF TRUE NULL HYPOTHESES UNDER DEPENDENCE

Positive false discovery proportions: intrinsic bounds and adaptive control

Family-wise Error Rate Control in QTL Mapping and Gene Ontology Graphs

Post-Selection Inference

The Pennsylvania State University The Graduate School A BAYESIAN APPROACH TO FALSE DISCOVERY RATE FOR LARGE SCALE SIMULTANEOUS INFERENCE

The Pennsylvania State University The Graduate School Eberly College of Science GENERALIZED STEPWISE PROCEDURES FOR

Estimation of a Two-component Mixture Model

Multiple Testing. Tim Hanson. January, Modified from originals by Gary W. Oehlert. Department of Statistics University of South Carolina

Lecture 6 April

Advanced Statistical Methods: Beyond Linear Regression

A Sequential Bayesian Approach with Applications to Circadian Rhythm Microarray Gene Expression Data

Multiple testing: Intro & FWER 1

Lecture 28. Ingo Ruczinski. December 3, Department of Biostatistics Johns Hopkins Bloomberg School of Public Health Johns Hopkins University

Chapter 1. Stepdown Procedures Controlling A Generalized False Discovery Rate

The optimal discovery procedure: a new approach to simultaneous significance testing

MIXTURE MODELS FOR DETECTING DIFFERENTIALLY EXPRESSED GENES IN MICROARRAYS

STAT 5200 Handout #7a Contrasts & Post hoc Means Comparisons (Ch. 4-5)

Extending the Robust Means Modeling Framework. Alyssa Counsell, Phil Chalmers, Matt Sigal, Rob Cribbie

Statistical Applications in Genetics and Molecular Biology

Lecture 27. December 13, Department of Biostatistics Johns Hopkins Bloomberg School of Public Health Johns Hopkins University.

Announcements. Proposals graded

CH.9 Tests of Hypotheses for a Single Sample

Aliaksandr Hubin University of Oslo Aliaksandr Hubin (UIO) Bayesian FDR / 25

Control of Generalized Error Rates in Multiple Testing

Let us first identify some classes of hypotheses. simple versus simple. H 0 : θ = θ 0 versus H 1 : θ = θ 1. (1) one-sided

Effects of dependence in high-dimensional multiple testing problems. Kyung In Kim and Mark van de Wiel

Introduction to the Analysis of Variance (ANOVA) Computing One-Way Independent Measures (Between Subjects) ANOVAs

EMPIRICAL BAYES METHODS FOR ESTIMATION AND CONFIDENCE INTERVALS IN HIGH-DIMENSIONAL PROBLEMS

Tools and topics for microarray analysis

MULTIPLE TESTING PROCEDURES AND SIMULTANEOUS INTERVAL ESTIMATES WITH THE INTERVAL PROPERTY

Two-stage stepup procedures controlling FDR

Resampling-Based Control of the FDR

arxiv: v1 [stat.me] 25 Aug 2016

A GENERAL DECISION THEORETIC FORMULATION OF PROCEDURES CONTROLLING FDR AND FNR FROM A BAYESIAN PERSPECTIVE

Chapter Seven: Multi-Sample Methods 1/52

Lec 1: An Introduction to ANOVA

STAT 461/561- Assignments, Year 2015

Semi-Penalized Inference with Direct FDR Control

The One-Way Independent-Samples ANOVA. (For Between-Subjects Designs)

STAT 135 Lab 9 Multiple Testing, One-Way ANOVA and Kruskal-Wallis

Journal Club: Higher Criticism

Doing Cosmology with Balls and Envelopes

Peak Detection for Images

Biostatistics Advanced Methods in Biostatistics IV

Familywise Error Rate Controlling Procedures for Discrete Data

New Procedures for False Discovery Control

Adaptive Filtering Multiple Testing Procedures for Partial Conjunction Hypotheses

Multidimensional local false discovery rate for microarray studies

On Procedures Controlling the FDR for Testing Hierarchically Ordered Hypotheses

False Discovery Rate

Lecture 21: October 19

Journal of Statistical Software

A NEW APPROACH FOR LARGE SCALE MULTIPLE TESTING WITH APPLICATION TO FDR CONTROL FOR GRAPHICALLY STRUCTURED HYPOTHESES

Exam: high-dimensional data analysis January 20, 2014

Transcription:

Statistical Applications in Genetics and Molecular Biology Volume 5, Issue 1 2006 Article 28 A Two-Step Multiple Comparison Procedure for a Large Number of Tests and Multiple Treatments Hongmei Jiang Rebecca W. Doerge Northwestern University, hongmei@northwestern.edu Purdue University, doerge@purdue.edu Copyright c 2006 The Berkeley Electronic Press. All rights reserved.

A Two-Step Multiple Comparison Procedure for a Large Number of Tests and Multiple Treatments Hongmei Jiang and Rebecca W. Doerge Abstract For situations where the number of tested hypotheses is increasingly large, the power to detect statistically significant multiple treatment effects decreases. As is the case with microarray technology, often researchers are interested in identifying differentially expressed genes for more than two types of cells or treatments. A two-step procedure is proposed for the purpose of increasing power to detect significant effects (i.e., to identify differentially expressed genes). Specifically, in the first step, the null hypothesis of equality across the mean expression levels for all treatments is tested for each gene. In the second step, only pairwise comparisons corresponding to the genes for which the treatment means are statistically different in the first step are tested. We propose an approach to estimate the overall FDR for both fixed rejection regions and fixed FDR significance levels. Also proposed is a procedure to find the FDR significance levels used in the first step and the second step such that the overall FDR can be controlled below a pre-specified FDR significance level. When compared via simulation the two-step approach has increased power over a one-step procedure, and controls the FDR at a desire significance level. KEYWORDS: false discovery rate, multiple comparisons, multiple tests, testing differential expression Acknowledgments: We are very grateful to two reviewers and the Associate Editor for their helpful comments and suggestions.

Jiang and Doerge: A Two-Step Multiple Comparison Procedure 1 1 Introduction Advances in many areas of technology (e.g., communication, health care, and biotechnology) are giving rise to vast experiments that provide data for testing a very large number of repetitive tests. These situations require a multiple comparison correction that not only accommodates the number of tests that are being conducted, but also controls the rate of false positives at a desired level. While this problem presents itself in a variety of applications the one that motivated this work is microarray technology; a powerful tool that is widely applicable to almost every area of science (e.g., basic science, agriculture, and medical research). Microarrays provide a systematic way to study transcript variation for thousands of genes simultaneously. The key question addressed by most microarray experiments is to ask which genes are differentially expressed genes between a pair of conditions (i.e., control and treatment). Numerous approaches that range from traditional statistical analyses to new statistical models have been proposed for testing differential gene expression (Schena et al., 1996; Baldi and Long, 2001; Efron, 2003; Newton et al., 2001; Gottardo et al., 2003; Tusher et al., 2001; Kerr et al., 2000; Wolfinger et al., 2001) between pairs of conditions. Since the traditional familywise error rate (FWER) multiple comparisons procedures, such as Bonferroni s procedure, are too conservative, false discovery rate (FDR) controlling procedures (Benjamini and Hochberg, 1995) have been widely used in microarray studies. Benjamini and Hochberg (2000) propose an adaptive procedure, that has increased power over the original procedure, by incorporating the estimate of the proportion of true null hypotheses. A variety of methods have been proposed to estimate the proportion of true null hypotheses for multiple testing problems, such as Storey s bootstrap method (Storey, 2002), Storey and Tibshirani s smoother estimate (Storey and Tibshirani, 2003), and Langaas et al. s method based on nonparametric maximum likelihood estimation of the p-value density, under the restriction of decreasing and convex decreasing densities (Langaas et al., 2005). Although testing for differential expression of a gene between pairs of conditions or treatments is informative, in a microarray study it is quite common for researchers to be interested in comparing more than two treatment conditions for thousands of genes in the experiment. For instance, Hedenfalk et al. (2001) studied gene expression changes among breast cancers due to mutations in either the gene BRCA1 or the gene BRCA2 and sporadic tumor (i.e., three conditions) using 5,361 genes. With a large number (m) of genes, the number of pairwise comparisons are typically very large (3m for 3 treatments, and 6m for 4 treatments, etc.). Therefore, when the goal is to identify statistically Published by The Berkeley Electronic Press, 2006

2 Statistical Applications in Genetics and Molecular Biology Vol. 5 [2006], No. 1, Article 28 differentially expressed genes between each pair of conditions, in the typical one-step multiple comparison procedure C m (C is number of pairwise comparisons for each gene) hypothesis tests are treated as a family, and a false discovery rate (FDR) controlling procedure such as Benjamini and Hochberg s procedure (Benjamini and Hochberg, 1995) is applied at a significance level α. In situations where the majority of genes are not differentially expressed across the treatments, applying the FDR controlling procedure to a large family of multiple comparisons may not be most powerful simply because when the number of hypothesis increases, the power of detecting differentially expressed genes decreases. Lu et al. (2005) explored this issue and proposed a two-step strategy. In the first step, a subset of genes that are potentially differentially expressed among the treatments are identified with a loose criterion. In the second step, these potential genes are combined for detecting differentially expressed genes with a more stringent criterion. It is expected that the smaller number of genes in the second step will give rise to a more powerful test. In both steps of the procedure Lu et al. (2005) employ a Bonferroni adjustment to address the multiple comparison problem. Lu et al. (2005) point out that Benjamini and Hochberg s FDR controlling procedure (Benjamini and Hochberg, 1995) can be used in both steps but do not address the family-wise error rate (FWER) or the FDR for the whole/entire procedure. Specifically, suppose the FDR significance levels used in the two steps are 0.05 and 0.01, respectively. The FDR for the whole procedure must be taken into account, and not limited to the individual FDRs at each step, since the false rejections in the first step will affect the results of the second step. Using this as our motivation, a two-step multiple comparison procedure is proposed for testing pairwise comparisons of more than two treatments for a large number of genes such that the power to detect differentially expressed genes, while controlling the FDR at a pre-chosen significance level, will be higher than a one-step procedure. Although Lu et al. (2005) used a mixed model approach for their two-step procedure, our proposed two-step procedure is not limited by the specifics of the model. Specifically, in the first step, the null hypothesis of equality across the mean expression levels for all treatments is tested for each gene. In the second step, only pairwise comparisons corresponding to the genes for which the treatment means are statistically different in the first step are tested. The two-step procedure can be applied in practice in three different ways: 1. The rejection regions in the first and second step both can be fixed. That is, equality tests of expression levels for the genes in the first step with corresponding p-values less than or equal to c 1 are considered statistically significant, and pairwise comparisons in the second step with p-values less than or equal to c 2 are statistically significant, where http://www.bepress.com/sagmb/vol5/iss1/art28

Jiang and Doerge: A Two-Step Multiple Comparison Procedure 3 c 1 and c 2 are fixed and known. Although it is typical to use the term rejection region in conjunction with the term test statistic(s), here we rely on the term rejection region in conjunction with the term p-value(s) for ease of explanation; 2. One can apply an FDR controlling procedure at significance level α 1 in the first step, and an FDR controlling procedure at significance level α 2 in the second step, where α 1 and α 2 are fixed and known; 3. One can pre-specify the overall FDR α to control the overall FDR below α. In this work we propose an approach to estimate the overall FDR for both fixed rejection regions (situation 1) and fixed FDR significance levels (situation 2). We also propose a procedure to find the FDR significance levels used in the first step and the second step such that the overall FDR can be controlled below a pre-specified FDR significance level. Using simulated data we demonstrate that our proposed two-step procedure has increased power over a one-step procedure and controls the FDR for the entire procedure at a desired significance level. 2 A two-step multiple comparison procedure A novel two-step multiple comparison procedure is proposed in the context of testing for differential expression. Initially, we present it generally with no specific FDR controlling procedure specified: Step 1. The null hypothesis that a gene is not differentially expressed across all treatment conditions is tested for each gene (e.g., the global F-test from ANOVA model). For the family of m tests corresponding to the m genes, an FDR controlling procedure is applied to control the FDR at level α 1. Suppose there are K tests that are significant. Let A denote the collection of the genes which have statistically significant treatment effects. If K=0, the procedure is stopped and it is concluded that no pairwise comparisons are significant and that there are no differentially expressed genes; otherwise, go to Step 2. Step 2. (a) For genes not belonging to A, conclude pairwise comparisons among the treatments for these genes are not significant. (b) For genes belonging to A, perform pairwise (C) comparisons for each gene. Since there are K genes, in total there are C K pairwise comparisons. Apply an FDR controlling procedure for this family of C K tests at level α 2. Using FDR significance levels α 1 and α 2 our two-step procedure follows (this can also be accomplished using fixed rejection regions in a similar way). Published by The Berkeley Electronic Press, 2006

4 Statistical Applications in Genetics and Molecular Biology Vol. 5 [2006], No. 1, Article 28 Step 1. The null hypothesis that a gene is not differentially expressed across all treatment conditions is tested for each gene (e.g., the global F- test from ANOVA model). Tests with p-values c 1 are considered as statistically significant. Suppose there are K tests that are significant. Let A denote the collection of the genes which have statistically significant treatment effects. If K=0, the procedure is stopped and it is concluded that no pairwise comparisons are significant and that there are no differentially expressed genes; otherwise, go to Step 2. Step 2. (a) For genes not belonging to A, conclude pairwise comparisons among the treatments for these genes are not significant. (b) For genes belonging to A, perform pairwise (C) comparisons for each gene. Since there are K genes, in total there are C K pairwise comparisons. Pairwise comparisons with p-values c 2 are considered as statistically significant. We assume that if a gene does not have a significant treatment effect (tested in Step 1), then all of the pairwise comparisons among the treatments corresponding to that gene are not significant. Only genes with a statistically significant treatment effect will enter into the second step to be tested for pairwise comparisons. However, if a gene has a significant treatment effect (Step 1), some or all the pairwise comparisons may not be significant. For the fixed FDR significance levels α 1 and α 2, or the fixed rejection regions [0, c 1 ] and [0, c 2 ] in the respective Step 1 and Step 2, determination of the overall FDR remains necessary. Choosing the significance level α 1 in Step 1 and α 2 in Step 2 so that the FDR for the entire two-step procedure is controlled at a desired significance level α is an additional issue that is of interest. To address these issues the two-step multiple comparison procedure is investigated further to gain an appreciation of the overall FDR relative to the FDR in each step of the procedure. 3 Estimating FDR for fixed rejection regions 3.1 Derivation of the FDR Assume the two-step procedure with fixed rejection regions are used. That is, assume that genes with p-values c 1 have a significant treatment effect (i.e., at least one treatment mean is different from others) in Step 1; and the pairwise comparisons with p-values c 2 are identified as statistically significant in Step 2, where c 1 and c 2 are known. Our goal is to compute the overall FDR for the http://www.bepress.com/sagmb/vol5/iss1/art28

Jiang and Doerge: A Two-Step Multiple Comparison Procedure 5 two-step multiple comparison procedure. The approach is similar to Storey s positive false discovery rate (pfdr) procedure (Storey, 2002, 2003b) where one estimates the FDR for a given rejection region. Let H0 i denote the null hypothesis of no treatment effect for the ith gene and let H ij 0 denote the null hypothesis that the jth pair of treatment means are not different for the ith gene. For instance, if three treatments are of interest, j = 1, 2, 3; if four treatments are of interest, j = 1, 2,, 6. Let D i = 0 indicate that there is no treatment effect for the ith gene, and let D i = 1 indicate a treatment effect for the ith gene. Furthermore, let D ij = 0 indicate that the means of the jth pair of treatments for gene i are the same, and D ij = 1 when they are different. If D i = 0, then D ij = 0 for all j. Finally let p i denote the p-value for testing the null hypothesis H0 i in Step 1; and p ij denote the p-value for testing the null hypothesis H ij 0 in Step 2. Our two-step multiple comparison approach is different from the one-step multiple comparison procedure where the decision to reject depends on only p ij, since the decision whether to reject H ij 0 or not in the two-step multiple comparison procedure depends on both p i and p ij. Essentially, the two-step multiple comparison procedure has two criteria. The null hypothesis H ij 0 is rejected if and only if both conditions p i c 1 and p ij c 2 are satisfied. Obviously, the two-step comparison procedure is exactly the one-step procedure when c 1 1. In fact, if c 1 is large enough such that the two events, {p ij c 2 } for some j, and {p i c 1 }, occur simultaneously for every gene i, then the two-step comparison procedure will produce the same results as the one-step procedure. Theorem 1. In a two-step multiple comparison procedure, suppose that objects/genes with p-values c 1 are considered as having a significant treatment effect (i.e., at least one treatment mean is different from others) in Step 1; and the pairwise comparisons with p-values c 2 are identified as statistically significant in Step 2. Assume c 1 and c 2 are known, and the objects/genes are independent. The pfdr of this two-step multiple comparison procedure is: pfdr = pfdr 1 P (p ij c 2 D i = 0, p i c 1 ) P (p ij c 2 p i c 1 ) + (1 pfdr 1 ) P (p ij c 2 D ij = 0, D i = 1, p i c 1 )P (D ij = 0 D i = 1, p i c 1 ), (1) P (p ij c 2 p i c 1 ) where pfdr 1 = P (D i = 0 p i c 1 ), which is the pfdr in Step 1. Published by The Berkeley Electronic Press, 2006

6 Statistical Applications in Genetics and Molecular Biology Vol. 5 [2006], No. 1, Article 28 Proof. Since the goal of the two-step multiple comparison procedure is to identify statistically significant pairwise comparisons, only the rejections in Step 2 are of interest. Assume the objects/genes are independent. Using the Bayesian interpretation of pfdr (Storey, 2003b), the pfdr for the whole procedure is the probability of having a false rejection of a pairwise comparison given that it is in the rejection region (i.e., the probability that D ij = 0 given that p i c 1 and p ij c 2 ), pfdr = P (D ij = 0 p i c 1, p ij c 2 ) = P (D ij = 0, p ij c 2 p i c 1 ). (2) P (p ij c 2 p i c 1 ) To compute the numerator of equation (2), falsely rejected genes in the first step are treated separately from the rejected genes that in fact have different treatment effects. P (D ij = 0, p ij c 2 p i c 1 ) = P (D ij = 0, p ij c 2 D i = 0, p i c 1 ) P (D i = 0 p i c 1 ) +P (D ij = 0, p ij c 2 D i = 1, p i c 1 ) P (D i = 1 p i c 1 ) = P (D ij = 0, p ij c 2 D i = 0, p i c 1 ) pfdr 1 +P (D ij = 0, p ij c 2 D i = 1, p i c 1 ) (1 pfdr 1 ) = P (p ij c 2 D ij = 0, D i = 0, p i c 1 ) P (D ij = 0 D i = 0, p i c 1 ) pfdr 1 + P (p ij c 2 D ij = 0, D i = 1, p i c 1 ) (1 pfdr 1 ) P (D ij = 0 D i = 1, p i c 1 ). Assume that all pairwise comparisons for a gene are not significant if that gene does not have a significant treatment effect, then Then, P (D ij = 0 D i = 0, p i c 1 ) = P (D ij = 0 D i = 0) = 1. P (D ij = 0, p ij c 2 p i c 1 ) = P (p ij c 2 D i = 0, p i c 1 ) pfdr 1 + (1 pfdr 1 ) P (p ij c 2 D ij = 0, D i = 0, p i c 1 )P (D ij = 0 D i = 1, p i c 1 ). (3) Combining equation (3) with equation (2) gives rise to the pfdr formulation as in equation (1). http://www.bepress.com/sagmb/vol5/iss1/art28

Jiang and Doerge: A Two-Step Multiple Comparison Procedure 7 3.2 Estimation of the FDR With respect to microarray studies, the probability of having at least one rejection, P (R > 0) is almost 1, making the FDR and the pfdr essentially the same (Storey et al., 2004; Black, 2004). Therefore, the pfdr can be replaced with FDR in equation (1), and the FDR for a two-step multiple comparison procedure is, FDR = FDR 1 P (p ij c 2 D i = 0, p i c 1 ) P (p ij c 2 p i c 1 ) + (1 FDR 1 ) P (p ij c 2 D ij = 0, D i = 1, p i c 1 )P (D ij = 0 D i = 1, p i c 1 ). (4) P (p ij c 2 p i c 1 ) To estimate the FDR of the two-step multiple comparison procedure with fixed rejection region, the five components of equation (4) have to be estimated: (1) P (p ij c 2 p i c 1 ) can be estimated using the proportion of rejections among the pairwise comparisons occurred in Step 2. That is, P (p ij c 2 p i c 1 ) = #{p ij : p ij c 2, p i c 1 }, (5) #{p i : p i c 1 } C where C is the number of pairwise comparisons for each gene, #{p i : p i c 1 } is the number of statistically significant genes (i.e., with p-values c 1 ) in Step 1, and #{p ij : p ij c 2, p i c 1 } is the number of significant pairwise comparisons (i.e., with p-values c 2 ) in Step 2. (2) The FDR in Step 1, FDR 1, can be estimated using the approach of Storey (2002) : F DR 1 = c 1 π 01 #{p i : p i c 1 }/m, (6) where m is the total number of genes, #{p i : p i c 1 } is the number of p-values c 1 in Step 1, and π 01 is the estimate for π 01 which is the proportion of true null hypotheses in Step 1 (i.e., the proportion of genes which in fact have no treatment effect among all m genes). Details about estimating the proportion of true null hypotheses are not covered here; references are given in Section 1. Published by The Berkeley Electronic Press, 2006

8 Statistical Applications in Genetics and Molecular Biology Vol. 5 [2006], No. 1, Article 28 (3) P (p ij c 2 D i = 0, p i c 1 ) is the probability of claiming a statistically significant pairwise comparison which is associated with a falsely rejected gene (tested in Step 1). A resampling technique can be employed to estimate this probability. The following procedure is applied to the cases where a global F-test from an ANOVA model with constant variance and normal distribution assumption is employed to test for the treatment effect in Step 1. The concept is to generate a large data set under the true null hypothesis (i.e., all treatment means are the same for all genes) and then analyze these data in the same manner as the real (actual) data. The proportion of rejections in Step 2 (ratio of the number of rejections to the total number of pairwise comparisons) is then computed. The specifics are as follows: (i) (ii) Using the same sample size as the real data, generate a random sample from a standard normal distribution for a large number of genes (e.g., M = 100, 000). Assume there are 3 treatment conditions and n observations within each treatment condition, making the random sample of size 3nM. These data are then analyzed using the same analysis as used for the real data. The p-value (p i ) for testing the null hypothesis that the treatment means are equal, and the p-values (p ij) for testing the pairwise comparisons for i = 1,, M are computed. Let #{p i : p i c 1 } be the number of p-values such that p i c 1 and #{p ij : p ij c 2, p i c 1 } be the number of p-values such that p ij c 2 where i is chosen such that p i c 1. These quantities as gained by resampling provide an estimate of the probability of claiming a statistically significant pairwise comparison that is associated with a falsely rejected genes, namely P (p ij c 2 D i = 0, p i c 1 ) = #{p ij : p ij c 2, p i c 1 } #{p i : p i c 1} C, (7) where C is the number of pairwise comparisons for each gene. In Section 6, we present an algorithm for situations when the experimental design is unbalanced and the data are not normally distributed. A permutation method is used to estimate the true null distribution of the test statistics. (4) The estimate of P (p ij c 2 D ij = 0, D i = 1, p i c 1 ) is c 2 when the http://www.bepress.com/sagmb/vol5/iss1/art28

Jiang and Doerge: A Two-Step Multiple Comparison Procedure 9 probability P (p i c 1 D ij = 0, D i = 1) = 1. Notice that P (p ij c 2 D ij = 0, D i = 1, p i c 1 ) P (p ij c 2 D ij = 0, D i = 1) = P (p ij c 2, p i c 1 D ij = 0, D i = 1) P (p ij c 2 D ij = 0, D i = 1) P (p i c 1 D ij = 0, D i = 1) P (p ij c 2 D ij = 0, D i = 1) P (p i c 1 D ij = 0, D i = 1) P (p ij c 2 D ij = 0, D i = 1) = P (p ij c 2 D ij = 0, D i = 1) 1 P (p i c 1 D ij = 0, D i = 1). P (p i c 1 D ij = 0, D i = 1) Since the p-value p ij corresponding to D i = 1 and D ij = 0 is uniformly distributed on the interval (0,1), then P (p ij c 2 D ij = 0, D i = 1) = c 2. Hence, P (p ij c 2 D ij = 0, D i = 1, p i c 1 ) c 2 1 P (p i c 1 D ij = 0, D i = 1) c 2. (8) P (p i c 1 D ij = 0, D i = 1) Therefore, when P (p i c 1 D ij = 0, D i = 1) = 1, P (p ij c 2 D ij = 0, D i = 1, p i c 1 ) = c 2 holds. For an infinite sample size, the event {p i c 1 D ij = 0, D i = 1} is deterministic regardless of the value that c 1 takes. For a finite sample size, P (p i c 1 D ij = 0, D i = 1) can be very close to, or equal to 1 for a reasonable value of c 1. For example, suppose there are three treatment conditions with an equal sample size n under each of the three conditions. Suppose further that a gene has treatment means (0, 0, 3). Using the noncentral F-distribution under the assumption of the normal distribution, P (p i 0.01 D ij = 0, D i = 1) = 0.9846 when n = 6, and 0.9999 when n = 10, and 1 when n = 30; P (p i 0.001 D ij = 0, D i = 1) = 0.8563 when n = 6, and 0.9991 when n = 10, and 1 when n = 30. When c 1 is extremely small, P (p i c 1 D ij = 0, D i = 1) can be much smaller than 1 for a finite sample size. Using equation (8) the following method can be employed to provide an overestimate of P (p ij c 2 D ij = 0, D i = 1, p i c 1 ). P (p ij c 2 D ij = 0, D i = 1, p i c 1 ) 1 = c 2 + c P (p i c 1 D ij = 0, D i = 1) 2. (9) P (p i c 1 D ij = 0, D i = 1) Published by The Berkeley Electronic Press, 2006

10 Statistical Applications in Genetics and Molecular Biology Vol. 5 [2006], No. 1, Article 28 Let E be the set of genes which enter the second step of the two-step procedure, but do not have all pairwise comparisons statistically significant, i.e., E = {gene g : p g c 1, at least one j such that p gj > c 2 }. Since the true means are unknown they have to be estimated. For gene g E, let x gj denote the sample mean for gene g under treatment condition j, and [j] denote the treatment which has the jth largest magnitude (absolute value) of the sample mean. For example, if the three treatment means for gene g satisfy x g3 < x g1 < x g2, then [1] = 3, [2] = 1 and [3] = 2. For gene g E, define the pseudo means under the J treatment conditions as following: µ g[1] = = µ g[j 1] = 0 and µ g[j] = µ g where µ g = max{ x gi x gj, i, j = 1,, J, i j}. It becomes necessary to compute the probability that a gene with these pseudo means will have a p-value for testing the equality of means below c 1. Under the assumption of normality, the global F-test statistic for testing the equality of the means has a non-central F-distribution with non-centrality parameter ncp g = ( j=j 1 j=1 n [j] (0 µ g /J) 2 + n [J] ( µ g µ g /J) 2 ) / σ 2 g, where n j is the sample size under treatment j and σ 2 g is the estimate of the variance for gene g. Then P (p g c 1 D gj = 0, D g = 1) = P (f J 1,N J,ncpg F 1 J 1,N J (1 c 1)), where N = n j, and f J 1,N J,ncpg is a random variable of non-central F-distribution with degrees of freedom J 1 and N J and non-centrality parameter ncp g, F 1 J 1,N J (1 c 1) is the (1 c 1 ) 100th percentile for a F-distribution with degrees of freedom J 1 and N J. Thus, P (p i c 1 D ij = 0, D i = 1) = average of P (p g c 1 D gj = 0, D g = 1), (10) where g E. When the assumption of normality does not hold, a permutation method is presented (in Section 6) to estimate this probability. (5) The last component of equation (4), P (D ij = 0 D i = 1, p i c 1 ), can be estimated using the proportion of non-significant pairwise comparisons among all pairwise comparisons associated with correctly rejected genes in http://www.bepress.com/sagmb/vol5/iss1/art28

Jiang and Doerge: A Two-Step Multiple Comparison Procedure 11 Step 1. However, it is impossible to separate the correctly rejected genes from the falsely rejected genes, hence an overestimate is pursued. Define π 02 as the estimate of the proportion of true null hypotheses given the distribution of the p-values in Step 2. We emphasize true here because π 02 is computed based on the the distribution of p-values in Step 2 using the same methods as those used to estimate π 01, and it is not exactly P (D ij = 0 p i c 1 ). Let K denote the number of genes in Step 2, then C K π 02 estimates the number of true null hypotheses based on the p-values, and C K (1 FDR 1 ) is the estimated number of pairwise comparisons generated by correctly rejected genes. Since the p-value (p ij ) corresponding to D i = 1 and D ij = 0 is approximately uniformly distributed, and the estimate C K π 02 also includes some true null hypotheses corresponding to D i = 0 and D ij = 0, the number of true null hypotheses (D ij = 0) corresponding to D i = 1 is less than or equal to C K π 02. Therefore, P (D ij = 0 D i = 1, p i c 1 ) = C K π 02 C K (1 FDR 1 ) = π 02 (1 FDR 1 ). Using equations (5) (9), along with the estimates of the proportions of true null hypotheses ( π 01 and π 02 ) in Step 1 and Step 2, the FDR (equation 4) of the two-step multiple comparison procedure can be estimated by FDR = P (p ij c 2 D i = 0, p i c 1 ) P (p ij c 2 p i c 1 ) FDR 1 + P (p ij c 2 D ij = 0, D i = 1, p i c 1 ) π 02. (11) P (p ij c 2 p i c 1 ) 3.3 Simulation study and results A simulation study is employed to illustrate the accuracy of the proposed method for estimating the FDR of the two-step multiple comparison procedure. Assume there are 3 treatments, and m = 1000 genes. Allow a proportion (R 1 ) of the genes to have a treatment effect. For any gene having a treatment effect, there are two cases: it is differentially expressed across all three treatments; or it is not differentially expressed between two treatments, but differentially expressed under the third treatment. Among the genes which have a treatment effect, assume a proportion (R 2 ) of them are not differentially expressed between two treatments, but differentially expressed under the third treatment. Published by The Berkeley Electronic Press, 2006

12 Statistical Applications in Genetics and Molecular Biology Vol. 5 [2006], No. 1, Article 28 That is, R 1 m genes have treatment effects, and R 1 R 2 m genes have treatment means (µ a, µ 0, µ 0 ) or (µ 0, µ a, µ 0 ) or (µ 0, µ 0, µ a ), where µ 0 and µ a are different; and R 1 (1 R 2 ) m genes have treatment means (µ 1, µ 2, µ 3 ), where µ 1, µ 2, and µ 3 are different. In this simulation, half of the R 1 R 2 m genes are chosen to have mean (2,0,0) and the other half have mean (4,0,0); and the R 1 (1 R 2 ) m genes have means (4,2,0). For the (1 R 1 ) m genes not having a treatment effect, the mean vector is (0,0,0). The values for R 1 are 0.10, 0.20, 0.30, 0.40, and 0.50, and the values for R 2 are 0.0, 0.20, 0.40, 0.60, 0.80 and 1. Large values of R 1 are not used in this simulation because the proportion of significant genes in most microarray studies is relatively small. Assume for each gene that there are n = 6 observations under each of the treatments. For each combination of R 1 and R 2, 1000 data sets (each with size of 1000 genes 6 replicates 3 treatments) are generated from normal distributions with standard deviation 1. For each simulated data, 1000 global F-test statistics corresponding to the m = 1000 genes are computed for testing equality of the three treatment means across the 1000 genes. If a gene has a p-value smaller than or equal to a pre-specified level c 1, then it is considered as having significant treatment effect, and thus enters the second step. In the second step, for the genes with statistically significant treatment effects from Step 1, pairwise comparisons are performed using t-tests. Pairwise comparisons with a p-value less than or equal to a pre-specified level c 2 are considered as statistically significant. Various values of c 1 and c 2 are used in the simulation. For each data simulation, π 01 and π 02, the estimates of the proportion of true null hypotheses in Step 1 and Step 2, are computed using Storey and Tibshirani s smoother estimate (Storey and Tibshirani, 2003), and the FDR is estimated using equation (11). The average of the estimated FDR from 1000 simulations for (c 1, c 2 ) = (0.10, 0.05), (0.10, 0.01) and (0.05, 0.01) are presented in Table 1. The average of the true FDR from the 1000 simulations is also presented. For the estimated FDR presented in Table 1, P (p ij c 2 D ij = 0, D i = 1, p i c 1 ) is estimated using c 2 instead of equation (9). It is clear that the estimated FDR is very close to the true FDR when c 1 is not too small which indicates P (p i c 1 D ij = 0, D i = 1) is close to 1. As seen in Table 1 the proposed method yields accurate estimates of the overall FDR. As one would expect the overall FDR for any two-step procedure depends on the configuration of R 1 and R 2. For our two-step approach with c 1 = 0.10, c 2 = 0.05 when R 1 = 0.10 and R 2 = 1.0 the FDR can be as big as 0.39, yet when R 1 = 0.50 and R 2 = 0.0 the FDR can be as small as 0.046. For the same value of R 1 and the same rejection regions [0, c 1 ] in Step 1 and [0, c 2 ] in Step 2, the FDR increases as R 2 increases. On the other hand, for the same value of R 2 and the same rejection regions [0, c 1 ] in Step 1 and [0, c 2 ] in Step http://www.bepress.com/sagmb/vol5/iss1/art28

Jiang and Doerge: A Two-Step Multiple Comparison Procedure 13 2, the FDR decreases as R 1 (the proportion of genes having treatment effect) increases. 4 Estimating FDR for fixed FDR significance levels The two-step multiple comparison procedure can also be applied using fixed FDR significance levels in Step 1 and Step 2, respectively. For instance, an FDR controlling procedure at FDR significance level α 1 (α 1 is known and fixed) is applied to the p-values in Step 1, and statistically significant genes are identified. Let A denote the collection of statistically significant genes. Define d 1 be the smallest p-value in Step 1 which is not statistically significant, i.e., d 1 = min{p i, i A c }, where A c is the complement of A. In Step 2, pairwise comparisons associated with the statistically significant genes (i.e., genes in set A) are tested using an FDR controlling procedure at FDR significance level α 2 (α 2 is known and fixed) and statistically significant effects are identified. Let d 2 be the smallest p-value for pairwise comparisons in Step 2 which are not statistically significant. Since the goal is to compute the overall FDR, this can be achieved by replacing c 1 and c 2 with the respective d 1 and d 2 when using the method for estimating the FDR for fixed rejection regions (11). That is, assuming d 1 and d 2 are known, FDR(α 1, α 2 ) = P (p ij d 2 D i = 0, p i d 1 ) P (p ij d 2 p i d 1 ) FDR 1 + P (p ij d 2 D ij = 0, D i = 1, p i d 1 ) π 02. (12) P (p ij d 2 p i d 1 ) It is worth noting that for this approach, d 1 is determined by the p-values in Step 1, α 1, and the FDR controlling procedures applied in Step 1; and d 2 is determined by the p-values in both steps, α 1, α 2, and the FDR controlling procedures applied in Step 1 and Step 2, respectively. 5 Controlling the FDR at a desired significance level Instead of estimating the FDR for a fixed rejection region, traditional multiple comparison procedures (Hochberg and Tamhane, 1987; Hsu, 1996) reject Published by The Berkeley Electronic Press, 2006

14 Statistical Applications in Genetics and Molecular Biology Vol. 5 [2006], No. 1, Article 28 Table 1: Simulation results. Estimated FDR ( FDR) and true FDR of pairwise comparisons for 3 treatments and 1000 genes as applied to the two-step multiple comparison procedure using fixed rejection regions c 1 and c 2 in Steps 1 and 2, respectively. R 1 : the proportion of genes having a treatment effect; R 2 : the proportion of genes with a treatment effect having one treatment mean different and the other two the same. R 1 R 2 = 0.0 0.20 0.40 0.60 0.80 1.0 c 1 = 0.10 FDR 0.10 0.299 0.315 0.331 0.350 0.367 0.392 c 2 = 0.05 0.20 0.160 0.171 0.183 0.196 0.212 0.230 0.30 0.100 0.109 0.118 0.129 0.141 0.156 0.40 0.067 0.074 0.082 0.090 0.101 0.113 0.50 0.046 0.052 0.058 0.066 0.075 0.085 True 0.10 0.296 0.311 0.328 0.346 0.369 0.391 FDR 0.20 0.157 0.168 0.182 0.197 0.213 0.230 0.30 0.098 0.107 0.118 0.129 0.142 0.158 0.40 0.065 0.073 0.082 0.091 0.102 0.115 0.50 0.044 0.051 0.058 0.066 0.076 0.087 c 1 = 0.10 FDR 0.10 0.104 0.111 0.118 0.126 0.134 0.144 c 2 = 0.01 0.20 0.049 0.053 0.057 0.061 0.066 0.072 0.30 0.029 0.032 0.035 0.038 0.041 0.046 0.40 0.019 0.021 0.023 0.026 0.028 0.032 0.50 0.013 0.015 0.016 0.018 0.020 0.023 True 0.10 0.102 0.107 0.116 0.122 0.130 0.142 FDR 0.20 0.048 0.051 0.056 0.060 0.066 0.071 0.30 0.028 0.031 0.034 0.037 0.041 0.045 0.40 0.018 0.021 0.023 0.025 0.028 0.032 0.50 0.012 0.014 0.016 0.018 0.020 0.023 c 1 = 0.05 FDR 0.10 0.104 0.110 0.117 0.124 0.133 0.143 c 2 = 0.01 0.20 0.049 0.052 0.056 0.061 0.066 0.072 0.30 0.029 0.032 0.034 0.038 0.041 0.045 0.40 0.019 0.021 0.023 0.025 0.028 0.031 0.50 0.013 0.014 0.016 0.018 0.020 0.023 True 0.10 0.101 0.107 0.114 0.122 0.130 0.142 FDR 0.20 0.048 0.052 0.055 0.060 0.065 0.071 0.30 0.029 0.031 0.034 0.037 0.041 0.045 0.40 0.018 0.020 0.023 0.025 0.028 0.031 0.50 0.012 0.014 0.016 0.018 0.021 0.023 http://www.bepress.com/sagmb/vol5/iss1/art28

Jiang and Doerge: A Two-Step Multiple Comparison Procedure 15 the null hypotheses at a pre-chosen significance level. If the desired FDR significance level of the two-step multiple comparison is α, then the problem becomes choosing the FDR significance levels α 1 and α 2 in Step 1 and Step 2, respectively, so that the overall FDR is controlled by α. 5.1 An approximate upper bound for FDR Although the resampling procedure that is required for estimating the FDR (equation 11) may appear to be a disadvantage, when the experimental design is complicated, it may in fact be difficult to generate data under the null hypothesis. Fortunately, an upper bound of P (p ij c 2 D i =0,p i c 1 ) P (p ij c 2 p i c 1 is possible, thus ) estimating P (p ij c 2 D i = 0, p i c 1 ) via simulation can be avoided. Theorem 2. In the two-step multiple comparison procedure, P (p ij c 2 D i = 0, p i c 1 ) P (p ij c 2 p i c 1 ) 1. Proof. P (p ij c 2 D i = 0, p i c 1 ) P (p ij c 2 p i c 1 ) = P (p ij c 2 D ij = 0, D i = 0, p i c 1 ) P (p ij c 2 p i c 1 ) = = P (p ij c 2,D ij =0,D i =0 p i c 1 ) P (D ij =0,D i =0 p i c 1 ) P (p ij c 2 p i c 1 ) P (p ij c 2,D ij =0,D i =0 p i c 1 ) P (p ij c 2 p i c 1 ) P (D ij = 0, D i = 0 p i c 1 ) = P (D ij = 0, D i = 0 p i c 1, p ij c 2 ) P (D ij = 0, D i = 0 p i c 1 ) 1. (13) When c 2 1 this equality (equation 13) holds for two specific reasons. First, the probability of a false rejection in Step 1 (reject the null hypothesis H 0 i when it is true) only depends on the p-values p i and c 1. Second, with a constraint in Step 2 (p ij c 2 and c 2 < 1), the chance of making a false rejection (reject the null hypothesis H 0 ij when it is true) will be smaller than when compared to the procedure for which no constraint is applied. Published by The Berkeley Electronic Press, 2006

16 Statistical Applications in Genetics and Molecular Biology Vol. 5 [2006], No. 1, Article 28 is When P (p i c 1 D i = 1, D ij = 0 for some j) = 1, the FDR (equation 4) FDR = FDR 1 P (p ij c 2 D i = 0, p i c 1 ) P (p ij c 2 p i c 1 ) +(1 FDR 1 ) c 2P (D ij = 0 D i = 1, p i c 1 ). P (p ij c 2 p i c 1 ) Define π 02, FDR 2 and pfdr 2 to be the proportion of true null hypotheses, the FDR and the pfdr in Step 2 based on the empirical distribution of the p-values. Then Notice that FDR 2 = pfdr 2 = c 2 π 02 P (p ij c 2 p i c 1 ). c 2 P (D ij = 0 D i = 1, p i c 1 ) P (p ij c 2 p i c 1 ) c 2 π 02 /(1 FDR 1 ) P (p ij c 2 p i c 1 ) = FDR 2 1 FDR 1, thus an upper bound for the overall FDR (equation 4) is, F DR FDR 1 + FDR 2. (14) Therefore, the overall FDR can be controlled below level α as long as the FDR significance levels α 1 and α 2 used in the respective Step 1 and Step 2 satisfy α 1 + α 2 α. However, when P (p i c 1 D i = 1, D ij = 0 for some j) is far less than 1, the realized FDR may exceed FDR 1 + FDR 2. One strategy is to put more weight of the overall FDR on FDR 1 so that P (p i c 1 D i = 1, D ij = 0 for some j) is closer to 1, and at the same time more genes can be included in the analysis in Step 2. Next, we investigate the performance of the two-step procedure with fixed FDR significance levels in Step 1 and Step 2, and propose a method to choose FDR significance levels in the two steps so that the overall FDR can be controlled below a pre-chosen overall FDR significance level. 5.2 Fixing the FDR significance levels A simulation study is employed to illustrate the improved power of the two-step multiple comparison procedure over the one-step procedure. The simulation scenario is the same as Section 3.3. There are 3 treatment conditions, a sample size of n = 6 within each treatment condition, and m = 1000 genes. For http://www.bepress.com/sagmb/vol5/iss1/art28

Jiang and Doerge: A Two-Step Multiple Comparison Procedure 17 each combination of R 1 and R 2, 1000 data sets are generated from standard normal distributions, and there are 3nm data points within each data set. The FDR controlling procedure is then applied to the corresponding 1000 genes at a FDR significance level α 1. In the second step, for the genes with significant treatment effects from Step 1, pairwise comparisons are performed with the FDR controlling procedure at FDR significance level α 2. The respective FDR significance levels used in the first and second step are (α 1, α 2 ) = (0.04, 0.01), and (0.03, 0.02), and the estimated FDR, the true FDR and average power are listed in Tables 2 and 3. Here, the average power is defined to be the expected proportion of correct rejections among the true alternative hypotheses. For the purpose of comparing the results with the one-step FDR controlling procedure, the estimated FDR, the true FDR, and the average power for the one-step procedure are also listed in Table 2. For the one-step procedure, an FDR controlling procedure is applied to the family of 3m pairwise comparisons. Specifically, Benjamini and Hochberg s adaptive FDR controlling procedure (Benjamini and Hochberg, 2000) with the incorporation of the estimate of the proportion of null hypotheses by Storey and Tibshirani s smoother estimate (Storey and Tibshirani, 2003) is employed. When the proportion of genes having a treatment effect (R 1 ) is small, the two-step multiple comparison procedure is more powerful than the one-step multiple comparison procedure because of the reduced number of tests in Step 2. For example, in this simulation, when R 1 = 0.2 and R 2 = 0.2, the one-step procedure has 80% power, while the two-step procedure has approximate power 96%. As observed from the simulations when R 2, the proportion of significant genes for which one treatment effect is different but the other two are the same, increases, the power of the two-step procedure decreases. This is due to the fact that when R 2 increases, fewer genes are included in Step 2. From this simulation, the power for α 1 = 0.04, α 2 = 0.01 is slightly bigger than that for α 1 = 0.03, α 2 = 0.02 when R 1 is small. Furthermore, when α 1 = 0.04 more genes are included in the Step 2. Simulations have been performed for different values of FDR level α that vary from 0.01, 0.02,, 0.2. The FDR controlling procedure with the incorporation of the estimate of true null hypotheses is applied in both steps of the two-step procedure, and Step 1 and Step 2 FDR levels are set to α 1 = 4/5α and α 2 = 1/5α. These simulations (Figure 1) demonstrate that the overall FDR is controlled at FDR level α for all values of α. Based on this work and experience our ad hoc suggestion is to use α 1 = 4/5α and α 2 = 1/5α if the overall FDR is required to be controlled at FDR level α. Published by The Berkeley Electronic Press, 2006

18 Statistical Applications in Genetics and Molecular Biology Vol. 5 [2006], No. 1, Article 28 Table 2: Simulation results. Estimated FDR ( FDR), true FDR, and average power for pairwise comparisons for 3 treatment conditions and 1000 genes using both the two-step and one-step procedure, respectively. For the twostep procedure, the FDR significance levels α 1 = 0.04 and α 2 = 0.01 are used in Step 1 and Step 2, respectively. For the one-step procedure, the FDR significance level is 0.05. R 1 R 2 = 0.0 0.20 0.40 0.60 0.80 1.0 Two- FDR 0.10 0.053 0.053 0.052 0.052 0.053 0.057 Step 0.20 0.053 0.048 0.046 0.046 0.047 0.049 0.30 0.053 0.046 0.044 0.043 0.043 0.044 0.40 0.053 0.044 0.042 0.041 0.041 0.041 0.50 0.053 0.042 0.040 0.039 0.038 0.039 True 0.10 0.037 0.045 0.046 0.047 0.048 0.051 FDR 0.20 0.038 0.042 0.043 0.043 0.045 0.046 0.30 0.037 0.041 0.042 0.042 0.042 0.043 0.40 0.037 0.040 0.040 0.040 0.040 0.040 0.50 0.037 0.038 0.038 0.037 0.037 0.037 Power 0.10 0.993 0.949 0.898 0.849 0.800 0.761 0.20 0.997 0.960 0.919 0.881 0.851 0.825 0.30 0.999 0.965 0.929 0.899 0.877 0.861 0.40 0.999 0.968 0.936 0.910 0.892 0.881 0.50 0.999 0.970 0.941 0.917 0.901 0.890 One- True 0.10 0.050 0.051 0.050 0.050 0.051 0.056 Step FDR 0.20 0.049 0.049 0.051 0.051 0.050 0.050 0.30 0.050 0.050 0.052 0.051 0.050 0.050 0.40 0.050 0.050 0.050 0.050 0.052 0.048 0.50 0.050 0.051 0.051 0.051 0.050 0.051 Power 0.10 0.695 0.700 0.715 0.712 0.726 0.748 0.20 0.800 0.798 0.797 0.800 0.805 0.816 0.30 0.861 0.858 0.853 0.850 0.851 0.857 0.40 0.903 0.894 0.891 0.889 0.885 0.886 0.50 0.931 0.926 0.921 0.914 0.910 0.909 http://www.bepress.com/sagmb/vol5/iss1/art28

Jiang and Doerge: A Two-Step Multiple Comparison Procedure 19 Table 3: Simulation results. Estimated FDR ( FDR), true FDR, and average power for pairwise comparisons for 3 treatment conditions and 1000 genes using the two-step procedure at the FDR significance levels α 1 = 0.03 and α 2 = 0.02 in Step 1 and Step 2, respectively. R 1 R 2 = 0.0 0.20 0.40 0.60 0.80 1.0 FDR 0.10 0.042 0.056 0.053 0.054 0.055 0.059 0.20 0.042 0.051 0.049 0.049 0.051 0.052 0.30 0.042 0.049 0.047 0.048 0.048 0.050 0.40 0.042 0.048 0.046 0.046 0.047 0.048 0.50 0.042 0.046 0.045 0.045 0.045 0.046 True 0.10 0.030 0.048 0.051 0.052 0.055 0.057 FDR 0.20 0.030 0.047 0.049 0.050 0.052 0.054 0.30 0.030 0.046 0.047 0.049 0.049 0.051 0.40 0.030 0.045 0.046 0.047 0.048 0.049 0.50 0.030 0.043 0.044 0.045 0.046 0.046 Power 0.10 0.990 0.953 0.904 0.854 0.796 0.736 0.20 0.997 0.968 0.931 0.890 0.850 0.809 0.30 0.999 0.975 0.944 0.912 0.882 0.854 0.40 1.000 0.980 0.954 0.928 0.905 0.884 0.50 1.000 0.983 0.961 0.940 0.922 0.906 Published by The Berkeley Electronic Press, 2006

20 Statistical Applications in Genetics and Molecular Biology Vol. 5 [2006], No. 1, Article 28 True FDR 0.05 0.10 0.15 0.20 R 1 =0.2, R 2 =0.2 R 1 =0.2, R 2 =0.8 R 1 =0.4, R 2 =0.2 R 1 =0.4, R 2 =0.8 0.05 0.10 0.15 0.20 α Figure 1: Simulation results of the FDR for the two-step multiple comparison procedure using α 1 = 4α and α 5 2 = 1 α for different levels of α. In total there 5 are m = 1000 genes, 3 treatment conditions, and four different combinations of R 1 and R 2 : R 1 = 0.2, R 2 = 0.2 (short dashed line), R 1 = 0.2, R 2 = 0.8 (dotted line), R 1 = 0.4, R 2 = 0.2 (dotted-dashed line) and R 1 = 0.4, R 2 = 0.8 (long dashed line). The black straight line represents the pre-chosen FDR level. Here R 1 is the proportion of genes having a treatment effect; R 2 is the proportion of genes with a treatment effect having one treatment mean different but the other two the same. http://www.bepress.com/sagmb/vol5/iss1/art28

Jiang and Doerge: A Two-Step Multiple Comparison Procedure 21 5.3 Choosing the FDR significance levels Here we propose an adaptive approach for choosing α 1 and α 2, and suggest some guidelines and direction for selecting α 1 and α 2. First, α 1 should be bigger than α 2. When a looser criterion is used in Step 1, more genes are available to enter the second step. Second, α 1 and α 2 should be chosen such that the overall FDR is close to but below the pre-specified significance level. Hence, the power for detecting a significant effect will be maximized. Third, the choice of α 1 and α 2 should lead to the largest number of rejections occurring in Step 2. With these guidelines in mind, we propose the following directive for finding the significance levels α 1 and α 2. Let S be a set of values of (i α)/n where i = 1,, n 1 and n is a positive integer. That is, S = {α/n, 2α/n,, (n 1)α/n}. Let FDR(α 1, α 2 )) be the estimated overall FDR and R(α 1, α 2 ) the number of rejections (or statistically significant pairwise comparisons) in Step 2 when a two-step procedure with respective significance levels α 1 and α 2 in Step 1 and 2 is applied. Then α1 and α2 are chosen such that (α1, α2) = arg α1,α 2 { max R(α 1, α 2 )}. (15) α 1,α 2 S,α 1 >α 2,α 1 +α 2 α, FDR(α 1,α 2 ) α Using the same simulation as in Section 3.3, for each of the 1000 data sets, we apply our guidelines to find α1 and α2. Suppose the overall FDR significance level α = 0.05 and S = {α/5, 2α/5, 3α/5, 4α/5}, then α1 and α2 can be chosen from (α 1, α 2 ) = (0.02, 0.01), (0.03, 0.01), (0.03, 0.02), and (0.04, 0.01). Table 4 gives the frequency distribution of α1 and α2 based on these 1000 simulations. As can be seen, when R 1 = 0.20, R 2 = 0.60, the choice of (α1, α2) is (0.03, 0.01) for 12 simulated data sets, (0.03, 0.02) for 877 simulated data sets, and (0.04, 0.01) for 111 simulated data sets. The chosen significance levels in the two step method are more diverse when R 1 is small, and then they converge to (α1, α2) = (0.03, 0.02) as R 1 gets larger. Evidently, the case where R 2 = 0.0 (genes which have a treatment effect where all means are different from each other) yields random results. This is most likely due to the fact that almost all pairwise comparisons in Step 2 are significant. Given the choices of α1 and α2 (Table 4), the average FDR is controlled below α = 0.05 (Table 5), and the two-step procedure has more power than the one-step procedure (Table 2). For these results, α1 and α2 take values from S = {α/5, 2α/5, 3α/5, 4α/5}. However, for more accurate results, we suggest S = {α/20, 2α/20,, 19α/20}. Published by The Berkeley Electronic Press, 2006

22 Statistical Applications in Genetics and Molecular Biology Vol. 5 [2006], No. 1, Article 28 Table 4: Frequency distribution of α1 and α2 from 1000 simulations for pairwise comparisons for 3 treatment conditions and 1000 genes. Here α1 and α2 are determined using the stated guidelines, and by controlling the overall FDR for the two-step procedure below α = 0.05. α1 = 0.01 0.03 0.03 0.04 R 1 R 2 α2 = 0.01 0.01 0.02 0.01 0.10 0.0 105 351 178 366 0.20 53 345 152 450 0.40 18 214 280 488 0.60 17 185 224 574 0.80 16 251 120 613 1.0 125 458 10 407 0.20 0.0 37 221 256 486 0.20 1 83 505 411 0.40 0 23 792 185 0.60 0 10 761 229 0.80 0 5 527 468 1.0 0 13 90 897 0.30 0.0 9 129 309 553 0.20 0 20 843 137 0.40 0 5 970 25 0.60 0 4 957 39 0.80 0 1 856 143 1.0 0 0 503 497 0.40 0.0 3 76 299 622 0.20 0 4 963 33 0.40 0 1 995 4 0.60 0 0 999 1 0.80 0 0 986 14 1.0 0 0 934 66 0.50 0.0 1 48 294 657 0.20 0 2 995 3 0.40 0 0 1000 0 0.60 0 0 1000 0 0.80 0 0 1000 0 1.0 0 0 1000 0 http://www.bepress.com/sagmb/vol5/iss1/art28

Jiang and Doerge: A Two-Step Multiple Comparison Procedure 23 Table 5: Simulation results. Estimated FDR ( FDR), true FDR, and power for pairwise comparisons for 3 treatment conditions and 1000 genes using the two-step procedure. The FDR for the entire procedure is controlled below 0.05 with significance levels α 1 and α 2 chosen automatically (results are listed in Table 4). R 1 R 2 = 0.0 0.20 0.40 0.60 0.80 1.0 FDR 0.10 0.045 0.050 0.050 0.050 0.050 0.051 0.20 0.048 0.050 0.051 0.051 0.049 0.049 0.30 0.048 0.050 0.050 0.050 0.049 0.048 0.40 0.049 0.049 0.050 0.048 0.048 0.047 0.50 0.049 0.049 0.047 0.047 0.047 0.046 True 0.10 0.036 0.047 0.049 0.050 0.050 0.052 FDR 0.20 0.036 0.045 0.048 0.050 0.051 0.050 0.30 0.036 0.046 0.046 0.047 0.050 0.049 0.40 0.037 0.045 0.046 0.046 0.048 0.048 0.50 0.037 0.043 0.044 0.045 0.046 0.047 Power 0.10 0.992 0.950 0.903 0.851 0.799 0.755 0.20 0.997 0.967 0.930 0.890 0.853 0.824 0.30 0.999 0.975 0.944 0.912 0.882 0.861 0.40 0.999 0.980 0.954 0.928 0.904 0.885 0.50 0.999 0.983 0.961 0.939 0.921 0.908 Published by The Berkeley Electronic Press, 2006