Statistical Applications in Genetics and Molecular Biology

Size: px

Start display at page:

Download "Statistical Applications in Genetics and Molecular Biology"

Tyler Jackson
6 years ago
Views:

1 Statistical Applications in Genetics and Molecular Biology Volume 3, Issue Article 13 Multiple Testing. Part I. Single-Step Procedures for Control of General Type I Error Rates Sandrine Dudoit Mark J. van der Laan Katherine S. Pollard Division of Biostatistics, School of Public Health, University of California, Berkeley, sandrine@stat.berkeley.edu Division of Biostatistics, School of Public Health, University of California, Berkeley, laan@stat.berkeley.edu University of California, Santa Cruz, kpollard@gladstone.ucsf.edu Copyright c 2004 by the authors. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise, without the prior written permission of the publisher, bepress, which has been given certain exclusive rights by the author. Statistical Applications in Genetics and Molecular Biology is produced by The Berkeley Electronic Press (bepress).

2 Multiple Testing. Part I. Single-Step Procedures for Control of General Type I Error Rates Sandrine Dudoit, Mark J. van der Laan, and Katherine S. Pollard Abstract The present article proposes general single-step multiple testing procedures for controlling Type I error rates defined as arbitrary parameters of the distribution of the number of Type I errors, such as the generalized family-wise error rate. A key feature of our approach is the test statistics null distribution (rather than data generating null distribution) used to derive cut-offs (i.e., rejection regions) for these test statistics and the resulting adjusted p-values. For general null hypotheses, corresponding to submodels for the data generating distribution, we identify an asymptotic domination condition for a null distribution under which single-step common-quantile and common-cut-off procedures asymptotically control the Type I error rate, for arbitrary data generating distributions, without the need for conditions such as subset pivotality. Inspired by this general characterization of a null distribution, we then propose as an explicit null distribution the asymptotic distribution of the vector of null value shifted and scaled test statistics. In the special case of family-wise error rate (FWER) control, our method yields the single-step minp and maxt procedures, based on minima of unadjusted p-values and maxima of test statistics, respectively, with the important distinction in the choice of null distribution. Single-step procedures based on consistent estimators of the null distribution are shown to also provide asymptotic control of the Type I error rate. A general bootstrap algorithm is supplied to conveniently obtain consistent estimators of the null distribution. The special cases of t- and F-statistics are discussed in detail. The companion articles focus on step-down multiple testing procedures for control of the FWER (van der Laan et al., 2004b) and on augmentations of FWER-controlling methods to control error rates such as tail probabilities for the number of false positives and for the proportion of false positives among the rejected hypotheses (van der Laan et al., 2004a). The proposed bootstrap multiple testing procedures are evaluated by a simulation study and applied to genomic data in the fourth article of the series (Pollard et al., 2004). KEYWORDS: Adjusted p-value, asymptotic control, bootstrap, consistency, cut-off, F-statistic, generalized family-wise error rate, multiple testing, null distribution, null hypothesis, quantile, single-step, test statistic, t-statistic, Type I error rate The authors would like to thank two referees for constructive comments on an earlier version of this manuscript.

3 Dudoit et al.: Single-Step Procedures for Control of General Type I Error Rates 1 Introduction 1.1 Motivation DNA microarray experiments and other high-throughput biological assays have motivated us to investigate multiple testing methods in large multivariate settings, though our results apply to multiple testing in general. Current statistical inference problems in genomic data analysis are characterized by: (i) high-dimensional multivariate distributions, with typically unknown and intricate correlation patterns among variables; (ii) large parameter spaces; (iii) a number of variables (hypotheses) that is much larger than the sample size; and (iv) some non-negligible proportion of false null hypotheses, i.e., true positives. Multiple hypothesis testing methods are concerned with the simultaneous test of M > 1 null hypotheses, while controlling a suitably defined Type I error rate (i.e., false positive rate), such as the family-wise error rate or the false discovery rate. General references on multiple testing include Hochberg and Tamhane (1987), Shaffer (1995), and Westfall and Young (1993). A number of recent articles have addressed the question of multiple testing as it relates to the identification of differentially expressed genes in DNA microarray experiments (Dudoit et al., 2002; Efron et al., 2001; Golub et al., 1999; Manduchi et al., 2000; Pollard and van der Laan, 2004; Reiner et al., 2003; Tusher et al., 2001; Westfall et al., 2001; Xiao et al., 2002); a review of multiple testing methods in the context of microarray data analysis is given in Dudoit et al. (2003). In multiple testing, decisions to reject the null hypotheses are based on cut-off rules for test statistics (or their associated p-values), so that a given Type I error rate is controlled at a specified level α. In practice, however, the true joint distribution of the test statistics is unknown and replaced by an assumed null joint distribution in order to derive these cut-offs. Current approaches use a data generating distribution that satisfies the complete null hypothesis that all null hypotheses are true. Procedures based on such a null distribution typically rely on the subset pivotality condition stated in Westfall and Young (1993), p , to ensure that control under a data generating distribution satisfying the complete null hypothesis does indeed give the desired control under the true data generating distribution. However, the subset pivotality condition is violated in important testing problems, since a data generating distribution satisfying the complete null hypothesis might result in a joint distribution for the vector of test statistics that has a Published by The Berkeley Electronic Press,

4 Statistical Applications in Genetics and Molecular Biology, Vol. 3 [2004], Iss. 1, Art. 13 different dependence structure than their true distribution. In fact, in many problems, there does not even exist a data generating null distribution that correctly specifies the dependence structure of the joint distribution of the test statistics corresponding to the true null hypotheses (e.g., tests concerning correlations and regression parameters in Section 7.1). Pollard and van der Laan (2004) formally define a statistical framework for testing multiple single-parameter null hypotheses of the form H 0 (m) = I(ψ(m) ψ 0 (m)), for one-sided tests, and H 0 (m) = I(ψ(m) = ψ 0 (m)), for two-sided tests, where ψ = (ψ(m) : m = 1,..., M) is an M vector of parameters and ψ 0 = (ψ 0 (m) : m = 1,..., M) are the hypothesized null values. They propose as test statistics null distribution the asymptotic distribution of the mean-zero centered test statistics and prove that, with this choice of null distribution, single-step multiple testing procedures based on common-cut-off rules for the test statistics or the corresponding marginal p-values (common-quantile procedures) provide asymptotic control of any Type I error rate that is a function of the distribution of the number of false positives. This general approach does not rely on subset pivotality. Pollard and van der Laan (2004) propose a bootstrap algorithm for estimating the null distribution and prove the important practical result that multiple testing procedures based on consistent estimators of the null distribution (e.g., from non-parametric or model-based bootstrap) asymptotically control the Type I error rate. These authors also generalize the equivalence of hypothesis testing and confidence regions to the multivariate setting, by demonstrating that their single-step multiple testing procedures, with asymptotic control of a particular Type I error rate θ at level α, are equivalent with constructing an asymptotic θ-specific (1 α) confidence region for the parameter of interest (e.g., bootstrap-based) and rejecting the hypotheses for which the null values are not included in the confidence region. This manuscript and its companion (van der Laan et al., 2004b) are concerned with the choice of null distribution for single-step and step-down multiple testing procedures that provide asymptotic control of Type I error rates defined as arbitrary parameters of the distribution of the number of Type I errors. Examples of such error rates include: the generalized family-wise error rate (gfwer), i.e., the probability of at least (k + 1) Type I errors, for some user-supplied integer k 0 (the family-wise error rate (FWER) is the gfwer in the special case k = 0), and the per-comparison error rate (PCER), i.e., the expected proportion of Type I errors among the M tests. We build on the earlier work of Pollard and van der Laan (2004) as follows: 2

5 Dudoit et al.: Single-Step Procedures for Control of General Type I Error Rates (i) general collections of null hypotheses, corresponding to submodels for the data generating distribution, are considered; (ii) step-down procedures are provided for asymptotic control of the FWER; (iii) adjusted p-values are derived for each of the multiple testing procedures. A general characterization and explicit construction are proposed for a test statistics null distribution that provides asymptotic control of the Type I error rate under the true data generating distribution, without the need for conditions such as subset pivotality. This null distribution is used to obtain cut-offs for the test statistics (or their corresponding unadjusted p-values) and to derive the resulting adjusted p-values, for both single-step and step-down procedures. We stress the generality of this null distribution: it can be used for any data generating distribution (i.e., for distributions with arbitrary dependence structures among variables), null hypotheses defined broadly in terms of submodels for the data generating distribution, and a wide range of test statistics. In particular, it allows us to address testing problems that cannot be handled by existing approaches, such as tests concerning correlations and, in general, tests of association between covariates and outcomes (e.g., regression parameters for the Cox proportional hazards model). 1.2 Outline The present article focuses on the choice of test statistics null distribution for single-step procedures controlling error rates defined as arbitrary parameters of the distribution of the number of Type I errors. In the next section, we describe a general statistical framework for multiple hypothesis testing. In particular, Section 3 outlines the main features of our approach to Type I error control and the choice of a null distribution. Section 4 revisits the notions of strong and weak control of a Type I error rate, and the related condition of subset pivotality. Differences between our approach to Type I error control and earlier approaches are highlighted. Section 5 proposes single-step common-quantile (Procedure 1) and common-cut-off (Procedure 2) multiple testing procedures that provide asymptotic control of the Type I error rate. A key feature of our approach is the test statistics null distribution (rather than data generating null distribution) used to derive cut-offs (i.e., rejection regions) for these test statistics and the resulting adjusted p- values. For general null hypotheses, corresponding to submodels for the data generating distribution, we identify an asymptotic domination condition for a null distribution under which single-step common-quantile and common- Published by The Berkeley Electronic Press,

6 Statistical Applications in Genetics and Molecular Biology, Vol. 3 [2004], Iss. 1, Art. 13 cut-off procedures asymptotically control the Type I error rate, for arbitrary data generating distributions, without the need for conditions such as subset pivotality (Theorem 1). Inspired by this general characterization, we then propose as an explicit null distribution the asymptotic distribution of the vector of null value shifted and scaled test statistics (Theorem 2). In the special case of family-wise error rate control, our approach yields the singlestep minp and single-step maxt procedures, based on minima of unadjusted p-values and maxima of test statistics, respectively (Section 5.3.3). In Section 6, procedures based on a consistent estimator of the null distribution are shown to also provide asymptotic control of the Type I error rate (Theorems 3 and 4, Corollary 1). Resampling procedures are supplied to conveniently obtain consistent estimators of the null distribution (bootstrap Procedures 3 5). Section 7 focuses on two particular examples of testing problems covered by our framework: the test of single-parameter null hypotheses (e.g., tests of means, correlations, regression parameters) using t-statistics and the test of multiple-parameter hypotheses using F -statistics. The companion article (van der Laan et al., 2004b) considers step-down approaches for controlling the family-wise error rate and provides procedures based on maxima of test statistics (step-down maxt) and minima of unadjusted p-values (step-down minp). In the third article of the series, van der Laan et al. (2004a) propose simple augmentations of FWER-controlling procedures which control tail probabilities for the number of false positives and for the proportion of false positives among the rejected hypotheses, under general data generating distributions, with arbitrary dependence structures among variables. The proposed methods are evaluated by a simulation study and applied to genomic data in the fourth article of the series (Pollard et al., 2004). Software implementing the bootstrap single-step and step-down multiple testing procedures will be available in the R package multtest, released as part of the Bioconductor Project ( 2 Multiple hypothesis testing framework For the remainder of the article, we adopt the following definitions for inverses of cumulative distribution functions (c.d.f.) and survivor functions. Let F denote a (non-decreasing and right-continuous) c.d.f. and let F denote the corresponding (non-increasing and right-continuous) survivor function, 4

7 Dudoit et al.: Single-Step Procedures for Control of General Type I Error Rates defined as F 1 F. For α [0, 1], define inverses as F 1 (α) inf{x : F (x) α} and F 1 (α) inf{x : F (x) α}. (1) With these definitions, F 1 (α) = F 1 (1 α). Note that we follow the convention that lower case letters denote realizations of random variables, e.g., x is a realization of the random variable X. 2.1 Model Let X 1,..., X n be n independent and identically distributed (i.i.d.) random J vectors, where X i = (X i (j) : j = 1,..., J) P M, i = 1,..., n, and the data generating distribution P is known to be an element of a particular statistical model M (possibly non-parametric). Let P n denote the corresponding empirical distribution, which places probability 1/n on each realization of X. For example, in cancer microarray studies, (X i (j) : j = 1,..., G) may denote a G vector of gene expression measures and (X i (j) : j = G+1,..., J) a (J G) vector of biological and clinical outcomes for patient i, i = 1,..., n. In microarray data analysis and other current areas of application of multiple testing methods, the dimension J, of the data vector X, is usually much larger than the sample size n, i.e., one can have thousands of expression measures for less than one hundred patients. 2.2 Parameters We consider arbitrary parameters defined as functions of the unknown data generating distribution P : Ψ(P ) = ψ = (ψ(m) : m = 1,..., M), where ψ(m) = Ψ(P )(m) IR. Parameters of interest include means, differences in means, correlations, and can refer to linear models, generalized linear models, survival models (e.g., Cox proportional hazards model), time-series models, dose-response models, etc. For instance, in microarray data analysis, one may be concerned with testing problems regarding the following parameters. Location parameters. E.g. means and medians for measuring differential expression for G genes. Let X P denote a random G vector of genome-wide expression measures for a particular population of cells. A parameter of interest is ψ = Ψ(P ), where ψ(g) = E[X(g)] is the mean expression measure of gene g, g = 1,..., G, in the cell population. Published by The Berkeley Electronic Press,

8 Statistical Applications in Genetics and Molecular Biology, Vol. 3 [2004], Iss. 1, Art. 13 Let X 1 P 1 and X 2 P 2 denote random G vectors of genome-wide expression measures for two populations of cells. A parameter of interest is ψ = Ψ(P ), where ψ(g) = ψ 2 (g) ψ 1 (g) = E[X 2 (g)] E[X 1 (g)] is the difference in mean expression measures for gene g, g = 1,..., G, in the two populations. Scale parameters. Let X P denote a random G vector of genomewide expression measures for a particular population of cells. Parameters of interest include covariances and correlations between expression measures for pairs of genes. The G G covariance matrix of X P is σ = Σ(P ) Cov[X], with entries σ(g, g ) = Cov[X(g), X(g )] denoting the pairwise covariances for the expression measures of genes g and g, g, g = 1,..., G. We adopt the following shorter notation for the diagonal elements of σ, i.e., the variances: σ 2 (g) σ(g, g). The G G correlation matrix of X P is σ = Σ (P ) Cor[X], with entries σ (g, g ) = Cor[X(g), X(g )] = σ(g, g )/σ(g)σ(g ) denoting the pairwise correlations for the expression measures of genes g and g, g, g = 1,..., G. Regression parameters. E.g. slopes, main effects, and interactions, for measuring association of expression measure X(g) of gene g, g = 1,..., G, with outcomes/covariates (X(j) : j = G + 1,..., J), X P. ψ(g) = regression parameter for univariate Cox proportional hazards model for survival time T = X(G + 1) given the expression measure X(g) of gene g. ψ(g) = interaction effect for two drugs on expression measure of gene g. ψ(g) = linear combination a β(g), e.g., contrast in ANOVA. 2.3 Null hypotheses General submodel null hypotheses. In order to cover a broad class of testing problems, we define M null hypotheses in terms of a collection of submodels, M(m) M, m = 1,..., M, for the data generating distribution 6

9 Dudoit et al.: Single-Step Procedures for Control of General Type I Error Rates P. The M null hypotheses are defined as H 0 (m) I(P M(m)) and the corresponding alternative hypotheses as H 1 (m) I(P / M(m)). Here, I( ) is the indicator function, equaling 1 if the condition in parentheses is true and 0 otherwise. Thus, H 0 (m) is true, i.e., H 0 (m) = 1, if P M(m), and H 0 (m) is false otherwise. M m=1 H 0(m) = M m=1 I(P M(m)) = The complete null hypothesis, H0 C I(P M m=1m(m)), is true if and only if all M individual null hypotheses H 0 (m) are true, i.e., if and only if the data generating distribution P belongs to the intersection M m=1m(m) of the corresponding M submodels. Let H 0 = H 0 (P ) {m : H 0 (m) = 1} = {m : P M(m)} be the set of h 0 H 0 true null hypotheses, where we note that H 0 depends on the true data generating distribution P. Let H 1 = H 1 (P ) H0(P c ) = {m : H 1 (m) = 1} = {m : P / M(m)} be the set of h 1 H 1 = M h 0 false null hypotheses, i.e., true positives. The goal of a multiple testing procedure is to accurately estimate the set H 0, and thus its complement H 1, while controlling probabilistically the number of false positives at a user-supplied level α. Single-parameter null hypotheses. A familiar special case, considered in Section 7, is that where each null hypothesis refers to a single parameter, ψ(m) = Ψ(P )(m) IR, m = 1,..., M. One distinguishes between two types of testing problems for single parameters. One-sided tests H 0 (m) = I ( ψ(m) ψ 0 (m) ) vs. H 1 (m) = I ( ψ(m) > ψ 0 (m) ), m = 1,..., M. Two-sided tests H 0 (m) = I ( ψ(m) = ψ 0 (m) ) vs. H 1 (m) = I ( ψ(m) ψ 0 (m) ), m = 1,..., M. The hypothesized null values, ψ 0 (m), are frequently zero (e.g., no difference in mean expression measures for gene m between two populations of patients). 2.4 Test statistics The decisions to reject or not the null hypotheses are based on an M vector of test statistics, T n = (T n (m) : m = 1,..., M), that are functions of the data, X 1,..., X n. Denote the (finite sample) joint distribution of the test statistics T n by Q n = Q n (P ). Unless specified otherwise, it is assumed that large values of the test statistic T n (m) provide evidence against the corresponding Published by The Berkeley Electronic Press,

10 Statistical Applications in Genetics and Molecular Biology, Vol. 3 [2004], Iss. 1, Art. 13 null hypothesis H 0 (m), that is, the null H 0 (m) is rejected whenever T n (m) exceeds a specified cut-off c(m). For two-sided tests of single-parameter null hypotheses, one could take absolute values of the test statistics. As in Pollard and van der Laan (2004), for the test of single-parameter null hypotheses, H 0 (m) = I(ψ(m) ψ 0 (m)), m = 1,..., M, we consider two main types of test statistics, difference statistics, D n (m) ( Estimator Null value ) = n(ψ n (m) ψ 0 (m)), (2) and t-statistics (i.e., standardized differences), T n (m) Estimator Null value Standard Error = n ψ n(m) ψ 0 (m). (3) σ n (m) Here, ˆΨ(Pn ) = ψ n = (ψ n (m) : m = 1,..., M) denotes an M vector of estimators for the parameter M vector Ψ(P ) = ψ = (ψ(m) : m = 1,..., M) and (σ n (m)/ n : m = 1,..., M) denote the corresponding estimated standard errors. We consider asymptotically linear estimators ψ n of the parameter ψ, with M dimensional vector influence curve (IC), IC(X P ) = (IC(X P )(m) : m = 1,..., M), such that ψ n (m) ψ(m) = 1 n IC(X i P )(m) + o P (1/ n), (4) n i=1 where E[IC(X P )(m)] = 0, for each m = 1,..., M. Let Σ(P ) = σ = (σ(m, m ) : m, m = 1,..., M) denote the covariance matrix of the M dimensional vector influence curve, where σ(m, m ) E[IC(X P )(m)ic(x P )(m )] and we may adopt the shorter notation σ 2 (m) σ(m, m) = E[IC 2 (X P )(m)] for variances. Similarly, let Σ (P ) = σ = (σ (m, m ) : m, m = 1,..., M) denote the M M correlation matrix of the IC, where σ (m, m ) σ(m, m )/σ(m)σ(m ). We assume that σn(m) 2 are consistent estimators of the IC variances σ 2 (m). The influence curve of a given estimator can be derived as the mean-zero centered functional derivative of the estimator (as a function of the empirical distribution P n for the entire sample of size n), applied to the empirical distribution based on a sample of size one (Gill, 1989; Gill et al., 1995). As illustrated in Section 7.1, this general representation for the test statistics covers standard one-sample and two-sample t-statistics for tests of means, but also test statistics for correlations and regression parameters in linear and non-linear models. F -statistics for multiple-parameter null hypotheses are discussed in Section

11 Dudoit et al.: Single-Step Procedures for Control of General Type I Error Rates 2.5 Multiple testing procedures A multiple testing procedure (MTP) can be represented by a random subset R n of rejected hypotheses, that estimates the set H 1 of true positives, R n = R(T n, Q 0, α) {m : H 0 (m) is rejected} {1,..., M}. (5) As indicated by the long notation R(T n, Q 0, α), the set R n is a function of: (i) the data, X 1,..., X n, through an M vector of test statistics, T n = (T n (m) : m = 1,..., M), where each T n (m) corresponds to a null hypothesis H 0 (m); (ii) a test statistics null distribution, Q 0, for computing cut-offs for each T n (m) and the resulting adjusted p-values; and (iii) the nominal level α of the MTP, i.e., the desired upper bound for a suitably defined Type I error rate. As in the companion article (van der Laan et al., 2004b), and unless specified otherwise, we consider multiple testing procedures that reject null hypotheses for large values of the test statistics, i.e., that can be represented as R n = R(T n, Q 0, α) = {m : T n (m) > c(m)}, (6) where c(m) = c(t n, Q 0, α)(m), m = 1,..., M, are possibly random cutoffs, or critical values, computed under the null distribution Q 0 for the test statistics. 2.6 Errors in multiple testing In any testing situation, two types of errors can be committed: a false positive, or Type I error, is committed by rejecting a true null hypothesis, and a false negative, or Type II error, is committed when the test procedure fails to reject a false null hypothesis. The situation can be summarized by Table 1, below, where the number of Type I errors is V n R n H 0 and the number of Type II errors is U n R c n H 1. Note that both U n and V n depend on the unknown data generating distribution P through the unknown set of true null hypotheses H 0 = H 0 (P ). The numbers h 0 = H 0 and h 1 = H 1 = M h 0 of true and false null hypotheses are unknown parameters, the number of rejected hypotheses R n R n is an observable random variable, and the entries in the body of the table, U n, h 1 U n, V n, and h 0 V n, are unobservable random variables (depending on P, through H 0 (P )). For two-sided tests concerning single-parameter null hypotheses, one is often interested in determining the direction of rejection for the null hypotheses. For instance, in microarray experiments, one would like to know Published by The Berkeley Electronic Press,

12 Statistical Applications in Genetics and Molecular Biology, Vol. 3 [2004], Iss. 1, Art. 13 whether genes are over- or under-expressed in, say, treated cells compared to untreated cells. In this setting, one can commit a Type III error by correctly rejecting a null hypothesis H 0 (m) = I(ψ(m) = ψ 0 (m)), but incorrectly concluding that ψ(m) < ψ 0 (m) when in truth ψ(m) > ψ 0 (m) (or vice versa). Control of Type III errors, as well as Type I errors, brings in additional complexities (Finner, 1999; Shaffer, 2002) and is not considered here. Table 1: Type I and Type II errors in multiple hypothesis testing. Null hypotheses not rejected rejected Null hypotheses true R c n H 0 V n = R n H 0 h 0 = H 0 (Type I) false U n = R c n H 1 R n H 1 h 1 = H 1 (Type II) M R n R n = R n M 2.7 Type I error rates Ideally, one would like to simultaneously minimize the number V n of Type I errors and the number U n of Type II errors. A standard approach in the single hypothesis setting is to prespecify an acceptable level α for the Type I error rate and seek tests which minimize the Type II error rate, i.e., maximize power, within the class of tests with Type I error rate at most α. In the multiple hypothesis case, a variety of generalizations are possible for the definition of Type I error rate (and power). Here, we focus on error rates that are defined as functions of the distribution of the number of Type I errors, that is, can be represented as parameters θ(f Vn ), where F Vn is the discrete cumulative distribution function (c.d.f.) on {0,..., M} for the number of Type I errors, V n. For convenience, we work with normalized Type I error rates, so that θ(f Vn ) [0, 1]. Such a general representation covers the following commonly-used Type I error rates. 10

13 Dudoit et al.: Single-Step Procedures for Control of General Type I Error Rates The per-comparison error rate (PCER) is the expected proportion of Type I errors among the M tests, P CER E[V n ]/M = vdf Vn (v)/m. (7) The per-family error rate (PFER) is the expected number of Type I errors, P F ER E[V n ] = vdf Vn (v). (8) The median-based per-family error rate (mpfer) is the median number of Type I errors, mp F ER Median(F Vn ) = F 1 V n (1/2). (9) The family-wise error rate (FWER) is the probability of at least one Type I error, F W ER P r(v n 1) = 1 F Vn (0). (10) The generalized family-wise error rate (gfwer), for a user-supplied integer k, k = 0,..., h 0 1, is the probability of at least (k + 1) Type I errors. That is, gf W ER(k) P r(v n k + 1) = 1 F Vn (k). (11) When k = 0, the gfwer is the usual family-wise error rate, FWER. In some testing problems, we also consider more general Type I error rates, defined as parameters θ(f Vn,R n ) of the joint distribution of the numbers of Type I errors V n and rejected hypotheses R n (van der Laan et al., 2004a). Error rates that depend also on the distribution of R n (i.e., on the distribution of the test statistics for the true positives, via h 1 U n = R n V n ) include the false discovery rate and tail probabilities for the proportion of false positives among the rejected hypotheses. The tail probability for the proportion of false positives among the rejected hypotheses (TPPFP), for a user-supplied q (0, 1), is defined as T P P F P (q) P r(v n /R n > q) = 1 F Vn/R n (q), (12) Published by The Berkeley Electronic Press,

14 Statistical Applications in Genetics and Molecular Biology, Vol. 3 [2004], Iss. 1, Art. 13 where F Vn/Rn is the c.d.f. for the proportion V n /R n of false positives among the rejected hypotheses, with the convention that V n /R n 0 if R n = 0. The false discovery rate (FDR) of Benjamini and Hochberg (1995) is the expected value of the proportion of false positives among the rejected hypotheses, F DR E[V n /R n ] = qdf Vn/Rn (q), (13) again with the convention that V n /R n 0 if R n = 0. Note that while the gfwer is a parameter of only the marginal distribution for the number of Type I errors V n (tail probability, or survivor function, for V n ), the TPPFP is a parameter of the joint distribution of (V n, R n ) (tail probability, or survivor function, for V n /R n ). van der Laan et al. (2004a) provide simple augmentations of FWER-controlling procedures that control the gfwer and TPPFP, under general data generating distributions P with arbitrary dependence structures among variables. Other approaches to TPPFP control can be found in Genovese and Wasserman (2003) and Korn et al. (2004). Assumptions for the mapping θ that defines the Type I error rate. Given two cumulative distribution functions F 1 and F 2 on {0,..., M}, define the distance measure d by d(f 1, F 2 ) max x {0,...,M} F 1 (x) F 2 (x). We make the following assumptions for the mapping θ : F θ(f ), defining a Type I error rate as a parameter corresponding to a c.d.f. F on {0,..., M}. AMI. Monotonicity Assumption. Given two c.d.f. s F 1 and F 2 on {0,..., M}, F 1 F 2 = θ(f 1 ) θ(f 2 ). (AMI) ACI. Continuity at {F n } Assumption. Given a sequence {F n } of c.d.f. s on {0,..., M}, assume that the mapping θ( ) is continuous at {F n }. That is, for any sequence {G n } of c.d.f. s on {0,..., M}, then lim d(f n, G n ) = 0 = lim (θ(g n ) θ(f n )) = 0. n n (ACI) In most situations, we only require continuity at a fixed c.d.f. F, that is, we consider the special case F n F. 12

15 Dudoit et al.: Single-Step Procedures for Control of General Type I Error Rates 2.8 Power Within the class of multiple testing procedures that control a given Type I error rate at an acceptable level α, one seeks procedures that maximize power, that is, minimize a suitably defined Type II error rate. As with Type I error rates, the concept of power can be generalized in various ways when moving from single to multiple hypothesis testing. Three common definitions of power are: (i) the probability of rejecting at least one false null hypothesis, P r( R n H 1 1) = P r(h 1 U n 1) = P r(u n h 1 1), where we recall that U n is the number of Type II errors; (ii) the expected proportion of rejected false null hypotheses, E[ R n H 1 ]/h 1, or average power; and (iii) the probability of rejecting all false null hypotheses, P r( R n H 1 = h 1 ) = P r(u n = 0) (Shaffer, 1995). When the family of tests consists of pairwise mean comparisons, these quantities have been called any-pair power, perpair power, and all-pairs power (Ramsey, 1978). In a spirit analogous to the FDR, one could also define power as E[(h 1 U n )/R n R n > 0]P r(r n > 0) = E[(R n V n )/R n R n > 0]P r(r n > 0) = P r(r n > 0) F DR; when h 1 = M, this is the any-pair power P r(u n h 1 1). 2.9 Adjusted p-values Unadjusted p-values. Consider the test of individual null hypotheses H 0 (m) at single test level α, i.e., such that the chance of a Type I error for each H 0 (m) is at most α (note that in the M = 1 case, FWER, PCER, and PFER coincide). Given a test statistic T n (m), with marginal null distribution Q 0,m, the null hypothesis H 0 (m) is rejected at single test level α, if T n (m) > c(q 0,m, α), where the cut-off c m (α) = c(q 0,m, α) is defined in terms of the marginal survivor function, Q 0,m (z) = 1 Q 0,m (z) = P r Q0,m (T n (m) > z), as 1 c m (α) Q 0,m(α) = inf{z : Q 0,m (z) α}. For the test of single null hypothesis H 0 (m), the unadjusted p-value (a.k.a. marginal or raw p-value), P 0n (m) = P (T n (m), Q 0,m ), is based only on the test statistic T n (m) for that hypothesis and is defined as P 0n (m) inf {α [0, 1] : Reject H 0 (m) at level α, given T n (m)} (14) = inf {α [0, 1] : c m (α) < T n (m)}, m = 1,..., M. That is, P 0n (m) is the nominal level of the single hypothesis testing procedure at which H 0 (m) would just be rejected, given T n (m). For continu- Published by The Berkeley Electronic Press,

16 Statistical Applications in Genetics and Molecular Biology, Vol. 3 [2004], Iss. 1, Art. 13 ous marginal null distributions Q 0,m, the unadjusted p-values are given by P 0n (m) = c 1 m (T n (m)) = Q 0,m (T n (m)), where c 1 m is the inverse of the monotone decreasing function α c m (α) = c(q 0,m, α). Adjusted p-values. The definition of p-value can be extended to multiple testing problems as follows. Given any multiple testing procedure R n = R(T n, Q 0, α) = {m : T n (m) > c(m)}, based on an M vector of cut-offs, c = c(t n, Q 0, α), the adjusted p-values, P 0n = P (T n, Q 0 ) = P (R(T n, Q 0, α) : α [0, 1]), are defined as P 0n (m) inf {α [0, 1] : Reject H 0 (m) at MTP level α, given T n } (15) = inf {α [0, 1] : m R(T n, Q 0, α)} = inf {α [0, 1] : c(t n, Q 0, α)(m) < T n (m)}, m = 1,..., M. That is, the adjusted p-value P 0n (m) for null hypothesis H 0 (m) is the nominal level of the entire MTP (e.g., gfwer or FDR) at which H 0 (m) would just be rejected, given T n. For continuous null distributions Q 0, P0n (m) = c 1 m (T n (m)), where c 1 m is the inverse of the monotone decreasing function α c m (α) = c(t n, Q 0, α)(m). The particular mapping c m, defining the cut-offs c(t n, Q 0, α)(m), will depend on the choice of MTP (e.g., single-step vs. stepwise, common cut-offs vs. common-quantile cut-offs). For instance, the adjusted p-values for the classical Bonferroni procedure for FWER control are P 0n (m) = min(mp 0n (m), 1). Adjusted p-values for general single-step common-quantile and common-cutoff Procedures 1 and 2 are derived in Section 5.3. Dudoit et al. (2003) provide adjusted p-values for commonly-used FWER and FDR-controlling procedures. We now have two representations for a MTP, in terms of cut-offs, or critical values, c = c(t n, Q 0, α), for the test statistics T n, R(T n, Q 0, α) = {m : T n (m) > c(m)}, and in terms of adjusted p-values, P 0n = P (T n, Q 0 ), R(T n, Q 0, α) = {m : P 0n (m) α}. (16) That is, hypothesis H 0 (m) is rejected at nominal Type I error rate α if P 0n (m) α. As in the single hypothesis case, an advantage of reporting 14

17 Dudoit et al.: Single-Step Procedures for Control of General Type I Error Rates adjusted p-values, as opposed to only rejection or not of the hypotheses, is that the level of the test does not need to be determined in advance, that is, results of the multiple testing procedure are provided for all α. Adjusted p- values are convenient and flexible summaries of the strength of the evidence against each null hypothesis, in terms of the Type I error rate for the entire MTP. Plots of sorted adjusted p-values allow scientists to examine sets of rejected hypotheses associated with various false positive rates (e.g., gfwer, FDR, or PCER). They do not require researchers to preselect a particular definition of Type I error rate or α-level, but rather provide them with tools to decide on an appropriate combination of number of rejections and tolerable false positive rate for a particular experiment and available resources Stepwise multiple testing procedures One usually distinguishes between two main classes of multiple testing procedures, single-step and stepwise procedures, depending on whether the cut-off vector c = (c(m) : m = 1,..., M) for the test statistics T n is constant or random (given the null distribution Q 0 ), i.e., is independent or not of these test statistics. In single-step procedures, each hypothesis H 0 (m) is evaluated using a critical value c(m) = c(q 0, α)(m) that is independent of the results of the tests of other hypotheses and is not a function of the data X 1,..., X n (unless these data are used to estimate the null distribution Q 0, as in Section 6). Improvement in power, while preserving (asymptotic) Type I error rate control, may be achieved by stepwise procedures, in which rejection of a particular hypothesis depends on the outcome of the tests of other hypotheses. That is, the cut-offs c(m) = c(t n, Q 0, α)(m) are allowed to depend on the data, X 1,..., X n, via the test statistics T n. In step-down procedures, the hypotheses corresponding to the most significant test statistics (i.e., largest absolute test statistics or smallest unadjusted p-values) are considered successively, with further tests depending on the outcome of earlier ones. As soon as one fails to reject a null hypothesis, no further hypotheses are rejected. In contrast, for step-up procedures, the hypotheses corresponding to the least significant test statistics are considered successively, again with further tests depending on the outcome of earlier ones. As soon as one hypothesis is rejected, all remaining more significant hypotheses are rejected. Step-down and step-up analogues of the classical Bonferroni procedure for FWER control are the Holm (1979) and Hochberg (1988) procedures, respectively. In these stepwise procedures, based solely on the marginal dis- Published by The Berkeley Electronic Press,

18 Statistical Applications in Genetics and Molecular Biology, Vol. 3 [2004], Iss. 1, Art. 13 tributions of the test statistics, the unadjusted p-value for the hypothesis with the hth most significant test statistic is multiplied by (M h + 1) M rather than M. Let O n (m) denote the indices for the ordered unadjusted p-values P 0n (m), so that P 0n (O n (1))... P 0n (O n (M)). The adjusted p-values for hypothesis H 0 (O n (m)) are P 0n (O n (m)) = min ( MP 0n (O n (m)), 1 ) [Bonferroni] (17) { P 0n (O n (m)) = max min ( (M h + 1) P 0n (O n (h)), 1 )} [Holm] h=1,...,m P 0n (O n (m)) = min h=m,...,m { min ( (M h + 1) P 0n (O n (h)), 1 )} [Hochberg]. Note that the Hochberg (1988) step-up procedure relies on Simes inequality (Simes, 1986) and therefore only guarantees FWER control for certain forms of dependence structures among the test statistics. The present article focuses on single-step procedures, while the companion article considers step-down procedures (van der Laan et al., 2004b). Commonly-used single-step and stepwise MTPs for control of the FWER and FDR are reviewed in Dudoit et al. (2003). 3 Type I error rate control and choice of null distribution 3.1 Type I error rate control A multiple testing procedure R n = R(T n, Q 0, α) is said to provide finite sample control of the Type I error rate θ(f Vn ) at level α (0, 1), if θ(f Vn ) α, where V n = R n H 0 denotes the number of Type I errors. Similarly, the MTP provides asymptotic control of the Type I error rate at level α, if lim sup n θ(f Vn ) α. Note that the distribution of the random variable V n depends on the true distribution Q n = Q n (P ) for the test statistics T n, i.e., on their distribution under the true underlying data generating distribution P. In practice, however, the distribution Q n (P ) is unknown and estimated by a null distribution Q 0, in order to derive cut-offs for each test statistic T n (m) (and the resulting adjusted p-values). The choice of a suitable null distribution Q 0 is crucial, in order to ensure that (finite sample or asymptotic) control of the Type I 16

19 Dudoit et al.: Single-Step Procedures for Control of General Type I Error Rates error rate under this assumed distribution does indeed provide the required control under the true distribution Q n (P ). 3.2 Sketch of proposed approach to Type I error rate control The following discussion highlights important considerations in choosing a null distribution Q 0 and motivates our general approach to the problem of Type I error rate control. Recall that the distribution F Vn for the number of Type I errors, V n = R n H 0 = R(T n, Q 0, α) H 0 (P ), depends on the following: the true distribution Q n = Q n (P ) of the test statistics T n, the null distribution Q 0 used to derive the M vector of cut-offs c(t n, Q 0, α) for these test statistics, and the set H 0 (P ) of true null hypotheses. Type I error control is therefore a statement about the true unknown data generating distribution P, via Q n (P ) and H 0 (P ). When needed, we may use the following notation for the numbers of rejected hypotheses and Type I errors: R n R(Q 0 Q n ) R(T n, Q 0, α), T n Q n, (18) R 0 R(Q 0 Q 0 ) R(T n, Q 0, α), T n Q 0, V n V (Q 0 Q n ) R(T n, Q 0, α) H 0 (P ), T n Q n, V 0 V (Q 0 Q 0 ) R(T n, Q 0, α) H 0 (P ), T n Q 0, where Q n = Q n (P ) refers to the (finite sample) joint distribution of the random M vector of test statistics T n, under the true data generating distribution P, and Q 0 refers to an assumed null distribution for these test statistics. This notation acknowledges that the distributions of the numbers of rejected hypotheses and Type I errors are defined in terms of a null distribution Q 0 (for deriving cut-offs) and a distribution (Q 0 or Q n ) for the test statistics T n (here, the subset H 0 of true null hypotheses is kept fixed at the truth H 0 (P ) and the nominal level α of the MTP is also held fixed). Control of Type I error rates of the form θ(f Vn ) can be achieved by the following three-step approach, which provides some intuition behind singlestep Procedures 1 and 2, and the general characterization (Theorem 1) and explicit construction (Theorem 2) of a test statistics null distribution Q 0. Published by The Berkeley Electronic Press,

20 Statistical Applications in Genetics and Molecular Biology, Vol. 3 [2004], Iss. 1, Art. 13 Three-step road map to Type I error rate control. 1. Null domination conditions for the Type I error rate. For proper control of the Type I error rate θ(f Vn ), for T n Q n (P ), select a null distribution Q 0 such that θ(f Vn ) θ(f V0 ) [finite sample control] (19) lim sup θ(f Vn ) θ(f V0 ) [asymptotic control]. n 2. Note that the number of Type I errors is never greater than the total number of rejected hypotheses, i.e., V 0 R 0, so that F V0 F R0, and hence, by monotonicity Assumption AMI for the mapping θ( ), θ(f V0 ) θ(f R0 ). 3. Control the parameter θ(f R0 ), corresponding to the observed number of rejected hypotheses R 0, under the null distribution Q 0, i.e., assuming T n Q 0. That is, θ(f R0 ) α. Combining Steps 1 3 provides the desired control of the Type I error rate θ(f Vn ) at level α (0, 1), that is, θ(f Vn ) θ(f V0 ) θ(f R0 ) α [finite sample control] lim sup θ(f Vn ) θ(f V0 ) θ(f R0 ) α [asymptotic control]. n Note that such an approach is conservative in two ways: from controlling θ(f R0 ) θ(f V0 ) and from the null domination in Step 1. The latter step is usually the most involved and requires a judicious choice for the null distribution Q 0. This article focuses on procedures that provide asymptotic control of the Type I error rate (i.e., such that lim sup n θ(f Vn ) α) and provides a general characterization (Theorem 1) and an explicit construction (Theorem 2) for a null distribution Q 0 that satisfies the asymptotic null domination condition in Step

21 Dudoit et al.: Single-Step Procedures for Control of General Type I Error Rates 3.3 Choice of null distribution and null domination conditions The above θ-specific null domination conditions in Step 1 of the road map hold under the following two types of general null domination conditions: (i) null domination for the distribution F Vn, of the number of Type I errors V n, and (ii) null domination for the joint distribution Q n,h0, of the H 0 -specific subvector (T n (m) : m H 0 ) of test statistics. Thus, under the latter two conditions, the road map provides the required (finite sample or asymptotic) control of any Type I error rate of the form θ(f Vn ). Null domination conditions for the number of Type I errors. For each x {0,..., M} F Vn (x) F V0 (x) [finite sample control] (20) lim inf n F V n (x) F V0 (x) [asymptotic control], that is, the number of Type I errors, V 0, under the null distribution Q 0, is stochastically greater than the number of Type I errors, V n, under the true distribution Q n = Q n (P ) for the test statistics T n. In particular, (20) holds under the following null domination property for the joint distribution of the test statistics (T n (m) : m H 0 ). Null domination conditions for the H 0 -specific test statistics (T n (m) : m H 0 ). The null distribution Q 0,H0 of the H 0 -specific subvector (T n (m) : m H 0 ), for T n Q 0, equals or dominates the corresponding true distribution Q n,h0 = Q n,h0 (P ), for T n Q n = Q n (P ). That is, Q n,h0 Q 0,H0 [finite sample control] (21) lim inf n Q n,h 0 Q 0,H0 [asymptotic control]. In other words, in the finite sample case, one has P r Qn (T n (m) t(m), m H 0 ) P r Q0 (T n (m) t(m), m H 0 ), t(m) IR, m H 0. For finite sample control, the null domination condition in Step 1 then follows by monotonicity Assumption AMI for the mapping θ( ). For asymptotic control, one relies also on continuity Assumption ACI. Note that null domination is only a statement about the distribution of the test Published by The Berkeley Electronic Press,

22 Statistical Applications in Genetics and Molecular Biology, Vol. 3 [2004], Iss. 1, Art. 13 statistics (T n (m) : m H 0 ) corresponding to the true null hypotheses. More specific (i.e., less stringent) forms of null domination can be derived for given definitions of the Type I error rate θ(f Vn ) (cf. FWER control in van der Laan et al. (2004b)). One of the main contributions of this and the companion article (van der Laan et al., 2004b) is the general characterization (Theorem 1) and explicit construction (Theorem 2) of a proper null distribution Q 0 for the test statistics T n. Procedures based on such a distribution provide asymptotic control of arbitrary Type I error rates θ(f Vn ), for testing null hypotheses H 0 (m) = I(P M(m)), defined in terms of submodels M(m) M, for general data generating distributions P (i.e., distributions P with general dependence structures among variables). Our proposed test statistics null distribution Q 0 can be used in testing problems which cannot be handled by traditional approaches based on a data generating null distribution P 0 (e.g., tests of parameters of survival models, tests of pairwise correlations). The construction of the null distribution Q 0 in Theorem 2 is inspired by null domination condition (21) for the H 0 -specific subvector of test statistics (T n (m) : m H 0 ): Q 0 is defined as the limit distribution of a sequence of random variables that are stochastically greater than the test statistics for the true null hypotheses. The resulting null distribution therefore satisfies asymptotic null domination condition (20), for the number of Type I errors, and also θ-specific asymptotic null domination condition (19) in Step 1 of the road map, for any Type I error rate mapping θ( ). 3.4 Contrast with other approaches As detailed in Section 4, the following two main points distinguish our approach, and that of Pollard and van der Laan (2004), from existing approaches to Type I error rate control (e.g., in Hochberg and Tamhane (1987) and Westfall and Young (1993)). Firstly, we are only concerned with control of the Type I error rate under the true data generating distribution P, i.e., under the joint distribution Q n = Q n (P ) of the test statistics T n implied by P. The notions of weak and strong control are therefore irrelevant in our context. In particular, our notion of null domination differs from that of subset pivotality (see Westfall and Young (1993), p ) in the following senses: (i) null domination is only concerned with the true data generating distribution P, i.e., it only considers the subset H 0 (P ) of true null hypotheses and not all possible 2 M 20

23 Dudoit et al.: Single-Step Procedures for Control of General Type I Error Rates subsets J 0 {1,..., M} of null hypotheses, and (ii) null domination does not require equality of the joint distributions Q 0,H0 and Q n,h0 (P ) for the H 0 -specific test statistics, but the weaker domination. Secondly, we propose a null distribution for the test statistics (T n Q 0 ) rather than a data generating null distribution (X P 0 ). A common choice of data generating null distribution P 0 is one that satisfies the complete null hypothesis, H0 C = M m=1 H 0(m) = M m=1 I(P M(m)) = I(P M m=1m(m)), that all M null hypotheses are true, i.e., P 0 M m=1m(m). The data generating null distribution P 0 then implies a null distribution Q n (P 0 ) for the test statistics. As discussed in Pollard and van der Laan (2004), procedures based on Q n (P 0 ) do not necessarily provide proper (asymptotic) Type I error control under the true distribution P, as the assumed null distribution Q n (P 0 ) and the true distribution Q n (P ) for the test statistics T n may converge to distributions with different covariance structures and, as a result, may violate the required null domination condition for the Type I error rates (Equation (19), p. 18, in Step 1 of the road map). For instance, for test statistics with Gaussian asymptotic distribution (Section 7.1), the H 0 -specific correlation matrix under the true distribution P may be different from the corresponding correlation matrix under the assumed complete null distribution P 0, that is, Σ H 0 (P ) Σ H 0 (P 0 ). In the two-sample problem, for the commonly-used permutation null distribution P 0, Pollard and van der Laan (2004) show that Σ H 0 (P ) = Σ H 0 (P 0 ) only if (i) the two populations have the same covariance matrices or (ii) the sample sizes are equal. Consequently, permutation-based approaches (e.g., Korn et al. (2004), Troendle (1995, 1996), and Westfall and Young (1993)) are only valid under certain assumptions for the data generating distribution. In fact, in most testing problems, there does not even exist a data generating null distribution P 0 M m=1m(m) that correctly specifies a joint distribution for the test statistics, i.e., such that the required null domination condition is satisfied. Thus, unlike current procedures which can only be applied to a limited set of multiple testing problems, our proposed test statistics null distribution leads to single-step and step-down procedures that provide the desired asymptotic Type I error rate control for general data generating distributions, null hypotheses, and test statistics. Published by The Berkeley Electronic Press,

Statistical Applications in Genetics and Molecular Biology

Statistical Applications in Genetics and Molecular Biology Volume 3, Issue 1 2004 Article 14 Multiple Testing. Part II. Step-Down Procedures for Control of the Family-Wise Error Rate Mark J. van der Laan